Variableselectiongui: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Scott
imported>Scott
No edit summary
Line 2: Line 2:


The Variable Selection panel contains an interface to several methods for performing variable selection. The goal is to find subsets of variables that improve predictions when compared to using all variables. This interface has several different methods available. Finding the best method and options settings depends on the data and usually requires some experimentation. Use the links below to find out more information on a particular method.
The Variable Selection panel contains an interface to several methods for performing variable selection. The goal is to find subsets of variables that improve predictions when compared to using all variables. This interface has several different methods available. Finding the best method and options settings depends on the data and usually requires some experimentation. Use the links below to find out more information on a particular method.
== General Guidance ==
In general, all variable selection methods can over-fit and it is necessary to be aware of this to avoid spurious results. If you have very many samples, few variables and a good relation between your variables, it is less of a problem whereas if you have a weak model, few samples and many variables, the risk is much higher.
In such cases, it is important to use a very conservative cross-validation and set aside a test to verify your findings. A conservative cross-validation would be one using few segments, random (if possible) segments and many repetitions.
== How to do variable selection in practice ==
# If you can build an almost reasonable model on your data, then the model diagnostics of that model will be meaningful. Hence you can use the model parameters to select variable and hence, it makes sense to start selecting variables with Automatic (VIP or sRatio). If the predictions improve and the selected variables seem reasonable, you use those selected variables and refine the model using the loadings and scores etc. Remember to always check for outliers before and after and maybe redo the variable selection upon removing outliers and re-adding all variables to the data. See the [[Using_the_DataSet_Include_Field|this page]] for more information on how to manage including/excuding data.
# If you do not have a good model to begin with. Or if step one did not succeed it is useful to check a variable selection that does not rely on the initial overall model. Forward selection of variables can be performed with Interval PLS (iPLS). If you have spectral data or similarly continuous data, you have to choose windows (intervals). Select the interval width such that you will get something like 10-40 intervals. Maybe play around with the interval width and see if it makes a huge difference. As above, if the selection works, build a model with the selected variables and refine, check for outliers etc.
# If step 1 and 2 fails you can go to the more heavy machinery and run a genetic algorithm. It is a more greedy approach so it may be able to find a solution in more difficult cases. As it is more greedy, it is also more important that the cross-validation is conservative.


==Methods==
==Methods==
{| class="wikitable"
|+ Variable Selection Performance
! Method !! Speed !!  Works without good model on all data
|-
| [[selectvars|Automatic (VIP or sRatio)]] || Fast || No
|-
| [[Genetic_Algorithms_for_Variable_Selection|GA - Genetic Algorithm]] || Slow || Yes
|-
| [[Interval_PLS_(IPLS)_for_Variable_Selection|iPLS - Interval (Forward Selection) PLS ]] || Slow || Yes
|-
| [[rpls|rPLS - Recursive PLS]] || Very Fast || No
|-
| [[sratio|sRatio - Selectivity Ratio]]|| Very Fast || No
|-
| [[vip|VIP - Variable Importance in Projection]]|| Very Fast || No
|}


* [[selectvars|Automatic (VIP or sRatio)]]
* [[selectvars|Automatic (VIP or sRatio)]]
* [[Genetic_Algorithms_for_Variable_Selection|GA - Genetic Algorithm]]
* [[Genetic_Algorithms_for_Variable_Selection|GA - Genetic Algorithm]]
* [[Interval_PLS_(IPLS)_for_Variable_Selection|iPLS - Interval PLS]]
* [[Interval_PLS_(IPLS)_for_Variable_Selection|iPLS - Interval PLS (Forward ]]
* [[rpls|rPLS - Recursive PLS]]
* [[rpls|rPLS - Recursive PLS]]
* [[sratio|sRatio - Selectivity Ratio]]
* [[sratio|sRatio - Selectivity Ratio]]

Revision as of 11:13, 16 January 2018

Introduction

The Variable Selection panel contains an interface to several methods for performing variable selection. The goal is to find subsets of variables that improve predictions when compared to using all variables. This interface has several different methods available. Finding the best method and options settings depends on the data and usually requires some experimentation. Use the links below to find out more information on a particular method.

General Guidance

In general, all variable selection methods can over-fit and it is necessary to be aware of this to avoid spurious results. If you have very many samples, few variables and a good relation between your variables, it is less of a problem whereas if you have a weak model, few samples and many variables, the risk is much higher.

In such cases, it is important to use a very conservative cross-validation and set aside a test to verify your findings. A conservative cross-validation would be one using few segments, random (if possible) segments and many repetitions.

How to do variable selection in practice

  1. If you can build an almost reasonable model on your data, then the model diagnostics of that model will be meaningful. Hence you can use the model parameters to select variable and hence, it makes sense to start selecting variables with Automatic (VIP or sRatio). If the predictions improve and the selected variables seem reasonable, you use those selected variables and refine the model using the loadings and scores etc. Remember to always check for outliers before and after and maybe redo the variable selection upon removing outliers and re-adding all variables to the data. See the this page for more information on how to manage including/excuding data.
  2. If you do not have a good model to begin with. Or if step one did not succeed it is useful to check a variable selection that does not rely on the initial overall model. Forward selection of variables can be performed with Interval PLS (iPLS). If you have spectral data or similarly continuous data, you have to choose windows (intervals). Select the interval width such that you will get something like 10-40 intervals. Maybe play around with the interval width and see if it makes a huge difference. As above, if the selection works, build a model with the selected variables and refine, check for outliers etc.
  3. If step 1 and 2 fails you can go to the more heavy machinery and run a genetic algorithm. It is a more greedy approach so it may be able to find a solution in more difficult cases. As it is more greedy, it is also more important that the cross-validation is conservative.

Methods

Variable Selection Performance
Method Speed Works without good model on all data
Automatic (VIP or sRatio) Fast No
GA - Genetic Algorithm Slow Yes
iPLS - Interval (Forward Selection) PLS Slow Yes
rPLS - Recursive PLS Very Fast No
sRatio - Selectivity Ratio Very Fast No
VIP - Variable Importance in Projection Very Fast No


Work Flow

Variableselectionpanel.jpg

  • Select a Method - Select a method from the drop-down menu. Options for the method will be displayed. If a previous calculation has been done, the results of it will be displayed.
  • Adjust Options - By default, a simplified set of options are displayed. If the Show All Options checkbox is selected then all available options will be displayed. Depending on the options set, a particular method can take an extended amount of time to complete. For example, decreasing the window width in GA will increase the amount of time it takes to complete. See documentation for more details on optional settings. Clicking Reset button will reset all options to default values.
  • Run Variable Selection - Clicking the Execute button will run the current variable selection method with values specified in the options. A waitbar will be displayed indicating the method is running. Some methods will display a waitbar with a message indicating it can be closed to cancel execution. NOTE: It can take some time for the method to finish a calculation loop and identify the user has canceled. If Show Plots is checked then any additional plots will be displayed in separate windows. This is useful for GA as it will show progress of the calculation.
  • View Results - When a calculation is complete the selected variables will be displayed under a plot of the data mean as green bars. Two lines are displayed indicating the relative RMSECV values. The red line is the best RMSECV for all variables and the blue line is the best RMSECV for selected variables. These lines are scaled to the max RMSECV with all variables included.
  • Use Selected Variables - To use the selected variables click the Use button and the current selection will become the "included" variables of the current dataset (.include{2} field). You can undo by clicking the Discard button.

Other Features

  • Automatic (VIP or sRatio) - Uses the .fractionstotest option to survey over .fractiontoremove option of selectvars and select the best result.
  • GA Window - The GenAlg (GA) Window can be displayed clicking the GA Window button. This will give access to all options. Values set on the panel will be reflected in the GA Window.
  • Show List - The Show List button will open a separate window listing the indices of the selected variables.
  • Help - The Help button displays this page.
  • Cross Validation Settings - For methods that use cross validation settings (e.g., maxlv and splits), these settings are pulled from the current values in the Cross Val window.
  • Drag Legend The legend on the plot can be dragged to a new location in needed.