Variable Selection: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Benjamin
mNo edit summary
No edit summary
Line 1: Line 1:
==Introduction==
The Variable Selection panel contains an interface to several methods for performing variable selection. The goal is to find subsets of variables that improve predictions when compared to using all variables. This interface has several different methods available. Finding the best method and options settings depends on the data and usually requires some experimentation. Use the links below to find out more information on a particular method.
== General Guidance ==
In general, all variable selection methods can over-fit and it is necessary to be aware of this to avoid spurious results. If you have very many samples, few variables and a good relation between your variables, it is less of a problem whereas if you have a weak model, few samples and many variables, the risk is much higher.
In such cases, it is important to use a very conservative cross-validation and set aside a test to verify your findings. A conservative cross-validation would be one using few segments, random (if possible) segments and many repetitions.
The available variable selection methods work for linear regression methods (PCR, PLS, MLR) but only the Genetic Algorithm (GA) or Interval PLS (iPLS) variable selection methods are available when performing linear classification using PLSDA.
== Variable Selection in Practice ==
# If you can build an almost reasonable model on your data, then the model diagnostics of that model will be meaningful. Hence you can use the model parameters to select variable and hence, it makes sense to start selecting variables with Automatic (VIP or sRatio). If the predictions improve and the selected variables seem reasonable, you use those selected variables and refine the model using the loadings and scores etc. Remember to always check for outliers before and after and maybe redo the variable selection upon removing outliers and re-adding all variables to the data. See the [[Using_the_DataSet_Include_Field|this page]] for more information on how to manage including/excuding data.
# If you do not have a good model to begin with. Or if step one did not succeed it is useful to check a variable selection that does not rely on the initial overall model. Forward selection of variables can be performed with Interval PLS (iPLS). If you have spectral data or similarly continuous data, you have to choose windows (intervals). Select the interval width such that you will get something like 10-40 intervals. Maybe play around with the interval width and see if it makes a huge difference. As above, if the selection works, build a model with the selected variables and refine, check for outliers etc.
# If step 1 and 2 fails you can go to the more heavy machinery and run a genetic algorithm. It is a more greedy approach so it may be able to find a solution in more difficult cases. As it is more greedy, it is also more important that the cross-validation is conservative.
==Methods==
{| class="wikitable"
|+ Variable Selection Performance
! Method !! Speed !!  Works without good model on all data
|-
| [[selectvars|Automatic (VIP or sRatio)]] || Fast || No
|-
| [[Genetic_Algorithms_for_Variable_Selection|GA - Genetic Algorithm]] || Slow || Yes
|-
| [[Interval_PLS_(IPLS)_for_Variable_Selection|iPLS - Interval (Forward Selection) PLS ]] || Slow || Yes
|-
| [[rpls|rPLS - Recursive PLS]] || Very Fast || No
|-
| [[sratio|sRatio - Selectivity Ratio]]|| Very Fast || No
|-
| [[vip|VIP - Variable Importance in Projection]]|| Very Fast || No
|}
==Work Flow==
[[Image:variableselectionpanel.jpg|600px]]
* <u>Select a Method</u> - Select a method from the drop-down menu. Options for the method will be displayed. If a previous calculation has been done, the results of it will be displayed.
* <u>Adjust Options</u> - By default, a simplified set of options are displayed. If the '''Show All Options''' checkbox is selected then all available options will be displayed. Depending on the options set, a particular method can take an extended amount of time to complete. For example, decreasing the window width in GA will increase the amount of time it takes to complete. See documentation for more details on optional settings. Clicking '''Reset''' button will reset all options to default values.
* <u>Run Variable Selection</u> - Clicking the '''Execute''' button will run the current variable selection method with values specified in the options. A waitbar will be displayed indicating the method is running. Some methods will display a waitbar with a message indicating it can be closed to cancel execution. NOTE: It can take some time for the method to finish a calculation loop and identify the user has canceled. If '''Show Plots''' is checked then any additional plots will be displayed in separate windows. This is useful for GA as it will show progress of the calculation.
* <u>View Results</u> - When a calculation is complete the selected variables will be displayed under a plot of the data mean as green bars. Two lines are displayed indicating the relative RMSECV values. The red line is the best RMSECV for all variables and the blue line is the best RMSECV for selected variables. These lines are scaled to the max RMSECV with all variables included.
:: '''NOTE:''' These RMSECV values are different values from those displayed in the old interface and via existing plots (Show Plots is checked).
* <u>Use Selected Variables</u> - To use the selected variables click the '''Use''' button and the current selection will become the "included" variables of the current dataset (.include{2} field).
* Build a model with new include field.
* To undo the <u>last selection</u> click the the '''Discard''' button.
* Repeat these steps try different settings in Variable Selection.
NOTE: '''Discard''' will only undo the last selection. If '''Use''' is clicked, another selection is run, and '''Use''' is clicked again, the '''Discard''' button will only undo the last selection and will not return the include field to the "original" state.
==Other Features==
* <u>Automatic (VIP or sRatio)</u> - Uses the .fractionstotest option to survey over .fractiontoremove option of [[selectvars]] and select the best result.
* <u>GA Window</u> - The [[Genetic_Algorithms_for_Variable_Selection|GenAlg (GA) Window]] can be displayed clicking the GA Window button. This will give access to all options. Values set on the panel will be reflected in the GA Window.
* <u>Show List</u> - The '''Show List''' button will open a separate window listing the indices of the selected variables.
* <u>Help</u> - The '''Help''' button displays this page.
* <u>Cross Validation Settings</u> - For methods that use cross validation settings (e.g., maxlv and splits), these settings are pulled from the current values in the [[Using Cross-Validation|Cross Val]] window.
* <u>Drag Legend</u> The legend on the plot can be dragged to a new location in needed.
==Variable Selection Functions==
:[[calibsel]] - Statistical procedure for variable selection.
:[[calibsel]] - Statistical procedure for variable selection.
:[[fullsearch]] - Exhaustive Search Algorithm for small problems.
:[[fullsearch]] - Exhaustive Search Algorithm for small problems.

Revision as of 11:17, 15 November 2019

Introduction

The Variable Selection panel contains an interface to several methods for performing variable selection. The goal is to find subsets of variables that improve predictions when compared to using all variables. This interface has several different methods available. Finding the best method and options settings depends on the data and usually requires some experimentation. Use the links below to find out more information on a particular method.

General Guidance

In general, all variable selection methods can over-fit and it is necessary to be aware of this to avoid spurious results. If you have very many samples, few variables and a good relation between your variables, it is less of a problem whereas if you have a weak model, few samples and many variables, the risk is much higher.

In such cases, it is important to use a very conservative cross-validation and set aside a test to verify your findings. A conservative cross-validation would be one using few segments, random (if possible) segments and many repetitions.

The available variable selection methods work for linear regression methods (PCR, PLS, MLR) but only the Genetic Algorithm (GA) or Interval PLS (iPLS) variable selection methods are available when performing linear classification using PLSDA.

Variable Selection in Practice

  1. If you can build an almost reasonable model on your data, then the model diagnostics of that model will be meaningful. Hence you can use the model parameters to select variable and hence, it makes sense to start selecting variables with Automatic (VIP or sRatio). If the predictions improve and the selected variables seem reasonable, you use those selected variables and refine the model using the loadings and scores etc. Remember to always check for outliers before and after and maybe redo the variable selection upon removing outliers and re-adding all variables to the data. See the this page for more information on how to manage including/excuding data.
  2. If you do not have a good model to begin with. Or if step one did not succeed it is useful to check a variable selection that does not rely on the initial overall model. Forward selection of variables can be performed with Interval PLS (iPLS). If you have spectral data or similarly continuous data, you have to choose windows (intervals). Select the interval width such that you will get something like 10-40 intervals. Maybe play around with the interval width and see if it makes a huge difference. As above, if the selection works, build a model with the selected variables and refine, check for outliers etc.
  3. If step 1 and 2 fails you can go to the more heavy machinery and run a genetic algorithm. It is a more greedy approach so it may be able to find a solution in more difficult cases. As it is more greedy, it is also more important that the cross-validation is conservative.

Methods

Variable Selection Performance
Method Speed Works without good model on all data
Automatic (VIP or sRatio) Fast No
GA - Genetic Algorithm Slow Yes
iPLS - Interval (Forward Selection) PLS Slow Yes
rPLS - Recursive PLS Very Fast No
sRatio - Selectivity Ratio Very Fast No
VIP - Variable Importance in Projection Very Fast No

Work Flow

Variableselectionpanel.jpg

  • Select a Method - Select a method from the drop-down menu. Options for the method will be displayed. If a previous calculation has been done, the results of it will be displayed.
  • Adjust Options - By default, a simplified set of options are displayed. If the Show All Options checkbox is selected then all available options will be displayed. Depending on the options set, a particular method can take an extended amount of time to complete. For example, decreasing the window width in GA will increase the amount of time it takes to complete. See documentation for more details on optional settings. Clicking Reset button will reset all options to default values.
  • Run Variable Selection - Clicking the Execute button will run the current variable selection method with values specified in the options. A waitbar will be displayed indicating the method is running. Some methods will display a waitbar with a message indicating it can be closed to cancel execution. NOTE: It can take some time for the method to finish a calculation loop and identify the user has canceled. If Show Plots is checked then any additional plots will be displayed in separate windows. This is useful for GA as it will show progress of the calculation.
  • View Results - When a calculation is complete the selected variables will be displayed under a plot of the data mean as green bars. Two lines are displayed indicating the relative RMSECV values. The red line is the best RMSECV for all variables and the blue line is the best RMSECV for selected variables. These lines are scaled to the max RMSECV with all variables included.
NOTE: These RMSECV values are different values from those displayed in the old interface and via existing plots (Show Plots is checked).
  • Use Selected Variables - To use the selected variables click the Use button and the current selection will become the "included" variables of the current dataset (.include{2} field).
  • Build a model with new include field.
  • To undo the last selection click the the Discard button.
  • Repeat these steps try different settings in Variable Selection.

NOTE: Discard will only undo the last selection. If Use is clicked, another selection is run, and Use is clicked again, the Discard button will only undo the last selection and will not return the include field to the "original" state.

Other Features

  • Automatic (VIP or sRatio) - Uses the .fractionstotest option to survey over .fractiontoremove option of selectvars and select the best result.
  • GA Window - The GenAlg (GA) Window can be displayed clicking the GA Window button. This will give access to all options. Values set on the panel will be reflected in the GA Window.
  • Show List - The Show List button will open a separate window listing the indices of the selected variables.
  • Help - The Help button displays this page.
  • Cross Validation Settings - For methods that use cross validation settings (e.g., maxlv and splits), these settings are pulled from the current values in the Cross Val window.
  • Drag Legend The legend on the plot can be dragged to a new location in needed.

Variable Selection Functions

calibsel - Statistical procedure for variable selection.
fullsearch - Exhaustive Search Algorithm for small problems.
gaselctr - Genetic algorithm for variable selection with PLS.
genalg - Genetic Algorithm for Variable Selection. See also Genetic Algorithms for Variable Selection.
genalgplot - Plot GA results using selected variable plot, color-coded by RMSECV.
ipls - Interval PLS variable selection. See also Interval PLS (IPLS) for Variable Selection.
rpls - Recursive PLS (and PCR) variable selection.
sratio - Calculates selectivity ratio for a given regression model.
selectvars - Selects variables that are predictive.
vip - Calculate Variable Importance in Projection from regression model.

(Sub topic of Categorical_Index)