Automatic sample selection and Variableselectiongui: Difference between pages

From Eigenvector Research Documentation Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Donal
 
imported>Scott
No edit summary
 
Line 1: Line 1:
The Calibration/Validation Sample selection interface allows the user to choose which samples to keep in the calibration set (Cal) and which to move to the validation set (Val).
==Introduction==


Selection can be done manually, by setting the "Sample Type" Class set (Under the Row Labels tab) to either Calibration or Validation for each sample, or automatically by selecting the Automatic split button (gear) in the toolbar.
The Variable Selection panel contains an interface to several methods for performing variable selection. The goal is to find subsets of variables that improve predictions when compared to using all variables. This interface has several different methods available. Finding the best method and options settings will take some experimentation. Use links below for more information on particular methods.


The sample selection interface is opened by choosing "Split into Calibration / Validation" from any of the data blocks in the Analysis status window. The resulting interface is a customized DataSet editor which shows one row for each sample in the current calibration and validation blocks and allows the user to modify the status of each sample.
==Methods==


Once set selection is done, the "Accept Experiment Setup" toolbar button can be used to automatically sort the data into the calibration and validation blocks. All data marked as "Calibration" will be moved to the X/Y blocks in the calibration section of the Analysis window and all data marked as "Validation" will be moved to the X/Y blocks in the validation section of the Analysis window. Clicking the "Discard Experiment Setup" button will discard all Cal / Val changes.
* Automatic (VIP or sRatio)
* GA - Genetic Algorithm
* iPLS - Interval PLS
* rPLS - Recursive PLS
* sRatio - Selectivity Ratio
* VIP - Variable Importance in Projection


__TOC__
==Work Flow==


==Manual Sample Selection==
* <u>Select a Method</u> - Select a method from the drop-down menu. Options for the method will be displayed. If a previous calculation has been done, the results of it will be displayed.  
 
* <u>Adjust Options</u> - By default, a simplified set of options are displayed. If the "Show All Options" checkbox is selected then all available options will be displayed. Depending on the options set, a particular method can take an extended amount of time to complete. For example, decreasing the window width in GA will increase the amount of time it takes to complete. See documentation for more details on optional settings.
Each sample can be moved to either the Calibration or Validation set by simply changing the "Sample Type" class. If there are labels for the samples, these will be shown in the Label field of the interface.
* <u>Run Variable Selection</u> - Clicking the "Execute" button will run the current variable selection method with values specified in the options. A waitbar will be displayed indicating the method is running. Some methods will display a waitbar with a message indicating it can be closed to cancel execution. NOTE: It can take some time for the method to finish a calculation loop and identify the user has canceled. If "Show Plots" is checked then any additional plots will be displayed in separate windows. This is useful for GA as it will show progress of the calculation.
 
* <u>View Results</u> - When a calculation is complete the selected variables will be displayed under a plot of the data mean as green bars.
To move more than one sample at a time, click the button at the left of each row to move to select the row. Once all the desired rows are selected, use the Class pull-down menu on one of the selected rows to choose Calibration or Validation, as desired. All selected samples will be switched to the indicated set.
 
==Automatic Sample Selection==
 
Automatic sample selection walks the user through the selection asking a series of questions outlined below.
 
===Disposition of Previous Selection Changes===
 
First, if there are any samples which have been manually or automatically moved from Cal to Val, or vice versa, the user is asked if they want to Reset all samples back to their original set before automatic selection is done. Choosing "Reset" will restore all the samples to the set they were in when the sample selection interface was opened. Choosing "Select from Current Split" will keep the samples in their current split and allow '''further''' selection automatically. "Cancel" stops all selection.
 
[[Image:Selreset.png]]
 
===Direction for Sample Selection===
 
Next, if there are any samples marked as Validation, the user is asked which "direction" they want to select, either removing samples FROM the calibration set (to the validation set), or adding samples TO the calibration set (out of the validation set). The first option is used when there are more samples in the calibration than are desired ''or'' when the user wishes to create a test set for their model. The second option is used when new data has been measured and the user wishes to add some subset of these samples to a previous set of calibration samples (to improve model performance on the new types of samples.)
 
If all the samples are in the calibration set already, Remove From Calibration is assumed.
 
[[Image:Seldirection.png]]
 
 
===Selection Method===
 
Next, the selection method must be chosen:
 
[[Image:Selmethod.png]]
 
 
* Kennard-Stone - based on [[kennardstone]] this method selects samples that best span the same range as the original data, but with an even distribution of samples across that range. This is similar to the previously-offered method [[reducennsamples]].
 
* Onion - based on [[distslct]] and [[Splitcaltest]] this method first selects a ring of the most unique samples (based on distance from the mean - like the outer-most layer of an onion.) These are used in the calibration set. Next, a ring of less unique samples, just inside the first set (the next onion layer), is put into the validation set. This is repeated two more times so there are 3 outer rings of most unique and less unique samples. Finally, all remaining samples are split randomly into calibration and validation.
 
===Handling Replicates===
 
When choosing a selection method, the user can also define whether to use special replicate handling. In most cases, if you have replicate measurements, you do '''not''' want to split them between the calibration and validation sets. You want to keep them together in either the calibration or validation sets.
 
To use this feature, you must first [[Assigning Sample Classes|create a class set in your data]] in which each set of replicate samples are assigned to the same class (with a different class for each group of samples that it is safe to split - see example below). Next, after a sample selection method is chosen, mark the "Keep Replicates Together" check-box on the Data Split Dialog. Then, choose the class set from the "Replicate Class Set" list which defines the replicate classes. The automatic splitting algorithm will keep those replicate samples together.
 
Below is an example of a set of samples and the classes that would work to keep replicates together:
 
{| class="wikitable" border="1" align="center"
|+
! Label !! Class
|-align="center"
| Sample 1 Replicate 1 || A
|-align="center"
| Sample 1 Replicate 2 || A
|-align="center"
| Sample 2 Replicate 1 || B
|-align="center"
| Sample 2 Replicate 2 || B
|-align="center"
| Sample 2 Replicate 3 || B
|-align="center"
| Sample 3 Replicate 1 || C
|-align="center"
| Sample 3 Replicate 2 || C
|}
 
===Choosing Percentage to Keep===
 
Finally, the user must select the percentage of samples to "select". In the case of Removing From Calibration, this is the percentage of Calibration samples to '''keep''' in the calibration set. In the case of Adding To Calibration, this is the percentage of Validation samples to '''add''' to the calibration set. The value must be between 1 and 100
 
[[Image:Selpct.png]]
 
 
===Finishing the Selection===
 
Once all settings have been defined, the selection will take place and the samples will be marked in their new sets. It may be useful to create a plot (click on the Plot toolbar button, or the Plot tab) to view which samples are in which sets. Accepting the changes will move all samples to the new sets and make sure Analysis is in the appropriate configuration for analysis of the data.

Revision as of 14:24, 11 January 2018

Introduction

The Variable Selection panel contains an interface to several methods for performing variable selection. The goal is to find subsets of variables that improve predictions when compared to using all variables. This interface has several different methods available. Finding the best method and options settings will take some experimentation. Use links below for more information on particular methods.

Methods

  • Automatic (VIP or sRatio)
  • GA - Genetic Algorithm
  • iPLS - Interval PLS
  • rPLS - Recursive PLS
  • sRatio - Selectivity Ratio
  • VIP - Variable Importance in Projection

Work Flow

  • Select a Method - Select a method from the drop-down menu. Options for the method will be displayed. If a previous calculation has been done, the results of it will be displayed.
  • Adjust Options - By default, a simplified set of options are displayed. If the "Show All Options" checkbox is selected then all available options will be displayed. Depending on the options set, a particular method can take an extended amount of time to complete. For example, decreasing the window width in GA will increase the amount of time it takes to complete. See documentation for more details on optional settings.
  • Run Variable Selection - Clicking the "Execute" button will run the current variable selection method with values specified in the options. A waitbar will be displayed indicating the method is running. Some methods will display a waitbar with a message indicating it can be closed to cancel execution. NOTE: It can take some time for the method to finish a calculation loop and identify the user has canceled. If "Show Plots" is checked then any additional plots will be displayed in separate windows. This is useful for GA as it will show progress of the calculation.
  • View Results - When a calculation is complete the selected variables will be displayed under a plot of the data mean as green bars.