Sample Classification Predictions and Variableselectiongui: Difference between pages

From Eigenvector Research Documentation Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Donal
No edit summary
 
imported>Scott
No edit summary
 
Line 1: Line 1:
__TOC__
==Introduction==
Viewing classification results for samples can be done through the scores for a PLSDA, SVMDA, KNN or SIMCA model. If the model has been applied to test data, predictions will also be available for those samples. The predictions for the calibration data are "self-predictions" (predictions for the model on the calibration data itself.)


Results can be viewed as a plot using the Plot Scores toolbar button in Analysis (or the [[plotscores]] command at the command line) and can be viewed as a table by selecting File > Edit Data from the Plot Controls window while viewing a scores plot, or by using the Edit Data toolbar button on the scores plot itself.
The Variable Selection panel contains an interface to several methods for performing variable selection. The goal is to find subsets of variables that improve predictions when compared to using all variables. This interface has several different methods available. Finding the best method and options settings will take some experimentation. Use links below for more information on particular methods.


The predictions available are based on various classification rules, including the following (all rules are described in detail after the list) :
==Methods==


* '''Class Pred Strict''' - Numerical class assignment based on strict assignment rules.
* Automatic (VIP or sRatio)
* '''Class Pred Most Probable''' - Numerical class assignment based on most probable class rules.
* GA - Genetic Algorithm
* '''Class Pred Probability <ClassID>''' - Probability that the sample belongs to a specific class <ClassID>.
* iPLS - Interval PLS
* '''Class Pred Member <ClassID>''' - Logical (true/false) class assignment to a specific class <ClassID> based on strict multiple-class assignment rules.
* rPLS - Recursive PLS
* '''Class Pred Member - Unassigned''' - Logical (true/false) class assignment indicating when no class could be assigned to a sample.
* sRatio - Selectivity Ratio
* '''Misclassified''' - Logical (true/false) indicating when the strict classification does not match the known "measured" class assignment.
* VIP - Variable Importance in Projection


While viewing a plot, the Plot Controls window allows selection and viewing of the different rule predictions. For example, setting the Plot Controls X selection to "Sample Number" and the Y selection to "Class Pred Most Probable" will show the most probable class for each sample in the Scores Plot. This is displayed as the numerical class number (for reference, this is the same number viewable in the class lookup table, if the model was built from a DataSet with classes.) When selected, the Y axis ranges over all possible class numbers and a sample determined to belong to class = 2 would be shown at (x,y) = (sample number, 2).
==Work Flow==


If viewing the table of results, the columns of the table will be the different classification results and the rows the different samples. Note that this information is also available in the model or prediction structure itself in the field "classifications", as described in the [[Standard Model Structure]] page.
* <u>Select a Method</u> - Select a method from the drop-down menu. Options for the method will be displayed. If a previous calculation has been done, the results of it will be displayed.  
 
* <u>Adjust Options</u> - By default, a simplified set of options are displayed. If the "Show All Options" checkbox is selected then all available options will be displayed. Depending on the options set, a particular method can take an extended amount of time to complete. For example, decreasing the window width in GA will increase the amount of time it takes to complete. See documentation for more details on optional settings.
===Class Pred Strict===
* <u>Run Variable Selection</u> - Clicking the "Execute" button will run the current variable selection method with values specified in the options. A waitbar will be displayed indicating the method is running. Some methods will display a waitbar with a message indicating it can be closed to cancel execution. NOTE: It can take some time for the method to finish a calculation loop and identify the user has canceled. If "Show Plots" is checked then any additional plots will be displayed in separate windows. This is useful for GA as it will show progress of the calculation.
Strict class predictions are based on the rule that each sample belongs to a class if the probability is > 0.50 for one and only one class. If no class has a probability > 0.50 or if more than one class has a probability > 0.50, then the sample is assigned to class zero (0) indicating no class could be assigned. These predictions provide the most safety in class assignment. If there is too large an uncertainty of a sample being a member of a class, or if the sample appears to be in more than one class, these predictions will indicate that. If samples are expected to belong to more than one class, use the '''Class Pred Member''' predictions (described below.)
* <u>View Results</u> - When a calculation is complete the selected variables will be displayed under a plot of the data mean as green bars.
 
Use strict class predictions if you need to see a class assignment for each sample where the model is confident the sample belongs to this class and to this class only.
 
===Class Pred Most Probable===
Most probable predictions are based on choosing the class that has the highest probability regardless of the magnitude of that probability. Note this differs from Strict class predictions because if more than one class has > 0.50 probability, the highest probability will "win" the sample. Likewise, if all probabilities are below 0.50, the largest probability still "wins".
 
Use these predictions if you need to see a single class assignment for each sample and are not concerned with the absolute probability of the classes. This might be the case when a model has been built on only a few example samples for each class, when samples have been pre-screened as being in one of the classes modeled, or when "no class" has no meaning.
 
There is always a most likely class for a sample to belong to but it is possible that the sample is not well modeled and has low probabilities for all classes. Or it is possible that two classes are similar and a sample belonging to one of them will also have a high predicted probability of belonging to the second class. In these situations it may be more useful to use the Strict class predictions.
 
===Class Pred Probability <ClassID>===
The predicted probability that a sample belongs to a particular class is a method-dependent calculation as described in [[Class Probability Calculation]], but in general is calculated such that a sample belonging to this class will have value closer to 1. Otherwise, it will be closer to 0. There will be a separate probability calculated for each class, and the class will be named in the description. For example the class named <ClassID>, is available under the label "Class Pred Probability <ClassID>".
 
These predictions are useful when you need to report a confidence of assignment or need to derive special rules for class assignment.
 
===Class Pred Member <ClassID>===
Class member predictions are reported as true/false for each class (<ClassID>) and are similar to the strict class predictions described earlier. A sample will be indicated as a member of a class if and only if the predicted probability for the given class is > 0.50. However, there is no restriction that a sample be assigned to one and only one class. As a result, a sample may be a member of more than one class if each class's probability is > 0.50.
 
These predictions should be used when an analysis permits a sample to belong to more than one class, or to no classes. That is, when the classes being predicted are not exclusionary for each other. For example, a model that reports both the water solubility of a compound (is or is not water soluble), and whether or not that compound is organic (organic vs. inorganic) should allow all combinations of both organic/inorganic and soluble/insoluble without exclusion.
 
The predictions for "Class Pred Member - unassigned" identify samples which were not assigned to any class because no predicted probability
was greater than 0.5.
 
===Misclassified===
Misclassified predictions identify samples where the predicted "Class Pred Strict" does not agree with the sample's actual class.
For SIMCA and PLSDA the actual class could include more than one class and the sample is misclassified if its "Class Pred Member <ClassID>" do not correctly predict the actual class(es). If the sample's actual class is unknown then the sample will not be identified as as misclassified.
 
===Example of Classification Predictions===
 
Shown below is an example Scores Plot from PLSDA run on the arch dataset. In the Plot Controls window (on left) are shown some of the classification predictions which may be plotted. The X menu is set to "Sample" and the Y menu is set to "Misclassified". The Scores Plot shows that all X samples have value 0 (NOT misclassified) except for one sample, the 16th, which has value 1, indicating it is misclassified. Looking at the "Class Pred Most Probable" predictions shows this sample is correctly predicted as belonging to class 2 ("BL"). Looking at "Class Member Pred K" and "Class Member Pred BL" both show sample 16 belonging, meaning that sample 16 belongs to each of these classes with probability > 0.5. Sample 16 actually only belongs to class "BL", however, as shown by Y="Class Measured 2 (BL)", and therefore it is considered to be misclassified. Note that none of the unknown class samples (samples 64-75) are marked as misclassified.
 
<gallery  widths="798px" heights="547px" perrow="1">
File:Scoresplot_classification.png|Scores Plot (right) and its Plot Controls (left) for PLSDA on arch dataset.
</gallery>
 
 
===Class Probability Calculation===
 
Calculating the probability that a sample belongs to each possible class is done differently for each of the classifier methods, PLSDA, SVMDA, KNN, and SIMCA. These methods are described here.

Revision as of 14:24, 11 January 2018

Introduction

The Variable Selection panel contains an interface to several methods for performing variable selection. The goal is to find subsets of variables that improve predictions when compared to using all variables. This interface has several different methods available. Finding the best method and options settings will take some experimentation. Use links below for more information on particular methods.

Methods

  • Automatic (VIP or sRatio)
  • GA - Genetic Algorithm
  • iPLS - Interval PLS
  • rPLS - Recursive PLS
  • sRatio - Selectivity Ratio
  • VIP - Variable Importance in Projection

Work Flow

  • Select a Method - Select a method from the drop-down menu. Options for the method will be displayed. If a previous calculation has been done, the results of it will be displayed.
  • Adjust Options - By default, a simplified set of options are displayed. If the "Show All Options" checkbox is selected then all available options will be displayed. Depending on the options set, a particular method can take an extended amount of time to complete. For example, decreasing the window width in GA will increase the amount of time it takes to complete. See documentation for more details on optional settings.
  • Run Variable Selection - Clicking the "Execute" button will run the current variable selection method with values specified in the options. A waitbar will be displayed indicating the method is running. Some methods will display a waitbar with a message indicating it can be closed to cancel execution. NOTE: It can take some time for the method to finish a calculation loop and identify the user has canceled. If "Show Plots" is checked then any additional plots will be displayed in separate windows. This is useful for GA as it will show progress of the calculation.
  • View Results - When a calculation is complete the selected variables will be displayed under a plot of the data mean as green bars.