Faq how are ROC curves calculated for PLSDA

From Eigenvector Research Documentation Wiki
Revision as of 11:06, 5 December 2018 by imported>Lyle
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Issue:

How are the ROC curves calculated for PLSDA?

Possible Solutions:

The ROC curves are based on the predicted y-values for each of your samples. These values are not discrete zeros and ones but range from around zero to around one in a continuous manner (take a look at a plot of y predicted to see what I mean). Each point in an ROC curve (or pair of points at a given threshold value in the "threshold" plots on the right hand side of the ROC figure) comes from calculating the sensitivity and specificity for a given threshold value. Specificity is calculated as the fraction of "not-in-class" samples which are below the given threshold. Sensitivity is calculated as the fraction of "in-class" samples which are above the given threshold.

These are empirical curves in that they are calculated from the data directly and not from a model of the distribution of the data so there will be some "stepping". In fact, with smaller sample sizes, the curves may NEVER be smooth because sensitivity and specificity only change (up or down) when the threshold moves past a sample's predicted y-value. For example, if the number of "not-in-class" samples above a threshold of 0.46 is no different than the number above 0.45, these two thresholds technically give the same specificity. As of version 3.5.4 of PLS_Toolbox, we actually calculate only "critical" thresholds (those that actually make a difference in the sensitivity and specificity curves) and interpolate between them. Even then, a multi-modal distribution of y-predictions for either in- or out-of-class samples will lead to non-smooth curves.

The cross-validated versions of the curves are determined by using the same procedure outlined above except that we use the y-value predicted for each sample when it was left out of the calibration set (during cross-validation). One might assume that doing multiple replicate cross-validation subsets would lead to smoother cross-validation curves. Two things keep this from happening:

First, before version 4.0 of PLS_Toolbox, the software doesn't actually average the predicted y-values from multiple replicates. It only remembers the predicted y-value from the LAST time a given sample was left out.

Secondly, even if the above ''issue'' weren't there, the curves would only get smoother if the different sets of samples left out during each cross-validation replicates induced a significant change in the model, and thus the predicted y-value for a sample. If the different models calculated in each cycle are essentially the same, there will be little to no variation in the predicted y-value and the curves will appear very similar for all replicates. In fact, significant variation in the predicted y-value from one sub-set to the next is an indication that the cross-validation is unstable (e.g., outliers in the data, too little data, or "critical" good samples which, when left out, keep a good model from being calculated).


Still having problems? Please contact our helpdesk at helpdesk@eigenvector.com