Tools: Permutation Test

From Eigenvector Research Documentation Wiki
Revision as of 13:23, 1 September 2011 by imported>Jeremy (Created page with "==Permutation Test Tool== Some regression and preprocessing methods are so exceptionally good at finding correlation between the measured data (X- and Y-blocks) that the model b...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Permutation Test Tool

Some regression and preprocessing methods are so exceptionally good at finding correlation between the measured data (X- and Y-blocks) that the model becomes too specific and will only apply to that exact data. Such overfit models are often useless for predictive applications as well as even interpretation. In many cases, careful use of cross-validation and/or validation data will help identify when this has happened. Permutation tests are another way to help identify an overfit model as well as provide a probability that the given model is different from one built under the same conditions but on random data.

Permutation tests involve repeatedly randomly reordering the y-block and rebuilding the model under the current modeling settings. For a regression problem, this means each sample is assigned a nominally "incorrect" y-value (although the distribution of y-values is maintained because every sample's y-value is simply re-assigned to a different sample.) In the case of classification models, reordering the y-block is equivalent to shuffling the class assignments on each sample.

Such permutation tests to what extent the modeling conditions might be finding "chance correlation" between the x-block and the y-block. After shuffling the y-block samples, the values predicted for each sample from a cross-validation and self-prediction (a.k.a. calibration) as well as the RMSEC and RMSECV (see Using Cross-Validation) for the given shuffling. The shuffling is repeated multiple times and several statistics are calculated for each shuffling as well as accumulating all the RMSE results. The result is two pieces of information: A table of "Probability of Model Insignificance" and a plot of Sum Squared Y versus Y-block correlation.

Probability Table

Probability of Model Insignificance vs. Permuted Samples
For model with 3 component(s)
_________________________________
Y-column:  1
                     Wilcoxon     Sign Test     Rand t-test
Self-Pred (RMSEC) :    0.00         0.00           0.01
Cross-Val (RMSECV):    0.00         0.00           0.01

SSQ Y Plot

For each shuffled y-block, the root mean squared error of calibration and cross-validation (RMSEC and RMSECV, respectively) are calculated and stored. In general practice, the RMSEC will always decrease as the modeling conditions push towards