Faq how does PCA cross validation work in PLS Toolbox and Solo and how do I set up command line

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search

Issue:

How does PCA cross-validation work in PLS_Toolbox and Solo and how do I set up the command-line options to best use it?

Possible Solutions:

Missing Data Approach to PCA Cross-Validation

PCA cross-validation in PLS_Toolbox and Solo is very different from other software packages. It does cross-validation using a missing data approach where it tests to see how well the PCA model does at replacing columns of the left-out samples (So it is leaving both samples AND variables out). This gives an estimate of how well the PCA model is fitting systematic information (that needed to replace the missing variables) versus noise (which won't be useful in replacing missing variables, and will, in fact, be detrimental in doing so.) This is more diagnostic than the standard residuals test which will continuously decrease asymptotically towards zero.

For a description of how this method is done and how it compares to other PCA cross-validation methods, see:

Cross-validation of component models: A critical look at current methods" R. Bro, K. Kjeldahl, A. K. Smilde, H. A. L. Kiers; Analytical and Bioanalytical Chemistry, March 2008, Volume 390, Issue 5, pp 1241-1251

Command-Line Settings for PCA Cross-Validation

The unusual approach to cross-validation means that, when calling the crossval function for PCA, you need to define both a pattern to leave out samples (the "cvi" input to crossval) AND a pattern to leave out variables (defined by the "pcacvi" option in the options input.) The I/O for crossval is:

>> results = crossval(x,y,rm,cvi,ncomp,options);

By default, the pcacvi option is 'loo' meaning it leaves one variable out at a time. Thus, for each split of samples, crossval splits the data again into as many sets are there are variables.

In the Analysis window, we use logic which defines what pcacvi to use which, given the number of included variables (n), does the calculation:

>> if n>25;
      cvopts.pcacvi = {'con' min(10,floor(sqrt(n)))};
    else
      cvopts.pcacvi = {'loo'}; 
    end

This says that, if the number of included variables is 25 or fewer, it does a "leave one out" pattern on the variables. Otherwise, it does a contiguous block leave-out of variables split into either the square root of the number of variables or 10, whichever is less. For example, with 1500 variables, it would choose 10 splits (because sqrt(1500) = about 38). This means that crossval would take the variables and split them into 10 groups and leave out one group at a time.

Using a contiguous block split of variables when lots of variables are present will be both significantly faster as well as more accurate (with 1500 variables, you often expect there to be a lot of correlated noise between variables). It is worth noting that this assumes adjacent variables are correlated with each other. Obviously there are cases where that may not be the case. In such cases, you can use other leave-out patterns on the variables, including custom sets where you choose the pattern.

So, to maximize the accuracy and speed of PCA cross-validation, modify the pcacvi option when calling crossval. For example:

>> opts        = crossval('options');
>> opts.pcacvi = {'con' 10};
>> results     = crossval(x,[],'pca',{'con' 5}, 15, opts);

The command-line function crossval does not "automatically" adjust the pcacvi because many command-line users want more control over such options and switching from one method to another could cause unexpected results when doing very specific testing.


Still having problems? Please contact our helpdesk at helpdesk@eigenvector.com