Crossval

From Eigenvector Research Documentation Wiki
Revision as of 19:01, 8 October 2008 by imported>Scott (→‎Description)
Jump to navigation Jump to search

Purpose

Cross-validation for PCA, PLS, MLR, and PCR.

Synopsis

results = crossval(x,y,rm,cvi,ncomp,options)
[press,cumpress,rmsecv,rmsec,cvpred,misclassed] = crossval(x,y,rm,cvi,ncomp,options)

Description

CROSSVAL performs cross-validation for linear regression (PCR, PLS, MLR, CorrelationPCR, and Locally Weighted Regression) and principal components analysis (PCA). Inputs are the MbyN predictor variable matrix (x), predicted variable (y) [ (y) is empty [ ] for rm = 'pca' ], regression method (rm), cross-validation method (cvi), and maximum number of latent variables / components (ncomp).

The regression method (rm) can be any of the following:

rm = 'pca' performs cross-validation for PCA (see PCA)
rm = 'mlr' performs cross-validation for MLR
rm = 'pcr' performs cross-validation for PCR (see PCR)
rm = 'nip' performs cross-validation for PLS using NIPALS
rm = 'sim' or 'pls' performs cross-validation for PLS using SIMPLS (see PLS)
rm = 'correlationpcr' performs cross-validation for CorrelationPCR
rm = 'lwr' performs cross-validation for Locally Weighted Regression (see LWRPRED).
rm = 'cls' performs cross-validation for CLS

The cross-validation method (cvi) can be

cvi = {method splits iterations}, a cell array containing one of the cross-validation methods (method) below with the appropriate parameters (split) and (iterations).
cvi = a vector representing user-defined cross-validation groups.

The cross-validation method (method) can be

cvi = {'loo'} : leave one out cross-validation (each sample left out on its own; inputs (splits) and (iterations) are not used
cvi = {'vet' (splits)} : venetian blinds cross-validation for (splits) subsets. Each subset includes every splits-th sample. For example, if splits = 3 the first subset uses rows 1:3:end, the second uses 2:3:end and the third subset uses 3:3:end.
cvi = {'vet' (splits) (iter)} randomly moves the starting point for the first (and subsequent) blocks.
cvi = {'con' (splits)} contiguous blocks cross-validation for (splits) subsets.
cvi = {'con' (splits) (iter)} randomly moves the starting point for the first (and subsequent) blocks. E.g.m cvi = {'con' 5}; for 5 contiguous blocks (one iteration).
rnd : {'rnd' (splits) (iter)} random subset selected cross-validation for (splits) subsets and (iter) iterations [number of replicate splits to perform].
cvi = a M element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:
cvi(i) = -2 the sample is always in the test set.
cvi(i) = -1 the sample is always in the calibration set,
cvi(i) = 0 the sample is always never used, and
cvi(i) = 1,2,3... defines each test subset.

Options

Optional input options is an options structure containing the following fields:

display : [ 'off' | {'on'} ] Governs output to command window.
plots : [ 'none' | {'final'} ] Governs plotting.
preprocessing : an integer or cell array used to control preprocessing. Default is mean centering: preprocessing = 1. (preprocessing) can be input in two ways:
preprocessing = [ integer ]
preprocessing = 0: uses no preprocessing
preprocessing = 1: uses mean centering {default}
preprocessing = 2: uses autoscaling.
preprocessing = {xp yp}, is a cell array containing a preprocessing structure(s) for the X- and Y-blocks respectively (see PREPROCESS). For example:
preprocessing = {xp [ ]} is used for PCA.
preprocessing = {xp yp} is used to preprocess the X- and Y-blocks respectively.
threshold : [ ] Alternative PLSDA threshold level {default = [ ] = automatic}.
prior : [ ] Used with PLSDA only. Vector of fractional prior probabilities. This is the probability (0 to 1) of observing a "1" for each column of y (i.e.m each class). E.g., [0.25 0.50] defines that only 25% and 50% of future samples will likely be "true" for the classes identified by columns 1 and 2 of the Y-block. [ ] (Empty) = equal priors.
  • structureoutput: [ {'no'} | 'yes' ] Governs output variables. 'Yes' returns a structure instead of individual variables. 'Yes' is default if only one output is requested.
  • jackknife: [ {'no'} | 'yes' ] Governs storing of jackknifed regression vectors. Jack-knifing may slow performance significantly or cause out-of-memory errors when both x and y blocks have many variables.
  • rmsec: [ 'no' | {'yes'} ] Governs calculation of RMSEC. When set to 'no', calculation of "all variables" model is skipped (unless specifically required for plots or requested with multiple outputs)
  • pcacvi: {'loo'} Cell describing how PCA cross-validation should perform variable replacement. Variable replacement options are similar to cross-validation CVI options and include:
  • {'loo'} leave one variable out at a time
  • {'con' splits} contiguous blocks (total of splits groups)
  • {'vet' splits} venetian blinds (every n'th variable), or
  • {'rnd' splits} random subsets (note: no iterations)
  • fastpca: [ 'off' | {'auto'} ] Governs use of "fast" PCA Cross-validation algorithm. 'off' never uses fast algorithm, 'auto' uses fast algorithm when other options permit. Fast pca can only be used with pcacvi set to 'loo'
  • lwr: Sub-structure of options to use for locally-weighted regression cross-validation. Most of these options are used as defined in the LWRPRED function (see LWRPRED for more details) but there are two additional options defined for cross-validation:
  • lwr.minimumpts : [20] the minimum number of points (samples) to use in any LWR sub-model.
  • lwr.ptsperterm : [20] the number of points to use per term (LV) in the LWR model. For example, when set to 20, 20 samples will be use for a 1 LV model, 40 samples will be used for a 2 LV model, etc. If set to zero, the number of points defined by lwr.minimumpts will be used for all models - that is, the number of samples used will be independent from the number of LVs in the model.
    • '''''''''' In all cases, the number of samples in an individual test set will be the upper limit of samples to include in any LWR prediction.

Outputs

press : predictive residual error sum of squares PRESS for each subset (subsets are rows of this matrix, number of components are columns). Note that for multivariate (y) the output (press) is grouped by output variable, i.e., all of the PRESS values for the first Y-variable are followed by all of the PRESS values for the second Y-variable, etc.
cumpress : cumulative PRESS (sum of columns of press).
rmsecv : root mean square error of cross-validation.
rmsec : root mean square error of calibration.
cvpred : cross-validation Y-predictions (regression methods only). If cross-validation method was random cvi = {'rnd' (splits) (iter)}, this is the average prediction of all replicates.
misclassed : fractional misclassifications for each class (valid for regression methods only and only when input (y) is class logical, (i.e., discrete-value) vector.
reg : jack-knifed regression vectors from each sub-set. This will be size [ncomp*Ny by Nx by (splits)] such that reg(1,:,:) will be the regression vectors for 1 component model of the first column of (y) for all subsets [a 1 by Nx by (splits) matrix]. Use SQUEEZE to reduce to an Nx by (splits) matrix. [note: options.jackknife must be 'yes' to use (reg)].
If options.structureoutput = 'yes', a single output (results) will return all the above outputs as fields in a structure array. If options.rmsec = 'no', then RMSEC is not returned (provides faster iterative calculation).
When (options.plots) is not 'none' plots both RMSECV and RMSEC are plotted.

Examples

[press,cumpress] = crossval(x,y,'nip',{'loo'},10);
[press,cumpress] = crossval(x,y,'pcr',{'vet',3},10);
[press,cumpress] = crossval(x,y,'nip',{'con',5},10);
[press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10);
res = crossval(x,y,'sim',{'rnd',3,20},10);
pre = {preprocess('autoscale') preprocess('autoscale')};
opts.preprocessing = pre;
opts.plots = 'none';
[press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10,opts);
res = crossval(x,y,'sim',{'rnd',3,20},10,opts);
[press,cumpress] = crossval(x,[],'pca',{'loo'},10);
[press,cumpress] = crossval(x,[],'pca',{'vet',3},10);
res = crossval(x,[],'pca',{'con',5},10);

See Also

encodemethod, pca, pcr, pls, preprocess, ncrossval, ncrossval