Crossval

From Eigenvector Research Documentation Wiki
Revision as of 15:24, 3 September 2008 by imported>Jeremy (Importing text file)
Jump to navigation Jump to search

Purpose

Cross-validation for PCA, PLS, MLR, and PCR.

Synopsis

results = crossval(x,y,rm,cvi,ncomp,options)
[press,cumpress,rmsecv,rmsec,cvpred,misclassed] = crossval(x,y,rm,cvi,ncomp,options)

Description

CROSSVAL performs cross-validation for linear regression (PCR, PLS, MLR, CorrelationPCR, and Locally Weighted Regression) and principal components analysis (PCA). Inputs are the predictor variable matrix x, predicted variable y (y is empty [] for rm = 'pca'), regression method rm, cross-validation method cvi, and maximum number of latent variables / components ncomp.

rm = 'pca' performs cross-validation for PCA,

rm = 'mlr' performs cross-validation for MLR,

rm = 'pcr' performs cross-validation for PCR,

rm = 'nip' performs cross-validation for PLS using NIPALS,

rm = 'sim' or 'pls' performs cross-validation for PLS using SIMPLS,

rm = 'correlationpcr' performs cross-validation for CorrelationPCR, and

rm = 'lwr' performs cross-validation for Locally Weighted Regression (see LWRPRED).

cvi can be 1) a cell containing one of the cross-validation methods below with the appropriate parameters {method splits iterations}, or 2) a vector representing user-defined cross-validation groups.

  • loo : leave one out cross-validation (each sample left out on its own; does not take splits or iterations as inputs),
  • vet : { splits} venetian blinds (every n-th sample together),
  • con : {splits} contiguous blocks, and
  • rnd : {splits iter} random subsets.

Except for leave-one-out, all methods require the number of data splits splits to be provided. Random data subsets ('rnd') also requires number of iterations iter where "iterations" defines the number of replicate splits to perform. For 'con' and 'vet', iterations randomly moves the starting point for the first (and subsequent) blocks.

E.g. cvi = {'con' 5}; for 5 contiguous blocks (one iteration)

For user-defined cross-validation, cvi is a vector with the same number of elements as x has rows (i.e. length(cvi) = size(x,1); when x is class "double", or length(cvi) = size(x.data,1); when x is class "dataset") with integer elements, defining test subsets. Each cvi(i) is defined as:

cvi(i) = -2 the sample is always in the test set,

cvi(i) = -1 the sample is always in the calibration set,

cvi(i) = 0 the sample is always never used, and

cvi(i) = 1,2,3... defines each subset.

Options

Optional input options is an options structure containing one or more of the following fields:

  • display: [ 'off' | {'on'} ] Governs output to command window,
  • plots: [ 'none' | {'final'} ] Governs plotting,
  • preprocessing: {[1]} Controls preprocessing. Default is mean centering (1). Can be input in two ways:
  • a) As a single value: 0 = none, 1 = mean centering, 2 = autoscaling, or
  • b) As {xp yp}, a cell array containing a preprocessing structure(s) for the X- and Y-blocks (see PREPROCESS). E.g. pre = {xp []}; for PCA. To include preprocessing of each subset use pre = {xp yp}; or pre = {xp []} for PCA. To avoid preprocessing of each subset use pre = {[] []}; or pre = 0 (zero).
  • threshold: {[]} Alternative PLSDA threshold level (default = [] = automatic)
  • prior: {[]} Used with PLSDA only. Vector of fractional prior probabilities. This is the probability (0-1) of observing a "1" for each column of y (i.e. each class). E.g. [.25 .50] defines that only 25Found and 50Found of future samples will likely be "true" for the classes identified by columns 1 and 2 of the y-block. [] (Empty) = equal priors.
  • structureoutput: [ {'no'} | 'yes' ] Governs output variables. 'Yes' returns a structure instead of individual variables. 'Yes' is default if only one output is requested.
  • jackknife: [ {'no'} | 'yes' ] Governs storing of jackknifed regression vectors. Jack-knifing may slow performance significantly or cause out-of-memory errors when both x and y blocks have many variables.
  • rmsec: [ 'no' | {'yes'} ] Governs calculation of RMSEC. When set to 'no', calculation of "all variables" model is skipped (unless specifically required for plots or requested with multiple outputs)
  • pcacvi: {'loo'} Cell describing how PCA cross-validation should perform variable replacement. Variable replacement options are similar to cross-validation CVI options and include:
  • {'loo'} leave one variable out at a time
  • {'con' splits} contiguous blocks (total of splits groups)
  • {'vet' splits} venetian blinds (every n'th variable), or
  • {'rnd' splits} random subsets (note: no iterations)
  • fastpca: [ 'off' | {'auto'} ] Governs use of "fast" PCA Cross-validation algorithm. 'off' never uses fast algorithm, 'auto' uses fast algorithm when other options permit. Fast pca can only be used with pcacvi set to 'loo'
  • lwr: Sub-structure of options to use for locally-weighted regression cross-validation. Most of these options are used as defined in the LWRPRED function (see LWRPRED for more details) but there are two additional options defined for cross-validation:
  • lwr.minimumpts : [20] the minimum number of points (samples) to use in any LWR sub-model.
  • lwr.ptsperterm : [20] the number of points to use per term (LV) in the LWR model. For example, when set to 20, 20 samples will be use for a 1 LV model, 40 samples will be used for a 2 LV model, etc. If set to zero, the number of points defined by lwr.minimumpts will be used for all models - that is, the number of samples used will be independent from the number of LVs in the model.
    • '''''''''' In all cases, the number of samples in an individual test set will be the upper limit of samples to include in any LWR prediction.

OUTPUTS

  • press: predictive residual error sum of squares PRESS for each subset (subsets are rows of this matrix, number of components are columns)
  • cumpress: cumulative PRESS (sum of columns of press).
  • rmsecv: root mean square error of cross-validation.
  • rmsec: root mean square error of calibration.
  • cvpred: cross-validation y-predictions (regression methods only). If cross-validation method was random, this is the average prediction of all replicates.
  • misclassed: fractional misclassifications for each class (valid for regression methods only and only when y is a logical, (i.e. discrete-value) vector.
  • reg: jack-knifed regression vectors from each sub-set. This will be size [k\*ny nx splits] such that reg(1,:,:) will be the regression vectors for 1 component model of the first column of y for all sub sets (a 1 by nx by splits matrix). Use squeeze to reduce to an nx by splits matrix. (note: options.jackknife must be 'yes' to use reg)

If options.structureoutput is 'yes', a single output (results) will return all the above outputs as fields in a structure. If options.rmsec is 'no', then RMSEC is not returned (provides faster iterative calculation)

Note that for multivariate (y) the output (press) is grouped by output variable, i.e. all of the PRESS values for the first variable are followed by all of the PRESS values for the second variable, etc.

When options.plots is not 'none' plots both RMSECV and RMSEC are provided.

Examples

[press,cumpress] = crossval(x,y,'nip',{'loo'},10);
[press,cumpress] = crossval(x,y,'pcr',{'vet',3},10);
[press,cumpress] = crossval(x,y,'nip',{'con',5},10);
[press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10);
res = crossval(x,y,'sim',{'rnd',3,20},10);
pre = {preprocess('autoscale') preprocess('autoscale')};
opts.preprocessing = pre;
opts.plots = 'none';
[press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10,opts);
res = crossval(x,y,'sim',{'rnd',3,20},10,opts);
[press,cumpress] = crossval(x,[],'pca',{'loo'},10);
[press,cumpress] = crossval(x,[],'pca',{'vet',3},10);
res = crossval(x,[],'pca',{'con',5},10);

See Also

encodemethod, pca, pcr, pls, preprocess, ncrossval, ncrossval