Crossval: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Neal
No edit summary
imported>Neal
No edit summary
Line 10: Line 10:
===Description===
===Description===


CROSSVAL performs cross-validation for linear regression (PCR, PLS, MLR, CorrelationPCR, and Locally Weighted Regression) and principal components analysis (PCA). Inputs are the ''M''by''N'' predictor variable matrix (x), predicted variable (y) [(y) is empty [ ] for rm = 'pca'], regression method (rm), cross-validation method (cvi), and maximum number of latent variables / components (ncomp).
CROSSVAL performs cross-validation for linear regression (PCR, PLS, MLR, CorrelationPCR, and Locally Weighted Regression) and principal components analysis (PCA). Inputs are the ''M''by''N'' predictor variable matrix (x), predicted variable (y) [ (y) is empty [ ] for rm = 'pca' ], regression method (rm), cross-validation method (cvi), and maximum number of latent variables / components (ncomp).


The regression method (rm) can be any of the following:
The regression method (rm) can be any of the following:
Line 22: Line 22:


The cross-validation method (cvi) can be
The cross-validation method (cvi) can be
:cvi = {method splits iterations}, a cell array containing one of the cross-validation methods (method) below with the appropriate parameters (split) and (iterations)
:cvi = {method splits iterations}, a cell array containing one of the cross-validation methods (method) below with the appropriate parameters (split) and (iterations).
:cvi = a vector representing user-defined cross-validation groups.
:cvi = a vector representing user-defined cross-validation groups.
The cross-validation method (method) can be
The cross-validation method (method) can be
Line 31: Line 31:
:::cvi = {'con' (splits) (iter)} randomly moves the starting point for the first (and subsequent) blocks. E.g.m cvi = {'con' 5}; for 5 contiguous blocks (one iteration).
:::cvi = {'con' (splits) (iter)} randomly moves the starting point for the first (and subsequent) blocks. E.g.m cvi = {'con' 5}; for 5 contiguous blocks (one iteration).
::rnd : {'rnd' (splits) (iter)} random subset selected cross-validation for (splits) subsets and (iter) iterations [number of replicate splits to perform].
::rnd : {'rnd' (splits) (iter)} random subset selected cross-validation for (splits) subsets and (iter) iterations [number of replicate splits to perform].
::cvi = a ''M'' element vector allows user defined subsets. (cvi) is a vector with the same number of elements as x has rows (i.e., length(cvi) = size(x,1) with integer elements, defining test subsets. Each cvi(i) is defined as:
::cvi = a ''M'' element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:
:::cvi(i) = -2  the sample is always in the test set,
:::cvi(i) = -2  the sample is always in the test set.
:::cvi(i) = -1  the sample is always in the calibration set,
:::cvi(i) = -1  the sample is always in the calibration set,
:::cvi(i) =  0  the sample is always never used, and
:::cvi(i) =  0  the sample is always never used, and
:::cvi(i) =  1,2,3... defines each subset.
:::cvi(i) =  1,2,3... defines each test subset.


===Options===
===Options===


Optional input ''options'' is an options structure containing the following fields:
Optional input ''options'' is an options structure containing the following fields:
 
:display : [ 'off' | {'on'} ] Governs output to command window.
* '''display''': [ 'off' | {'on'} ] Governs output to command window,
:plots : [ 'none' | {'final'} ] Governs plotting.
 
:preprocessing : an integer or cell array used to control preprocessing. Default is mean centering: preprocessing = 1. (preprocessing) can be input in two ways:
* '''plots''': [ 'none' | {'final'} ] Governs plotting,
:: preprocessing = [ integer ]
 
::: preprocessing = 0: uses no preprocessing
* '''preprocessing''': {[1]} Controls preprocessing. Default is mean centering (1). Can be input in two ways:
::: preprocessing = 1: uses mean centering {default}
 
::: preprocessing = 2: uses autoscaling.
*  '''a)''' As a single value: 0 = none, 1 = mean centering, 2 = autoscaling, or
:: preprocessing = {xp yp}, is a cell array containing a preprocessing structure(s) for the X- and Y-blocks respectively (see PREPROCESS). For example:
 
::: preprocessing = {xp [ ]} is used for PCA.  
*  '''b)''' As {xp yp}, a cell array containing a preprocessing structure(s) for the X- and Y-blocks (see PREPROCESS). E.g. pre = {xp []}; for PCA. To include preprocessing of each subset use pre = {xp yp}; or pre = {xp []} for PCA. To avoid preprocessing of each subset use pre = {[] []}; or pre = 0 (zero).
::: preprocessing = {xp yp} is used to preprocess the X- and Y-blocks respectively.
 
:threshold : [ ] Alternative PLSDA threshold level {default = [ ] = automatic}.
* '''threshold''': {[]} Alternative PLSDA threshold level (default = [] = automatic)
:prior : [ ] Used with PLSDA only. Vector of fractional prior probabilities. This is the probability (0 to 1) of observing a "1" for each column of y (i.e.m each class). E.g., [0.25 0.50] defines that only 25% and 50% of future samples will likely be "true" for the classes identified by columns 1 and 2 of the Y-block. [ ] (Empty) = equal priors.
 
* '''prior''': {[]} Used with PLSDA only. Vector of fractional prior probabilities. This is the probability (0-1) of observing a "1" for each column of y (i.e. each class). E.g. [.25 .50] defines that only 25Found and 50Found of future samples will likely be "true" for the classes identified by columns 1 and 2 of the y-block. [] (Empty) = equal priors.


* '''structureoutput''': [ {'no'} | 'yes' ] Governs output variables. 'Yes' returns a structure instead of individual variables. 'Yes' is default if only one output is requested.
* '''structureoutput''': [ {'no'} | 'yes' ] Governs output variables. 'Yes' returns a structure instead of individual variables. 'Yes' is default if only one output is requested.
Line 85: Line 83:
====Outputs====
====Outputs====


* '''press''': predictive residual error sum of squares PRESS for each subset (subsets are rows of this matrix, number of components are columns)
: press : predictive residual error sum of squares PRESS for each subset (subsets are rows of this matrix, number of components are columns).


* '''cumpress''': cumulative PRESS (sum of columns of press).
: cumpress : cumulative PRESS (sum of columns of press).


* '''rmsecv''': root mean square error of cross-validation.
: rmsecv : root mean square error of cross-validation.


* '''rmsec''': root mean square error of calibration.
: rmsec : root mean square error of calibration.


* '''cvpred''': cross-validation y-predictions (regression methods only). If cross-validation method was random, this is the average prediction of all replicates.
: cvpred : cross-validation Y-predictions (regression methods only). If cross-validation method was random cvi = {'rnd' (splits) (iter)}, this is the average prediction of all replicates.


* '''misclassed''': fractional misclassifications for each class (valid for regression methods only and only when y is a logical, (i.e. discrete-value) vector.
: misclassed : fractional misclassifications for each class (valid for regression methods only and only when input (y) is class logical, (i.e., discrete-value) vector.


* '''reg''': jack-knifed regression vectors from each sub-set. This will be size [k\*ny nx splits] such that reg(1,:,:) will be the regression vectors for 1 component model of the first column of y for all sub sets (a 1 by nx by splits matrix). Use squeeze to reduce to an nx by splits matrix. (note: options.jackknife must be 'yes' to use reg)
: reg : jack-knifed regression vectors from each sub-set. This will be size [ncomp*''Ny'' by ''Nx'' by (splits)] such that reg(1,:,:) will be the regression vectors for 1 component model of the first column of (y) for all subsets [a 1 by ''Nx'' by (splits) matrix]. Use SQUEEZE to reduce to an ''Nx'' by (splits) matrix. [note: options.jackknife must be 'yes' to use (reg)].


If options.structureoutput is 'yes', a single output (results) will return all the above outputs as fields in a structure. If options.rmsec is 'no', then RMSEC is not returned (provides faster iterative calculation)
:If options.structureoutput = 'yes', a single output (results) will return all the above outputs as fields in a structure array. If options.rmsec = 'no', then RMSEC is not returned (provides faster iterative calculation).


Note that for multivariate (y) the output (press) is grouped by output variable, i.e. all of the PRESS values for the first variable are followed by all of the PRESS values for the second variable, etc.  
Note that for multivariate (y) the output (press) is grouped by output variable, i.e. all of the PRESS values for the first variable are followed by all of the PRESS values for the second variable, etc.  

Revision as of 15:04, 7 October 2008

Purpose

Cross-validation for PCA, PLS, MLR, and PCR.

Synopsis

results = crossval(x,y,rm,cvi,ncomp,options)
[press,cumpress,rmsecv,rmsec,cvpred,misclassed] = crossval(x,y,rm,cvi,ncomp,options)

Description

CROSSVAL performs cross-validation for linear regression (PCR, PLS, MLR, CorrelationPCR, and Locally Weighted Regression) and principal components analysis (PCA). Inputs are the MbyN predictor variable matrix (x), predicted variable (y) [ (y) is empty [ ] for rm = 'pca' ], regression method (rm), cross-validation method (cvi), and maximum number of latent variables / components (ncomp).

The regression method (rm) can be any of the following:

rm = 'pca' performs cross-validation for PCA (see PCA)
rm = 'mlr' performs cross-validation for MLR
rm = 'pcr' performs cross-validation for PCR (see PCR)
rm = 'nip' performs cross-validation for PLS using NIPALS
rm = 'sim' or 'pls' performs cross-validation for PLS using SIMPLS (see PLS)
rm = 'correlationpcr' performs cross-validation for CorrelationPCR
rm = 'lwr' performs cross-validation for Locally Weighted Regression (see LWRPRED).

The cross-validation method (cvi) can be

cvi = {method splits iterations}, a cell array containing one of the cross-validation methods (method) below with the appropriate parameters (split) and (iterations).
cvi = a vector representing user-defined cross-validation groups.

The cross-validation method (method) can be

cvi = {'loo'} : leave one out cross-validation (each sample left out on its own; inputs (splits) and (iterations) are not used
cvi = {'vet' (splits)} : venetian blinds cross-validation for (splits) subsets. Each subset includes every splits-th sample. For example, if splits = 3 the first subset uses rows 1:3:end, the second uses 2:3:end and the third subset uses 3:3:end.
cvi = {'vet' (splits) (iter)} randomly moves the starting point for the first (and subsequent) blocks.
cvi = {'con' (splits)} contiguous blocks cross-validation for (splits) subsets.
cvi = {'con' (splits) (iter)} randomly moves the starting point for the first (and subsequent) blocks. E.g.m cvi = {'con' 5}; for 5 contiguous blocks (one iteration).
rnd : {'rnd' (splits) (iter)} random subset selected cross-validation for (splits) subsets and (iter) iterations [number of replicate splits to perform].
cvi = a M element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:
cvi(i) = -2 the sample is always in the test set.
cvi(i) = -1 the sample is always in the calibration set,
cvi(i) = 0 the sample is always never used, and
cvi(i) = 1,2,3... defines each test subset.

Options

Optional input options is an options structure containing the following fields:

display : [ 'off' | {'on'} ] Governs output to command window.
plots : [ 'none' | {'final'} ] Governs plotting.
preprocessing : an integer or cell array used to control preprocessing. Default is mean centering: preprocessing = 1. (preprocessing) can be input in two ways:
preprocessing = [ integer ]
preprocessing = 0: uses no preprocessing
preprocessing = 1: uses mean centering {default}
preprocessing = 2: uses autoscaling.
preprocessing = {xp yp}, is a cell array containing a preprocessing structure(s) for the X- and Y-blocks respectively (see PREPROCESS). For example:
preprocessing = {xp [ ]} is used for PCA.
preprocessing = {xp yp} is used to preprocess the X- and Y-blocks respectively.
threshold : [ ] Alternative PLSDA threshold level {default = [ ] = automatic}.
prior : [ ] Used with PLSDA only. Vector of fractional prior probabilities. This is the probability (0 to 1) of observing a "1" for each column of y (i.e.m each class). E.g., [0.25 0.50] defines that only 25% and 50% of future samples will likely be "true" for the classes identified by columns 1 and 2 of the Y-block. [ ] (Empty) = equal priors.
  • structureoutput: [ {'no'} | 'yes' ] Governs output variables. 'Yes' returns a structure instead of individual variables. 'Yes' is default if only one output is requested.
  • jackknife: [ {'no'} | 'yes' ] Governs storing of jackknifed regression vectors. Jack-knifing may slow performance significantly or cause out-of-memory errors when both x and y blocks have many variables.
  • rmsec: [ 'no' | {'yes'} ] Governs calculation of RMSEC. When set to 'no', calculation of "all variables" model is skipped (unless specifically required for plots or requested with multiple outputs)
  • pcacvi: {'loo'} Cell describing how PCA cross-validation should perform variable replacement. Variable replacement options are similar to cross-validation CVI options and include:
  • {'loo'} leave one variable out at a time
  • {'con' splits} contiguous blocks (total of splits groups)
  • {'vet' splits} venetian blinds (every n'th variable), or
  • {'rnd' splits} random subsets (note: no iterations)
  • fastpca: [ 'off' | {'auto'} ] Governs use of "fast" PCA Cross-validation algorithm. 'off' never uses fast algorithm, 'auto' uses fast algorithm when other options permit. Fast pca can only be used with pcacvi set to 'loo'
  • lwr: Sub-structure of options to use for locally-weighted regression cross-validation. Most of these options are used as defined in the LWRPRED function (see LWRPRED for more details) but there are two additional options defined for cross-validation:
  • lwr.minimumpts : [20] the minimum number of points (samples) to use in any LWR sub-model.
  • lwr.ptsperterm : [20] the number of points to use per term (LV) in the LWR model. For example, when set to 20, 20 samples will be use for a 1 LV model, 40 samples will be used for a 2 LV model, etc. If set to zero, the number of points defined by lwr.minimumpts will be used for all models - that is, the number of samples used will be independent from the number of LVs in the model.
    • '''''''''' In all cases, the number of samples in an individual test set will be the upper limit of samples to include in any LWR prediction.

Outputs

press : predictive residual error sum of squares PRESS for each subset (subsets are rows of this matrix, number of components are columns).
cumpress : cumulative PRESS (sum of columns of press).
rmsecv : root mean square error of cross-validation.
rmsec : root mean square error of calibration.
cvpred : cross-validation Y-predictions (regression methods only). If cross-validation method was random cvi = {'rnd' (splits) (iter)}, this is the average prediction of all replicates.
misclassed : fractional misclassifications for each class (valid for regression methods only and only when input (y) is class logical, (i.e., discrete-value) vector.
reg : jack-knifed regression vectors from each sub-set. This will be size [ncomp*Ny by Nx by (splits)] such that reg(1,:,:) will be the regression vectors for 1 component model of the first column of (y) for all subsets [a 1 by Nx by (splits) matrix]. Use SQUEEZE to reduce to an Nx by (splits) matrix. [note: options.jackknife must be 'yes' to use (reg)].
If options.structureoutput = 'yes', a single output (results) will return all the above outputs as fields in a structure array. If options.rmsec = 'no', then RMSEC is not returned (provides faster iterative calculation).

Note that for multivariate (y) the output (press) is grouped by output variable, i.e. all of the PRESS values for the first variable are followed by all of the PRESS values for the second variable, etc.

When options.plots is not 'none' plots both RMSECV and RMSEC are provided.

Examples

[press,cumpress] = crossval(x,y,'nip',{'loo'},10);
[press,cumpress] = crossval(x,y,'pcr',{'vet',3},10);
[press,cumpress] = crossval(x,y,'nip',{'con',5},10);
[press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10);
res = crossval(x,y,'sim',{'rnd',3,20},10);
pre = {preprocess('autoscale') preprocess('autoscale')};
opts.preprocessing = pre;
opts.plots = 'none';
[press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10,opts);
res = crossval(x,y,'sim',{'rnd',3,20},10,opts);
[press,cumpress] = crossval(x,[],'pca',{'loo'},10);
[press,cumpress] = crossval(x,[],'pca',{'vet',3},10);
res = crossval(x,[],'pca',{'con',5},10);

See Also

encodemethod, pca, pcr, pls, preprocess, ncrossval, ncrossval