Crossval: Difference between revisions
imported>Jeremy (Importing text file) |
imported>Jeremy (Importing text file) |
||
Line 14: | Line 14: | ||
rm = 'lwr' performs cross-validation for Locally Weighted Regression (see LWRPRED). | rm = 'lwr' performs cross-validation for Locally Weighted Regression (see LWRPRED). | ||
cvi can be 1) a cell containing one of the cross-validation methods below with the appropriate parameters {method splits iterations}, or 2) a vector representing user-defined cross-validation groups. | cvi can be 1) a cell containing one of the cross-validation methods below with the appropriate parameters {method splits iterations}, or 2) a vector representing user-defined cross-validation groups. | ||
* loo : leave one out cross-validation (each sample left out on its own; does not take splits or iterations as inputs), | * '''loo''' : leave one out cross-validation (each sample left out on its own; does not take splits or iterations as inputs), | ||
* vet : { splits} venetian blinds (every n-th sample together), | * '''vet''' : { splits} venetian blinds (every n-th sample together), | ||
* con : {splits} contiguous blocks, and | * '''con''' : {splits} contiguous blocks, and | ||
* rnd : {splits iter} random subsets. | * '''rnd''' : {splits iter} random subsets. | ||
Except for leave-one-out, all methods require the number of data splits splits to be provided. Random data subsets ('rnd') also requires number of iterations iter where "iterations" defines the number of replicate splits to perform. For 'con' and 'vet', iterations randomly moves the starting point for the first (and subsequent) blocks. | Except for leave-one-out, all methods require the number of data splits splits to be provided. Random data subsets ('rnd') also requires number of iterations iter where "iterations" defines the number of replicate splits to perform. For 'con' and 'vet', iterations randomly moves the starting point for the first (and subsequent) blocks. | ||
E.g. cvi = {'con' 5}; for 5 contiguous blocks (one iteration) | E.g. cvi = {'con' 5}; for 5 contiguous blocks (one iteration) | ||
Line 27: | Line 27: | ||
===Options=== | ===Options=== | ||
Optional input ''options'' is an options structure containing one or more of the following fields: | Optional input ''options'' is an options structure containing one or more of the following fields: | ||
* display: [ 'off' | {'on'} ] Governs output to command window, | * '''display''': [ 'off' | {'on'} ] Governs output to command window, | ||
* plots: [ 'none' | {'final'} ] Governs plotting, | * '''plots''': [ 'none' | {'final'} ] Governs plotting, | ||
* preprocessing: {[1]} Controls preprocessing. Default is mean centering (1). Can be input in two ways: | * '''preprocessing''': {[1]} Controls preprocessing. Default is mean centering (1). Can be input in two ways: | ||
* a) As a single value: 0 = none, 1 = mean centering, 2 = autoscaling, or | * '''a)''' As a single value: 0 = none, 1 = mean centering, 2 = autoscaling, or | ||
* b) As {xp yp}, a cell array containing a preprocessing structure(s) for the X- and Y-blocks (see PREPROCESS). E.g. pre = {xp []}; for PCA. To include preprocessing of each subset use pre = {xp yp}; or pre = {xp []} for PCA. To avoid preprocessing of each subset use pre = {[] []}; or pre = 0 (zero). | * '''b)''' As {xp yp}, a cell array containing a preprocessing structure(s) for the X- and Y-blocks (see PREPROCESS). E.g. pre = {xp []}; for PCA. To include preprocessing of each subset use pre = {xp yp}; or pre = {xp []} for PCA. To avoid preprocessing of each subset use pre = {[] []}; or pre = 0 (zero). | ||
* threshold: {[]} Alternative PLSDA threshold level (default = [] = automatic) | * '''threshold''': {[]} Alternative PLSDA threshold level (default = [] = automatic) | ||
* prior: {[]} Used with PLSDA only. Vector of fractional prior probabilities. This is the probability (0-1) of observing a "1" for each column of y (i.e. each class). E.g. [.25 .50] defines that only 25Found and 50Found of future samples will likely be "true" for the classes identified by columns 1 and 2 of the y-block. [] (Empty) = equal priors. | * '''prior''': {[]} Used with PLSDA only. Vector of fractional prior probabilities. This is the probability (0-1) of observing a "1" for each column of y (i.e. each class). E.g. [.25 .50] defines that only 25Found and 50Found of future samples will likely be "true" for the classes identified by columns 1 and 2 of the y-block. [] (Empty) = equal priors. | ||
* structureoutput: [ {'no'} | 'yes' ] Governs output variables. 'Yes' returns a structure instead of individual variables. 'Yes' is default if only one output is requested. | * '''structureoutput''': [ {'no'} | 'yes' ] Governs output variables. 'Yes' returns a structure instead of individual variables. 'Yes' is default if only one output is requested. | ||
* jackknife: [ {'no'} | 'yes' ] Governs storing of jackknifed regression vectors. Jack-knifing may slow performance significantly or cause out-of-memory errors when both x and y blocks have many variables. | * '''jackknife''': [ {'no'} | 'yes' ] Governs storing of jackknifed regression vectors. Jack-knifing may slow performance significantly or cause out-of-memory errors when both x and y blocks have many variables. | ||
* rmsec: [ 'no' | {'yes'} ] Governs calculation of RMSEC. When set to 'no', calculation of "all variables" model is skipped (unless specifically required for plots or requested with multiple outputs) | * '''rmsec''': [ 'no' | {'yes'} ] Governs calculation of RMSEC. When set to 'no', calculation of "all variables" model is skipped (unless specifically required for plots or requested with multiple outputs) | ||
* pcacvi: {'loo'} Cell describing how PCA cross-validation should perform variable replacement. Variable replacement options are similar to cross-validation CVI options and include: | * '''pcacvi''': {'loo'} Cell describing how PCA cross-validation should perform variable replacement. Variable replacement options are similar to cross-validation CVI options and include: | ||
* {'loo'} leave one variable out at a time | * '''{'loo'}''' leave one variable out at a time | ||
* {'con' splits} contiguous blocks (total of splits groups) | * '''{'con'''' splits} contiguous blocks (total of splits groups) | ||
* {'vet' splits} venetian blinds (every n'th variable), or | * '''{'vet'''' splits} venetian blinds (every n'th variable), or | ||
* {'rnd' splits} random subsets (note: no iterations) | * '''{'rnd'''' splits} random subsets (note: no iterations) | ||
* fastpca: [ 'off' | {'auto'} ] Governs use of "fast" PCA Cross-validation algorithm. 'off' never uses fast algorithm, 'auto' uses fast algorithm when other options permit. Fast pca can only be used with pcacvi set to 'loo' | * '''fastpca''': [ 'off' | {'auto'} ] Governs use of "fast" PCA Cross-validation algorithm. 'off' never uses fast algorithm, 'auto' uses fast algorithm when other options permit. Fast pca can only be used with pcacvi set to 'loo' | ||
* lwr: Sub-structure of options to use for locally-weighted regression cross-validation. Most of these options are used as defined in the LWRPRED function (see LWRPRED for more details) but there are two additional options defined for cross-validation: | * '''lwr''': Sub-structure of options to use for locally-weighted regression cross-validation. Most of these options are used as defined in the LWRPRED function (see LWRPRED for more details) but there are two additional options defined for cross-validation: | ||
* lwr.minimumpts : [20] the minimum number of points (samples) to use in any LWR sub-model. | * '''lwr.minimumpts''' : [20] the minimum number of points (samples) to use in any LWR sub-model. | ||
* lwr.ptsperterm : [20] the number of points to use per term (LV) in the LWR model. For example, when set to 20, 20 samples will be use for a 1 LV model, 40 samples will be used for a 2 LV model, etc. If set to zero, the number of points defined by lwr.minimumpts will be used for all models - that is, the number of samples used will be independent from the number of LVs in the model. | * '''lwr.ptsperterm''' : [20] the number of points to use per term (LV) in the LWR model. For example, when set to 20, 20 samples will be use for a 1 LV model, 40 samples will be used for a 2 LV model, etc. If set to zero, the number of points defined by lwr.minimumpts will be used for all models - that is, the number of samples used will be independent from the number of LVs in the model. | ||
* | * ''' | ||
* In all cases, the number of samples in an individual test set will be the upper limit of samples to include in any LWR prediction. | **''''''''''''''' In all cases, the number of samples in an individual test set will be the upper limit of samples to include in any LWR prediction. | ||
====OUTPUTS==== | |||
* press: predictive residual error sum of squares PRESS for each subset (subsets are rows of this matrix, number of components are columns) | * '''press''': predictive residual error sum of squares PRESS for each subset (subsets are rows of this matrix, number of components are columns) | ||
* cumpress: cumulative PRESS (sum of columns of press). | * '''cumpress''': cumulative PRESS (sum of columns of press). | ||
* rmsecv: root mean square error of cross-validation. | * '''rmsecv''': root mean square error of cross-validation. | ||
* rmsec: root mean square error of calibration. | * '''rmsec''': root mean square error of calibration. | ||
* cvpred: cross-validation y-predictions (regression methods only). If cross-validation method was random, this is the average prediction of all replicates. | * '''cvpred''': cross-validation y-predictions (regression methods only). If cross-validation method was random, this is the average prediction of all replicates. | ||
* misclassed: fractional misclassifications for each class (valid for regression methods only and only when y is a logical, (i.e. discrete-value) vector. | * '''misclassed''': fractional misclassifications for each class (valid for regression methods only and only when y is a logical, (i.e. discrete-value) vector. | ||
* reg: jack-knifed regression vectors from each sub-set. This will be size [k\*ny nx splits] such that reg(1,:,:) will be the regression vectors for 1 component model of the first column of y for all sub sets (a 1 by nx by splits matrix). Use squeeze to reduce to an nx by splits matrix. (note: options.jackknife must be 'yes' to use reg) | * '''reg''': jack-knifed regression vectors from each sub-set. This will be size [k\*ny nx splits] such that reg(1,:,:) will be the regression vectors for 1 component model of the first column of y for all sub sets (a 1 by nx by splits matrix). Use squeeze to reduce to an nx by splits matrix. (note: options.jackknife must be 'yes' to use reg) | ||
If options.structureoutput is 'yes', a single output (results) will return all the above outputs as fields in a structure. If options.rmsec is 'no', then RMSEC is not returned (provides faster iterative calculation) | If options.structureoutput is 'yes', a single output (results) will return all the above outputs as fields in a structure. If options.rmsec is 'no', then RMSEC is not returned (provides faster iterative calculation) | ||
Note that for multivariate (y) the output (press) is grouped by output variable, i.e. all of the PRESS values for the first variable are followed by all of the PRESS values for the second variable, etc. | Note that for multivariate (y) the output (press) is grouped by output variable, i.e. all of the PRESS values for the first variable are followed by all of the PRESS values for the second variable, etc. |
Revision as of 19:56, 2 September 2008
Purpose
Cross-validation for PCA, PLS, MLR, and PCR.
Synopsis
- results = crossval(x,y,rm,cvi,ncomp,options)
- [press,cumpress,rmsecv,rmsec,cvpred,misclassed] = crossval(x,y,rm,cvi,ncomp,options)
Description
CROSSVAL performs cross-validation for linear regression (PCR, PLS, MLR, CorrelationPCR, and Locally Weighted Regression) and principal components analysis (PCA). Inputs are the predictor variable matrix x, predicted variable y (y is empty [] for rm = 'pca'), regression method rm, cross-validation method cvi, and maximum number of latent variables / components ncomp. rm = 'pca' performs cross-validation for PCA, rm = 'mlr' performs cross-validation for MLR, rm = 'pcr' performs cross-validation for PCR, rm = 'nip' performs cross-validation for PLS using NIPALS, rm = 'sim' or 'pls' performs cross-validation for PLS using SIMPLS, rm = 'correlationpcr' performs cross-validation for CorrelationPCR, and rm = 'lwr' performs cross-validation for Locally Weighted Regression (see LWRPRED). cvi can be 1) a cell containing one of the cross-validation methods below with the appropriate parameters {method splits iterations}, or 2) a vector representing user-defined cross-validation groups.
- loo : leave one out cross-validation (each sample left out on its own; does not take splits or iterations as inputs),
- vet : { splits} venetian blinds (every n-th sample together),
- con : {splits} contiguous blocks, and
- rnd : {splits iter} random subsets.
Except for leave-one-out, all methods require the number of data splits splits to be provided. Random data subsets ('rnd') also requires number of iterations iter where "iterations" defines the number of replicate splits to perform. For 'con' and 'vet', iterations randomly moves the starting point for the first (and subsequent) blocks. E.g. cvi = {'con' 5}; for 5 contiguous blocks (one iteration) For user-defined cross-validation, cvi is a vector with the same number of elements as x has rows (i.e. length(cvi) = size(x,1); when x is class "double", or length(cvi) = size(x.data,1); when x is class "dataset") with integer elements, defining test subsets. Each cvi(i) is defined as: cvi(i) = -2 the sample is always in the test set, cvi(i) = -1 the sample is always in the calibration set, cvi(i) = 0 the sample is always never used, and cvi(i) = 1,2,3... defines each subset.
Options
Optional input options is an options structure containing one or more of the following fields:
- display: [ 'off' | {'on'} ] Governs output to command window,
- plots: [ 'none' | {'final'} ] Governs plotting,
- preprocessing: {[1]} Controls preprocessing. Default is mean centering (1). Can be input in two ways:
- a) As a single value: 0 = none, 1 = mean centering, 2 = autoscaling, or
- b) As {xp yp}, a cell array containing a preprocessing structure(s) for the X- and Y-blocks (see PREPROCESS). E.g. pre = {xp []}; for PCA. To include preprocessing of each subset use pre = {xp yp}; or pre = {xp []} for PCA. To avoid preprocessing of each subset use pre = {[] []}; or pre = 0 (zero).
- threshold: {[]} Alternative PLSDA threshold level (default = [] = automatic)
- prior: {[]} Used with PLSDA only. Vector of fractional prior probabilities. This is the probability (0-1) of observing a "1" for each column of y (i.e. each class). E.g. [.25 .50] defines that only 25Found and 50Found of future samples will likely be "true" for the classes identified by columns 1 and 2 of the y-block. [] (Empty) = equal priors.
- structureoutput: [ {'no'} | 'yes' ] Governs output variables. 'Yes' returns a structure instead of individual variables. 'Yes' is default if only one output is requested.
- jackknife: [ {'no'} | 'yes' ] Governs storing of jackknifed regression vectors. Jack-knifing may slow performance significantly or cause out-of-memory errors when both x and y blocks have many variables.
- rmsec: [ 'no' | {'yes'} ] Governs calculation of RMSEC. When set to 'no', calculation of "all variables" model is skipped (unless specifically required for plots or requested with multiple outputs)
- pcacvi: {'loo'} Cell describing how PCA cross-validation should perform variable replacement. Variable replacement options are similar to cross-validation CVI options and include:
- {'loo'} leave one variable out at a time
- {'con' splits} contiguous blocks (total of splits groups)
- {'vet' splits} venetian blinds (every n'th variable), or
- {'rnd' splits} random subsets (note: no iterations)
- fastpca: [ 'off' | {'auto'} ] Governs use of "fast" PCA Cross-validation algorithm. 'off' never uses fast algorithm, 'auto' uses fast algorithm when other options permit. Fast pca can only be used with pcacvi set to 'loo'
- lwr: Sub-structure of options to use for locally-weighted regression cross-validation. Most of these options are used as defined in the LWRPRED function (see LWRPRED for more details) but there are two additional options defined for cross-validation:
- lwr.minimumpts : [20] the minimum number of points (samples) to use in any LWR sub-model.
- lwr.ptsperterm : [20] the number of points to use per term (LV) in the LWR model. For example, when set to 20, 20 samples will be use for a 1 LV model, 40 samples will be used for a 2 LV model, etc. If set to zero, the number of points defined by lwr.minimumpts will be used for all models - that is, the number of samples used will be independent from the number of LVs in the model.
-
- '''''''''' In all cases, the number of samples in an individual test set will be the upper limit of samples to include in any LWR prediction.
OUTPUTS
- press: predictive residual error sum of squares PRESS for each subset (subsets are rows of this matrix, number of components are columns)
- cumpress: cumulative PRESS (sum of columns of press).
- rmsecv: root mean square error of cross-validation.
- rmsec: root mean square error of calibration.
- cvpred: cross-validation y-predictions (regression methods only). If cross-validation method was random, this is the average prediction of all replicates.
- misclassed: fractional misclassifications for each class (valid for regression methods only and only when y is a logical, (i.e. discrete-value) vector.
- reg: jack-knifed regression vectors from each sub-set. This will be size [k\*ny nx splits] such that reg(1,:,:) will be the regression vectors for 1 component model of the first column of y for all sub sets (a 1 by nx by splits matrix). Use squeeze to reduce to an nx by splits matrix. (note: options.jackknife must be 'yes' to use reg)
If options.structureoutput is 'yes', a single output (results) will return all the above outputs as fields in a structure. If options.rmsec is 'no', then RMSEC is not returned (provides faster iterative calculation) Note that for multivariate (y) the output (press) is grouped by output variable, i.e. all of the PRESS values for the first variable are followed by all of the PRESS values for the second variable, etc. When options.plots is not 'none' plots both RMSECV and RMSEC are provided.
Examples
- [press,cumpress] = crossval(x,y,'nip',{'loo'},10);
- [press,cumpress] = crossval(x,y,'pcr',{'vet',3},10);
- [press,cumpress] = crossval(x,y,'nip',{'con',5},10);
- [press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10);
- res = crossval(x,y,'sim',{'rnd',3,20},10);
- pre = {preprocess('autoscale') preprocess('autoscale')};
- opts.preprocessing = pre;
- opts.plots = 'none';
- [press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10,opts);
- res = crossval(x,y,'sim',{'rnd',3,20},10,opts);
- [press,cumpress] = crossval(x,[],'pca',{'loo'},10);
- [press,cumpress] = crossval(x,[],'pca',{'vet',3},10);
- res = crossval(x,[],'pca',{'con',5},10);
See Also
encodemethod, pca, pcr, pls, preprocess, ncrossval, ncrossval