Crossval

From Eigenvector Research Documentation Wiki
Revision as of 15:49, 22 February 2024 by Scott (talk | contribs) (→‎See Also)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Purpose

Cross-validation for PCA, PLS, PCR, MLR, ANN, CLS, KNN, LWR, SVM, PLSDA, SVMDA, NPLS, MPCA, or any regression vector based method.

Synopsis

results = crossval(x,y,rm,cvi,ncomp,options)
model = crossval(x,y,model,cvi,ncomp,options)
[press,cumpress,rmsecv,rmsec,cvpred,misclassed] = crossval(x,y,rm,cvi,ncomp,options)

The output is either a structure/model or a variable list depending on the value of the 'structureoutput' option. Use option structureoutput = 'no' to get the list output. Default = 'yes', where crossval returns a structure (or updated model if the third input parameter is a model).

Note, using

model = crossval(x,y,model,cvi,ncomp,options)

can give the same RMSEC and RMSECV results as seen in the Analysis window but requires setting 'options' field 'preprocessing' from the model's preprocessing. Also, for PCA models requires setting option for 'pcacvi' = {'loo'} or {'con', 10} if there are 25 or fewer included variables or not, respectively. (During cros-validation for PCA the 'pcacvi' option governs breaking the X-block data into groups of variables which are used to estimate X for a "test" variables group in turn, thus estimating X for all variables. The residuals associated with this X estimate are used to calculate RMSECV in the case of PCA models). It is preferable instead to use the Evrimodel .crossvalidate method instead as this method handles these details:

model = model.crossvalidate(x,cvi)

Description

CROSSVAL performs cross-validation for linear regression (PCR, PLS, MLR, CorrelationPCR, and Locally Weighted Regression) and principal components analysis (PCA). Inputs are the MbyN predictor variable matrix (x), predicted variable (y) [ (y) is empty [ ] for rm = 'pca' ], regression method (rm), cross-validation method (cvi), and maximum number of latent variables / components (ncomp).

The third parameter can represent the regression method (rm), which can be any of the following:

rm = 'pca' performs cross-validation for PCA (see PCA)
rm = 'mlr' performs cross-validation for MLR
rm = 'pcr' performs cross-validation for PCR (see PCR)
rm = 'nip' performs cross-validation for PLS using NIPALS
rm = 'sim' or 'pls' performs cross-validation for PLS using SIMPLS (see PLS)
rm = 'correlationpcr' performs cross-validation for CorrelationPCR
rm = 'lwr' performs cross-validation for Locally Weighted Regression (see LWRPRED).
rm = 'cls' performs cross-validation for CLS

The third parameter can also be a previously build model. In this case crossval uses the model's regression method and returns the model with its cross-validation related fields updated (for example, model.detail.rmsecv). Note that crossval always uses the include fields from input x and y, and does not use the include fields used when building the model. The pre-processing used is taken from the input options' preprocessing. If there are 5 or fewer input parameters and a model is passed in then the preprocessing is taken from that used in the model.

The cross-validation method (cvi) can be

cvi = {method splits iterations}, a cell array containing one of the cross-validation methods (method) below with the appropriate parameters (split) and (iterations). Where "splits" is the number of subsets to split the data into and "iterations" is the number of replicates to perform on each split. The replicate results are averaged before being used in the final results (RMSECV, cv pred, and misclassification rates).
cvi = a vector representing user-defined cross-validation groups.

The cross-validation method (method) can be any one of the following. For more information on choosing a cross-validation method, see Using Cross-Validation.

cvi = {'loo'} : leave one out cross-validation (each sample left out on its own; inputs (splits) and (iterations) are not used
cvi = {'vet' (splits)} : venetian blinds cross-validation for (splits) subsets. Each subset includes every splits-th sample. For example, if splits = 3 the first subset uses rows 1:3:end, the second uses 2:3:end and the third subset uses 3:3:end.
cvi = {'vet' (splits) (blindsize)} randomly moves the starting point for the first (and subsequent) blocks. Where 'vet' means venetian blinds (split data into "splits" groups leaving out "blindsize" samples at a time - when blindsize is 1 (one), this leaves out every n'th sample. "splits" defines the number of groups to split the data into and "blindsize" defines the number of samples to include in each blind.
For example, splits=4 and blindsize=2 means split the data into groups of 4 taking 2 samples for each group at a time. Several examples:
splits   blindsize   grouping (same #s left out together)
   4         1         1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4...
   4         2         1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4...
   2         5         1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2...
cvi = {'con' (splits)} contiguous blocks cross-validation for (splits) subsets.
cvi = {'con' (splits) (iter)} randomly moves the starting point for the first (and subsequent) blocks. E.g.m cvi = {'con' 5}; for 5 contiguous blocks (one iteration).
rnd : {'rnd' (splits) (iter)} random subset selected cross-validation for (splits) subsets and (iter) iterations [number of replicate splits to perform].
cvi = a M element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:
cvi(i) = -2 the sample is always in the test set.
cvi(i) = -1 the sample is always in the calibration set,
cvi(i) = 0 the sample is always never used, and
cvi(i) = 1,2,3... defines each test subset.

Options

Optional input options is an options structure containing the following fields:

  • display : [ 'off' | {'on'} ] Governs output to command window.
  • plots : [ 'none' | {'final'} ] Governs plotting.
  • waitbartrigger : [15] Governs display of waitbar. If a given crossval run is expected to take longer than this many seconds, a waitbar will be presented to the user. Set to "inf" to disable the waitbar entirely.
  • preprocessing : an integer or cell array used to control preprocessing. Default is mean centering: preprocessing = 1. (preprocessing) can be input in two ways:
preprocessing = [ integer ]
preprocessing = 0: uses no preprocessing
preprocessing = 1: uses mean centering {default}
preprocessing = 2: uses autoscaling.
preprocessing = {xp yp}, is a cell array containing a preprocessing structure(s) for the X- and Y-blocks respectively (see PREPROCESS). For example:
preprocessing = {xp [ ]} is used for PCA.
preprocessing = {xp yp} is used to preprocess the X- and Y-blocks respectively.
  • testx : [] Use to provide a separate validation X-block from which an RMSEP will be calculated for each number of components in the model. Must be supplied with option testy. Only functional for 2-way regression methods.
  • testy : [] Use to provide a separate validation Y-block for use with option testx (see above).
  • discrim : [ {'no'} | 'yes' ] Force cross-validation in "discriminant analysis" mode. Returns average misclassification rate and returns misclassed output. Also triggered by y being logical.
  • threshold : [ ] Alternative PLSDA threshold level {default = [ ] = automatic}.
  • prior : [ ] Used with PLSDA only. Vector of fractional prior probabilities. This is the probability (0 to 1) of observing a "1" for each column of y (i.e.m each class). E.g., [0.25 0.50] defines that only 25% and 50% of future samples will likely be "true" for the classes identified by columns 1 and 2 of the Y-block. [ ] (Empty) = equal priors.
  • structureoutput: [ {'no'} | 'yes' ] Governs output variables. 'Yes' returns a structure instead of individual variables. 'Yes' is default if only one output is requested.
  • jackknife: [ {'no'} | 'yes' ] Governs storing of jackknifed regression vectors. Jack-knifing may slow performance significantly or cause out-of-memory errors when both x and y blocks have many variables.
  • rmsec: [ 'no' | {'yes'} ] Governs calculation of RMSEC. When set to 'no', calculation of "all variables" model is skipped (unless specifically required for plots or requested with multiple outputs)
  • permutation : [ {'no'} | 'yes' ] Performs permutation test instead of simple cross-validation. This calls the permutetest function with the same inputs as provided to cross-validation.
  • pcacvi: {'loo'} Cell describing how PCA cross-validation should perform variable replacement. Variable replacement options are similar to cross-validation CVI options and include:
{'loo'} leave one variable out at a time
{'con' splits} contiguous blocks (total of splits groups)
{'vet' splits} venetian blinds (every n'th variable), or
{'rnd' splits} random subsets (note: no iterations)
  • fastpca: [ 'off' | {'auto'} ] Governs use of "fast" PCA Cross-validation algorithm. 'off' never uses fast algorithm, 'auto' uses fast algorithm when other options permit. Fast pca can only be used with pcacvi set to 'loo'
  • lwr: Sub-structure of options to use for locally-weighted regression cross-validation. Most of these options are used as defined in the LWRPRED function (see LWRPRED for more details) but there are two additional options defined for cross-validation:
    lwr.minimumpts : [20] the minimum number of points (samples) to use in any LWR sub-model.
  • lwr.ptsperterm : [20] the number of points to use per term (LV) in the LWR model. For example, when set to 20, 20 samples will be use for a 1 LV model, 40 samples will be used for a 2 LV model, etc. If set to zero, the number of points defined by lwr.minimumpts will be used for all models - that is, the number of samples used will be independent from the number of LVs in the model.
In all cases, the number of samples in an individual test set will be the upper limit of samples to include in any LWR prediction.
  • weights: [ 'hist' | [vector] ] governs sample weighting for PLS regression method ONLY. If set to the string 'hist', y-block histogram weighting is done on the samples. If set to a vector, the vector must be equal in length to the number of samples in the y block and each element is used as a weight for the corresponding sample. If empty, no sample weighting is done.
  • weightsvect: [ ] Used only with custom weights. The vector specified must be equal in length to the number of samples in the y block and each element is used as a weight for the corresponding sample. If empty, no sample weighting is done.
  • rmoptions: Sub-structure of regression method options to specify what options should be passed directly to the function specified by the regression method.

Outputs

  • press : predictive residual error sum of squares PRESS for each subset (subsets are rows of this matrix, number of components are columns). Note that for multivariate (y) the output (press) is grouped by output variable, i.e., all of the PRESS values for the first Y-variable are followed by all of the PRESS values for the second Y-variable, etc.
  • cumpress : cumulative PRESS (sum of columns of press).
  • rmsecv : root mean square error of cross-validation.
  • rmsec : root mean square error of calibration.
  • cvpred : cross-validation Y-predictions (regression methods only). If cross-validation method was random cvi = {'rnd' (splits) (iter)}, this is the average prediction of all replicates.
  • misclassed : fractional misclassifications for each class (valid for regression methods only and only when input (y) is class logical, (i.e., discrete-value) vector. Each cell of this array contains the fractional misclassification rates for a given class in the data. The columns of the matrix in a cell corresponds to the number of latent variables in the model. The first row of the matrix is the false positive rate and the second row is the false negative rate.
  • reg : jack-knifed regression vectors from each sub-set. This will be size [ncomp*Ny by Nx by (splits)] such that reg(1,:,:) will be the regression vectors for 1 component model of the first column of (y) for all subsets [a 1 by Nx by (splits) matrix]. Use SQUEEZE to reduce to an Nx by (splits) matrix. [note: options.jackknife must be 'yes' to use (reg)].
If y has more than one column (i.e. multivariate y) then the rows of reg are ordered grouped by number of latent variables. The regression vectors for each y-column are given for the one-factor models (lv=1) followed by the regression vectors for each y-column for the two-factor models (lv=2), etc:
1 LV, y-column 1
1 LV, y-column 2
...
2 LVs, y-column 1
2 LVs, y-column 2
...
ncomp LVs, y-column 1
ncomp LVs, y-column 2
...

If options.structureoutput = 'yes', a single output (results) will return all the above outputs as fields in a structure array. If options.rmsec = 'no', then RMSEC is not returned (provides faster iterative calculation).

When (options.plots) is not 'none' plots both RMSECV and RMSEC are plotted.

Examples

[press,cumpress] = crossval(x,y,'nip',{'loo'},10);

[press,cumpress] = crossval(x,y,'pcr',{'vet',3},10);
[press,cumpress] = crossval(x,y,'nip',{'con',5},10);
[press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10);
res = crossval(x,y,'sim',{'rnd',3,20},10);

pre = {preprocess('autoscale') preprocess('autoscale')};
opts.preprocessing = pre;
opts.plots = 'none';
[press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10,opts);
res = crossval(x,y,'sim',{'rnd',3,20},10,opts);

[press,cumpress] = crossval(x,[],'pca',{'loo'},10);
[press,cumpress] = crossval(x,[],'pca',{'vet',3},10);
res = crossval(x,[],'pca',{'con',5},10);

See Also

Using_Cross-Validation, pca, pcr, pls, preprocess, encodemethod, Evrimodel, EVRIModel_Objects, preprocess