Splitcaltest: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Jeremy
 
(22 intermediate revisions by 6 users not shown)
Line 1: Line 1:
===Purpose===
===Purpose===


Splits randomly ordered data into calibration and test sets.
Splits data into calibration and test sets.


===Synopsis===
===Synopsis===


:z = splitcaltest(model,options);  %identifies model (calibration step)
:z = splitcaltest(model,options);  %identifies model (calibration step)
:z = splitcaltest(model,options,y);  %for splitting using spxy method


:Also available in the [[Automatic_sample_selection|Analysis interface via the data context menu]]
:Also available in the [[Automatic_sample_selection|Analysis interface via the data context menu]]
Line 11: Line 12:
===Description===
===Description===


The calibration and test data are split up under the assumption that the data were acquired in a random sequence. The split is based on the scores from the input model. If a matrix or DataSet are passed in place of a model, it is assumed to contain the scores for the data.
The split is based on the scores from the input model. If a matrix or DataSet is passed in place of a model, it is assumed to contain the scores for the data. A randomization is used in the splitting process so no assumption about the data acquisition order is necessary. It is possible to specify the usereplicates option to keep replicated samples together during the splitting process.
 
If ''usereplicates'' option is enabled and ''repidclass'' option indicates which
sample classset identifies replicated samples then the splitting will
not separate replicated samples from each other.
Replicates are first combined using classcenter
before splitcaltest is applied to the class centered data. Replicates
only contribute to the class centered result if they were not excluded
in the input dataset or model. The results of splitting these combined
samples are then mapped back to the original replicates, so replicates
are never separated in the resulting calibration and test sets.
(For more information see: https://eigenvector.com/wp-content/uploads/2020/01/Onion_SampleSelection.pdf)


====Inputs====
====Inputs====
Line 23: Line 35:
===Options===
===Options===


* '''options''' = structure array with the following fields :
'''options''' = structure array with the following fields :
 
* '''plots''': [ 'none' | {'final'} ] Governs level of plotting


* '''plots''': [ 'none' | {'final'} ] governs level of plotting
* '''algorithm''': [ {'kennardstone'} | 'reducennsamples' | 'onion' | 'duplex' | 'spxy' | 'random' ] Algorithm used to select calibration samples.
:: 'kennardstone' selects the option.fraction of samples uniformly starting on the exterior of data space using the Kennard-Stone method, see kennardstone.
::: RW Kennard, LA Stone (1969): Computer Aided Design of Experiments, Technometrics, 11:1, 137-148.
:: 'reducennsamples' selects a subset of samples by removing nearest neighbors, see reducennsamples. Results are similar to Kennard-Stone.
::: JS Shenk, MO.Westerhaus, Crop Sci., 1991, 31, 469; J. Shenk, MO Westerhaus, Crop Sci., 1991, 31, 1548.
:: 'onion' selects samples on the exterior of the data space, see distslct. use options (nonion), (loopfraction) and (fraction)
:: 'duplex' see [[duplex]]
:: 'spxy' see [[spxy]]
:: 'random' see [[randomsplit]]


* '''algorithm''': [ {'onion'} ]
* '''nonion''': [ {3} ] the number of 'external layers' to select. A layer consists of a cal and test set.
::: The first cal set consist of (loop fraction*fraction*M samples) furthest apart on the exterior of the data space. Once nonion layers have been assigned, the remainder (interior samples) are split randomly between cal and test sets.


* '''nonion''': [ {3} ] the number of 'external layers'
* '''loopfraction''': [{0.1}] onion: fraction of unassigned samples assigned per onion layer.


* '''fraction''': [ {0.66} ] fraction of data to be set as calibrations samples.
* '''fraction''': [ {0.66} ] fraction of data to be set as calibrations samples.
* '''usereplicates''': [{0} | 1] Keep replicates together (1) or not (0).
* '''repidclass''': [{1}] The X-block classset used to identify sample replicates
* '''distmeasure''': [{'euclidean'} | 'mahalanobis'] Defines the type of distance measure to use for onion method. 'euclidean' does simple, non-scaled distance. 'mahalanobis' scales each direction by the covariance matrix (correcting for unusually small or large directions). Either method can be abbreviated with their single first letter 'e' or 'm'.
* '''nnt_maxdistance''': [{inf}] reducennsamples: Maximum allowed closest distance between samples. Sample thinning stops if the two closest samples are further away than this value.
* '''nnt_maxsamples''': [{5000}] reducennsamples: Maximum number of samples which can be passed for down-sampling. More than this number will throw an error.


===See Also===
===See Also===


[[crossval]], [[pca]], [[pcr]], [[preprocess]].
[[distslct]], [[reducennsamples]], [[crossval]], [[pca]], [[pcr]], [[preprocess]], [[classcenter]], [[duplex]], [[spxy]], [[randomsplit]]

Latest revision as of 09:43, 5 December 2023

Purpose

Splits data into calibration and test sets.

Synopsis

z = splitcaltest(model,options); %identifies model (calibration step)
z = splitcaltest(model,options,y); %for splitting using spxy method
Also available in the Analysis interface via the data context menu

Description

The split is based on the scores from the input model. If a matrix or DataSet is passed in place of a model, it is assumed to contain the scores for the data. A randomization is used in the splitting process so no assumption about the data acquisition order is necessary. It is possible to specify the usereplicates option to keep replicated samples together during the splitting process.

If usereplicates option is enabled and repidclass option indicates which sample classset identifies replicated samples then the splitting will not separate replicated samples from each other. Replicates are first combined using classcenter before splitcaltest is applied to the class centered data. Replicates only contribute to the class centered result if they were not excluded in the input dataset or model. The results of splitting these combined samples are then mapped back to the original replicates, so replicates are never separated in the resulting calibration and test sets. (For more information see: https://eigenvector.com/wp-content/uploads/2020/01/Onion_SampleSelection.pdf)

Inputs

  • model = standard model structure from a factor-based model OR a double or DataSet object containing the scores to analyze.

Outputs

  • z = a structure containing the class and classlookup table.

Options

options = structure array with the following fields :

  • plots: [ 'none' | {'final'} ] Governs level of plotting
  • algorithm: [ {'kennardstone'} | 'reducennsamples' | 'onion' | 'duplex' | 'spxy' | 'random' ] Algorithm used to select calibration samples.
'kennardstone' selects the option.fraction of samples uniformly starting on the exterior of data space using the Kennard-Stone method, see kennardstone.
RW Kennard, LA Stone (1969): Computer Aided Design of Experiments, Technometrics, 11:1, 137-148.
'reducennsamples' selects a subset of samples by removing nearest neighbors, see reducennsamples. Results are similar to Kennard-Stone.
JS Shenk, MO.Westerhaus, Crop Sci., 1991, 31, 469; J. Shenk, MO Westerhaus, Crop Sci., 1991, 31, 1548.
'onion' selects samples on the exterior of the data space, see distslct. use options (nonion), (loopfraction) and (fraction)
'duplex' see duplex
'spxy' see spxy
'random' see randomsplit
  • nonion: [ {3} ] the number of 'external layers' to select. A layer consists of a cal and test set.
The first cal set consist of (loop fraction*fraction*M samples) furthest apart on the exterior of the data space. Once nonion layers have been assigned, the remainder (interior samples) are split randomly between cal and test sets.
  • loopfraction: [{0.1}] onion: fraction of unassigned samples assigned per onion layer.
  • fraction: [ {0.66} ] fraction of data to be set as calibrations samples.
  • usereplicates: [{0} | 1] Keep replicates together (1) or not (0).
  • repidclass: [{1}] The X-block classset used to identify sample replicates
  • distmeasure: [{'euclidean'} | 'mahalanobis'] Defines the type of distance measure to use for onion method. 'euclidean' does simple, non-scaled distance. 'mahalanobis' scales each direction by the covariance matrix (correcting for unusually small or large directions). Either method can be abbreviated with their single first letter 'e' or 'm'.
  • nnt_maxdistance: [{inf}] reducennsamples: Maximum allowed closest distance between samples. Sample thinning stops if the two closest samples are further away than this value.
  • nnt_maxsamples: [{5000}] reducennsamples: Maximum number of samples which can be passed for down-sampling. More than this number will throw an error.

See Also

distslct, reducennsamples, crossval, pca, pcr, preprocess, classcenter, duplex, spxy, randomsplit