Reducennsamples
Purpose
Selects a subset of samples by removing nearest neighbors.
Synopsis
- [sc,incl] = reducennsamples(model,minsamples,options);
- [sc,incl] = reducennsamples(model,newdata,minsamples,options);
Description
(Also available in the Analysis interface via the data context menu)
Select a subset of samples by removing nearest neighbors. Performs a selection of samples which fill out the multivariate space by removing ("thinning out") samples which are similar to each other based on nearest neighbor distance. This algorithm is useful in selecting the minimum number of samples needed to define a subspace and reduce the number of reference measurements needed, or amount of data needed to be stored.
Initially, the nearest neighbor of each sample is found along with the distance between the neighbors. Of the two nearest samples, one is excluded from the data and the distances are recalculated. This process is repeated until either the smallest distance between samples reaches a maximum limit, or the number of samples reaches a lower limit.
Source of data can be either a factor-based model (PCA, PLS, PCR, etc) which contains scores for all samples, or a raw data matrix or DataSet, in which case distances will be calculated in raw variable space.
Algorithm is based on work published in:
- J.S. Shenk, M.O. Westerhaus, Crop Sci., 1991, 31, 469,
- J.S. Shenk, M.O. Westerhaus, Crop Sci., 1991, 31, 1548.
Inputs
- x = Standard model structure OR double OR DataSet object containing data to select from,
- newdata: Additional data which should be considered for addition to the data provided by model input. When provided, all model samples are used and newdata is examined for samples to fill in empty regions of the model space. Under these conditions, minsamples, is considered the number of additional samples to be selected above the number included in model (see minsamples below),
- minsamples: Minimum number of samples to retain. Sample thinning stops when the number of retained samples reaches this value. If omitted, 4 times the number of factors in the model or 1/2 the number of samples (whichever is smaller) is used.
- options is a structure array with fields described below:
Outputs
- sc = DataSet object containing either the scores (if a model was supplied) or the data supplied. Samples selected are included. Thinned samples are excluded.
- incl = Indices of retained samples (samples not thinned as redundant).
Options
options is a structure array with the following fields:
- maxdistance: [inf] Maximum allowed closest distance between samples. Sample thinning stops if the two closest samples are further away than this value. If "inf", thinning occurs until the number of samples given in minsamples is reached. If empty, the nearest distances are calculatd for the initial set and 1/2 of the maximum observed distance is used,
- maxsamples: [5000] Maximum number of samples which can be passed for down-sampling. More than this number will throw an error,
- mustuse: [] Indicies of samples which must be used,
- waitbar: [ 'no' | {'yes'} ] indicates whether a waitbar can be shown.
See Also
distslct doptimal knnscoredistance stdgen stdsslct splitcaltest