Glsw: Difference between revisions
imported>Jeremy |
imported>Jeremy |
||
Line 73: | Line 73: | ||
: If this fraction is above the value specified by this option, (y) is considered a continuous variable (such as a concentration or other property to predict). In these cases, the "sample similarity" (a.k.a. "column gradient") method of calculating the covariance matrix will be used. Sample similarity method determines the down-weighting required based mostly on samples which are the most similar (on the specified y-scale). Set to >=1 to disable and to 0 (zero) to always use. | : If this fraction is above the value specified by this option, (y) is considered a continuous variable (such as a concentration or other property to predict). In these cases, the "sample similarity" (a.k.a. "column gradient") method of calculating the covariance matrix will be used. Sample similarity method determines the down-weighting required based mostly on samples which are the most similar (on the specified y-scale). Set to >=1 to disable and to 0 (zero) to always use. | ||
* '''xgradient''': [ {'no'} | 'yes' ] apply an x-block gradient filter before calculating the filter. This filter performs a derivative down the columns of the x-block accentuating differences between adjacent samples. For example: When samples are sorted by time, this creates a GLSW filter that down-weights differences in the short time scale while retaining long-scale differences. | |||
* '''xgradwindow''': [ 3 ] number of samples over which the xgradient should be taken (see xgradient option) | |||
* '''maxpcs''': [ 50 ] maximum number of components (factors) to allow in the GLSW model. Typically, the number of factors in incuded in a model will be the smallest of this number, the number of variables or the number of samples. Having a limit set here is useful when derriving a GLSW model from a large number of samples and variables. Often, a GLSW model effectively uses fewer than 20 components. Thus, this option can be used to keep the GLSW model smaller in size. It may, however, decrease its effectiveness if critical factors are not included in the model. | * '''maxpcs''': [ 50 ] maximum number of components (factors) to allow in the GLSW model. Typically, the number of factors in incuded in a model will be the smallest of this number, the number of variables or the number of samples. Having a limit set here is useful when derriving a GLSW model from a large number of samples and variables. Often, a GLSW model effectively uses fewer than 20 components. Thus, this option can be used to keep the GLSW model smaller in size. It may, however, decrease its effectiveness if critical factors are not included in the model. | ||
* '''classset''': [ 1 ] indicates which class set in x to use when no y-block is provided. | * '''classset''': [ 1 ] indicates which class set in x to use when no y-block is provided. | ||
* '''maxperclass''': [inf] indicates the maximum number of samples from each class that should be used to calculate the filter when class-based filtering is being done. When < inf, only the first "k" samples from each class are used to calculate the filter. | |||
* '''downweight''': [ 'no' | {'yes'} ] governs whether the filter will downweight identified features, or upweight them. Normally, "clutter" is identified and downweighted by a GLSW filter. However, GLSW filters can also be supplied with features that are of interest (signal) and this flag can be reversed causing GLSW to "upweight" thses signal features. | |||
===See Also=== | ===See Also=== | ||
[[caltransfer]], [[pca]], [[pls]], [[preprocess]], [[osccalc]] | [[caltransfer]], [[pca]], [[pls]], [[preprocess]], [[osccalc]] |
Revision as of 11:53, 10 April 2013
Purpose
Calculate or apply Generalized Least Squares weighting (GLSW), External Parameter Orthogonalization (EPO), and Extended Mixture Model (EMM) filters. See also GLSW_Settings_GUI.
Synopsis
- modl = glsw(x,a); %GLS on matrix
- modl = glsw(x1,x2,a); %GLS between two data sets
- modl = glsw(x,y,a); %GLS on matrix in groups based on y
- modl = glsw(modl,a); %Update model to use a new value
- xt = glsw(newx,modl,options); %apply correction
- xt = glsw(newx,modl,a); %apply correction
Description
This filter uses a Generalized Least Squares (GLS) based weighting strategy to down-weight features identified from the singular value decomposition of a clutter data matrix. Clutter is context dependent and the cases are described in detail below.
If the singular value decomposition (SVD) of the input matrix x is X = USVT then the deweighting matrix is estimated with the following pseudo-inverse:
- W= Udiag( sqrt(1/(diag(S)/a2+1) )VT = USinvVT
where Sinv corresponds to a regularized inverse of the singular values. The adjustable parameter a is a regularization parameter used to scale the singular values prior to calculating their inverse. As a gets larger, the extent of deweighting decreases (because Sinv approaches 1). As a gets smaller (e.g., 0.1 decreasing to 0.001) the extent of deweighting increases (because Sinv approaches 0) and the deweighting includes increasing amounts of the the directions represented by smaller singular values. A good initial guess for a is 1x10-2 but will vary depending on the covariance structure of X and the specific application. It is recommended that a number of different values be investigated using an external cross-validation metric for performance evaluation.
For more information see:
- H. Martens, M. Høy, B.M. Wise, R. Bro and P.B. Brockhoff, "Pre-whitening of data by covariance-weighted pre-processing," J. Chemom., 17(3), 153-165, 2003.
This function will also perform EPO (External Parameter Orthogonalization) which is GLSW with a filter built from a specific number of singular vectors rather than the weighting scheme described above and EMM (Extended Mixture Modeling) filtering which is EPO orthogonalizing to all available singular vectors. To perform EPO, a negative integer is supplied in place of (a) where (-a) specifies the number of singular vectors to include in the filter. This is GLSW with a square-wave function for the deweighting i.e., the first a singular values of Sinv are set to zero and the remaining singular values are set to 1. To perform EMM, a negative infinity (-inf) is supplied in place of (a).
Finally, an alternative method to use GLSW is in quantitative analysis where a continuous y-variable is used to develop pseudo-groupings of samples in X by comparing the differences in the corresponding y values. This is referred to as the "gradient method" because it utilizes a gradient of the sorted X- and y-blocks to calculate a covariance matrix. For more information on this method, see the Advanced Preprocessing: Multivariate Filtering page or the paper:
- B. M. Zorzetti, J. M. Shaver, J. J. Harynuk, "Estimation of the age of a weathered mixture of volatile organic compounds" Analytica Chimica Acta, 694, 31–37, 2011.
For calibration of the GLSW model modl, inputs can be provided by one of four methods:
1) modl = glsw(x,a)
- x = a clutter data or covariance matrix containing features to be downweighted, and
- a = scalar regularization parameter that governs downweighting {default = 1e-2}.
- Note: If x is a dataset with classes, differences within each class are used for down-weighting (i.e., intra-class variance is considered clutter). This reduces intra-class variation but ignores the inter-class variation. Only classes with class numbers >0 are included in the clutter calculation (see DataSet object for more information).
2) modl = glsw(x1,x2,a)
- x1 = a M by N data matrix and
- x2 = a M by N data matrix.
- The clutter is defined as x = x1-x2; the row-by-row differences between x1 and x2. The input data represents two or more measured populations which should otherwise be the same (e.g., the same samples measured on two different analyzers or using different solvents).
- a = scalar regularization parameter that governs downweighting {default = 1e-2}.
3) modl = glsw(x,y,a)
- x = a M by N data matrix,
- y = column vector of integers with M rows specifing sample groups in x within which differences should be downweighted.
- Note: This method is identical to method (1) when classes of the X-block are used to identify groups. The only difference is that the groups are identified from the separate input y instead of the dataset classes. If y is empty, this defaults to method (1) without class information where x is then defined as the clutter data matrix.
- a = scalar regularization parameter that governs downweighting {default = 1e-2}.
4) modl = glsw(x,y,a)
- x = a M by N data matrix,
- y = column vector with M rows specifying a y-block continuous variable. In this input, the "gradient method" is used to identify similar samples and downweight differences between them. See also the gradientthreshold option below.
- a = scalar regularization parameter that governs downweighting {default = 1e-2}.
The input a can be replaced with an options structure (see Options below).
When applying a GLSW model the inputs are newx, the X-block to be deweighted, and modl, a GLSW model structure.
Outputs are modl, a GLSW model structure, and xt, the deweighted X-block.
Options
An options structure can be used in place of (a) for any call or as the third input in an apply call. This structure consists of any of the fields:
- a: [ 0.02 ] scalar parameter limiting downweighting {default = 1e-2},
- meancenter: [ 'no' | {'yes'} ] For single x-block modes only: governs the calculation of a mean of each group of data before calculating the covariance. If set to no, the filter will include the offset of each group. This is equivalent to saying the offset in the data is part of the clutter which should be removed.
- applymean: [ 'no' | {'yes'} ] governs the use of the mean difference calculated between two instruments (difference between two instruments mode). When appling a GLS filter to data collected on the x1 instrument, the mean should NOT be applied. Data collected on the SECOND instrument should have the mean applied.
- gradientthreshold: [ .25 ] "continuous variable" threshold fraction above which the column gradient method will be used with a continuous y. Usually, when (y) is supplied, it is assumed to be the identification of discrete groups of samples. However, when calibrating, the number of samples in each "group" is calculated and the fraction of samples in "singleton" groups (i.e. in thier own group) is determined.
- fraction = (\# Samples in Singleton Groups) / Total Samples
- If this fraction is above the value specified by this option, (y) is considered a continuous variable (such as a concentration or other property to predict). In these cases, the "sample similarity" (a.k.a. "column gradient") method of calculating the covariance matrix will be used. Sample similarity method determines the down-weighting required based mostly on samples which are the most similar (on the specified y-scale). Set to >=1 to disable and to 0 (zero) to always use.
- xgradient: [ {'no'} | 'yes' ] apply an x-block gradient filter before calculating the filter. This filter performs a derivative down the columns of the x-block accentuating differences between adjacent samples. For example: When samples are sorted by time, this creates a GLSW filter that down-weights differences in the short time scale while retaining long-scale differences.
- xgradwindow: [ 3 ] number of samples over which the xgradient should be taken (see xgradient option)
- maxpcs: [ 50 ] maximum number of components (factors) to allow in the GLSW model. Typically, the number of factors in incuded in a model will be the smallest of this number, the number of variables or the number of samples. Having a limit set here is useful when derriving a GLSW model from a large number of samples and variables. Often, a GLSW model effectively uses fewer than 20 components. Thus, this option can be used to keep the GLSW model smaller in size. It may, however, decrease its effectiveness if critical factors are not included in the model.
- classset: [ 1 ] indicates which class set in x to use when no y-block is provided.
- maxperclass: [inf] indicates the maximum number of samples from each class that should be used to calculate the filter when class-based filtering is being done. When < inf, only the first "k" samples from each class are used to calculate the filter.
- downweight: [ 'no' | {'yes'} ] governs whether the filter will downweight identified features, or upweight them. Normally, "clutter" is identified and downweighted by a GLSW filter. However, GLSW filters can also be supplied with features that are of interest (signal) and this flag can be reversed causing GLSW to "upweight" thses signal features.