Glsw

From Eigenvector Research Documentation Wiki
Revision as of 13:08, 12 August 2009 by imported>Jeremy (→‎Description)
Jump to navigation Jump to search

Purpose

Calculate or apply Generalized Least Squares weighting filter. Also performs External Parameter Orthogonalization (EPO) preprocessing.

Synopsis

modl = glsw(x,a); %GLS on matrix
modl = glsw(x1,x2,a); %GLS between two data sets
modl = glsw(x,y,a); %GLS on matrix in groups based on y
modl = glsw(modl,a); %Update model to use a new value
xt = glsw(newx,modl,options); %apply correction
xt = glsw(newx,modl,a); %apply correction

Description

This filter uses Generalized Least Squares (GLS) to down-weight features identified from the singular value decomposition of a data matrix. The input data usually represents two or more measured populations which should otherwise be the same (e.g. the same samples measured on two different analyzers or using different solvents) and can be input in one of several forms, as explained below. In all cases, the downweighting is performed by taking the eigenvectors and eigenvalues of the differences.

If the singular value decomposition (SVD) of the input matrix x is X=USVT then the deweighting matrix is estimated with the following pseudo-inverse:

W= Udiag(sqrt(1/(diag(S)/a2+1)))VT

where the center term defines Sinv. The adjustable parameter a is used to scale the singular values prior to calculating their inverse. As a gets larger, the extent of deweighting decreases (because Sinv approaches 1). As a gets smaller (e.g. 0.1 to 0.001) the extent of deweighting increases (because Sinv approaches 0) and the deweighting includes increasing amounts of the the directions represented by smaller singular values.

A good initial guess for a is 1x10-2 but will vary depending on the covariance structure of X and the specific application. It is recommended that a number of different values be investigated using some external cross-validated metric for performance evalution.

This function will also perform EPO (External Parameter Orthogonalization) which is GLSW with a filter built from a specific number of singular vectors rather than the weighting scheme described above. To perform EPO, a negative integer is supplied in place of (a) where -a specifies the number of singular vectors to include in the filter. This is GLSW with a square-wave function for the deweighting.

Finally, an alternative method to use GLSW is in quantitative analysis where a continuous y-variable is used to develop pseudo-groupings of samples in X by comparing the differences in the corresponding y values. This is referred to as the "gradient method" because it utilizes a gradient of the sorted X and y blocks to calculate a covariance matrix. For more information on this method, see the chapter discussing Preprocessing in the PLS_Toolbox Manual.

For calibration, inputs can be provided by one of four methods:

1)

x = data matrix containing features to be downweighted, and
a = scalar parameter limiting downweighting {default = 1e-2}.
Note: If x is a dataset with classes, the differences within each class will be downweighted rather than the entire matrix. This reduces the within-class variation ignoring the between-class variation.

2)

x1 = a M by N data matrix and
x2 = a M by N data matrix.
The row-by-row differences between x1 and x2 will be used to estimate the downweighting.
a = scalar parameter limiting downweighting {default = 1e-2}.

3)

x = a MxN data matrix,
y = column vector with M rows which specifies sample groups in x within which differences should be downweighted. Note that this method is identical to method (1) when classes of the X block are used to identify groups. The only difference is that these groupings are passed as a separate input. In fact, if y is empty, this defaults to method (1) above.
a = scalar parameter limiting downweighting {default = 1e-2}.

4)

x = a MxN data matrix,
y = column vector with M rows specifying a y-block continuous variable. In this input, the "gradient method" is used to identify similar samples and downweight differences between them. See also the gradientthreshold option below.
a = scalar parameter limiting downweighting {default = 1e-2}.

The input a can be replaced with an options structure (see Options below).

When applying a GLSW model the inputs are newx, the x-block to be deweighted, and modl, a GLSW model structure.

Outputs are modl, a GLSW model structure, and xt, the deweighted x-block.

Options

An options structure can be used in place of (a) for any call or as the third output in an apply call. This structure consists of any of the fields:

  • a: [ 0.02 ] scalar parameter limiting downweighting {default = 1e-2},
  • applymean: [ 'no' | {'yes'} ] governs the use of the mean difference calculated between two instruments (difference between two instruments mode). When appling a GLS filter to data collected on the x1 instrument, the mean should NOT be applied. Data collected on the SECOND instrument should have the mean applied.
  • gradientthreshold: [ .25 ] "continuous variable" threshold fraction above which the column gradient method will be used with a continuous y. Usually, when (y) is supplied, it is assumed to be the identification of discrete groups of samples. However, when calibrating, the number of samples in each "group" is calculated and the fraction of samples in "singleton" groups (i.e. in thier own group) is determined.
fraction = (\# Samples in Singleton Groups) / Total Samples
If this fraction is above the value specified by this option, (y) is considered a continuous variable (such as a concentration or other property to predict). In these cases, the "sample similarity" (a.k.a. "column gradient") method of calculating the covariance matrix will be used. Sample similarity method determines the down-weighting required based mostly on samples which are the most similar (on the specified y-scale). Set to >=1 to disable and to 0 (zero) to always use.
  • maxpcs: [ 50 ] maximum number of components (factors) to allow in the GLSW model. Typically, the number of factors in incuded in a model will be the smallest of this number, the number of variables or the number of samples. Having a limit set here is useful when derriving a GLSW model from a large number of samples and variables. Often, a GLSW model effectively uses fewer than 20 components. Thus, this option can be used to keep the GLSW model smaller in size. It may, however, decrease its effectiveness if critical factors are not included in the model.

See Also

pca, pls, preprocess, osccalc