Svmda and Advanced Preprocessing: Sample Normalization: Difference between pages

From Eigenvector Research Documentation Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Donal
 
imported>Jeremy
 
Line 1: Line 1:
===Purpose===
===Introduction===


SVMDA Support Vector Machine (LIBSVM) for classification.
In many analytical methods, the variables measured for a given sample are subject to overall scaling or gain effects. That is, all (or maybe just a portion of) the variables measured for a given sample are increased or decreased from their true value by a multiplicative factor. Each sample can, in these situations, experience a different level of multiplicative scaling.


===Synopsis===
In spectroscopic applications, scaling differences arise from pathlength effects, scattering effects, source or detector variations, or other general instrumental sensitivity effects (see, for example, Martens and Næs, 1989). Similar effects can be seen in other measurement systems due to physical or chemical effects (e.g., decreased activity of a contrast reagent or physical positioning of a sample relative to a sensor). In these cases, it is often the relative value of variables which should be used when doing multivariate modeling rather than the absolute measured value. Essentially, one attempts to use an internal standard or other pseudo-constant reference value to correct for the scaling effects.


:model = svmda(x,y,options);          %identifies model (calibration step)
The sample normalization preprocessing methods attempt to correct for these kinds of effects by identifying some aspect of each sample which should be essentially constant from one sample to the next, and correcting the scaling of all variables based on this characteristic. The ability of a normalization method to correct for multiplicative effects depends on how well one can separate the scaling effects which are due to properties of interest (e.g., concentration) from the interfering systematic effects.
:pred  = svmda(x,model,options);      %makes predictions with a new X-block
:pred  = svmda(x,y,model,options);  %performs a "test" call with a new X-block and known y-values


===Description===
Normalization also helps give all samples an equal impact on the model. Without normalization, some samples may have such severe multiplicative scaling effects that they will not be significant contributors to the variance and, as a result, will not be considered important by many multivariate techniques.


SVMDA performs calibration and application of Support Vector Machine (SVM) classification models. (Please see the svm function for support vector machine regression problems). These are non-linear models which can be used for classification problems. The model consists of a number of support vectors (essentially samples selected from the calibration set) and non-linear model coefficients which define the non-linear mapping of variables in the input x-block to allow prediction of the classification as passed in either as the classes field of the x-block or in a y-block which contains numerical classes. It is recommended that regression be done through the svm function.
When creating discriminant analysis models such as PLS-DA or SIMCA models, normalization is done if the relationship between variables, and not the absolute magnitude of the response,  is the most important aspect of the data for identifying a species (e.g., the concentration of a chemical isn't important, just the fact that it is there in a detectable quantity). Use of normalization in these conditions should be considered after evaluating how the variables' response changes for the different classes of interest. Models with and without normalization should be compared.


Svmda is implemented using the LIBSVM package which provides both cost-support vector regression (C-SVC) and nu-support vector regression (nu-SVC). Linear and Gaussian Radial Basis Function kernel types are supported by this function.
Typically, normalization should be performed before any centering or scaling or other column-wise preprocessing steps and after baseline or offset removal (see above regarding these preprocessing methods). The presence of baseline or offsets can impede correction of multiplicative effects. The effect of normalization prior to background removal will, in these cases, not improve model results and may deteriorate model performance.


Note: Calling svmda with no inputs starts the graphical user interface (GUI) for this analysis method.  
One exception to this guideline of preprocessing order is when the baseline or background is very consistent from sample to sample and, therefore, provides a very useful reference for normalization. This can sometimes be seen in near-infrared spectroscopy in the cases where the overall background shape is due to general solvent or matrix vibrations. In these cases, normalization before background subtraction may provide improved models. In any case, cross-validation results can be compared for models with the normalization and background removal steps in either order and the best selected.


====Inputs====
A second exception is when normalization is used after a scaling step (such as autoscaling). This should be used when autoscaling emphasizes features which may be useful in normalization. This reversal of normalization and scaling is sometimes done in discriminant analysis applications.


* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,
===Normalize===
* '''y''' = Y-block (predicted block) class "double" or "dataset", containing integer values,
Simple normalization of each sample is a common approach to the multiplicative scaling problem. The Normalize preprocessing method calculates one of several different metrics using all the variables of each sample. Possibilities include:
* '''model''' = previously generated model (when applying model to new data).


====Outputs====


* '''model''' = a standard model structure model with the following fields (see MODELSTRUCT):
{| class="wikitable" border="1"
** '''modeltype''': 'SVM',
|+
** '''datasource''': structure array with information about input data,
! Name!! Description!! Equation*
** '''date''': date of creation,
|-
** '''time''': time of creation,
| 1-Norm ||
** '''info''': additional model information,
Normalize  to (divide each variable by) the sum of the absolute value of all variables  for the given sample. Returns a vector with unit area (area = 1) "under  the curve."
** '''pred''': 2 element cell array with
|| <math>w_{i}=\sum_{j=1}^{n}\left | x_{i,j} \right |</math>
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)
|-
** '''detail''': sub-structure with additional model details and results, including:
| 2-Norm ||
*** model.detail.svm.model: Matlab version of the libsvm svm_model (Java)
Normalize  to the sum of the squared value of all variables for the given sample.  Returns a vector of unit length (length = 1). A form of weighted  normalization where larger values are weighted more heavily in the scaling.
*** model.detail.svm.cvscan: results of CV parameter scan
|| <math>w_{i}=\sum_{j=1}^{n}x_{i,j}^{2}</math>
*** model.detail.svm.outlier: results of outlier detection (one-class svm)
|-
| Inf-Norm ||
Normalize to the maximum value observed for all variables  for the given sample. Returns a vector with unit maximum value. Weighted  normalization where only the largest value is considered in the scaling.
|| <math>w_{i}=Max\left ( \mathbf{x}_{i} \right )</math>
|}


* '''pred''' a structure, similar to '''model''' for the new data.
* Where, in each case, wi is the normalization weight for sample i, xi is the vector of observed values for the given sample, j is the variable number, and n is the total number of variables (columns of x).


===Options===
The weight calculated for a given sample is then used to calculate the normalized sample, <math>\mathbf{x}_{i,norm}</math>, using:
''options'' =  a structure array with the following fields:


* '''display''': [ 'off' | {'on'} ], governs level of display to command window,
: <math>\mathbf{x}_{i,norm}=\mathbf{x}_{i}w_{i}^{-1}</math>
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,
An example using the 1-norm on near infrared spectra is shown in the figure below. These spectra were measured as 20 replicates of 5 synthetic mixtures of gluten and starch (Martens, Nielsen, and Engelsen, 2003) In the original data (top plot), the five concentrations of gluten and starch are not discernable because of multiplicative and baseline effects among the 20 replicate measurements of each mixture. After normalization using a 1-norm (bottom plot), the five mixtures are clearly observed in groups of 20 replicate measurements each.
* '''preprocessing''': {[]} preprocessing structures for x block (see PREPROCESS). NOTE that y-block preprocessing is NOT used with SVMs. Any y-preprocessing will be ignored.
* '''blockdetails''': [ {'standard'} | 'all' ], extent of predictions and residuals included in model, 'standard' = only y-block, 'all' x- and y-blocks.
* '''algorithm''': [ 'libsvm' ] algorithm to use. libsvm is default and currently only option.
* '''kerneltype''': [ 'linear' | {'rbf'} ], SVM kernel to use. 'rbf' is default.
* '''svmtype''': [ {'c-svc'} | 'nu-svc' ] Type of SVM to apply. The default is 'c-svc' for classification.
* '''probabilityestimates''': [0| {1} ], whether to train the SVR model for probability estimates, 0 or 1 (default 1)"


* '''cvtimelimit''': Set a time limit (seconds) on individual cross-validation sub-calculation when searching over supplied SVM parameter ranges for optimal parameters. Only relevant if parameter ranges are used for SVM parameters such as cost, epsilon, gamma or nu. Default is 2 (seconds);
[[Image:App_normalize.png|||]]  Normal  0          false  false  false    EN-US  X-NONE  X-NONE                                          MicrosoftInternetExplorer4                                                                                                                                                                                                                                                                                                                             
* '''splits''': Number of subsets to divide data into when applying n-fold cross validation. Default is 5.
* '''gamma''': Value(s) to use for LIBSVM kernel gamma parameter. Default is 15 values from 10^-6 to 10, spaced uniformly in log.
* '''cost''': Value(s) to use for LIBSVM 'c' parameter. Default is 11 values from 10^-3 to 100, spaced uniformly in log.
* '''nu''': Value(s) to use for LIBSVM 'n' parameter (nu of nu-SVC, and nu-SVR). Default is the set of values [0.2, 0.5, 0.8].
* '''outliernu''': Value to use for nu in LIBSVM's one-class svm outlier detection. (0.05).


===Algorithm===
'''Figure''': Effect of normalization on near-IR spectra of five synthetic gluten and starch mixtures. Original spectra (top plot) and spectra after 1-norm normalization (bottom plot) are shown.
Svmda uses the LIBSVM implementation using the user-specified values for the LIBSVM parameters (see ''options'' above). See [http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf] for further details of these options.  


The default SVMDA parameters cost, nu and gamma have value ranges rather than single values. This svm function uses a search over the grid of appropriate parameters using cross-validation to select the optimal SVM parameter values and builds an SVM model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however.
From the Preprocessing GUI, the only setting associated with this method is the type of normalization (1-norm, 2-norm or inf-norm). There is currently no option to perform this normalization based on anything other than all selected variables.


===See Also===
From the command line, this method is performed using the [[normaliz]] function (note the unusual spelling of the function).


[[analysis]], [[svm]]
===SNV (Standard Normal Variate)===
 
Unlike the simple 1-Norm Normalize described above, the Standard Normal Variate (SNV) normalization method is a weighted normalization (i.e., not all points contribute to the normalization equally). SNV calculates the standard deviation of all the pooled variables for the given sample (see for example Barnes et al., 1989). The entire sample is then normalized by this value, thus giving the sample a unit standard deviation (s = 1). Note that this procedure also includes a zero-order detrend (subtraction of the individual mean value from each spectrum - see discussion of detrending, above), and also that this is different from mean centering (described later). The equations used by the algorithm are the mean and standard deviation equations:
 
<math>\bar{x}_i=\frac{\sum_{j=1}^{n}{X_{i,j}}}{n}</math>
 
<math>w_i = \sqrt{\frac{\sum_{j=1}^{n}{(X_{i,j}-\bar{x}_i)^2}}{(n-1)}}+\delta^{-1}</math>
 
where ''n'' is the number of variables, <math>x_{i,j}</math>, is the value of the j<sup>th</sup> variable for the i<sup>th</sup> sample, and <math>\delta</math> is a user-definable offset. The user-definable offset can be used to avoid over-normalizing samples which have near zero standard deviation. The default value for this offset is zero, indicating that samples will be normalized by their unweighted standard deviation. The selection of <math>\delta</math> is dependent on the scale of the variables. A setting near the expected noise level (in the variables' units) is a good approximation.
 
This normalization approach is weighted towards considering the values that deviate from the individual sample mean more heavily than values near the mean. Consider the example Raman spectrum in the figure below. The horizontal line at intensity 0.73 indicates the mean of the spectrum. The dashed lines at 1.38 and 0.08 indicate one standard deviation away from the mean. In general, the normalization weighting for this sample is driven by how far all the points deviate from the mean of 0.73. Considering the square term in the equation above and the histogram on the left of the figure, it is clear that the high intensity points will contribute greatly to this normalization.
 
[[Image:App_snv_1.png|||]]
 
'''Figure''': Eample Raman spectrum (right plot) and corresponding intensity histogram (left plot). The mean of the spectrum is shown as a dashed line at intensity 0.73; one standard deviation above and below this mean are shown at intensities 1.38 and 0.08 and indicated by the arrow.
 
  Normal  0          false  false  false    EN-US  X-NONE  X-NONE                                          MicrosoftInternetExplorer4
 
This approach is a very empirical normalization method in that one seldom expects that variables for a given sample should deviate about their mean in a normal distribution with unit variance (except in the case where the primary contribution to most of the variables is noise and the variables are all in the same units). When much of the signal in a sample is the same in all samples, this method can do very well. However, in cases where the overall signal changes significantly from sample to sample, problems may occur. In fact, it is quite possible that this normalization can lead to non-linear responses to what were otherwise linear responses. SNV should be carefully compared to other normalization methods for quantitative models.
 
The figure below shows the result of SNV on the gluten and starch mixtures described earlier. Comparing the SNV results to the original spectra and 1-norm spectra shown in the  "normalize" section above, it is obvious that SNV gives tighter groupings of the replicate measurements. In fact, SNV was originally developed for NIR data of this sort, and it behaves well with this kind of response.
  Normal  0          false  false  false    EN-US  X-NONE  X-NONE                                          MicrosoftInternetExplorer4                                                                                                                                                                                                                                                                                                                              '''Figure''': Effect of SNV normalization on near-IR spectra measured of five synthetic gluten and starch mixtures.
 
From the Preprocessing GUI, the only setting associated with this method is the offset. There is currently no option to perform this normalization based on anything other than all selected variables.
From the command line, this method is performed using the [[snv]] function.
 
===MSC (Multiplicative Scatter Correction)===

Revision as of 15:50, 7 July 2010

Introduction

In many analytical methods, the variables measured for a given sample are subject to overall scaling or gain effects. That is, all (or maybe just a portion of) the variables measured for a given sample are increased or decreased from their true value by a multiplicative factor. Each sample can, in these situations, experience a different level of multiplicative scaling.

In spectroscopic applications, scaling differences arise from pathlength effects, scattering effects, source or detector variations, or other general instrumental sensitivity effects (see, for example, Martens and Næs, 1989). Similar effects can be seen in other measurement systems due to physical or chemical effects (e.g., decreased activity of a contrast reagent or physical positioning of a sample relative to a sensor). In these cases, it is often the relative value of variables which should be used when doing multivariate modeling rather than the absolute measured value. Essentially, one attempts to use an internal standard or other pseudo-constant reference value to correct for the scaling effects.

The sample normalization preprocessing methods attempt to correct for these kinds of effects by identifying some aspect of each sample which should be essentially constant from one sample to the next, and correcting the scaling of all variables based on this characteristic. The ability of a normalization method to correct for multiplicative effects depends on how well one can separate the scaling effects which are due to properties of interest (e.g., concentration) from the interfering systematic effects.

Normalization also helps give all samples an equal impact on the model. Without normalization, some samples may have such severe multiplicative scaling effects that they will not be significant contributors to the variance and, as a result, will not be considered important by many multivariate techniques.

When creating discriminant analysis models such as PLS-DA or SIMCA models, normalization is done if the relationship between variables, and not the absolute magnitude of the response, is the most important aspect of the data for identifying a species (e.g., the concentration of a chemical isn't important, just the fact that it is there in a detectable quantity). Use of normalization in these conditions should be considered after evaluating how the variables' response changes for the different classes of interest. Models with and without normalization should be compared.

Typically, normalization should be performed before any centering or scaling or other column-wise preprocessing steps and after baseline or offset removal (see above regarding these preprocessing methods). The presence of baseline or offsets can impede correction of multiplicative effects. The effect of normalization prior to background removal will, in these cases, not improve model results and may deteriorate model performance.

One exception to this guideline of preprocessing order is when the baseline or background is very consistent from sample to sample and, therefore, provides a very useful reference for normalization. This can sometimes be seen in near-infrared spectroscopy in the cases where the overall background shape is due to general solvent or matrix vibrations. In these cases, normalization before background subtraction may provide improved models. In any case, cross-validation results can be compared for models with the normalization and background removal steps in either order and the best selected.

A second exception is when normalization is used after a scaling step (such as autoscaling). This should be used when autoscaling emphasizes features which may be useful in normalization. This reversal of normalization and scaling is sometimes done in discriminant analysis applications.

Normalize

Simple normalization of each sample is a common approach to the multiplicative scaling problem. The Normalize preprocessing method calculates one of several different metrics using all the variables of each sample. Possibilities include:


Name Description Equation*
1-Norm

Normalize to (divide each variable by) the sum of the absolute value of all variables for the given sample. Returns a vector with unit area (area = 1) "under the curve."

2-Norm

Normalize to the sum of the squared value of all variables for the given sample. Returns a vector of unit length (length = 1). A form of weighted normalization where larger values are weighted more heavily in the scaling.

Inf-Norm

Normalize to the maximum value observed for all variables for the given sample. Returns a vector with unit maximum value. Weighted normalization where only the largest value is considered in the scaling.

  • Where, in each case, wi is the normalization weight for sample i, xi is the vector of observed values for the given sample, j is the variable number, and n is the total number of variables (columns of x).

The weight calculated for a given sample is then used to calculate the normalized sample, , using:

An example using the 1-norm on near infrared spectra is shown in the figure below. These spectra were measured as 20 replicates of 5 synthetic mixtures of gluten and starch (Martens, Nielsen, and Engelsen, 2003) In the original data (top plot), the five concentrations of gluten and starch are not discernable because of multiplicative and baseline effects among the 20 replicate measurements of each mixture. After normalization using a 1-norm (bottom plot), the five mixtures are clearly observed in groups of 20 replicate measurements each.

App normalize.png Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4

Figure: Effect of normalization on near-IR spectra of five synthetic gluten and starch mixtures. Original spectra (top plot) and spectra after 1-norm normalization (bottom plot) are shown.

From the Preprocessing GUI, the only setting associated with this method is the type of normalization (1-norm, 2-norm or inf-norm). There is currently no option to perform this normalization based on anything other than all selected variables.

From the command line, this method is performed using the normaliz function (note the unusual spelling of the function).

SNV (Standard Normal Variate)

Unlike the simple 1-Norm Normalize described above, the Standard Normal Variate (SNV) normalization method is a weighted normalization (i.e., not all points contribute to the normalization equally). SNV calculates the standard deviation of all the pooled variables for the given sample (see for example Barnes et al., 1989). The entire sample is then normalized by this value, thus giving the sample a unit standard deviation (s = 1). Note that this procedure also includes a zero-order detrend (subtraction of the individual mean value from each spectrum - see discussion of detrending, above), and also that this is different from mean centering (described later). The equations used by the algorithm are the mean and standard deviation equations:

where n is the number of variables, , is the value of the jth variable for the ith sample, and is a user-definable offset. The user-definable offset can be used to avoid over-normalizing samples which have near zero standard deviation. The default value for this offset is zero, indicating that samples will be normalized by their unweighted standard deviation. The selection of is dependent on the scale of the variables. A setting near the expected noise level (in the variables' units) is a good approximation.

This normalization approach is weighted towards considering the values that deviate from the individual sample mean more heavily than values near the mean. Consider the example Raman spectrum in the figure below. The horizontal line at intensity 0.73 indicates the mean of the spectrum. The dashed lines at 1.38 and 0.08 indicate one standard deviation away from the mean. In general, the normalization weighting for this sample is driven by how far all the points deviate from the mean of 0.73. Considering the square term in the equation above and the histogram on the left of the figure, it is clear that the high intensity points will contribute greatly to this normalization.

App snv 1.png

Figure: Eample Raman spectrum (right plot) and corresponding intensity histogram (left plot). The mean of the spectrum is shown as a dashed line at intensity 0.73; one standard deviation above and below this mean are shown at intensities 1.38 and 0.08 and indicated by the arrow.

  Normal  0          false  false  false    EN-US  X-NONE  X-NONE                                          MicrosoftInternetExplorer4

This approach is a very empirical normalization method in that one seldom expects that variables for a given sample should deviate about their mean in a normal distribution with unit variance (except in the case where the primary contribution to most of the variables is noise and the variables are all in the same units). When much of the signal in a sample is the same in all samples, this method can do very well. However, in cases where the overall signal changes significantly from sample to sample, problems may occur. In fact, it is quite possible that this normalization can lead to non-linear responses to what were otherwise linear responses. SNV should be carefully compared to other normalization methods for quantitative models.

The figure below shows the result of SNV on the gluten and starch mixtures described earlier. Comparing the SNV results to the original spectra and 1-norm spectra shown in the "normalize" section above, it is obvious that SNV gives tighter groupings of the replicate measurements. In fact, SNV was originally developed for NIR data of this sort, and it behaves well with this kind of response.

  Normal  0          false  false  false    EN-US  X-NONE  X-NONE                                          MicrosoftInternetExplorer4                                                                                                                                                                                                                                                                                                                               Figure: Effect of SNV normalization on near-IR spectra measured of five synthetic gluten and starch mixtures.

From the Preprocessing GUI, the only setting associated with this method is the offset. There is currently no option to perform this normalization based on anything other than all selected variables. From the command line, this method is performed using the snv function.

MSC (Multiplicative Scatter Correction)