Model Exporter Reference Manual and Advanced Preprocessing: Multivariate Filtering: Difference between pages

From Eigenvector Research Documentation Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Jeremy
 
imported>Jeremy
 
Line 1: Line 1:
__TOC__
===Introduction===
==Introduction==


[[Model_Exporter]] converts models created within the [[Software_User_Guide|PLS_Toolbox or Solo]] chemometrics modeling environments into an interpretable format for use outside of these products. These exported models can be used with the included C# or Java interpreters or with a user-supplied interpreter to make predictions on new data.
In some cases, there is insufficient selectivity in the variables to easily remove things like backgrounds or other signals which are interferences to a multivariate model. In these cases, using multivariate filtering methods before model calibration may help simplify the end model. Multivariate filters identify some unwanted covariance structure (i.e., how variables change together) and remove these sources of variance from the data prior to calibration or prediction. In a simple way, these filters can be viewed as pattern filters in that they remove certain patterns among the variables. The resulting data contain only those covariance patterns which passed through the filter and are, ideally, useful or interesting in the context of the model.


Model_Exporter takes as input a standard model structure created in PLS_Toolbox or Solo and outputs the model into one of three formats: an [[#XML_File_Format|XML file]] (executable by a user-supplied external parser or the Java or C# [[Model _Exporter Interpreter]] class provided with Model_Exporter), an [[#M-file_Format|m-file]] (executable in MATLAB – separately distributed by Mathworks, Inc – without any additional toolboxes, or LabView with their MathScript addon package) or a [[#TCL_File_Format|TCL file]] (executable in a Tcl interpreter or in the Symbion software package – by Symbion Systems, Inc.).
Identification of the patterns to filter can be based on a number of different criteria. The full discussion of multivariate filtering methods is outside the scope of this chapter, but it is worth noting that these methods can be very powerful for calibration transfer and instrument standardization problems, as well as for filtering out other differences between measurements which should otherwise be the same (e.g., differences in the same sample due to changes with time, or differences within a class of items being used in a classification problem).


The exported model requires very few resources to be executed. Specifically, it requires floating-point numerical calculations, a small amount of memory, and the overhead resources required by the specific interpreter.
One common method to identify the multivariate filter "target" uses the Y-block of a multivariate regression problem. This Y-block contains the quantitative (or qualitative) values for each sample and, theoretically, samples with the same value in the Y-block should have the same covariance structure (i.e., they should be similar in a multivariate fashion). A multivariate filter can be created which attempts to remove differences between samples with similar y-values. This filter should reduce the complexity of any regression model needed to predict these data. Put in mathematical terms, the multivariate filter removes signals in the X-block (measured responses) which are orthogonal to the Y-block (property of interest).


This documentation describes the use of the Model_Exporter, the use of exported [[#M-file_Format|M-file]] and [[#TCL_File_Format|TCL-file]] formats as well as to help in the design of external [[#XML_File_Format|XML parsing engines]]. Model_Exporter includes a freely-distributable interpreter class with versions in C# and Java as described on the [[Model_Exporter Interpreter]] page. In addition, an example interpreter engine is supplied for the PHP language (often used for web-page scripting predictions; see http://www.php.net for more information on PHP). Additional engines may be available - [mailto:helpdesk@eigenvector.com Contact Eigenvector Research, Inc.] for more information.
Three multivariate filtering methods are provided in the Preprocessing window: Orthogonal Signal Correction (OSC), Generalized Least Squares Weighting (GLSW), and External Parameter Orthogonalization (EPO) where this last one also encompasses Extended Mixture Model (EMM) filtering. In the context of the Preprocessing window, both methods require a Y-block and are thus only relevant in the context of regression models. Additionally, as of the current version of PLS_Toolbox, the graphical interface access to these functions only permits their use to orthogonalize to a Y-block, not for calibration transfer applications. From the command line, however, both of these functions can also be used for calibration transfer or other filtering tasks. For more information on these uses, please see the calibration transfer and instrument standardization chapter of this manual.


Latest version release notes can be found at http://wiki.eigenvector.com/index.php?title=Model_Exporter_Release_Notes
===OSC (Orthogonal Signal Correction)===


==System Requirements==
Orthogonal Signal Correction (Sjöblom et al., 1998) removes variance in the X-block which is orthogonal to the Y-block. Such variance is identified as some number of factors (described as components) of the X-block which have been made orthogonal to the Y-block. When applying this preprocessing to new data, the same directions are removed from the new data prior to applying the model.


Model_Exporter can be executed from either the MATLAB computational environment ([http://mathworks.com Mathworks, Inc., Natick, MA]), or  [[Software User Guide|Solo]] (Eigenvector Research, Inc., Wenatchee, WA). Model_Exporter converts models created by PLS_Toolbox 3.5 or higher or Solo 4.0 or higher.
The algorithm starts by identifying the first principal component (PC) of the X-block. Next, the loading is rotated to make the scores be orthogonal to the Y-block. This loading represents a feature which is not influenced by changes in the property of interest described in the Y-block. Once the rotation is complete, a PLS model is created which can predict these orthogonal scores from the X-block. The number of components in the PLS model is adjusted to achieve a given level of captured variance for the orthogonal scores. Finally, the weights, loadings, and predicted scores are used to remove the given orthogonal component, and are also set aside for use when applying OSC to a new unknown sample. This entire process can then be repeated on the "deflated" X-block (the X-block with the previously-identified orthogonal component removed) for any given number of components. Each cycle results in additional PLS weights and loadings being added to the total that will be used when applying to new data.


===Matlab-Based Exporter Requirements===
There are three settings for the OSC preprocessing method: number of components, number of iterations, and tolerance level. The number of components defines how many times the entire process will be performed. The number of iterations defines how many cycles will be used to rotate the initial PC loading to be as orthogonal to Y as possible. The tolerance level defines the percent variance that must be captured by the PLS model(s) of the orthogonalized scores.


For execution of Model_Exporter within the MATLAB environment, the following is required:
In the Preprocessing window, this method allows for adjustment of the settings identified above. From the command line, this method is performed using the osccalc and oscapp functions.


:Matlab 7.3 or higher
===GLS Weighting and EPO===
:256 MB RAM (recommended – less may be required)


===Solo-Based Exporter Requirements===
Generalized Least Squares Weighting (GLSW) is a filter calculated from the differences between samples which should otherwise be similar. These differences are considered interferences or "clutter" and the filter attempts to down-weight (shrink) those interferences. A simplified version of GLSW is called External Parameter Orthogonalization (EPO), which does an orthogonalization (complete subtraction) of some number of significant patterns identified as clutter. A simplified version of EPO emulates the Extended Mixture Model (EMM) in which all identified clutter patterns are orthogonalized to.


For execution of the Model_Exporter, the following is recommended
====Clutter Identification====


:Solo+Model_Exporter 4.1 or higher
In the case of a classification problem, similar samples would be the members of a given class. Any variation within each class group (known as "within-class variance") can be considered clutter which will make the classification task harder. The goal of GLSW in this case is to remove this within-class variance as much as possible without making the classes closer together (between-class variance).
:Operating system requirements as listed for the specified Solo version
:200 MB Disk Space (for installation; some models may require additional space)
:256 MB RAM (recommended – less may be required)


===Requirements for Using Exported Models===
In the case of a calibration transfer problem, similar samples would be data from the same samples measured on two different instruments or on the same instrument at two different points in time. The goal of GLSW is to down-weight the differences between the two instruments and, therefore, make them appear more similar. A regression model built from GLSW-filtered data can be used on either instrument after applying the filtering to any measured spectrum. Although this specific application of GLSW is not covered by this chapter, the description below gives the mathematical basis of this use.


The requirements to execute an exported model vary depending on the interpreter used, the number of variables in the modeled data, and the complexity of the model (i.e. the number of factors/components included in the model and the types of preprocessing used).
GLSW can also be used prior to building a regression model in order to remove variance from the X-block which is mostly orthogonal to the Y-block. This application of GLSW is similar to OSC (see above), and such filtering can allow a regression model to achieve a required error of calibration and prediction using fewer latent variables. In this context, GLSW uses samples with similar Y-block values to identify the sources of variance to down-weight.


Memory requirements depend on the precision required for the application, the number of variables in the data and the total number of factors in the model. For example, a model working on 10,000 variables and 5 factors would require around 1MB for double-precision calculations and 500KB single-precision calculations.
In all cases, the default algorithm for GLSW uses a single adjustable parameter, <math>\alpha</math>, which defines how strongly GLSW downweights interferences. Adjusting  <math>\alpha</math>    towards larger values (typically above 0.001) decreases the effect of the filter. Smaller  <math>\alpha</math>s (typically 0.001 and below) apply more filtering.


The software which executes the specific file formats may have additional requirements. See the file format description sections later in this manual for where to locate model execution details.
====GLSW Algorithm====


==Supported Methods==
The GLSW algorithm will be described here for the calibration transfer application (because it is simpler to visualize) and then the use of GLSW in classification and regression applications will be described. In all cases, the approach involves the calculation of a covariance matrix from the differences between similar samples. In the case of calibration transfer problems, this difference is defined as the numerical difference between the two groups of mean-centered transfer samples. Given two sample matrices, X1 and X2, the data are mean-centered and the difference calculated:


Model_Exporter supports the following model types:
:<math>\mathbf{X}_{1,mc}=\mathbf{X}_{1}-\mathbf{1}\bar{\mathbf{x}}_{1}</math> <div align="right">(1)</div>


:PCA – Principal Components Analysis model
:<math>\mathbf{X}_{2,mc}=\mathbf{X}_{2}-\mathbf{1}\bar{\mathbf{x}}_{2}</math> <div align="right">(2)</div>
:PLS – Partial Least Squares regression model
:PLSDA – Partial Least Squares discriminant analysis model
:PCR – Principal Components Regression model
:CLS – Classical Least Squares Regression model
:SVM – Support Vector Machine Regression model
:SVMDA – Support Vector Machine Classification model
:ANN – Artificial Neural Network Regression model


and preprocessing methods:
:<math>\mathbf{X}_{d}=\mathbf{X}_{2}-\mathbf{X}_{1}</math> <div align="right">(3)</div>


:Absolute value     
:Autoscale       
:Baseline (specified)
:Derivative (SavGol) 
:Detrend         
:ELS
:EPO                 
:GLS weighting   
:Log Decay Scaling
:Log10               
:MSC             
:Mean center
:Median center       
:Normalize       
:OSC
:Pareto Scaling     
:Poisson Scaling 
:SNV
:Smooth (SavGol)     
:Sqrt Mean Scale 
:Transmission to Absorbance
:Variance Scaling


where '''1''' is a vector of ones equal in length to the number of rows in '''X<sub>1</sub>''',  <math>\bar{x}_1</math>  is the mean of all rows of '''X<sub>1</sub>''', and  <math>\bar{x}_2</math>  is the mean of all rows of '''X<sub>2</sub>'''. Note that this requires that '''X<sub>1</sub>''' and '''X<sub>2</sub>''' are arranged such that the rows are in the same order in terms of samples measured on the two instruments.


Normalization and Baseline support windowing. Normalization supports type 1 (area) and type 2 (length) normalization, but does not support 'Inf' type normalization.
The next step is to calculate the covariance matrix, C:


Model_Exporter does not support replacement of missing values (values must be measured for all variables).
:<math>\mathbf{C}=\mathbf{X}_d^T\mathbf{X}_d</math> <div align="right">(4)</div>


==Exporting a Model==
followed by the singular-value decomposition of the matrix, which produces the left eigenvectors, '''V''', and the diagonal matrix of singular values, '''S''':


===Exporting from PLS_Toolbox and MATLAB===
:<math>\mathbf{C}=\mathbf{V}S^2\mathbf{V}^T</math> <div align="right">(5)</div>


Model_Exporter is easily called from the MATLAB environment. After adding the Model_Exporter folder to the MATLAB path, a model can be exported by simply calling the exportmodel function, passing the model structure itself, and an optional input specifying the file name and type to which the exported model should be written. When filename is omitted, Model_Exporter will prompt for a filename, file type, and location.


    exportmodel(modelstructure,filename)
Next, a weighted, ridged version of the singular values is calculated


    exportmodel(modelstructure,filename, options)
:<math>\mathbf{D}=\sqrt{\frac{\mathbf{S}^2}{\alpha}+\mathbf{1}_D}</math> <div align="right">(6)</div>


The third parameter, options, allows specification of how excluded variables are handled, how numerical values are stored (text or binary), and whether the exported m-file is a script or function. See below for further details of options.
where '''1'''<sub>D</sub> is a diagonal matrix of ones of appropriate size and  <math>\alpha</math>  is the weighting parameter mentioned earlier. The scale of the weighting parameter depends on the scale of the variance in '''X'''<sub>d</sub>. Finally, the inverse of these weighted eigenvalues are used to calculate the filtering matrix.


Model_Exporter is also accessible from the PLS_Toolbox through the Analysis GUI. With the model to export loaded into the Analysis GUI, go to the '''File > Export Model > To Predictor…''' menu and select the file type to export from the flyout menu.
:<math>\mathbf{G}=\mathbf{V}\mathbf{D}^{-1}\mathbf{V}^T</math> <div align="right">(7)</div>


===Exporting from Solo===
This multivariate filtering matrix can be used by simply projecting a sample into the matrix. The result of this projection is that correlations present in the original covariance matrix are down-weighted (to the extent defined by    <math>\alpha</math> ). The filtering matrix is used both on the original calibration data prior to model calibration, and any future new data prior to application of the regression model.


When installed with the stand-alone Solo software, a model is exported from the Analyis GUI. With the model to export loaded into the Analysis GUI, Go to the '''File > Export Model > To Predictor…''' menu and select the file type to export from the flyout menu.
The choice of <math>\alpha</math> depends on the scale of the original values but also how similar the interferences are to the net analyte signal. If the interferences are similar to the variance necessary to the analytical measurement, then  <math>\alpha</math>   will need to be higher in order to keep from removing analytically useful variance. However, a higher  <math>\alpha</math>    will decrease the extent to which interferences are down-weighted. In practice, values between 1 and 0.0001 are often used.


===Handling Excluded Variables===
====Y-Gradient GLSW====
When using GLSW to filter out X-block variance which is orthogonal to a Y-block, a different approach is used to calculate the difference matrix,  '''X'''<sub>d</sub>    . In this situation we have only one X-block, '''X''', of measured calibration samples, but we also have a Y-block, '''y''' (here defined only for a single column-vector), of reference measurements. To a first approximation, the Y-block can be considered a description of the similarity between samples. Samples with similar y values should have similar values in the X-block.


When excluded variables are detected within a model, the user will be given two options for how to handle these variables.
In order to identify the differences between samples with similar y values, the rows of the X- and Y-blocks are first sorted in order of increasing y value. This puts samples with similar values near each other in the matrix. Next, the difference between proximate samples is determined by calculating the derivative of each column of the X-block. These derivatives are calculated using a 5-point, first-order, Savitzky-Golay first derivative (note that a first-order polynomial derivative is essentially a block-average derivative including smoothing and derivatizing simultaneously). This derivative yields a matrix,  '''X'''<sub>d</sub>  , in which each sample (row) is an average of the difference between it and the four samples most similar to it. A similar derivative is calculated for the sorted Y-block, yielding vector  '''y'''<sub>d</sub>    , a measure of how different the y values are for each group of 5 samples.


# Compress Model – Model_Exporter will attempt to remove all references to excluded variables. The created predictor will expect values for only the included variables.
At this point,  '''X'''<sub>d</sub>    could be used in equation 4 to calculate the covariance matrix of differences. However, some of the calculated differences (rows) may have been done on groups of samples with significantly different y values. These rows contain features which are correlated to the Y-block and should not be removed by GLS. To avoid this, the individual rows of  '''X'''<sub>d</sub>    need to be re-weighted by converting the sorted Y-block differences into a diagonal re-weighting matrix,  '''W'''    , in which the ''i''<sup>th</sup> diagonal element, ''w''<sub>i</sub>, is calculated from the rearranged equation
# Use Placeholders – Model_Exporter will create a predictor which expects values for all variables, excluded or included, although excluded values will be ignored.


The choice between these two methods depends on the environment in which the exported model is going to be used. If it is easier to always provide all variables to the predictor, then the “Use Placeholders” option is probably preferred. If, instead, only the included variables will be available (e.g. the excluded variables are not going to be measured), compressing the model is the correct approach.
:<math>\log_2(w_i)=-\mathbf{y}_{d,i}s_{yd}</math> <div align="right">(8)</div>


In general, the two methods give identical numerical results with the sole exception of models which make use of smoothing and derivative preprocessing. These methods may give slightly different “edge effects” after compressing a model and validation of such models is encouraged.
The value  <math>\mathbf{y}_{d,i}</math>  is the ''i''<sup>th</sup> element of the '''y'''<sub>d</sub>    vector, and   ''s''<sub>yd</sub>    is the standard deviation of y-value differences:


In either case, the header information in the exported model will always reflect the number of variables expected and any labels or axisscale information for those variables.
:<math>s_{yd}=\sqrt{\sum_{i=1}^m{\frac{(y_{d,i}-\bar{y}_d)^2}{m-1}}}</math> <div align="right">(9)</div>


===Storing Numerical Values as Binary===


With large numbers of variables, and with certain types of preprocessing (e.g. derivatives and smoothing), the numerical matrices needed to apply the model can become quite large, particularly when stored in the standard text format of an exported model. When the '''m-file format''' is selected as the output target, you have the choice to store the numerical values in one of three formats:
The re-weighting matrix is then used along with   '''X'''<sub>d</sub>    to form the covariance matrix


* Text in the script (Default)
:<math>\mathbf{C}=\mathbf{X}_d^T\mathbf{W}^{2}\mathbf{X}_d</math> <div align="right">(10)</div>
* Binary data file in DOUBLE (64-bit) precision
* Binary data file in SINGLE (32-bit) precision


Text in the script is the default format to store numerical values and allows all the model information to be included in a single file (the text script.) The second two options instead store these values in a separate binary file as a simple stream of numerical values of the indicated precision. When the binary formats are selected, the script is written to automatically open the binary file and read in the values from there instead of parsing them out of the script.
which is then used in equations 5 through 7 as described above.


:'''Notes:'''
====External Parameter Orthogonalization (EPO)====
:# This file format is currently only available for scripts exported in the m-file format. [mailto:helpdesk@eigenvector.com|Contact Eigenvector Research] if you have interest in using a similar format for other export formats.
An alternative multivariate filter called External Parameter Orthogonalization (EPO) uses the same process as GLSW except that only a certnain number of eigenvectors calculated in equation 5 are kept and the '''D''' matrix calculated in equation 6 is a diagonal vector of ones. The result is that '''X''' is "hard-orthogonalized" to the eigenvectors (the directions are completely removed) rather than simply "shrinking" these directions as is done with GLSW.
:# Single Precision Binary will reduce the accuracy of the predictions due to rounding error. The extent of error will depend greatly on the noise level of the data and the precision required by the model. Models exported with this precision should be validated with known samples to determine the effect of rounding on the predictions for the given model.
:# It is assumed that the binary file is in the "current working directory" (unless the script is edited to change the file location.)


==M-file Format==
If all of the calculated eigenvectors are used in an EPO filter, the method becomes equivalent to the Extended Mixture Model (EMM) method described in Martens and Naes 1989.


The m-files output by [[Model Exporter]] are stand-alone. That is, they can be run by the MATLAB computational environment (available from Mathworks, Inc., http://www.mathworks.com) without any additional toolboxes or the LabVIEW environment (available from National Instruments, Inc., http://www.ni.com) with any MathScript-enabled package.
For a literature reference on EPO, see: Roger, Chauchard, Bellon-Maurel, "EPO–PLS external parameter orthogonalisation of PLS application to temperature-independent measurement of sugar content of intact fruits." Chemom. Intell. Lab. Syst., 66, 191– 204 (2003).


For maximum flexibility, an exported model is written as a script which expects only to find a variable named x in its workspace. This variable provides the input data to which the model should be applied. It is important to note that the variable x will be modified by the script and, thus, the caller should not expect the variable to remain unchanged. See "Creating Functions from Exported Models", below, for more information on how to isolate the script and call it as a function. (Those unfamiliar with MATLAB scripts and functions should read the MATLAB documentation describing these concepts and the associated "variable scope" documentation.)
====Settings and Command-line Usage====


The input variable x should be a vector, representing a single sample, and the output will be a prediction for this one sample.
In the Preprocessing window, the GLSW method has a [[Declutter_Settings_Window|Settings Window]] to allow for adjustment of the weighting parameter<math>\alpha</math>, whether or not to include mean-centering ("ignore means"), whether to use '''EPO''' mode and select a given number of components to orthongonalize to, or whether to use '''EMM/ELS''' mode in which the data is orthogonalized to all available components. From the command line, this method is performed using the [[glsw]] function, which also permits a number of other modes of application (including identification of "classes" of similar samples).
 
===Options===
When exporting from the Matlab environment to a .m file using the ''options'' parameter it is possible to specify preferred behavior for these choices. The available options are:
* '''handleexcludes''': [ {'ask'} | 'ignore' | 'placeholders'} Governs how excluded variables should be handled.
:'ignore' = attempt to remove all references to excluded variables. Only included values will be expected.
:'placeholders' = expect values for all variables, although excluded values will not be used by model.
:'ask' = prompt user for desired behavior.
 
* '''datastorageformat''': [ {'ask'} | 'text' | 'binarydouble' | 'binarysingle' ] Governs output format of numerical values.
:'text' = store numerical values as text in the script (the normal output mode).
:'binarydouble' = store as binary data file in DOUBLE precision.
:'binarysingle' = store as binary data file in SINGLE precision.
:'ask' = prompt user for desired format.
Note: Single Precision Binary will reduce the accuracy of the predictions due to rounding error. Validate results against known samples if single precision is used. Note: Binary output formats provide for smaller memory footprint but require parsers that can execute binary file read instructions.
 
* '''creatematlabfunction''': [ {'no'}  | 'yes' ] Governs m-file format specifying whether to create a Matlab script or function.
:'yes' outputs m-files with appropriate code to allow calls to the model application in a functional form.
:'no' outputs m-file in script form.
 
===Input Data===
 
The expected length (number of elements) and contents of the input x vector are defined in the comments and initial sections of the exported model script. The script, as exported, does not use this information to perform any validity testing on the input variable. This information is only provided to indicate to the user what type of data is expected.
 
The example below shows the part of an exported model which indicates the expected data size and associated context information. This particular model expects input data of ten variables as a row vector (as described by inputdata.size). The labels of these ten variables are specified in the string array inputdata.label. As there was no axisscale information in this particular data, the inputdata.axisscale value is empty.
 
  inputdata.size = [ 1 10 ];
inputdata.axisscale = [ ];
inputdata.label = ['Fe';'Ti';'Ba';'Ca';'K ';'Mn';'Rb';'Sr';'Y ';'Zr'];
 
The user can make use of this information to assure the data being passed to the model is correct. Again, as written, the script provides no testing. Incorrect data sizes will be indicated by a runtime error when executing the script.
 
===Returned Results===
 
The results available from a model prediction will be present as variables in the script's workspace. The user is responsible for making use of these variables as needed. The following list specifies the supported results which may be of interest to the user.
 
:'''scores''' - Scores for each component as a row vector.
:'''T2''' - The Hotelling's T^2 as a scalar value.
:'''Q''' - The sum squared x residuals (Q value) as a scalar value.
:'''Tcon''' - Variable contributions to T2 as a row vector.
:'''Qcon''' - Q residuals contributions (x residuals) as a row vector.
:'''x''' - The preprocessed version of the input data.
:'''Xhat''' - Model estimate of the data as a row vector (in preprocessed units - comparable to the preprocessed '''x''').
 
All regression and PLSDA models return the following additional value:
 
:'''yhat''' - Model prediction for y (predicted y value) as a scalar value or vector.
 
PLSDA discriminant analysis models also return an additional value
 
:'''probs''' - Model predicted probability of the input sample belonging to each class, where the classes are ordered as unique(y), as a vector. (y refers to the classes variable originally used in building the model).
 
SVM regression analysis models return values:
:'''yhat''' - Model prediction for y (predicted y value) as a scalar.
:'''nsvs''' - Number of support vectors used by the model, as a scalar.
 
SVMDA discriminant analysis (classification) models return values:
:'''probs''' - Model predicted "probability" of the input sample belonging to each class, where the classes are ordered as shown in classorder (below). Note that the probability reported here is '''not''' the same as the probability reported by the SVM algorithm (based on a maximum likelihood calculation). Instead, this is based on the classvotes reported below. The class votes are normalized to give fraction of votes for each class. This fraction is raised to the power of 10, then normalized to unit area again. This gives a ROUGH estimate of probability where the class with the highest votes also gets the highest probability and the remaining classes are ranked in decreasing order. The log of the probability is roughly proportional to the number of class votes that would have to change to cause the assignment to change.
::NOTE: For historical reasons, the output '''prob''' will also contain the identical probabilities as '''probs'''.
:'''classvotes''' - Votes cast in favor of each class, as a vector. The class with most votes is the predicted class of the input sample.
:'''classorder''' - Is a vector of class numbers identifying which class each classvotes value is associated with. For example, if the second entry in classvotes has the largest value then the second value in classorder give the winning class number. See the model's model.classification.classnums and model.classification.classids to translate between class numbers with class names. Ties between two or more classes are resolved by choosing the first.
:'''nsvs''' - Number of support vectors used by the model, as a scalar.
:'''df''' - Vector of decision function values for pairwise classifiers, as a vector. If there are N classes then there are N*(N-1)/2 pairwise classifiers used. The decision functions are in order: class 1-2, 1-3,...1-N, 2-3, 2-4, ...,2-N,..., (N-1)-N. The classvotes are based on the decision function values.
 
:Note these exported model results should be the same as results from SVMDA when using option probabilityestimates = 0 (even if the exported model was built using option probabilityestimates = 1). Thus the exported model's predictions should only be validated against SVMDA models built using probabilityestimates = 0.
 
===Creating Functions from Exported Models===
 
Although the exported model is written as a script which would normally operate in the base workspace of MATLAB, the user can also wrap the script into a function by simply adding a standard function definition to the script file. A function wrapper keeps the input variable x from being modified outside the function. This approach tends to be safer than a script, but is not implemented by default in order to provide the widest flexibility to the user.
 
An example function line is provided in the exported model file (commented out) along with instructions for customization. In addition, there is an example block of code (also commented out by default) which will return “expected information” about x if the function is called without any inputs.
 
In general, the function definition requires only one input, x, and can output any of the variables which are present after the script's execution. An example would be:
 
  function [scores,Q,T2,Qcon,Tcon] = mymodel(x)
 
This function definition returns the vectors: scores, Qcon, and Tcon, as well as the scalar values: Q and T2 to the caller.
 
Note, as discussed above, the user can have this conversion of the exported m-file from a script to a function applied automatically by specifying the exportmodel option '''creatematlabfunction''' = 'yes'.
 
==TCL File Format==
 
The tcl-files output by [[Model Exporter]] can be run by either a stand-alone Tcl parser (for example see the "Batteries Included" ActiveTcl Distribution http://www.tcl.tk/software/tcltk/ ) or by Symbion (available from Symbion Systems, Inc., http://www.gosymbion.com ). When run in a stand-alone Tcl parser, the La package for matrix support is required (available free from: http://www.hume.com/la/ )
 
For maximum flexibility, an exported model is written as a Tcl script which expects only to find a variable named x in its workspace. This variable provides the input data to which the model should be applied. It is important to note that the variable x will be modified by the script and, thus, the caller should not expect the variable to remain unchanged.
 
The input variable x should be a vector, representing a single sample, and the output will be a prediction for this one sample.
 
===Input Data===
 
The expected length (number of elements) and contents of the input x vector are defined in the comments and initial sections of the exported model script. The script, as exported, does not use this information to perform any validity testing on the input variable. This information is only provided to indicate to the user what type of data is expected.
 
The example below shows the part of an exported model which indicates the expected data size and associated context information. This particular model expects input data of ten variables as a row vector (as described by inputdata.size). The labels of these ten variables are specified in the string array inputdata.label. As there was no axisscale information in this particular data, the inputdata.axisscale value is empty.
 
# inputdata.size = [ 1 10 ];
# inputdata.axisscale = [ ];
# inputdata.label = ['Fe';'Ti';'Ba';'Ca';'K ';'Mn';'Rb';'Sr';'Y ';'Zr'];
 
The user can make use of this information assure the data being passed to the model is correct. Again, such testing is not provided by the script as written. Incorrect data sizes will be indicated by a runtime error when executing the script.
 
===Returned Results===
 
The results available from a model prediction will be present as variables in the script's workspace. The user is responsible for making use of these variables as needed. The list of output variables is the same as those listed under the [[#Returned_Results|M-file format description]].
 
==XML File Format==
 
The input variable x should be a vector, representing a single sample, and the output will be a prediction for this one sample.
 
===Numerical Matrix Definitions===
 
The XML format utilizes custom tags to define various parts of the model. For some tags, the content is a vector or matrix of values. In these cases, a comma character delineates different column elements and semicolon indicates the end of a matrix row and the beginning of the next. All white space is ignored. If a given matrix contains only one row, it is described as a "row vector". A matrix with a single column is described as a "column vector". Orientation of such vectors is critical to the mathematical operations and must be parsed appropriately.
 
===XML Structure===
 
The XML file will consist of a top level &lt;model&gt; tag which will contain an &lt;information&gt; tag, an &lt;inputdata&gt; tag, and one or more step segments, each wrapped in a separate &lt;step&gt; tag.
 
'''&lt;model&gt;'''
:'''&lt;information&gt;'''  General information on the encoded model.
::'''&lt;source&gt;'''    Text description of file source (EVRI Model_Exporter).
::'''&lt;modeltype&gt;'''  Standard model method acronym (PCA, PLS, etc).
::'''&lt;description&gt;''' Text description of model including preprocessing, data size(s), and number of components. Each row of this multi-row string is delineated by &lt;sr&gt; (string row) tags.
::'''&lt;datasource&gt;''' Information block of modeled calibration data. &lt;datasource&gt; is a multi-cell table format. There will be one column of information for each block of data required by the given modeltype (e.g. PCA requires 1 block, PLS requires 2). Each &lt;td&gt; tag will contain a number of sub-fields describing the data used for the given block. Informational only, sub-fields may change.
:'''&lt;/information&gt;'''
:'''&lt;inputdata&gt;'''    Specific requirements for input data including the following information:
::'''&lt;size&gt;'''    Numeric class row vector describing the size expected for the input data (x). The first element of the vector gives the expected number of rows, the second is the expected number of columns.
::'''&lt;axisscale&gt;'''  Numeric class row vector providing the expected axisscale of the input values. The actual values stored in the axisscale vector are completely dependent on the application and the analytical method used and may be empty.
::'''&lt;label&gt;'''    Strings (delimited by &lt;sr&gt; sub-tags) defining the names of the variables expected in the input data (x). The names are dependent on the application and the analytical method used and may be empty.
:'''&lt;/inputdata&gt;'''
:'''&lt;step&gt;'''      Repeated tag for each step required for making a prediction using this model. Will contain the following sub-fields:
::'''&lt;sequence&gt;'''  Numeric class single value indicating the order in which this step should be performed. The steps are generally included in the XML file in sequence-order (sequence 1 will be the first step in the file), but this field can be used to assure in-order processing of steps.
::'''&lt;description&gt;''' String class description of the step (informational only)
::'''&lt;constants&gt;'''  Contains information on constants required by this step. Each constant is defined as a sub-tag herein. The name of the constant is the sub-tag name and will contain a matrix (or vector) of values to use for the given constant. See below for more information.
::'''&lt;script&gt;'''   One or more rows of strings describing the mathematical operations to perform this step. When more than one mathematical operation is to be performed, each will be given in a separate string row &lt;sr&gt; tag. However, these can be ignored. Each mathematical operation will be terminated with a semicolon.
:'''&lt;/step&gt;'''
''(Additional &lt;step&gt; tags located here…)''
 
'''&lt;/model&gt;'''
 
See the provided files "pcaexample.xml" and "plsexample.xml" for full examples of the XML structure.
 
==Requirements for XML Interpreters==
 
To execute each of the <step> segments contained in the XML file, an interpreter must be able to parse the constants defined into matrices and be able to execute the script commands. The following give the specifications for an interpreter.
 
For examples of interpreters, see the [[Model_Exporter Interpreter]] objects in folder: interpreters/MEInterpreter, or the PHP interpreter in interpreters/predict.php. These are all distributed with Model_Exporter and are ''freely-distributable without additional licensing''.
 
===Managing of Constants and Variables===
* The interpreter must  maintain a "workspace" of stored constants and variables in  which the matrices can be accessed by a variable name (specified by the  tag in which the given constant was read, for example:<pre>&lt;s class="numeric" size="[1,1]">4&lt;/s></pre>
:would define a constant "s" which was equal to the scalar value 4).
* Constants are NOT case sensitive and any interpreter must be written to consider the upper or lower case variables as the same.
* "Constants" are just pre-defined variables. Although every effort will be made to avoid changing these values, it is NOT a rule that these "constants" cannot be  changed – scripts may modify and overwrite these values. They are called  "constants" because they are initially defined by the model.
* The enclosing tag for the  constant will define the class of the constant (in this application,  constants will always be "numeric") and will also define the  size of the constant using the attribute 'size'. For example, <pre>&lt;s class="numeric" size="[1,5]"></pre> defines that the enclosed constant will be a row vector (1 row) of 5 elements (5 columns).
* Prior to the execution of  the script(s), the XML interpreter must place a variable named "x"  (lower-case) in their workspace. This variable must contain the data to which  the model should be applied. The value of "x" will be modified by the script so, following initial assignment, no alteration of this  variable should be done outside what is specified by the script.
* All constants/variables  must be retained for the entirety of a given step. In many cases, the  variables remaining in the workspace will contain results of interest to the caller and, therefore, all workspace values should be retained. The variable "x" must always be present.
 
===Script Execution===
 
The following lists define the script commands which must be supported by the interpreter (scripts may contain only these commands). When applicable, the Matlab operator corresponding to the given function is given. Interpreters do not need to interpret these operators. They will never be used in any script and are provided here only for reference.
 
====Single Input Functions====
 
C = function(A); 
  abs            Absolute Value    Removal of sign of elements ( abs(A) )
  log10          log (base 10)      Base 10 logarithm of elements ( log10(A) )
  transpose      transpose array    Exchange rows for columns ( A' )
 
====Double Input Functions====
 
C = function(A,B);
    plus          Plus                              Addition of paired elements ( A+B )
    minus        Minus                            Subtraction of paired elements ( A-B )
    mtimes        Matrix multiply (dot product)    Dot product of matrices ( A*B )
    times        Array multiply                    Multiplication of paired elements ( A.*B )
    power        Array power                      Exponent using paired elements ( A.^B )
    rdivide      Right array divide                Division of paired elements ( A./B )
    cols          Index into columns of matrix      Select or replicate columns  ( A(:,B) )
    rows          Index into rows of matrix        Select or replicate rows    ( A(B,:) )
 
===Mathematical Operation Requirements===
 
* All mathematical operations are expected to be performed using signed, single precision numbers.
* With the exception of mtimes (dot product), all operations are "element-by-element". That is, the two matrices passed will be equal in size (see scalar exception below) and the mathematical operation is performed between each element of matrix A and its corresponding element in matrix B. The output matrix C is always the same size as A and B.
* Scalar Exception (except mtimes): A or B may be a scalar even if the other isn't. In this situation, the scalar input must be interpreted as an appropriately-sized matrix containing all the same value.
* mtimes (dot product) is performed using the standard linear-algebraic dot-product operation. In generic terms, the input matrix A will contain m rows and k columns, the input matrix B will contain k rows and n columns and the output matrix C will contain m rows and n columns. The following equation is used to calculate each element of the C matrix (loop for i = 1 to m and for j = 1 to n):
::<math>C_{i,j}=A_{i,1}B_{1,j} + A_{i,2}B_{2,j} + A_{i,3}B_{3,j}  + ... + A_{i,k}B_{k,j}</math>
:Subscripts indicate the row and column indexing (respectively) into the  given array. When either A or B is a scalar, the mtimes operation should  be handled as a "times" operation. That is, the operation  becomes an element-by-element multiplication where each element of the  matrix input is multiplied by the scalar value and C is the same size as  the input matrix.
* cols and rows indexing operations should expect a row vector for B that may have repeated elements (which allows replication of a given row or column). For example, given a row vector for B of
::B = [1 1 1 2 2 2]
:passed into the cols operation, this would replicate column 1 three times then replicate column 2 three times giving a total of 6 columns in the output.
 
===Script Execution Requirements===
* The format for a single  script command is: <pre> C = function(A,B);</pre> where function is one of the above functions, A and B are the pre-defined constants / variables to use as input to function, and C is the output. Input B will be omitted for functions which require only one input. Each command of the script will end in a semi-colon ";". All commands must be performed in the order in which they appear in the script.
* The expected size, axisscale,  and labels associated with x will be stored in the <sourcedata> tab  (if any exist). These values can be used by an XML interpreter to verify  the data being analyzed.
* Constants are NOT case  sensitive and any interpreter must be written to consider the upper or lower  case variables as the same.
 
===Returned Results===
 
The results returned by a model prediction will be present as variables in the interpreter's workspace upon completion of the XML parsing. The returned results are the same as those listed for the [[#Returned_Results|M-file format]].
 
==Requirements for XML Writers==
 
The following rules are to be followed by the script creation algorithm of Model_Exporter. These rules may be of interest to script interpreters, but should not have any critical impact on interpreter design.
 
* Nesting of functions is not  allowed. Functions can only take variables or pre-defined constants as  input.
* NO iterative processes are  supported. All scripts must be straight-through executing (no control  structures such as "ifs", "while", etc are supported.)
* Missing data replacement  is not supported.
* As of version 1.0 of this product,  only variables or pre-defined constants may be used in a function. No  "in-line" constants may be used. For example:
 
    C = minus(A,1);
 
is invalid because the constant "1" has to be pre-defined. This command should instead be written where the "1" is pre-defined as a constant and the name of that constant is used.
 
* Variables are NOT case  sensitive and any interpreter must be written to consider the upper or  lower case variables as the same. Note, however, that the Matlab output of  this function will be case-sensitive code so the scripts should try to be  consistent in case, even if other interpreters won't care.

Revision as of 11:50, 1 November 2011

Introduction

In some cases, there is insufficient selectivity in the variables to easily remove things like backgrounds or other signals which are interferences to a multivariate model. In these cases, using multivariate filtering methods before model calibration may help simplify the end model. Multivariate filters identify some unwanted covariance structure (i.e., how variables change together) and remove these sources of variance from the data prior to calibration or prediction. In a simple way, these filters can be viewed as pattern filters in that they remove certain patterns among the variables. The resulting data contain only those covariance patterns which passed through the filter and are, ideally, useful or interesting in the context of the model.

Identification of the patterns to filter can be based on a number of different criteria. The full discussion of multivariate filtering methods is outside the scope of this chapter, but it is worth noting that these methods can be very powerful for calibration transfer and instrument standardization problems, as well as for filtering out other differences between measurements which should otherwise be the same (e.g., differences in the same sample due to changes with time, or differences within a class of items being used in a classification problem).

One common method to identify the multivariate filter "target" uses the Y-block of a multivariate regression problem. This Y-block contains the quantitative (or qualitative) values for each sample and, theoretically, samples with the same value in the Y-block should have the same covariance structure (i.e., they should be similar in a multivariate fashion). A multivariate filter can be created which attempts to remove differences between samples with similar y-values. This filter should reduce the complexity of any regression model needed to predict these data. Put in mathematical terms, the multivariate filter removes signals in the X-block (measured responses) which are orthogonal to the Y-block (property of interest).

Three multivariate filtering methods are provided in the Preprocessing window: Orthogonal Signal Correction (OSC), Generalized Least Squares Weighting (GLSW), and External Parameter Orthogonalization (EPO) where this last one also encompasses Extended Mixture Model (EMM) filtering. In the context of the Preprocessing window, both methods require a Y-block and are thus only relevant in the context of regression models. Additionally, as of the current version of PLS_Toolbox, the graphical interface access to these functions only permits their use to orthogonalize to a Y-block, not for calibration transfer applications. From the command line, however, both of these functions can also be used for calibration transfer or other filtering tasks. For more information on these uses, please see the calibration transfer and instrument standardization chapter of this manual.

OSC (Orthogonal Signal Correction)

Orthogonal Signal Correction (Sjöblom et al., 1998) removes variance in the X-block which is orthogonal to the Y-block. Such variance is identified as some number of factors (described as components) of the X-block which have been made orthogonal to the Y-block. When applying this preprocessing to new data, the same directions are removed from the new data prior to applying the model.

The algorithm starts by identifying the first principal component (PC) of the X-block. Next, the loading is rotated to make the scores be orthogonal to the Y-block. This loading represents a feature which is not influenced by changes in the property of interest described in the Y-block. Once the rotation is complete, a PLS model is created which can predict these orthogonal scores from the X-block. The number of components in the PLS model is adjusted to achieve a given level of captured variance for the orthogonal scores. Finally, the weights, loadings, and predicted scores are used to remove the given orthogonal component, and are also set aside for use when applying OSC to a new unknown sample. This entire process can then be repeated on the "deflated" X-block (the X-block with the previously-identified orthogonal component removed) for any given number of components. Each cycle results in additional PLS weights and loadings being added to the total that will be used when applying to new data.

There are three settings for the OSC preprocessing method: number of components, number of iterations, and tolerance level. The number of components defines how many times the entire process will be performed. The number of iterations defines how many cycles will be used to rotate the initial PC loading to be as orthogonal to Y as possible. The tolerance level defines the percent variance that must be captured by the PLS model(s) of the orthogonalized scores.

In the Preprocessing window, this method allows for adjustment of the settings identified above. From the command line, this method is performed using the osccalc and oscapp functions.

GLS Weighting and EPO

Generalized Least Squares Weighting (GLSW) is a filter calculated from the differences between samples which should otherwise be similar. These differences are considered interferences or "clutter" and the filter attempts to down-weight (shrink) those interferences. A simplified version of GLSW is called External Parameter Orthogonalization (EPO), which does an orthogonalization (complete subtraction) of some number of significant patterns identified as clutter. A simplified version of EPO emulates the Extended Mixture Model (EMM) in which all identified clutter patterns are orthogonalized to.

Clutter Identification

In the case of a classification problem, similar samples would be the members of a given class. Any variation within each class group (known as "within-class variance") can be considered clutter which will make the classification task harder. The goal of GLSW in this case is to remove this within-class variance as much as possible without making the classes closer together (between-class variance).

In the case of a calibration transfer problem, similar samples would be data from the same samples measured on two different instruments or on the same instrument at two different points in time. The goal of GLSW is to down-weight the differences between the two instruments and, therefore, make them appear more similar. A regression model built from GLSW-filtered data can be used on either instrument after applying the filtering to any measured spectrum. Although this specific application of GLSW is not covered by this chapter, the description below gives the mathematical basis of this use.

GLSW can also be used prior to building a regression model in order to remove variance from the X-block which is mostly orthogonal to the Y-block. This application of GLSW is similar to OSC (see above), and such filtering can allow a regression model to achieve a required error of calibration and prediction using fewer latent variables. In this context, GLSW uses samples with similar Y-block values to identify the sources of variance to down-weight.

In all cases, the default algorithm for GLSW uses a single adjustable parameter, , which defines how strongly GLSW downweights interferences. Adjusting towards larger values (typically above 0.001) decreases the effect of the filter. Smaller s (typically 0.001 and below) apply more filtering.

GLSW Algorithm

The GLSW algorithm will be described here for the calibration transfer application (because it is simpler to visualize) and then the use of GLSW in classification and regression applications will be described. In all cases, the approach involves the calculation of a covariance matrix from the differences between similar samples. In the case of calibration transfer problems, this difference is defined as the numerical difference between the two groups of mean-centered transfer samples. Given two sample matrices, X1 and X2, the data are mean-centered and the difference calculated:

(1)
(2)
(3)


where 1 is a vector of ones equal in length to the number of rows in X1, is the mean of all rows of X1, and is the mean of all rows of X2. Note that this requires that X1 and X2 are arranged such that the rows are in the same order in terms of samples measured on the two instruments.

The next step is to calculate the covariance matrix, C:

(4)

followed by the singular-value decomposition of the matrix, which produces the left eigenvectors, V, and the diagonal matrix of singular values, S:

(5)


Next, a weighted, ridged version of the singular values is calculated

(6)

where 1D is a diagonal matrix of ones of appropriate size and is the weighting parameter mentioned earlier. The scale of the weighting parameter depends on the scale of the variance in Xd. Finally, the inverse of these weighted eigenvalues are used to calculate the filtering matrix.

(7)

This multivariate filtering matrix can be used by simply projecting a sample into the matrix. The result of this projection is that correlations present in the original covariance matrix are down-weighted (to the extent defined by ). The filtering matrix is used both on the original calibration data prior to model calibration, and any future new data prior to application of the regression model.

The choice of depends on the scale of the original values but also how similar the interferences are to the net analyte signal. If the interferences are similar to the variance necessary to the analytical measurement, then will need to be higher in order to keep from removing analytically useful variance. However, a higher will decrease the extent to which interferences are down-weighted. In practice, values between 1 and 0.0001 are often used.

Y-Gradient GLSW

When using GLSW to filter out X-block variance which is orthogonal to a Y-block, a different approach is used to calculate the difference matrix, Xd . In this situation we have only one X-block, X, of measured calibration samples, but we also have a Y-block, y (here defined only for a single column-vector), of reference measurements. To a first approximation, the Y-block can be considered a description of the similarity between samples. Samples with similar y values should have similar values in the X-block.

In order to identify the differences between samples with similar y values, the rows of the X- and Y-blocks are first sorted in order of increasing y value. This puts samples with similar values near each other in the matrix. Next, the difference between proximate samples is determined by calculating the derivative of each column of the X-block. These derivatives are calculated using a 5-point, first-order, Savitzky-Golay first derivative (note that a first-order polynomial derivative is essentially a block-average derivative including smoothing and derivatizing simultaneously). This derivative yields a matrix, Xd , in which each sample (row) is an average of the difference between it and the four samples most similar to it. A similar derivative is calculated for the sorted Y-block, yielding vector yd , a measure of how different the y values are for each group of 5 samples.

At this point, Xd could be used in equation 4 to calculate the covariance matrix of differences. However, some of the calculated differences (rows) may have been done on groups of samples with significantly different y values. These rows contain features which are correlated to the Y-block and should not be removed by GLS. To avoid this, the individual rows of Xd need to be re-weighted by converting the sorted Y-block differences into a diagonal re-weighting matrix, W , in which the ith diagonal element, wi, is calculated from the rearranged equation

(8)

The value is the ith element of the yd vector, and syd is the standard deviation of y-value differences:

(9)


The re-weighting matrix is then used along with Xd to form the covariance matrix

(10)

which is then used in equations 5 through 7 as described above.

External Parameter Orthogonalization (EPO)

An alternative multivariate filter called External Parameter Orthogonalization (EPO) uses the same process as GLSW except that only a certnain number of eigenvectors calculated in equation 5 are kept and the D matrix calculated in equation 6 is a diagonal vector of ones. The result is that X is "hard-orthogonalized" to the eigenvectors (the directions are completely removed) rather than simply "shrinking" these directions as is done with GLSW.

If all of the calculated eigenvectors are used in an EPO filter, the method becomes equivalent to the Extended Mixture Model (EMM) method described in Martens and Naes 1989.

For a literature reference on EPO, see: Roger, Chauchard, Bellon-Maurel, "EPO–PLS external parameter orthogonalisation of PLS application to temperature-independent measurement of sugar content of intact fruits." Chemom. Intell. Lab. Syst., 66, 191– 204 (2003).

Settings and Command-line Usage

In the Preprocessing window, the GLSW method has a Settings Window to allow for adjustment of the weighting parameter, , whether or not to include mean-centering ("ignore means"), whether to use EPO mode and select a given number of components to orthongonalize to, or whether to use EMM/ELS mode in which the data is orthogonalized to all available components. From the command line, this method is performed using the glsw function, which also permits a number of other modes of application (including identification of "classes" of similar samples).