Advanced Preprocessing: Multivariate Filtering and Faq obtain or use recompilation license for PLS Toolbox: Difference between pages

From Eigenvector Research Documentation Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Jeremy
 
imported>Scott
No edit summary
 
Line 1: Line 1:
===Introduction===
===Issue:===


In some cases, there is insufficient selectivity in the variables to easily remove things like backgrounds or other signals which are interferences to a multivariate model. In these cases, using multivariate filtering methods before model calibration may help simplify the end model. Multivariate filters identify some unwanted covariance structure (i.e., how variables change together) and remove these sources of variance from the data prior to calibration or prediction. In a simple way, these filters can be viewed as pattern filters in that they remove certain patterns among the variables. The resulting data contain only those covariance patterns which passed through the filter and are, ideally, useful or interesting in the context of the model.
How do I obtain or use a recompilation license for PLS_Toolbox?


Identification of the patterns to filter can be based on a number of different criteria. The full discussion of multivariate filtering methods is outside the scope of this chapter, but it is worth noting that these methods can be very powerful for calibration transfer and instrument standardization problems, as well as for filtering out other differences between measurements which should otherwise be the same (e.g., differences in the same sample due to changes with time, or differences within a class of items being used in a classification problem).
===Possible Solutions:===


One common method to identify the multivariate filter "target" uses the Y-block of a multivariate regression problem. This Y-block contains the quantitative (or qualitative) values for each sample and, theoretically, samples with the same value in the Y-block should have the same covariance structure (i.e., they should be similar in a multivariate fashion). A multivariate filter can be created which attempts to remove differences between samples with similar y-values. This filter should reduce the complexity of any regression model needed to predict these data. Put in mathematical terms, the multivariate filter removes signals in the X-block (measured responses) which are orthogonal to the Y-block (property of interest).
The standard [http://www.eigenvector.com/software/license_evri.html PLS_Toolbox license] does not permit recompilation of any part of the code without written permission from Eigenvector Research, Inc. This permission is usually in the form of a recompiliation license (for more information on recompilation licenses, see: our [http://www.eigenvector.com/evriblog/?p=27 Blog post on Compiling PLS_Toolbox] ).  


Three multivariate filtering methods are provided in the Preprocessing window: Orthogonal Signal Correction (OSC), Generalized Least Squares Weighting (GLSW), and External Parameter Orthogonalization (EPO) where this last one also encompasses Extended Mixture Model (EMM) filtering. In the context of the Preprocessing window, both methods require a Y-block and are thus only relevant in the context of regression models. Additionally, as of the current version of PLS_Toolbox, the graphical interface access to these functions only permits their use to orthogonalize to a Y-block, not for calibration transfer applications. From the command line, however, both of these functions can also be used for calibration transfer or other filtering tasks. For more information on these uses, please see the calibration transfer and instrument standardization chapter of this manual.
If you have purchased a recompiliation license for PLS_Toolbox and/or other Matlab-based Eigenvector Research products, you can use the following instructions to compile your application including the licensed Eigenvector Research (EVRI) code.


===OSC (Orthogonal Signal Correction)===
# If you were not supplied an ''evrilicense.lic'' file by EVRI, create one by copying the license code supplied for your compilation license (found on the download tab of your EVRI account) into a plain-text file named: ''evrilicense.lic'' The file should consist of the license code on a single line of the file. For example: <pre>12345678-98765432-ab-1234-1234</pre>
# Copy the ''evrilicense.lic'' file into one of the folders on your Matlab path. This could be either one of the PLS_Toolbox folders, or your application's folder.
# Add the ''evrilicense.lic'' file to the "Shared Resources" list in the Matlab project builder. This will assure that the EVRI license gets included in the compiled application.
# Compile your application as usual using Mathworks' standard instructions. The Matlab dependency logic will automatically include the PLS_Toolbox functions in your compiled application. (See note below regarding "blocking" certain functions from being included.)


Orthogonal Signal Correction (Sjöblom et al., 1998) removes variance in the X-block which is orthogonal to the Y-block. Such variance is identified as some number of factors (described as components) of the X-block which have been made orthogonal to the Y-block. When applying this preprocessing to new data, the same directions are removed from the new data prior to applying the model.
'''Blocking Unnecessary Functions'''


The algorithm starts by identifying the first principal component (PC) of the X-block. Next, the loading is rotated to make the scores be orthogonal to the Y-block. This loading represents a feature which is not influenced by changes in the property of interest described in the Y-block. Once the rotation is complete, a PLS model is created which can predict these orthogonal scores from the X-block. The number of components in the PLS model is adjusted to achieve a given level of captured variance for the orthogonal scores. Finally, the weights, loadings, and predicted scores are used to remove the given orthogonal component, and are also set aside for use when applying OSC to a new unknown sample. This entire process can then be repeated on the "deflated" X-block (the X-block with the previously-identified orthogonal component removed) for any given number of components. Each cycle results in additional PLS weights and loadings being added to the total that will be used when applying to new data.
By default, Matlab's compiler automatically identifies all m-files which are necessary to run your application and includes all of these in the compiler output. Because of the integrated nature of many of the PLS_Toolbox functions, this can lead to "sprawl" - inclusion of many more functions than are actually needed. The follow steps can be taken to reduce the size of a compiled application:


There are three settings for the OSC preprocessing method: number of components, number of iterations, and tolerance level. The number of components defines how many times the entire process will be performed. The number of iterations defines how many cycles will be used to rotate the initial PC loading to be as orthogonal to Y as possible. The tolerance level defines the percent variance that must be captured by the PLS model(s) of the orthogonalized scores.
* Remove PLS_Toolbox 'dems' folder and 'help' folder from your path prior to compiling. Files in these folders can be large and are unnecessary for compilation.


In the Preprocessing window, this method allows for adjustment of the settings identified above. From the command line, this method is performed using the osccalc and oscapp functions.
* Add "dummy" functions to reduce dependencies:
:: One way to help reduce these unnecessary additions is to create empty "shell" functions to overload certain PLS_Toolbox functions. These functions, if placed in a folder above PLS_Toolbox when you are compiling, will shadow (hide) the actual function and help avoid sprawl. In particular the following functions are useful to shadow:


===GLS Weighting and EPO===
:* analysis.m
:* browse.m
:* plotgui.m
:* browse.m
:* evriinstall.m
:* evrireporterror.m


Generalized Least Squares Weighting (GLSW) is a filter calculated from the differences between samples which should otherwise be similar. These differences are considered interferences or "clutter" and the filter attempts to down-weight (shrink) those interferences. A simplified version of GLSW is called External Parameter Orthogonalization (EPO), which does an orthogonalization (complete subtraction) of some number of significant patterns identified as clutter. A simplified version of EPO emulates the Extended Mixture Model (EMM) in which all identified clutter patterns are orthogonalized to.
:: These functions will not be called in normal operation and, in most cases, our compilation licenses do not permit their inclusion in your application anyway.


====Clutter Identification====
* Find top level functions and see if you can "manually" determine dependencies. Look at the results of the top level dependency check and see what functions are called from the primary PLS_Toolbox function you're working with. If the dependencies are few, you may be able to iterate over the results (get 'toponly' dependencies from results) and get a smaller subset of dependencies. '''NOTE''': This will require some experimentation and time to work through. The dataset object is extensively used by most function so this folder should almost always be included.


In the case of a classification problem, similar samples would be the members of a given class. Any variation within each class group (known as "within-class variance") can be considered clutter which will make the classification task harder. The goal of GLSW in this case is to remove this within-class variance as much as possible without making the classes closer together (between-class variance).
<pre> [fList, pList] = matlab.codetools.requiredFilesAndProducts('peakfind','toponly') </pre>


In the case of a calibration transfer problem, similar samples would be data from the same samples measured on two different instruments or on the same instrument at two different points in time. The goal of GLSW is to down-weight the differences between the two instruments and, therefore, make them appear more similar. A regression model built from GLSW-filtered data can be used on either instrument after applying the filtering to any measured spectrum. Although this specific application of GLSW is not covered by this chapter, the description below gives the mathematical basis of this use.
'''Uninstall the Stats Toolbox '''


GLSW can also be used prior to building a regression model in order to remove variance from the X-block which is mostly orthogonal to the Y-block. This application of GLSW is similar to OSC (see above), and such filtering can allow a regression model to achieve a required error of calibration and prediction using fewer latent variables. In this context, GLSW uses samples with similar Y-block values to identify the sources of variance to down-weight.
Although moving the Stats Toolbox below PLS_Toolbox on your MATLAB path (or removing the Stats Toolbox folders altogether) will allow the PLS_Toolbox DataSet Object to function normally, you must uninstall the Stats Toolbox before compiling PLS_Toolbox function that require the DataSet Object.  


In all cases, the default algorithm for GLSW uses a single adjustable parameter, <math>\alpha</math>, which defines how strongly GLSW downweights interferences. Adjusting  <math>\alpha</math>    towards larger values (typically above 0.001) decreases the effect of the filter. Smaller  <math>\alpha</math>s (typically 0.001 and below) apply more filtering.
The MathWorks states:


====GLSW Algorithm====
"When you compile [a program] into an application and run it, the MATLAB Compiler Run-time references its in-built Dataset function which is higher in its PATH and hence runs the data against this inbuilt Dataset function."


The GLSW algorithm will be described here for the calibration transfer application (because it is simpler to visualize) and then the use of GLSW in classification and regression applications will be described. In all cases, the approach involves the calculation of a covariance matrix from the differences between similar samples. In the case of calibration transfer problems, this difference is defined as the numerical difference between the two groups of mean-centered transfer samples. Given two sample matrices, X1 and X2, the data are mean-centered and the difference calculated:
For more information on the DataSet Object history see here:
*[http://www.eigenvector.com/evriblog/?p=10 DataSet Object Conflict]
*[http://www.eigenvector.com/evriblog/?p=11 DataSet Object — Letter to MathWorks March 15, 2007]


:<math>\mathbf{X}_{1,mc}=\mathbf{X}_{1}-\mathbf{1}\bar{\mathbf{x}}_{1}</math> <div align="right">(1)</div>
'''Troubleshooting'''


:<math>\mathbf{X}_{2,mc}=\mathbf{X}_{2}-\mathbf{1}\bar{\mathbf{x}}_{2}</math> <div align="right">(2)</div>
* In some cases PLS_Toolbox may need to be moved out of the default installation folder into a folder with more permissions and/or no spaces in the path. For example, "C:\eigenvector\PLS_Toolbox".
'''Still having problems? Please contact our helpdesk at [mailto:helpdesk@eigenvector.com helpdesk@eigenvector.com]'''


:<math>\mathbf{X}_{d}=\mathbf{X}_{2}-\mathbf{X}_{1}</math> <div align="right">(3)</div>
[[Category:FAQ]]
 
 
where '''1''' is a vector of ones equal in length to the number of rows in '''X<sub>1</sub>''',  <math>\bar{x}_1</math>  is the mean of all rows of '''X<sub>1</sub>''', and  <math>\bar{x}_2</math>  is the mean of all rows of '''X<sub>2</sub>'''. Note that this requires that '''X<sub>1</sub>''' and '''X<sub>2</sub>''' are arranged such that the rows are in the same order in terms of samples measured on the two instruments.
 
The next step is to calculate the covariance matrix, C:
 
:<math>\mathbf{C}=\mathbf{X}_d^T\mathbf{X}_d</math> <div align="right">(4)</div>
 
followed by the singular-value decomposition of the matrix, which produces the left eigenvectors, '''V''', and the diagonal matrix of singular values, '''S''':
 
:<math>\mathbf{C}=\mathbf{V}S^2\mathbf{V}^T</math> <div align="right">(5)</div>
 
 
Next, a weighted, ridged version of the singular values is calculated
 
:<math>\mathbf{D}=\sqrt{\frac{\mathbf{S}^2}{\alpha}+\mathbf{1}_D}</math> <div align="right">(6)</div>
 
where '''1'''<sub>D</sub> is a diagonal matrix of ones of appropriate size and  <math>\alpha</math>  is the weighting parameter mentioned earlier. The scale of the weighting parameter depends on the scale of the variance in '''X'''<sub>d</sub>. Finally, the inverse of these weighted eigenvalues are used to calculate the filtering matrix.
 
:<math>\mathbf{G}=\mathbf{V}\mathbf{D}^{-1}\mathbf{V}^T</math> <div align="right">(7)</div>
 
This multivariate filtering matrix can be used by simply projecting a sample into the matrix. The result of this projection is that correlations present in the original covariance matrix are down-weighted (to the extent defined by    <math>\alpha</math> ). The filtering matrix is used both on the original calibration data prior to model calibration, and any future new data prior to application of the regression model.
 
The choice of <math>\alpha</math> depends on the scale of the original values but also how similar the interferences are to the net analyte signal. If the interferences are similar to the variance necessary to the analytical measurement, then  <math>\alpha</math>    will need to be higher in order to keep from removing analytically useful variance. However, a higher  <math>\alpha</math>    will decrease the extent to which interferences are down-weighted. In practice, values between 1 and 0.0001 are often used.
 
====Y-Gradient GLSW====
When using GLSW to filter out X-block variance which is orthogonal to a Y-block, a different approach is used to calculate the difference matrix,  '''X'''<sub>d</sub>. In this situation we have only one X-block, '''X''', of measured calibration samples, but we also have a Y-block, '''y''' (here defined only for a single column-vector), of reference measurements. To a first approximation, the Y-block can be considered a description of the similarity between samples. Samples with similar y values should have similar values in the X-block.
 
In order to identify the differences between samples with similar y values, the rows of the X- and Y-blocks are first sorted in order of increasing y value. This puts samples with similar values near each other in the matrix. Next, the difference between proximate samples is determined by calculating the derivative of each column of the X-block. These derivatives are calculated using a 5-point, first-order, Savitzky-Golay first derivative (note that a first-order polynomial derivative is essentially a block-average derivative including smoothing and derivatizing simultaneously). This derivative yields a matrix,  '''X'''<sub>d</sub>  , in which each sample (row) is an average of the difference between it and the four samples most similar to it. A similar derivative is calculated for the sorted Y-block, yielding vector  '''y'''<sub>d</sub>    , a measure of how different the y values are for each group of 5 samples.
 
At this point,  '''X'''<sub>d</sub>    could be used in equation 4 to calculate the covariance matrix of differences. However, some of the calculated differences (rows) may have been done on groups of samples with significantly different y values. These rows contain features which are correlated to the Y-block and should not be removed by GLS. To avoid this, the individual rows of  '''X'''<sub>d</sub>    need to be re-weighted by converting the sorted Y-block differences into a diagonal re-weighting matrix,  '''W'''    , in which the ''i''<sup>th</sup> diagonal element, ''w''<sub>i</sub>, is calculated from the rearranged equation
 
:<math>\log_2(w_i)=-\mathbf{y}_{d,i}s_{yd}</math> <div align="right">(8)</div>
 
The value  <math>\mathbf{y}_{d,i}</math>  is the ''i''<sup>th</sup> element of the  '''y'''<sub>d</sub>    vector, and  ''s''<sub>yd</sub>    is the standard deviation of y-value differences:
 
:<math>s_{yd}=\sqrt{\sum_{i=1}^m{\frac{(y_{d,i}-\bar{y}_d)^2}{m-1}}}</math> <div align="right">(9)</div>
 
 
The re-weighting matrix is then used along with  '''X'''<sub>d</sub>    to form the covariance matrix
 
:<math>\mathbf{C}=\mathbf{X}_d^T\mathbf{W}^{2}\mathbf{X}_d</math> <div align="right">(10)</div>
 
which is then used in equations 5 through 7 as described above.
 
This approach is discussed in:
:B. M. Zorzetti, J. M. Shaver, J. J. Harynuk, "Estimation of the age of a weathered mixture of volatile organic compounds," Analytica Chimica Acta, '''694''', 31–37, 2011.
 
====External Parameter Orthogonalization (EPO)====
An alternative multivariate filter called External Parameter Orthogonalization (EPO) uses the same process as GLSW except that only a certnain number of eigenvectors calculated in equation 5 are kept and the '''D''' matrix calculated in equation 6 is a diagonal vector of ones. The result is that '''X''' is "hard-orthogonalized" to the eigenvectors (the directions are completely removed) rather than simply "shrinking" these directions as is done with GLSW.
 
If all of the calculated eigenvectors are used in an EPO filter, the method becomes equivalent to the Extended Mixture Model (EMM) method described in Martens and Naes 1989.
 
For a literature reference on EPO, see: Roger, Chauchard, Bellon-Maurel, "EPO–PLS external parameter orthogonalisation of PLS application to temperature-independent measurement of sugar content of intact fruits." Chemom. Intell. Lab. Syst., 66, 191– 204 (2003).
 
====Settings and Command-line Usage====
 
In the Preprocessing window, the GLSW method has a [[Declutter_Settings_Window|Settings Window]] to allow for adjustment of the weighting parameter,  <math>\alpha</math>, whether or not to include mean-centering ("ignore means"), whether to use '''EPO''' mode and select a given number of components to orthongonalize to, or whether to use '''EMM/ELS''' mode in which the data is orthogonalized to all available components. From the command line, this method is performed using the [[glsw]] function, which also permits a number of other modes of application (including identification of "classes" of similar samples).

Revision as of 11:28, 28 June 2019

Issue:

How do I obtain or use a recompilation license for PLS_Toolbox?

Possible Solutions:

The standard PLS_Toolbox license does not permit recompilation of any part of the code without written permission from Eigenvector Research, Inc. This permission is usually in the form of a recompiliation license (for more information on recompilation licenses, see: our Blog post on Compiling PLS_Toolbox ).

If you have purchased a recompiliation license for PLS_Toolbox and/or other Matlab-based Eigenvector Research products, you can use the following instructions to compile your application including the licensed Eigenvector Research (EVRI) code.

  1. If you were not supplied an evrilicense.lic file by EVRI, create one by copying the license code supplied for your compilation license (found on the download tab of your EVRI account) into a plain-text file named: evrilicense.lic The file should consist of the license code on a single line of the file. For example:
    12345678-98765432-ab-1234-1234
  2. Copy the evrilicense.lic file into one of the folders on your Matlab path. This could be either one of the PLS_Toolbox folders, or your application's folder.
  3. Add the evrilicense.lic file to the "Shared Resources" list in the Matlab project builder. This will assure that the EVRI license gets included in the compiled application.
  4. Compile your application as usual using Mathworks' standard instructions. The Matlab dependency logic will automatically include the PLS_Toolbox functions in your compiled application. (See note below regarding "blocking" certain functions from being included.)

Blocking Unnecessary Functions

By default, Matlab's compiler automatically identifies all m-files which are necessary to run your application and includes all of these in the compiler output. Because of the integrated nature of many of the PLS_Toolbox functions, this can lead to "sprawl" - inclusion of many more functions than are actually needed. The follow steps can be taken to reduce the size of a compiled application:

  • Remove PLS_Toolbox 'dems' folder and 'help' folder from your path prior to compiling. Files in these folders can be large and are unnecessary for compilation.
  • Add "dummy" functions to reduce dependencies:
One way to help reduce these unnecessary additions is to create empty "shell" functions to overload certain PLS_Toolbox functions. These functions, if placed in a folder above PLS_Toolbox when you are compiling, will shadow (hide) the actual function and help avoid sprawl. In particular the following functions are useful to shadow:
  • analysis.m
  • browse.m
  • plotgui.m
  • browse.m
  • evriinstall.m
  • evrireporterror.m
These functions will not be called in normal operation and, in most cases, our compilation licenses do not permit their inclusion in your application anyway.
  • Find top level functions and see if you can "manually" determine dependencies. Look at the results of the top level dependency check and see what functions are called from the primary PLS_Toolbox function you're working with. If the dependencies are few, you may be able to iterate over the results (get 'toponly' dependencies from results) and get a smaller subset of dependencies. NOTE: This will require some experimentation and time to work through. The dataset object is extensively used by most function so this folder should almost always be included.
 [fList, pList] = matlab.codetools.requiredFilesAndProducts('peakfind','toponly') 

Uninstall the Stats Toolbox

Although moving the Stats Toolbox below PLS_Toolbox on your MATLAB path (or removing the Stats Toolbox folders altogether) will allow the PLS_Toolbox DataSet Object to function normally, you must uninstall the Stats Toolbox before compiling PLS_Toolbox function that require the DataSet Object.

The MathWorks states:

"When you compile [a program] into an application and run it, the MATLAB Compiler Run-time references its in-built Dataset function which is higher in its PATH and hence runs the data against this inbuilt Dataset function."

For more information on the DataSet Object history see here:

Troubleshooting

  • In some cases PLS_Toolbox may need to be moved out of the default installation folder into a folder with more permissions and/or no spaces in the path. For example, "C:\eigenvector\PLS_Toolbox".

Still having problems? Please contact our helpdesk at helpdesk@eigenvector.com