Auto-Alignment and Missing Data Replacement

From Eigenvector Research Documentation Wiki
Revision as of 15:01, 22 February 2011 by imported>Jeremy (Created page with "==Auto-Alignment and Missing Data Replacement== Validation data can sometimes be missing variables that were available in the calibration data or the variables provided may have...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Auto-Alignment and Missing Data Replacement

Validation data can sometimes be missing variables that were available in the calibration data or the variables provided may have different axisscale values (if relevant to the x block). Up to certain limits, Analysis will attempt to handle these scenarios using either Auto-Alignment or Missing Data Replacement.

Auto-Alignment

Missing or shifted variables are not uncommon in multivariate data analysis. The Auto-Alignment procedure (using the matchvars function of PLS_Toolbox) can help correct for these problems.

If the calibration data contains labels for the x block variables, and labels are present in both the calibration and validation sets, then auto-alignment reorders the validation x block variables to match the calibration data. If any labels were present in the calibration data but not in the validation data, these variables appear as missing data (NaN = Not a Number) and will be handled by Missing Data Replacement.

If the calibration data contains an axisscale for the x block variables, and an equivalent axisscale is present on the validation data, but these two axisscales are not equivalent, Analysis performs shifting and/or interpolation to match up variables. If needed variables fall outside the axisscale range of the validation data, these variables would require extrapolation and are instead marked as missing data (NaN).

Note: By default, the axisscale alignment performs a linear interpolation which is not appropriate for variables which do not have high-correlation in adjacent variables (e.g. narrow peaks in Mass Spectrometry, Chromatography, or NMR).

Missing Data Replacement

When data has been marked as "missing" either because auto-alignment could not locate the needed variables or because they were provided as missing originally, Analysis will attempt to replace these variables using the current model as a template. This imputation procedure uses projection of the known data into the model followed by replacement of the missing values using the projection and loadings of the model. This procedure uses the replace function and is generally unbiased assuming the model is accurate. In fact, the replaced values can be seen as the values which avoid influence on the model predictions.

Because not all model types contain such a representation of the data, replacement cannot be done with all models. In these cases, it is useful to build a factor-based model (For example, PCA, PCR, or PLS) of the calibration data, then apply that model to the validation data and have Analysis use the automatic data replacement algorithm to replace the missing data. The new validation data (with the infilled missing data) can then be used with any other model type without error.