Auto-Alignment and Missing Data Replacement

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search

Auto-Alignment and Missing Data Replacement

Applying a model to validation or new data that is missing variables that were available in the calibration data or that contains variables with different axisscale values than the calibration data is generally not possible. However, up to certain limits, Analysis will attempt to handle these scenarios using either Auto-Alignment and/or Missing Data Replacement as described below.

Auto-Alignment

Missing or shifted variables are not uncommon in multivariate data analysis. The Auto-Align procedure (using the matchvars function of PLS_Toolbox) can help correct for these problems.

If the calibration data contains labels for the x block variables, and labels are present in both the calibration and validation sets, then auto-alignment reorders the validation x block variables to match the calibration data. If any labels were present in the calibration data but not in the validation data, these variables appear as missing data (NaN = Not a Number) and will be handled by Missing Data Replacement.

If the calibration data contains an axisscale for the x block variables, and an equivalent axisscale is present on the validation data, but these two axisscales are not equivalent, Analysis performs shifting and/or interpolation to match up variables. If needed variables fall outside the axisscale range of the validation data, these variables would require extrapolation and are instead marked as missing data (NaN).

Note: By default, the axisscale alignment performs a linear interpolation which is not appropriate for variables which do not have high-correlation in adjacent variables (e.g. narrow peaks in Mass Spectrometry, Chromatography, or NMR).

Missing Data Replacement

When data has been marked as "missing" either because auto-alignment could not locate the needed variables or because they were provided as missing originally, Analysis will attempt to replace these variables using the current model as a template. This imputation procedure uses projection of the known data into the model followed by replacement of the missing values using the projection and loadings of the model. This procedure uses the replace function and is generally unbiased assuming the model is accurate. In fact, the replaced values can be seen as the values which avoid influence on the model predictions. (Note that during calibration, the mdcheck function is used to replace missing data.)

It should be noted that this kind of data replacement requires correlation among the variables. If your data is missing unique variables which have no or little correlation to other variables, these data cannot be replaced successfully and you may have no indication of this failure. It is incumbent on the user to review which data are missing (see the menu: View > Missing Data Map of the DataSet Editor or the Missing Data count in Plot Controls for visualizations of the missing variables.)

Because not all model types contain such a representation of the data, replacement cannot be done with all models. In these cases, it is useful to build a factor-based model (For example, PCA, PCR, or PLS) of the calibration data, then apply that model to the validation data and have Analysis use the automatic data replacement algorithm to replace the missing data. The new validation data (with the infilled missing data) can then be used with any other model type without error.