Advanced Preprocessing: Introduction
Data preprocessing is often employed in multivariate analysis but it is often unclear why and when to preprocess the data, and in what order. The topic gets even more confusing when the large number of preprocessing methods is considered. In short, the objective in data preprocessing is to separate the signal of interest from clutter where clutter is defined as all signal that is not of interest (e.g., signal attributable to interferences and noise). This means that the appropriate preprocessing method depends on the data analysis objective, and on how the signal and clutter manifest in the data. Obviously, this topic can get pretty complicated and confusing very quickly. However, it is the intention here to provide only a brief introduction to why preprocessing is performed and in what order. Simple preprocessing methods are used as examples to introduce these concepts. A more thorough discussion of the objective, theory and math of each preprocessing procedure is not included here.
Preprocessing is typically performed prior to data analysis methods such as principal components analysis (PCA) or partial least squares regression (PLS). Recall, that PCA maximizes the capture of sum-of-squares with factors or principal components (PCs) within a single block of data, and PLS is slightly more complicated method that finds linear relationships between two blocks of data. This introduction will use PCA in the examples. Two of the simplest examples of preprocessing are mean-centering and autoscaling and these two methods will be described in a bit more detail, but first a description of the data analysis objective with no preprocessing will be discussed.
Imagine that an MxN data matrix is available and the objective is to perform exploratory analysis of this data using PCA. Recall that samples (or objects) correspond to the rows of and variables correspond to the columns. If no preprocessing is applied to prior to the PCA decomposition, then the PCA loadings will capture the most sum-of-squares in centered about the origin (i.e., the model is a force fit about zero). In this case, the first principal component (PC) will point in the direction that captures the most sum-of-squares about zero (variance about zero).
Next, define the Nx1 mean of data matrix as . The mean is calculated down the rows of so that for the nth column of (i.e., the nth element of the vector ) the mean is a scalar and is given by
- (1)
The mean centered data is then calculated by subtracting the column mean from the corresponding column so that
for
for
(2)
where is a Mx1 vector of ones (typically it is assumed that is of appropriate size) and T is the transpose operator. (The notations in Equation 2 provide identical results, but the simplicity of the last form shows why the linear algebra notation is often preferred.) The first step in mean-centering, represented by Equation 1, is to calculate the mean of each column of . This procedure can be considered “calibration” of the mean-centering preprocessing and it consists of estimating the mean from the “calibration” data . The second step, represented by Equation 2, subtracts the mean from the data. This procedure can be considered “applying” the centering to the data . The first PC of a PCA model of will then capture the most sum-of-squares about the mean (variance about the mean or simply ‘variance’). The mean is now a part of the overall PCA model “calibrated” on the “calibration” data and the mean-centering operation has changed what sum-of-squares is captured by the first PC. In other words the preprocessing has changed the data to get the PCA model to focus on a different type of variance. As a result, the PCA model must be interpreted differently during the exploratory analysis.
Next, assume that a new M2xN data matrix was available where M2>=1. To apply the PCA model calibrated above to the new data, the new data set must first be centered to the mean of the calibration data. The preprocessing is “applied” to the new data using a procedure analogous to Equation 2 as follows:
- (4)
Autoscaling of the data is treated in a manner very similar to mean-centering but the preprocessing includes an additional step. During calibration, Equation 1 is first used to estimate the mean of the calibration data . Next, the standard deviation of each column is calculated using
- (5)
Equations 4 and 5 correspond to “calibration” of the autoscaling preprocessing procedure. After calibration, the mean-centered columns are divided by the corresponding standard deviation as follows
- (6)
Autoscaling includes mean-centering and division by the standard deviation and Equation 6 corresponds to “applying” the preprocessing to the calibration data. Equation 7 shows how the preprocessing is applied to new data .
- (7)
In summary, the autoscaling preprocessing parameters and for n=1,...N were estimated from the calibration data , and the application step used these parameters to center and scale both and new data . It should be clear that the calibration data should be sufficiently representative of what is expected in the future if the estimated preprocessing parameters will adequately represent the mean and standard deviation of new data. Also, variables (columns) with large standard deviation are now down-weighted relative to variables with small standard deviation. This changes the relative sum-of-squares for the preprocessed data and the first PC will now capture the largest sum-of-squares relative to the mean of the weighted matrix .
Although outside of the scope of the present introduction, it should be noted that some preprocessing methods do not operate down the rows but instead operate across the columns. As such, estimates such as the mean and standard deviation might not be estimated from the data. However, these methods most often include settings or parameters that dictate how they operate and it is important that all the data are treated similarly. As a result, the preprocessing settings are a part of the model just like the estimated means and standard deviations. However, it should be clear that estimated preprocessing parameters and settings for the preprocessing are all a part of the model established during the “calibration” step, and that these parameters and settings are stored as a part of the model. Subsequently, during the model application step the preprocessing parameters are applied to new data. The two step model “calibration” and model “application” includes preprocessing as well as data modeling such as PCA. It should also be clear that preprocessing can change the focus of the data modeling procedure. For example, PCA always captures the most sum-of-squares in the first PC. However, the different preprocessing methods examined above changed what sum-of-squares was the focus of the PCA decomposition. It is in this way that preprocessing can be used to tune what variance is captured by the PCA or PLS model.
This brief introduction described how preprocessing is calibrated (based on calibration data) and applied (to both the calibration and new test data). A more detailed discussion of mean-centering and autoscaling for PCA can be found in Wise, B.M. and Gallagher, N.B., "The Process Chemometrics Approach to Chemical Process Monitoring and Fault Detection," J. Proc. Cont. 6(6), 329-348 (1996).