Advanced Preprocessing: Introduction

From Eigenvector Research Documentation Wiki
Revision as of 13:27, 13 July 2011 by imported>Jeremy
Jump to navigation Jump to search

Data preprocessing is often employed in multivariate analysis but it is often unclear why and when to preprocess the data, and in what order. The topic gets even more confusing when the large number of preprocessing methods is considered. In short, the objective in data preprocessing is to separate the signal of interest from clutter where clutter is defined as all signal that is not of interest (e.g., signal attributable to interferences and noise). This means that the appropriate preprocessing method depends on the data analysis objective, and on how the signal and clutter manifest in the data. Obviously, this topic can get pretty complicated and confusing very quickly. However, it is the intention here to provide only a brief introduction to why preprocessing is performed and in what order. Simple preprocessing methods are used as examples to introduce these concepts. A more thorough discussion of the objective, theory and math of each preprocessing procedure is not included here.

Preprocessing is typically performed prior to data analysis methods such as principal components analysis (PCA) or partial least squares regression (PLS). Recall, that PCA maximizes the capture of sum-of-squares with factors or principal components (PCs) within a single block of data, and PLS is slightly more complicated method that finds linear relationships between two blocks of data. This introduction will use PCA in the examples. Two of the simplest examples of preprocessing are mean-centering and autoscaling and these two methods will be described in a bit more detail, but first a description of the data analysis objective with no preprocessing will be discussed.

Imagine that an MxN data matrix is available and the objective is to perform exploratory analysis of this data using PCA. Recall that samples (or objects) correspond to the rows of and variables correspond to the columns. If no preprocessing is applied to prior to the PCA decomposition, then the PCA loadings will capture the most sum-of-squares in centered about the origin (i.e., the model is a force fit about zero). In this case, the first principal component (PC) will point in the direction that captures the most sum-of-squares about zero (variance about zero).

Next, define the Nx1 mean of data matrix as . The mean is calculated down the rows of so that for the nth column of (i.e., the nth element of the vector ) the mean is a scalar and is given by

(1)

The mean centered data is then calculated by subtracting the column mean from the corresponding column so that

for

for

(2)

where is a Mx1 vector of ones (typically it is assumed that is of appropriate size) and T is the transpose operator. (The notations in Equation 2 provide identical results, but the simplicity of the last form shows why the linear algebra notation is often preferred.) The first step in mean-centering, represented by Equation 1, is to calculate the mean of each column of . This procedure can be considered “calibration” of the mean-centering preprocessing and it consists of estimating the mean from the “calibration” data . The second step, represented by Equation 2, subtracts the mean from the data. This procedure can be considered “applying” the centering to the data . The first PC of a PCA model of will then capture the most sum-of-squares about the mean (variance about the mean or simply ‘variance’). The mean is now a part of the overall PCA model “calibrated” on the “calibration” data and the mean-centering operation has changed what sum-of-squares is captured by the first PC. In other words the preprocessing has changed the data to get the PCA model to focus on a different type of variance. As a result, the PCA model must be interpreted differently during the exploratory analysis.