Advanced Preprocessing: Variable Centering: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Jeremy
(Created page with '===Introduction=== Many preprocessing methods are based on the variance in the data. Such techniques should generally be provided with data which are centered relative to some r…')
 
imported>Jeremy
Line 12: Line 12:


One of the most common preprocessing methods, mean-centering calculates the mean of each column and subtracts this from the column. Another way of interpreting mean-centered data is that, after mean-centering, each row of the mean-centered data includes only how that row differs from the average sample in the original data matrix.
One of the most common preprocessing methods, mean-centering calculates the mean of each column and subtracts this from the column. Another way of interpreting mean-centered data is that, after mean-centering, each row of the mean-centered data includes only how that row differs from the average sample in the original data matrix.
Mean centering has the effect of including an adjustable intercept in multivariate models. For example, mean-centering both the X and Y blocks in a regression model effectively allows for a non-zero intercept of the regression line. This is critical in many inferential regression problems where the intercept is not necessarily at a Y of zero when X goes to zero (predicting temperature in Kelvin, for example.)


In the Preprocessing GUI, this method has no adjustable settings. From the command line, this method is achieved using the [[mncn]] function.
In the Preprocessing GUI, this method has no adjustable settings. From the command line, this method is achieved using the [[mncn]] function.

Revision as of 11:32, 27 July 2011

Introduction

Many preprocessing methods are based on the variance in the data. Such techniques should generally be provided with data which are centered relative to some reference point. Centering is generically defined as

where is a vector representing the reference point for each variable, is a column-vector of ones, and represents the centered data. Often the reference point is the mean of the data. Interpretation of loadings and samples from models built on centered data is done relative to this reference point. For example, when centering is used before calculating a PCA model, the resultant eigenvalues can be interpreted as variance captured by each principal component. Without centering, the eigenvalues include both variance and the sum-squared mean of each variable.

In most cases, centering and/or scaling (see next section) will be the last method in a series of preprocessing methods. When other preprocessing methods are being used, they are usually performed prior to a centering and/or scaling method.

Mean-Center

One of the most common preprocessing methods, mean-centering calculates the mean of each column and subtracts this from the column. Another way of interpreting mean-centered data is that, after mean-centering, each row of the mean-centered data includes only how that row differs from the average sample in the original data matrix.

Mean centering has the effect of including an adjustable intercept in multivariate models. For example, mean-centering both the X and Y blocks in a regression model effectively allows for a non-zero intercept of the regression line. This is critical in many inferential regression problems where the intercept is not necessarily at a Y of zero when X goes to zero (predicting temperature in Kelvin, for example.)

In the Preprocessing GUI, this method has no adjustable settings. From the command line, this method is achieved using the mncn function.

For more information on the use of mean-centering, see the discussion on Principal Components Analysis in Chapter 5 of the Chemometrics Tutorial.

Median-Center

The median-centering preprocessing method is very similar to mean-centering except that the reference point is the median of each column rather than the mean. This is considered one of the "robust" preprocessing methods in that it is not as influenced by outliers (unusual samples) in the data.

In the Preprocessing GUI, this method has no adjustable settings. From the command line, this method is performed using the medcn function.