Preprocess: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Jeremy
 
(27 intermediate revisions by 6 users not shown)
Line 1: Line 1:
===Purpose===
===Purpose===


Selection and application of standard preprocessing methods.
Selection and application of standard [[Model_Building:_Preprocessing_Methods|preprocessing methods]].


===Synopsis===
===Synopsis===


:s = preprocess(''s'')                       %GUI preprocessing selection
: s = preprocess;                  %Modal GUI to select preprocessing
:s = preprocess('default','methodname')             %Non-GUI selection
:[s,changed] = preprocess(s); %Modal GUI to modify preprocessing
:[datap,sp] = preprocess('calibrate',s,data)  %single block calibrate
:list = preprocess('initcatalog'); % Gives a list of the available methods
:[datap,sp] = preprocess('calibrate',s,xblock,yblock)    %multi-block
:preprocess('keywords'); %Lists valid method names.
:datap = preprocess('apply',sp,data)                %apply to new data
:s = preprocess('default','methodname');            %Non-GUI interactive preprocessing selection
:data = preprocess('undo',sp,datap)               %undo preprocessing
:[datap,sp] = preprocess('calibrate',s,data);   %single block calibration of preprocessing
:[datap,sp] = preprocess('calibrate',s,xblock,yblock);     %multi-block calibration of preprocessing
:datap = preprocess('apply',sp,data);               %apply to new data
:data = preprocess('undo',sp,datap) ;              %undo preprocessing
:data = preprocess('undo_silent',sp,datap);    %undo preprocessing (no warnings)
:[datap,s] = preprocess(data);                      %Modal GUI to preprocess selected data
:[datap,s] = preprocess(data,s);                    %Modal GUI to preprocess selected data


===Description===
===Description===


PREPROCESS is a general tool to choose preprocessing steps and to perform these steps on data. See [[preprouser]] for a description on how custom preprocessing can be added to the standard preprocessing options listed below. PREPROCESS can be used to perform 4 different tasks:
PREPROCESS is a general tool to choose preprocessing steps and to perform the steps on data. It can be used as a graphical interface or as a command-line tool. See [[ModelBuilding_PreProcessingMethods|Model Building - PreProcessing Methods]] for a description of the use of the graphical user interface. See [[User Defined Preprocessing]] and [[preprouser]] for a description on how custom preprocessing can be added to the standard preprocessing options listed below.


* 1) Specification of Preprocessing
From the command line, PREPROCESS can be used to perform four different tasks:
* 2) Estimate preprocessing parameters (calibrate)
* 1) Specification of Preprocessing (or preprocessing steps),
* 3) Apply preprocessing to new data (apply)
* 2) Estimate preprocessing parameters (calibrate),
* 4) Remove the effect of previously-done preprocessing on data (undo)
* 3) Apply preprocessing to new data (apply), and
* 4) Remove the preprocessing previously performed on the data (undo).


====Case 1) Specification of Preprocessing====
====Case 1) Specification of Preprocessing====


The purpose of the following calls to PREPROCESS is to generate standard structure arrays that contain the desired preprocessing steps.
The purpose of the following calls to PREPROCESS is to generate standard preprocessing structure arrays that contain the desired preprocessing steps. The commands are listed here.


:s = preprocess;
s = preprocess;
creates a GUI that allows the user to select preprocessing steps interactively. The output <tt>s</tt> is a standard preprocessing structure. If multiple preprocessing steps are selected, <tt>s</tt> is a multi-record structure with each record corresponding to a preprocessing step.


generates a GUI and allows the user to select preprocessing steps interactively. The output <tt>s</tt> is a standard preprocessing structure.
[s,changed] = preprocess(s);
allows the user to interactively edit a previously-built preprocessing structure <tt>s</tt>. The output <tt>s</tt> is the edited preprocessing structure. The second output <tt>changed</tt> is a flag that indicates whether the user clicked "OK" (==1) or "Cancel" (==0) to close the interface. If the user cancels, the output will be the same as the input (no changes have been made.)


:s = preprocess(s);
s = preprocess('default','methodname');
returns the default preprocessing structure for method <tt>methodname</tt>. A list of valid method names, <tt>methodname</tt>, can be obtained using the command:
preprocess('keywords')


allows the user to interactively edit a previously-built preprocessing structure <tt>s</tt>. The output <tt>s</tt> is the edited preprocessing structure.
The technical description of the different types of preprocessing can be found on the [[Model_Building:_Preprocessing_Methods|Model Building: Preprocessing Methods]] page. Below is list of standard methods that can be used for 'methodname':
 
* ''''abs'''': takes the absolute value of the data (see [[Advanced_Preprocessing:_Simple_Mathematical_Operations|abs]]),
:s = preprocess('default','methodname');
* ''''arithmetic'''': simple arithmetic operations,
 
returns the default structure for method <tt>methodname</tt>. A list of strings that can be used for <tt>methodname</tt> can be viewed using the command:
 
:preprocess('keywords')
 
Below is list of standard methods that can be used for 'methodname':
 
* ''''abs'''': takes the absolute value of the data (see [[abs]]),
* ''''autoscale'''': centers columns to zero mean and scales to unit variance (see [[auto]]),
* ''''autoscale'''': centers columns to zero mean and scales to unit variance (see [[auto]]),
* ''''simple baseline'''': baseline (specified points, see [[baseline]]),
* ''''autoscalenomean'''': variance (std) scaling, scales each variable by its standard deviation without mean-centering,
* ''''baseline'''': baseline (weighted least squares, see [[wlsbaseline]]),
* ''''baseline''': baselining using an iterative weighted least square algorithm (see [[wlsbaseline]]),
* ''''derivative'''': derivative [[savgol]],
* ''''whittaker'''': baselining using an automatic Whittaker filter,
* ''''simple baseline'''': baselining base on user specified points (see [[baseline]]),
* ''''classcenter'''': centers classes in data to the mean of each class (see [[classcenter]]).
* ''''classcentroid'''': centers data to the centroid of all classes (see [[classcentroid]]).
* ''''classcentroidscale'''': centers data to the centroid of all classes and scales to intra-class variance (see [[classcentroid]]).
* ''''derivative'''': Savitzky-Golay smoothing and derivative across rows (see [[savgol]]),
* ''''derivative columns'''': Savitzky-Golay smoothing and derivative down columns (see [[savgol]]),
* ''''detrend'''': remove a linear trend (see [[baseline]]),
* ''''detrend'''': remove a linear trend (see [[baseline]]),
* ''''eemfilter'''': EEM filtering,
* ''''emsc'''': extended multiplicative scatter correction (see [[emscorr]]),
* ''''epo'''': External Parameter Orthogonalization - remove clutter covariance (see [[glsw]]),
* ''''gapsegment'''': gap segment derivatives (see [[gapsegment]]),
* ''''gls weighting'''': generalized least squares weighting (see [[glsw]]),
* ''''gls weighting'''': generalized least squares weighting (see [[glsw]]),
* ''''groupscale'''': group/block scaling (see [[gscale]]),
* ''''gscale'''': group/block scaling (see [[gscale]]),
* ''''holoreact'''': Kaiser HoloReact Method (see [[hrmethodreadr]]),
* ''''logdecay'''': log decay scaling,
* ''''logdecay'''': log decay scaling,
* ''''log10'''': calculate base 10 logarithm of data,
* ''''log10'''': calculate base 10 logarithm of data([[Advanced_Preprocessing:_Simple_Mathematical_Operations|log10]]),
* ''''mean center'''': center columns to have zero mean (see [[mncn]]),
* ''''mean center'''': center columns to have zero mean (see [[mncn]]),
* ''''msc (mean)'''': multiplicative scatter correction with offset, the mean is the reference spectrum (see [[mscorr]]),
* ''''median center'''': center columns to have zero median (see [[medcn]]),
* ''''median center'''': center columns to have zero median (see [[medcn]]),
* ''''minmax'''': min-max scaling, scales each row or column to have a minimum of 0 and a maximum of 1 (see [[minmax]]),
* ''''msc'''': multiplicative scatter correction with offset, the mean is the reference spectrum (see [[mscorr]]),
* ''''msc_median'''': multiplicative scatter correction with offset, the median is the reference spectrum (see [[mscorr]]),
* ''''centering'''': multiway center,
* ''''centering'''': multiway center,
* ''''scaling'''': multiway scale,
* ''''scaling'''': multiway scale,
* ''''normalize'''': normalization of the rows (see [[normaliz]]),
* ''''normalize'''': normalization of the rows (see [[normaliz]]),
* ''''osc'''': orthogonal signal correction (see [[osccalc]] and [[oscapp]]),
* ''''osc'''': orthogonal signal correction (see [[osccalc]] and [[oscapp]]),
* ''''smooth'''': Savitsky-Golay smoothing and deriviatives (see [[savgol]]), and
* ''''pareto'''': Pareto (sqrt std) scaling, scales each variable by the square root of its standard deviation,
* ''''sqmnsc'''': Poisson (sqrt mean) scaling, scales each variable by the square root of its mean (see [[poissonscale]]),
* ''''referencecorrection'''': reference/background correction,
* ''''smooth'''': Savitzky-Golay smoothing (see [[savgol]]),
* ''''snv'''': standard normal deviate (autoscale the rows, see [[snv]]),
* ''''snv'''': standard normal deviate (autoscale the rows, see [[snv]]),
* ''''sqmnsc'''': sqrt mean scale, scale each variable by the square root of its mean.
* ''''specalign'''': variable alignment via [[cow]] and [[registerspec]],
* ''''trans2abs'''': transmission to absorbance ([[Advanced_Preprocessing:_Simple_Mathematical_Operations|log(1/T)]]),
* ''''window_filter'''': spectral filtering (see [[windowfilter]]).
 
Additional methods are available with MIA_Toolbox. The valid method names for 'methodname' follow.
* ''''Image_Flatfield'''': background subtraction (flatfield),
* ''''Image_Close'''': close (dilate+erode)],
* ''''Image_Dilate'''': dilate,
* ''''Image_Erode'''': erode,
* ''''Image_Max'''': replaces window of pixels with the max (see [[box_filter]]),
* ''''Image_Mean'''': replaces window of pixels with the mean (see [[box_filter]]),
* ''''Image_Median'''': replaces window of pixels with the median (see [[box_filter]]),
* ''''Image_Min'''': replaces window of pixels with the min (see [[box_filter]]),
* ''''Image_Open''': open (erode+dilate),
* ''''Image_Smooth'''': smooth,
* ''''Image_TrimmedMean'''': replaces window of pixels with the trimmed mean (see [[box_filter]]),
* ''''Image_TrimmedMedian'''': replaces window of pixels with the trimmed median (see [[box_filter]]).


The output is a standard preprocessing structure array <tt>s</tt>, where each preprocessing method to apply is contained in a separate record.
The following command generates a multi-record structure of the default preprocessing steps included in PREPROCESS.
list = prprocess('initcatalog');


====Case 2) Estimate preprocessing parameters (calibrate)====
====Case 2) Estimate preprocessing parameters (calibrate)====


The objective of the following calls to PREPROCESS is to estimate preprocessing parameters, if any, from a calibration data set and perform preprocessing on the calibration data set. The I/O format is:
Many preprocessing methods derive statistics and other numerical values from the calibration data. These values must be stored and used when new data (test or other future data) is going to be preprocessed in the same way as the calibration data. Examples include the mean or variance for each variable of the calibration data.


:[datap,sp] = preprocess('calibrate',s,data);
The objective of the 'calibrate' call to PREPROCESS is to estimate preprocessing parameters, if any, from the calibration data set and perform preprocessing on the data. The I/O format is:
[datap,sp] = preprocess('calibrate',s,data);


The inputs are <tt>s</tt> a standard preprocessing structure and <tt>data</tt> the calibration data. The preprocessed data is returned in <tt>datap</tt>, and preprocessing parameters are returned in a modified preprocessing structure <tt>sp</tt>. Note that <tt>sp</tt> is used as an input with the 'apply' and 'undo' commands described below.
The inputs are <tt>s</tt> a standard preprocessing structure and <tt>data</tt> the calibration data. The preprocessed data is returned in <tt>datap</tt>, and preprocessing parameters are returned in a modified preprocessing structure <tt>sp</tt>. Note that <tt>sp</tt> is used as an input with the 'apply' and 'undo' commands described below.


Short-cuts for each method can also be used. Examples for 'mean center' and 'autoscale' are
Short-cuts for each method can also be used. Examples for 'mean center' and 'autoscale' are
 
[datap,sp] = preprocess('calibrate','mean center',data);
:[datap,sp] = preprocess('calibrate','mean center',data);
[datap,sp] = preprocess('calibrate','autoscale',data);
 
:[datap,sp] = preprocess('calibrate','autoscale',data);


Preprocessing for some multi-block methods (specifically, 'osc' and 'gls weighting') require that the y-block be passed also. The I/O format in these cases is:
Preprocessing for some multi-block methods (specifically, 'osc' and 'gls weighting') require that the y-block be passed also. The I/O format in these cases is:
 
[datap,sp] = preprocess('calibrate',s,xblock,yblock);
:[datap,sp] = preprocess('calibrate',s,xblock,yblock);


====Case 3) Apply preprocessing to new data (apply)====
====Case 3) Apply preprocessing to new data (apply)====


The objective of the following call to PREPROCESS
Once the preprocessing steps have been calibrated on a set of data (see Case 2), the preprocessing can be "applied" to new data. The parameters determined during calibration are now used to preprocess the new data. For example, consider mean-centering. In this case, the parameters correspond to the mean of the calibration set. During 'apply', the new data are centered to the mean of the calibration data.
 
:datap = preprocess('apply',sp,data)


is to '''apply''' the calibrated preprocessing in <tt>sp</tt> to new data. Inputs are <tt>sp</tt>, the modified preprocessing structure (See Case 2 above) and the data, <tt>data</tt>, to apply the preprocessing to. The output is preprocessed data <tt>datap</tt> that is class "dataset".
The following call to PREPROCESS
datap = preprocess('apply',sp,data);
applies the calibrated preprocessing <tt>sp</tt> to new data <tt>data</tt> and returns the preprocessed data <tt>datap</tt> that is class "dataset".


====Case 4) Remove the effect of previously-done preprocessing on data (undo)====
====Case 4) Remove preprocessing from preprocessed data (undo)====


The inverse operation of applying preprocessing is performed in the following call to PREPROCESS
The inverse operation of applying preprocessing is performed in the following call to PREPROCESS
data = preprocess('undo',sp,datap);
Inputs are <tt>sp</tt>, the calibrated preprocessing structure (See Case 2 above) and the preprocessed data, <tt>datap</tt> [class "double" or "dataset"]. The output is the "unpreprocessed" data <tt>data</tt>.


:data = preprocess('undo',sp,datap);
The 'undo' operation is most often used on y-block predictions in regression models and in missing data replacement algorithms (see [[mdcheck]] and [[replace]]) in which the data is preprocessed and then an estimate of the data is converted back to the original unpreprocessed form.


Inputs are <tt>sp</tt>, the modified preprocessing structure (See Case 2 above) and the data, <tt>datap</tt>, (class "double" or "dataset") from which the preprocessing is removed. Note that for some preprocessing methods (for example, 'osc' and 'sg') an inverse does not exist or has not been defined, and in such cases an 'undo' call will cause an error to occur. One reason for not defining an inverse, or undo, is because it would require a significant amount of memory storage when data sets get large.
Note that some preprocessing can not be undone (for example, 'osc' and 'sg'). In these cases, an inverse does not exist or has not been defined and an 'undo' call will result in a warning. Using 'undo_silent' instead of 'undo' suppresses the warning message. One reason for not defining an inverse, or undo, is because the procedure requires a significant amount of memory storage (e.g., when data sets are large).


===See Also===
===See Also===


[[crossval]], [[pca]], [[pcr]], [[pls]], [[preprouser]], [[preprocatalog]]
[[analysis]], [[crossval]], [[pca]], [[pcr]], [[pls]], [[preprocessiterator]], [[preprocatalog]], [[preprouser]]

Latest revision as of 08:54, 1 September 2020

Purpose

Selection and application of standard preprocessing methods.

Synopsis

s = preprocess; %Modal GUI to select preprocessing
[s,changed] = preprocess(s); %Modal GUI to modify preprocessing
list = preprocess('initcatalog'); % Gives a list of the available methods
preprocess('keywords'); %Lists valid method names.
s = preprocess('default','methodname'); %Non-GUI interactive preprocessing selection
[datap,sp] = preprocess('calibrate',s,data); %single block calibration of preprocessing
[datap,sp] = preprocess('calibrate',s,xblock,yblock); %multi-block calibration of preprocessing
datap = preprocess('apply',sp,data); %apply to new data
data = preprocess('undo',sp,datap) ; %undo preprocessing
data = preprocess('undo_silent',sp,datap); %undo preprocessing (no warnings)
[datap,s] = preprocess(data); %Modal GUI to preprocess selected data
[datap,s] = preprocess(data,s); %Modal GUI to preprocess selected data

Description

PREPROCESS is a general tool to choose preprocessing steps and to perform the steps on data. It can be used as a graphical interface or as a command-line tool. See Model Building - PreProcessing Methods for a description of the use of the graphical user interface. See User Defined Preprocessing and preprouser for a description on how custom preprocessing can be added to the standard preprocessing options listed below.

From the command line, PREPROCESS can be used to perform four different tasks:

  • 1) Specification of Preprocessing (or preprocessing steps),
  • 2) Estimate preprocessing parameters (calibrate),
  • 3) Apply preprocessing to new data (apply), and
  • 4) Remove the preprocessing previously performed on the data (undo).

Case 1) Specification of Preprocessing

The purpose of the following calls to PREPROCESS is to generate standard preprocessing structure arrays that contain the desired preprocessing steps. The commands are listed here.

s = preprocess;

creates a GUI that allows the user to select preprocessing steps interactively. The output s is a standard preprocessing structure. If multiple preprocessing steps are selected, s is a multi-record structure with each record corresponding to a preprocessing step.

[s,changed] = preprocess(s);

allows the user to interactively edit a previously-built preprocessing structure s. The output s is the edited preprocessing structure. The second output changed is a flag that indicates whether the user clicked "OK" (==1) or "Cancel" (==0) to close the interface. If the user cancels, the output will be the same as the input (no changes have been made.)

s = preprocess('default','methodname');

returns the default preprocessing structure for method methodname. A list of valid method names, methodname, can be obtained using the command:

preprocess('keywords')

The technical description of the different types of preprocessing can be found on the Model Building: Preprocessing Methods page. Below is list of standard methods that can be used for 'methodname':

  • 'abs': takes the absolute value of the data (see abs),
  • 'arithmetic': simple arithmetic operations,
  • 'autoscale': centers columns to zero mean and scales to unit variance (see auto),
  • 'autoscalenomean': variance (std) scaling, scales each variable by its standard deviation without mean-centering,
  • 'baseline: baselining using an iterative weighted least square algorithm (see wlsbaseline),
  • 'whittaker': baselining using an automatic Whittaker filter,
  • 'simple baseline': baselining base on user specified points (see baseline),
  • 'classcenter': centers classes in data to the mean of each class (see classcenter).
  • 'classcentroid': centers data to the centroid of all classes (see classcentroid).
  • 'classcentroidscale': centers data to the centroid of all classes and scales to intra-class variance (see classcentroid).
  • 'derivative': Savitzky-Golay smoothing and derivative across rows (see savgol),
  • 'derivative columns': Savitzky-Golay smoothing and derivative down columns (see savgol),
  • 'detrend': remove a linear trend (see baseline),
  • 'eemfilter': EEM filtering,
  • 'emsc': extended multiplicative scatter correction (see emscorr),
  • 'epo': External Parameter Orthogonalization - remove clutter covariance (see glsw),
  • 'gapsegment': gap segment derivatives (see gapsegment),
  • 'gls weighting': generalized least squares weighting (see glsw),
  • 'gscale': group/block scaling (see gscale),
  • 'holoreact': Kaiser HoloReact Method (see hrmethodreadr),
  • 'logdecay': log decay scaling,
  • 'log10': calculate base 10 logarithm of data(log10),
  • 'mean center': center columns to have zero mean (see mncn),
  • 'median center': center columns to have zero median (see medcn),
  • 'minmax': min-max scaling, scales each row or column to have a minimum of 0 and a maximum of 1 (see minmax),
  • 'msc': multiplicative scatter correction with offset, the mean is the reference spectrum (see mscorr),
  • 'msc_median': multiplicative scatter correction with offset, the median is the reference spectrum (see mscorr),
  • 'centering': multiway center,
  • 'scaling': multiway scale,
  • 'normalize': normalization of the rows (see normaliz),
  • 'osc': orthogonal signal correction (see osccalc and oscapp),
  • 'pareto': Pareto (sqrt std) scaling, scales each variable by the square root of its standard deviation,
  • 'sqmnsc': Poisson (sqrt mean) scaling, scales each variable by the square root of its mean (see poissonscale),
  • 'referencecorrection': reference/background correction,
  • 'smooth': Savitzky-Golay smoothing (see savgol),
  • 'snv': standard normal deviate (autoscale the rows, see snv),
  • 'specalign': variable alignment via cow and registerspec,
  • 'trans2abs': transmission to absorbance (log(1/T)),
  • 'window_filter': spectral filtering (see windowfilter).

Additional methods are available with MIA_Toolbox. The valid method names for 'methodname' follow.

  • 'Image_Flatfield': background subtraction (flatfield),
  • 'Image_Close': close (dilate+erode)],
  • 'Image_Dilate': dilate,
  • 'Image_Erode': erode,
  • 'Image_Max': replaces window of pixels with the max (see box_filter),
  • 'Image_Mean': replaces window of pixels with the mean (see box_filter),
  • 'Image_Median': replaces window of pixels with the median (see box_filter),
  • 'Image_Min': replaces window of pixels with the min (see box_filter),
  • 'Image_Open: open (erode+dilate),
  • 'Image_Smooth': smooth,
  • 'Image_TrimmedMean': replaces window of pixels with the trimmed mean (see box_filter),
  • 'Image_TrimmedMedian': replaces window of pixels with the trimmed median (see box_filter).

The following command generates a multi-record structure of the default preprocessing steps included in PREPROCESS.

list = prprocess('initcatalog');

Case 2) Estimate preprocessing parameters (calibrate)

Many preprocessing methods derive statistics and other numerical values from the calibration data. These values must be stored and used when new data (test or other future data) is going to be preprocessed in the same way as the calibration data. Examples include the mean or variance for each variable of the calibration data.

The objective of the 'calibrate' call to PREPROCESS is to estimate preprocessing parameters, if any, from the calibration data set and perform preprocessing on the data. The I/O format is:

[datap,sp] = preprocess('calibrate',s,data);

The inputs are s a standard preprocessing structure and data the calibration data. The preprocessed data is returned in datap, and preprocessing parameters are returned in a modified preprocessing structure sp. Note that sp is used as an input with the 'apply' and 'undo' commands described below.

Short-cuts for each method can also be used. Examples for 'mean center' and 'autoscale' are

[datap,sp] = preprocess('calibrate','mean center',data);
[datap,sp] = preprocess('calibrate','autoscale',data);

Preprocessing for some multi-block methods (specifically, 'osc' and 'gls weighting') require that the y-block be passed also. The I/O format in these cases is:

[datap,sp] = preprocess('calibrate',s,xblock,yblock);

Case 3) Apply preprocessing to new data (apply)

Once the preprocessing steps have been calibrated on a set of data (see Case 2), the preprocessing can be "applied" to new data. The parameters determined during calibration are now used to preprocess the new data. For example, consider mean-centering. In this case, the parameters correspond to the mean of the calibration set. During 'apply', the new data are centered to the mean of the calibration data.

The following call to PREPROCESS

datap = preprocess('apply',sp,data);

applies the calibrated preprocessing sp to new data data and returns the preprocessed data datap that is class "dataset".

Case 4) Remove preprocessing from preprocessed data (undo)

The inverse operation of applying preprocessing is performed in the following call to PREPROCESS

data = preprocess('undo',sp,datap);

Inputs are sp, the calibrated preprocessing structure (See Case 2 above) and the preprocessed data, datap [class "double" or "dataset"]. The output is the "unpreprocessed" data data.

The 'undo' operation is most often used on y-block predictions in regression models and in missing data replacement algorithms (see mdcheck and replace) in which the data is preprocessed and then an estimate of the data is converted back to the original unpreprocessed form.

Note that some preprocessing can not be undone (for example, 'osc' and 'sg'). In these cases, an inverse does not exist or has not been defined and an 'undo' call will result in a warning. Using 'undo_silent' instead of 'undo' suppresses the warning message. One reason for not defining an inverse, or undo, is because the procedure requires a significant amount of memory storage (e.g., when data sets are large).

See Also

analysis, crossval, pca, pcr, pls, preprocessiterator, preprocatalog, preprouser