Auto: Difference between revisions
imported>Jeremy (Importing text file) |
|||
(12 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
===Purpose=== | ===Purpose=== | ||
Autoscales a matrix to mean zero and unit variance. | Autoscales a matrix to mean zero and unit variance. | ||
===Synopsis=== | ===Synopsis=== | ||
:[ax,mx,stdx,msg] = auto(x,''options'') | :[ax,mx,stdx,msg] = auto(x,''options'') | ||
:[ax,mx,stdx,msg] = auto(x,''offset'') | :[ax,mx,stdx,msg] = auto(x,''offset'') | ||
===Description=== | ===Description=== | ||
[ax,mx,stdx] = auto(x) autoscales a matrix x and returns the resulting matrix ax with mean-zero unit variance columns, a vector of means mx and a vector of standard deviations stdx used in the scaling. Output msg returns any warning messages. If missing data NaNs are found, the available data is autoscaled if the fraction missing is not above the thresholds specified below. mx and stdx can be used to scale new data (see SCALE). | |||
[ax,mx,stdx] = auto(x); autoscales a matrix (x) and returns the resulting matrix (ax) with mean-zero unit variance columns, a vector of means (mx) and a vector of standard deviations (stdx) used in the scaling. Output (msg) returns any warning messages. If missing data NaNs are found, the available data is autoscaled if the fraction missing is not above the thresholds specified below. (mx) and (stdx) can be used to scale new data (see SCALE). | |||
Optional input (offset) is a scalar offset to add to the standard deviations to avoid divide by zero. Optional input (options) is described below. | |||
===Options=== | ===Options=== | ||
* offset: scaling can use standard deviation plus an offset {default = 0} | ''options'' = a structure array with the following fields: | ||
* display: [ {'off'}| 'on' ] governs level of display to the command window | |||
* matrix_threshold: fraction of missing data allowed based on entire matrix (x) {default = 0.15} | * '''offset''': scaling can use standard deviation plus an offset {default = 0}. This can be used to avoid divid by zero errors. | ||
*column_threshold: fraction of missing data allowed base on a single column {default = 0.25}. | |||
* algorithm: [ {'standard'} | 'robust'] scaling algorithm. 'robust' uses MADC for scaling and median instead of mean. Should be used for robust techniques | * '''display''': [ {'off'}| 'on' ] governs level of display to the command window. | ||
* stdthreshold: [ 0 ] scalar or vector of standard deviation threshold values. If a standard deviation is below its corresponding threshold value, the threshold value will be used in lieu of the actual value. Note that the actual standard deviation is always returned, whether or not it exceedes the threshold. A scalar value is used as a threshold for all variables, | |||
* badreplacement: [0] value to use in place of standard deviation values of 0 (zero). Typical values used with the following effects: | * '''matrix_threshold''': fraction of missing data allowed based on entire matrix (x) {default = 0.15}. | ||
*'''column_threshold''': fraction of missing data allowed base on a single column {default = 0.25}. | |||
* '''algorithm''': [ {'standard'} | 'robust'] scaling algorithm. 'robust' uses MADC for scaling and median instead of mean. Should be used for robust techniques. The MADC function is a scale estimator given by the Median Absolute Deviation (with finite sample correction) and is part of the [https://wis.kuleuven.be/stat/robust/LIBRA LIBRA] package included in PLS_Toolbox/Solo. It is defined as | |||
madc(x)= b_n 1.4826 med(|x_i - med(x)|) | |||
with b_n a small sample correction factor (b_n=n/(n-0.8) for n>9) to make the mad unbiased at the normal distribution. | |||
* '''stdthreshold''': [ 0 ] scalar or vector of standard deviation threshold values. If a standard deviation is below its corresponding threshold value, the threshold value will be used in lieu of the actual value. Note that the actual standard deviation is always returned, whether or not it exceedes the threshold. A scalar value is used as a threshold for all variables, | |||
* '''badreplacement''': [0] value to use in place of standard deviation values of 0 (zero). Typical values used with the following effects: | |||
:: '''0''' = Any value in given variable is set to zero. Variable is effectively excluded (but still expected by model). This is also the behavior when badreplacement = inf. | |||
:: '''1''' = Values different from mean of the given variable are flagged in Q residuals with no reweighting. | |||
::Values >0 and <inf give the variable different weighting in the Q residuals (values >1 down-weight the bad variables for Q residual calculations, values <1 up-weight the bad variables.). | |||
If the input (offset) is a scalar then, this is used as the offset value with other options set at their default values. | If the input (offset) is a scalar then, this is used as the offset value with other options set at their default values. | ||
The optional input ''offset'' is added to the standard deviations before scaling and can be used to suppress low-level variables that would otherwise have standard deviations near zero. | The optional input ''offset'' is added to the standard deviations before scaling and can be used to suppress low-level variables that would otherwise have standard deviations near zero. | ||
The default options can be retreived using: options = auto('options');. | The default options can be retreived using: options = auto('options');. | ||
===See Also=== | ===See Also=== | ||
[[gscale]], [[medcn]], [[mncn]], [[normaliz]], [[npreprocess]], [[regcon]], [[rescale]], [[scale]], [[snv | |||
[[gscale]], [[gscaler]], [[medcn]], [[mncn]], [[normaliz]], [[npreprocess]], [[regcon]], [[rescale]], [[scale]], [[snv]], [[madc]] |
Latest revision as of 10:47, 5 December 2019
Purpose
Autoscales a matrix to mean zero and unit variance.
Synopsis
- [ax,mx,stdx,msg] = auto(x,options)
- [ax,mx,stdx,msg] = auto(x,offset)
Description
[ax,mx,stdx] = auto(x); autoscales a matrix (x) and returns the resulting matrix (ax) with mean-zero unit variance columns, a vector of means (mx) and a vector of standard deviations (stdx) used in the scaling. Output (msg) returns any warning messages. If missing data NaNs are found, the available data is autoscaled if the fraction missing is not above the thresholds specified below. (mx) and (stdx) can be used to scale new data (see SCALE). Optional input (offset) is a scalar offset to add to the standard deviations to avoid divide by zero. Optional input (options) is described below.
Options
options = a structure array with the following fields:
- offset: scaling can use standard deviation plus an offset {default = 0}. This can be used to avoid divid by zero errors.
- display: [ {'off'}| 'on' ] governs level of display to the command window.
- matrix_threshold: fraction of missing data allowed based on entire matrix (x) {default = 0.15}.
- column_threshold: fraction of missing data allowed base on a single column {default = 0.25}.
- algorithm: [ {'standard'} | 'robust'] scaling algorithm. 'robust' uses MADC for scaling and median instead of mean. Should be used for robust techniques. The MADC function is a scale estimator given by the Median Absolute Deviation (with finite sample correction) and is part of the LIBRA package included in PLS_Toolbox/Solo. It is defined as
madc(x)= b_n 1.4826 med(|x_i - med(x)|)
with b_n a small sample correction factor (b_n=n/(n-0.8) for n>9) to make the mad unbiased at the normal distribution.
- stdthreshold: [ 0 ] scalar or vector of standard deviation threshold values. If a standard deviation is below its corresponding threshold value, the threshold value will be used in lieu of the actual value. Note that the actual standard deviation is always returned, whether or not it exceedes the threshold. A scalar value is used as a threshold for all variables,
- badreplacement: [0] value to use in place of standard deviation values of 0 (zero). Typical values used with the following effects:
- 0 = Any value in given variable is set to zero. Variable is effectively excluded (but still expected by model). This is also the behavior when badreplacement = inf.
- 1 = Values different from mean of the given variable are flagged in Q residuals with no reweighting.
- Values >0 and <inf give the variable different weighting in the Q residuals (values >1 down-weight the bad variables for Q residual calculations, values <1 up-weight the bad variables.).
If the input (offset) is a scalar then, this is used as the offset value with other options set at their default values.
The optional input offset is added to the standard deviations before scaling and can be used to suppress low-level variables that would otherwise have standard deviations near zero.
The default options can be retreived using: options = auto('options');.
See Also
gscale, gscaler, medcn, mncn, normaliz, npreprocess, regcon, rescale, scale, snv, madc