Parafac: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Chuck
 
(49 intermediate revisions by 11 users not shown)
Line 5: Line 5:
===Synopsis===
===Synopsis===


:model  = parafac(X,ncomp,''initval,options'')
:model  = parafac(X,ncomp,''options'')
:pred    = parafac(Xnew,model)
:pred    = parafac(Xnew,model)
:parafac % Launches an analysis window with Parafac as the selected method
Please note that the recommended way to build and apply a PARAFAC model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]].


===Description===
===Description===
Line 14: Line 17:
For example, twenty-seven samples containing different amounts of dissolved hydroquinone, tryptophan, phenylalanine, and dopa are measured spectrofluoremetrically using 233 emission wavelengths (250-482 nm) and 24 excitation wavelengths (200-315 nm each 5 nm). A typical sample is also shown.
For example, twenty-seven samples containing different amounts of dissolved hydroquinone, tryptophan, phenylalanine, and dopa are measured spectrofluoremetrically using 233 emission wavelengths (250-482 nm) and 24 excitation wavelengths (200-315 nm each 5 nm). A typical sample is also shown.


[[Image:Parafacdata.gif]]
A four-component PARAFAC model of these data will give four factors, each corresponding to one of the chemical analytes. This is illustrated graphically below. The first mode scores (loadings in mode 1) in the matrix '''A''' (27x4) contain estimated relative concentrations of the four analytes in the 27 samples. The second mode loadings '''B''' (233x4) are estimated emission loadings and the third mode loadings '''C''' (24x4) are estimated excitation loadings.


A four-component PARAFAC model of these data will give four factors, each corresponding to one of the chemical analytes. This is illustrated graphically below. The first mode scores (loadings in mode 1) in the matrix '''A''' (27�4) contain estimated relative concentrations of the four analytes in the 27 samples. The second mode loadings '''B''' (233�4) are estimated emission loadings and the third mode loadings '''C''' (24�4) are estimated excitation loadings.
[[Image:Parafacresults.gif]]


For more information about how to use PARAFAC, see the [https://www.youtube.com/playlist?list=PL4L59zaizb3E-Pgp-f90iKHdQQi15JJoL University of Copenhagen's Multi-Way Analysis Videos].


In the PARAFAC algorithm, any missing values must be set to NaN or Inf and are then automatically handled by expectation maximization. This routine employs an alternating least squares (ALS) algorithm in combination with a line search. For 3-way data, the initial estimate of the loadings is usually obtained from the tri-linear decomposition (TLD).
In the PARAFAC algorithm, any missing values must be set to NaN or Inf and are then automatically handled by expectation maximization. This routine employs an alternating least squares (ALS) algorithm in combination with a line search. For 3-way data, the initial estimate of the loadings is usually obtained from the tri-linear decomposition (TLD).
For assistance in preparing batch data for use in PARAFAC please see [[bspcgui]].


====Inputs====
====Inputs====
Line 25: Line 33:
* '''x''' = the multiway array to be decomposed, and
* '''x''' = the multiway array to be decomposed, and


* '''ncomp''' =  the number of factors (components) to use, or
* '''ncomp''' =   
 
:* the number of factors (components) to use, OR
* '''model''' =  a PARAFAC model structure (new data are fit to the model i.e. sample mode scores are calculated).
:* a cell array of parameters such as {a,b,c} which will then be used as starting point for the model. The cell array must be the same length as the number of modes and element j contain the scores/loadings for that mode. If one cell element is empty, this mode is guessed based on the remaining modes.


====Optional Inputs====
====Optional Inputs====


* '''''initval''''' =  cell array of initial values (initial guess) for the loadings (e.g. model.loads from a previous fit). If not used it can be 0 or [], and
* '''''initval''''' =   
:* If a parafac model is input, the data are fit to this model where the loadings for the first mode (scores) are estimated.
:* If the loadings are input (e.g. model.loads) these are used as starting values.


*'''''''' options'' =  discussed below.
*'''''options''''' =  discussed below.


====Outputs====
====Outputs====
Line 59: Line 69:
* '''description''': cell array with text description of model, and
* '''description''': cell array with text description of model, and


* '''detail''': sub-structure with additional model details and results.
* '''detail''': sub-structure with additional model details and results. E.g. .detail.ssq contains information on loss function and variance explained as well as variance per component.


Note that the sum-squared captured table contains various statistics on the information captured by each component. Please see [[MCR and PARAFAC Variance Captured]] for details.
The output pred is a structure array that contains the approximation of the data if the options field blockdetails is set to 'all' (see next).
The output pred is a structure array that contains the approximation of the data if the options field blockdetails is set to 'all' (see next).


Line 73: Line 84:
* '''weights''': [], used for fitting a weighted loss function (discussed below),
* '''weights''': [], used for fitting a weighted loss function (discussed below),


* '''stopcrit''': [1e-6 1e-6 10000 3600] defines the stopping criteria as [(relative tolerance) (absolute tolerance) (maximum number of iterations) (maximum time in seconds)],
* '''stopcriteria''': Structure defining when to stop iterations based on any one of four criteria
 
:* '''relativechange''': Default is 1e-6. When the relative change in fit gets below the threshold, the algorithm stops.
:* '''absolutechange''': Default is 1e-6. When the absolute change in fit gets below the threshold, the algorithm stops.
:* '''iterations''': Default is 10.000. When the number of iterations exceeds the threshold, the algorithm stops.
:* '''seconds''': Default is 3600 (seconds). When the time spent exceeds the threshold, the algorithm stops.


* '''init''': [ 0 ], defines how parameters are initialized (discussed below),
* '''init''': [ 0 ], defines how parameters are initialized (discussed below),
Line 79: Line 95:
* '''line''': [ 0 | {1}] defines whether to use the line search {default uses it},
* '''line''': [ 0 | {1}] defines whether to use the line search {default uses it},


* '''algo''': [ {'ALS'} | 'tld' | 'swatld' ] governs algorithm used,
* '''algo''': [ {'ALS'} | 'tld' | 'swatld' ] governs algorithm used. Only ALS allows more than three-way and allows constraints,


* '''iterative''': settings for iterative reweighted least squares fitting (see help on weights below),
* '''iterative''': settings for iterative reweighted least squares fitting (see help on weights below),


* '''blockdetails''': 'standard'
* '''validation.splithalf''': [ 'on' | {'off'} ], Allows doing [[splithalf]] analysis. See the help of SPLITHALF for more information,
 
* '''auto_outlier.perform''': [ 'on' | {'off'} ], Will automatically remove detected outliers in an iterative fashion. See auto_outlier.help for more information,
 
* '''scaletype''': Defines how loadings are scaled. See options.scaletype.text for help,
 
* '''blockdetails''': [ {'standard'} | 'compact' | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.
:* ‘Compact’ = like 'Standard' only residual limits from old model is used and the core consistency field in the model structure is left empty. ('model.detail.reslim', 'model.detail.coreconsistency.consistency').
:* 'All' = keep predictions, raw residuals for x-block as well as the X-block dataset itself.


* '''missdat''': this option is not yet active,
* '''preprocessing''': {[]}, one element cell array containing preprocessing structure (see PREPROCESS) defining preprocessing to use on the x-block


* '''samplemode''': [1], defines which mode should be considered the sample or object mode,
* '''samplemode''': [1], defines which mode should be considered the sample or object mode,


* '''constraints''': {3x1 cell}, defines constraints on parameters (discussed below), and
* '''constraints''': {3x1 cell}, defines constraints on parameters (discussed below),
 
* '''coreconsist''': [ {'on'} | 'off' ], governs calculation of core consistency (turning off may save time with large data sets and many components), and


* '''coreconsist''': [ {'on'} | 'off' ], governs calculation of core consistency (turning off may save time with large data sets and many components).
* '''waitbar''': [ {'on'} | 'off' ], display waitbar.  


The default options can be retrieved using: options = parafac('options');.
The default options can be retrieved using: options = parafac('options');.


WEIGHTS
=====Weights=====


Through the use of the ''options'' field weights it is possible to fit a PARAFAC model in a weighted least squares sense The input is an array of the same size as the input data X holding individual weights for each element. The PARAFAC model is then fit in a weighted least squares sense. Instead of minimizing the frobenius norm ||x-M||<sup>2</sup> where M is the PARAFAC model, the norm ||(x-M).\*weights||<sup>2</sup> is minimized. The algorithm used for weighted regression is based on a majorization step according to Kiers, ''Psychometrika'', '''62''', 251-266, 1997 which has the advantage of being computationally inexpensive. If alternatively, the field weights is set to 'iterative' then iteratively reweighted least squares fitting is used. The settings of this can be modified in the field iterative.cutoff_residuals which defines the cutoff for large residuals in terms of the number of robust standard deviations. The lower the number, the more subtle outliers will be ignored.
Through the use of the ''options'' field weights it is possible to fit a PARAFAC model in a weighted least squares sense The input is an array of the same size as the input data X holding individual weights for each element. The PARAFAC model is then fit in a weighted least squares sense. Instead of minimizing the frobenius norm ||x-M||<sup>2</sup> where M is the PARAFAC model, the norm ||(x-M).*weights||<sup>2</sup> is minimized. The algorithm used for weighted regression is based on a majorization step according to Kiers, ''Psychometrika'', '''62''', 251-266, 1997 which has the advantage of being computationally inexpensive.


INIT
=====Init=====


The ''options'' field init is used to govern how the initial guess for the loadings is obtained. If optional input ''initval'' is input then options.init is not used. The following choices for init are available.
The ''options'' field init is used to govern how the initial guess for the loadings is obtained. If optional input ''initval'' is input then options.init is not used. The following choices for init are available.
Line 107: Line 134:
* '''init''' =  0, PARAFAC chooses initialization {default},
* '''init''' =  0, PARAFAC chooses initialization {default},


* '''init''' =  1, uses TLD (unless there are missing values then random is used),
* '''init''' =  1, uses TLD (unless data is more than three-way. Then ATLD is used),


* '''init''' =  2, initializes loadings with random values,
* '''init''' =  2, based on singular value decomposition (good alternative to 1),  


* '''init''' =  3, based on orthogonalization of random values (preferred over 2),
* '''init''' =  3, based on orthogonalization of random values (good for checking local minima),


* '''init''' =  4, based on singular value decomposition,  
* '''init''' =  4, based on approximate (sequentially fitted) PARAFAC model,  


* '''init''' =  5, based on compression which may be useful for large data, and
* '''init''' =  5, based on compression which may be useful for large data, and
Line 119: Line 146:
* '''init''' >  5, based on best fit of many (the value options.init) small runs.
* '''init''' >  5, based on best fit of many (the value options.init) small runs.


CONSTRAINTS
=====Constraints=====
 
The ''options'' field constraints is used to employ constraints on the parameters. It is a cell array with number of elements equal to the number of modes of the input data X. Each cell contains a structure array with the following fields:
 
* '''nonnegativity''': [ {0} | 1 ], a 1 imposes non-negativity.
 
* '''unimodality''': [ {0} | 1 ], a 1 imposes unimodality (1 local maxima).
 
* '''orthogonal''': [ {0} | 1 ], constrain factors in this mode to be orthogonal.
 
* '''orthonormal''': [ {0} | 1 ], constrain factors in this mode to be orthonormal.
 
* '''exponential''': [ {0} | 1 ], a 1 fits an exponential function to the factors in this mode.
 
:smoothness.weight: [0 to 1], imposes smoothness using B-splines, values near 1 impose high smoothness and values close to 0, impose less smoothness.
 
* '''fixed.position''': [ ], a matrix containing 1's and 0's of the same size as the corresponding loading matrix, with a 1 indicating where parameters are fixed.
 
* '''fixed.value''': [ ], a vector containing the fixed values. Thus, if B is the loading matrix, then we seek B(find(fixed.position)) = fixed.value. Therefore, fixed.value must be a matrix of the same size as the loadings matrix and with the corresponding elements to be fixed at their appropriate values. All other elements of fixed.value are disregarded.
 
* '''fixed.weight''': [ ], a scalar (0 ? fixed.weight ? 1) indicating how strongly the fixed.value is imposed. A value of 0 (zero) does not impose the constraint at all, whereas a value of 1 (one) fixes the constraint.
 
* '''ridge.weight''': [ ], a scalar value between 0 and 1 that introduces a ridging in the update of the loading matrix. It is a penalty on the size of the esimated loadings. The closer to 1, the higher the ridge. Ridging is useful when a problem is difficult to fit.
 
* '''equality.G''': [ ], matrix with ''N'' columns, where ''N'' is the number of factors, used with equality.H. If '''A''' is the loadings for this mode then the constraint is imposed such that '''GA'''<sup>T</sup> = '''H'''. For example, if '''G''' is a row vector of ones and '''H''' is a vector of ones (1's), this would impose closure.
 
* '''equality.H''': [ ], matrix of size consistent with the constriant imposed by equality.G.
 
* '''equality.weight''': [ ], a scalar (0 ? equality.weight ? 1) indicating how strongly the equality.H and equality.G is imposed. A value of 0 (zero) does not impose the constraint at all, whereas a value of 1 (one) fixes the constraint.
 
* '''leftprod''': [0], If the loading matrix, '''B''' is of size JxR, the leftprod is a matrx '''G''' of size JxM. The loading '''B''' is then constrained to be of the form '''B = GH''', where only '''H''' is updated. For example, '''G''' may be a certain JxJ subspace, if the loadings are to be within a certain subspace.
 
* '''rightprod''': [0], If the loading matrix, '''B''' is of size JxR, the rightprod is a matrx '''G''' of size MxR. The loading '''B''' is then constrained to be of the form '''B''' = '''HG''', where only '''H''' is updated. For example, if rightprod is [1 1 0;0 0 1], then the first two components in '''B''' are forced to be the same.
 
* '''iterate_to_conv''': [0], Usually the constraints are imposed within an iterative algorithm. Some of the constraints use iterative algorithms themselves. Setting iterate_to_conv to one, will force the iterative constraint algorithms to continue until convergence.
 
* '''timeaxis''': [], This field (if supplied) is used as the time axis when fitting loadings to a function (e.g. see exponential). Therefore, it must have the same number of elements as one of the loading vectors for this mode.
 
* '''description''': [1x1592 char],


If the constraint in a mode is set as fixed, then the loadings of that mode will not be updated, hence the initial loadings stay fixed.
The ''options'' field constraints is used to employ constraints on the parameters. It is a cell array with number of elements equal to the number of modes of the input data X. Each cell contains a structure array that defines the constraints in that particular mode. Hence, options.constraints{2} defines constraints on the second mode loadings. For help on setting constraints see [[constrainfit]]. Note, that if your dataset is e.g. a five-way array, then the default constraint field in options only defines the first three modes. You will have to make the constraint field for the remaining modes yourself. This can be done by copying from the other modes. For example, options.constraints{4} = options.constraints{1};options.constraints{5} = options.constraints{1};


===Examples===
===Examples===
Line 181: Line 170:
===See Also===
===See Also===


[[datahat]], [[explode]], [[gram]], [[mpca]], [[outerm]], [[parafac2]], [[tld]], [[tucker]], [[unfoldm]]
[[analysis]], [[bspcgui]], [[datahat]], [[eemoutlier]], [[explode]], [[gram]], [[mpca]], [[npls]], [[outerm]], [[parafac2]], [[pca]], [[preprocess]], [[splithalf]], [[tld]], [[tucker]], [[unfoldm]], [[modelviewer]], [[EVRIModel_Objects]]

Latest revision as of 22:31, 12 November 2020

Purpose

PARAFAC (PARAllel FACtor analysis) for multi-way arrays

Synopsis

model = parafac(X,ncomp,options)
pred = parafac(Xnew,model)
parafac % Launches an analysis window with Parafac as the selected method

Please note that the recommended way to build and apply a PARAFAC model from the command line is to use the Model Object. Please see this wiki page on building and applying models using the Model Object.

Description

PARAFAC will decompose an array of order N (where N >= 3) into the summation over the outer product of N vectors (a low-rank model). E.g. if N=3 then the array is size I by J by K. An example of three-way fluorescence data is shown below..

For example, twenty-seven samples containing different amounts of dissolved hydroquinone, tryptophan, phenylalanine, and dopa are measured spectrofluoremetrically using 233 emission wavelengths (250-482 nm) and 24 excitation wavelengths (200-315 nm each 5 nm). A typical sample is also shown.

Parafacdata.gif

A four-component PARAFAC model of these data will give four factors, each corresponding to one of the chemical analytes. This is illustrated graphically below. The first mode scores (loadings in mode 1) in the matrix A (27x4) contain estimated relative concentrations of the four analytes in the 27 samples. The second mode loadings B (233x4) are estimated emission loadings and the third mode loadings C (24x4) are estimated excitation loadings.

Parafacresults.gif

For more information about how to use PARAFAC, see the University of Copenhagen's Multi-Way Analysis Videos.

In the PARAFAC algorithm, any missing values must be set to NaN or Inf and are then automatically handled by expectation maximization. This routine employs an alternating least squares (ALS) algorithm in combination with a line search. For 3-way data, the initial estimate of the loadings is usually obtained from the tri-linear decomposition (TLD).

For assistance in preparing batch data for use in PARAFAC please see bspcgui.

Inputs

  • x = the multiway array to be decomposed, and
  • ncomp =
  • the number of factors (components) to use, OR
  • a cell array of parameters such as {a,b,c} which will then be used as starting point for the model. The cell array must be the same length as the number of modes and element j contain the scores/loadings for that mode. If one cell element is empty, this mode is guessed based on the remaining modes.

Optional Inputs

  • initval =
  • If a parafac model is input, the data are fit to this model where the loadings for the first mode (scores) are estimated.
  • If the loadings are input (e.g. model.loads) these are used as starting values.
  • options = discussed below.

Outputs

The output model is a structure array with the following fields:

  • modeltype: 'PARAFAC',
  • datasource: structure array with information about input data,
  • date: date of creation,
  • time: time of creation,
  • info: additional model information,
  • loads: 1 by K cell array with model loadings for each mode/dimension,
  • pred: cell array with model predictions for each input data block,
  • tsqs: cell array with T2 values for each mode,
  • ssqresiduals: cell array with sum of squares residuals for each mode,
  • description: cell array with text description of model, and
  • detail: sub-structure with additional model details and results. E.g. .detail.ssq contains information on loss function and variance explained as well as variance per component.

Note that the sum-squared captured table contains various statistics on the information captured by each component. Please see MCR and PARAFAC Variance Captured for details. The output pred is a structure array that contains the approximation of the data if the options field blockdetails is set to 'all' (see next).

Options

options = a structure array with the following fields:

  • display: [ {'on'} | 'off' ], governs level of display,
  • plots: [ {'final'} | 'all' | 'none' ], governs level of plotting,
  • weights: [], used for fitting a weighted loss function (discussed below),
  • stopcriteria: Structure defining when to stop iterations based on any one of four criteria
  • relativechange: Default is 1e-6. When the relative change in fit gets below the threshold, the algorithm stops.
  • absolutechange: Default is 1e-6. When the absolute change in fit gets below the threshold, the algorithm stops.
  • iterations: Default is 10.000. When the number of iterations exceeds the threshold, the algorithm stops.
  • seconds: Default is 3600 (seconds). When the time spent exceeds the threshold, the algorithm stops.
  • init: [ 0 ], defines how parameters are initialized (discussed below),
  • line: [ 0 | {1}] defines whether to use the line search {default uses it},
  • algo: [ {'ALS'} | 'tld' | 'swatld' ] governs algorithm used. Only ALS allows more than three-way and allows constraints,
  • iterative: settings for iterative reweighted least squares fitting (see help on weights below),
  • validation.splithalf: [ 'on' | {'off'} ], Allows doing splithalf analysis. See the help of SPLITHALF for more information,
  • auto_outlier.perform: [ 'on' | {'off'} ], Will automatically remove detected outliers in an iterative fashion. See auto_outlier.help for more information,
  • scaletype: Defines how loadings are scaled. See options.scaletype.text for help,
  • blockdetails: [ {'standard'} | 'compact' | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.
  • ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.
  • ‘Compact’ = like 'Standard' only residual limits from old model is used and the core consistency field in the model structure is left empty. ('model.detail.reslim', 'model.detail.coreconsistency.consistency').
  • 'All' = keep predictions, raw residuals for x-block as well as the X-block dataset itself.
  • preprocessing: {[]}, one element cell array containing preprocessing structure (see PREPROCESS) defining preprocessing to use on the x-block
  • samplemode: [1], defines which mode should be considered the sample or object mode,
  • constraints: {3x1 cell}, defines constraints on parameters (discussed below),
  • coreconsist: [ {'on'} | 'off' ], governs calculation of core consistency (turning off may save time with large data sets and many components), and
  • waitbar: [ {'on'} | 'off' ], display waitbar.

The default options can be retrieved using: options = parafac('options');.

Weights

Through the use of the options field weights it is possible to fit a PARAFAC model in a weighted least squares sense The input is an array of the same size as the input data X holding individual weights for each element. The PARAFAC model is then fit in a weighted least squares sense. Instead of minimizing the frobenius norm ||x-M||2 where M is the PARAFAC model, the norm ||(x-M).*weights||2 is minimized. The algorithm used for weighted regression is based on a majorization step according to Kiers, Psychometrika, 62, 251-266, 1997 which has the advantage of being computationally inexpensive.

Init

The options field init is used to govern how the initial guess for the loadings is obtained. If optional input initval is input then options.init is not used. The following choices for init are available.

Generally, options.init = 0, will do for well-behaved data whereas options.init = 10, will be suitable for difficult models. Difficult models are typically those with many components, with very correlated loadings, or models where there are indications that local minima are present.

  • init = 0, PARAFAC chooses initialization {default},
  • init = 1, uses TLD (unless data is more than three-way. Then ATLD is used),
  • init = 2, based on singular value decomposition (good alternative to 1),
  • init = 3, based on orthogonalization of random values (good for checking local minima),
  • init = 4, based on approximate (sequentially fitted) PARAFAC model,
  • init = 5, based on compression which may be useful for large data, and
  • init > 5, based on best fit of many (the value options.init) small runs.
Constraints

The options field constraints is used to employ constraints on the parameters. It is a cell array with number of elements equal to the number of modes of the input data X. Each cell contains a structure array that defines the constraints in that particular mode. Hence, options.constraints{2} defines constraints on the second mode loadings. For help on setting constraints see constrainfit. Note, that if your dataset is e.g. a five-way array, then the default constraint field in options only defines the first three modes. You will have to make the constraint field for the remaining modes yourself. This can be done by copying from the other modes. For example, options.constraints{4} = options.constraints{1};options.constraints{5} = options.constraints{1};

Examples

parafac demo gives a demonstration of the use of the PARAFAC algorithm.

model = parafac(X,5) fits a five-component PARAFAC model to the array X using default settings.

pred = parafac(Z,model) fits a parafac model to new data Z. The scores will be taken to be in the first mode, but you can change this by setting options.samplemodex to the mode which is the sample mode. Note, that the sample-mode dimension may be different for the old model and the new data, but all other dimensions must be the same.

options = parafac('options'); generates a set of default settings for PARAFAC. options.plots = 0; sets the plotting off.

options.init = 3; sets the initialization of PARAFAC to orthogonalized random numbers.

options.samplemodex = 2; Defines the second mode to be the sample-mode. Useful, for example, when fitting an existing model to new data has to provide the scores in the second mode.

model = parafac(X,2,options); fits a two-component PARAFAC model with the settings defined in options.

parafac io shows the I/O of the algorithm.

See Also

analysis, bspcgui, datahat, eemoutlier, explode, gram, mpca, npls, outerm, parafac2, pca, preprocess, splithalf, tld, tucker, unfoldm, modelviewer, EVRIModel_Objects