Importtool and Ann: Difference between pages

From Eigenvector Research Documentation Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Mathias
 
imported>Jeremy
 
Line 1: Line 1:
===Purpose===
===Purpose===


GUI for designating column/row data types in incoming data.  Allows user to specify a column or row as labels, class sets, or axisscale or data.
Predictions based on Artificial Neural Network (ANN) regression models.


===Synopsis===
===Synopsis===


: [ctypes, rtypes] = importtool(data);
: [model] = ann(x,y,options);
: [ctypes, rtypes] = importtool(data,options);
: [model] = ann(x,y, nhid, options);
: [pred] = ann(x,model,options);
: [valid] = ann(x,y,model,options);


===Description===
===Description===


Allows user to identify data type (data, class, axisscale, include, and ignore) fields (row and columns) in a data matrix.
Build an ANN model from input X and Y block data using the specified number of layers and layer nodes.
Alternatively, if a model is passed in ANN makes a Y prediction for an input test X block. The ANN model
contains quantities (weights etc) calculated from the calibration data. When a model structure is passed in  
to ANN then these weights do not need to be calculated.  


===Options===
There are two implementations of ANN available referred to as 'BPN' and 'Encog'.
:BPN is a feedforward ANN using backpropagation training and is implemented in Matlab.
:Encog is a feedforward ANN using Resilient Backpropagation training. See [http://en.wikipedia.org/wiki/Rprop Rprop] for further details.
Encog is implemented using the Encog framework [http://www.heatonresearch.com/encog Encog] provided by
Heaton Research, Inc, under the Apache 2.0 license. Further details of Encog Neural Network features are
available at [http://www.heatonresearch.com/wiki/Main_Page#Encog_Documentation Encog Documentation].
BPN is the ANN version used by default but the user can specify the option 'algorithm' = 'encog' to use Encog instead.
Both implementations should give similar results but one may be faster than the other for different datasets.
BPN is currently the only version which calculates RMSECV.


options = a structure array with the following fields:
====Inputs====


* '''fields''': Nx2 cell array, first column is field name, second column is color to use.
* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,
* '''y''' = Y-block (predicted block) class "double" or "dataset", containing numeric values,
* '''nhid''' = number of nodes in a single hidden layer ANN, or vector of two two numbers, indicating a two hidden layer ANN, representing the number of nodes in the two hidden layers. (this takes precedence over options nhid1 and nhid2),
* '''model''' = previously generated model (when applying model to new data).


====Outputs====


==Examples==
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):
** '''modeltype''': 'ANN',
** '''datasource''': structure array with information about input data,
** '''date''': date of creation,
** '''time''': time of creation,
** '''info''': additional model information,
** '''pred''': 2 element cell array with
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)
** '''detail''': sub-structure with additional model details and results, including:
*** model.detail.ann.W: Structure containing details of the ANN, including the ANN type, number of hidden layers and the weights.


* '''pred''' a structure, similar to '''model''' for the new data.


Here we import a csv file by dragging the file into the browse window. 
====Training Termination====
<gallery caption="Steps involved with specifying columns and rows with the import tool" widths="300px" heights="300px" perrow="2">
The ANN is trained on a calibration dataset to minimize prediction error, RMSEC. It is important to not overtrain, however, so some some criteria for ending training are needed.
File:csv_examp.jpg|a)    '' example file with class and label data''


File:Text_import1.jpg|b) ''text import settings'
BPN determines the optimal number of learning iteration cycles by selecting the minumum RMSECV using the calibration data over a range of learning iterations values. The cross-validation used is determined by option cvi, or else by cvmethod. If neither of these are specified then the minumum RMSEP using a single subset of samples from a 5-fold random split of the calibration data is used. This value is not saved in the model.rmsecv field. Apply cross-validation (see below) to add this information to the model.


File:importtool1.jpg|c) ''import tool'
Encog training terminates whenever either a) RMSE becomes smaller than the option 'terminalrmse' value, or b) the rate of improvement of RMSE per 100 training iterations
</gallery>
becomes smaller than the option 'terminalrmserate' value, or c) time exceeds the option 'maxseconds' value (though results are not optimal if is stopped prematurely by this time limit).
Note these RMSE values refer to the internal preprocessed and scaled y values.


====Cross-validation====
Cross-validation can be applied to ANN when using either the ANN Analysis window or the command line. From the Analysis window specify the cross-validation method in the usual way (clicking on the model icon's red check-mark, or the "Choose Cross-Validation" link in the flowchart). In the cross-validation window the "Maximum Number of Nodes" specifies how many hidden-layer 1 nodes to test over. Viewing RMSECV versus number of hidden-layer 1 nodes (toolbar icon to left of Scores Plot) is useful for choosing the number of layer 1 nodes. From the command line use the crossval method to add crossvalidation information to an existing model.


[[Image:csv_examp.jpg]]
===Options===
[[Image:Text_import1.jpg]]
[[Image:importtool1.jpg]]
 


options = a structure array with the following fields:
* '''display''' : [ 'off' |{'on'}] Governs display
* '''plots''': [ {'none'} | 'final' ] governs plotting of results.
* '''blockdetails''' : [ {'standard'} | 'all' ] extent of detail included in model. 'standard' keeps only y-block, 'all' keeps both x- and y- blocks.
* '''waitbar''' : [ 'off' |{'auto'}| 'on' ] governs use of waitbar during analysis. 'auto' shows waitbar if delay will likely be longer than a reasonable waiting period.
* '''algorithm''' : [{'bpn'} | 'encog'] ANN implementation to use.
* '''nhid1''' : [{2}] Number of nodes in first hidden layer.
* '''nhid2''' : [{0}] Number of nodes in second hidden layer.
* '''learnrate''' : [0.125] ANN backpropagation learning rate (bpn only).
* '''learncycles''' : [20] Number of ANN learning iterations (bpn only).
* '''terminalrmse''' : [0.05] Termination RMSE value (of scaled y) for ANN iterations (encog only).
* '''terminalrmserate''' : [1.e-9] Termination rate of change of RMSE per 100 iterations (encog only).
* '''maxseconds''' : [{20}] Maximum duration of ANN training in seconds (encog only).
* '''preprocessing''': {[] []} preprocessing structures for x and y blocks (see PREPROCESS).
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the ANN model. 'pca' uses a simple PCA model to compress the information. 'pls' uses a pls model. Compression can make the ANN more stable and less prone to overfitting.
* '''compressncomp''': [1] Number of latent variables (or principal components to include in the compression model.
* '''compressmd''': [{'yes'} | 'no'] Use Mahalnobis Distance corrected.
* '''cvmethod''' : [{'con'} | 'vet' | 'loo' | 'rnd'] CV method, OR [] for Kennard-Stone single split.
* '''cvsplits''' : [{5}] Number of CV subsets.
* '''cvi''' : ''M'' element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:
::cvi(i) = -2  the sample is always in the test set.
::cvi(i) = -1  the sample is always in the calibration set,
::cvi(i) =  0  the sample is always never used, and
::cvi(i) =  1,2,3... defines each test subset.


===See Also===
===See Also===


[[parsemixed]]
[[analysis]], [[crossval]], [[lwr]], [[modelselector]], [[pls]], [[pcr]], [[svm]]

Revision as of 14:39, 22 September 2014

Purpose

Predictions based on Artificial Neural Network (ANN) regression models.

Synopsis

[model] = ann(x,y,options);
[model] = ann(x,y, nhid, options);
[pred] = ann(x,model,options);
[valid] = ann(x,y,model,options);

Description

Build an ANN model from input X and Y block data using the specified number of layers and layer nodes. Alternatively, if a model is passed in ANN makes a Y prediction for an input test X block. The ANN model contains quantities (weights etc) calculated from the calibration data. When a model structure is passed in to ANN then these weights do not need to be calculated.

There are two implementations of ANN available referred to as 'BPN' and 'Encog'.

BPN is a feedforward ANN using backpropagation training and is implemented in Matlab.
Encog is a feedforward ANN using Resilient Backpropagation training. See Rprop for further details.

Encog is implemented using the Encog framework Encog provided by Heaton Research, Inc, under the Apache 2.0 license. Further details of Encog Neural Network features are available at Encog Documentation. BPN is the ANN version used by default but the user can specify the option 'algorithm' = 'encog' to use Encog instead. Both implementations should give similar results but one may be faster than the other for different datasets. BPN is currently the only version which calculates RMSECV.

Inputs

  • x = X-block (predictor block) class "double" or "dataset", containing numeric values,
  • y = Y-block (predicted block) class "double" or "dataset", containing numeric values,
  • nhid = number of nodes in a single hidden layer ANN, or vector of two two numbers, indicating a two hidden layer ANN, representing the number of nodes in the two hidden layers. (this takes precedence over options nhid1 and nhid2),
  • model = previously generated model (when applying model to new data).

Outputs

  • model = a standard model structure model with the following fields (see Standard Model Structure):
    • modeltype: 'ANN',
    • datasource: structure array with information about input data,
    • date: date of creation,
    • time: time of creation,
    • info: additional model information,
    • pred: 2 element cell array with
      • model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)
    • detail: sub-structure with additional model details and results, including:
      • model.detail.ann.W: Structure containing details of the ANN, including the ANN type, number of hidden layers and the weights.
  • pred a structure, similar to model for the new data.

Training Termination

The ANN is trained on a calibration dataset to minimize prediction error, RMSEC. It is important to not overtrain, however, so some some criteria for ending training are needed.

BPN determines the optimal number of learning iteration cycles by selecting the minumum RMSECV using the calibration data over a range of learning iterations values. The cross-validation used is determined by option cvi, or else by cvmethod. If neither of these are specified then the minumum RMSEP using a single subset of samples from a 5-fold random split of the calibration data is used. This value is not saved in the model.rmsecv field. Apply cross-validation (see below) to add this information to the model.

Encog training terminates whenever either a) RMSE becomes smaller than the option 'terminalrmse' value, or b) the rate of improvement of RMSE per 100 training iterations becomes smaller than the option 'terminalrmserate' value, or c) time exceeds the option 'maxseconds' value (though results are not optimal if is stopped prematurely by this time limit). Note these RMSE values refer to the internal preprocessed and scaled y values.

Cross-validation

Cross-validation can be applied to ANN when using either the ANN Analysis window or the command line. From the Analysis window specify the cross-validation method in the usual way (clicking on the model icon's red check-mark, or the "Choose Cross-Validation" link in the flowchart). In the cross-validation window the "Maximum Number of Nodes" specifies how many hidden-layer 1 nodes to test over. Viewing RMSECV versus number of hidden-layer 1 nodes (toolbar icon to left of Scores Plot) is useful for choosing the number of layer 1 nodes. From the command line use the crossval method to add crossvalidation information to an existing model.

Options

options = a structure array with the following fields:

  • display : [ 'off' |{'on'}] Governs display
  • plots: [ {'none'} | 'final' ] governs plotting of results.
  • blockdetails : [ {'standard'} | 'all' ] extent of detail included in model. 'standard' keeps only y-block, 'all' keeps both x- and y- blocks.
  • waitbar : [ 'off' |{'auto'}| 'on' ] governs use of waitbar during analysis. 'auto' shows waitbar if delay will likely be longer than a reasonable waiting period.
  • algorithm : [{'bpn'} | 'encog'] ANN implementation to use.
  • nhid1 : [{2}] Number of nodes in first hidden layer.
  • nhid2 : [{0}] Number of nodes in second hidden layer.
  • learnrate : [0.125] ANN backpropagation learning rate (bpn only).
  • learncycles : [20] Number of ANN learning iterations (bpn only).
  • terminalrmse : [0.05] Termination RMSE value (of scaled y) for ANN iterations (encog only).
  • terminalrmserate : [1.e-9] Termination rate of change of RMSE per 100 iterations (encog only).
  • maxseconds : [{20}] Maximum duration of ANN training in seconds (encog only).
  • preprocessing: {[] []} preprocessing structures for x and y blocks (see PREPROCESS).
  • compression: [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the ANN model. 'pca' uses a simple PCA model to compress the information. 'pls' uses a pls model. Compression can make the ANN more stable and less prone to overfitting.
  • compressncomp: [1] Number of latent variables (or principal components to include in the compression model.
  • compressmd: [{'yes'} | 'no'] Use Mahalnobis Distance corrected.
  • cvmethod : [{'con'} | 'vet' | 'loo' | 'rnd'] CV method, OR [] for Kennard-Stone single split.
  • cvsplits : [{5}] Number of CV subsets.
  • cvi : M element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:
cvi(i) = -2 the sample is always in the test set.
cvi(i) = -1 the sample is always in the calibration set,
cvi(i) = 0 the sample is always never used, and
cvi(i) = 1,2,3... defines each test subset.

See Also

analysis, crossval, lwr, modelselector, pls, pcr, svm