Svm and Model Building: Analysis Phases Overview: Difference between pages

From Eigenvector Research Documentation Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Benjamin
No edit summary
 
imported>Jeremy
No edit summary
 
Line 1: Line 1:
===Purpose===
__TOC__


SVM Support Vector Machine (LIBSVM) for regression. Use SVMDA for SVM classification ([[Svmda]]). Please also look at the [[Svmda]] page since it has more detailed information much of which also applies to SVM for regression.
[[TableOfContents|Table of Contents]] | [[ModelBuilding_PreProcessingMethods|Previous]] | [[ModelBuilding_CalibrationPhase|Next]]


===Synopsis===
==Analysis Phases==


:model = svm(x,y,options);          %identifies model (calibration step).
The Analysis window serves as the core interface to the Solo modeling and analysis functions. You create your models in an Analysis window, apply models in this window, and also analyze and explore the models in this window. Three phases are required to completely carry out modeling and analysis in the Analysis window-the [[ModelBuilding_AnalysisPhasesOverview#Calibration phase|Calibration phase]], the [[ModelBuilding_AnalysisPhasesOverview#Test and Validation phase|Test and Validation phase]], and the [[ModelBuilding_AnalysisPhasesOverview#Model Application phase|Model Application phase]].
:pred = svm(x,model,options);      %makes predictions with a new X-block
:pred = svm(x,y,model,options);    %performs a "test" call with a new X-block and known y-values
:svm % Launches an Analysis window with SVM as the selected method.


===Description===
===Calibration phase===


The SVM function or analysis method performs calibration and application of Support Vector Machine (SVM) regression models. SVM models can be used for regression problems. The model consists of a number of support vectors (essentially samples selected from the calibration set) and non-linear model coefficients which define the non-linear mapping of variables in the input x-block. The model allows prediction of the continuous y-block variable. It is recommended that classification be done through the svmda function.
The Calibration phase consists of model building and exploratory analysis. In this phase, which affects only the Calibration side of the Status pane, you must load data into the X calibration control. This data is referred to as x block data, and it is a set of multivariate measurements on your data samples. Some analysis methods also require you to load data into the Y calibration control. This data is referred to as y block data and it is a set of secondary or reference measurements on the same data samples. During analysis, you identify any patterns or trends in the data, and any other information that you consider relevant, for example, any relationships that might exist between the x data and the y data, and use this information to build a model. See [[ModelBuilding_CalibrationPhase|Building the Model in the Calibration Phase]].


Svm is implemented using the LIBSVM package which provides both epsilon-support vector regression (epsilon-SVR) and nu-support vector regression (nu-SVR). Linear and Gaussian Radial Basis Function kernel types are supported by this function.
===Test and Validation phase===


Note: Calling svm with no inputs starts the graphical user interface (GUI) for this analysis method.  
The Test and Validation phase consists of applying the model that you built in the Calibration phase to your validation data, which is data with known physical and/or chemical characteristics. In this phase, which affects the Validation side of the Status pane, you must load data into to the X validation control, and if applicable, the Y validation control. As is the case in the Calibration phase, the data that you load into the X control is referred to as x block data, and it is a set of multivariate measurements on your data samples. Likewise, the data that you load into the Y control is referred to as y block data and it is a set of secondary or reference measurements on the same data samples. You use this validation data to confirm that the model that you built captures valid patterns and trends in the data. You test and validate the model by applying it to the validation data and verifying that the test results are acceptable. For example, PCA analysis is typically used for pattern recognition. A correctly built PCA model, therefore, can identify the instances for which this pattern has been broken, such as a failure in material that does not meet specifications. During the Test and Validation phase of a PCA model, some of the validation data samples should meet specifications and some of the validation data samples should be "out of spec." A well-built PCA model will identify or flag these "out of spec" samples. If the test results are acceptable, you can continue to the next phase, the Model Application phase. If the test results are not acceptable, you must return to the Calibration phase. See [[ModelApplication_ValidationPhase|Applying the Model in the Test and Validation Phase]].


====Inputs====
===Model Application phase===


* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,
The Model Application phase consists of applying the tested and verified model to new data, which is data with unknown characteristics, and therefore, the results of applying the model cannot be known in advance. If your test results, however, were acceptable in the Test and Validation phase, then the results from the Model Application phase are also likely accurate. For example, a correctly built PCA model that was successfully tested and validated in the Test and Validation phase should identify "out of spec" samples during the Model Application phase. See [[ModelApplication_ValidationPhase|Applying the Model in the Test and Validation Phase]].
* '''y''' = Y-block (predicted block) class "double" or "dataset", containing numeric values,
* '''model''' = previously generated model (when applying model to new data).
 
====Outputs====
 
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):
** '''modeltype''': 'SVM',
** '''datasource''': structure array with information about input data,
** '''date''': date of creation,
** '''time''': time of creation,
** '''info''': additional model information,
** '''pred''': 2 element cell array with
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)
** '''detail''': sub-structure with additional model details and results, including:
*** model.detail.svm.model: Matlab version of the libsvm svm_model (Java). Note that the number of support vectors used is given by model.detai.svm.model.l. It is useful to check this because it can indicate overfitting if most of the calibration samples are used as support vectors, or can indicate problems fitting a model if there are no support vectors (and all prediction values will equal a constant value, a weighted mean).
*** model.detail.svm.cvscan: Results of CV parameter scan
*** model.detail.svm.svindices: Indices of X-block samples which are support vectors.
 
* '''pred''' a structure, similar to '''model''' for the new data.
 
===Options===
''options'' =  a structure array with the following fields:
 
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,
* '''preprocessing''': {[] []}  preprocessing structures for x and y blocks (see PREPROCESS).
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the SVM model. 'pca' uses a simple PCA model to compress the information. 'pls' uses either a pls or plsda model (depending on the svmtype). Compression can make the SVM more stable and less prone to overfitting.
* '''compressncomp''': [1]  Number of latent variables (or principal components to include in the compression model.
* '''blockdetails''': [ {'standard'} | 'all' ], extent of predictions and residuals included in model, 'standard' = only y-block, 'all' x- and y-blocks.
* '''algorithm''': [ 'libsvm' ] algorithm to use. libsvm is default and currently only option.
* '''kerneltype''': [ 'linear' | {'rbf'} ], SVM kernel to use. 'rbf' is default.
* '''svmtype''': [ {'epsilon-svr'} | 'nu-svr' ] Type of SVM to apply. The default is 'epsilon-svr' for regression.
* '''probabilityestimates''': [0| {1} ], whether to train the SVR model for probability estimates, 0 or 1 (default 1)"
 
* '''cvtimelimit''': Set a time limit (seconds) on individual cross-validation sub-calculation when searching over supplied SVM parameter ranges for optimal parameters. Only relevant if parameter ranges are used for SVM parameters such as cost, epsilon, gamma or nu. Default is 10;
* '''splits''': Number of subsets to divide data into when applying n-fold cross validation. Default is 5. This option is only used when the "cvi" option is empty.
* '''cvi''': {{}} Standard cross-validation cell (see crossval) defining a split method, number of splits, and number of iterations. This cross-validation is use both for parameter optimization and for error estimate on the final selected parameter values. If empty (the default), then random cross-validation is done based on the "splits" option.
 
* '''gamma''': Value(s) to use for LIBSVM kernel gamma parameter. Default is 15 values from 10^-6 to 10, spaced uniformly in log.
* '''cost''': Value(s) to use for LIBSVM 'c' parameter. Default is 11 values from 10^-3 to 100, spaced uniformly in log.
* '''epsilon''': Value(s) to use for LIBSVM 'p' parameter (epsilon in loss function). Default is the set of values [1.0, 0.1, 0.01].
* '''nu''': Value(s) to use for LIBSVM 'n' parameter (nu of nu-SVC, and nu-SVR). Default is the set of values [0.2, 0.5, 0.8].
 
===Algorithm===
Svm uses the LIBSVM implementation using the user-specified values for the LIBSVM parameters (see ''options'' above). See [http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf] for further details of these options.
 
The default SVM parameters cost, epsilon, nu and gamma have value ranges rather than single values. This svm function uses a search over the grid of appropriate parameters using cross-validation to select the optimal SVM parameter values and builds an SVM model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however. If you are using the command line SVM function to build a model then the optimal SVM parameters are shown in model.detail.svm.cvscan.best. If you are using the graphical Analysis SVM then the optimal parameters are reported in the summary window which is shown when you mouse-over the model icon, once the model is built.
 
====Model building performance====
Building a single SVM model can sometimes be slow, especially if the calibration dataset is large. Using ranges for the SVM parameters to search for the optimal parameter combination increases the final model building time significantly. If cross-validation is used the calculation is again increased, possibly dramatically if the number of CV subsets is large. Some suggestions for faster SVM building include:
:1) Turning CV off ("none") during preliminary analyses. This is MUCH faster and cross-validation is still performed using a default "Random Subsets" with 5 data splits and 1 iteration,
:2) Using a coarse grid of SVM parameter values to search over for optimal values,
:3) Choosing the CV method carefully, at least initially. For example, use "Random Subsets" with a small number of data splits (e.g. 5) and a small "Number of Iterations" (e.g. 1).
:4) Using the compression option if the number of variables is large.
 
====epsilon-SVR and nu-SVR====
There are two commonly used versions of SVM regression, 'epsilon-SVR' and 'nu-SVR'. The original SVM formulations for Regression (SVR) used parameters C [0, inf) and epsilon[0, inf) to apply a penalty to the optimization for points which were not correctly predicted.  An alternative version of both SVM regression was later developed where the epsilon penalty parameter was replaced by an alternative parameter, nu [0,1], which applies a slightly different penalty. The main motivation for the nu versions of SVM is that it has a has a more meaningful interpretation. This is because nu represents an upper bound on the fraction of training samples which are errors (badly predicted) and a lower bound on the fraction of samples which are support vectors. Some users feel nu is more intuitive to use than C or epsilon.
Epsilon or nu are just different versions of the penalty parameter. The same optimization problem is solved in either case. Thus it should not matter which form of SVM you use, epsilon or nu. PLS_Toolbox uses epsilon since this was the original formulation and is the most commonly used form. For more details on 'nu' SVM regression see [http://www.csie.ntu.edu.tw/~cjlin/papers/newsvr.pdf]
 
The user must provide parameters (or parameter ranges) for SVM regression as:
:*'epsilon-SVR':
::'''epsilon''','''C''',  (using linear kernel), or
::'''epsilon''','''C''', '''gamma''' (using radial basis function kernel),
 
:*'nu-SVR':
::'''nu''', '''C''',    (using linear kernel), or
::'''nu''', '''C''', '''gamma''' (using radial basis function kernel),
 
====SVM Parameters====
 
* '''cost''': Cost [0 ->inf] represents the penalty associated with errors larger than epsilon. Increasing cost value causes closer fitting to the calibration/training data.
* '''gamma''': Kernel ''gamma'' parameter controls the shape of the separating hyperplane. Increasing gamma usually increases number of support vectors.
* '''epsilon''': In training the regression function there is no penalty associated with points which are predicted within distance epsilon from the actual value. Decreasing epsilon forces closer fitting to the calibration/training data.
* '''nu''': Nu (0 -> 1] indicates a lower bound on the number of support  vectors to use, given as a fraction of total calibration samples, and an upper bound on the fraction of training samples which are errors (poorly predicted).
 
===See Also===
 
[[analysis]], [[ann]], [[mlr]], [[lwr]], [[pls]], [[pcr]], [[svmda]], [[preprocess]]

Revision as of 12:16, 2 August 2010

Table of Contents | Previous | Next

Analysis Phases

The Analysis window serves as the core interface to the Solo modeling and analysis functions. You create your models in an Analysis window, apply models in this window, and also analyze and explore the models in this window. Three phases are required to completely carry out modeling and analysis in the Analysis window-the Calibration phase, the Test and Validation phase, and the Model Application phase.

Calibration phase

The Calibration phase consists of model building and exploratory analysis. In this phase, which affects only the Calibration side of the Status pane, you must load data into the X calibration control. This data is referred to as x block data, and it is a set of multivariate measurements on your data samples. Some analysis methods also require you to load data into the Y calibration control. This data is referred to as y block data and it is a set of secondary or reference measurements on the same data samples. During analysis, you identify any patterns or trends in the data, and any other information that you consider relevant, for example, any relationships that might exist between the x data and the y data, and use this information to build a model. See Building the Model in the Calibration Phase.

Test and Validation phase

The Test and Validation phase consists of applying the model that you built in the Calibration phase to your validation data, which is data with known physical and/or chemical characteristics. In this phase, which affects the Validation side of the Status pane, you must load data into to the X validation control, and if applicable, the Y validation control. As is the case in the Calibration phase, the data that you load into the X control is referred to as x block data, and it is a set of multivariate measurements on your data samples. Likewise, the data that you load into the Y control is referred to as y block data and it is a set of secondary or reference measurements on the same data samples. You use this validation data to confirm that the model that you built captures valid patterns and trends in the data. You test and validate the model by applying it to the validation data and verifying that the test results are acceptable. For example, PCA analysis is typically used for pattern recognition. A correctly built PCA model, therefore, can identify the instances for which this pattern has been broken, such as a failure in material that does not meet specifications. During the Test and Validation phase of a PCA model, some of the validation data samples should meet specifications and some of the validation data samples should be "out of spec." A well-built PCA model will identify or flag these "out of spec" samples. If the test results are acceptable, you can continue to the next phase, the Model Application phase. If the test results are not acceptable, you must return to the Calibration phase. See Applying the Model in the Test and Validation Phase.

Model Application phase

The Model Application phase consists of applying the tested and verified model to new data, which is data with unknown characteristics, and therefore, the results of applying the model cannot be known in advance. If your test results, however, were acceptable in the Test and Validation phase, then the results from the Model Application phase are also likely accurate. For example, a correctly built PCA model that was successfully tested and validated in the Test and Validation phase should identify "out of spec" samples during the Model Application phase. See Applying the Model in the Test and Validation Phase.