Lda: Difference between revisions
No edit summary |
|||
(13 intermediate revisions by 2 users not shown) | |||
Line 10: | Line 10: | ||
:valid = lda(x,y,model,options) | :valid = lda(x,y,model,options) | ||
Please note that the recommended way to build and apply a LDA model from the command line is to use the Model Object. Please see this wiki page on building and applying models using the Model Object. | Please note that the recommended way to build and apply a LDA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. | ||
===Description=== | ===Description=== | ||
Linear Discriminant Analysis (LDA) is a supervised machine learning method for classification and dimensionality reduction. It works by finding the linear combinations of features that best separate two or more classes of samples. The main idea is to project the data onto a lower-dimensional space to maximize the separation between different classes while minimizing the intra-class separation of the samples. LDA assumes that the data is normally or Gaussian distributed and classes have the same covariance matrix, but LDA has been proven effective in other scenarios where the assumptions are not strictly met. We have included two solvers for LDA: ‘eig’ and ‘svd’ (default). | |||
* '''Generalized Eigenvalue Problem (‘eig’ solver)''': The 'eig' solver operates by computing the eigenvectors and eigenvalues of the scatter matrices, using these eigenvectors to project data onto a lower-dimensional space. This approach is more suitable when the number of variables (features) is smaller than the number of samples, offering better numerical stability in such scenarios. However, it can struggle with singular or nearly singular scatter matrices, a situation often encountered in datasets where variables far outnumber samples or when multicollinearity is present,e.g. spectra. | * '''Generalized Eigenvalue Problem (‘eig’ solver)''': The 'eig' solver operates by computing the eigenvectors and eigenvalues of the scatter matrices, using these eigenvectors to project data onto a lower-dimensional space. This approach is more suitable when the number of variables (features) is smaller than the number of samples, offering better numerical stability in such scenarios. However, it can struggle with singular or nearly singular scatter matrices, a situation often encountered in datasets where variables far outnumber samples or when multicollinearity is present,e.g. spectra. | ||
: '''Regularization''' | : '''Regularization for the 'eig' solver''':To improve model’s performance and stability of the ‘eig’ solver, we have adopted ridge regularization into our LDA implementation. Ridge (L2) regularization involves adding a small value λ (regularization parameter or penalty) to the diagonal elements of the within-class scatter matrix. This addition helps prevent the matrix from being singular, a common issue in datasets with either more variables than samples or with highly collinear variables (e.g. spectra). The benefit of this regularization is twofold: it not only stabilizes the LDA calculations, especially in high-dimensional spaces, but also reduces the risk of overfitting, thereby enhancing the model’s ability to generalize to new data. Regularization is applied by '''default''' with a '''lambda value λ=0.001'''. This value typically works well with auto-scaled data, but λ must be optimized for any dataset. To turn off regularization, simply make λ = 0. | ||
:To | |||
: | :'''For example''', calibrating an LDA model with the mean-centered data of the OliveOilData.mat dataset will result in an unstable LDA with lambda set to zero, e.g. eigenvalues are extremely large. Repeating the calculation with the default λ value 0.001 produces a robust LDA model. | ||
* '''Singular Value Decomposition (‘svd’ solver) (Default)''': On the other hand, the 'svd' solver utilizes singular value decomposition, which doesn't explicitly compute scatter matrices, making it more robust in handling singular or nearly singular matrices. This method is more numerically stable and is better suited for high-dimensional data or sparse datasets. While potentially more computationally demanding, especially for larger datasets, 'svd' is generally preferred in cases where numerical stability and handling of a higher number of features are critical. | * '''Singular Value Decomposition (‘svd’ solver) (Default)''': On the other hand, the 'svd' solver utilizes singular value decomposition, which doesn't explicitly compute scatter matrices, making it more robust in handling singular or nearly singular matrices. This method is more numerically stable and is better suited for high-dimensional data or sparse datasets. While potentially more computationally demanding, especially for larger datasets, 'svd' is generally preferred in cases where numerical stability and handling of a higher number of features are critical. | ||
'''Choosing Between 'eig' and 'svd'''' | :'''Choosing Between 'eig' and 'svd'''' | ||
:The choice between these two solvers often depends on the specific characteristics of the data at hand: | :The choice between these two solvers often depends on the specific characteristics of the data at hand: | ||
:* If the dataset is large and the number of features is significantly less than the number of samples, 'eig' might be a good choice. | |||
:* For high-dimensional data, or in cases where numerical stability is a concern, 'svd' is usually preferred. | |||
:In practical applications, it's often a good idea to experiment with both methods to see which one performs better for your specific dataset and problem. | |||
====Probability Estimation and Classification==== | |||
In the LDA algorithm, the estimation of the probability of each sample belonging to each class is grounded in Bayes' Theorem. This theorem provides a way to update our probability estimates for a hypothesis (in this case, class membership) based on new evidence (the sample's features). | In the LDA algorithm, the estimation of the probability of each sample belonging to each class is grounded in Bayes' Theorem. This theorem provides a way to update our probability estimates for a hypothesis (in this case, class membership) based on new evidence (the sample's features). | ||
The key steps are as follows: | The key steps are as follows: | ||
* '''Prior Probabilities''': The algorithm begins by establishing prior probabilities for each class. These priors can be either provided or calculated based on the frequency of each class in the training data. | * '''Prior Probabilities''': The algorithm begins by establishing prior probabilities for each class. These priors can be either provided or calculated based on the frequency of each class in the training data. | ||
Line 42: | Line 37: | ||
* '''Complete log-likelihood''': Bayes' Theorem combines the latter log-likelihoods with the class priors to calculate the complete log-likelihood. The complete (posterior) log-likelihood represents the updated beliefs about the sample's class membership after observing its features. | * '''Complete log-likelihood''': Bayes' Theorem combines the latter log-likelihoods with the class priors to calculate the complete log-likelihood. The complete (posterior) log-likelihood represents the updated beliefs about the sample's class membership after observing its features. | ||
* '''Probabilities''': To derive the probabilities, the calculated complete log-likelihoods are normalized for each sample using softmax across all classes to ensure the probabilities sum up to one. This normalization step is crucial for making the probabilities comparable and meaningful. | * '''Probabilities''': To derive the probabilities, the calculated complete log-likelihoods are normalized for each sample using softmax across all classes to ensure the probabilities sum up to one. This normalization step is crucial for making the probabilities comparable and meaningful. | ||
* '''Classification''': The predicted class for each sample is determined by selecting the class with the highest probability. This decision reflects the class that is statistically “most probable” | * '''Classification''': The predicted class for each sample is determined by selecting the class with the highest probability. This decision reflects the class that is statistically “most probable”. | ||
====Inputs==== | ====Inputs==== | ||
Line 52: | Line 47: | ||
***(A) column vector of sample classes for each sample in '''x''' | ***(A) column vector of sample classes for each sample in '''x''' | ||
***(B) a logical array with '1' indicating class membership for each sample (rows) in one or more classes (columns), or | ***(B) a logical array with '1' indicating class membership for each sample (rows) in one or more classes (columns), or | ||
***(C) a cell array of class groupings of classes from the x-block data. For example: <tt> {[1 2] [3]} </tt>would model classes 1 and 2 as a single group against class 3. | ***(C) a cell array of class groupings of classes from the x-block data. For example: <tt> {[1 2][3]} </tt>would model classes 1 and 2 as a single group against class 3. | ||
* '''ncomp''' = the number of used LDA components or discriminant functions (positive integer scalar). | * '''ncomp''' = the number of used LDA components or discriminant functions (positive integer scalar). | ||
Latest revision as of 15:32, 6 December 2023
Purpose
Linear Discriminant Analysis.
Synopsis
- lda - Launches an Analysis window with the LDA method selected
- model = lda(x,y,ncomp,options)
- model = lda(x,ncomp,options)
- pred = lda(x,model,options)
- valid = lda(x,y,model,options)
Please note that the recommended way to build and apply a LDA model from the command line is to use the Model Object. Please see this wiki page on building and applying models using the Model Object.
Description
Linear Discriminant Analysis (LDA) is a supervised machine learning method for classification and dimensionality reduction. It works by finding the linear combinations of features that best separate two or more classes of samples. The main idea is to project the data onto a lower-dimensional space to maximize the separation between different classes while minimizing the intra-class separation of the samples. LDA assumes that the data is normally or Gaussian distributed and classes have the same covariance matrix, but LDA has been proven effective in other scenarios where the assumptions are not strictly met. We have included two solvers for LDA: ‘eig’ and ‘svd’ (default).
- Generalized Eigenvalue Problem (‘eig’ solver): The 'eig' solver operates by computing the eigenvectors and eigenvalues of the scatter matrices, using these eigenvectors to project data onto a lower-dimensional space. This approach is more suitable when the number of variables (features) is smaller than the number of samples, offering better numerical stability in such scenarios. However, it can struggle with singular or nearly singular scatter matrices, a situation often encountered in datasets where variables far outnumber samples or when multicollinearity is present,e.g. spectra.
- Regularization for the 'eig' solver:To improve model’s performance and stability of the ‘eig’ solver, we have adopted ridge regularization into our LDA implementation. Ridge (L2) regularization involves adding a small value λ (regularization parameter or penalty) to the diagonal elements of the within-class scatter matrix. This addition helps prevent the matrix from being singular, a common issue in datasets with either more variables than samples or with highly collinear variables (e.g. spectra). The benefit of this regularization is twofold: it not only stabilizes the LDA calculations, especially in high-dimensional spaces, but also reduces the risk of overfitting, thereby enhancing the model’s ability to generalize to new data. Regularization is applied by default with a lambda value λ=0.001. This value typically works well with auto-scaled data, but λ must be optimized for any dataset. To turn off regularization, simply make λ = 0.
- For example, calibrating an LDA model with the mean-centered data of the OliveOilData.mat dataset will result in an unstable LDA with lambda set to zero, e.g. eigenvalues are extremely large. Repeating the calculation with the default λ value 0.001 produces a robust LDA model.
- Singular Value Decomposition (‘svd’ solver) (Default): On the other hand, the 'svd' solver utilizes singular value decomposition, which doesn't explicitly compute scatter matrices, making it more robust in handling singular or nearly singular matrices. This method is more numerically stable and is better suited for high-dimensional data or sparse datasets. While potentially more computationally demanding, especially for larger datasets, 'svd' is generally preferred in cases where numerical stability and handling of a higher number of features are critical.
- Choosing Between 'eig' and 'svd'
- The choice between these two solvers often depends on the specific characteristics of the data at hand:
- If the dataset is large and the number of features is significantly less than the number of samples, 'eig' might be a good choice.
- For high-dimensional data, or in cases where numerical stability is a concern, 'svd' is usually preferred.
- In practical applications, it's often a good idea to experiment with both methods to see which one performs better for your specific dataset and problem.
Probability Estimation and Classification
In the LDA algorithm, the estimation of the probability of each sample belonging to each class is grounded in Bayes' Theorem. This theorem provides a way to update our probability estimates for a hypothesis (in this case, class membership) based on new evidence (the sample's features). The key steps are as follows:
- Prior Probabilities: The algorithm begins by establishing prior probabilities for each class. These priors can be either provided or calculated based on the frequency of each class in the training data.
- For the log-likelihood calculation, LDA assumes that all classes are normally distributed; therefore, LDA uses a multivariate Gaussian density function to model each class. In a nutshell, the distance (Mahalonobis) between a sample and the mean of each class is calculated using the shared covariance matrix, which accounts for the exponential term in the Gaussian distribution.
- Complete log-likelihood: Bayes' Theorem combines the latter log-likelihoods with the class priors to calculate the complete log-likelihood. The complete (posterior) log-likelihood represents the updated beliefs about the sample's class membership after observing its features.
- Probabilities: To derive the probabilities, the calculated complete log-likelihoods are normalized for each sample using softmax across all classes to ensure the probabilities sum up to one. This normalization step is crucial for making the probabilities comparable and meaningful.
- Classification: The predicted class for each sample is determined by selecting the class with the highest probability. This decision reflects the class that is statistically “most probable”.
Inputs
- x = X-block (predictor block), class "double" or "dataset",
- y = Y-block
- OPTIONAL if x is a dataset containing classes for sample mode (mode 1)
- otherwise, y is one of the following:
- (A) column vector of sample classes for each sample in x
- (B) a logical array with '1' indicating class membership for each sample (rows) in one or more classes (columns), or
- (C) a cell array of class groupings of classes from the x-block data. For example: {[1 2][3]} would model classes 1 and 2 as a single group against class 3.
- ncomp = the number of used LDA components or discriminant functions (positive integer scalar).
Optional Inputs
- options = an optional input options structure.
Outputs
- model = standard model structure containing the LDA model.
- pred = structure array with predictions
- valid = structure array with predictions, includes known class information (Y block data) of test samples.
Options
options = a structure that can contain the following fields:
- display: [ 'off' | {'on'} ] governs level of display to command window.
- plots: [ 'none' | {'final'} ] governs level of plotting.
- preprocessing: {[] []} preprocessing structures for x and y blocks (see PREPROCESS).
- priorprob: [ ] Vector of prior probabilities of observing each class. If any class prior is "Inf", the frequency of observation of that class in the calibration is used as its prior probability. If all priors are Inf, this has the effect of providing the fewest incorrect predictions assuming that the probability of observing a given class in future samples is similar to the frequency that class in the calibration set. The default [] uses all ones i.e. equal priors.
- classset: [ 1 ] indicates which class set in x to use when no y-block is provided.
- algorithm: [ 'eig' | 'svd'] ‘svd’ is set as default
- blockdetails: [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.
- 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.
- strictthreshold: Probability threshold value to use in strict class assignment, see Sample_Classification_Predictions#Class_Pred_Strict. Default = 0.5.
- lambda: [ 0.001 ] Regularization parameter only applied to the ‘eig’ solver.
See Also
analysis, compressmodel, crossval, discrimprob, knn, modelselector, plsda, plsdaroc, plsdthres, preprocess, simca, svmda, vip, EVRIModel_Objects