Diviner: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
Line 3: Line 3:
==Diviner==
==Diviner==


'''Diviner''' is a semi-automated machine learning (Semi-AutoML) tool specifically designed to enhance the development of multivariate calibration models for linear regression. Unlike traditional AutoML systems that entirely automate the machine learning workflow, often at the expense of domain-specific insights and transparency, Diviner strikes a balance between automation and expert involvement. It allows users to leverage automation efficiently while maintaining control over critical decision points in the modeling process. This hybrid approach addresses key shortcomings of AutoML, such as the lack of domain knowledge, overfitting, and limited customization, by integrating user input to guide model development more effectively.
'''Diviner''' is a semi-automated machine learning (Semi-AutoML) tool specifically designed to enhance the development of linear multivariate regression models. Unlike traditional AutoML systems that entirely automate the machine learning workflow —often sacrificing domain-specific insights and transparency— Diviner strikes a balance between automation and expert involvement. It allows users to efficiently leverage automation while retaining control over critical decision points in the modeling process. This hybrid approach addresses key shortcomings of AutoML, such as the lack of domain knowledge, overfitting, and limited customization, by incorporating user input to guide model development more effectively.


Diviner is a tool designed for calibrating linear models, specifically Partial Least Squares ('''PLS''') and regularized multiple linear regression ('''MLR''') models, such as Elastic Net. It provides a comprehensive workflow that includes outlier assessment, a grid search for preprocessing methods, variable selection, and user-guided model refinement.
Diviner is a tool designed for calibrating linear models, specifically Partial Least Squares ('''PLS''') and regularized multiple linear regression ('''MLR''') models, such as Elastic Net. It provides a comprehensive workflow that includes outlier assessment, a grid search for preprocessing methods, variable selection, and user-guided model refinement.


Due to its extensive library of preloaded preprocessing methods, Diviner is particularly suited for chemometrics and spectral data analysis. However, users can also create custom preprocessing libraries, making Diviner a versatile tool for linear regression tasks with various data types.
With its extensive library of preloaded preprocessing methods, Diviner is particularly suited for chemometrics and spectral data analysis. However, users can also create custom preprocessing libraries, making Diviner a versatile tool for linear regression tasks across various data types.


[[File:Diviner workflow.png|thumb]]
[[File:Diviner workflow.png|thumb]]
Line 14: Line 14:


====Exploratory Module:====
====Exploratory Module:====
'''Data loading:''' The process begins with data being loaded into the system. If a test dataset is available, it should be loaded at this time so it can be used to evaluate the performance of the models in that test dataset. Note: if a test set is not loading before the run, applying the models to the test set a posteriori won't be possible.  
'''Data loading:''' The process begins with data being loaded into the system. If a test dataset is available, it should be loaded at this time to evaluate the performance of the models. Note: if a test set is not loaded before the run, applying the models to the test set later will not be possible.


'''Choice of algorithms:''' PLS is set by default, and MLR must be activated in the options. While the optimization of the PLS models is part of the initial grid search in diviner, MLR (Elastic Nets) on the other hand, performs an optimization routine to find the best penalty for each model.  
'''Choice of algorithms:'''PLS is set as the default algorithm, while MLR must be activated in the options. PLS model optimization of number of components (LVs) is included in Diviner's initial grid search. MLR (Elastic Net) performs a separate optimization routine to find the best penalty for each MLR model in the grid search.


'''Preprocessing:''' This step involves selecting multiple preprocessing methods depending on the data type and application. In the same step, preprocessing methods used for outlier assessment must be selected.  
'''Preprocessing:''' This step involves selecting multiple preprocessing methods based on the data type and application. Preprocessing methods for outlier assessment must also be selected at this stage.


'''Cross-validation (CV):''' Diviner supports all modes of cross-validation in the PLS_Toolbox and Solo. This step also sets the number of PLS latent variables (LVs) that will be used in the initial grid search.  
'''Cross-validation (CV):''' Diviner supports all cross-validation modes available in the PLS_Toolbox and Solo. This step also sets the number of PLS latent variables (LVs) to be used in the initial grid search.


'''Auto-Variable Selection:''' Variable selection in the first module uses two fast algorithms VIP (Variable Importance in Projection) and sRatio (Selectivity Ratio). https://www.wiki.eigenvector.com/index.php?title=Selectvars
'''Auto-Variable Selection:''' Variable selection in the initial module uses two fast algorithms: Variable Importance in Projection (VIP) and Selectivity Ratio (sRatio). Learn more about these algorithms [[Selectvars|here]].
   
   
'''Outlier assessment:''' Diviner automatically assesses possible outliers in the dataset using a combination of robust PCA and PLS with different preprocessing methods previously chosen by the user.  
'''Outlier assessment:''' Diviner automatically identifies potential outliers in the dataset using a combination of robust PCA and PLS with the preprocessing methods previously selected by the user.


'''Preliminary Output & First Model Selection:''' after the full grid-search of PLS (and MLR) models based on preprocessing recipes, variable selection, and Latent Variable combinations. The initial output is generated as a plot of the Overfit (RMSECV/RMSEC) vs. RMSECV. The user manually selects the best models from this plot for refinement or further analysis.  
'''Preliminary Output & First Model Selection:''' After the full grid search of PLS (and MLR) models—based on preprocessing recipes, variable selection, and Latent Variable combinations—the initial output is generated as a plot of Overfit (RMSECV/RMSEC) vs. RMSECV. The user manually selects the best models from this plot for refinement or further analysis.


====Refinement Module:====
====Refinement Module:====
Refinement consists of further variable selection and outlier inclusion. Since these two procedures are time consuming and not ideal to perform on a large number of models it follows the model selection from the grid-search on the first module.  
'''Refinement Process:''' This involves further variable selection and outlier reinclusion. Given that these procedures are time-consuming and not ideal for a large number of models, they are performed only on the models selected from the initial grid search.


'''Further Variable Selection:''' In this step, additional refinement of the variable selection is carried out, using interval Partial Least Squares (iPLS) to fine-tune which variables contribute to the model.
'''Further Variable Selection:''' Additional refinement of variable selection is conducted using interval Partial Least Squares (iPLS) to fine-tune which variables contribute to the model.


'''Outlier Reinclusion:''' If any outliers previously excluded are significant upon further analysis, they might be re-evaluated and potentially reintegrated into the models.
'''Outlier Reinclusion:''' If any previously excluded outliers are found to be significant upon further analysis, they may be re-evaluated and potentially reintegrated into the models.


'''Final Model Selection and Output:''' After all refinements are made, the best-performing models are selected by the user for the final output.
'''Final Model Selection and Output:''' After all refinements are completed, the user selects the best-performing models for the final output.


==Text==
==Text==


'''[[File: image_file_name.png | 500px]]'''
'''[[File: image_file_name.png | 500px]]'''

Revision as of 08:03, 10 September 2024

Page under construction

Diviner

Diviner is a semi-automated machine learning (Semi-AutoML) tool specifically designed to enhance the development of linear multivariate regression models. Unlike traditional AutoML systems that entirely automate the machine learning workflow —often sacrificing domain-specific insights and transparency— Diviner strikes a balance between automation and expert involvement. It allows users to efficiently leverage automation while retaining control over critical decision points in the modeling process. This hybrid approach addresses key shortcomings of AutoML, such as the lack of domain knowledge, overfitting, and limited customization, by incorporating user input to guide model development more effectively.

Diviner is a tool designed for calibrating linear models, specifically Partial Least Squares (PLS) and regularized multiple linear regression (MLR) models, such as Elastic Net. It provides a comprehensive workflow that includes outlier assessment, a grid search for preprocessing methods, variable selection, and user-guided model refinement.

With its extensive library of preloaded preprocessing methods, Diviner is particularly suited for chemometrics and spectral data analysis. However, users can also create custom preprocessing libraries, making Diviner a versatile tool for linear regression tasks across various data types.

Diviner workflow.png

Data Workflow

Exploratory Module:

Data loading: The process begins with data being loaded into the system. If a test dataset is available, it should be loaded at this time to evaluate the performance of the models. Note: if a test set is not loaded before the run, applying the models to the test set later will not be possible.

Choice of algorithms:PLS is set as the default algorithm, while MLR must be activated in the options. PLS model optimization of number of components (LVs) is included in Diviner's initial grid search. MLR (Elastic Net) performs a separate optimization routine to find the best penalty for each MLR model in the grid search.

Preprocessing: This step involves selecting multiple preprocessing methods based on the data type and application. Preprocessing methods for outlier assessment must also be selected at this stage.

Cross-validation (CV): Diviner supports all cross-validation modes available in the PLS_Toolbox and Solo. This step also sets the number of PLS latent variables (LVs) to be used in the initial grid search.

Auto-Variable Selection: Variable selection in the initial module uses two fast algorithms: Variable Importance in Projection (VIP) and Selectivity Ratio (sRatio). Learn more about these algorithms here.

Outlier assessment: Diviner automatically identifies potential outliers in the dataset using a combination of robust PCA and PLS with the preprocessing methods previously selected by the user.

Preliminary Output & First Model Selection: After the full grid search of PLS (and MLR) models—based on preprocessing recipes, variable selection, and Latent Variable combinations—the initial output is generated as a plot of Overfit (RMSECV/RMSEC) vs. RMSECV. The user manually selects the best models from this plot for refinement or further analysis.

Refinement Module:

Refinement Process: This involves further variable selection and outlier reinclusion. Given that these procedures are time-consuming and not ideal for a large number of models, they are performed only on the models selected from the initial grid search.

Further Variable Selection: Additional refinement of variable selection is conducted using interval Partial Least Squares (iPLS) to fine-tune which variables contribute to the model.

Outlier Reinclusion: If any previously excluded outliers are found to be significant upon further analysis, they may be re-evaluated and potentially reintegrated into the models.

Final Model Selection and Output: After all refinements are completed, the user selects the best-performing models for the final output.

Text

File:Image file name.png