Diviner: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
Line 11: Line 11:
===Data Workflow===
===Data Workflow===


1. Exploratory Module:
'''Exploratory Module:'''


• Data Input and Cross-Validation (CV): The process begins with data being inputted into the system. Cross-validation (CV) is applied to ensure the reliability of the model’s performance by repeatedly splitting the data into training and testing subsets.
• Data loading and Cross-Validation (CV): The process begins with data being loaded into the system. If a test set is available, it should be loaded at this time as well since it is the only way to use the test  Cross-validation (CV) is applied to ensure the reliability of the model’s performance by repeatedly splitting the data into training and testing subsets.
• Preprocessing: This step involves preparing the data by applying various transformations and normalizations to improve model performance. This can include operations like scaling, centering, or applying derivative techniques.
• Preprocessing: This step involves preparing the data by applying various transformations and normalizations to improve model performance. This can include operations like scaling, centering, or applying derivative techniques.
• Outlier Auto-Detection: Automated detection of outliers is performed to identify and potentially exclude data points that could skew the model’s accuracy.
• Outlier Auto-Detection: Outliers are automatically detected to identify and potentially exclude data points that could skew the model’s accuracy.
• Auto-Variable Selection: This automated step selects the most relevant variables from the dataset, reducing dimensionality and improving the model’s focus on significant predictors.
• Auto-Variable Selection: This automated step selects the most relevant variables from the dataset, reducing dimensionality and improving the model’s focus on significant predictors.
• Partial Least Squares (PLS) & Regression-based Multiple Linear Regression (Reg-MLR): Using these regression techniques, initial models are generated to provide preliminary insights into the relationships within the data.
• Partial Least Squares (PLS) and regression-based multiple linear regression (Reg-MLR): These regression techniques generate initial models to provide preliminary insights into the relationships within the data.
• Preliminary Output & First Model Selection: The initial output is generated, and a preliminary model is selected based on performance metrics like RMSECV (Root Mean Square Error of Cross-Validation).
• Preliminary Output & First Model Selection: The initial output is generated, and a preliminary model is selected based on performance metrics like RMSECV (Root Mean Square Error of Cross-Validation).


2. Refinement Module:
'''Refinement Module:'''


• Further Variable Selection: In this step, additional refinement of the variable selection is carried out, often using techniques like interval Partial Least Squares (iPLS) to fine-tune which variables contribute to the model.
• Further Variable Selection: In this step, additional refinement of the variable selection is carried out, often using techniques like interval Partial Least Squares (iPLS) to fine-tune which variables contribute to the model.
• Outlier Reinclusion: Any outliers previously excluded might be re-evaluated and potentially reintegrated into the model if they hold significance upon further analysis.
• Outlier Reinclusion: If any outliers previously excluded are significant upon further analysis, they might be re-evaluated and potentially reintegrated into the models.
• Final Model Selection: After all refinements are made, the best-performing model is selected for final output.
• Final Model Selection: After all refinements are made, the best-performing model is selected for final output.


3. Output:
'''Output:'''


• The process concludes with the generation of the final model, which is the culmination of the exploratory and refinement processes, ensuring that the model is robust, accurate, and tailored to the specific dataset.
• The process concludes with generating the final model, which is the culmination of the exploratory and refinement processes, ensuring that the model is robust, accurate, and tailored to the specific dataset.


This structured workflow allows for a balanced approach that combines automated processes with critical human intervention, ensuring that the final model is both reliable and interpretable.
This structured workflow allows for a balanced approach that combines automated processes with critical user feedback.





Revision as of 13:00, 9 September 2024

Page under construction

Diviner

Diviner is a semi-automated machine learning (Semi-AutoML) tool specifically designed to enhance the development of multivariate calibration models for linear regression. Unlike traditional AutoML systems that entirely automate the machine learning workflow, often at the expense of domain-specific insights and transparency, Diviner strikes a balance between automation and expert involvement. It allows users to leverage automation efficiently while maintaining control over critical decision points in the modeling process. This hybrid approach addresses key shortcomings of AutoML, such as the lack of domain knowledge, overfitting, and limited customization, by integrating user input to guide model development more effectively.

Diviner is focused on calibrating partial least squares (PLS) and regularized (elastic net) multiple linear regression (MLR) models. It offers a workflow combining outlier overview, preprocessing grid search, and variable selection with user-guided model refinement.

Diviner workflow.png

Data Workflow

Exploratory Module:

• Data loading and Cross-Validation (CV): The process begins with data being loaded into the system. If a test set is available, it should be loaded at this time as well since it is the only way to use the test Cross-validation (CV) is applied to ensure the reliability of the model’s performance by repeatedly splitting the data into training and testing subsets. • Preprocessing: This step involves preparing the data by applying various transformations and normalizations to improve model performance. This can include operations like scaling, centering, or applying derivative techniques. • Outlier Auto-Detection: Outliers are automatically detected to identify and potentially exclude data points that could skew the model’s accuracy. • Auto-Variable Selection: This automated step selects the most relevant variables from the dataset, reducing dimensionality and improving the model’s focus on significant predictors. • Partial Least Squares (PLS) and regression-based multiple linear regression (Reg-MLR): These regression techniques generate initial models to provide preliminary insights into the relationships within the data. • Preliminary Output & First Model Selection: The initial output is generated, and a preliminary model is selected based on performance metrics like RMSECV (Root Mean Square Error of Cross-Validation).

Refinement Module:

• Further Variable Selection: In this step, additional refinement of the variable selection is carried out, often using techniques like interval Partial Least Squares (iPLS) to fine-tune which variables contribute to the model. • Outlier Reinclusion: If any outliers previously excluded are significant upon further analysis, they might be re-evaluated and potentially reintegrated into the models. • Final Model Selection: After all refinements are made, the best-performing model is selected for final output.

Output:

• The process concludes with generating the final model, which is the culmination of the exploratory and refinement processes, ensuring that the model is robust, accurate, and tailored to the specific dataset.

This structured workflow allows for a balanced approach that combines automated processes with critical user feedback.


Bold text


Some text

Text

File:Image file name.png