Overview
This page broadly summarizes the steps needed to go from data gathering to model building
Last updated
This page broadly summarizes the steps needed to go from data gathering to model building
Last updated
Gather the data
Import and understand the data: do things like the following to understand more about the data at hand
check the shape of data
check the number of unique values in each column, drop the ones which have the same value
if it is a classification problem check for class imbalance
check for column datatypes and fix if necessary
Check and fix for missing values:
Perform feature engineering to build new columns from the existing ones, it can be YoY growth, etc.
Run univariate and multivariate analysis to understand more about the features
Detect and treat outliers:
Encode categorical variables:
Standardize the data
If needed run different sampling techniques to reduce imbalance or run PCA etc.
Split the data to train, test and validation: Training data are collections of examples or samples that are used to 'teach' or 'train the machine learning model. In contrast, validation datasets contain different samples to evaluate trained ML models. It is still possible to tune and control the model at this stage. Working on validation data is used to assess the model performance and fine-tune the parameters of the model. This becomes an iterative process wherein the model learns from the training data and is then validated and fine-tuned on the validation set. Finally, a test data set is a separate sample, an unseen data set, to provide an unbiased final evaluation of a model fit.
Cross-validation involves one or more splits of the training data set and validation data set. In particular, K-fold cross-validation aims to maximize accuracy in testing by dividing the source data into several bins or groups. All except one of these are for training and validation purposes. The last is for testing.
Determine which metric you want to track
Run some basic algorithms to understand which models might give the best results, PYCARET is a good option to quickly prototype
Select the promising ones and deep dive on those
Check if there are any assumptions of the model --> Perform checks on Collinearity etc.
--> Use regularization, Cross validation etc. to reduce overfitting
Check for feature importance using built in functions or SHAP