THE DATA SCIENCE INTERVIEW BOOK
Buy Me a Coffee ☕FollowForum
  • About
  • Log
  • Mathematical Motivation
  • STATISTICS
    • Probability Basics
    • Probability Distribution
    • Central Limit Theorem
    • Bayesian vs Frequentist Reasoning
    • Hypothesis Testing
    • ⚠️A/B test
  • MODEL BUILDING
    • Overview
    • Data
      • Scaling
      • Missing Value
      • Outlier
      • ⚠️Sampling
      • Categorical Variable
    • Hyperparameter Optimization
  • Algorithms
    • Overview
    • Bias/Variance Tradeoff
    • Regression
    • Generative vs Discriminative Models
    • Classification
    • ⚠️Clustering
    • Tree based approaches
    • Time Series Analysis
    • Anomaly Detection
    • Big O
  • NEURAL NETWORK
    • Neural Network
    • ⚠️Recurrent Neural Network
  • NLP
    • Lexical Processing
    • Syntactic Processing
    • Transformers
  • BUSINESS INTELLIGENCE
    • ⚠️Power BI
      • Charts
      • Problems
    • Visualization
  • PYTHON
    • Theoretical
    • Basics
    • Data Manipulation
    • Statistics
    • NLP
    • Algorithms from scratch
      • Linear Regression
      • Logistic Regression
    • PySpark
  • ML OPS
    • Overview
    • GIT
    • Feature Store
  • SQL
    • Basics
    • Joins
    • Temporary Datasets
    • Windows Functions
    • Time
    • Functions & Stored Proc
    • Index
    • Performance Tuning
    • Problems
  • ⚠️EXCEL
    • Excel Basics
    • Data Manipulation
    • Time and Date
    • Python in Excel
  • MACHINE LEARNING FRAMEWORKS
    • PyCaret
    • ⚠️Tensorflow
  • ANALYTICAL THINKING
    • Business Scenarios
    • ⚠️Industry Application
    • Behavioral/Management
  • Generative AI
    • Vector Database
    • LLMs
  • CHEAT SHEETS
    • NumPy
    • Pandas
    • Pyspark
    • SQL
    • Statistics
    • RegEx
    • Git
    • Power BI
    • Python Basics
    • Keras
    • R Basics
  • POLICIES
    • PRIVACY NOTICE
Powered by GitBook
On this page

Was this helpful?

  1. MODEL BUILDING

Overview

This page broadly summarizes the steps needed to go from data gathering to model building

PreviousA/B testNextData

Last updated 1 year ago

Was this helpful?

  1. Gather the data

  2. Import and understand the data: do things like the following to understand more about the data at hand

    1. check the shape of data

    2. check the number of unique values in each column, drop the ones which have the same value

    3. if it is a classification problem check for class imbalance

    4. check for column datatypes and fix if necessary

  3. Check and fix for missing values:

  4. Perform feature engineering to build new columns from the existing ones, it can be YoY growth, etc.

  5. Run univariate and multivariate analysis to understand more about the features

  6. Detect and treat outliers:

  7. Encode categorical variables:

  8. Standardize the data

  9. If needed run different sampling techniques to reduce imbalance or run PCA etc.

  10. Split the data to train, test and validation: Training data are collections of examples or samples that are used to 'teach' or 'train the machine learning model. In contrast, validation datasets contain different samples to evaluate trained ML models. It is still possible to tune and control the model at this stage. Working on validation data is used to assess the model performance and fine-tune the parameters of the model. This becomes an iterative process wherein the model learns from the training data and is then validated and fine-tuned on the validation set. Finally, a test data set is a separate sample, an unseen data set, to provide an unbiased final evaluation of a model fit.

    Cross-validation involves one or more splits of the training data set and validation data set. In particular, K-fold cross-validation aims to maximize accuracy in testing by dividing the source data into several bins or groups. All except one of these are for training and validation purposes. The last is for testing.

  11. Determine which metric you want to track

  12. Run some basic algorithms to understand which models might give the best results, PYCARET is a good option to quickly prototype

  13. Select the promising ones and deep dive on those

  14. Check if there are any assumptions of the model --> Perform checks on Collinearity etc.

  15. --> Use regularization, Cross validation etc. to reduce overfitting

  16. Check for feature importance using built in functions or SHAP

Read more about this here
Read more about this here
Read more about this here
Tune the model