# Overview

1. Gather the data
2. Import and understand the data: do things like the following to understand more about the data at hand
   1. check the shape of data
   2. check the number of unique values in each column, drop the ones which have the same value
   3. if it is a classification problem check for class imbalance
   4. check for column datatypes and fix if necessary
3. Check and fix for missing values: [Read more about this here](/model-building/data/missing-value.md)
4. Perform feature engineering to build new columns from the existing ones, it can be YoY growth, etc.
5. Run univariate and multivariate analysis to understand more about the features
6. Detect and treat outliers: [Read more about this here](/model-building/data/outlier.md)
7. Encode categorical variables: [Read more about this here](/model-building/data/categorical-variable.md)
8. Standardize the data
9. If needed run different sampling techniques to reduce imbalance or run PCA etc.
10. Split the data to train, test and validation: Training data are collections of examples or samples that are used to 'teach' or 'train the machine learning model. In contrast, validation datasets contain different samples to evaluate trained ML models. It is still possible to tune and control the model at this stage. Working on validation data is used to assess the model performance and fine-tune the parameters of the model. This becomes an iterative process wherein the model learns from the training data and is then validated and fine-tuned on the validation set. Finally, a test data set is a separate sample, an unseen data set, to provide an unbiased final evaluation of a model fit.&#x20;

    **Cross-validation** involves one or more splits of the training data set and validation data set. In particular, K-fold cross-validation aims to maximize accuracy in testing by dividing the source data into several bins or groups. All except one of these are for training and validation purposes. The last is for testing.
11. Determine which metric you want to track
12. Run some basic algorithms to understand which models might give the best results, PYCARET is a good option to quickly prototype
13. Select the promising ones and deep dive on those
14. Check if there are any assumptions of the model --> Perform checks on Collinearity etc.
15. [Tune the model ](/model-building/hyperparameter-optimization.md)--> Use regularization, Cross validation etc. to reduce overfitting
16. Check for feature importance using built in functions or SHAP


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://book.thedatascienceinterviewproject.com/model-building/overview.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
