# Regression

## Linear Regression

$Y = b_0 + b_1 * x_1 + b_2 * x_2 + \epsilon$

The idea is to find the line or plane which best fits the data. Collectively, $b_0, b_1, b_2$ are called regression coefficients. $\epsilon$ is the error term, the part of $Y$ the regression model is unable to explain.

**Loss Function:**The loss function in linear regression quantifies how well the model's predictions match the actual target values. In linear regression, the most common loss function is the Mean Squared Error (MSE). The MSE calculates the average squared difference between the predicted values and the actual target values for all data points.**Optimization Criterion:**The optimization criterion is the goal of finding the best-fitting line (or hyperplane in higher dimensions) that minimizes the chosen loss function. In the case of linear regression, the goal is to find the coefficients (slope and intercept) of the linear equation that minimize the MSE. This involves adjusting the coefficients to minimize the overall squared difference between the predicted and actual values.**Optimization Routine:**To find the optimal coefficients that minimize the MSE, an optimization routine is used. Gradient Descent is a widely used optimization algorithm for linear regression.

### Metrics

Now once you have the model fit next comes the metrics to measure how good the fit is, some of the common metrics are as follows:

$RSS$ (Residual sum of squares) $= (Y_{actual} - Y_{predicted})^2$, it changes with scale change

$TSS$ (Total sum of squares) $= (Y_{actual} - Y_{avg})^2$

$R^2$ $= 1-\frac{RSS}{TSS}$, more the better, increases with more coefficients

$RSE$ (Residual Standard Error) $= \sqrt{\frac{RSS}{d.o.f}}$, here $d.o.f = n-2$

### Feature selection

Hypothesis testing and using p-values to understand if the feature is important or not

Using metrics like $\text{Adjusted} R^2$, $AIC$, $BIC$, etc. which takes into consideration the number of features used to build the model and penalizes accordingly

How do we find the model that minimizes a metric like $AIC$? One approach is to search through all possible models, called all

**subset regression**. This is computationally expensive and is not feasible for problems with large data and many variables. An attractive alternative is to use**stepwise regression**about which we learned above, this successively adds and drops predictors to find a model that lowers $AIC$. Simpler yet are**forward selection**and**backward selection**. In forward selection, you start with no predictors and add them one-by-one, at each step adding the predictor that has the largest contribution to , stopping when the contribution is no longer statistically significant. In backward selection, or backward elimination, you start with the full model and take away predictors that are not statistically significant until you are left with a model in which all predictors are statistically significant.**Penalized Regression**or**Regularization**:Penalized regression is similar in spirit to AIC. Instead of explicitly searching through a discrete set of models, the model-fitting equation incorporates a constraint that penalizes the model for too many variables (parameters). Rather than eliminating predictor variables entirely — as with stepwise, forward, and backward selection — penalized regression applies the penalty by reducing coefficients, in some cases to near zero. Common penalized regression methods are ridge regression and lasso regression. Regularization is nothing but adding a penalty term to the objective function and control the model complexity using that penalty term. It can be used for many machine learning Algorithms. Both Ridge and Lasso regression uses $L2$ and $L1$ regularizations.

Ridge brings the coefficients close to $0$ but not exactly to $0$ which results in the model retaining all the features. Lasso on the other hand brings the coefficients to $0$ hence results in reduced features. Lasso shrinks the coefficients by same amount whereas Ridge shrinks them by same proportion.

Elastic Net is another useful technique which combines both L1 and L2 regularization.

### Assumptions

The relationship between $X$ and $Y$ is

**linear**. Because we are fitting a linear model, we assume that the relationship really is linear, and that the errors, or residuals, are simply random fluctuations around the true line.The error terms are

**normally distributed**. This can be checked with a Q-Q plotError terms are independent of each other. This can be checked with a ACF plot. This can be used while checking independence while using a time-series data

Error terms are

**homoscedastic**, i.e. they have constant variance. Residulas Vs Fitted graph should be flat. This means that the variability in the response is changing as the predicted value increases. This is a problem, in part, because the observations with larger errors will have more pull or influence on the fitted model.The independent variables are not multicollinear.

**Multicollinearity**is when a variable can be explained as a combination of other variables. This can be checked by using**VIF(Variance inflation factor)**$= \frac{1}{1-R_i^2}$.A VIF score of $>10$ indicates there there is a problem

If a multicollinear variable is present the coefficients swing wildly thereby affecting the interpretability of the model. P-vales are not reliable. But it doesnot affect prediction or the goodness of fit statistics.

To deal with multicollinearity

drop variables

create new features from existing ones

PCA/PLS

One very important point to remember is that Generalized Linear Regression is called so because $Y$ is linear w.r.t its coefficients $b_0, b_1, b_2$, etc. it is irrespective of whether the features $x_1, x_2$, etc. are linear or not. Meaning $x_1$ can actually be $x_1^2$ and it won't matter.

### OLS Stats Model (Ordinary Least Square)

OLS is a stats model, which will help us in identifying the more significant features that can has an influence on the output. OLS model in python is executed as: lm = smf.ols(formula = 'Sales ~ am+constant', data = data).fit() lm.conf_int() lm.summary() And we get the output as below

## SVR (Support Vector Regression)

In simple linear regression, try to minimize the error rate. But in SVR, we try to fit the error within a certain threshold.

Our best fit line is the one where the hyperplane has the maximum number of points. We are trying to do here is trying to decide a decision boundary at ‘e’ distance from the original hyperplane such that data points closest to the hyperplane or the support vectors are within that boundary line.

## Non-Linear Regression

In some cases, the true relationship between the outcome and a predictor variable might not be linear. There are different solutions extending the linear regression model for capturing these nonlinear effects, some of these are covered below.

### Polynomial Regression

The equation of polynomial becomes something like this.

$Y = b_0 + b_1 * x_1 + b_2 * x_1^2 + b_n * x_1^n$and so on...

The degree of order which to use is a Hyperparameter, and we need to choose it wisely. But using a high degree of polynomial tries to overfit the data and for smaller values of degree, the model tries to underfit so we need to find the optimum value of a degree. **Polynomial Regression on datasets with high variability chances to result in over-fitting.**

### Regression Splines

In order to overcome the disadvantages of polynomial regression, we can use an improved regression technique which, instead of building one model for the entire dataset, divides the dataset into multiple bins and fits each bin with a separate model. Such a technique is known as Regression spline.

In polynomial regression, we generated new features by using various polynomial functions on the existing features which imposed a global structure on the dataset. To overcome this, we can divide the distribution of the data into separate portions and fit linear or low degree polynomial functions on each of these portions. The points where the division occurs are called **Knots**. Functions which we can use for modelling each piece/bin are known as Piecewise functions. There are various piecewise functions that we can use to fit these individual bins.

### Generalized additive models

It does the same thing as above but just removes the need to specifying the knots. It fits spline models with automated selection of knots.

## Questions

Last updated