THE DATA SCIENCE INTERVIEW BOOK
Buy Me a Coffee ☕FollowForum
  • About
  • Log
  • Mathematical Motivation
  • STATISTICS
    • Probability Basics
    • Probability Distribution
    • Central Limit Theorem
    • Bayesian vs Frequentist Reasoning
    • Hypothesis Testing
    • ⚠️A/B test
  • MODEL BUILDING
    • Overview
    • Data
      • Scaling
      • Missing Value
      • Outlier
      • ⚠️Sampling
      • Categorical Variable
    • Hyperparameter Optimization
  • Algorithms
    • Overview
    • Bias/Variance Tradeoff
    • Regression
    • Generative vs Discriminative Models
    • Classification
    • ⚠️Clustering
    • Tree based approaches
    • Time Series Analysis
    • Anomaly Detection
    • Big O
  • NEURAL NETWORK
    • Neural Network
    • ⚠️Recurrent Neural Network
  • NLP
    • Lexical Processing
    • Syntactic Processing
    • Transformers
  • BUSINESS INTELLIGENCE
    • ⚠️Power BI
      • Charts
      • Problems
    • Visualization
  • PYTHON
    • Theoretical
    • Basics
    • Data Manipulation
    • Statistics
    • NLP
    • Algorithms from scratch
      • Linear Regression
      • Logistic Regression
    • PySpark
  • ML OPS
    • Overview
    • GIT
    • Feature Store
  • SQL
    • Basics
    • Joins
    • Temporary Datasets
    • Windows Functions
    • Time
    • Functions & Stored Proc
    • Index
    • Performance Tuning
    • Problems
  • ⚠️EXCEL
    • Excel Basics
    • Data Manipulation
    • Time and Date
    • Python in Excel
  • MACHINE LEARNING FRAMEWORKS
    • PyCaret
    • ⚠️Tensorflow
  • ANALYTICAL THINKING
    • Business Scenarios
    • ⚠️Industry Application
    • Behavioral/Management
  • Generative AI
    • Vector Database
    • LLMs
  • CHEAT SHEETS
    • NumPy
    • Pandas
    • Pyspark
    • SQL
    • Statistics
    • RegEx
    • Git
    • Power BI
    • Python Basics
    • Keras
    • R Basics
  • POLICIES
    • PRIVACY NOTICE
Powered by GitBook
On this page
  • Linear Regression
  • Metrics
  • Feature selection
  • Assumptions
  • OLS Stats Model (Ordinary Least Square)
  • SVR (Support Vector Regression)
  • Non-Linear Regression
  • Polynomial Regression
  • Regression Splines
  • Generalized additive models
  • Questions

Was this helpful?

  1. Algorithms

Regression

PreviousBias/Variance TradeoffNextGenerative vs Discriminative Models

Last updated 1 year ago

Was this helpful?

Linear Regression

Y=b0+b1∗x1+b2∗x2+ϵY = b_0 + b_1 * x_1 + b_2 * x_2 + \epsilonY=b0​+b1​∗x1​+b2​∗x2​+ϵ

  1. Loss Function: The loss function in linear regression quantifies how well the model's predictions match the actual target values. In linear regression, the most common loss function is the Mean Squared Error (MSE). The MSE calculates the average squared difference between the predicted values and the actual target values for all data points.

  2. Optimization Criterion: The optimization criterion is the goal of finding the best-fitting line (or hyperplane in higher dimensions) that minimizes the chosen loss function. In the case of linear regression, the goal is to find the coefficients (slope and intercept) of the linear equation that minimize the MSE. This involves adjusting the coefficients to minimize the overall squared difference between the predicted and actual values.

  3. Optimization Routine: To find the optimal coefficients that minimize the MSE, an optimization routine is used. Gradient Descent is a widely used optimization algorithm for linear regression.

Metrics

Now once you have the model fit next comes the metrics to measure how good the fit is, some of the common metrics are as follows:

Feature selection

  • Hypothesis testing and using p-values to understand if the feature is important or not

  • Penalized Regression or Regularization:

Elastic Net is another useful technique which combines both L1 and L2 regularization.

Assumptions

  • The error terms are normally distributed. This can be checked with a Q-Q plot

  • Error terms are independent of each other. This can be checked with a ACF plot. This can be used while checking independence while using a time-series data

  • Error terms are homoscedastic, i.e. they have constant variance. Residulas Vs Fitted graph should be flat. This means that the variability in the response is changing as the predicted value increases. This is a problem, in part, because the observations with larger errors will have more pull or influence on the fitted model.

    • If a multicollinear variable is present the coefficients swing wildly thereby affecting the interpretability of the model. P-vales are not reliable. But it doesnot affect prediction or the goodness of fit statistics.

    • To deal with multicollinearity

      • drop variables

      • create new features from existing ones

      • PCA/PLS

OLS Stats Model (Ordinary Least Square)

OLS is a stats model, which will help us in identifying the more significant features that can has an influence on the output. OLS model in python is executed as: lm = smf.ols(formula = 'Sales ~ am+constant', data = data).fit() lm.conf_int() lm.summary() And we get the output as below

SVR (Support Vector Regression)

In simple linear regression, try to minimize the error rate. But in SVR, we try to fit the error within a certain threshold.

Our best fit line is the one where the hyperplane has the maximum number of points. We are trying to do here is trying to decide a decision boundary at ‘e’ distance from the original hyperplane such that data points closest to the hyperplane or the support vectors are within that boundary line.

Non-Linear Regression

In some cases, the true relationship between the outcome and a predictor variable might not be linear. There are different solutions extending the linear regression model for capturing these nonlinear effects, some of these are covered below.

Polynomial Regression

The equation of polynomial becomes something like this.

The degree of order which to use is a Hyperparameter, and we need to choose it wisely. But using a high degree of polynomial tries to overfit the data and for smaller values of degree, the model tries to underfit so we need to find the optimum value of a degree. Polynomial Regression on datasets with high variability chances to result in over-fitting.

Regression Splines

In order to overcome the disadvantages of polynomial regression, we can use an improved regression technique which, instead of building one model for the entire dataset, divides the dataset into multiple bins and fits each bin with a separate model. Such a technique is known as Regression spline.

In polynomial regression, we generated new features by using various polynomial functions on the existing features which imposed a global structure on the dataset. To overcome this, we can divide the distribution of the data into separate portions and fit linear or low degree polynomial functions on each of these portions. The points where the division occurs are called Knots. Functions which we can use for modelling each piece/bin are known as Piecewise functions. There are various piecewise functions that we can use to fit these individual bins.

Generalized additive models

It does the same thing as above but just removes the need to specifying the knots. It fits spline models with automated selection of knots.

Questions

[UPSTART] Regression Coefficient

Answer

Linear Regression in Time Series

Do you think Linear Regression should be used in Time series analysis?

Answer

Linear Regression as per me can be used in Time Series but might not always give good results. Few reasons which come up are:

  • Linear Regression is good for intrapolation but not for extrapolation so the results can vary wildly

  • When Linear Regression is used but observations are correlated (as in time series data) you will have a biased estimate of the variance

  • Moreover, time-series data have a pattern, such as during peak hours, festive seasons, etc., which would most likely be treated as outliers in the linear regression analysis

[AIRBNB] Booking Regression

Let's say we want to build a model to predict booking prices.

  1. Explain the difference between a linear regression versus a random forest regression.

  2. Which one would likely perform better?

Answer

Linear Regression is used to predict continuous outputs where there is a linear relationship between the features of the dataset and the output variable. It is used for regression problems where you are trying to predict something with infinite possible answers such as the price of a house.

In the case of regression, decision trees in random forest learn by splitting the training examples in a way such that the sum of squared residuals is minimized. To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees. It is useful when there are complex relationships between the features and the output variables. They also work well compared to other Algorithms when there are missing features, when there is a mix of categorical and numerical features and when there is a big difference in the scale of features.

It is difficult to tell which will perform better, it completely depends on the problem statement and the available data. Other than the points mentioned above some of the Key advantages of linear models over tree-based ones are:

  • they can extrapolate (e.g., if labels are between 1-5 in train set, tree-based model will never predict 10, but linear will)

  • could be used for anomaly detection because of extrapolation

  • interpretability (yes, tree-based models have feature importance, but it's only a proxy, weights in linear model are better)

  • need less data to get good results

  • Random Forest is able to discover more complex relation at the cost of time

The first point becomes clearly important in this case as we would need booking price values which might not necessarily be in the training data range.

[GOOGLE] Adding Noise

What is the new objective function? How do you compute it?

[UBER] L1 vs L2

What is L1 and L2 regularization? What are the differences between the two?

The loss function for the two are:

[TESLA] Choice of Cost Function

You're working with several sensors that are designed to predict a particular energy consumption metric on a vehicle. Using the outputs of the sensors, you build a linear regression model to make the prediction. There are many sensors, and several of the sensors are prone to complete failure.

What are some cost functions you might consider, and which would you decide to minimize in this scenario?

[AIRBNB] Prove that maximizing the likelihood is equivalent to minimizing the sum of squared residuals

Suppose you are running a linear regression and model the error terms as being normally distributed. Show that in this setup, maximizing the likelihood of the data is equivalent to minimizing the sum of squared residuals.

A mathematical derivation like this requires us to:

  • Define correct Mathematical symbols and their relationships through equations

  • Recall and use the definitions of the terms like likelihood and normally distributed

  • Perform Mathematical manipulation to derive the required result

Problem Setup:

Next, we are give a set of training data points, consisting of

Likelihood:

Take a look at the problem statement again. We are assuming that the error terms are normally distributed. There is an implicit assumption that all the error terms are independent of each other. (Make sure you make this assumption explicit to your interviewer).

Since we are assuming that the error terms are also independent, their joint probability distribution, is given by the product of their likelihood.

Maximum Likelihood Estimator:

The maximum likelihood estimator seeks to maximize the likelihood function defined above. For the maximization,

  • We can also take the log of the likelihood function, converting the product into sum

The log likelihood function of the errors is given by

But this is just the negative of the sum of squared errors!

Thus, if you want to maximize the likelihood (or log likelihood) of the errors, you better minimize the sum of squared errors of the estimates.

The idea is to find the line or plane which best fits the data. Collectively, b0,b1,b2b_0, b_1, b_2b0​,b1​,b2​ are called regression coefficients. ϵ\epsilonϵ is the error term, the part of YYY the regression model is unable to explain.

RSSRSSRSS (Residual sum of squares) =(Yactual−Ypredicted)2= (Y_{actual} - Y_{predicted})^2=(Yactual​−Ypredicted​)2, it changes with scale change

TSSTSSTSS (Total sum of squares) =(Yactual−Yavg)2= (Y_{actual} - Y_{avg})^2=(Yactual​−Yavg​)2

R2R^2R2 =1−RSSTSS= 1-\frac{RSS}{TSS} =1−TSSRSS​, more the better, increases with more coefficients

RSERSERSE (Residual Standard Error) =RSSd.o.f= \sqrt{\frac{RSS}{d.o.f}}=d.o.fRSS​​, here d.o.f=n−2d.o.f = n-2d.o.f=n−2

Using metrics like AdjustedR2\text{Adjusted} R^2AdjustedR2, AICAICAIC, BICBICBIC, etc. which takes into consideration the number of features used to build the model and penalizes accordingly

How do we find the model that minimizes a metric like AICAICAIC? One approach is to search through all possible models, called all subset regression. This is computationally expensive and is not feasible for problems with large data and many variables. An attractive alternative is to use stepwise regression about which we learned above, this successively adds and drops predictors to find a model that lowers AICAICAIC. Simpler yet are forward selection and backward selection. In forward selection, you start with no predictors and add them one-by-one, at each step adding the predictor that has the largest contribution to , stopping when the contribution is no longer statistically significant. In backward selection, or backward elimination, you start with the full model and take away predictors that are not statistically significant until you are left with a model in which all predictors are statistically significant.

Penalized regression is similar in spirit to AIC. Instead of explicitly searching through a discrete set of models, the model-fitting equation incorporates a constraint that penalizes the model for too many variables (parameters). Rather than eliminating predictor variables entirely — as with stepwise, forward, and backward selection — penalized regression applies the penalty by reducing coefficients, in some cases to near zero. Common penalized regression methods are ridge regression and lasso regression. Regularization is nothing but adding a penalty term to the objective function and control the model complexity using that penalty term. It can be used for many machine learning Algorithms. Both Ridge and Lasso regression uses L2L2L2 and L1L1L1 regularizations.

Ridge brings the coefficients close to 000 but not exactly to 000 which results in the model retaining all the features. Lasso on the other hand brings the coefficients to 000 hence results in reduced features. Lasso shrinks the coefficients by same amount whereas Ridge shrinks them by same proportion.

The relationship between XXX and YYY is linear. Because we are fitting a linear model, we assume that the relationship really is linear, and that the errors, or residuals, are simply random fluctuations around the true line.

The independent variables are not multicollinear. Multicollinearity is when a variable can be explained as a combination of other variables. This can be checked by using VIF(Variance inflation factor) =11−Ri2= \frac{1}{1-R_i^2}=1−Ri2​1​.

A VIF score of >10>10>10 indicates there there is a problem

One very important point to remember is that Generalized Linear Regression is called so because YYY is linear w.r.t its coefficients b0,b1,b2b_0, b_1, b_2b0​,b1​,b2​, etc. it is irrespective of whether the features x1,x2x_1, x_2x1​,x2​, etc. are linear or not. Meaning x1x_1x1​ can actually be x12x_1^2x12​ and it won't matter.

Y=b0+b1∗x1+b2∗x12+bn∗x1nY = b_0 + b_1 * x_1 + b_2 * x_1^2 + b_n * x_1^n Y=b0​+b1​∗x1​+b2​∗x12​+bn​∗x1n​and so on...

Suppose we have two variables, XXX and YYY, where Y=X+Y = X +Y=X+ some normal white noise.

What will our coefficient be of we run a regression of YYY on XXX?

What happens if we run a regression of XXX on YYY?

Let's start with Y=XY = XY=X, then the regression line is a perfect fit. The points of such a dataset is (1,1),(2,2),(3,3),(4,4),(5,5)(1,1),(2,2),(3,3),(4,4),(5,5)(1,1),(2,2),(3,3),(4,4),(5,5)

Adding some normal white noise to these points (1,1),(2,3),(3,5),(4,5),(5,5)(1,1),(2,3),(3,5),(4,5),(5,5)(1,1),(2,3),(3,5),(4,5),(5,5). A regression line fit on these points will move up. Hence the coefficients of Y=mX+cY = mX+cY=mX+c, mmm will increase, ccc might still stay at 000 or at max increase.

This movement will go in the negative direction if we predict XXX based on YYY

Say we are running a probabilistic linear regression which does a good job modeling the underlying relationship between some yyy and xxx. Now assume all inputs have some noise ϵ\epsilonϵ added, which is independent of the training data.

Answer (

The objective function for linear regression where xxx is set of input vectors and www are the weights: L(w)=E[(wTx−y)2]L(w) = E[(w^Tx-y)^2]L(w)=E[(wTx−y)2]

Let's assume that the noise added is Gaussian as follows: ϵ∼N(0,λI)\epsilon \sim N(0, \lambda I)ϵ∼N(0,λI), then the new objective function is given by:L(w)=E[(wT(x+ϵ)−y)2]L(w) = E[(w^T(x + \epsilon)-y)^2]L(w)=E[(wT(x+ϵ)−y)2].

To compute it, we simplify: L′(w)=E[(wTx−y+wTϵ)2]L'(w) = E[(w^T x -y + w^T\epsilon)^2]L′(w)=E[(wTx−y+wTϵ)2] L′(w)=E[(wTx−y)2+2(wTx−y)wTϵ+wTϵϵTw]L'(w) = E[(w^T x - y)^2 + 2(w^Tx-y)w^T\epsilon +w^T\epsilon \epsilon^Tw]L′(w)=E[(wTx−y)2+2(wTx−y)wTϵ+wTϵϵTw] L′(w)=E[(wTx−y)2]+E[2(wTx−y)wTϵ]+E[wTϵϵTw]L'(w) = E[(w^T x - y)^2] + E[2(w^Tx-y)w^T\epsilon] + E[w^T\epsilon \epsilon^Tw]L′(w)=E[(wTx−y)2]+E[2(wTx−y)wTϵ]+E[wTϵϵTw]

We know that the expectation for ϵ\epsilonϵ is 000 so, the middle term becomes 000 and we are left with:L′(w)=L(w)+0+wTE[ϵϵT]wL'(w) = L(w) + 0 + w^TE[\epsilon \epsilon^T]wL′(w)=L(w)+0+wTE[ϵϵT]w

The last term can be simplified as: L′(w)=L(w)+wTλIwL'(w) = L(w) + w^T\lambda IwL′(w)=L(w)+wTλIw

And therefore, the objective function simplifies to that of L2-regularization: L′(w)=L(w)+λ∣∣w∣∣2L'(w) = L(w) + \lambda||w||^2L′(w)=L(w)+λ∣∣w∣∣2

Answer

L1L1L1 and L2L2L2 regularization are both methods of regularization that attempt to prevent overfitting in machine learning. For a regular regression model assume the loss function is given by LLL. L1L1L1 adds the absolute value of the coefficients as a penalty term, whereas L2L2L2 adds the squared magnitude of the coefficients as a penalty term.

Loss(L1)=L+λ∣wi∣Loss(L_1) = L + \lambda |w_i|Loss(L1​)=L+λ∣wi​∣

Loss(L2)=L+λ∣wi2∣Loss(L_2) = L + \lambda |w_i^2|Loss(L2​)=L+λ∣wi2​∣

Where the loss function LLL is the sum of errors squared, given by the following, where f(x)f(x)f(x) is the model of interest, for example, linear regression with ppp predictors:

L=∑i=1n(yi−f(xi))2=∑i=1n(yi−∑j=1p(xijwj))2 for linear regressionL = \sum_{i=1}^{n} (y_i - f(x_i))^2 = \sum_{i=1}^{n} (y_i - \sum_{j=1}^{p}(x_{ij}w_j) )^2 \space \text{for linear regression}L=∑i=1n​(yi​−f(xi​))2=∑i=1n​(yi​−∑j=1p​(xij​wj​))2 for linear regression

If we run gradient descent on the weights www, we find that L1L1L1 regularization will force any weight closer to 000, irrespective of its magnitude, whereas, for the L2L2L2 regularization, the rate at which the weight goes towards 000 becomes slower as the rate goes towards 000. Because of this, L1L1L1 is more likely to “zero” out particular weights, and hence removing certain features from the model completely, leading to more sparse models.

Answer

There are two potential cost functions here, one using the L1L1L1 norm and the other using the L2L2L2 norm. Below are two basic cost functions using an L1 and L2 norm respectively:

J(w)=∣∣Xw−y∣∣J(w) = ||Xw-y||J(w)=∣∣Xw−y∣∣

J(w)=∣Xw−y∣2J(w) = |Xw-y|^2J(w)=∣Xw−y∣2

It would be more sensible to use an L1L1L1 norm in this case since the L1L1L1 norm penalizes the outliers harder and thus gives less weight to the complete failures than the L2L2L2 norm does.

Additionally, it would be prudent to involve a regularization term to account for noise. If we assume that the noise added to each sensor uniformly as follows: ϵ∼N(0,λI)\epsilon \sim N(0, \lambda I)ϵ∼N(0,λI) then using traditional L2L2L2 regularization, we would have the cost function: J(w)=∣∣Xw−y∣∣+λ∣∣w∣∣2J(w) = ||Xw-y|| + \lambda||w||^2J(w)=∣∣Xw−y∣∣+λ∣∣w∣∣2

However, given the fact that there are many sensors (and a broad range of how useful they are), we could instead assume that noise is added by: ϵ∼N(0,λD)\epsilon \sim N(0, \lambda D)ϵ∼N(0,λD) where each diagonal term in the matrix D represents the error term used for each sensor (and hence penalizing certain sensors more than others). Then our final cost function is given by: J(w)=∣∣Xw−y∣∣+λwTDwJ(w) = ||Xw-y|| + \lambda w^TDwJ(w)=∣∣Xw−y∣∣+λwTDw

Answer

A linear regression model proposes that the output yyy is linearly dependent on the input vector XXX by the relation,

y=WTX+βy = W^T X + \betay=WTX+β

Where, XXX is an mmm-dimensional vector, and (W;β)={w1,w2,...,wm;β}(W; \beta) = \{w_1, w_2, ..., w_m; \beta\}(W;β)={w1​,w2​,...,wm​;β} are the parameters of the model.

a set of input vectors, X={X1,X2,...,Xn}X = \{X_1, X_2, ..., X_n\}X={X1​,X2​,...,Xn​}. Note that every input XiX_iXi​ is a vector, Xi={Xi1,Xi2,...,Xim}X_i = \{X_{i1}, X_{i2}, ..., X_{im}\}Xi​={Xi1​,Xi2​,...,Xim​}

and a set of outputs Y={y1,y2,...,yn}Y = \{y_1, y_2, ..., y_n\}Y={y1​,y2​,...,yn​}

Given the values of the parameters WWW and β\betaβ, the estimate y^\hat{y}y^​ is given by

y^i=∑j=1mXij∗wj+β=Xi1∗w1+Xi2∗w2+...+Xim∗wm+β\hat{y}_i = \sum_{j=1}^m X_{ij} * w_j + \beta = X_{i1} * w_1 + X_{i2} * w_2 + ... + X_{im} * w_m + \beta y^​i​=∑j=1m​Xij​∗wj​+β=Xi1​∗w1​+Xi2​∗w2​+...+Xim​∗wm​+β.

Finally, the error term for ithi^{th}ith input is simply the difference between the observed value yyy and the estimate y^\hat{y}y^​.

ϵi(W,β)=yi−y^i=yi−∑j=1mXij∗wj−β\epsilon_i(W, \beta) = y_i - \hat{y}_i = y_i - \sum_{j=1}^m X_{ij} * w_j - \betaϵi​(W,β)=yi​−y^​i​=yi​−∑j=1m​Xij​∗wj​−β

Note that the error term depends on the parameters of the model WWW and β\betaβ, and hence is denoted as ϵi(W,β)\epsilon_i(W, \beta)ϵi​(W,β).

What does it mean for the error term to be normally distributed. It means that, by definition, the probability distribution function of the ithi^{th}ith error term is given by

l(ϵi∣W;β)=12πσexp{−ϵi22σ2}l(\epsilon_i|W; \beta) = \frac{1}{\sqrt{2\pi}\sigma}exp\{\frac{-\epsilon_i^2}{2\sigma^2}\}l(ϵi​∣W;β)=2π​σ1​exp{2σ2−ϵi2​​}

This probability distribution function of the ithi^{th}ith input is also called its likelihood function, and also depends on the parameters of the model, WWW and β\betaβ.

l=∏i=1nl(ϵi∣W;β)l = \displaystyle \prod_{i=1}^n l(\epsilon_i|W; \beta)l=i=1∏n​l(ϵi​∣W;β) =∏i=1n12πσexp{−ϵi22σ2}= \displaystyle \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}exp\{\frac{-\epsilon_i^2}{2\sigma^2}\}=i=1∏n​2π​σ1​exp{2σ2−ϵi2​​} =1(2πσ)n∏i=1nexp{−ϵi22σ2}= \displaystyle \frac{1}{(\sqrt{2\pi}\sigma)^n} \prod_{i=1}^n exp\{\frac{-\epsilon_i^2}{2\sigma^2}\}=(2π​σ)n1​i=1∏n​exp{2σ2−ϵi2​​} =1(2πσ)n∏i=1nexp{−(yi−∑j=1mXij∗wj−β)22σ2}= \displaystyle \frac{1}{(\sqrt{2\pi}\sigma)^n} \prod_{i=1}^n exp\{\frac{-(y_i - \sum_{j=1}^m X_{ij} * w_j - \beta)^2}{2\sigma^2}\}=(2π​σ)n1​i=1∏n​exp{2σ2−(yi​−∑j=1m​Xij​∗wj​−β)2​}

We can ignore the constant 1(2πσ)n\frac{1}{(\sqrt{2\pi}\sigma)^n}(2π​σ)n1​

L=log(l)L = log(l)L=log(l) =log(∏i=1nexp{−(yi−∑j=1mXij∗wj−β)22σ2})= \displaystyle log(\prod_{i=1}^n exp\{\frac{-(y_i - \sum_{j=1}^m X_{ij} * w_j - \beta)^2}{2\sigma^2}\})=log(i=1∏n​exp{2σ2−(yi​−∑j=1m​Xij​∗wj​−β)2​}) =∑i=1n−(yi−∑j=1mXij∗wj−β)22σ2= \displaystyle \sum_{i=1}^n {\frac{-(y_i - \sum_{j=1}^m X_{ij} * w_j - \beta)^2}{2\sigma^2}}=i=1∑n​2σ2−(yi​−∑j=1m​Xij​∗wj​−β)2​

As a final step, for the purpose of optimization, we can ignore the constant multiplier 2σ22\sigma^22σ2 from the summation, giving us

L=∑i=1n−(yi−∑j=1mXij∗wj−β)2L = \displaystyle \sum_{i=1}^n -(y_i - \sum_{j=1}^m X_{ij} * w_j - \beta)^2L=i=1∑n​−(yi​−j=1∑m​Xij​∗wj​−β)2

📖Explanation
📖Explanation
📖Explanation
Source)
(Source)
(Source)
(Source)
The higher the t-value for the feature, the more significant the feature is to the output variable. And also, the p-value plays a rule in rejecting the Null hypothesis(Null hypothesis stating the features has zero significance on the target variable.). If the p-value is less than 0.05(95% confidence interval) for a feature, then we can consider the feature to be significant.
Source
Source
Source