THE DATA SCIENCE INTERVIEW BOOK
Buy Me a Coffee ☕FollowForum
  • About
  • Log
  • Mathematical Motivation
  • STATISTICS
    • Probability Basics
    • Probability Distribution
    • Central Limit Theorem
    • Bayesian vs Frequentist Reasoning
    • Hypothesis Testing
    • ⚠️A/B test
  • MODEL BUILDING
    • Overview
    • Data
      • Scaling
      • Missing Value
      • Outlier
      • ⚠️Sampling
      • Categorical Variable
    • Hyperparameter Optimization
  • Algorithms
    • Overview
    • Bias/Variance Tradeoff
    • Regression
    • Generative vs Discriminative Models
    • Classification
    • ⚠️Clustering
    • Tree based approaches
    • Time Series Analysis
    • Anomaly Detection
    • Big O
  • NEURAL NETWORK
    • Neural Network
    • ⚠️Recurrent Neural Network
  • NLP
    • Lexical Processing
    • Syntactic Processing
    • Transformers
  • BUSINESS INTELLIGENCE
    • ⚠️Power BI
      • Charts
      • Problems
    • Visualization
  • PYTHON
    • Theoretical
    • Basics
    • Data Manipulation
    • Statistics
    • NLP
    • Algorithms from scratch
      • Linear Regression
      • Logistic Regression
    • PySpark
  • ML OPS
    • Overview
    • GIT
    • Feature Store
  • SQL
    • Basics
    • Joins
    • Temporary Datasets
    • Windows Functions
    • Time
    • Functions & Stored Proc
    • Index
    • Performance Tuning
    • Problems
  • ⚠️EXCEL
    • Excel Basics
    • Data Manipulation
    • Time and Date
    • Python in Excel
  • MACHINE LEARNING FRAMEWORKS
    • PyCaret
    • ⚠️Tensorflow
  • ANALYTICAL THINKING
    • Business Scenarios
    • ⚠️Industry Application
    • Behavioral/Management
  • Generative AI
    • Vector Database
    • LLMs
  • CHEAT SHEETS
    • NumPy
    • Pandas
    • Pyspark
    • SQL
    • Statistics
    • RegEx
    • Git
    • Power BI
    • Python Basics
    • Keras
    • R Basics
  • POLICIES
    • PRIVACY NOTICE
Powered by GitBook
On this page
  • Random Variable
  • Types of Distribution
  • Normal (Gaussian) Distribution
  • Measures to understand a distribution:
  • Measure of Central Tendency
  • Measure of Dispersion
  • Measure to describe shape of distribution
  • Box Plots
  • Unbiased Estimator
  • Maximum Likelihood Estimation (MLE)
  • Questions

Was this helpful?

  1. STATISTICS

Probability Distribution

Knowing the distribution of data helps us better model the world around us. It helps us to determine the likeliness of various outcomes or make an estimate of the variability of an occurrence.

PreviousProbability BasicsNextCentral Limit Theorem

Last updated 3 months ago

Was this helpful?

Random Variable

Random Variable maps the outcome of sample space into real numbers.

Example: How many heads when we toss 3 coins?

XXX could be 0,1,20, 1, 20,1,2 or 333 randomly, and they might each have a different probability. XXX = "The number of Heads" is the Random Variable.

In this case, there could be 0 Heads (if all the coins land Tails up), 1 Head, 2 Heads or 3 Heads. So, the Sample Space = 0,1,2,3{0, 1, 2, 3}0,1,2,3 But this time the outcomes are NOT all equally likely. The three coins can land in eight possible ways:

Looking at the table we see just 1 case of Three Heads, but 3 cases of Two Heads, 3 cases of One Head, and 1 case of Zero Heads. So:

  • P(X=3)=1/8=HHHP(X = 3) = 1/8 = {HHH}P(X=3)=1/8=HHH

  • P(X=2)=3/8=HHT,HTH,THHP(X = 2) = 3/8 = {HHT,HTH,THH}P(X=2)=3/8=HHT,HTH,THH

  • P(X=1)=3/8=TTH,THT,TTHP(X = 1) = 3/8 = {TTH,THT,TTH}P(X=1)=3/8=TTH,THT,TTH

  • P(X=0)=1/8=TTTP(X = 0) = 1/8 = {TTT}P(X=0)=1/8=TTT

And this is what becomes the probability distribution.

Frequency distribution comes from actually doing the experiment nnn number of times as n−>∞n->\inftyn−>∞ the shape comes closer and closer to the Probability distribution

Now the probability distribution can be of 222 types, discrete and continuous. An example of Discrete is shown above.

  • When we use a probability function to describe a discrete probability distribution, we call it a probability mass function (PMF). The probability mass function, fff, just returns the probability of the outcome. Therefore, the probability of rolling a 333 is f(3)=1/6f(3) = 1/6f(3)=1/6.

  • When we use a probability function to describe a continuous probability distribution, we call it a probability density function (PDF).

Now depending on the problem type one can choose the corresponding distribution and find the probability for some value of the random variable.

Types of Distribution

Some common types of probability distribution are as follows:

Normal (Gaussian) Distribution

The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution.

Despite the different shapes, all forms of the normal distribution have the following characteristic properties.

  • They’re all symmetric. The normal distribution cannot model skewed distributions.

  • The mean, median, and mode are all equal.

  • Half of the population is less than the mean and half is greater than the mean.

  • The Empirical Rule, which describes the percentage of the data that fall within specific numbers of standard deviations from the mean for bell-shaped curves.

Mean +/- standard deviations
Percentage of data contained

1

68%

2

95%

3

99.7%

Measures to understand a distribution:

There are 3 variety of measures, required to understand a distribution:

  • Measure of Central tendency

  • Measure of dispersion

  • Measure to describe shape of curve

Measure of Central Tendency

Measures of central tendencies are measures, which help you describe a population, through a single metric. For example, if you were to compare Saving habits of people across various nations, you will compare average Savings rate in each of these nations. Following are the measures of central tendency:

  • Mean: or the average

  • Median: the value, which divides the population in two half

  • Mode: the most frequent value in a population

Measure of Dispersion

Measures of dispersion reveal how is the population distributed around the measures of central tendency.

  • Range: Difference in the maximum and minimum value in the population

  • Quartiles: Values, which divide the population in 4 equal subsets (typically referred to as first quartile, second quartile and third quartile)

  • Inter-quartile range: The difference in third quartile (Q3) and first quartile (Q1). By definition of quartiles, 50% of the population lies in the inter-quartile range.

  • Variance: The average of the squared differences from the Mean.

  • Standard Deviation: is square root of Variance

Measure to describe shape of distribution

  • Skewness: Skewness is a measure of the asymmetry. Negatively skewed curve has a long left tail and vice versa.

  • Kurtosis: Kurtosis is a measure of the “peaked ness”. Distributions with higher peaks have positive kurtosis and vice-versa

Box Plots

Box plots are one of the easiest and most intuitive way to understand distributions. They show mean, median, quartiles and Outliers on single plot.

Unbiased Estimator

An unbiased estimator is an accurate statistic that’s used to approximate a population parameter. “Accurate” in this sense means that it’s neither an overestimate nor an underestimate. If an overestimate or underestimate does happen, the mean of the difference is called a “bias.” That’s just saying if the estimator (i.e., the sample mean) equals the parameter (i.e., the population mean), then it’s an unbiased estimator.

Maximum Likelihood Estimation (MLE)

Say you have some data. Say you're willing to assume that the data comes from some distribution -- perhaps Gaussian. There are an infinite number of different Gaussians that the data could have come from (which correspond to the combination of the infinite number of means and variances that a Gaussian distribution can have). MLE will pick the Gaussian (i.e., the mean and variance) that is "most consistent" with your data (the precise meaning of consistent is explained below).

So, say you've got a data set of y=−1,3,7y={−1,3,7}y=−1,3,7. The most consistent Gaussian from which that data could have come has a mean of 333 and a variance of 32/332/332/3. It could have been sampled from some other Gaussian. But one with a mean of 333 and variance of 32/332/332/3 is most consistent with the data in the following sense: the probability of getting the particular yyy values you observed is greater with this choice of mean and variance, than it is with any other choice.

Will show the calculation step by step:

We have a dataset: y={−1,3,7}y = \{-1, 3, 7\}y={−1,3,7}

Assuming the data follows a normal distribution N(μ,σ2)\mathcal{N}(\mu, \sigma^2) N(μ,σ2), the likelihood function is:

L(μ,σ2)=∏i=1n12πσ2exp⁡(−(yi−μ)22σ2) L(\mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) L(μ,σ2)=i=1∏n​2πσ2​1​exp(−2σ2(yi​−μ)2​)

Taking the log-likelihood:

log⁡L(μ,σ2)=−n2log⁡(2πσ2)−12σ2∑i=1n(yi−μ)2\log L(\mu, \sigma^2) = -\frac{n}{2} \log (2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \mu)^2logL(μ,σ2)=−2n​log(2πσ2)−2σ21​i=1∑n​(yi​−μ)2

The MLE for the mean of a Gaussian is the sample mean:

μ^=1n∑i=1nyi\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} y_iμ^​=n1​i=1∑n​yi​

Substituting the values: μ^=−1+3+73=93=3\hat{\mu} = \frac{-1 + 3 + 7}{3} = \frac{9}{3} = 3μ^​=3−1+3+7​=39​=3

The MLE for variance is: σ^2=1n∑i=1n(yi−μ^)2\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{\mu})^2σ^2=n1​∑i=1n​(yi​−μ^​)2

Substituting the values: σ^2=(−1−3)2+(3−3)2+(7−3)23\hat{\sigma}^2 = \frac{( -1 - 3)^2 + (3 - 3)^2 + (7 - 3)^2}{3}σ^2=3(−1−3)2+(3−3)2+(7−3)2​

However, if the population variance formula was used instead of the true MLE formula: σ^2=1n−1∑i=1n(yi−μ^)2\hat{\sigma}^2 = \frac{1}{n-1} \sum_{i=1}^{n} (y_i - \hat{\mu})^2σ^2=n−11​∑i=1n​(yi​−μ^​)2 =322=16= \frac{32}{2} = 16=232​=16

This would be the unbiased variance estimator, not the true MLE.

Maximum Likelihood Estimation can be applied to both regression and classification problems.

Questions

[LIME] Example of unbiased estimator

What is an unbiased estimator and can you provide an example for a layman to understand?

Answer

One famous example of an unrepresentative sample is the literary digest voter survey, which predicted Alfred Landon would win the 1936 presidential election. The survey was biased, as it failed to include a representative sample of low income voters who were more likely to be democrat and vote for Theodore Roosevelt.

If the sampling had been done correctly then the estimator would have been unbiased as it would match with the actual output from the population, which was win for Theodore Roosevelt.

[GOOGLE] Median of Uniform Distribution

Given 3 i.i.d. variables from an uniform distribution of 000 to 444, what’s the chance the median is greater than 333?

Answer

This will only be possible if atleast 222 random variables are greater than 333.

P(M>3)=P(GGL)+P(GLG)+P(LGG)+P(GGG)=3∗(1/4)2∗3/4+(1/4)3=5/32P(M>3) = P(GGL) + P(GLG) + P(LGG) + P(GGG) = 3 * (1/4)^2 * 3/4 + (1/4)^3 = 5/32P(M>3)=P(GGL)+P(GLG)+P(LGG)+P(GGG)=3∗(1/4)2∗3/4+(1/4)3=5/32 where, GGG stands for probability of number >3> 3>3 which is probability of it being 444 out of 1,2,3,4=1/41,2,3,4 = 1/41,2,3,4=1/4; LLL for probability of number <3< 3<3 which is probability of it being 1,2,31,2,31,2,3 out of 1,2,3,4=3/41,2,3,4 = 3/41,2,3,4=3/4

[SPOTIFY] MLE of Uniform Distribution

Suppose you draw n samples from a uniform distribution U(a, b). What is the MLE estimate of a and b?

Answer

Let x1,x2,…,xnx_1, x_2, \ldots , x_nx1​,x2​,…,xn​ be the nnn samples drawn.

Recall the pdf for the uniform distribution function is:

f(x)=1b−af(x)=\frac{1}{b-a}f(x)=b−a1​

Thus, the likelihood function L\mathcal{L}L is simply the product of the pdf n times, which is:

f(x)=1(b−a)n.f(x)=\frac{1}{(b-a)^n}.f(x)=(b−a)n1​.

The MLE will occur at the values of aaa and bbb for which that quantity is maximized. Since (b−a)n(b-a)^n(b−a)n is in the denominator, and b−ab-ab−a must always be positive because b>ab>ab>a, the likelihood is maximized when b−ab-ab−a is minimized. This means we want aaa as big as possible, and bbb as small as possible. But for one of the xix_ixi​ to be sampled, aaa must be smaller than that value, (and bbb must be larger), so the maximum likelihood estimation is a=min⁡(x1,x2,…,xn)a=\min(x_1, x_2, \ldots , x_n)a=min(x1​,x2​,…,xn​) and b=max⁡(x1,x2,…,xn)b=\max(x_1, x_2, \ldots , x_n)b=max(x1​,x2​,…,xn​)

[MCKINSEY] Flipping 576 Times

You flip a fair coin 576 times. Without using a calculator, calculate the probability of flipping at least 312 heads.

Answer

Fair coin, p(H)=0.5p(H)=0.5p(H)=0.5 Since this experiment has only 222 outcomes hence we can use a binomial distribution,

mean = npnpnp = 576∗0.5=288576*0.5 = 288576∗0.5=288, var= np(1−p)=576∗0.5∗0.5=144np(1-p)= 576*0.5*0.5 = 144np(1−p)=576∗0.5∗0.5=144, stddev = sqrt(var) = 121212.

For normal distribution, 68% of the data falls within one standard deviation, 95% percent within two standard deviations, and 99.7% within three standard deviations from the mean.

312=288312= 288312=288(mean)+2∗12+2*12+2∗12(stddev), which means the probability of flipping at least 312312312 heads or tails is 55%5. Since we are only looking at the probability of at least 321321321 heads, it is the right tail area of the distribution, which is 55%/2= 2.5%5. So the probability of flipping at least 312312312 heads is 2.52.5%2.5.

[GOOGLE] Non-normal Probability Distribution

Explain how a probability distribution could be not normal and give an example scenario.

Answer

Normal probability distributions are characterized by their famous bell shaped probability density function. The observations are centered around the mean and are equally spread around as per the standard deviation of the distribution, in case the probability distribution is a standard normal. They occur frequently in the nature, for e.g. distribution of heights

There are other types of distributions which are not normal; since normal distributions are for continuous random variable, all discrete random variables do not follow normal distributions.

There can be many examples of Non-Normal distribution:

  • Flip a coin ten times and count the number of heads you get. That follows a binomial distribution

  • Flip a coin until you get five heads and count the number of flips. That follows a negative binomial distribution

  • Take a well-shuffled deck of cards and count how many red cards there are in the first ten. That follows a hypergeometric distribution

Reference: , ,

Solution recieved from the community via

Discussion
Explanation
Implementation
merge request
Source
Thanks to
Reza Bagheri
Types of Probability Distribution
2 Distribution with different standard deviation