Probability Distribution
Knowing the distribution of data helps us better model the world around us. It helps us to determine the likeliness of various outcomes or make an estimate of the variability of an occurrence.
Last updated
Knowing the distribution of data helps us better model the world around us. It helps us to determine the likeliness of various outcomes or make an estimate of the variability of an occurrence.
Last updated
Random Variable maps the outcome of sample space into real numbers.
Example: How many heads when we toss 3 coins?
could be or randomly, and they might each have a different probability. = "The number of Heads" is the Random Variable.
In this case, there could be 0 Heads (if all the coins land Tails up), 1 Head, 2 Heads or 3 Heads. So, the Sample Space = But this time the outcomes are NOT all equally likely. The three coins can land in eight possible ways:
Looking at the table we see just 1 case of Three Heads, but 3 cases of Two Heads, 3 cases of One Head, and 1 case of Zero Heads. So:
And this is what becomes the probability distribution.
When we use a probability function to describe a continuous probability distribution, we call it a probability density function (PDF).
Now depending on the problem type one can choose the corresponding distribution and find the probability for some value of the random variable.
Some common types of probability distribution are as follows:
The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution.
Despite the different shapes, all forms of the normal distribution have the following characteristic properties.
They’re all symmetric. The normal distribution cannot model skewed distributions.
The mean, median, and mode are all equal.
Half of the population is less than the mean and half is greater than the mean.
The Empirical Rule, which describes the percentage of the data that fall within specific numbers of standard deviations from the mean for bell-shaped curves.
There are 3 variety of measures, required to understand a distribution:
Measure of Central tendency
Measure of dispersion
Measure to describe shape of curve
Measures of central tendencies are measures, which help you describe a population, through a single metric. For example, if you were to compare Saving habits of people across various nations, you will compare average Savings rate in each of these nations. Following are the measures of central tendency:
Mean: or the average
Median: the value, which divides the population in two half
Mode: the most frequent value in a population
Measures of dispersion reveal how is the population distributed around the measures of central tendency.
Range: Difference in the maximum and minimum value in the population
Quartiles: Values, which divide the population in 4 equal subsets (typically referred to as first quartile, second quartile and third quartile)
Inter-quartile range: The difference in third quartile (Q3) and first quartile (Q1). By definition of quartiles, 50% of the population lies in the inter-quartile range.
Variance: The average of the squared differences from the Mean.
Standard Deviation: is square root of Variance
Skewness: Skewness is a measure of the asymmetry. Negatively skewed curve has a long left tail and vice versa.
Kurtosis: Kurtosis is a measure of the “peaked ness”. Distributions with higher peaks have positive kurtosis and vice-versa
Box plots are one of the easiest and most intuitive way to understand distributions. They show mean, median, quartiles and Outliers on single plot.
An unbiased estimator is an accurate statistic that’s used to approximate a population parameter. “Accurate” in this sense means that it’s neither an overestimate nor an underestimate. If an overestimate or underestimate does happen, the mean of the difference is called a “bias.” That’s just saying if the estimator (i.e., the sample mean) equals the parameter (i.e., the population mean), then it’s an unbiased estimator.
Reference: Discussion, Explanation, Implementation
Say you have some data. Say you're willing to assume that the data comes from some distribution -- perhaps Gaussian. There are an infinite number of different Gaussians that the data could have come from (which correspond to the combination of the infinite number of means and variances that a Gaussian distribution can have). MLE will pick the Gaussian (i.e., the mean and variance) that is "most consistent" with your data (the precise meaning of consistent is explained below).
Maximum Likelihood Estimation can be applied to both regression and classification problems.
Frequency distribution comes from actually doing the experiment number of times as the shape comes closer and closer to the Probability distribution
Now the probability distribution can be of types, discrete and continuous. An example of Discrete is shown above.
When we use a probability function to describe a discrete probability distribution, we call it a probability mass function (PMF). The probability mass function, , just returns the probability of the outcome. Therefore, the probability of rolling a is .
Mean +/- standard deviations | Percentage of data contained |
---|---|
So, say you've got a data set of . The most consistent Gaussian from which that data could have come has a mean of and a variance of . It could have been sampled from some other Gaussian. But one with a mean of and variance of is most consistent with the data in the following sense: the probability of getting the particular values you observed is greater with this choice of mean and variance, than it is with any other choice.
Given 3 i.i.d. variables from an uniform distribution of to , what’s the chance the median is greater than ?
This will only be possible if atleast random variables are greater than .
where, stands for probability of number which is probability of it being out of ; for probability of number which is probability of it being out of
Let be the samples drawn.
Thus, the likelihood function is simply the product of the pdf n times, which is:
The MLE will occur at the values of and for which that quantity is maximized. Since is in the denominator, and must always be positive because , the likelihood is maximized when is minimized. This means we want as big as possible, and as small as possible. But for one of the to be sampled, must be smaller than that value, (and must be larger), so the maximum likelihood estimation is and
Fair coin, Since this experiment has only outcomes hence we can use a binomial distribution,
mean = = , var= , stddev = sqrt(var) = .
(mean)(stddev), which means the probability of flipping at least heads or tails is . Since we are only looking at the probability of at least heads, it is the right tail area of the distribution, which is . So the probability of flipping at least heads is .
1
68%
2
95%
3
99.7%