# Hypothesis Testing

Hypothesis testing is the process used to evaluate the strength of evidence from the sample and provides a framework for making determinations related to the population

Last updated

Hypothesis testing is the process used to evaluate the strength of evidence from the sample and provides a framework for making determinations related to the population

Last updated

Inferential Statistics

Sometimes, you may require a very large amount of data for your analysis which may need too much time and resources to acquire. In such situations, you are forced to work with a smaller sample of the data, instead of having the entire data to work with.

Situations like these arise all the time at big companies like Amazon. For example, say the Amazon QC department wants to know what proportion of the products in its warehouses are defective. Instead of going through all of its products (which would be a lot!), the Amazon QC team can just check a small sample of 1,000 products and then find, for this sample, the defect rate (i.e. the proportion of defective products). Then, based on this sample's defect rate, the team can "infer" what the defect rate is for all the products in the warehouses. **This process of “inferring” insights from sample data is called “Inferential Statistics”.**

Hypothesis Testing

Hypothesis testing is used to confirm your conclusions about the population parameter. Through this we can conclude if there is enough evidence to confirm the hypothesis about the population.

null hypothesis, what is already present; always has the following signs: = OR ≤ OR ≥

alternate hypothesis, a challenge to the null hypothesis; always has the following signs: ≠ OR > OR <

Steps

There is an initial research hypothesis of which the truth is unknown.

The first step is to state the relevant null and alternative hypotheses. This is important, as mis-stating the hypotheses will muddy the rest of the process.

The second step is to consider the statistical assumptions being made about the sample in doing the test; for example, assumptions about the statistical independence or about the form of the distributions of the observations. This is equally important as invalid assumptions will mean that the results of the test are invalid.

Decide which test is appropriate, and state the relevant test statistic .

Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example, the test statistic might follow a Student's t distribution with known degrees of freedom, or a normal distribution with known mean and variance. If the distribution of the test statistic is completely fixed by the null hypothesis, we call the hypothesis simple, otherwise it is called composite.

Select a significance level (), a probability threshold below which the null hypothesis will be rejected. Common values are and .

The distribution of the test statistic under the null hypothesis partitions the possible values of T into those for which the null hypothesis is rejected—the so-called critical region—and those for which it is not. The probability of the critical region is . In the case of a composite null hypothesis, the maximal probability of the critical region is .

Compute from the observations the observed value of the test statistic .

Decide to either reject the null hypothesis in favor of the alternative or not reject it. The decision rule is to reject the null hypothesis if the observed value is in the critical region, and not to reject the null hypothesis otherwise.

A common alternative formulation of this process goes as follows:

The former process was advantageous in the past when only tables of test statistics at common probability thresholds were available. It allowed a decision to be made without the calculation of a probability. It was adequate for classwork and for operational use, but it was deficient for reporting results. The latter process relied on extensive tables or on computational support not always available. The explicit calculation of a probability is useful for reporting. The calculations are now trivially performed with appropriate software.

The difference in the two processes applied to the Radioactive suitcase example (below):

The former report is adequate, the latter gives a more detailed explanation of the data and the reason why the suitcase is being checked.

Example

So as per the above problem:

First, we will take a look into the Critical-value method:

Errors in Hypothesis Testing

Types of Test

**Z-Test**:**Purpose**: Used to test hypotheses about a population mean when the population standard deviation is known.**Example**: Suppose you want to test if the average weight of a sample of 50 apples is significantly different from 150 grams (population mean). You know the population standard deviation is 10 grams.

**T-Test**:**Purpose**: Used to test hypotheses about a population mean when the population standard deviation is unknown or when dealing with small sample sizes.**Example**: You want to test if a new drug has a statistically significant effect on blood pressure. You collect data from 30 patients before and after treatment and perform a t-test to compare the means.

**Chi-Square Test**:**Purpose**: Used to test the independence of categorical variables or goodness-of-fit of observed data to an expected distribution.**Example**: You want to determine if there is an association between smoking habits (smoker, non-smoker) and the incidence of lung cancer (yes, no) in a population. You create a contingency table and perform a chi-square test for independence.

**ANOVA (Analysis of Variance)**:**Purpose**: Used to compare means of more than two groups to determine if there are statistically significant differences among them.**Example**: You have data on test scores from three different teaching methods (A, B, C). You want to determine if there is a significant difference in mean test scores between the methods.

**Paired T-Test**:**Purpose**: Used to compare the means of two related groups (e.g., before and after treatment) to determine if there is a significant difference.**Example**: You measure the blood pressure of the same group of patients before and after a 6-week exercise program to see if there is a significant change.

**Wilcoxon Rank-Sum Test (Mann-Whitney U Test)**:**Purpose**: Used to compare two independent groups when the data is not normally distributed or when ordinal data is involved.**Example**: You want to determine if there is a significant difference in test scores between students who received tutoring and those who did not. The data is not normally distributed.

**Fisher's Exact Test**:**Purpose**: Used to test the independence of two categorical variables in a 2x2 contingency table, especially when sample sizes are small.**Example**: You want to determine if there's an association between gender (male, female) and the success of a medical treatment (success, failure) in a small sample of patients.

**K-Sample Anderson-Darling Test**:**Purpose**: Used to compare the distribution of multiple independent samples to determine if they come from the same population.**Example**: You have three different groups of people, and you want to test if their ages are drawn from the same population distribution.

These are just a few examples of hypothesis tests, and there are many more specific tests designed for different types of data and research questions. The choice of which test to use depends on the nature of your data, the research question, and the assumptions of the test.

Questions

Compute from the observations the observed value of the test statistic .

Calculate the -value. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed (the maximal probability of that event, if the hypothesis is composite).

Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the -value is less than (or equal to) the significance level (the selected probability) threshold (), for example or .

"The Geiger-counter reading is . The limit is . Check the suitcase."

"The Geiger-counter reading is high; of safe suitcases have lower readings. The limit is . Check the suitcase."

A manufacturer claims that the average life of its products is months. An auditor selects a sample of units of the product and calculates the average life to be months. The population standard deviation is months. Test the manufacturer's claim at significance level.

months

months

or

In this case we will have critical region at both sides with total area of .

So, the area to the right = , which means that area till UCV (cumulative probability till that point)

-score of cumulative probability of UCV ( in this case) -score of

Calculating the critical values UCV/LCV

UCV/LCV

Now as the sample mean is not between UCV and LCV hence we reject the null hypothesis.

Now let's solve it using the -value method:

Calculate the -score of

Calculate the p-value from , which is cumulative area till sample point

As this it is in the left-hand side hence there is no need to subtract from .

Now this will be a 2-tailed test as we are checking for inequality, we need to multiply by , *remember 1-tailed test provides more power to detect an effect because the entire weight is allocated to one direction only.*

As -value is so, we reject null-hypothesis.

Type 1 error will scale the more the number of t-tests are run. If then there is chance of Type 1 error on a single test, then across many tests will increase. For example, with 2 tests:

If you want your across n-tests to remain at , you will need to decrease the in each individual test. Bonferroni correction can be applied. Basically, alpha will be reduced to alpha/n. n is number of experiments you are running.

Otherwise, you can try and run an F-test to start in order to identify if a least test sees some significant effect. Then run a t-test on the specific experiment with the highest effect size. Granted, the p-value of the test will also depend on the variance of the sample in the given test, if we assume constant variance across tests, then the test with the highest effect size is in expectation the best performing test. Only running a single t-test will keep your p(type I error) low.

📖

$t_{obs}$

$T$

$p$

$p$

$\alpha$

$0.05$

$0.01$

$10$

$9$

$97%$

$95%$

$36$

$49$

$34.5$

$4$

$3%$

$H_0 : \mu = 36$

$H_1 : \mu ≠ 36$

$\alpha = 3%$

$0.03$

$\sigma = 4$

$N = 49$

$0.03$

$0.015$

$= 1-0.015=0.985$

$z$

$Z\_c$

$= z$

$0.985 = 2.17$

$= \mu \pm Z_c * \frac{\sigma}{\sqrt{N}}$

$= 36 \pm 2.17 * \frac{4}{\sqrt{49}} = 37.24 \text{ and } 34.76$

$34.5$

$p$

$z$

$34.5 = \frac{\bar{x} -\mu}{\frac{\sigma}{\sqrt{N}}} = -2.62$

$1$

$2 = 0.0088$

$p$

$< \alpha$

$\alpha = 0.05$

$5%$

$p(Type I)$

$P(\text{type I error}) = p(\text{type I error on A OR type I error on B})$

$= 2p(\text{type I error on single test}) - p(\text{type I error on A AND type I error on B})$

$= 2*.05 - .05^2 (\text{assuming independence of tests}) = 0.5 - .025 = .075$

$p(type I error)$

$5%$

$\alpha$

$1$

$H_0 =$

$H_1 =$

$T$

$\alpha$

$5%$

$1%$

$\alpha$

$\alpha$

$t_{obs}$

$T$

$H_0$

$t_{obs}$

$= 0.0044$