Statistics

Questions

JOIN Dataframes

Can you tell me the ways in which 2 pandas data frames can be joined?

Answer

  • merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True)

  • concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).

  • join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().

[GOOGLE] Normal Distribution

Write a function to generate N samples from a normal distribution and plot the histogram.

Answer

import numpy as np
import numpy as np
import matplotlib.pyplot as plt

def generate_samples(n, mean, std):
  """Generates N samples from a normal distribution with mean `mean` and standard deviation `std`."""
  return np.random.normal(mean, std, n)

def plot_histogram(samples):
  """Plots a histogram of the given samples."""
  plt.hist(samples, bins=100)
  plt.show()

# Generate 100 samples from a normal distribution with mean 0 and standard deviation 1.
samples = generate_samples(1000, 0, 1)

# Plot the histogram of the samples.
plot_histogram(samples)
[UBER] Bernoulli trial generator

Given a random Bernoulli trial generator, write a function to return a value sampled from a normal distribution.

Answer

# *Solution recieved from the community via [merge request](https://github.com/dipranjan/dsinterviewqns/pull/5)*

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# straightforward using the central limit theorem.

p = .5
n = 10000

# returns standard normal output via the central limit theorem
def standard_normal_output(p,n):
    bernoulli_mean = p
    bernoulli_variance = p*(1-p)
    bernoulli_std = abs(np.sqrt(bernoulli_variance))
    sample = np.random.binomial(size = n, n = 1, p = p)
    return (sample.mean() - bernoulli_mean)/(bernoulli_std/np.sqrt(n))

# now we plot this output 10000 times to indeed show it is a standard normal distribution
def plot_output():
    outputs=[]
    for i in range(0,n):
        outputs.append(standard_normal_output(p=p,n=n))
    num_bins = 20
    plt.hist(outputs, bins=num_bins, facecolor='blue', alpha=0.5)
    plt.show() 
plot_output()
[PINTEREST] Interquartile Distance

Given an array of unsorted random numbers (decimals) find the interquartile distance.

Answer

# Interquartile distance is the difference between first and third quartile

# first let's generate a list of random numbers

import random
import numpy as np

li = [round(random.uniform(33.33, 66.66), 2) for i in range(50)]
print(li)

qtl_1 = np.quantile(li,.25)
qtl_3 = np.quantile(li,.75)

print("Interquartile distance: ", qtl_1 - qtl_3)
[GENENTECH] Imputing the median

Write a function cheese_median to impute the median price of the selected California cheeses in place of the missing values. You may assume at least one cheese is not missing its price.

Answer

import pandas as pd

cheeses = {"Name": ["Bohemian Goat", "Central Coast Bleu", "Cowgirl Mozzarella", "Cypress Grove Cheddar", "Oakdale Colby"], "Price" : [15.00, None, 30.00, None, 45.00]}

df_cheeses = pd.DataFrame(cheeses)
Show the Central Limit Theorem

In order to do this we will start with a non-normal distribution example the uniform distribution. Next, we will sample that distribution and get the mean of the sample, we will do this repeatedly. As per the central limit theorem the plot of the means will resemble a normal distribution.

Answer

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def sampling(n):
    # Create sample from uniform distribution
    sample = np.random.uniform(size=n, low = 1, high = 6)
    return sample.mean() #3.5 subtract the population mean if you want mean=0 for the normal distribution

# now we sample this 10000 times to indeed show it is a standard normal distribution
def plot_output(n):
    outputs=[]
    for i in range(0,n):
        outputs.append(sampling(30))
    num_bins = 20
    plt.hist(outputs, bins=num_bins, facecolor='blue', alpha=0.5)
    plt.title("Sample")
    plt.show() 

plot_output(10000)

Last updated