Central Limit Theorem
In machine learning, statistics play a significant role in achieving data distribution and the study of inferential statistics. A data scientist must understand the math behind sample data and Central Limit Theorem answers most of the problems. Let us discuss the concept of the Central Limit Theorem. It assumes that the distribution in the sample should be normal even if the distribution in the population is not normal.
Let us take a look at population distribution.
The data present in the population may follow any type of distribution. Some of these types are-
Normal Distribution in Central Limit Theorem
This distribution follows a bell curve. It is also known as Gaussian distribution. This type of distribution assumes that the data near the mean of the distribution seems to be more frequent than the data that is not close to the mean. Here is a diagram to represent a normal distribution curve.
Left-skewed Distribution in Central Limit Theorem
This type of data has a very long tail towards the left and the data is mostly concentrated towards the right. It is not normal and can denote different conditions for different types of data. Let us take a look at an example.
In the above data which is left-skewed, the median is on towards the right of the mean. If we consider the monthly turnover of a business, this can be considered good news. It signifies that the business is growing. Whereas, if we are a machine manufacturing company and this is the data of faulty machines manufactured by us, then it is bad news. Because the faulty machines are increasing which proves to be a great loss for us.
Right-skewed Distribution
As the name suggests, it is just the opposite of the left-skewed distribution. The data has a long tail towards the right and the data is concentrated towards the left.
In the above diagram, the median is on the left side of the mean and the tail is to the right side. Now, if we take the same business example from the left-skewed concept, then we can say the business company is going to be bankrupt soon. But if we consider the manufacturing company, we can say the faulty machines are decreasing as time passes by. This is how we interpret the distribution of the data.
Uniform distribution
This type of distribution has constant probability.
Here, the probability function f(x) is given by-
fx=0, for x<a 1b-a 0, for x>b , for a≤ x≤b
From the above types, we can assume that they refer to the distribution of the population from which we draw a random sample. Now, Central Limit Theorem applies to all types of distribution but including a fact considered that the population data must follow a finite variance. The Central Limit Theorem can be applied to both identically distributed and independent variables. This means that the value of one variable is not dependent on another.
Sampling Distribution
When we calculate the mean of the samples at different times taking the same sample size each time, we plot them in the histogram. The histogram then helps us understand the sample mean distribution. This refers to the sampling distribution of the mean. Many procedures cut down our efforts to repeat studies and make it possible to estimate the mean from one random sample.
Central Limit theorem with respect to Size
The shape of the sample distributions changes when the size of the sample increases. So the question is ‘how large should the sample size be, to achieve the normal distribution?’. It depends completely on the population data. If the population data is too far from being normal, then the sample size should be large enough to achieve normal distribution. Most of the procedures suggest that a sample size of 30 is required quite often to achieve normal distribution. Sometimes it requires a much larger size to achieve normal distribution. Let us see the distribution taking each sample size.
When n=1, the data follows a uniform distribution. When n=10, the data fought its way to achieve normality. This graph clearly shows how sample data attain normality after the increase in sample size.
Approximation of Normal distribution
The central limit theorem follows a relationship between the sampling distribution and the variable distribution present in the population. As the definition suggests, the population distribution must be skewed, but the sample drawn from such a population must follow a normal distribution. The following representation of the data is given below to make interpretation much easier.
The above distribution does not follow a normal distribution and is skewed having a long tail towards the right. If we assume the above data belongs to the population, then, the sample data should follow the distribution as given below.
As we see, the data seems to be normal after taking a sample of size ‘n’.
If the population data is normal initially, the sample data would be easily normal even taking a small sample size. But it is surprising to expect a normal distribution of the sample drawn from a population that is not normal.
Statistical Significance
The significance that the Central Limit theorem holds is as follows-
- Statistical procedures such as building confidence intervals and hypothesis testing show that the population data is normal. But, the sampling data should be treated as normal even if the population data is not normal according to this theorem.
- When we take different samples from the population and calculate the mean as sampling distribution, the standard deviation tends to decrease the mean of all the samples. That gives us an idea of the nature of population data easily.
- The sample mean gives us an idea of the population mean. When we conduct the construction of a confidence interval, it will give us a range in which population means may lie.
Applications of Central Limit Theorem
The Central Limit theorem is applied in many domains. Let us quickly discuss some of the real-world applications of such theorems.
- Election- When people of different regions give a vote to their candidate, the Central Limit Theorem is capable of giving us a range in the form of the confidence interval. This will tell us the percentage of winning of that particular candidate.
- Census- Central limit theorem is applied to different fields of the census to calculate different details of the population such as family income, amount of electricity consumed, salaries of individuals, etc.
Also Read: Applications of Machine Learning
Assumptions in Central Limit theorem
The Central Limit theorem holds certain assumptions which are given as follows.
- The variables present in the sample must follow a random distribution. This implies that the data must be taken without knowledge i.e., in a random manner.
- The sample variables drawn from a population must be independent of one another. In other words, there must not exist any relationship with two variables present in the sample data. If the data were taken randomly, then this condition can be met without much effort.
- The sample drawn from the population must be 10% of the population data and it should follow a normal distribution irrespective of their randomness.
- The size of the sample should be large when the population data is either skewed or non-symmetric. In the case of symmetric data where the population data is near to normal, the size of the sample can be considered small. Usually, statistical procedures suggest, a size of 30 is sufficient for such a population.
Also Read: Must-haves on your Machine Learning Resume
Case Study
Let us take data on heart disease patients which tells us if a patient has heart diseases or not. Our motive is to demonstrate the concept of the Central Limit theorem. We are not building any model here. So, we take any attribute and try to see whether the sample data is normal or not after the increase in size.
Let us import the libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Now let us read the dataset-
df=pd.read_csv('heart.csv')
df.isnull().sum()
As we see there are no missing values in any column. So, we proceed with the analysis.
Let us take a ‘chol’ column which contains the information of cholesterol amount present in the patients.
Cholesterol=df[‘chol’]
Now let us see the distribution of the cholesterol data.
num_bins=100
plt.hist(Cholesterol,num_bins,color='green')
plt.xlabel('cholesterol')
plt.ylabel('Amount')
plt.title('Distribution of Cholesterol')
plt.axvline(x=Cholesterol.mean(),color='black')
As we see the data is distributed quite normally. Some of the data beyond 500 represent outliers. So, we will take the sample size of 30, 60 and 400 and see if the nature of the distribution improves or not.
Let us create arrays to store random samples of size 30, 60 and 400.
array30 = []
array60 = []
array400 = []
n = 300
for i in range (1, n):
array30.append(Cholesterol.sample(n=30, replace= True). mean ())
array60.append(Cholesterol.sample(n=60, replace= True). mean ())
array400.append(Cholesterol.sample(n=400, replace= True). mean ())
Now let us create subplots to visualize the data-
fig, (ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=3,figsize=(12,6))
Subplot of n=30
ax1.hist(array30, bins=100, color='red')
ax1.set_xlabel('Cholesterol')
ax1.set_ylabel('Amount')
ax1.set_title ('When Sample Size = 20')
ax1.axvline(x=np.mean(array30),color='black')
Subplot of n=60
ax2.hist(array60, bins=100, color='yellow')
ax2.set_xlabel('Cholesterol')
ax2.set_ylabel('Amount')
ax2.set_title ('When Sample= 60')
ax2.axvline(x=np.mean(array60),color='r')
Subplot of n=400
ax3.hist(array400, bins=100)
ax3.set_xlabel('Cholesterol')
ax3.set_ylabel('Amount')
ax3.set_title ('When Sample= 400')
ax3.axvline(x=np.mean(array400),color='black')
After defining the following subplots in axes, we get the following:
As we see, it is evident that the distribution tends to be more normal when we increase the size from 20 to 400. So, it meets the assumptions of the Central Limit theorem that the increase in the size of the sample brings the data to be more normal.
Also Read: Machine Learning Framework algorithm to recognize handwriting
Central Limit theorem plays a crucial role in the field of Machine learning where there is a necessity to make the data normal. Besides, it is also important to study the measure of central tendencies such as mean, median, mode, and standard deviation. Confidence intervals and also the nature of the distribution such as skewness and kurtosis are also very important to look into before proceeding with the Central Limit theorem. It is usually applied to those data which are highly independent of each other and they need to be normal.
If you found this interesting and wish to upskill, join Great Learning’s Machine Learning Course today!