- What is Distribution?
- What is Data?
- Measurement level of Data
- What does Data do? In What ways it matters most?
- Why are distributions important?
- Difference between Frequency and Probability Distribution
- Types of Distributions
- Python Libraries for Distributions
- Bernoulli Distribution
- Normal Distribution
- Binomial Distribution
- Poisson Distribution
- Uniform Distribution
- Gamma Distribution
- Exponential Distribution
- References
Contributed by: Venkat M
LinkedIn Profile: https://www.linkedin.com/in/venkat-murali-3753bab/
What is Distribution?
The distribution of a statistical dataset is the spread of the data which shows all possible values or intervals of the data and how they occur.
A distribution is simply a collection of data or scores on a variable. Usually, these scores are arranged in order from ascending to descending and then they can be presented graphically.
The distribution provides a parameterized mathematical function which will calculate the probability of any individual observation from the sample space.
Before moving on to distributions, understanding about the term “data” which is very important and critical for the data analyst/data scientist
To understand more about distribution in statistics, watch this complete video where Abhinand Sarkar will share some of his thoughts on distribution.
What is Data?
Data is a collection of information (numbers, words, measurements, observations) about facts, figures and statistics collected together for analysis.
Example: Distribution of Categorical Data (True/False, Yes/No): It shows the number (or) percentage of individuals in each group.
How to Visualize Categorical Data: Bar Plot, Pie Chart and Pareto Chart.
Distribution of Numerical Data (Height, Weight and Salary): Firstly, it is sorted from ascending to descending order and grouped based on similarity. It is represented in graphs and charts to examine the amount of variance in the data.
How to Visualize Numerical Data: Histogram, Line Plot and Scatter Plot.
Measurement level of Data
S.No | Qualitative | Quantitative |
1 | Nominal – Brand-name, Zip-code and GenderOrdinal – Grades, Star Reviews | Ordinal – Position in Race and DateInterval – Temperature in Celsius, Year of BirthRatio – Height, Age, Weight |
What does Data do? In What ways it matters most?
- Identifies the relationship between two variables
- Prediction of future and forecasting based on the previous trend of data
- Pattern determination that exists in the dataset
- Detects Fraud and anomalies
Why are distributions important?
Sampling distributions are important for statistics because we need to collect the sample and estimate the parameters of the population distribution. Hence distribution is necessary to make inferences about the overall population.
For example, The most common measures of how sample differs from each other is the standard deviation and standard error of the mean.
Difference between Frequency and Probability Distribution
S.No | Frequency Distribution | Probability Distribution |
1 | It records how often an event occurs. It is based on actual observations | It records the likelihood that an event is to occur. It is based on theoretical assumption of what should happen |
Frequency Distribution:
The number of times each numerical value occurs.
Probability Distribution
List of Probabilities associated with each of its possible numerical values.
Types of Distributions
- Bernoulli Distribution
- Uniform Distribution
- Binomial Distribution
- Normal Distribution
- Poisson Distribution
- Exponential Distribution
Python Libraries for Distributions
Bernoulli Distribution
A special case of binomial distribution. It is the discrete probability distribution and has exactly only two possible outcomes – 1(Success) and 0(Failure) and a single trial.
Example: In Cricket: Toss a Coin leads to win or lose the toss. There is no intermediate result. The occurrence of a head denotes success, and the occurrence of a tail denotes failure.
The probability of success (1) is 0.4 and failure(0) is 0.6
Bernoulli Distribution in Python
Normal Distribution
It is otherwise known as Gaussian Distribution and Symmetric Distribution. It is a type of continuous probability distribution which is symmetric to the mean. The majority of the observations cluster around the central peak point.
It is a bell-shaped curve.
Examples: Performance appraisal, Height, BP, measurement error and IQ scores follow a normal distribution.
Mean = Median = Mode
The standard normal distribution is a normal distribution with µ = 0 and б = 1.
Basic Properties:
- The normal distribution always run between –α and +α
- Zero skewness and distribution is symmetrical about the mean.
- Zero kurtosis
- 68% of the values are within 1 SD of the mean
- 95% of the values are within 2 SD of the mean
- 99.7% of the values are within 3 SD of the mean
Normal Distribution in Python
Binomial Distribution
The most widely known discrete probability distribution. It has been used hundreds of years.
Assumptions:
- The experiment involves n identical trials.
- Each trial has only two possible outcomes – success or failure.
- Each trial is independent of the previous trials.
- The terms p and q remain constant throughout the experiment, where p is the probability of getting a success on any one trial and q = (1 – p) is the probability of getting a failure on any one trial.
Binomial Distribution in Python
Poisson Distribution
It is the discrete probability distribution of the number of times an event is likely to occur within a specified period of time. It is used for independent events which occur at a constant rate within a given interval of time.
The occurrences in each interval can range from zero to infinity (0 to α).
Examples:
- How many black colours are there in a random sample of 50 cars
- No of cars arriving at a car wash during a 20 minute time interval
Uniform Distribution
It is a continuous or rectangular distribution. It describes an experiment where an outcome lies between certain boundaries.
Examples:
- Time to fly from Newark to Atlanta ranges from 120 to 150 minutes if we monitor the fly time for many commercial flights it will follow more or less the uniform distribution.
- The time taken for the students to finish a one hour test may range from 50 mins to 60 mins. An equal number of students complete over 5 minutes interval within this range – 50, 54, 56, 58 and 60. The finishing time of the test can be approximated by a uniform distribution.
- Time for Pizza delivery from Nanganallur to Alandur may range from 20 to 30 mins uniformly from the time delivery man leaves the Pizza Hut.
Uniform Distribution in Python
Gamma Distribution
It deals with continuous variables which take on a wide range of values such as individual call times. Based on which we can model probabilities across any range of possible values using a gamma distribution function. First one is shape parameter (α) and the second one is scale parameter (β).
Examples:
- The amount of rainfall accumulated in a reservoir.
- The size of loan defaulters and aggregation of insurance claims
- The flow of items through manufacturing and distribution processes
- The load on web servers
Gamma Distribution in Python
Exponential Distribution
It is concerned with the amount of time until some specific event occurs.
Example:
- The amount of time until an earthquake occurs has an exponential distribution
- The amount of time in business telephone calls
- The car battery lasts.
- The amount of money customers spend on one trip to the supermarket follows an exponential distribution. There are more people who spend small amounts of money and fewer people who spend large amounts of money.
The exponential distribution is widely used in the field of reliability.
Note: Reliability deals with the amount of time a product lasts.