Understanding Box Plot - What does it mean?

A single box which gives you a visual idea about 5 components in a dataset. It is also known as box and whiskers plot or simply box plot. It is useful for describing measures of central tendencies and measures of dispersion in a dataset.

Contributed by: Avantika Shukla

Box Plot represents the following points in a dataset.

Minimum Value
First Quartile (Q1 or 25^th Percentile)
Second Quartile (Q2 or 50^th Percentile)
Third Quartile (Q3 or 75^th Percentile)
Maximum Value

Along with the above 5 components. Boxplot also gives us below information:

Outliers: Points lying beyond the minimum and maximum values are outliers
Interquartile range: It is Q3-Q1. It is the spread or range of the middle 50% of the data.
Whiskers: From Minimum Value to Q1 is the first 25% of data
From Q3 to Maximum value is the last 25% of the data

Let us understand box plot with an example. Suppose you have a dataset of runs scored by a batsman in his 12 matches. You arrange the dataset in descending order. Divide the dataset into 4 equal parts. Now how to find out 3 percentiles? 25^th percentile, 50^th percentile and 75^th percentile of the boxplot. Percentile is the number below which a given percentage falls or you can also apply the below formula to find out the percentile.

Position of the number, for given percentile (Pn) =Percentile(N+1)/100

N= No. of items in the dataset

If the above result comes in float (in decimals) then, take the mean of 2 numbers of Pn and P(n+1)

If the result comes in integer, then take the value of Pn

In our example N = 12

Position number for 25^th percentile= 25(12+1)/100= 3.25

The result is a decimal number, we will take mean of 3^rd and 4^th number

Position number for 50^th percentile= 50(12+1)/100= 6.5

The result is a decimal number, we will take mean of 6^th and 7^th number

Position number for 50^th percentile= 75(12+1)/100= 9.75

The result is a decimal number, we will take mean of 9^th and 10^th number

Let us visualize it:

Q1 or 25^th percentile = Number below which 25% is falling
Q2 or 50^th percentile = Number below which 50% is falling

Q3 or 75^th percentile = Number below which 75% is falling

Q1=17

Q2=19.5

Q3=24

Interquartile range (IQR)=Q3-Q1=24-17=7

Minimum Value= Q1-1.5 * IQR=17-1.5*7=6.5

Maximum Value= Q3 +1.5 * IQR=24+1.5*7=34.5

Outliers of the dataset= 5 & 64

Let us impose all the points of the boxplot on a number line:

Boxplot also tells us about the distribution and symmetry of the data. The above example shows the data is right skewed. Interquartile range shows us that the middle 50% of the data lies between 17 runs to 24 runs. Whiskers of the box plot cover approximately 99.65% of the data.

Also Read: Python Tutorial For Beginners – A Complete Guide | Learn Python Easily

Graphing a boxplot using Python

Read the data.

The code below will import all the necessary libraries and will read the data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Read the dataset RunScored.xlsx
df=pd.read_excel('RunScored.xlsx')
#Read top 5 rows
df.head()

Let us check the dimension of the data.

#Check the dimensions of the data
df.shape

The above result tells us that there are 12 rows and 1 column. This dataset contains runs scored in 12 matches.

Now let us see how to make boxplot with seaborn library:

#Plot boxplot with seaborn
sns.boxplot(x = 'RunsScored',color = 'violet',data = df);

The above result shows the player A has a median of 19.5, 2 outliers are also present in data. The middle 50% spread is between 17 to 24.

Now, let us compare runs scored by 2 players (Player A and Player B) through box plot.

Read the data:

#Reading 2nd database with 2 players to compare their runs scored using boxplots
df=pd.read_excel('RunScored2player.xlsx')
#Reading top 5 rows of the dataset 
df.head()

Let us see the dimension of the dataset.

#Check the dimensions of the data
df.shape

The above result tells us that there are 12 rows and 2 columns. This dataset contains runs scored in 12 matches.

Let us plot the boxplot of runs scored by 2 players:

By the above plot we can deduce that the median of player B is greater than player A. IQR tells us the spread of Player B for middle 50% is higher. Even the Q3 and maximum value of player B is higher.

If you want to display the median on your boxplot you can make use of function box.annotate().

medians=[df.RunsScoredPlayerA.median(),df.RunsScoredPlayerB.median()]
 
box=sns.boxplot(data=df);
for i in range(2):
    box.annotate(str(medians[i]),xy=(i,medians[i]),horizontalalignment='center');

The above result shows the median of player A is 19.5 on the other hand median of player B is 25.

Similarly, we can also display Q1, Q3, Minimum value and Maximum value on the boxplot.

Q1=[df.RunsScoredPlayerA.quantile(.25),df.RunsScoredPlayerB.quantile(.25)]
Q3=[df.RunsScoredPlayerA.quantile(.75),df.RunsScoredPlayerB.quantile(.75)]
IQR=[(Q3[0]-Q1[0]),(Q3[1]-Q1[1])]
Min_value=[(Q1[0]-1.5*IQR[0]),(Q1[1]-1.5*IQR[1])]
Max_value=[(Q3[0]+1.5*IQR[0]),(Q3[1]+1.5*IQR[1])]

box=sns.boxplot(data=df);
for i in range(2):
    box.annotate(str(medians[i]),xy=(i,medians[i]),horizontalalignment='center');
    box.annotate(str(Q1[i]),xy=(i,Q1[i]),horizontalalignment='center');
    box.annotate(str(Q3[i]),xy=(i,Q3[i]),horizontalalignment='center');
    box.annotate(str(Min_value[i]),xy=(i,Min_value[i]),horizontalalignment='center');
    box.annotate(str(Max_value[i]),xy=(i,Max_value[i]),horizontalalignment='center');

Displaying values on the boxplot gives a more clear idea.

This brings us to the end of the blog on Understanding boxplot. Hope this helps you gain a better understanding of the same. If you wish to learn more such concepts, head over to Great Learning Academy and choose from a plethora of free online courses.