Statistics is an important field that forms a strong base for learning data science and computing large volumes of data. Are you looking to build a career in this field? These Statistics Interview Questions will help you prepare for jobs encompassing data science and machine learning by refreshing your memory of key aspects of Statistics as well as Probability. To gain clarity on some fundamentals, you can enroll in for statistics for data science course with a certificate.
Also, watch this video to understand more about statistics for data science
Basic Statistics Interview Questions
Ready to kickstart your Statistics career? This section is curated to help you understand the basics and has a list of basic statistics interview questions. Let’s get started.
1. What is the Central Limit Theorem?
Central Limit Theorem is the cornerstone of statistics. It states that the distribution of a sample from a population comprising a large sample size will have its mean normally distributed. In other words, it will not have any effect on the original population distribution.
Central Limit Theorem is widely used in the calculation of confidence intervals and hypothesis testing. Here is an example – We want to calculate the average height of people in the world, and we take some samples from the general population, which serves as the data set. Since it is hard or impossible to obtain data regarding the height of every person in the world, we will simply calculate the mean of our sample.
By multiplying it several times, we will obtain the mean and their frequencies which we can plot on the graph and create a normal distribution. It will form a bell-shaped curve that will closely resemble the original data set.
2. What is the assumption of normality?
The assumption of normality dictates that the mean distribution across samples is normal. This is true across independent samples as well.
3. Describe Hypothesis Testing. How is the statistical significance of an insight assessed?
Hypothesis Testing in statistics is used to see if a certain experiment yields meaningful results. It essentially helps to assess the statistical significance of insight by determining the odds of the results occurring by chance. The first thing is to know the null hypothesis and then state it. Then the p-value is calculated, and if the null hypothesis is true, other values are also determined. The alpha value denotes the significance and is adjusted accordingly.
If the p-value is less than alpha, the null hypothesis is rejected, but if it is greater than alpha, the null hypothesis is accepted. The rejection of the null hypothesis indicates that the results obtained are statistically significant.
4. What are observational and experimental data in statistics?
Observational data is derived from the observation of certain variables from observational studies. The variables are observed to determine any correlation between them.
Experimental data is derived from those experimental studies where certain variables are kept constant to determine any discrepancy or causality.
5. What is an outlier?
Outliers can be defined as the data points within a data set that varies largely in comparison to other observations. Depending on its cause, an outlier can decrease the accuracy as well as the efficiency of a model. Therefore, it is crucial to remove them from the data set.
6. How to screen for outliers in a data set?
There are many ways to screen and identify potential outliers in a data set. Two key methods are described below –
- Standard deviation/z-score – Z-score or standard score can be obtained in a normal distribution by calculating the size of one standard deviation and multiplying it by 3. The data points outside the range are then identified. The Z-score is measured from the mean. If the z-score is positive, it means the data point is above average.
If the z-score is negative, the data point is below average.
If the z-score is close to zero, the data point is close to average.
If the z-score is above or below 3, it is an outlier and the data point is considered unusual.
The formula for calculating a z-score is –
z= data point−mean/standard deviation OR z=x−μ/ σ
- Interquartile range (IQR) – IQR, also called midspread, is a method to identify outliers and can be described as the range of values that occur throughout the length of the middle of 50% of a data set. It is simply the difference between two extreme data points within the observation.
IQR=Q3 – Q1
Other methods to screen outliers include Isolation Forests, Robust Random Cut Forests, and DBScan clustering.
7. What is the meaning of an inlier?
An Inliner is a data point within a data set that lies at the same level as the others. It is usually an error and is removed to improve the model accuracy. Unlike outliers, inlier is hard to find and often requires external data for accurate identification.
8. What is the meaning of six sigma in statistics?
Six sigma in statistics is a quality control method to produce an error or defect-free data set. Standard deviation is known as Sigma or σ. The more the standard deviation, the less likely that process performs with accuracy and causes a defect. If a process outcome is 99.99966% error-free, it is considered six sigma. A six sigma model works better than 1σ, 2σ, 3σ, 4σ, 5σ processes and is reliable enough to produce defect-free work.
9. What is the meaning of KPI in statistics?
KPI is an acronym for a key performance indicator. It can be defined as a quantifiable measure to understand whether the goal is being achieved or not. KPI is a reliable metric to measure the performance level of an organization or individual with respect to the objectives. An example of KPI in an organization is the expense ratio.
10. What is the Pareto principle?
Also known as the 80/20 rule, the Pareto principle states that 80% of the effects or results in an experiment are obtained from 20% of the causes. A simple example is – 20% of sales come from 80% of customers.
11. What is the Law of Large Numbers in statistics?
According to the law of large numbers, an increase in the number of trials in an experiment will result in a positive and proportional increase in the results coming closer to the expected value. As an example, let us check the probability of rolling a six-sided dice three times. The expected value obtained is far from the average value. And if we roll a dice a large number of times, we will obtain the average result closer to the expected value (which is 3.5 in this case).
12. What are some of the properties of a normal distribution?
Also known as Gaussian distribution, Normal distribution refers to the data which is symmetric to the mean, and data far from the mean is less frequent in occurrence. It appears as a bell-shaped curve in graphical form, which is symmetrical along the axes.
The properties of a normal distribution are –
- Symmetrical – The shape changes with that of parameter values
- Unimodal – Has only one mode.
- Mean – the measure of central tendency
- Central tendency – the mean, median, and mode lie at the centre, which means that they are all equal, and the curve is perfectly symmetrical at the midpoint.
13. How would you describe a ‘p-value’?
P-value in statistics is calculated during hypothesis testing, and it is a number that indicates the likelihood of data occurring by a random chance. If a p-value is 0.5 and is less than alpha, we can conclude that there is a probability of 5% that the experiment results occurred by chance, or you can say, 5% of the time, we can observe these results by chance.
14. How can you calculate the p-value using MS Excel?
The formula used in MS Excel to calculate p-value is –
=tdist(x,deg_freedom,tails)
The p-value is expressed in decimals in Excel. Here are the steps to calculate it –
- Find the Data tab
- On the Analysis tab, click on the data analysis icon
- Select Descriptive Statistics and then click OK
- Select the relevant column
- Input the confidence level and other variables
15. What are the types of biases that you can encounter while sampling?
Sampling bias occurs when you lack the fair representation of data samples during an investigation or a survey. The six main types of biases that one can encounter while sampling are –
- Undercoverage bias
- Observer Bias
- Survivorship bias
- Self-Selection/Voluntary Response Bias
- Recall Bias
- Exclusion Bias
Intermediate Statistics Interview Questions
Planning to switch to a career where you need Statistics? This section will help you prepare well for the upcoming interview. It has a compiled list of intermediate statistics interview questions that are commonly asked during the interview process.
16. What is cherry-picking, P-hacking, and significance chasing?
Cherry-picking can be defined as the practice in statistics where only that information is selected which supports a certain claim and ignores any other claim that refutes the desired conclusion.
P-hacking refers to a technique in which data collection or analysis is manipulated until significant patterns can be found who have no underlying effect whatsoever.
Significance chasing is also known by the names of Data Dredging, Data Fishing, or Data Snooping. It refers to the reporting of insignificant results as if they are almost significant.
17. What is the difference between type I vs type II errors?
A type 1 error occurs when the null hypothesis is rejected even if it is true. It is also known as false positive.
A type 2 error occurs when the null hypothesis fails to get rejected, even if it is false. It is also known as a false negative.
18. What is a statistical interaction?
A statistical interaction refers to the phenomenon which occurs when the influence of an input variable impacts the output variable. A real-life example includes the interaction of adding sugar to the stirring of tea. Neither of the two variables has an impact on sweetness, but it is the combination of these two variables that do.
19. Give an example of a data set with a non-Gaussian distribution?
A non-Gaussian distribution is a common occurrence in many processes in statistics. This happens when the data naturally follows a non-normal distribution with data clumped on one side or the other on a graph. For example, the growth of bacteria follows a non-Gaussian or exponential distribution naturally and Weibull distribution.
20. What is the Binomial Distribution Formula?
The binomial distribution formula is:
b(x; n, P) = nCx * Px * (1 – P)n – x
Where:
b = binomial probability
x = total number of “successes” (pass or fail, heads or tails, etc.)
P = probability of success on an individual trial
n = number of trials
21. What are the criteria that Binomial distributions must meet?
Here are the three main criteria that Binomial distributions must meet –
- The number of observation trials must be fixed. It means that one can only find the probability of something when done only a certain number of times.
- Each trial needs to be independent. It means that none of the trials should impact the probability of other trials.
- The probability of success remains the same across all trials.
22. What is linear regression?
In statistics, linear regression is an approach that models the relationship between one or more explanatory variables and one outcome variable. For example, linear regression can be used to quantify or model the relationship between various predictor variables such as age, gender, genetics, and diet on height, outcome variables.
23. What are the assumptions required for linear regression?
Four major assumptions for linear regression are as under –
- There’s a linear relationship between the predictor (independent) variables and the outcome (dependent) variable. It means that the relationship between X and the mean of Y is linear.
- The errors are normally distributed with no correlation between them. This process is known as Autocorrelation.
- There is an absence of correlation between predictor variables. This phenomenon is called multicollinearity.
- The variation in the outcome or response variable is the same for all values of independent or predictor variables. This phenomenon of assumption of equal variance is known as homoscedasticity.
24. What are some of the low and high-bias Machine Learning algorithms?
Some of the widely used low and high-bias Machine Learning algorithms are –
Low bias -Decision trees, Support Vector Machines, k-Nearest Neighbors, etc.
High bias -Linear Regression, Logistic Regression, Linear Discriminant Analysis, etc.
Check out the free course on Statistical Methods For Decision Making.
25. When should you use a t-test vs a z-test?
The z-test is used for hypothesis testing in statistics with a normal distribution. It is used to determine population variance in the case where a sample is large.
The t-test is used with a t-distribution and used to determine population variance when you have a small sample size.
In case the sample size is large or n>30, a z-test is used. T-tests are helpful when the sample size is small or n<30.
26. What is the equation for confidence intervals for means vs for proportions?
To calculate the confidence intervals for mean, we use the following equation –
For n > 30
Use the Z table for the standard normal distribution.
For n<30
Use the t table with df=n-1
Confidence Interval for the Population Proportion –
27. What is the empirical rule?
In statistics, the empirical rule states that every piece of data in a normal distribution lies within three standard deviations of the mean. It is also known as the 68–95–99.7 rule. According to the empirical rule, the percentage of values that lie in a normal distribution follow the 68%, 95%, and 99.7% rule. In other words, 68% of values will fall within one standard deviation of the mean, 95% will fall within two standard deviations, and 99.75 will fall within three standard deviations of the mean.
28. How are confidence tests and hypothesis tests similar? How are they different?
Confidence tests and hypothesis tests both form the foundation of statistics.
The confidence interval holds importance in research to offer a strong base for research estimations, especially in medical research. The confidence interval provides a range of values that helps in capturing the unknown parameter.
Hypothesis testing is used to test an experiment or observation and determine if the results did not occur purely by chance or luck using the below formula where ‘p’ is some parameter.
Confidence and hypothesis testing are inferential techniques used to either estimate a parameter or test the validity of a hypothesis using a sample of data from that data set. While confidence interval provides a range of values for an accurate estimation of the precision of that parameter, hypothesis testing tells us how confident we are inaccurately drawing conclusions about a parameter from a sample. Both can be used to infer population parameters in tandem.
In case we include 0 in the confidence interval, it indicates that the sample and population have no difference. If we get a p-value that is higher than alpha from hypothesis testing, it means that we will fail to reject the bull hypothesis.
29. What general conditions must be satisfied for the central limit theorem to hold?
Here are the conditions that must be satisfied for the central limit theorem to hold –
- The data must follow the randomization condition which means that it must be sampled randomly.
- The Independence Assumptions dictate that the sample values must be independent of each other.
- Sample sizes must be large. They must be equal to or greater than 30 to be able to hold CLT. Large sample size is required to hold the accuracy of CLT to be true.
30. What is Random Sampling? Give some examples of some random sampling techniques.
Random sampling is a sampling method in which each sample has an equal probability of being chosen as a sample. It is also known as probability sampling.
Let us check four main types of random sampling techniques –
- Simple Random Sampling technique – In this technique, a sample is chosen randomly using randomly generated numbers. A sampling frame with the list of members of a population is required, which is denoted by ‘n’. Using Excel, one can randomly generate a number for each element that is required.
- Systematic Random Sampling technique -This technique is very common and easy to use in statistics. In this technique, every k’th element is sampled. For instance, one element is taken from the sample and then the next while skipping the pre-defined amount or ‘n’.
In a sampling frame, divide the size of the frame N by the sample size (n) to get ‘k’, the index number. Then pick every k’th element to create your sample.
- Cluster Random Sampling technique -In this technique, the population is divided into clusters or groups in such a way that each cluster represents the population. After that, you can randomly select clusters to sample.
- Stratified Random Sampling technique – In this technique, the population is divided into groups that have similar characteristics. Then a random sample can be taken from each group to ensure that different segments are represented equally within a population.
31. What is the difference between population and sample in inferential statistics?
A population in inferential statistics refers to the entire group we take samples from and are used to draw conclusions. A sample, on the other hand, is a specific group we take data from and this data is used to calculate the statistics. Sample size is always less than that of the population.
32. What are descriptive statistics?
Descriptive statistics are used to summarize the basic characteristics of a data set in a study or experiment. It has three main types –
- Distribution – refers to the frequencies of responses.
- Central Tendency – gives a measure or the average of each response.
- Variability – shows the dispersion of a data set.
33. What are quantitative data and qualitative data?
Qualitative data is used to describe the characteristics of data and is also known as Categorical data. For example, how many types. Quantitative data is a measure of numerical values or counts. For example, how much or how often. It is also known as Numeric data.
34. How to calculate range and interquartile range?
The range is the difference between the highest and the lowest values whereas the Interquartile range is the difference between upper and lower medians.
Range (X) = Max(X) – Min(X) IQR = Q3 – Q1
Here, Q3 is the third quartile (75 percentile)
Here, Q1 is the first quartile (25 percentile)
35. What is the meaning of standard deviation?
Standard deviation gives the measure of the variation of dispersion of values in a data set. It represents the differences of each observation or data point from the mean.
(σ) = √(∑(x-µ)2 / n)
Where the variance is the square of standard deviation.
36. What is the relationship between mean and median in normal distribution?
In a normal distribution, the mean and the median are equal.
37. What is the left-skewed distribution and the right-skewed distribution?
In the left-skewed distribution, the left tail is longer than the right side.
Mean < median < mode
In the right-skewed distribution, the right tail is longer. It is also known as positive-skew distribution.
Mode < median < mean
38. How to convert normal distribution to standard normal distribution?
Any point (x) from the normal distribution can be converted into standard normal distribution (Z) using this formula –
Z(standardized) = (x-µ) / σ
Here, Z for any particular x value indicates how many standard deviations x is away from the mean of all values of x.
39. What can you do with an outlier?
Outliers affect A/B testing and they can be either removed or kept according to what situation demands or the data set requirements.
Here are some ways to deal with outliers in data –
- Filter out outliers especially when we have loads of data.
- If a data point is wrong, it is best to remove the outliers.
- Alternatively, two options can be provided – one with outliers and one without.
- During post-test analysis, outliers can be removed or modified. The best way to modify them is to trim the data set.
- If there are a lot of outliers and results are critical, then it is best to change the value of the outliers to other variables. They can be changed to a value that is representative of the data set.
- When outliers have meaning, they can be considered, especially in the case of mild outliers.
40. How to detect outliers?
The best way to detect outliers is through graphical means. Apart from that, outliers can also be detected through the use of statistical methods using tools such as Excel, Python, SAS, among others. The most popular graphical ways to detect outliers include box plot and scatter plot.
41. Why do we need sample statistics?
Sampling in statistics is done when population parameters are not known, especially when the population size is too large.
42. What is the relationship between standard error and margin of error?
Margin of error = Critical value X Standard deviation for the population
and
Margin of error = Critical value X Standard error of the sample.
The margin of error will increase with the standard error.
43. What is the proportion of confidence intervals that will not contain the population parameter?
Alpha is the probability in a confidence interval that will not contain the population parameter.
α = 1 – CL
Alpha is usually expressed as a proportion. For instance, if the confidence level is 95%, then alpha would be equal to 1-0.95 or 0.05.
44. What is skewness?
Skewness provides the measure of the symmetry of a distribution. If a distribution is not normal or asymmetrical, it is skewed. A distribution can exhibit positive skewness or negative skewness if the tail on the right is longer and the tail on the left side is longer, respectively.
45. What is the meaning of covariance?
In statistics, covariance is a measure of association between two random variables from their respective means in a cycle.
46. What is a confounding variable?
A confounding variable in statistics is an ‘extra’ or ‘third’ variable that is associated with both the dependent variable and the independent variable, and it can give a wrong estimate that provides useless results.
For example, if we are studying the effect of weight gain, then lack of workout will be the independent variable, and weight gain will be the dependent variable. In this case, the amount of food consumption can be the confounding variable as it will mask or distort the effect of other variables in the study. The effect of weather can be another confounding variable that may later the experiment design.
47. What does it mean if a model is heteroscedastic?
A model is said to be heteroscedastic when the variation in errors comes out to be inconsistent. It often occurs in two forms – conditional and unconditional.
48. What is selection bias and why is it important?
Selection bias is a term in statistics used to denote the situation when selected individuals or a group within a study differ in a manner from the population of interest that they give systematic error in the outcome.
Typically selection bias can be identified using bivariate tests apart from using other methods of multiple regression such as logistic regression.
It is crucial to understand and identify selection bias to avoid skewing results in a study. Selection bias can lead to false insights about a particular population group in a study.
Different types of selection bias include –
- Sampling bias – It is often caused by non-random sampling. The best way to overcome this is by drawing from a sample that is not self-selecting.
- Participant attrition – The dropout rate of participants from a study constitutes participant attrition. It can be avoided by following up with the participants who dropped off to determine if the attrition is due to the presence of a common factor between participants or something else.
- Exposure – It occurs due to the incorrect assessment or the lack of internal validity between exposure and effect in a population.
- Data – It includes dredging of data and cherry-picking and occurs when a large number of variables are present in the data causing even bogus results to appear significant.
- Time-interval – It is a sampling error that occurs when observations are selected from a certain time period only. For example, analyzing sales during the Christmas season.
- Observer selection- It is a kind of discrepancy or detection bias that occurs during the observation of a process and dictates that for the data to be observable, it must be compatible with the life that observes it.
49. What does autocorrelation mean?
Autocorrelation is a representation of the degree of correlation between the two variables in a given time series. It means that the data is correlated in a way that future outcomes are linked to past outcomes. Autocorrelation makes a model less accurate because even errors follow a sequential pattern.
50. What does Design of Experiments mean?
The Design of Experiments or DOE is a systematic method that explains the relationship between the factors affecting a process and its output. It is used to infer and predict an outcome by changing the input variables.
51. What is Bessel’s correction?
Bessel’s correction advocates the use of n-1 instead of n in the formula of standard deviation. It helps to increase the accuracy of results while analyzing a sample of data to derive more general conclusions.
52. What types of variables are used for Pearson’s correlation coefficient?
Variables (both the dependent and independent variables) used for Pearson’s correlation coefficient must be quantitative. It will only test for the linear relationship between two variables.
53. What is the use of Hash tables in statistics?
In statistics, hash tables are used to store key values or pairs in a structured way. It uses a hash function to compute an index into an array of slots in which the desired elements can be searched.
54. Does symmetric distribution need to be unimodal?
Symmetrical distribution does not necessarily need to be unimodal, they can be skewed or asymmetric. They can be bimodal with two peaks or multimodal with multiple peaks.
55. What is the benefit of using box plots?
Boxplot is a visually effective representation of two or more data sets and facilitates quick comparison between a group of histograms.
56. What is the meaning of TF/IDF vectorization?
TF/IDF is an acronym for Term Frequency – Inverse Document Frequency and is a numerical measure widely used in statistics in summarization. It reflects the importance of a word or term in a document. The document is called a collection or corpus.
57. What is the meaning of sensitivity in statistics?
Sensitivity refers to the accuracy of a classifier in a test. It can be calculated using the formula –
Sensitivity = Predicted True Events/Total number of Events
58. What is the difference between the first quartile, the second quartile, and the third quartile?
The first quartile is denoted by Q1 and it is the median of the lower half of the data set.
The second quartile is denoted by Q2 and is the median of the data set.
The third quartile is denoted by Q3 and is the median of the upper half of the data set.
About 25% of the data set lies above Q3, 75% lies below Q3 and 50% lies below Q2. The Q1, Q2, and Q3 are the 25th, 50th, and 75th percentile respectively.
59. What is kurtosis?
Kurtosis is a measure of the degree of the extreme values present in one tail of distribution or the peaks of frequency distribution as compared to the others. The standard normal distribution has a kurtosis of 3 whereas the values of symmetry and kurtosis between -2 and +2 are considered normal and acceptable. The data sets with a high level of kurtosis imply that there is a presence of outliers. One needs to add data or remove outliers to overcome this problem. Data sets with low kurtosis levels have light tails and lack outliers.
60. What is a bell-curve distribution?
A bell-curve distribution is represented by the shape of a bell and indicates normal distribution. It occurs naturally in many situations especially while analyzing financial data. The top of the curve shows the mode, mean and median of the data and is perfectly symmetrical. The key characteristics of a bell-shaped curve are –
- The empirical rule says that approximately 68% of data lies within one standard deviation of the mean in either of the directions.
- Around 95% of data falls within two standard deviations and
- Around 99.7% of data fall within three standard deviations in either direction.
Statistics FAQs
To prepare for a statistics interview, you can read this blog on the top commonly asked interview questions. These questions will help you brush up your skills and ace your upcoming interview.
Estimation: bias, maximum likelihood, method of moments, Rao-Blackwell theorem, fisher information. Central limit theorem, hypothesis testing, likelihood ratio tests, law of large numbers – These are some of the most important topics in statistics.
A collection of methods to display, analyze, and draw conclusions from data. Statistics can be of two types, descriptive statistics and inferential statistics.
1. State the null hypothesis
2. State the alternate hypothesis
3. Which test and test statistic to be performed
4. Collect Data
5. Calculate the test statistic
6. Construct Acceptance / Rejection regions
7. Based on steps 5 and 6, draw a conclusion about H0
These Statistics interview questions cover the basic ground of Statistics and make it easier for the students and professionals to clarify their fundamentals on this subject. For some of the industry-leading online courses on Statistics, You can easily fill these gaps by signing up for free courses that target these specific areas. These courses allow you to brush up on weaker areas before your interview, making sure you can confidently tackle any question thrown at you.