What is Outlier Detection | Definition of Outlier Detection

outliers

Contributed by: Saurabh Gupta

A principal task of a data scientist is to apply the models to data, and at some point, you will definitely encounter a dataset that contains outliers. Outliers are nothing but data points or observations that fall outside of an expected distribution or pattern. For example, if we were to approximate the data with a Poisson distribution, then the outliers are the observations that do not appear to follow the pattern of a Poisson distribution.

This can be similarly conceptualized when a Linear Regression is the chosen model, and the residual plot indicates that a small number of observations/data points are way different than the rest of the majority points in the plot.

Outlier detection is usually performed in the Exploratory Data Analysis stage of the Data Science Project Management process, and our decision to deal with them decides how well or bad the model performs for the business problem at hand. The model, and hence, the entire workflow, is greatly affected by the presence of outliers. 

They can be critical in analyzing the data for at least two reasons:

  1. The outliers may negatively bias the entire result of an analysis.
  2. The behavior of outliers may be precisely what is being sought after and here is when you require a discussion with the domain expert.

It becomes pertinent to first detect and then remove the outliers present in your data before proceeding with model building. This certainly will help to build an efficient model in the end.

Types of Outliers

  1. Type 1 – Global
  2. Type 2 – Contextual
  3. Type 3 – Collective

1. Global Outliers

Also known as “Point Anomalies,” are kind of outliers which deviate significantly from the rest of the data. Measurement will be called a global outlier if it diverges from the distribution of data regardless of the features because that measurement is far off the global distribution. It is the simplest type of outlier and is found in the majority of cases.

A global outlier remains distinct from other data points by representing its outliers. It can be better explained by considering a real-life example dataset of credit card fraud detection, which contains transactional data of a bank’s customer who holds a credit card. If we consider the daily transaction amount by the customer as one of the attributes, then a transaction with a very high amount as compared to the normal range of the individual’s expenditure will be considered as a Point or Global outlier.

2. Contextual Outliers 

If a data instance is anomalous in a specific context, then it would be called a contextual outlier or a conditional outlier. Therefore, a contextual outlier will represent a small group of outliers in itself (having some similar features) as compared to a significantly larger group of observations. The value might, however, be seen as normal in a different context.

The idea of a context is induced by the structure of the dataset and should be specified as a part of problem formulation. The alternative of applying a contextual outlier technique is decided by the meaningfulness of the contextual outliers in the target domain where it has to be applied.

3. Collective Outliers

When a subset of observations in a dataset, as a collection, deviates significantly from the entire dataset, it is called a collective outlier. It is not necessary that each instance within the collective outlier is also an outlier. When seeking outlier detection, it is very important to keep the context in mind because sometimes, a point or collective outlier can also be a contextual outlier given the context of the study.

Challenges of Outlier Detection

1. Effective Identification

Outlier definition is a highly subjective task and depends on the domain and the application scenario. The grey area between normal observations and outliers often is very small, and even a little ignorance can lead to the treatment of a possible outlier as a normal observation or vice-versa. Hence, we must be very cautious while selecting the outlier detection method to treat the outliers.

2. Application-Specific Challenges

As stated earlier, choosing the similarity or distance measure and the relationship model to describe data objects is of utmost importance in outlier detection. Unfortunately, they are often application-dependent. Different applications may have very different requirements; for example, datasets from the medical field may have outliers that are even slightly deviating from the rest of the dataset. Hence individual outlier detection methods that are dedicated to specific applications must be developed.

3. Handling Noise

Noise in the data tends to be similar to the actual outliers and hence is difficult to distinguish and remove them from malicious outliers. We must understand that outliers and noise are two different entities and are different from each other. And because the noise, often invariably, can be present in all kinds of data collected, it can bring a lot of challenges to outlier detection by blurring the difference between normal observations and outliers. Noise hides outlier objects, thus dropping the effectiveness of the outlier detection algorithm.

Outlier Detection Methods

1. Statistical Methods

Simply starting with visual analysis of the Univariate data by using Boxplots, Scatter plots, Whisker plots, etc., can help in finding the extreme values in the data. Assuming a normal distribution, calculate the z-score, which means the standard deviation (σ) times the data point is from the sample’s mean. Because we know from the Empirical Rule, which says that 68% of the data falls within one standard deviation, 95% percent within two standard deviations, and 99.7% within three standard deviations from the mean, we can identify data points that are more than three times the standard deviation, as outliers. Another way would be to use InterQuartile Range (IQR) as a criterion and treating outliers outside the range of 1.5 times from the first or the third quartile.

2. Proximity Methods

Proximity-based methods deploy clustering techniques to identify the clusters in the data and find out the centroid of each cluster. They assume that an object is an outlier if the nearest neighbors of the object are far away in feature space; that is, the proximity of the object to its neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set. The usual approach is as follows – Fix a threshold and evaluate the distance of each data point from the cluster centroid and then remove the outlier data points and go ahead with the modeling.

As intuitive as it gets, the success of such types of models heavily depends on the metric used as the distance measure. The drawbacks are that for some specific problem types, there can be a challenge to find the correct distance measure. Another drawback is that it is not that accurate when the group of outliers lies close to each other.

Proximity-based methods are classified into two types: Distance-based methods judge a data point based on the distance(s) to its neighbors. Density-based determines the degree of outlines of each data instance based on its local density. DBScan, k-means, and hierarchical clustering techniques are examples of density-based outlier detection methods.

 3. Projection Methods

Projection methods utilize techniques such as the PCA to model the data into a lower-dimensional subspace using linear correlations. Post that, the distance of each data point to a plane that fits the sub-space is calculated. This distance can be used then to find the outliers. Projection methods are simple and easy to apply and can highlight irrelevant values.

The PCA-based method approaches a problem by analyzing available features to determine what constitutes a “normal” class. The module then applies distance metrics to identify cases that represent anomalies.

Summary

Data of immense size, with multi-faceted properties and hoards of devices/equipment, is being generated and captured every second across multiple industries. This data has unbelievable business value if processed, analyzed, and comprehended using proper tools and techniques. But it’s easier said than done because the data brings along many hidden inconsistencies, which can compromise the overall process and analysis by a huge margin.

Outliers are something that comes very naturally to the data. They can have hidden patterns/meanings, which, when revealed, can improve the model performance because unnecessary/erroneous data points are removed from the analysis OR unearth a pattern that otherwise couldn’t have been revealed. Because each kind of dataset has different kinds of outliers, this document explains how to deal with them and get the best technique applied to drive home a much-improved conclusion.

To learn more such content, sign up for Great Learning Academy’s free online courses and upskill today!

→ Explore this Curated Program for You ←

Avatar photo
Great Learning Editorial Team
The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.

Recommended Data Science Courses

Data Science and Machine Learning from MIT

Earn an MIT IDSS certificate in Data Science and Machine Learning. Learn from MIT faculty, with hands-on training, mentorship, and industry projects.

4.63 ★ (8,169 Ratings)

Course Duration : 12 Weeks

PG in Data Science & Business Analytics from UT Austin

Advance your career with our 12-month Data Science and Business Analytics program from UT Austin. Industry-relevant curriculum with hands-on projects.

4.82 ★ (10,876 Ratings)

Course Duration : 12 Months

Scroll to Top