A Quick Introduction to KNN Algorithm

Are you venturing into machine learning? Here is a quick introduction to the simplest machine language algorithms – KNN – which will help you grasp its key dynamics.

K-Nearest Neighbors algorithm in Machine Learning (or KNN) is one of the most used learning algorithms due to its simplicity. So what is it?

KNN is a lazy learning, non-parametric algorithm. It uses data with several classes to predict the classification of the new sample point. KNN is non-parametric since it doesn’t make any assumptions on the data being studied, i.e., the model is distributed from the data.

What does it mean to say KNN is a lazy algorithm? It means it doesn’t use the training data points to make any generalisation. Which implies:

You expect little to no explicit training phase,
The training phase is pretty fast,
KNN keeps all the training data since they are needed during the testing phase.

Most data does not obey the typical theoretical assumptions, like when we consider a model like linear regression, which makes KNN crucial when studying data with little or no prior knowledge.

For basic machine learning algorithm, watch the following video.

Linear Regression Algorithm | Machine Learning Tutorial | Great Learning

Where KNN was born?

KNN was born out of research done for the armed forces. Fix and Hodge – two officers of USAF School of Aviation Medicine – wrote a technical report in 1951 introducing the KNN algorithm.

KNN is a Supervised Learning Algorithm

A supervised machine learning algorithm is one that relies on labelled input data to learn a function that produces an appropriate output when given unlabeled data.

In machine learning, there are two categories

1. Supervised Learning

2. Unsupervised Learning

In supervised learning, you train your data on a labelled set of data and ask it to predict the label for an unlabeled point. For example, a tumour prediction model is trained on many clinical test results which are classified either positive or negative. The trained model can then predict whether an unlabeled test is positive or negative.

It works just like we’d do it – a teacher or a parent would teach a child new things. If a teacher wants the child to learn how an elephant looks like, he will show the child pictures of elephants, and then pictures of animals which are not elephants like zebras and monkeys.

When we see an elephant, we shout, “elephant!” when it’s not an elephant; we shout, “no, not an elephant!” After the teacher does this for a while with the kid, and he shows a child a picture and asks “elephant?” and the child will (mostly) correctly say “elephant!” or “no, not elephant!” depending on the picture. That is supervised learning. When we substitute the child with a computer, it becomes supervised machine learning.

We train it using the labelled data already available to us. In a dataset consisting of observation (x, y), we want to learn a function g: X → Y so that with X, we can use g(x) to predict corresponding output Y.

Where to use KNN

KNN can be used in both regression and classification predictive problems. However, when it comes to industrial problems, it’s mostly used in classification since it fairs across all parameters evaluated when determining the usability of a technique

Prediction Power
Calculation Time
Ease to Interpret the Output

KNN algorithm fairs across all parameters of considerations. But mostly, it is used due to its ease of interpretation and low calculation time.

The primary step in Machine Learning

KNN is very simple and is often used as a benchmark for more complex classifiers like the Support Vector Machines (SVM) and the Artificial Neural Networks (ANN).

How is it employed in daily problems?

Despite its simplicity, KNN does better than more powerful classifiers and is used in places such as genetics, data compression, and economic forecasting.

In political science – classing a political voter to “vote Republican” or “vote Democrat”, or to a “will vote” or “will not vote”.
Banking system – KNN can be used to predict if a person is fit for loan approval. Or if he or she has similar traits to a defaulter.
Calculating credit ratings – KNN can help when calculating an individual’s credit score by comparing it with persons with similar traits.

Other areas that use the KNN algorithm include Video Recognition, Image Recognition, Handwriting Detection, and Speech Recognition.

Companies Using KNN

Companies like Amazon or Netflix use KNN when recommending books to buy or movies to watch. There was even a $1 million award on Netflix to the team that could come up with the most accurate recommendation algorithm!

How do these companies make recommendations? Well, these companies gather data on the books you have read or movies you have watched on their website and apply KNN. The companies will input your available customer data and compare that to other customers who have purchased similar books or have watched similar movies.

The books and movies recommended depending on how the algorithm classifies that data point.

How does KNN works?

Contributed by: Augustine Joseph

The k-nearest neighbor algorithm stores all the available data and classifies a new data point based on the similarity measure (e.g., distance functions). This means when new data appears. Then it can be easily classified into a well-suited category by using K- NN algorithm.

Suppose there are two classes, i.e., Class A and Class B, and we have a new unknown data point “?”, so this data point will lie in which of these classes. To solve this problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the class of a particular dataset. The data point is classified by a majority vote of its neighbors, with the data point being assigned to the class most common amongst its K nearest neighbors measured by a distance function.

Consider the below diagram:

Here, we can see that if k = 3, then based on the distance function used, the nearest three neighbors of the data point is found and based on the majority votes of its neighbors, the data point is classified into a class. In the case of k = 3, for the above diagram, it’s Class B. Similarly, when k = 7, for the above diagram, based on the majority votes of its neighbors, the data point is classified to Class A.

K-Nearest Neighbors

KNN algorithm applies the birds of a feather. It assumes that similar things are near to each other; that is, they are nearby.

KNN captures some mathematics you learned as a child as you were trying to grasp the calculation of the distance between points on a graph. The idea of similarity (sometimes called closeness, proximity, or distance).

Euclidean distance or straight-line distance is a popular and familiar choice of calculating distance.

Choosing the right value for K

To get the right K, you should run the KNN algorithm several times with different values of K and select the one that has the least number of errors.

The right K must be able to predict data that it hasn’t seen before accurately.

Things to guide you as you choose the value of K

As K approaches 1, your prediction becomes less stable.
As your value of K increases, your prediction becomes more stable due to the majority of voters.
When you start receiving an increasing number of errors, you should know you are pushing your K too far.
Taking a majority vote among labels needs K to be an odd number to have a tiebreaker.

Check out how A* algorithm works.

Working of KNN Algorithm in Machine

To understand better the working KNN algorithm applies the following steps when using it:

Step 1 – When implementing an algorithm, you will always need a data set. So, you start by loading the training and the test data.

Step 2 – Choose the nearest data points (the value of K). K can be any integer.

Step 3 – Do the following, for each test data –

3.1 – Use Euclidean distance, Hamming, or Manhattan to calculate the distance between test data and each row of training. The Euclidean method is the most used when calculating distance.

3.2 – Sort data set in ascending order based on the distance value.

3.3 – From the sorted array, choose the top K rows.

3.4 – Based on the most appearing class of these rows, it will assign a class to the test point.

Step 4 – End

Some KNN Advantages and Disadvantages

Some Advantages of KNN

Quick calculation time
Simple algorithm – to interpret
Versatile – useful for regression and classification
High accuracy – you do not need to compare with better-supervised learning models
No assumptions about data – no need to make additional assumptions, tune several parameters, or build a model. This makes it crucial in nonlinear data case.

Some Disadvantages of KNN

Accuracy depends on the quality of the data
With large data, the prediction stage might be slow
Sensitive to the scale of the data and irrelevant features
Require high memory – need to store all of the training data
Given that it stores all of the training, it can be computationally expensive

Learn more about AI and Machine Learning

AI in Healthcare Free Course Machine Learning with AWS Free Course Machine Learning Algorithms Free Course

A Quick Summary of KNN Algorithm

K is a positive integer
With a new sample, you have to specify K
K is selected from database closest to the new sample
KNN doesn’t learn any model
KNN makes predictions using the similarity between an input sample and each training instance.

This blog has given you the fundamentals of one of the most basic machine learning algorithms.

KNN is a great place to start when first learning to build models based on different data sets.

Data set with a lot of different points and accurate information is your best place, to begin with KNN.

You should Keep these 3 points in mind:

A data set with lots of different points and labelled data is the ideal to use.
The best languages to use with KNN are R and python.
To find the most accurate results from your data set, you need to learn the correct practices for using this algorithm.

MIT No Code AI and Machine Learning Program

AI and ML Program from UT Austin