This paper was presented in the International Conference on Data Science, Machine learning and its applications (ICDML-2020) by Rohit Thakur, Apratim Banerjee, Robin George & Rajvir Singh Prajapati PGP DSBA students of Great Learning.
The banks accept different types of KYC (Know Your Customer) documents (such as Aadhar card, PAN card, driving license, passport, voter ID, etc.) to validate the authenticity of their customer. The customers can submit these documents over the counter or through an online portal. The customers can upload images of different sizes, colours, and orientations captured from either a camera, cell phone, or scanner. These documents are manually verified and classified according to the type of identification documents. These images of the documents are saved in the system for further processing. Manual classification of documents is not only time consuming, but it may lead to information leakages. In this project, our learners aimed to build a machine learning model to automate KYC document classification and retrieval of the textual information simultaneously.
In this study, images of different KYC documents like PAN card, Aadhar card, passport, driving licence, ration card, voter IDs were collected from 346 individuals. A few images of cats and dogs were also introduced and were labelled as “Garbage”. All the images were collected in .jpg format.
All the images were resized to a standardized size as per the requirement for the machine learning models. 9090 images were created by transforming images of the KYC documents using image augmentation. Image augmentation is a technique to artificially create new images from existing image data. The transformation includes a range of operations from the field of image manipulation, such as shifts, flips, zooms, and much more, with the intent to expand the training dataset with new, plausible examples which could mimic a user’s way of capturing the image. Further, a combination of Machine Learning Model and Optical Character Recognition was developed where the ML model could classify the images into different classes (PAN card, Aadhar card, passport, driving licence, ration card, voter IDs), and optical character recognition could retrieve the textual information from the images.
The combination of the Machine Learning Model and Optical Character Recognition was able to correctly classify and retrieve correct information from the KYC documents with high accuracy of 98%. This process would automate the classification and retrieval of textual information from the documents, which would increase the speed of the process while ensuring zero information leakage. This automated process could then be implemented in any industry where KYC documents are required.