I am Vibhav Kharangate, a Product Manager at an EdTech company. We cater to more than 10 countries. Read further to learn about my journey with Great Learning’s PGP AIML Course.
We had the majority of NRI customers in these foreign locations. Hence, we needed to get proper numbers on the distribution of Native vs. NRI from each country so that we can identify those countries where we need to focus our efforts. The only relevant data we had from these customers was the country they signed up from and their names. It was brought to my attention that some members of sales/operations teams were manually classifying each customer as native or NRI. This would have taken a lot of time for the thousands of customers that we had. Hence, I decided to try and build an ethnicity detector using an AI model using just the names.
For this, I needed labeled data of names from various countries. I found some open datasets of names on GitHub, Kaggle, and other sources. I planned to build a binary classification model with just NRI and Native as its classes, so we considered names from all other countries as a native class. PFB for reference:
Next, to design the neural network, we have to think about how we as humans assume whether a name is Indian or not. Many times, the names are just something we have already heard in the past. But sometimes, even an unheard name sounds very Indian. This is because names belonging to a region have similar sounds which are created by the syllables. Hence, we needed something that could detect patterns at the character level. Hence, I first used a character-wise tokenizer followed by a 1-dimensional convolutional network to detect patterns in how those characters are arranged in different countries.
The prediction had its challenges. We are an EdTech startup in the k12 segment. Hence, we had both students as well as parents’ names. We noticed a lot of the student names who were 2nd generation NRIs were westernized. For example, a name like Chris Patel would mess up the prediction. Hence, we decided to use parents’ names wherever possible. Secondly, most customers had listed only their 1st names. This would create another problem as some names are common to different countries, such as Ali. We tried solving this to some extent by combining 1st names of both the parent and the student.
To test the model, we used a set of data that was already manually classified by our operations team before I started this project.
We got the following metrics:
As seen above, we had an overall accuracy of 85%. We also have very high precision for NRI data. However, precision for the native class is a bit low. For future model improvements, my focus will be on increasing this metric. Even simple solutions like capturing the full name of the customer will greatly boost the accuracy of this model.
Based on this model, we were able to classify our customers as natives and NRIs on a country-wide basis. We can now identify which countries need our attention to grow a native customer base. This has helped the company to work with a properly targeted direction.