A Capstone Project by Amit Bajaj and Sathya Guruprasad
Introduction
Cloud Computing has become very popular due to the multiple benefits it provides and is being adopted by businesses worldwide. Flexibility to scale up or down as per the business needs, faster and efficient disaster recovery, subscription-based models which reduce the high cost of hardware, and flexible working for employees are some of the benefits of cloud that attracts businesses. Similar to cloud, Data Analytics is another crucial area which businesses are exploring for their growth. With the exponential rise in the amount of data available on the internet is a result of the boom in the usage of social media, mobile apps, IoT devices, sensors and so on. It has become imperative for the organisations to analyse this data to get insights into their businesses and take appropriate action.
AWS provides a reliable platform for solving complex problems where cost-effective infrastructure can be built with great ease at low cost. AWS provides a wide range of managed services, including computing, storage, networking, database, analytics, application services and many more.
Problem Statement:
We have analysed multiple software solutions which perform analysis on data collected from the market and provide information as well as suggestions and provide better customer experience. This includes trade application providing stock price, taxi companies providing locations of nearby taxis, journey plan applications providing live updates on the different transport media and many more.
We have considered a “server-less” platform / “Server-less Computing Execution Model” to build the real-time data-processing app. Architecture is based on managed services provided by AWS.
What is “Server-less”?
A cloud-based execution model in which the cloud provider dynamically allocates and runs the server. This is a consumption-based model where pricing is directly proportional to consumer use. AWS takes complete ownership of operational responsibilities eliminating infrastructure management and availability with higher uptime.
Services Consumed:
- Kinesis – Kinesis Data Stream- Kinesis Data Analytics- Kinesis Firehose
- Athena
- Lambda
- Dynamo DB
- Amazon S3
- AWS CLI
Without building a sizable infrastructure, how to receive data from different sources for cloud-based infrastructure?
Kinesis, a managed service by AWS, Amazon Kinesis makes it easy to collect, process, and analyse real-time, streaming data so you can get timely insights and react quickly to new information. Kinesis Datastream allows user to receive data from data generation source. We have created amazon kinesis data stream using AWS CLI commands which is expected to consume data from the data source.
Technical + Functional Flow
Create Kinesis data streams:
-
-
- Create a stream in Kinesis using AWS Console or AWS CLI Commands; one to receive data from Data generator and another to write post processing. Data generator will produce the data which will be read and written to input/source data stream. Kinesis Analytics App will process and write data to Output/destination stream.
- We have created a program to generate data, and with the help of AWS SDKs and AWS CLI commands transmitted to Kinesis Data Streams. Data can be generated in various fashion:
- Using IoT devices
- Live trackers
- GPS trackers
- API
- Data generator tools (in case of Analysis)
-
Create a Kinesis Analytics App to Aggregate data:
-
-
- Build a Kinesis Data Analytics application to read from the input/source data stream and write to output/destination data stream in formatted fashion in a specified time interval.
- It is very important to stop the application when not in use to save unwanted cost.
-
Data Storage and Processing:
-
-
- Lambda, another managed service by AWS processes data from trigger data stream and write to dynamo DB
- Lambda function works on trigger basis and cost model is strictly driven by consumption. No cost is incurred from user when function is not running. Data is stored in Dynamo DB and can be accessed in standard fashion.
-
Kinesis Firehose, S3 and Athena:
-
- Kinesis Firehose acts as mediator between Kinesis Datastream and S3 where Data received from Kinesis Datastream will be predefined S3 bucket in specified format
- Amazon Athena is server-less interactive query service which enables user to glorify data stored in S3 Bucket for analysis.
Amazon CLI, AWS Cloud formation and AWS IAM also plays a very important role in building Cloud based infrastructure and ensure secure connectivity within and outside AWS cloud world.
Conclusion:
Using AWS services, we were able to create a real-time data processing application based on serverless architecture which is capable of accepting data through Kinesis data streams, processing through Kinesis Data Analytics, triggering Lambda Function and storing in DynamoDB. The architecture can be reused for multiple data types from various data sources and formats with minor modifications. We have used all the managed services provided by AWS which led to zero infrastructure management efforts.
Capstone project has helped us in building practical expertise on AWS services like Kinesis, Lambda, Dynamo DB, Athena, S3, Identity and Access Management, Serverless Architecture and Managed Services. We have also learnt the Go (programming language) to build pseudo data producer programs. AWS CLI has helped us to connect on-premise infrastructure with cloud services.
This project is a part of Great Learning’s post-graduate program in Cloud Computing.
Authors
Amit Bajaj – Project Manager at Cognizant
Sathya Guruprasad – Infrastructure Specialist at IBM Pvt Ltd