In the Kafka Spark Streaming Integration, Kafka acts as a central hub to input data streams which are processed, and then Spark streaming publishes the results using Spark engine into another Kafka topic or store in the form of dashboards, HDFS, and databases.
Two approaches involved in this process are -
- Receiver-based approach
Using the Kafka consumer API, the Receiver is implemented to receive the data which is then stored in Spark executors. Spark stream then processes the data synchronously to ensure zero data loss.
- Direct approach
The direct approach is receiver-less and works by querying Kafka for offsets in each topic within its designation partition rather than using receivers. This approach makes it easier to read data in parallel and offers zero data loss by eliminating the need for write-ahead logs.
It was introduced in Spark 1.4 for the Python API and Spark 1.3 for the Java and Scala API.
Here are Kafka-Spark APIs -
- SparkConf API
It is used to set configuration for key-value pairs using the SparkConf class.
- StreamingContext API
It denotes the connection to a Spark cluster and can be used to create broadcast variables, accumulators, and RDDs on it. Its signature is -
public StreamingContext(String master, String appName, Duration batchDuration,
String sparkHome, scala.collection.Seq<String> jars,
scala.collection.Map<String,String> environment)
where app name indicates the name of your job, the master is the cluster URL to connect to and batchDuration is the time interval required to divide streams of data into batches.
- KafkaUtils API
This API connects Kafka cluster to Spark streaming using createStream signature mentioned below -
public static ReceiverInputDStream<scala.Tuple2<String,String>> createStream(
StreamingContext ssc, String zkQuorum, String groupId,
scala.collection.immutable.Map<String,Object> topics, StorageLevel storageLevel)
Where, ssc is a StreamingContext object, zkQuorum is Zookeeper quorum, groupId is the group id for the consumer, topics means a map of topics to consume and storageLevel indicates the level used for storing received objects.