Importing data from Kafka to GCS using Dataproc Serverless
Data, a new language spoken by every industry in the modern era.
Q : Why does or how does a data plays an important role in any business?
Ans : Because a data helps you to make decisions, understand the health of your business and drive your business and so forth.
So for this data undergoes various stages during it’s journey in data-pipeline before it forms a meaningful state.
Today we will discuss how we can transfer data from Apache kafka to GCS.
To achieve this we run a dataproc template, on GCP. that’s it, Simple? right. Now let’s take a brief understanding of what all components involved to make and look it simple yet powerful.
Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Spark, Flink, Presto, and 30+ open source tools and frameworks.
Today we will run Apache spark on Dataproc, now let’s see what is Apache Spark.
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance
So far we defined and understood our processing engine, now let’s take a look at source and destination. Our source is Apache Kafka.
Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds
And our destination is Google Cloud Storage(GCS)
GCS is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google’s cloud with advanced security and sharing capabilities
So to move data from kafka to GCS, just run a template, which is in place already. Dataproc template developed by Google Cloud Platform.
Steps to execute Dataproc Template
1. Clone the Dataproc Templates repository and navigate to java templates folder.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java
2. Obtain authentication credentials to submit the job.
gcloud auth application-default login
3. Ensure you have enabled the subnet with Private Google Access. If you are using “default” VPC created by GCP, you will still have to enable private access as below.
4. Configure the Dataproc Serverless job by exporting the variables needed for submission —
GCP_PROJECT
: GCP project id to run Dataproc Serverless on
REGION
: Region to run Dataproc Serverless in
GCS_STAGING_LOCATION
: GCS staging bucket location, created in Step 3
SUBNET
: The VPC subnet to run Dataproc Serverless on, if not using the default subnet (format: projects/<project_id>/regions/<region>/subnetworks/<subnetwork>)
HISTORY_SERVER_CLUSTER
: An existing Dataproc cluster to act as a Spark History Server. This property can be used to specify a dedicated server, where you can view the status of running and completed Spark jobs
export GCP_PROJECT=<project_id>
export REGION=<region>
export GCS_STAGING_LOCATION=<gcs-staging-bucket-folder>
export SUBNET=<vpc-subnet>
export HISTORY_SERVER_CLUSER=projects/<project_id>/regions/<region>/clusters/<cluster_name>
6. Now run KafkaToGCS template
sample execution command:
export GCP_PROJECT=dp-test-project
export REGION=us-central1
export SUBNET=test-subnet
export GCS_STAGING_LOCATION=gs://dp-templates-kakfatogcs/stg
export GCS_SCHEMA_FILE=gs://dp-templates-kafkatogcs/schema/msg_schema.json
export GCS_OUTPUT_PATH=gs://dp-templates-kafkatogcs/output/
bin/start.sh \
-- --template KAFKATOGCS \
--templateProperty project.id=$GCP_PROJECT \
--templateProperty kafka.bootstrap.servers=102.1.1.20:9092 \
--templateProperty kafka.topic=events-topic \
--templateProperty kafka.starting.offset=latest \
--templateProperty kafka.schema.url=$GCS_SCHEMA_FILE \
--templateProperty kafka.gcs.await.termination.timeout.ms=1200000 \
--templateProperty kafka.gcs.output.location=$GCS_OUTPUT_PATH \
--templateProperty kafka.gcs.output.format=parquet
More info to run the template
7. Monitor the Spark batch job
After submitting the job, you will be able to view the job in the Dataproc Batch UI. From there, we can view both metrics and logs for the job.