Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Moving Data from BigQuery to GCS — using GCP Dataproc Serverless and PySpark

--

Dataproc Templates allow us to run common use cases on Dataproc Serverless using Java and Python without the need to develop them ourselves. These templates implement common Spark workloads, letting us customize and run them easily.

This blogpost can be used if you are new to Dataproc Serverless or you are looking for a PySpark Template to migrate data from GCS to BigQuery using Dataproc Serverless.

Prerequisites

For running these templates, we will need:

  • The Google Cloud SDK installed and authenticated
  • Python 3.7+ installed
  • A VPC subnet with Private Google Access enabled. The default subnet is suitable, as long as Private Google Access was enabled. You can review all the Dataproc Serverless networking requirements here.

Key Benefits

  • Use Dataproc Serverless to run Spark batch workloads without provisioning and managing your own cluster.
  • BigQueryToGCS template is open source, fully customisable and ready to use for simple jobs.
  • You can ingest data from BigQuery to GCS in Parquert, AVRO, CSV and JSON formats.

Configuration Arguments

This template includes the following arguments to configure the execution:

  • bigquery.gcs.input.table: BigQuery Input table name (format: project:dataset.table)
  • bigquery.gcs.output.format: Output file format. One of parquet, avro, csv, json
  • bigquery.gcs.output.location: GCS location for output files (format: gs://BUCKET/…)
  • bigquery.gcs.output.mode: Output write mode. One of: append, overwrite, ignore, errorifexists. Default is append. You can read about how each save mode behaves here.

Usage

  1. Ensure you have enabled the subnet with Private Google Access, if you are going to use “default” VPC Network generated by GCP. Still you will need to enable private access as below.

2. Create a GCS bucket to use as the staging location for Dataproc. This bucket will be used to store dependencies required to run our serverless cluster.

export STAGING_BUCKET=”my-staging-bucket”
gsutil mb gs://$STAGING_BUCKET

3. Clone the Dataproc Templates repository and navigate to the Python. template’s directory

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/python

4. Configure the Dataproc Serverless job

To submit the job to Dataproc Serverless, we will use the provided bin/start.sh script. The script requires us to configure the Dataproc Serverless cluster using environment variables.

The mandatory configuration are:

  • GCP_PROJECT : The GCP project to run Dataproc Serverless on.
  • REGION : The region to run Dataproc Serverless on.
  • GCS_STAGING_LOCATION : A GCS location to where Dataproc will store staging assets. Should be within the bucket we created earlier.
# Project ID to run the Dataproc Serverless Job
export GCP_PROJECT=<project_id>
# GCP region where the job should be submitted
export REGION=<region>
# The staging location for Dataproc
export GCS_STAGING_LOCATION=gs://$STAGING_BUCKET/staging

In our case, the BigQuery To GCS needs the Spark BigQuery Connector to be available in the classpath. The connector is publicly hosted, so we will add it using the JARSenvironment variable. You can also choose to store the JAR file on a bucket you own.

For exporting BigQuery data in AVRO file format we also need spark-avro.jar which is already included in bin/start.sh

# Path to the Spark BigQuery Connector JAR file
export JARS=”gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar”

5. Execute the BigQuery To GCS Dataproc template

After configuring the job, we are ready to trigger it. We will run the bin/start.sh script, specifying the template we want to run and the argument values for the execution.

./bin/start.sh \
— — template=BIGQUERYTOGCS \
— bigquery.gcs.input.table=<projectId:datasetName.tableName> \
— bigquery.gcs.output.format=<csv|parquet|avro|json> \
— bigquery.gcs.output.mode=<overwrite|append|ignore|errorifexists> \
— bigquery.gcs.output.location=<gs://bucket/path>

NOTE: Submitting the job will ask you to enable the Dataproc API, if not enabled already.

6. Monitor the Spark batch job

After submitting the job, we will be able to see in the Dataproc Batches UI. From there, we can view both metrics and logs for the job.

References

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Surjit Singh
Surjit Singh

Written by Surjit Singh

Cloud Data Engineer, Google Cloud

Responses (3)