BigQuery data preparation: A Step-by-Step Guide
TL;DR: Learn how to use BigQuery data preparation, an AI-powered visual interface to transform raw data into analysis-ready tables. This guide walks you through a step-by-step example, demonstrating how to use its features for data cleansing, transformation, and enrichment.
Introduction
BigQuery data preparation is a powerful feature within BigQuery Studio that empowers users to build AI-assisted, visual data pipelines. This allows you to transform raw data from various sources into refined BigQuery tables, ready for analysis and insights.
Why is data preparation important?
In today’s data-driven world, organizations often deal with data that is:
- Raw and Unstructured: Data might come from various sources like applications, websites, or external databases, often requiring cleaning and transformation.
- Inconsistent: Data formats, missing values, and errors can hinder analysis and lead to inaccurate conclusions.
- Not Optimized for Analysis: Raw data might not be in the ideal format for specific analytical tasks.
BigQuery data preparation addresses these challenges by providing a user-friendly interface to:
- Cleanse Data: Remove errors, handle missing values, and standardize formats.
- Transform Data: Convert data types, extract relevant information, and create new features.
- Enrich Data: Combine data from multiple sources to gain a more comprehensive view.
Use Cases for BigQuery data preparation
This feature is valuable for various scenarios, including:
- Creating data pipelines for data warehousing.
- Preparing data for business intelligence (BI) dashboards and reports.
- Building machine learning (ML) datasets.
- Enabling self-service data preparation for business users.
In this example, a business user is notified that new raw data on web/app interactions is available. Many businesses rely on analyzing user behavior on their websites or apps to make informed decisions. This data often resides in raw, unprocessed formats, making it difficult to extract meaningful insights. For example, imagine a marketing team trying to understand why a recent campaign didn’t perform as expected. They have access to website click-stream data, but it’s a jumbled mess of timestamps, user IDs, and product codes. BigQuery data preparation can help them transform this raw data into a clean, structured format, allowing them to identify trends, segment users, and ultimately optimize their marketing strategy.
Before the data is used by the rest of the team, the user will prepare the data for their teammates by using data preparation’s features. Let’s see how the business user would accomplish that.
Want to follow along in your console? Check this initial setup first
If you want to follow the steps below on your cloud console and do a “hands on” experience, you will need to create the tables that are going to be used throughout the post. You may check the setup steps here. Otherwise, just keep reading below!
Navigating the data preparation interface
It all starts from the web_app_interaction_external table, which is the table that the user wants to transform before sharing it with the rest of the team. To open the data preparation interface, click on the three dots next to the web_app_interaction_external table and go to Open in/Data preparation.
The data preparation interface is divided into two main sections:
- The left section shows a sample of the data.
- The right section shows Gemini suggestions for transforming the data.
You can click on each column to see specific suggestions for that column.
Transforming and Refining Your Data
Data preparation has several ways to easily transform your data, ranging from AI suggestions, user intention, user input and code generation using natural language. Let’s go over all of them using practical examples.
The first thing you might notice is that the session_start_time column contains different time zones and is in STRING format. To convert the column to timestamp and handle the different time zones, click on the session_start_time column name and then click on the PREVIEW button for a particular suggestion. If the transformation looks good, click on the APPLY button.
Data preparation is very flexible. If a specific suggestion that you are looking for does not show up, you can always click EDIT and create it using natural language in the prompt or even add the command manually:
COALESCE(SAFE.PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S %Z', session_start_time))
You might also notice that the purchase_made column is not following the company standard of TRUE/FALSE for BOOLEAN values.
To fix this, click on the purchase_made column name and then click on the PREVIEW button for the suggestion. If the transformation looks good, click on the APPLY button.
CASE WHEN purchase_made = 'YES' THEN TRUE ELSE FALSE END
The product_json column contains a JSON object, but the user only wants the product name. To extract the product name from the JSON object, click on the product_json column name and then click on the PREVIEW button on the suggestion that extracts the product name from the JSON object.
JSON_EXTRACT_SCALAR(product_json, r'$.product_name')
Before applying the transformation, you might notice that it would be applied directly to the product_json column, which would be the incorrect column name.
To fix this, click on EDIT, then Target column, then CREATE NEW COLUMN, add product_name and click SAVE. Then, click on PREVIEW and then APPLY.
You might notice that the newly created product_name column contains extra information within it. Upon analyzing it, you notice that the data contains product name and product size separated by a pipe character.
To fix this, you can further transform the product_name column. Let’s do this one using the power of natural language transformations! First, click on the product_name column, ADD STEP and then Transformation.
Then, enter the command using natural language. A good prompt would look like this: “This expression removes everything before the pipe symbol (|) from product_name”. Click “Send input”. Also remember to create a new target column for this transformation.
REGEXP_REPLACE(product_name, r'.*\|', r'')
Now that you have extracted the product size into a separate column, you can safely change the product_name column to remove the size information from it.
To do this, click on the product_name column, then click PREVIEW and APPLY on the suggestion to remove everything after the pipe symbol.
REGEXP_REPLACE(product_name, r'\|.*', r'')
Intelligent Transformations based on user’s intention
Data preparation also allows you to transform data by simply editing it directly and capturing what is the user's intention. Let’s see how this works with the time_on_site column.
The user noticed that time_on_site data is being delivered in seconds. However, the team needs it in minutes. The user simply goes to the column, adds a division by 60 and waits to see if Gemini gives a suggestion based on their intention. As you can see, Gemini gave the right suggestion that the user is probably trying to convert from seconds to minutes and it even suggests it to be rounded at 2 decimal places!
If the suggestion does not show up right away, you may try to edit a few more rows, dividing them by 60 as well, so Gemini can properly get the intention of what you are trying to achieve.
COALESCE(ROUND(CAST(time_on_site AS NUMERIC) / CAST(60 AS NUMERIC), 2), time_on_site)
The user also knows that distance_to_store data was populated in miles. However, his team is used to do analyzes using the metric system, in kilometers. The user knows that miles can be converted to kilometers by multiplying it by 1.609. When the user adds this intention to the data preparation, Gemini is able to understand that the user wants to apply a conversion from miles to kilometers!
COALESCE(ROUND(distance_to_store * CAST(1.609 AS NUMERIC), 3), distance_to_store)
Handling Missing Data
The user may also add default values to columns that contain null. The user wants the discount_code column to always have some value in it. The user picks a row that is null, types what is the default value that they want to see and Gemini suggests the transformation to it.
CASE WHEN discount_code is null THEN 'n/a' ELSE discount_code END
Sometimes the user just wants to filter out the null values. In this case, the user clicks on device_info and selects the suggestion of keeping only the rows that are not empty.
device_info <> ''
Joining and Removing Columns
To gain a more comprehensive understanding, users can combine data from multiple tables based on related columns, bringing together information that might be scattered across different sources. Data preparation allows the users to join the current data with other existing tables, enhancing or complementing the data that they need to prepare for their team.
Imagine that the user wants to join this web/app data with a customer table that exists in BigQuery. The user can easily achieve this by creating a Join, which is also powered by Gemini.
Once the user selects the table, Gemini automatically suggests which columns to join! In this case, Gemini suggested the join to be by email, which is a common column between the two tables. The user may also select the join type, which in this case will be an Inner join. Once the join is created, we may see all the columns from the customer table added to the data preparation.
Now that the customer table is also joined, the user realizes that some columns are not needed to be available in the final result. One of them is customer_email, which has the same information as the user_email column. Data preparation allows the users to remove any column by going in the SCHEMA tab and dropping them.
Saving and executing your data preparation
Once the data is prepared, you can save the data preparation object. Depending on your organization’s policies, business users may be able to create a new table directly or they may need to share the data preparation object with a data engineer. If the later one is the case, the business user can easily share their data preparation object with the data engineer within the console.
Independently if it’s the business user or the data engineer, the next step is to add a destination where this data will be loaded to. You can add a Destination step and configure the project/dataset/table as needed.
Data preparation gives the flexibility of selecting from different write modes (Full refresh, Append or Incremental) giving the users options on how the data should be ingested into the final table.
After we select how the data should be written to our table, we may want to run the data preparation to see the data flowing to the destination table. Users have the option to run an ad hoc execution directly from the data preparation object or to create a schedule, which will execute the data preparation during specified time intervals. Let’s do both. First, let’s do an ad hoc execution.
We may follow the execution status right from the data preparation screen. We may even see what were the commands that were run as part of the data preparation. As you may see, all code is generated as SQL and it is executed in BigQuery itself, so it is transparent to the users as it does not require any extra configuration.
Now let’s see how we could easily schedule this data preparation to run on a regular basis. We simply create a schedule and define its frequency.
The schedule is saved and it can be viewed and managed with all other scheduled objects in BigQuery, on the Scheduling menu.
Viewing the final results
Now it’s time to see our final results! We may go to our dataset and see that a new table was created there. When we preview it, we may see the final data in the final format that was prepared. All of this without a single line of code written by the user [or very few if you needed to adjust any suggestion]!
Saving and Version Controlling Your Work
Version control is essential for any data workflow. Although data preparation doesn’t have direct Git integration yet, you can still export your work as a YAML file and save it in your Git repository within BigQuery. By exporting and saving data preparation objects in Git, users can track changes, revert to previous versions, and collaborate effectively.
To do this, go to the data preparation object and click DOWNLOAD. It will save a .yaml file in your system.
As we upload our yaml file to Git, we can also see the graph representation of our data preparation. As with any other Git interaction, we add a comment to our version and commit/push the change to our repository.
Conclusion
BigQuery data preparation is a powerful and user-friendly tool that simplifies the process of preparing data for analysis. From data cleansing and transformation to enrichment and delivery, it streamlines the journey from raw data to actionable insights.
In this blog post, we’ve shown you how it can handle a variety of data preparation tasks, empowering you to take control of your data. And with the rise of generative AI, we can expect even more intuitive and automated data preparation experiences in the future.
Ready to simplify your data transformation process? Dive into BigQuery data preparation now and experience the ease of building AI-powered data pipelines!