How to find aggregate value for rows in a table in different range of values in Bigquery?

Question

I have a bigquery table in format company_id, date, sales_amount. sales_amount is a FLOAT64 column which value can vary from 0 to 1 Billion. I need to find the first date for every company_id a particular sales_amount range hit for the first time.

What I have written so far is for each range a with clause is used for example :

With A as (
SELECT company_id, min(date) breakDate
FROM <table>
WHERE sales_amount >= 100000 and sales_amount < 500000
GROUP BY company_id
),
B as (
SELECT company_id, min(date) breakDate
FROM <table>
WHERE sales_amount >= 500000 and sales_amount < 1000000
GROUP BY company_id
),
AllUnion AS (
SELECT * FROM A
LEFT JOIN B
USING(company_id)
WHERE B.breakDate > A.breakDate OR B.company_id is NULL

UNION ALL
SELECT * FROM B
)

So when a new range is added I have to add a new With section and in the last a big union section to merge all the break events. In the merge time I will make sure that if Higher order events happend first then lower order events are filtered out. For example in this case a company made more than 500K sales in Jan (First time) and their sales went down and hit 120K in Feb. Only 500K sales event will be returned Feb event will be filtered out

I have to do it for different tables and might have more events, I am wondering is there a smart way to write this query in bigquery ?

ted · Accepted Answer · 2022-06-18 18:17:25Z

You can bucketize to make sales_amount within a bucket have same bucket id. And then by group-by company_id and bucket_id, you are able to get MIN(date) of each bucket.

https://cloud.google.com/bigquery/docs/reference/standard-sql/mathematical_functions#range_bucket

SELECT company_id, MIN(date) AS breakDate 
  FROM <table>
 WHERE sales_amount >= 100000 
 GROUP BY company_id, RANGE_BUCKET(sales_amount, [100000, 500000, 1000000]);

Example:

WITH sales AS (
  SELECT 'c1' AS company_id, '2022-05-01' AS date, 99999 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-02' AS date, 100000 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-03' AS date, 499999 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-04' AS date, 500000 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-05' AS date, 1100000 AS sales_amount
)  
SELECT company_id, 
       buckets[SAFE_OFFSET(RANGE_BUCKET(sales_amount, buckets) - 1)] AS bucket_id,
       ARRAY_AGG(sales_amount ORDER BY date)[OFFSET(0)] AS sales_amount,
       MIN(date) AS breakDate
  FROM sales, UNNEST([STRUCT([100000, 500000, 1000000] AS buckets)])
 WHERE sales_amount >= 100000 
 GROUP BY company_id, bucket_id
;

output:

example 2 :

WITH sales AS (
  SELECT 'c1' AS company_id, '2022-05-01' AS date, 99999 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-02' AS date, 100000 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-03' AS date, 499999 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-05' AS date, 1100000 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-07' AS date, 500000 AS sales_amount
),
bucketized_sales AS (
  SELECT company_id, 
         buckets[SAFE_OFFSET(RANGE_BUCKET(sales_amount, buckets) - 1)] AS bucket_id,
         ARRAY_AGG(sales_amount ORDER BY date)[OFFSET(0)] AS sales_amount,
         MIN(date) AS breakDate
    FROM sales, UNNEST([STRUCT([100000, 500000, 1000000] AS buckets)])
   WHERE sales_amount >= 100000 
   GROUP BY company_id, bucket_id
)
SELECT *, MAX(bucket_id) OVER (PARTITION BY company_id ORDER BY breakDate ROWS  BETWEEN  UNBOUNDED PRECEDING AND 1 PRECEDING)  factor
  FROM bucketized_sales 
 WHERE TRUE QUALIFY bucket_id > factor OR factor IS NULL
 ORDER BY breakDate
;

output:

Is there a way I could add the bucket info to the select query to understand which bucket it is ? Also filter out lower order events coming after higher order. Ex company hitting 500K sales first and then sales going down and hitting 100K in the next month. So in this case filtering out the 100K events. — ted, Commented May 29, 2022 at 10:56
I've updated the query to display the bucket info for your first question. — Jaytiger, Commented May 29, 2022 at 11:10
hey thanks, it works perfect for me. Only thing now I am wondering is earlier I was left joining with the "with statements" and filtering out low sales amount events happening after hight sales amount events. — ted, Commented May 29, 2022 at 11:13
For your second question, I've assumed gross sales which increase over time. If you're considering net sales, you may need another approach to get the result you want. — Jaytiger, Commented May 29, 2022 at 11:15
sales is like that particular day sales. It is not cumulative. — ted, Commented May 29, 2022 at 11:19

Collectives™ on Stack Overflow

How to find aggregate value for rows in a table in different range of values in Bigquery?

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related