I have a postgres table with a 100K± records representing edges of graph (will grows even to 1M) defined as (I left the relevant parts only):
CREATE TABLE public.collection_lineages (
id int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY( INCREMENT BY 1 MINVALUE 1 MAXVALUE 9223372036854775807 START 1 CACHE 1 NO CYCLE),
upstream_id int8 NOT NULL,
downstream_id int8 NOT NULL,
);
worth mentioning:
- table has cycles.
- there are multiple paths reaching from A -> B.
- my main query is: get all downstream collections.
After researching about postgres graph traversal I found postgres CYCLE DETECTION and I wrote this query:
WITH RECURSIVE
"recursive_table"("id",
"upstream_id",
"downstream_id",
"depth") AS (
SELECT
"working_table"."id",
"working_table"."upstream_id",
"working_table"."downstream_id",
1
FROM
"collection_lineages" AS "working_table"
WHERE
"working_table"."upstream_id" = $1
UNION ALL
SELECT
"working_table"."id",
"working_table"."upstream_id",
"working_table"."downstream_id",
"recursive_table"."depth" + 1
FROM
"collection_lineages" AS "working_table",
"recursive_table"
WHERE
"working_table"."upstream_id" = "recursive_table"."downstream_id") CYCLE "id" SET "is_cycle" USING PATH
-- filter out cycles
SELECT
"recursive_table"."id"
FROM
"recursive_table"
WHERE
NOT "recursive_table"."is_cycle";
when querying it with some collection_id as $1
, CTE traverse the table but revisiting edges. simple example will be:
B
/ \
A D-E-F...-Z
\ /
C
query with A
will produce:
A->B, A->C, B->D, C->D, D->E, D->E...
notice path D->Z returned twice due to the nature of recursive CTEs. This isn't efficient in scale.
My question is: is there a better way query the graph using CTE avoiding visited edges?
I tried adding: WHERE "working_table"."id" NOT IN (SELECT id from recursive_table)
but I'm getting: recursive reference to query "recursive_table" must not appear within a subquery
.
My workaround is to SELECT DISTINCT "recursive_table"."id"
but the issue is still there and query takes unreasonable amount of time.
I already tried writing a simple BFS script for that, maintaining visited_ids
table, iterating layer after layer. It shows better performance in general.
I also looked at age extension but this works with Cypher and force me to create the graph via them and I'd like to use my existing table.
DISTINCT
from it, this was my original query. I'm asking if there is another way avoiding the repetitive edges in the response rather than usingDISTINCT
? something like NOT duplicate records during recursion?