0

I have a postgres table with a 100K± records representing edges of graph (will grows even to 1M) defined as (I left the relevant parts only):

CREATE TABLE public.collection_lineages (
    id int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY( INCREMENT BY 1 MINVALUE 1 MAXVALUE 9223372036854775807 START 1 CACHE 1 NO CYCLE),
    upstream_id int8 NOT NULL,
    downstream_id int8 NOT NULL,
);

worth mentioning:

  • table has cycles.
  • there are multiple paths reaching from A -> B.
  • my main query is: get all downstream collections.

After researching about postgres graph traversal I found postgres CYCLE DETECTION and I wrote this query:

WITH RECURSIVE
"recursive_table"("id",
"upstream_id",
"downstream_id",
"depth") AS (
SELECT
    "working_table"."id",
    "working_table"."upstream_id",
    "working_table"."downstream_id",
    1
FROM
    "collection_lineages" AS "working_table"
WHERE
    "working_table"."upstream_id" = $1
UNION ALL
SELECT
    "working_table"."id",
    "working_table"."upstream_id",
    "working_table"."downstream_id",
    "recursive_table"."depth" + 1
FROM
    "collection_lineages" AS "working_table",
    "recursive_table"
WHERE
    "working_table"."upstream_id" = "recursive_table"."downstream_id") CYCLE "id" SET "is_cycle" USING PATH

-- filter out cycles
SELECT
    "recursive_table"."id"
FROM
    "recursive_table"
WHERE
    NOT "recursive_table"."is_cycle";

when querying it with some collection_id as $1, CTE traverse the table but revisiting edges. simple example will be:

  B 
 / \
A   D-E-F...-Z
 \ /
  C

query with A will produce:

A->B, A->C, B->D, C->D, D->E, D->E...

notice path D->Z returned twice due to the nature of recursive CTEs. This isn't efficient in scale.

My question is: is there a better way query the graph using CTE avoiding visited edges?

I tried adding: WHERE "working_table"."id" NOT IN (SELECT id from recursive_table) but I'm getting: recursive reference to query "recursive_table" must not appear within a subquery.

My workaround is to SELECT DISTINCT "recursive_table"."id" but the issue is still there and query takes unreasonable amount of time.

I already tried writing a simple BFS script for that, maintaining visited_ids table, iterating layer after layer. It shows better performance in general.

I also looked at age extension but this works with Cypher and force me to create the graph via them and I'd like to use my existing table.

4
  • Start by removing the DISTINCT in the recursive CTE: the CYCLE already stops the recursion and having the "depth" in it prevents it to be useful in the general case: if you had A-B1-B2-D, the depth will be 4 when reaching D while it's 3 when coming from the A-C-D branch. Commented May 12, 2024 at 13:35
  • @p3consulting sorry, I edited the query so I removed the DISTINCT from it, this was my original query. I'm asking if there is another way avoiding the repetitive edges in the response rather than using DISTINCT? something like NOT duplicate records during recursion? Commented May 15, 2024 at 6:16
  • The usual techniques to detect cycles for database dialects not having the CYCLE keyword as some have... often based on building a path string and checking the current node is not yet in or by using power of 2 math to maintain a bitmap where bitnum == rowid() of the node. Commented May 15, 2024 at 10:14
  • I ended up with the BFS script, it's way faster gave me 42K rows for 28 depth graph in just 2.4s. Sorry folks recursive CTEs produce repetitive records - it's just doesn't scale. Commented May 30, 2024 at 12:12

1 Answer 1

0

I ended up with the BFS script, it's way faster gave me 42K rows for 28 depth graph in just 2.4s. Sorry folks recursive CTEs produce repetitive records - it's just doesn't scale.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.