9

Using PostgreSQL 9.5. It seems like Postgres is being inconsistent with how it compares strings similar to لى. A unique constraint is considering some strings equal that GROUP BY is considering different.

I have a select query that is using GROUP BY on a TEXT column and an aggregate function on some other column to ensure uniqueness of the TEXT column in the output, and I'm inserting the results into a table with a primary key (and thus unique) constraint on the TEXT column.

Looks similar to this; I've just changed the table names for simplicity:

 INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
            SELECT 
            sq1.string string, sq2.score / sq1.score
            FROM
            (
                SELECT n.string string, SUM(n.score) score
                FROM
                othertable1 n % has string TEXT (non-unique) and score INT
                GROUP BY string
            ) sq1,
            (
                SELECT n.string string, SUM(n.score) score
                FROM
                othertable2 n % has string TEXT (non-unique) and score INT
                GROUP BY string
            ) sq2
            WHERE sq1.string = sq2.string
            ORDER BY score DESC LIMIT 100000
        );

This should never fail, right? It fails: violation of unique constraint due on key لى.

And it has worked many times before with other data sets containing millions of rows in PG 9.3; I don't know whether لى was in the data back then. I know Arabic has decorations you can put on the letters, so I wonder if that's tripping it up.

Does anyone have an alternative explanation, or should I report this as a bug once I can reproduce it more easily?

UPDATE: Confirmed that the query runs successfully on a PostgreSQL 9.3 server with the same data. There are some moving parts here, so I'm trying to find exactly what the problematic strings are so I can make a simple list of queries anyone can run to expose a bug.

UPDATE 2: Argh, I can't get my database to give me a set of strings I can copy into a table and expose a bug. I've been trying to do it with COPY TO. Something along the way keeps stripping the Arabic text of the differences that are making it fail, I think. But I tried a simpler query, and it's also failing. It's more obvious that this should work:

INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
    SELECT n.string string, SUM(n.score) score
    FROM
    othertable2 n % has string TEXT (non-unique) and score INT
    GROUP BY string
);

I'm still working towards getting something others can try because, of course, a bug report is useless if I say that it only works on my data.

UPDATE 3: I ran it again with different data and ran into the same problem with Cyrillic characters У and В. Making a table containing them didn't reveal anything. Same problem as with the Arabic text, I think. Something is getting stripped along the way.

UPDATE 4: This is definitely a bug. I'm still trying to figure out how to report this. I found a workaround:

WITH glitch(string, score) AS (
    SELECT n.string string, SUM(n.score) score
    FROM
    othertable2 n % has string TEXT (non-unique) and score INT
    GROUP BY string
)
INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
    SELECT DISTINCT ON(string) * FROM glitch
);
3
  • All my TEXT columns are using default COLLATE settings, and my locale is en_US.UTF-8. Commented Feb 21, 2016 at 18:46
  • Is mytable empty before running this INSERT? Commented Feb 23, 2016 at 8:04
  • Yes. I made sure it was empty. Furthermore, if I run the SELECT without INSERTing the results in psql, it opens a vim window of the results, I search for لى, and I get multiple results that look the same but with different scores. Commented Feb 23, 2016 at 19:37

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.