Inconsistent equality checking of Unicode text in Postgres 9.5?

Ask Question

Asked 9 years, 9 months ago

Modified 2 years, 3 months ago

Viewed 770 times

Using PostgreSQL 9.5. It seems like Postgres is being inconsistent with how it compares strings similar to لى. A unique constraint is considering some strings equal that GROUP BY is considering different.

I have a select query that is using GROUP BY on a TEXT column and an aggregate function on some other column to ensure uniqueness of the TEXT column in the output, and I'm inserting the results into a table with a primary key (and thus unique) constraint on the TEXT column.

Looks similar to this; I've just changed the table names for simplicity:

 INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
            SELECT 
            sq1.string string, sq2.score / sq1.score
            FROM
            (
                SELECT n.string string, SUM(n.score) score
                FROM
                othertable1 n % has string TEXT (non-unique) and score INT
                GROUP BY string
            ) sq1,
            (
                SELECT n.string string, SUM(n.score) score
                FROM
                othertable2 n % has string TEXT (non-unique) and score INT
                GROUP BY string
            ) sq2
            WHERE sq1.string = sq2.string
            ORDER BY score DESC LIMIT 100000
        );

This should never fail, right? It fails: violation of unique constraint due on key لى.

And it has worked many times before with other data sets containing millions of rows in PG 9.3; I don't know whether لى was in the data back then. I know Arabic has decorations you can put on the letters, so I wonder if that's tripping it up.

Does anyone have an alternative explanation, or should I report this as a bug once I can reproduce it more easily?

UPDATE: Confirmed that the query runs successfully on a PostgreSQL 9.3 server with the same data. There are some moving parts here, so I'm trying to find exactly what the problematic strings are so I can make a simple list of queries anyone can run to expose a bug.

UPDATE 2: Argh, I can't get my database to give me a set of strings I can copy into a table and expose a bug. I've been trying to do it with COPY TO. Something along the way keeps stripping the Arabic text of the differences that are making it fail, I think. But I tried a simpler query, and it's also failing. It's more obvious that this should work:

INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
    SELECT n.string string, SUM(n.score) score
    FROM
    othertable2 n % has string TEXT (non-unique) and score INT
    GROUP BY string
);

I'm still working towards getting something others can try because, of course, a bug report is useless if I say that it only works on my data.

UPDATE 3: I ran it again with different data and ran into the same problem with Cyrillic characters У and В. Making a table containing them didn't reveal anything. Same problem as with the Arabic text, I think. Something is getting stripped along the way.

UPDATE 4: This is definitely a bug. I'm still trying to figure out how to report this. I found a workaround:

WITH glitch(string, score) AS (
    SELECT n.string string, SUM(n.score) score
    FROM
    othertable2 n % has string TEXT (non-unique) and score INT
    GROUP BY string
)
INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
    SELECT DISTINCT ON(string) * FROM glitch
);

edited Jun 2, 2016 at 0:58

asked Feb 21, 2016 at 6:49

sudo

1,0611 gold badge11 silver badges18 bronze badges

All my TEXT columns are using default COLLATE settings, and my locale is en_US.UTF-8.

sudo
– sudo

2016-02-21 18:46:42 +00:00
Commented Feb 21, 2016 at 18:46
Is mytable empty before running this INSERT?

ypercubeᵀᴹ
– ypercubeᵀᴹ

2016-02-23 08:04:28 +00:00
Commented Feb 23, 2016 at 8:04
Yes. I made sure it was empty. Furthermore, if I run the SELECT without INSERTing the results in psql, it opens a vim window of the results, I search for لى, and I get multiple results that look the same but with different scores.

sudo
– sudo

2016-02-23 19:37:24 +00:00
Commented Feb 23, 2016 at 19:37

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Inconsistent equality checking of Unicode text in Postgres 9.5?

0

Hot Network Questions

Inconsistent equality checking of Unicode text in Postgres 9.5?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions