Using PostgreSQL 9.5. It seems like Postgres is being inconsistent with how it compares strings similar to لى. A unique constraint is considering some strings equal that GROUP BY is considering different.
I have a select query that is using GROUP BY on a TEXT column and an aggregate function on some other column to ensure uniqueness of the TEXT column in the output, and I'm inserting the results into a table with a primary key (and thus unique) constraint on the TEXT column.
Looks similar to this; I've just changed the table names for simplicity:
INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
SELECT
sq1.string string, sq2.score / sq1.score
FROM
(
SELECT n.string string, SUM(n.score) score
FROM
othertable1 n % has string TEXT (non-unique) and score INT
GROUP BY string
) sq1,
(
SELECT n.string string, SUM(n.score) score
FROM
othertable2 n % has string TEXT (non-unique) and score INT
GROUP BY string
) sq2
WHERE sq1.string = sq2.string
ORDER BY score DESC LIMIT 100000
);
This should never fail, right? It fails: violation of unique constraint due on key لى.
And it has worked many times before with other data sets containing millions of rows in PG 9.3; I don't know whether لى was in the data back then. I know Arabic has decorations you can put on the letters, so I wonder if that's tripping it up.
Does anyone have an alternative explanation, or should I report this as a bug once I can reproduce it more easily?
UPDATE: Confirmed that the query runs successfully on a PostgreSQL 9.3 server with the same data. There are some moving parts here, so I'm trying to find exactly what the problematic strings are so I can make a simple list of queries anyone can run to expose a bug.
UPDATE 2: Argh, I can't get my database to give me a set of strings I can copy into a table and expose a bug. I've been trying to do it with COPY TO. Something along the way keeps stripping the Arabic text of the differences that are making it fail, I think. But I tried a simpler query, and it's also failing. It's more obvious that this should work:
INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
SELECT n.string string, SUM(n.score) score
FROM
othertable2 n % has string TEXT (non-unique) and score INT
GROUP BY string
);
I'm still working towards getting something others can try because, of course, a bug report is useless if I say that it only works on my data.
UPDATE 3: I ran it again with different data and ran into the same problem with Cyrillic characters У and В. Making a table containing them didn't reveal anything. Same problem as with the Arabic text, I think. Something is getting stripped along the way.
UPDATE 4: This is definitely a bug. I'm still trying to figure out how to report this. I found a workaround:
WITH glitch(string, score) AS (
SELECT n.string string, SUM(n.score) score
FROM
othertable2 n % has string TEXT (non-unique) and score INT
GROUP BY string
)
INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
SELECT DISTINCT ON(string) * FROM glitch
);
mytableempty before running this INSERT?لى, and I get multiple results that look the same but with differentscores.