1

I am comparing the collation of Intl.Collator in the three browsers, Node and Postgres Collation and realized that the order of most implementations are very different. The only two implementations matching each other for the locale 'en' are Chrome and Firefox. For example the letters from MATHEMATICAL SANS-SERIF ITALIC SMALL A to Z are sorted together with the letters a - z in Intl with locales as e.g. en in Chrome and Firefox but not in Postgres. I was thinking that all implementations are based on the same CLDR data. The differences between the implementations are (Tested with Playwright):

Chrome - Node 1474
Chrome - Webkit 34727
Chrome - Firefox 0
Chrome - Postgres 25781
Webkit - Node 34727
Webkit - Postgres 34727
Node   - Postgres 34892

A more predictable sort order would enable to sort agnostic in the frontend and backend.

The code is as follows:

The Unicode data is parsed from 'UnicodeData.txt'

export async function parseUnicodeData(unicodePath: string): Promise<UnicodeData []> {
  return (await fs.promises.readFile(path.join(unicodePath, 'UCD/UnicodeData.txt'), 'utf-8'))
    .split('\n')
    .filter((line) => line !== '')
    .map((line) => line.split(';'))
    .map(([codeValue]) => ({codeValue}));
}

In Postgres:

CREATE TABLE unicode.character (
  character text PRIMARY KEY
);

-- Filtered codepoints: '0000', 'D800', 'DB7F', 'DB80', 'DBFF', 'DC00', 'DFFF'

INSERT INTO unicode.character VALUES
  (chr(1)),
  (chr(2)),
  ...
  (chr(1114109));

Then in node:

import {Pool} from 'pg';
import {parseUnicodeData} from 'util-unicode-parser';
import {chromium, webkit, firefox} from 'playwright';

const pool = new Pool();

export async function compareCollations(
  unicodeDirectory: string,
  postgresCollationName: string,
  intlLocale: string,
  intlSettings = '{}',
) {
  // setting up playwright
  const pages = await Promise.all([chromium, webkit, firefox]
    .map((browserType) => browserType.launch({headless: true}))
    .map(async (browser) => (await browser).newContext())
    .map(async (context) => (await context).newPage()));

  // parsing unicode data
  const unicodeData = (await parseUnicodeData(unicodeDirectory))
    // codepoints filtered because can not be inserted into Postgres
    .filter(({codeValue}) => !['0000', 'D800', 'DB7F', 'DB80', 'DBFF', 'DC00', 'DFFF'].includes(codeValue));
  const browserCollationString = `[${unicodeData
      .map(({codeValue}) => 
      `String.fromCodePoint(parseInt('${codeValue}', 16))`).join(',')
    }].sort(new Intl.Collator('${intlLocale}', ${intlSettings}).compare)`;

  // creating sorted arrays of all characters
  const nodeCollation = unicodeData.map(({codeValue}) => String.fromCodePoint(parseInt(codeValue, 16))).sort(new Intl.Collator(intlLocale, JSON.parse(intlSettings)).compare);
  const chromeCollation = await pages[0].evaluate(browserCollationString);
  const webkitCollation = await pages[1].evaluate(browserCollationString);
  const firefoxCollation = await pages[2].evaluate(browserCollationString);
  const postgresCollation = (await pool.query(`SELECT character from unicode.character ORDER BY character COLLATE "${postgresCollationName}";`))
    .rows.map(({character}) => character);

  // comparing sorted arrays only first pair is shown
   console.log(chromeCollation.map((c, i) => [c, nodeCollation[i], c.codePointAt(0), nodeCollation[i].codePointAt(0)]).filter(([c, r, pc, rc]) => c !== r ).sort((a, b) => a[2] < b[2] ? -1 : 1).length) 
  // ...
}
1
  • Can you share the code involved? Commented Apr 5, 2024 at 7:43

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.