Skip to content

Update for unicode 16.0 #215

@YourFin

Description

@YourFin

Unicode 16 has been released, and bstr doesn't match spec behavior anymore.

From some fiddling around last night, it looks like the following needs to happen:

  • Update the test cases in src/unicode/data
  • Update the regex-automata dependency to 0.4.8 to ensure the transitive dependency on regex-syntax supports Unicode 16
  • Update grapheme.sh to the new grammar in the spec. As far as I can tell, the changes are the inclusion of CR and LF as grapheme breakes on their own (previously only windows-style CR LF was included), and rule GB9c for new Indic_Conjunct_Break behavior. Draft Implementation
  • Re-generated the finite state machines with an up-to-date regex-cli

Unfortunately, the spec implementation of GB9c depends on support for a few obscure character classes that regex (nor ucd-generate, which codegens the relevant table) does not currently support.

We could get around this by inlining the character class definitions; that would be pretty brittle, though, and upstreaming support seems like a better solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions