Why does the Java collator for the Hungarian locale mix E and É?

Question

I want to order a huge list of countries, and I have noticed that, for example, E and É are treated as equals. But in the Hungarian grammar, E comes before É, so a rule should be added, 'E < É'.

Example code:

List<String> countries = List.of("Észak-korea", "Észtország", "Eritrea", "Etiópia", "El Salvador");
Locale locale = Locale.of("hu", "HU");
Collator collator = Collator.getInstance(locale);
List<String> orderedCountries = countries.stream().sorted(collator).toList();
System.out.println(orderedCountries);

The result is this:

[El Salvador, Eritrea, Észak-korea, Észtország, Etiópia]

The expected result:

[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]

Setting collator strength changes nothing. The rule set clearly misses the order of the accentuated letters, compared to their simple letter counterparts. Like A < Á, I < I, E < É, etc.

I use OpenJDK 21 (21.0.2).

Why is there a rule set for Hungarian language that fails to follow the rules of the Hungarian language? I don't want to create own rules to implement the functionality that should be there already. Maybe there is a good answer to that. Or is there a different method to achieve the expected result with the default Java libraries?

The concept here is called "collation order". Knowing the right term can help you find information. Collation order is a characteristic of locale / localization (and not a matter of grammar). Knowing the right classification can also help you find information. — John Bollinger
– John Bollinger, Commented yesterday
Here is the CLDR entry for Hungarian collation. It does not contain the rule "E < É". That's why. — Sweeper
– Sweeper, Commented yesterday
Does this occur when you use Hungarian Notation ? What if you move to Polish Notation ? Is CLDR the variant of TLDR ? [[ Just Joking !! ]] — Prem
– Prem, Commented 6 hours ago
At least one answer smells of having been generated by ChatGPT (or similar). — Peter Mortensen
– Peter Mortensen, Commented 1 hour ago

DuncG · Accepted Answer · 2025-10-31 11:29:11Z

5

Java isn’t missing any grammar rules. The Hungarian collation in OpenJDK follows the Unicode/CLDR standard, where accented letters (like É) are treated as secondary forms of their base letter (E). Because of this, the traditional Hungarian dictionary order (A < Á < B < C < Cs … E < É) is not applied by default.

No built-in Java Collator implements the full Hungarian dictionary alphabet.

If you need the real Hungarian dictionary order, you must use a tailored collation. For example, with ICU4J:

Collator coll = Collator.getInstance(new ULocale("hu@collation=dictionary"));

This collator follows the correct Hungarian dictionary rules, including treating E and É as separate letters.

edited yesterday

DuncG

16.3k3 gold badges31 silver badges44 bronze badges

answered yesterday

Siripireddy Giri

795 bronze badges

New contributor

Sign up to request clarification or add additional context in comments.

3 Comments

Matthieu M. 5 hours ago

I would like to note that having multiple collation orders depending on the usecase is not unusual. For example, if I remember correctly, Spanish uses a different collation order between phone books (yeah, disappearing as they are) and dictionaries, with regard to whether LL is treated as its own letter, or as just two Ls.

andrewJames 3 hours ago

Have you run this code? What output does it produce using the data in the question? There is no such thing as collation=dictionary for locale hu in the CLDR (as used by ICU4J). There is only collation=standard. See the hu.xml file: it only contains <collation type="standard" > (and a proposed draft - also standard).

andrewJames 3 hours ago

As noted by @MatthieuM. there are examples in the CLDR of languages having multiple collations, such as Spanish. Another one: the German data, which has multiple sets of collation rules.

Aniketh T S · Accepted Answer · 2025-10-31 11:09:31Z

In Hungarian, letters like É and E are not the same. The correct alphabetical order should be E < É, I < Í, and so on.
So it’s natural to expect Java’s Hungarian Collator to sort words that way.

But here’s the catch:
Even though Java supports the Hungarian locale, the built-in Collator doesn’t fully follow Hungarian collation rules. It still treats accented letters like “É” as just variants of “E,” not as separate letters.

This happens because Java’s locale data comes from CLDR (Common Locale Data Repository), which actually does define the proper Hungarian rules.
However, the Java Collator class doesn’t use those CLDR rules — it still relies on older internal data that doesn’t include the accent order for Hungarian. In other words, Java’s Collator knows about the Hungarian language but not about the finer details of its alphabet.

That’s why no matter how you tweak the Collator strength, “E” and “É” are treated as equals.

Use ICU4J

ICU4J is a library that handles collation (and many other locale features) far more accurately. It fully supports Hungarian ordering right out of the box.

import com.ibm.icu.text.Collator;
import com.ibm.icu.util.ULocale;

List<String> countries = List.of("Észak-korea", "Észtország", "Eritrea", "Etiópia", "El Salvador");
Collator collator = Collator.getInstance(ULocale.forLanguageTag("hu-HU"));
collator.setStrength(Collator.TERTIARY);

List<String> ordered = countries.stream().sorted(collator).toList();
System.out.println(ordered);

[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]

That’s the correct Hungarian order — E < É, just as it should be.

Did you try running this code? On macOS 26.0.1, this code produces the same output as the one shown in the question, not the expected output.

Peter Mortensen · Accepted Answer · 2025-11-01 17:07:28Z

Regarding your specific question:

Why is there a rule set for Hungarian language that fails to follow the rules of the Hungarian language?

You would almost certainly have to ask the language experts who contributed these specific collation rules to the Unicode CLDR project, if you want a definitive answer.

Having said that, you can see that the CLDR itself alludes to this:

There are a number of pitfalls with collation, so be careful. In some cases, such as Hungarian or Japanese, the rules can be fairly complicated (of course, reflecting that the sorting sequence for those languages is complicated).

There is also a discussion (from almost 20 years ago) when the MySQL team was trying to define Hungarian collation rules - outside the scope of this Java question, but still interesting. I won't try to capture the full thread here, but it includes the following snippets:

"Most people agree that this is the Hungarian alphabet: a á b c cs d dz dzs e é f g gy h i í j k l ly m n ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs"

Some people also say there's a secondary sort rule for these short/long vowel pairs: a á, e é, i í, o ó, ö ő, u ú, ü ű For these pairs, long = short usually, but long > short if all else is equal.

An alternative collation sometimes used (in libraries, and some dictionaries and lexica) is according to the basic latin [sic] alphabet, whit the accented letters having the same value as the not accented. Or anything in between. E.g., honoring the digraphs and the trigraph, but leaving the accents out of the business.

So, "leaving the accents out of the business" seems to have been acceptable in some cases.

The thread goes on to discuss various other complexities. Worth a read.

If you feel strongly enough you could ask the CLDR team for more info about the history of this topic.

There are already some tickets about Hungarian collation such as collation rules for hu (Hungarian). This one was marked "won't fix".

I just want my expected sort order to work - what can I do?

I appreciate that you want to provide the least surprising sort order for consumers of your data.

Here is one way you can do that in Java - but with a very large caveat that I am not any kind of expert in Hungarian. I do not know the language at all.

You can use a custom RuleBasedCollator to do that.

Here is a first (somewhat naive) implementation:

import java.text.ParseException;
import java.text.RuleBasedCollator;
import java.util.List;

...

String myRules = """
     < a, A < á, Á < b, B < c, C < cs, CS, Cs < d, D <
     dz, DZ, Dz < dzs, DZS, Dzs <
     e, E < é, É < f, F < g, G < gy, GY, Gy < h, H <
     i, I < í, Í < j, J < k, K < l, L < ly, LY, Ly <
     m, M < n, N < ny, NY, Ny < o, O < ó, Ó < ö, Ö < ő, Ő < p, P <
     q, Q < r, R < s, S < sz, SZ, Sz < t, T < ty, TY, Ty <
     u, U < ú, Ú < ü, Ü < ű, Ű < v, V < w, W < x, X <
     y, Y < z, Z < zs, ZS, Zs
    """;
RuleBasedCollator myCollator = new RuleBasedCollator(myRules);

List<String> countries = List.of("Észak-korea", "Észtország", "Eritrea",
        "Etiópia", "El Salvador");
List<String> expectedResults = List.of("El Salvador", "Eritrea", "Etiópia",
        "Észak-korea", "Észtország");
List<String> orderedCountries = countries.stream().sorted(myCollator).toList();

System.out.println(expectedResults);
System.out.println(orderedCountries);

This outputs expected and actual results:

[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]
[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]

It uses rules such as e, E < é, É to ensure accented letters are sorted after their unaccented counterparts.

Note that I do not use a Hungarian locale anywhere in the above code.

This could be enhanced in a couple of ways (and maybe more):

(a) If you explicitly use a Hungarian locale, Java uses the following collation rules (as extracted from the CLDR rules):

CollationData_hu.java:

{ "Rule",
    /* for hu, default sorting except for the following: */
    /* add cs "ligature" between c and d. */
    /* add d<stroke> between d and e. */
    /* add gy "ligature" between g and h. */
    /* add ly "ligature" between l and l<stroke>. */
    /* add l<stroke> between l and m. */
    /* add sz "ligature" between s and t. */
    /* add zs "ligature" between z and z<abovedot> */
    /* add z<abovedot> after z.       */
    "& C < cs , cS , Cs , CS " // cs ligatures
    + "& D < \u0111, \u0110 "    // tal : african [sic] d < d-stroke
    + "& G < gy, Gy, gY, GY "    // gy ligatures
    + "& L < ly, Ly, lY, LY "    // l < ly
    + "& O < o\u0308 , O\u0308 " // O < o-umlaut
    + "< o\u030b , O\u030b "     // o-double-accute
    + "& S < sz , sZ , Sz , SZ " // s < sz ligature
    + "& U < u\u0308 , U\u0308 " // u < u-umlaut
    + "< u\u030b , U\u030b "     // u-double-accute
    + "& Z < zs , zS , Zs , ZS " // stop-stroke < zs ligature
}

As noted already in some comments, they do not handle accented characters in the way you expect/need. You could enhance your custom rules by adding these extra ones to your custom collator.

Also (b): These Hungarian-specific rules are appended to a much larger set of default rules in CollationRules.java.

This is a very long list.

You can't (shouldn't!) access this class directly - it's in a sun.util package. You could take a copy of the rules, and append your custom rules, for completeness. These rules may, of course, change in the future (but they appear to be quite stable, these days).

Just to repeat my earlier warning: I am not a Hungarian language expert. I would only want to use this type of custom approach in a very specific circumstance - so maybe only for your country name sort, and nowhere else. You may be able to improve upon my approach.

Or you may find a better way.

Collectives™ on Stack Overflow

Why does the Java collator for the Hungarian locale mix E and É?

3 Answers 3

3 Comments

Use ICU4J

1 Comment

Regarding your specific question:

I just want my expected sort order to work - what can I do?

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Use ICU4J

1 Comment

Regarding your specific question:

I just want my expected sort order to work - what can I do?

Comments

Linked

Related