6

I want to order a huge list of countries, and I have noticed that, for example, E and É are treated as equals. But in the Hungarian grammar, E comes before É, so a rule should be added, 'E < É'.

Example code:

List<String> countries = List.of("Észak-korea", "Észtország", "Eritrea", "Etiópia", "El Salvador");
Locale locale = Locale.of("hu", "HU");
Collator collator = Collator.getInstance(locale);
List<String> orderedCountries = countries.stream().sorted(collator).toList();
System.out.println(orderedCountries);

The result is this:

[El Salvador, Eritrea, Észak-korea, Észtország, Etiópia]

The expected result:

[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]

Setting collator strength changes nothing. The rule set clearly misses the order of the accentuated letters, compared to their simple letter counterparts. Like A < Á, I < I, E < É, etc.

I use OpenJDK 21 (21.0.2).

Why is there a rule set for Hungarian language that fails to follow the rules of the Hungarian language? I don't want to create own rules to implement the functionality that should be there already. Maybe there is a good answer to that. Or is there a different method to achieve the expected result with the default Java libraries?

5
  • 3
    The concept here is called "collation order". Knowing the right term can help you find information. Collation order is a characteristic of locale / localization (and not a matter of grammar). Knowing the right classification can also help you find information. Commented yesterday
  • 3
    Here is the CLDR entry for Hungarian collation. It does not contain the rule "E < É". That's why. Commented yesterday
  • See also stackoverflow.com/q/65832378/5133585 Commented yesterday
  • Does this occur when you use Hungarian Notation ? What if you move to Polish Notation ? Is CLDR the variant of TLDR ? [[ Just Joking !! ]] Commented 6 hours ago
  • At least one answer smells of having been generated by ChatGPT (or similar). Commented 1 hour ago

3 Answers 3

5

Java isn’t missing any grammar rules. The Hungarian collation in OpenJDK follows the Unicode/CLDR standard, where accented letters (like É) are treated as secondary forms of their base letter (E). Because of this, the traditional Hungarian dictionary order (A < Á < B < C < Cs … E < É) is not applied by default.

No built-in Java Collator implements the full Hungarian dictionary alphabet.

If you need the real Hungarian dictionary order, you must use a tailored collation. For example, with ICU4J:

Collator coll = Collator.getInstance(new ULocale("hu@collation=dictionary"));

This collator follows the correct Hungarian dictionary rules, including treating E and É as separate letters.

New contributor
Siripireddy Giri is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
Sign up to request clarification or add additional context in comments.

3 Comments

I would like to note that having multiple collation orders depending on the usecase is not unusual. For example, if I remember correctly, Spanish uses a different collation order between phone books (yeah, disappearing as they are) and dictionaries, with regard to whether LL is treated as its own letter, or as just two Ls.
Have you run this code? What output does it produce using the data in the question? There is no such thing as collation=dictionary for locale hu in the CLDR (as used by ICU4J). There is only collation=standard. See the hu.xml file: it only contains <collation type="standard" > (and a proposed draft - also standard).
As noted by @MatthieuM. there are examples in the CLDR of languages having multiple collations, such as Spanish. Another one: the German data, which has multiple sets of collation rules.
3

In Hungarian, letters like É and E are not the same. The correct alphabetical order should be E < É, I < Í, and so on.
So it’s natural to expect Java’s Hungarian Collator to sort words that way.

But here’s the catch:
Even though Java supports the Hungarian locale, the built-in Collator doesn’t fully follow Hungarian collation rules. It still treats accented letters like “É” as just variants of “E,” not as separate letters.

This happens because Java’s locale data comes from CLDR (Common Locale Data Repository), which actually does define the proper Hungarian rules.
However, the Java Collator class doesn’t use those CLDR rules — it still relies on older internal data that doesn’t include the accent order for Hungarian. In other words, Java’s Collator knows about the Hungarian language but not about the finer details of its alphabet.

That’s why no matter how you tweak the Collator strength, “E” and “É” are treated as equals.


Use ICU4J

ICU4J is a library that handles collation (and many other locale features) far more accurately. It fully supports Hungarian ordering right out of the box.

import com.ibm.icu.text.Collator;
import com.ibm.icu.util.ULocale;

List<String> countries = List.of("Észak-korea", "Észtország", "Eritrea", "Etiópia", "El Salvador");
Collator collator = Collator.getInstance(ULocale.forLanguageTag("hu-HU"));
collator.setStrength(Collator.TERTIARY);

List<String> ordered = countries.stream().sorted(collator).toList();
System.out.println(ordered);
[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]

That’s the correct Hungarian order — E < É, just as it should be.


New contributor
Aniketh T S is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

1 Comment

Did you try running this code? On macOS 26.0.1, this code produces the same output as the one shown in the question, not the expected output.
3

Regarding your specific question:

Why is there a rule set for Hungarian language that fails to follow the rules of the Hungarian language?

You would almost certainly have to ask the language experts who contributed these specific collation rules to the Unicode CLDR project, if you want a definitive answer.

Having said that, you can see that the CLDR itself alludes to this:

There are a number of pitfalls with collation, so be careful. In some cases, such as Hungarian or Japanese, the rules can be fairly complicated (of course, reflecting that the sorting sequence for those languages is complicated).

There is also a discussion (from almost 20 years ago) when the MySQL team was trying to define Hungarian collation rules - outside the scope of this Java question, but still interesting. I won't try to capture the full thread here, but it includes the following snippets:

"Most people agree that this is the Hungarian alphabet: a á b c cs d dz dzs e é f g gy h i í j k l ly m n ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs"

Some people also say there's a secondary sort rule for these short/long vowel pairs: a á, e é, i í, o ó, ö ő, u ú, ü ű For these pairs, long = short usually, but long > short if all else is equal.

An alternative collation sometimes used (in libraries, and some dictionaries and lexica) is according to the basic latin [sic] alphabet, whit the accented letters having the same value as the not accented. Or anything in between. E.g., honoring the digraphs and the trigraph, but leaving the accents out of the business.

So, "leaving the accents out of the business" seems to have been acceptable in some cases.

The thread goes on to discuss various other complexities. Worth a read.

If you feel strongly enough you could ask the CLDR team for more info about the history of this topic.

There are already some tickets about Hungarian collation such as collation rules for hu (Hungarian). This one was marked "won't fix".


I just want my expected sort order to work - what can I do?

I appreciate that you want to provide the least surprising sort order for consumers of your data.

Here is one way you can do that in Java - but with a very large caveat that I am not any kind of expert in Hungarian. I do not know the language at all.

You can use a custom RuleBasedCollator to do that.

Here is a first (somewhat naive) implementation:

import java.text.ParseException;
import java.text.RuleBasedCollator;
import java.util.List;

...

String myRules = """
     < a, A < á, Á < b, B < c, C < cs, CS, Cs < d, D <
     dz, DZ, Dz < dzs, DZS, Dzs <
     e, E < é, É < f, F < g, G < gy, GY, Gy < h, H <
     i, I < í, Í < j, J < k, K < l, L < ly, LY, Ly <
     m, M < n, N < ny, NY, Ny < o, O < ó, Ó < ö, Ö < ő, Ő < p, P <
     q, Q < r, R < s, S < sz, SZ, Sz < t, T < ty, TY, Ty <
     u, U < ú, Ú < ü, Ü < ű, Ű < v, V < w, W < x, X <
     y, Y < z, Z < zs, ZS, Zs
    """;
RuleBasedCollator myCollator = new RuleBasedCollator(myRules);

List<String> countries = List.of("Észak-korea", "Észtország", "Eritrea",
        "Etiópia", "El Salvador");
List<String> expectedResults = List.of("El Salvador", "Eritrea", "Etiópia",
        "Észak-korea", "Észtország");
List<String> orderedCountries = countries.stream().sorted(myCollator).toList();

System.out.println(expectedResults);
System.out.println(orderedCountries);

This outputs expected and actual results:

[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]
[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]

It uses rules such as e, E < é, É to ensure accented letters are sorted after their unaccented counterparts.

Note that I do not use a Hungarian locale anywhere in the above code.


This could be enhanced in a couple of ways (and maybe more):

(a) If you explicitly use a Hungarian locale, Java uses the following collation rules (as extracted from the CLDR rules):

CollationData_hu.java:

{ "Rule",
    /* for hu, default sorting except for the following: */
    /* add cs "ligature" between c and d. */
    /* add d<stroke> between d and e. */
    /* add gy "ligature" between g and h. */
    /* add ly "ligature" between l and l<stroke>. */
    /* add l<stroke> between l and m. */
    /* add sz "ligature" between s and t. */
    /* add zs "ligature" between z and z<abovedot> */
    /* add z<abovedot> after z.       */
    "& C < cs , cS , Cs , CS " // cs ligatures
    + "& D < \u0111, \u0110 "    // tal : african [sic] d < d-stroke
    + "& G < gy, Gy, gY, GY "    // gy ligatures
    + "& L < ly, Ly, lY, LY "    // l < ly
    + "& O < o\u0308 , O\u0308 " // O < o-umlaut
    + "< o\u030b , O\u030b "     // o-double-accute
    + "& S < sz , sZ , Sz , SZ " // s < sz ligature
    + "& U < u\u0308 , U\u0308 " // u < u-umlaut
    + "< u\u030b , U\u030b "     // u-double-accute
    + "& Z < zs , zS , Zs , ZS " // stop-stroke < zs ligature
}

As noted already in some comments, they do not handle accented characters in the way you expect/need. You could enhance your custom rules by adding these extra ones to your custom collator.


Also (b): These Hungarian-specific rules are appended to a much larger set of default rules in CollationRules.java.

This is a very long list.

You can't (shouldn't!) access this class directly - it's in a sun.util package. You could take a copy of the rules, and append your custom rules, for completeness. These rules may, of course, change in the future (but they appear to be quite stable, these days).


Just to repeat my earlier warning: I am not a Hungarian language expert. I would only want to use this type of custom approach in a very specific circumstance - so maybe only for your country name sort, and nowhere else. You may be able to improve upon my approach.

Or you may find a better way.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.