Regarding your specific question:
Why is there a rule set for Hungarian language that fails to follow the rules of the Hungarian language?
You would almost certainly have to ask the language experts who contributed these specific collation rules to the Unicode CLDR project, if you want a definitive answer.
Having said that, you can see that the CLDR itself alludes to this:
There are a number of pitfalls with collation, so be careful. In some cases, such as Hungarian or Japanese, the rules can be fairly complicated (of course, reflecting that the sorting sequence for those languages is complicated).
There is also a discussion (from almost 20 years ago) when the MySQL team was trying to define Hungarian collation rules - outside the scope of this Java question, but still interesting. I won't try to capture the full thread here, but it includes the following snippets:
"Most people agree that this is the Hungarian alphabet: a á b c cs d dz dzs e é f g gy h i í j k l ly m n ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs"
Some people also say there's a secondary sort
rule for these short/long vowel pairs:
a á, e é, i í, o ó, ö ő, u ú, ü ű
For these pairs, long = short usually, but long > short
if all else is equal.
An alternative collation sometimes used (in libraries, and some dictionaries
and lexica) is according to the basic latin [sic] alphabet, whit the accented
letters having the same value as the not accented. Or anything in between.
E.g., honoring the digraphs and the trigraph, but leaving the accents out of
the business.
So, "leaving the accents out of the business" seems to have been acceptable in some cases.
The thread goes on to discuss various other complexities. Worth a read.
If you feel strongly enough you could ask the CLDR team for more info about the history of this topic.
There are already some tickets about Hungarian collation such as collation rules for hu (Hungarian). This one was marked "won't fix".
I just want my expected sort order to work - what can I do?
I appreciate that you want to provide the least surprising sort order for consumers of your data.
Here is one way you can do that in Java - but with a very large caveat that I am not any kind of expert in Hungarian. I do not know the language at all.
You can use a custom RuleBasedCollator to do that.
Here is a first (somewhat naive) implementation:
import java.text.ParseException;
import java.text.RuleBasedCollator;
import java.util.List;
...
String myRules = """
< a, A < á, Á < b, B < c, C < cs, CS, Cs < d, D <
dz, DZ, Dz < dzs, DZS, Dzs <
e, E < é, É < f, F < g, G < gy, GY, Gy < h, H <
i, I < í, Í < j, J < k, K < l, L < ly, LY, Ly <
m, M < n, N < ny, NY, Ny < o, O < ó, Ó < ö, Ö < ő, Ő < p, P <
q, Q < r, R < s, S < sz, SZ, Sz < t, T < ty, TY, Ty <
u, U < ú, Ú < ü, Ü < ű, Ű < v, V < w, W < x, X <
y, Y < z, Z < zs, ZS, Zs
""";
RuleBasedCollator myCollator = new RuleBasedCollator(myRules);
List<String> countries = List.of("Észak-korea", "Észtország", "Eritrea",
"Etiópia", "El Salvador");
List<String> expectedResults = List.of("El Salvador", "Eritrea", "Etiópia",
"Észak-korea", "Észtország");
List<String> orderedCountries = countries.stream().sorted(myCollator).toList();
System.out.println(expectedResults);
System.out.println(orderedCountries);
This outputs expected and actual results:
[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]
[El Salvador, Eritrea, Etiópia, Észak-korea, Észtország]
It uses rules such as e, E < é, É to ensure accented letters are sorted after their unaccented counterparts.
Note that I do not use a Hungarian locale anywhere in the above code.
This could be enhanced in a couple of ways (and maybe more):
(a) If you explicitly use a Hungarian locale, Java uses the following collation rules (as extracted from the CLDR rules):
CollationData_hu.java:
{ "Rule",
/* for hu, default sorting except for the following: */
/* add cs "ligature" between c and d. */
/* add d<stroke> between d and e. */
/* add gy "ligature" between g and h. */
/* add ly "ligature" between l and l<stroke>. */
/* add l<stroke> between l and m. */
/* add sz "ligature" between s and t. */
/* add zs "ligature" between z and z<abovedot> */
/* add z<abovedot> after z. */
"& C < cs , cS , Cs , CS " // cs ligatures
+ "& D < \u0111, \u0110 " // tal : african [sic] d < d-stroke
+ "& G < gy, Gy, gY, GY " // gy ligatures
+ "& L < ly, Ly, lY, LY " // l < ly
+ "& O < o\u0308 , O\u0308 " // O < o-umlaut
+ "< o\u030b , O\u030b " // o-double-accute
+ "& S < sz , sZ , Sz , SZ " // s < sz ligature
+ "& U < u\u0308 , U\u0308 " // u < u-umlaut
+ "< u\u030b , U\u030b " // u-double-accute
+ "& Z < zs , zS , Zs , ZS " // stop-stroke < zs ligature
}
As noted already in some comments, they do not handle accented characters in the way you expect/need. You could enhance your custom rules by adding these extra ones to your custom collator.
Also (b): These Hungarian-specific rules are appended to a much larger set of default rules in CollationRules.java.
This is a very long list.
You can't (shouldn't!) access this class directly - it's in a sun.util package. You could take a copy of the rules, and append your custom rules, for completeness. These rules may, of course, change in the future (but they appear to be quite stable, these days).
Just to repeat my earlier warning: I am not a Hungarian language expert. I would only want to use this type of custom approach in a very specific circumstance - so maybe only for your country name sort, and nowhere else. You may be able to improve upon my approach.
Or you may find a better way.