0
$\begingroup$

In a parallel effort of me implementing the C langauge, I'm also implementing the Shell and Utility section (volume actually) of the POSIX standard, to ensure that I have everything I need to compile my own C compiler.

One thing that had bothered me, is internationalization of regex. Some of us may be aware, that in addition to individual characters, POSIX regex require support for multi-character collating elements (e.g. 'ch' in some continental European languages) and equivalence class (i.e. A == Å == Ǎ == Á == etc.). I plan to not to implement these features under the following grounds:

  1. POSIX gave permission to support no locale other than 1 that's specifically compatible with the POSIX a.k.a. C locale, which is based on the ASCII character set.

  2. Large body of text processing software are written in other languages, such as Perl; and language models - large, or of deployable scale, can handle day-to-day translation in place of locale, which in this case I consider a legacy feature of selling System V era systems to non-English-speaking countries.

  3. Softwares that depend on regex in the locale of a specific language is by definition not strictly-POSIX-conforming, I do no harm not supporting them.

Q: is it an acceptable feature set to limit the implementation to ASCII-only considering real-world use of regular expressions in programming?

$\endgroup$
8
  • 5
    $\begingroup$ Is it acceptable to who? This seems to be a matter of opinion. If you want to know whether it would be acceptable to users of your implementation, you would have to survey them. $\endgroup$ Commented Jun 18 at 13:50
  • $\begingroup$ If the goal is just "to compile my own C compiler", you may not need regexp in the shell at all. $\endgroup$ Commented Jun 18 at 15:22
  • $\begingroup$ @kaya3 I'll do a survey. I did say "considering real-world use of [regices] ..". $\endgroup$ Commented Jun 18 at 18:54
  • $\begingroup$ I don’t understand what this is getting at. Locales other than C are not mandated by POSIX, as you already noted. If you only support the C locale, all of the equivalence and collation classes are single-byte sets and trivial. Including special bracket expressions in BREs is not tied to internationalisation (but is obligatory), but even basic range expressions are affected by locale and presumably you’re not excluding those as well. You’ve posted an answer that suggests you mean something other than what you’ve asked here. What is the intended kind of answer to this question? $\endgroup$ Commented Jun 18 at 20:04
  • 1
    $\begingroup$ But real-world usage of special bracket expressions is completely unrelated to being ASCII-only. $\endgroup$ Commented Jun 18 at 20:24

1 Answer 1

2
$\begingroup$

I think this question is conflating two separate issues: does an implementation of POSIX regular expressions need to support non-ASCII text, and does it need to support collating/equivalence/character class expressions. The answers to these are different.

POSIX requires only the C/POSIX locale to be supported, which is fully defined in section 7.2. This locale contains the ASCII letters, digits, punctuation, and control characters, appropriately categorised. Section 7.3.2.6 then defines collation consistent with ASCII (by byte order). Section 6.4 is explicit that

POSIX.1-2024 does not require that multiple character sets or codesets be supported

A conforming implementation of POSIX thus needs only to support this locale, is free to reject attempts to use other locales, and may live entirely within the ASCII ordering of the portable character set. Your implementation of POSIX languages (sed, awk, Makefile, ...) and the regular expression standard library operations for a C implementation can be ASCII-only.

However, the elements of the bracket expression syntax are not optional, and are not tied to internationalisation. For example, the graph and blank classes [[:graph:]] and [[:blank:] are fully-defined for the POSIX locale (§7.3.1), and so the bracket expressions [[:graph:]] and [[:blank:]] may be used in basic or extended regular expressions, explicitly in all locales (including POSIX!). [[=a=]] can be used anywhere, and in the POSIX locale it is exactly equivalent to [a], and the same for [[.a.]]. In another locale, they might have more complex semantics, but you're not in those locales, so it's only the syntax you have to handle.

You do have to handle the syntax, because applications do use these. In particular, the sort of deep-buried build-system code that you probably want to support, given the storyline of the question, is quite likely to rely on it somewhere, potentially within generated sed or awk code. That's not because it depends on being in some other locale, but because it doesn't, it's making sure that it works in any locale, and it's even intended to work on systems with other encodings entirely. If your BRE/ERE implementation doesn't support these, it will mis-match text, or error out on valid inputs. Since you are only supporting the POSIX locale, there's no complexity in implementing them as they are essentially no-ops. You should support these if you are attempting to implement POSIX-compatible semantics for the shell command language and basic utilities.

The answer to the explicit question

Q: is it an acceptable feature set to limit the implementation to ASCII-only considering real-world use of regular expressions in programming?

is yes, explicitly, because you only need to support the POSIX locale. The answer to the implicit empirical question about prevalence of usage of this syntax is that no, you probably can't leave it out as unused, but whether you'll really encounter it in code you care about isn't something we can answer.

$\endgroup$
2
  • $\begingroup$ "the implicit empirical question about prevalence of usage", I did a survey and answered and deleted. You probably can see it, since you're diamond decorated. I'm aware that internationalization bracket elements are mandatory, I just didn't know they're called "special" in the standard. $\endgroup$ Commented Jun 19 at 7:37
  • $\begingroup$ You did! I think it is mismatched with what the question says, but presumably it is exactly what you meant (and so it gave you what you need? I do expect you would encounter load-bearing uses of these constructs in things like configure and build scripts even if they don’t make the first pages of search results, but it may not be any ones that matter for what you’re trying to do). $\endgroup$ Commented Jun 19 at 8:41

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.