5

Here is a bash script:

#!/bin/bash

char=$'\x01'

if [[ "$char" =~ ^[\x01-\x20\x7F-\xFF]$ ]]
then
  echo "Unprintable"
else
  echo "Printable"
fi

The script ought to say "Unprintable", but the hex escape sequence doesn't seem to catch that the character is unprintable. How do I get hex escape sequences to work in bash regex ranges?

5
  • Your locale might interfere. Should work with export LC_ALL=C (after fixing typo in char=$'\x01'). Commented 2 days ago
  • @pmf Why would locale interfere with matching bytes? Maybe if they were interpreted as codepoints, but Bash isn't that high-level. Commented 2 days ago
  • 1
    @wjandrea Idk, but I had a (now deleted) comment basically suggesting your approach (but having ^[$'\x01-\x20\x7F-\xFF']$ as a single C-string instead), which did not work until I also changed LC_ALL. But then, even ^[\x01-\x20\x7F-\xFF]$ (as in the OP) worked as expected (GNU bash 5.3.3). Commented 2 days ago
  • \x20 is a printable char. It could also be written as for s in $'\x01' $'\x20' $'\x7F' $'\x61'; do [[ "$s" =~ ^[[:cntrl:]$'\x20']$ ]] && echo "Unprintable $(xxd -p <<<"$s")"; done Commented 2 days ago
  • Beside the point, but why do you have \x20 and \xA0-\xFF as "unprintable"? Are you not using UTF-8 (or at least extended ASCII)? Commented 2 days ago

2 Answers 2

7

From what I can tell, Bash doesn't support this syntax.* Instead, you can use C-style quoting to generate the characters you want to match, but the behaviour is locale-dependent, so set the locale to C to treat the escapes as bytes:

LC_ALL=C
[[ "$char" =~ ^[$'\x01'-$'\x20'$'\x7F'-$'\xFF']$ ]]

Or, in this case, you might actually want to use character class [:print:] (printable), inverted:

[[ "$char" =~ ^[^[:print:]]$ ]]

* My research: Bash documentation for [[:

An additional binary operator, '=~', is available, with the same precedence as '==' and '!='. When you use '=~', the string to the right of the operator is considered a POSIX extended regular expression pattern and matched accordingly (using the POSIX regcomp and regexec interfaces usually described in regex(3)).

regex(3) on my machine points to regex(7), which says:

a '\' followed by any other character(!) (matching that character taken as an ordinary character, as if the '\' had not been present(!))

POSIX.2 leaves some aspects of RE syntax and semantics open; "(!)" marks decisions on these aspects that may not be fully portable to other POSIX.2 implementations.

Sign up to request clarification or add additional context in comments.

13 Comments

I assume pmf's comment (export LC_ALL=C) is important.
I couldn't get that to work, and the docs I found don't say anything about locale.
Your example works for me if I add export LC_ALL=C before if in an extra line: char=$'\x01'; export LC_ALL=C; if [[ "$char" =~ ^[$'\x01'-$'\x20'$'\x7F'-$'\xFF']$ ]]; then echo "Unprintable"; else echo "Printable"; fi
Oh! Whoops, I did that too and forgot to remove it before trying the next thing. I'll work on editing.
I don't think \x7f-\xff is guaranteed to work. [:print:] is best for the use case presented in OP, and for checking against a custom range I'd do something like printf -v tmp %u "'$char" && ((tmp > lo && tmp < hi)).
It should with the C locale, no?
The Bash manual says "You can quote any part of the pattern to force the quoted portion to be matched literally instead of as a regular expression". That makes use of both the backslash and $'...' problematic inside a regex bracket expression, which, as a whole, is not matched literally.
Changed quoting style to fix that. It did work from what I tried, but if it doesn't fit the spec, that might be a mistake.
Regarding [:print:]: this character class is locale sensitive, and (at least in most locales) it matches the space character. The latter, especially, makes ^[^[:print:]]$ inconsistent with the OP's apparent intention to treat the space character as unprintable.
Who says space is \x20? ;) For example, it's \x40 in EBCDIC, not that that's consistent with the other \xs. It's unclear what OP's intention is exactly.
|
2

The backslash character (\) is a quoting character, both for the shell and for POSIX regexes, and sometimes also a dequoting character in POSIX regexes. POSIX does not provide for C-style hex escapes in either mode, and Bash supports them only in C-quoted mode (that is, within $'...').

The Bash manual's section on the =~ operator notes:

Shell programmers should take special care with backslashes, since backslashes are used by both the shell and regular expressions to remove the special meaning from the following character.

Additionally,

You can quote any part of the pattern to force the quoted portion to be matched literally instead of as a regular expression

and

The shell performs any word expansions before passing the pattern to the regular expression functions, so you can assume that the shell’s quoting takes precedence. As noted above, the regular expression parser will interpret any unquoted backslashes remaining in the pattern after shell expansion according to its own rules.

The part about taking quoted text literally is problematic for the contents of bracket expressions, which, as a whole, are not taken literally at all. It's not clear how Bash is meant to handle that, but whether it is the shell itself or the regex engine that processes the backslashes, there's good reason to think that your condition

[[ "$char" =~ ^[\x01-\x20\x7F-\xFF]$ ]]

will be treated as equivalent to

[[ "$char" =~ ^[x01-x20x7F-xFF]$ ]]

, which does not match $'\x01'.

Especially when you have a regex that contains quoting characters or shell metacharacters, it can be very helpful to store the pattern in a variable. In this case, that would also be a way to remove any ambiguity about what the regex actually is. For example:

unprintable_pattern=$'^[\x01-\x20\x7F-\xFF]$'
if [[ "$char" =~ $unprintable_pattern ]]
# ...

That uses Bash's C-quoting form to construct the bracket expression with literal (as the regex will see it) characters. Note that the reference to the pattern variable should be unquoted, lest the expansion be matched literally.

When I make that change in your original code, it outputs "Unprintable" as you expected.

Comments on the question and on another answer remark upon a locale sensitivity. It's unclear to me where that comes from, but I tested the variation described here with several different locales, and it worked the same in every case.

3 Comments

"I tested the variation described here with several different locales, and it worked the same in every case." - Huh, weird. It works for me with LC_ALL=C or LANG=C but not by default (exit code 2 for invalid regex). I'm using en_CA.UTF-8 and echo $BASH_VERSION5.2.15(1)-release on Debian Bookworm.
It also works with LANG=, but I imagine that's undefined behaviour.
@wjandrea, it works for me both by default and with LC_ALL=en_CA.UTF-8, with Bash 5.2.21 on Ubuntu 24.04.1. Maybe, then, the locale sensitivity in some other versions or builds of Bash is unintentional. As I already remarked in this answer, that does seem anomalous.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.