Which horizontal whitespace should be supported?

Question

There seems to be three approaches to horizontal whitespace for separating tokens:

Only tab and space: Bash, C, D, Dart, Go, Java, Lua, OCaml, Python, PHP, Perl, Ruby, Rust, Scala, Swift (also supports null!), Zig
Tab, space, non-breaking space: Common Lisp, Erlang, Wolfram Language
All Unicode whitespace: C#, Haskell, JavaScript, Julia, PowerShell, R, Raku

However, I've not been able to find any reasoning for this, nor any discussion of the issues involved. What are the pros and cons of the three approaches?

Swift doesn’t support all Unicode whitespace, but it does support some other weird choices within ASCII, such as form feed and NUL. — Bbrk24
– Bbrk24, Commented May 16, 2023 at 17:02
Honestly, anything that isspace() in C would return 1 for. — CPlus
– CPlus, Commented May 16, 2023 at 20:36
@Pseudonym Why? The question is specifically about horizontal whitespace. — Adám
– Adám, Commented May 17, 2023 at 4:38

Jon Purdy · Accepted Answer · 2023-05-16 19:12:08Z

Tabs and spaces are those whitespace characters most widely supported by common keyboard layouts. They’re both available in ASCII, and likely to be preserved by systems that don’t correctly handle character encoding translation, because they’re also present in other common legacy encodings such as Windows-1252, SJIS, and EBCDIC.

Space (SP, U+0020)

Allowing only SP simplifies the calculation of column positions, both for source locations in diagnostic messages, and for languages where indentation or alignment are significant. (However, see the later section on Unicode.)

This could be considered the minimal whitespace to allow and require, although in fact it’s quite possible to design a grammar where spaces aren’t significant at all. Notably, early versions of FORTRAN did not require spaces around key words and symbols, and permitted spaces in identifiers. However, I won’t dwell on this, because it was considered a misfeature, as a small typo could significantly change the interpretation of a line of code.

Tab (HT, U+0009)

Allowing HT requires handling the variable positioning of tab stops. A variety of approaches are possible, and I can’t list them exhaustively, but here are a few examples for illustration.

Neither indentation nor alignment is significant

Column positions in error messages may use a fixed tab width parameter, or be computed based on byte offsets rather than graphemes
Alignment is significant, using a fixed tab width (GHC Haskell uses 8 columns by default)
Alignment is significant, only within a table column (treating a series of one or more HT characters as a column separator)
Only indentation is significant; an indented region shares the same prefix of whitespace characters, which may begin with HT characters

Non-breaking Space (NBSP, U+00A0)

Non-breaking spaces are simply character spaces that don’t allow automatic insertion of a line break.

Allowing NBSP can improve usability in a few small ways. First, NBSP characters may frequently be inserted by some input methods. For example, traditional French typography places NBSP before certain punctuation ( : ! ? ) and within quotation marks ( ‹ › « » ); and on some keyboard layouts, hitting the Space key with a modifier such as Alt, Option, or AltGr will also insert NBSP.

Many professional programmers will use a US English keyboard layout to accommodate languages that were designed to be typed on such a layout. However, a beginner programmer is likely starting with the familiar layout of their native language and country, so they may insert NBSP characters unintentionally. Their text editor may give no clear indication that the whitespace is incorrect, and they may not notice even if it does, because they haven’t yet developed the clerical skills involved in programming text-based languages.

Furthermore, NBSP may be inserted for presentation reasons on tutorial websites and blogs, rather than (or in addition to) using a proper facility for preformatted/verbatim text, like the HTML <pre> tag. Programmers will frequently copy and paste code examples from such websites, and they have a reasonable expectation that it should work, or deliver a sensible diagnostic message.

As a rule of thumb, producing helpful diagnostics from a parser requires the parser to accept a larger language than is actually valid, so that it can describe specifically how the input isn’t valid.

Unicode

Allowing all Unicode whitespace may sound big and complicated, but in fact there isn’t much difficulty to deal with. The advantage of proper Unicode support is improved usability—code that is represented as text will break fewer user expectations by functioning predictably like other text.

For example, Unicode technical reports such as the Unicode line-breaking algorithm are actually extremely informative as a reference for syntax design. The lack of a break opportunity between two characters tells you that a programmer is likely to interpret those characters as a single semantic unit if they appear side by side without any intervening whitespace.

For computing column positions for diagnostic message alignment in a terminal, you should use ICU extended grapheme clusters, or your platform’s wcwidth function. However, depending on alignment is best avoided in a terminal if possible, since emulator support for proper alignment can still be a bit dodgy, especially for East Asian scripts.

As for whitespace, the Unicode General Categories that we must consider are the following, which account for both horizontal and vertical spacing.

Zs: separator, space (17 characters)

This includes SP and NBSP, as well as U+3000 ideographic space, which is automatically entered by keyboard layouts for East Asian languages that use CJK ideographs, and thus beneficial to understand for the same reasons as for NBSP given above.

The remainder of these characters are various typographical spaces; they’re used more rarely, but there are no particular caveats about supporting them. Nominally they do have different widths, which could be significant for code typeset in a proportional-width typeface, but by far the majority of code is set in monospace / fixed width.

Zl/Zp: separator, line/paragraph (2 characters)

These categories contain only LS (U+2028 line separator) and PS (U+2029 paragraph separator), which can be treated as mandatory line breaks.

It used to be the case that JSON strings could contain LS/PS while JavaScript strings couldn’t; this was amended by a TC39 proposal, Subsume JSON.

Cc: other, control (65 characters)

Of these, the significant ones are SP and HT (already covered), and the following:

line feed / newline (LF, U+000A)
line tabulation / vertical tab (VT, U+000B)
form feed / new page (FF, U+000C)
carriage return / (CR, U+000D)
next line (NEL, U+0085)

All of these can be treated as mandatory line breaks like LF; the only exception is that CR is not a break if followed by LF, to support CRLF sequences. I recommend normalising newlines before parsing to avoid this.

Besides CR and LF, the others are uncommon. FF used to be pretty widespread as a section separator in text files, and this is still seen occasionally, since it has some editor support, notably in Emacs. VT could be used as a row separator, if your sources might contain tabular alignment like I described for HT, but this currently has no editor support.

Cf: other, format (2 characters)

This category contains 170 characters, but almost all of them are insignificant for the purpose of code typography. Notable exceptions are these two:

zero-width space (ZWSP, U+200B)

Should be treated as a space, but not advance the column position of source locations (but again, prefer grapheme clustering or wcwidth)
zero-width no-break space (ZWNBSP, U+FEFF)

May appear as a byte-order mark (BOM); should be ignored at the start of a file, and otherwise treated as a space.

Matheus Moreira · Accepted Answer · 2023-12-09 02:40:11Z

1

I've not been able to find any reasoning for this, nor any discussion of the issues involved.

This Ruby code perfectly illustrates the possible consequences of failing to recognize all whitespace characters as separators:

def this is a single symbol
  p __method__
end

this is a single symbol
# :this is a single symbol

The spaces between the words are non-breaking spaces. The Stack Exchange text area seems to have filtered them out but if you replace them with the appropriate code points you can use spaces in symbols and variable names.

Supporting this kind of thing will enable programmers to write natural looking code but it will much be harder to tell where each logical symbol begins and ends. Even with editor whitespace display support I find it pretty annoying to read.

Any symbol unavailable on standard keyboards will also make it hard to work with the code due to inability to easily insert those symbols. It will require elaborate or custom input methods, people will resort to copy/paste. This also applies to Unicode symbol support in general.

I suppose it's up to the language's designer to decide whether this is a good thing or not.

edited Dec 9, 2023 at 2:40

answered Dec 9, 2023 at 2:25

Matheus Moreira

5814 silver badges13 bronze badges

2

$\begingroup$ Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. $\endgroup$

Community
– Community Bot

2023-12-09 02:38:23 +00:00
Commented Dec 9, 2023 at 2:38
3

$\begingroup$ The decision to not count NBSP as whitespace doesn't entail that you must therefore count it as a valid identifier character. Consider e.g. that ? is not a separator character in Python but also isn't allowed in identifiers. Personally I think NBSP (and other Unicode shenanigans, like RTL order markers) should not be allowed at all outside of string literals. $\endgroup$

kaya3
– kaya3

2023-12-09 12:35:20 +00:00
Commented Dec 9, 2023 at 12:35

Add a comment |

feldentm · Accepted Answer · 2023-12-09 09:54:18Z

If you consider exactly the whitespace that is in Unicode and in ASCII and your language tokenizes all non-ASCII Unicode without looking at codepoints, you can ignore Unicode in the Lexer and Parser entirely and treat it in later passes if at all.

This has a positive impact on performance and complexity, but it completely depends on your language if anybody will notice.

It could have a negative impact on what is considered an identifier in your language, but it is debatable if that matters in practice as most style guides wouldn't allow you use non-ascii identifiers anyway because they are incomprehensable.

One property of this approach is that strings may contain invalid Unicode codepoints. You might wish to perform a check later, but that later check means that Unicode validity checks are applied to Strings only and not to the entire file.

Stack Exchange Network

Which horizontal whitespace should be supported?

3 Answers 3

Space (SP, U+0020)

Tab (HT, U+0009)

Non-breaking Space (NBSP, U+00A0)

Unicode

Zs: separator, space (17 characters)

Zl/Zp: separator, line/paragraph (2 characters)

Cc: other, control (65 characters)

Cf: other, format (2 characters)

You must log in to answer this question.

Linked

Hot Network Questions

Which horizontal whitespace should be supported?

3 Answers 3

Space (SP, U+0020)

Tab (HT, U+0009)

Non-breaking Space (NBSP, U+00A0)

Unicode

Zs: separator, space (17 characters)

Zl/Zp: separator, line/paragraph (2 characters)

Cc: other, control (65 characters)

Cf: other, format (2 characters)

You must log in to answer this question.

Linked

Related

Hot Network Questions