Tabs and spaces are those whitespace characters most widely supported by common keyboard layouts. They’re both available in ASCII, and likely to be preserved by systems that don’t correctly handle character encoding translation, because they’re also present in other common legacy encodings such as Windows-1252, SJIS, and EBCDIC.
Space (SP, U+0020)
Allowing only SP simplifies the calculation of column positions, both for source locations in diagnostic messages, and for languages where indentation or alignment are significant. (However, see the later section on Unicode.)
This could be considered the minimal whitespace to allow and require, although in fact it’s quite possible to design a grammar where spaces aren’t significant at all. Notably, early versions of FORTRAN did not require spaces around key words and symbols, and permitted spaces in identifiers. However, I won’t dwell on this, because it was considered a misfeature, as a small typo could significantly change the interpretation of a line of code.
Tab (HT, U+0009)
Allowing HT requires handling the variable positioning of tab stops. A variety of approaches are possible, and I can’t list them exhaustively, but here are a few examples for illustration.
Neither indentation nor alignment is significant
Column positions in error messages may use a fixed tab width parameter, or be computed based on byte offsets rather than graphemes
Alignment is significant, using a fixed tab width (GHC Haskell uses 8 columns by default)
Alignment is significant, only within a table column (treating a series of one or more HT characters as a column separator)
Only indentation is significant; an indented region shares the same prefix of whitespace characters, which may begin with HT characters
Non-breaking Space (NBSP, U+00A0)
Non-breaking spaces are simply character spaces that don’t allow automatic insertion of a line break.
Allowing NBSP can improve usability in a few small ways. First, NBSP characters may frequently be inserted by some input methods. For example, traditional French typography places NBSP before certain punctuation ( : ! ? ) and within quotation marks ( ‹ › « » ); and on some keyboard layouts, hitting the Space key with a modifier such as Alt, Option, or AltGr will also insert NBSP.
Many professional programmers will use a US English keyboard layout to accommodate languages that were designed to be typed on such a layout. However, a beginner programmer is likely starting with the familiar layout of their native language and country, so they may insert NBSP characters unintentionally. Their text editor may give no clear indication that the whitespace is incorrect, and they may not notice even if it does, because they haven’t yet developed the clerical skills involved in programming text-based languages.
Furthermore, NBSP may be inserted for presentation reasons on tutorial websites and blogs, rather than (or in addition to) using a proper facility for preformatted/verbatim text, like the HTML <pre> tag. Programmers will frequently copy and paste code examples from such websites, and they have a reasonable expectation that it should work, or deliver a sensible diagnostic message.
As a rule of thumb, producing helpful diagnostics from a parser requires the parser to accept a larger language than is actually valid, so that it can describe specifically how the input isn’t valid.
Unicode
Allowing all Unicode whitespace may sound big and complicated, but in fact there isn’t much difficulty to deal with. The advantage of proper Unicode support is improved usability—code that is represented as text will break fewer user expectations by functioning predictably like other text.
For example, Unicode technical reports such as the Unicode line-breaking algorithm are actually extremely informative as a reference for syntax design. The lack of a break opportunity between two characters tells you that a programmer is likely to interpret those characters as a single semantic unit if they appear side by side without any intervening whitespace.
For computing column positions for diagnostic message alignment in a terminal, you should use ICU extended grapheme clusters, or your platform’s wcwidth function. However, depending on alignment is best avoided in a terminal if possible, since emulator support for proper alignment can still be a bit dodgy, especially for East Asian scripts.
As for whitespace, the Unicode General Categories that we must consider are the following, which account for both horizontal and vertical spacing.
Zs: separator, space (17 characters)
This includes SP and NBSP, as well as U+3000 ideographic space, which is automatically entered by keyboard layouts for East Asian languages that use CJK ideographs, and thus beneficial to understand for the same reasons as for NBSP given above.
The remainder of these characters are various typographical spaces; they’re used more rarely, but there are no particular caveats about supporting them. Nominally they do have different widths, which could be significant for code typeset in a proportional-width typeface, but by far the majority of code is set in monospace / fixed width.
Zl/Zp: separator, line/paragraph (2 characters)
These categories contain only LS (U+2028 line separator) and PS (U+2029 paragraph separator), which can be treated as mandatory line breaks.
It used to be the case that JSON strings could contain LS/PS while JavaScript strings couldn’t; this was amended by a TC39 proposal, Subsume JSON.
Cc: other, control (65 characters)
Of these, the significant ones are SP and HT (already covered), and the following:
- line feed / newline (LF, U+000A)
- line tabulation / vertical tab (VT, U+000B)
- form feed / new page (FF, U+000C)
- carriage return / (CR, U+000D)
- next line (NEL, U+0085)
All of these can be treated as mandatory line breaks like LF; the only exception is that CR is not a break if followed by LF, to support CRLF sequences. I recommend normalising newlines before parsing to avoid this.
Besides CR and LF, the others are uncommon. FF used to be pretty widespread as a section separator in text files, and this is still seen occasionally, since it has some editor support, notably in Emacs. VT could be used as a row separator, if your sources might contain tabular alignment like I described for HT, but this currently has no editor support.
Cf: other, format (2 characters)
This category contains 170 characters, but almost all of them are insignificant for the purpose of code typography. Notable exceptions are these two:
zero-width space (ZWSP, U+200B)
Should be treated as a space, but not advance the column position of source locations (but again, prefer grapheme clustering or wcwidth)
zero-width no-break space (ZWNBSP, U+FEFF)
May appear as a byte-order mark (BOM); should be ignored at the start of a file, and otherwise treated as a space.
isspace()in C would return1for. $\endgroup$