https://invisible-island.net/xterm/bad-utf8/
Copyright © 2020,2024 by Thomas E. Dickey

XTerm – Handling Ill-formed UTF-8

Background

Markus Kuhn added rudimentary support for UTF-8 in the Linux console in 1996. Later, in April 1999, he began the support for UTF-8 in xterm (patch #97). Kuhn's initial implementation in the Linux console did rudimentary error checking, discarding unexpected input. Kuhn's changes for xterm were adapted from the Linux console source, but adding a comment:

    /* Combine UTF-8 into Unicode */
    /* Incomplete characters silently ignored,
     * should probably be better represented by U+fffc
     * (replacement character).
     */

Actually, the proper Unicode replacement character is U+FFFD, but that was a start. Shortly after, Kuhn provided a patch to use U+FFFD (in xterm patch #99).

Later that year, Kuhn created a demonstration file for valid input, and a test file for invalid input. He made minor improvements to both over the next few years, but no substantial revisions to account for changes to the Unicode conformance. The Internet Archive has a succession of versions of UTF-8-test.txt from Kuhn's website, but it has changed only a half-dozen times. The Internet Archive lacks the first two versions; here is a complete set:

XTerm (barring the occasional bug report) works with that file. XTerm also works with the demonstration file UTF-8-demo.txt, but this page is about the former, the test-file.

Others created similar demonstrations, e.g., Frank da Cruz' UTF-8 Sampler at Columbia as part of the Kermit project. Originally written to promote Kermit 95, da Cruz refocused it in 2011, noting

This, however, is a Web page, which started out as a kind of stress test for UTF-8 support in Web browsers, which was spotty when this page was first created but which has become standard in all modern browsers.

But developing a test suite for Unicode has been neglected. While the Unicode documentation takes about a hundred pages to describe how one might develop tests for conformance, it does not provide a reference implementation. As a result, developers have been presented with the opportunity of interpreting the (always incomplete) documentation. These provide interesting reading:

DRAFT - L2/02-149 Unicode Compliance Testing (2001)

In a section discussing ISO-5589:
If the converter distinguishes between illegal (source) values and unassigned values (in the target set), verify that the appropriate responses are generated:
- unassigned: U+0212, U+FFFD, U+10FFFD
- illegal: U+FFFF, U+10FFFF
That looks promising, until one reads it closely and realizes that the converter is only asked to recognize those invalid codes, not do anything in particular with them.
UNICODE CONFORMANCE MODEL (2008)

Section 2.5.1 says

A conformance test for the Unicode Standard is a list of data certified by the Unicode Technical Committee [UTC] to be "correct" with regard to some particular requirement for conformance to the standard.

and

A conformance verification test for the Unicode Standard is a test, usually designed and implemented by a third party not associated with the Unicode Consortium or the UTC, intended to test a product which claims conformance to one or more aspects of the Unicode Standard, for actual conformance to the standard.

That is, the emphasis is on handling correct data. How to handle errors is left to the implementors. That “usually” tells us that the Unicode organization is not going to develop a reference implementation.

Providing correct data (and mostly complete procedures) is good for demonstration purposes. Applications have to handle error conditions consistently.

Problems

Kuhn's test file is simple enough. One tests the terminal by sending the test-file to the terminal, making it display the result. One gauges the success of the test by checking that a vertical bar “|” is in column 79. Because it contains ill-formed UTF-8, some of the expected display will be the replacement character. The position of the vertical bar takes that into account.

The test file has a few problems:

Two of the test-file lines lack a vertical bar (sections 2.1.1 and 2.2.1). Kuhn left those out on the first revision of the file, in November 1999.
Kuhn apparently assumed that U+0080 would display like a space. Terminals do not do that; it is used for padding. Unicode agrees with that description, saying it is a control character.
Section 5 of the test-file uses codes that Unicode says explicitly are not characters. Kuhn apparently assumed those would display as a double-cell (aka fullwidth) character.

The test-file was intended for terminals, but after all, this is Unicode which is supposed to work everywhere—even with a web browser. Someone attempted to transform Kuhn's test file into a webpage, which can be seen on W3C's website. It looks different from xterm:

Compare test-file in *xterm* and Firefox
xterm	Firefox

On the Firefox side, the vertical bars do not line up. There are two reasons for this:

Firefox (and most web browsers) displays each byte of the ill-formed UTF-8 as a replacement character.
Even if one adjusts the test-file to account for that, it is not possible to make the vertical bars line up because Firefox does not have the replacement character available in a monospace font.

It turns out that is not just Firefox. Apparently font designers do not consider the replacement character useful.

There are other differences of course. Web browsers have no concept of control characters (aside from whitespace), so they are guaranteed to do the wrong thing when told to handle a padding character.

On xterm's side, some of the characters for which Firefox displays a replacement character are shown as empty boxes. For a while (from patch #233 to patch #334), xterm would have shown a replacement character, but the current scheme avoids doing that for characters which appear to be valid but missing.

Comparing the results in a web browser was an issue to explore because Dan Gohman suggested changes to xterm's error handling would have it imitate Firefox (or equivalently, act like a terminal whose developers imitate Firefox). That was motivated by a comment in the Unicode chapter 3 on conformance:

U+FFFD Substitution of Maximal Subparts

An increasing number of implementations are adopting the handling of ill-formed subse-quences as specified in the W3C standard for encoding to achieve consistent U+FFFD replacements. See:

http://www.w3.org/TR/encoding/

which overlooked the following (quoting from Unicode 13):

The Unicode Standard does not require this practice for conformance. The following text describes this practice and gives detailed examples.

The reference to W3C deals with the way Firefox displays multiple replacement characters. W3C's recommendation for Unicode is moot with regard to terminals, and actually not everyone agreed that it was suitable for browsers. For example in this page

How Many REPLACEMENT CHARACTERs?

Henri Sivonen disputes that. At the top of the page, he notes that the Unicode and ICU organizations amended their wording to avoid the appearance that W3C's approach is recommended (or “best practice”).

Ignoring that conclusion, the suggested changes might be relevant in terms of browser-like features imitated in some other terminals. In that regard, it could be useful as a resource setting for people who must deal with scripts which rely upon this feature.

Comparisons

I decided to explore this by comparing the results from Kuhn's test-file with and without Gohman's changes.

Because those changes are incompatible with longstanding practice (more than 20 years), I refactored Gohman's changes as a new resource setting utf8Weblike.

By making the choice a resource, it is possible to make a test-script which exercises the terminal with/without the feature and compare the results. My script (called bad-utf8) uses the terminal's cursor movement and reporting controls to determine where the vertical bar is, and produce a copy of Kuhn's test-file which would give the expected results (by adding or removing spaces before the vertical bar).

Because those controls are part of the common VT100 repertoire, that script can also run successfully on other terminals.

Terminals compared

I used the script to collect information on these terminals:

alacritty (0.5.0-1)
iTerm2 (3.4.0beta2)
kitty (0.17.4-1
konsole (4:20.04.3-1)
linux (5.7.12.arch1-1)
macos (2.9.5 build 421.2)
mintty (3.2.0-1)
mlterm (3.9.0-1)
ms-terminal (1.1.2021.0)
putty (0.70)
stterm (0.8.4-1)
urxvt (9.22-8)
vte-0.46 (0.46.1-1)
vte-0.60 (0.60.3-1)
wsl-ubuntu (4.19.104, 4.5ubuntu1.2)
xterm-344 (xterm patch #344)
xterm-358 (xterm patch #358)
xterm-w3c (xterm patch #359 with utf8Weblike enabled)
xterm.js (embedded in Visual Studio Code 1.45.0)
zoc7 (7.25.7)

Twenty should be enough. There are other terminals, but they have not distributed packages (or they are duplicates).

For testing, I used these machines (aside from xterm, the package versions cited above were current on those):

Arch Linux (alacritty in GNOME desktop as well as Linux console)
Debian 9 (vte-0.46 in twm desktop)
Debian testing (other programs, in XFCE4 desktop)
macOS 10.14.6 (macos and iTerm2)
Windows 8 (mintty in msys2, and PuTTY)
Windows 10 (ms-terminal and wsl-ubuntu)

Counts of matches

Kuhn's test-file numbers each of the test cases. My script reports “0” for each terminal if the test matched exactly. Each line in the test-file which does not match exactly adds one in the report. Most test cases have only one line of data; a few (e.g., 3.1.9, 5.3.3 and 5.3.4) have more than one.

Linux console, xterm (i.e., patch #358) and PuTTY were the only terminals which matched Kuhn's test-file closely (5-6 differences out of 98 data lines).

The table is here.

Amount of mismatch

The bad-utf8 script constructs a file which could be sent to the corresponding terminal with the vertical bars aligned. It does this by adding spaces.

Counting the number of spaces needed may give more insight to how closely related the terminals are for error handling. Reviewing the results, you may see that the relative ranking does not change appreciably.

The table is here.

Cross-terminal differences

I wrote another script, diff-utf8 to process the data collected with bad-utf8.

Beyond seeing how closely a given terminal matches the assumptions in Kuhn's test file, one might want to know if there are groups of terminals which give similar results. It turns out that there are two groups (more than two terminals having close matches):

xterm.js, vte-0.60 and xterm (with utf8Weblike enabled) are identical.
Linux console, PuTTY, xterm (patch #358) are the same except for two problematic test cases (see Analysis):

2.1.2 2 bytes (U-00000080)

2.2.1 1 byte (U-0000007F)

XTerm (and any VT100-compatible terminal) should treat the corresponding single-byte values as nonprinting. The latter is a single byte, but the former becomes 0xC2, 0x80.
Reviewing the generated files:
- PuTTY responds to 0x7F by moving the cursor left one column (bad-utf8 added a space to compensate).
- Linux console responds to U+0080 by moving the cursor right one column (bad-utf8 removed a space to compensate).

A few other terminals are close to one of those groups. Most are not close.

The table is here.

Analysis

That table is large, and you may not pick out the pattern easily. In vile, I can see this using the editor's highlighting:

To explore this, I added a report to diff-utf8. Initially, I had only tested vte-0.60, until noticing that adding an earlier version would help explain the relationships among these terminals:

** pairwise report
.. level 0
xterm-w3c vs vte-0.60
xterm.js vs vte-0.60
xterm.js vs xterm-w3c
.. level 1
putty vs linux (2.2.1)
vte-0.46 vs konsole (2.1.2)
vte-0.60 vs macos (2.1.2)
xterm-358 vs linux (2.1.2)
xterm-w3c vs macos (2.1.2)
xterm.js vs macos (2.1.2)
.. level 2
vte-0.46 vs macos (3.3.7, 3.4)
vte-0.60 vs konsole (3.3.7, 3.4)
xterm-358 vs putty (2.1.2, 2.2.1)
xterm-w3c vs konsole (3.3.7, 3.4)
xterm.js vs konsole (3.3.7, 3.4)
.. level 3
macos vs konsole (2.1.2, 3.3.7, 3.4)
vte-0.60 vs vte-0.46 (2.1.2, 3.3.7, 3.4)
xterm-w3c vs vte-0.46 (2.1.2, 3.3.7, 3.4)
xterm.js vs vte-0.46 (2.1.2, 3.3.7, 3.4)
.. level 4
.. level 5
.. level 6
.. level 7
.. level 8
.. level 9
macos vs iTerm2 (2.2.1, 3.3.2, 3.3.3, 3.3.4, 3.3.5, 3.3.8, 3.3.9, 3.3.10, 3.4)

Recapitulating that in words:

xterm.js, vte-0.60 and xterm (utf8Weblike) give the same result.
Allowing for one difference:
- macOS Terminal.app is close enough to add to the initial group.
- xterm and Linux console are close enough to start a second group.
- Konsole and vte-0.46 are close enough to start a third group.
Allowing for two differences:
- Konsole is close enough to add to the initial group.
- PuTTY is close enough to add to the xterm/Linux group.
- macOS Terminal.app and vte-0.46 are close enough to start a fourth group.
Allowing for three differences:
- vte-0.46 is close enough to add to the initial group.
- macOS Terminal.app and Konsole are close enough to start a fifth group.
After that, there is a long gap (nine differences) before iTerm2 is close enough to macOS Terminal.app to start a sixth group.

In short,

there is an initial group with three having the same behavior, along with three more having up to three differences from that, and
a second group from the oldest implementations (Linux console, xterm and PuTTY).
nothing else with more than two closely-related terminals.

The differences in xterm patch #344 versus patch #358 bear explanation. Patch #344 happens to be the version provided in the current Debian 10 (stable), so it was useful for comparison. Before Gohman's suggested changes, he submitted a bug-report which required some scrutiny of Kuhn's test-file, since it pointed out an additional case overlooked in the fixes for patch #268. Since this was before writing bad-utf8, with visual inspection alone, it was easy to overlook an additional problem introduced in patch #328. Both are fixed in patch #357, but patch #358 was the most recently published version of xterm when this page was created.

The history of other terminals' handling of ill-formed UTF-8 is just as complicated. Confining it to the story of how web-browser behavior came to be relevant to terminals, we have this:

Konsole's developers made the initial change in 2013:

commit 8dd47e34b9b96ac27a99cdcf10b8aec506882fc2
Author: Thiago Macieira <thiago.macieira@intel.com>
Date:   Sun Oct 20 17:43:46 2013 +0100

    Add a new UTF-8 decoder, similar to the encoder we've just added
    
    Like before, this is taken from the existing QUrl code and is optimized for
    ASCII handling (for the same reasons). And like previously, make
    QString::fromUtf8 use a stateless version of the codec, which is faster.
    
    There's a small change in behavior in the decoding: we insert a U+FFFD for
    each byte that cannot be decoded properly. Previously, it would "eat" all bad
    high-bit bytes and replace them all with one single U+FFFD. Either behavior is
    allowed by the UTF-8 specifications, even though this new behavior will cause
    misalignment in the Bradley Kuhn sample UTF-8 text.
    
    Change-Id: Ib1b1f0b4291293bab345acaf376e00204ed87565
    Reviewed-by: Olivier Goffart <ogoffart@woboq.com>
    Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>

commit d51130cc3a00df8147e2eb0799e06865c901c6e0
Author: Thiago Macieira <thiago.macieira@intel.com>
Date:   Sat Oct 19 18:54:55 2013 -0400

    Add a new UTF-8 encoder and use it from QString
    
    This is a new and faster UTF-8 encoder, based on the code from QUrl. This code
    specializes for ASCII, which is the most common case anyway, especially since
    QString's "ascii" mode is actually UTF-8 now.
    
    In addition, make QString::toUtf8 use a stateless encoder. Stateless means that
    the function doesn't handle state between calls in the form of
    QTextCodec::ConverterState. This allows it to be faster than otherwise.
    
    The new code is in the form of a template so that it can be used from
    QJsonDocument and QUrl, which have small modifications to how the
    encoding is handled.
    
    Change-Id: I305ee0fd8523cc4ec74c2678cb9ea88b75bac7ac
    Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>

There are a few problems with that:

Bradley Kuhn is a different person. Really, Bradley is unlikely to ever contribute to one of these projects.
Lacking other information, the reader is left with the understanding that the reason for this change was the convenience of copy/pasting from one Qt class into another. One alternative (that this was the only way to make it faster) is implausible, since returning multiple replacement characters would increase the amount of memory-allocation calls.
Just for comparison, I looked for the corresponding code in QUrl at git://code.qt.io/qt/qt5.git, but did not find it.
Based on the commit-date, one assumes that this was in Qt 5.2, however the Git repository has no tag for ”5.2” (but ”5.3” is close enough).

VTE 0.46 was tagged 2016-08-15 (Debian 9's package says 0.46, but actually uses 0.45.90, since no "0.46" was tagged).

This version of VTE used GIConv (part of Glib), which in turn simply uses iconv.

Because it used iconv, error-handling was left to the discretion of the VTE (application). It provided a replacement character on each error, though iconv's behavior is unsatisfactory (i.e., iconv is not helpful when it comes to determining how many bytes to skip on ill-formed data).
Independently of this, the Unicode and W3C organizations came up with guidance in their respective areas:
- Unicode 10.0.0's chapter 3 (June 2017) has this to say:
  
  Best Practices for Using U+FFFD. When using U+FFFD to replace ill-formed subsequences encountered during conversion, there are various logically possible approaches to associate U+FFFD with all or part of an ill-formed subsequence. To promote interoperability in the implementation of conversion processes, the Unicode Standard recommends a particular best practice. The following definitions simplify the discussion of this best practice:
  
  concluding with
  
  Neither of the code units <80> or <BF> in the sequence <63 80 BF 64> is the start of a potentially well-formed sequence; therefore each of them is separately replaced by U+FFFD. For a discussion of the generalization of this approach for conversion of other character sets to Unicode, see Section 5.22, Best Practice for U+FFFD Substitution.
- That recommendation first appeared in Unicode 6's chapter 3 on conformance (February 2011).
- However, the comments about “best practice” were removed in Unicode 11.0.0 (June 2018).
- The W3C WHATWG page entitled Encoding Standard started in January 2013. It notes
  
  The constraints in the utf-8 decoder above match “Best Practices for Using U+FFFD” from the Unicode standard. No other behavior is permitted per the Encoding Standard (other algorithms that achieve the same result are obviously fine, even encouraged).
  
  Although Unicode withdrew the recommendation more than two years ago, to date (August 2020) that is not yet corrected in the WHATWG page.
The vte-0.60 changes date from 2018-09-02, with this:
```
commit b8b1aa4ed5ef12368c5c3f6d85ebf3e1d72f91a8
Author: Christian Persch <chpe@src.gnome.org>
Date:   Mon Sep 3 16:10:51 2018 +0200

    utf8: Make decoder conform to recommendation on replacement characters
    
    With this change, the decoder conforms to the W3 Encoding TR and
    the Unicode recommendation on inserting replacement characters
    from §3.9 of the Unicode core spec.
    
    https://gitlab.gnome.org/GNOME/vte/issues/30
```
That particular issue has been deleted, so it is not possible to determine who suggested the change, etc. Most of the change consisted of copying test-data from Henri Sivonen's encoding_rs project.

Sivonen provides a lengthy description in encoding_rs: a Web-Compatible Character Encoding Library in Rust.

At the time this change was applied, there was no longer a Unicode recommendation which was relevant. That was the previous year.
The developers for xterm.js made an improved Utf8ToUtf32 class in June 2019. Reading the source, there are a few differences in its boundary conditions versus the WHATWG code, but (see below) it gives an equivalent result using Sivonen's test-data.
Gohman brought the WHATWG page to my attention, as part of explaining why he thought it was a feature that xterm should do, and provided patches to implement it. While investigating this, I found that that it was no longer recommended by the Unicode organization.
Seeing that one of the pitfalls here was the absence of testing for the whole process, I added scripts to review differences between terminals.

In reviewing Gohman's changes, I noticed that he had removed the check which transformed surrogates into the replacement character. The WHATWG process for decoding UTF-8 does not mention surrogates, but it is possible to deduce that by a suitable test-case (lacking in that document). Henri Sivonen's code (with test-cases) was useful for that purpose.

Unresolved Issues

The bad-utf8 script works by printing a line from the test-file, finding where the cursor is after printing, and computing an adjustment. It attempts to handle line-wrapping, but in practice some of the terminals tested differ too much in their handling of wrapping to make that work reliably. Running the script on a terminal sized at least 90 columns wide gives good results. I chose the Arch Linux console for testing because it had been set up as a wide display.

Besides wrapping, one must be careful to not interrupt the terminal while it is processing the test-cases, because that can interfere with the cursor-position report.

The bad-utf8 script updates a CSV-file for each completed test on a given terminal, while also writing a copy of the adjusted test-file. There are two CSV-files, chosen according to whether an option is used to tell it to report test successes or the amount of adjustment needed. The adjusted test-files in either case should be identical. I ran the bad-utf8 script more than once, ensuring that those files were in fact identical (indicating that no data was lost or corrupted due to timing or inadvertant interference with the cursor-position reports).