STFU-8 is a hacky text encoding/decoding protocol for data that might be not
quite UTF-8 but is still mostly UTF-8. It is based on the syntax of the repr
created when you write (or print) binary text in rust, python, C or other
common programming languages.
Its primary purpose is to be able to allow a human to visualize and edit
"data" that is mostly (or fully) visible UTF-8 text. It encodes all non
visible or non UTF-8 compliant bytes as longform text (i.e. ESC becomes the
full string r"\x1B"
). It can also encode/decode ill-formed UTF-16.
Comparision to other formats:
std::str
):base64
): also encodes binary[0x72, 0x65, 0x61, 0x64, 0x20, 0x69, 0x74]
isIn simple terms, encoded STFU-8 is itself always valid unicode which decodes
to binary (the binary is not necessarily UTF-8). It differs from unicode in
that single \
items are illegal. The following patterns are legal:
\\
: decodes to the backward-slash (\
) byte (\x5c
)\t
: decodes to the tab byte (\x09
)\n
: decodes to the newline byte (\x0A
)\r
: decodes to the linefeed byte (\x0D
)\xXX
where XX are exactly two case-insensitive hexidecimal digits: decodes\xXX
byte, where XX
is a hexidecimal number (example: \x9F
,\xaB
or \x05
). This never gets resolved into a code point, the value\uXXXXXX
where XXXXXX
are exacty six case-insensitive hexidecimal digits,stfu8
will attempt to store the value into the decoder (if thestfu8
provides 2 different categories of functions for encoding/decoding data
that are not necessarily interoperable (don't decode output created from encode_u8
with decode_u16
).
encode_u8(&[u8]) -> String
and decode_u8(&str) -> Vec<u8>
: encodes oru8
values to/from STFU-8, primarily used for interfacingencode_u16(&[u16]) -> String
and decode_u16(&str) -> Vec<u16>
: encodesu16
values to/from STFU-8, primarily used forThere are some general rules for encoding and decoding:
\u...
cannot be resolved into a valid UTF code point it must fit into"\u00DEED"
(which is an UTF-16decode_u8
will fail, but will succeed withdecode_u16
."\x01\x02"
will be[0x01, 0x02]
not [0x0102]
-- even if you use decode_u16
.\x...
are always copied verbatum into the decoder.\xFF
is a valid UTF-32 code point, but if decoded with decode_u8
0xFE
in the buffer, not two bytes of data as the UTF-8 character'þ'
. Note that with decode_u16
0xFE
is a valid UTF-16 code point, so'þ'
character. Moral of the story: don't mixu8
and u16
functions.tab, newline, and line-feed characters are "visible", so encoding with them in "pretty form" is optional.
The problem is succinctly stated here:
http://unicode.org/faq/utf_bom.html
Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
A different issue arises if an unpairedsurrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a convertermust treat this as an error. [AF]
Also, from the WTF-8 spec
As a result, [unpaired] surrogates do occur in practice and need to be preserved. For example:
In ECMAScript (a.k.a. JavaScript), a String value is defined as a sequence of 16-bit integers that usually represents UTF-16 text but may or may not be well-formed. Windows applications normally use UTF-16, but the file system treats path and file names as an opaque sequence of WCHARs (16-bit code units).
We say that strings in these systems are encoded in potentially ill-formed UTF-16 or WTF-16.
Basically: you can't (always) convert from UTF-16 to UTF-8 and it's a real bummer. WTF-8, while kindof an answer to this problem, doesn't allow me to serialize UTF-16 into a UTF-8 format, send it to my webapp, edit it (as a human), and send it back. That is what STFU-8 is for.
The source code in this repository is Licensed under either of
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
The STFU-8 protocol/specification itself (including the name) is licensed under CC0 Community commons and anyone should be able to reimplement or change it for any purpose without need of attribution. However, using the same name for a completely different protocol would probably confuse people so please don't do it.