A Note on W-plus String Encoding

Since a string in general can contain arbitrary bytes, there may be different reasons why we want to encode them into a more manageable format, such as some form of text format. These kind of encodings are sometimes called binary-to-text encoding schemes. We also usually want to be able to decode encoded string back into the original format. A very useful such encoding would be one that encodes a binary string into a string of letters, digits, and the underscore character (`_'), since these are rarely treated as special characters, or so-called meta-characters. Strings encoded in such way can be used as a part of identifiers in programming languages for example, or part of URLs. They are matched with the regular expression /\w+/, and for this reason let us call these kind of encodings w-plus encodings. There are many ways we can easily come up with such encodings, such as simply encoding each byte as a pair of hexadecimal digits, which we could call the hexadecimal encoding. This kind of encoding would not be very intelligible in the sense that some obvious textual strings such as "hello world" would be encoding in an unrecognizable way: "68656C6C6F20776F726C64". We will consider a w-plus encoding intelligible if it preserves most original letters, digits, and underscore characters. Additionally, we would prefer a more space-preserving encoding; i.e., a more optimal encoding in terms of length; than doubling the length of string on average, which is the case with the mentioned hexadecimal encoding.

W-plus Encoding

One simple encoding which would be more intelligible and more space preserving than hexadecimal, is the encoding in which we would preserve all word character (letters, digits, and underscore) except one character, which we will use as the `escape' character signaling that the original, encoded, character is encoded in the next two hexadecimal digits. One obvious choice for this special escape character would be underscore, however the lowercase letter `x' is also a good choice since it appears relatively infrequently in typical English text, and it conveniently reminds us of the hexadecimal code used in the next two characters. It is also used in Perl, C, and some other languages to indicate hexadecimal numbers, as in `0x1f'.

So, the W-plus encoding of the strings ``hello world'' and ``hexadecimal numbers'' would be ``hellox20world'' and ``hex78adecimalx20numbers''.

W-plus Encoding in Perl

Another advantage of the W-plus encoding is that encoding and decoding are very easy to implement with minimal code in Perl (and also C, and likely other languages). W-plus encoding of a string in Perl can be executed using the following substitution:

  s/[\Wx]/'x'.uc unpack("H2",$&)/ge;

or as the following function encode_w:

sub encode_w {
  local $_ = shift;
  s/[\Wx]/'x'.uc unpack("H2",$&)/ge;
  return $_;
}

W-plus Decoding in Perl

W-plus decoding in Perl can be done with the following substitution:

  s/x([0-9A-Fa-f][0-9A-Fa-f])/pack("c",hex($1))/ge;

or as the following function decode_w:

sub decode_w {
    local $_ = shift;
    s/x([0-9A-Fa-f][0-9A-Fa-f])/pack("c",hex($1))/ge;
    return $_;
}

created: 2020-05-17, last update: 2020-05-18, me comments