Heart Cat Emoji

Emoji. They’re everywhere these days. If your application takes user input, guess what? You’re going to get them. They come in text like this, 😻 which characters were first defined in the Unicode standard version 6.0. According to its Unicode table entry it’s value is U+1F63B. It may be coded in an HTML document as 😻.

You’re likely familiar with Unicode text encodings. In the popular encoding, UTF-8, the emoji requires four bytes. Their values are F0 9F 98 BB.

UTF-8 was developed by no less than Ken Thompson and Rob Pike, both alumni of the famous Bell Labs, where so many of the foundations for computing were laid. Thompson wrote the first Unix and collaborated on the development of the C programming language. The two are part of a trio at Google who developed the programming language, Go.

UTF-8 is the most popular Unicode encoding scheme on the planet . (Who knows about elsewhere?) By a large margin. As in, almost everybody uses it. You use it. I’ll bet.

mySQL

If you use mySQL, also wildly popular but losing ground to Postgres, so I hear, you might have been seduced by the character encoding, ‘utf8’ for use in text and string valued columns of your data tables. Lovely. Marvelous. UTF-8 everywhere!

If you were, as I was, you might have missed a little tiny detail. The utf8 character encoding in mySQL uses (up to) three bytes. That’s a three. The cat emoji you try to put into the string column of a record in your table will whisper to you a sweet nothing like,

# Mysql2::Error:
#   Incorrect string value: '\xF0\x9F\x98\xBBysto...'

(Not exactly that, but something like it. You get the idea.) What’s up? Well, we just tried to stuff four bytes of perfectly legitimate UTF-8 encoded cat into a three byte crate. Meow 😾. Cat scratch fever.

Checking-in with the mySQL reference manual 5.7, section 10.10, or opening a mySQL command prompt and typing, SHOW CHARACTER SET;, scanning the list of available sets, we find that little three in the Maxlen column for utf8. Oh the woe.

What we have to do to properly support UTF-8 content in mySQL is apply the character encoding utf8mb4. It has the same description as utf8, “UTF-8 Unicode” only happily in the Maxlen column, a four.

Testing

The other day (well, yesterday), while updating some tests for iaccdb I picked-up a mySQL error like the one above. It was because the table, by some accident of inattention, had picked-up a latin1 character encoding. The story is the same for utf8 encodings, although slightly less drastic. The other tables were encoded utf8.

The error had never occurred before because all of my test data was pretty much ASCII. UTF-8 encoded, but not straying from the easy for everybody ASCII portion of the set.

What had changed was that I had entertained myself by fixing up some of the test data using Faker::Space.meteorite (and Faker::Space.nasa_space_craft). That data set has a couple of entries with UTF-8 two-byte characters, including for example, “Sołtmany” and “Příbram”. These aren’t going to shoehorn into latin1, not any day.

First, I thought, “You idiot, why did you go getting all fancy with the test data. Leave well enough alone.” For a moment, I felt tempted to revert to a tamer data set for testing. But only for a moment.

What had I done? I’d uncovered a bug, albeit accidentally, by making my tests fancier. That is, assuming I want the application to support values like, “Łowicz”, or emoji like 🛩️. It should. The only right thing to do was to fix it.

Repairing the encoding

Skipping back to the mySQL documentation, there’s lots of great information about character encodings, and ways to change them, including this section about how to convert a table.

ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;

The key here is that the statement doesn’t merely set a new default for new columns, but converts all existing text and string columns to the changed encoding. There are a few caveats about this.

If you already have a lot of data in your live, production database, this is going to take some time. It’s going to be down time, because at least writes to the table will be locked. In my case, the production database has fewer than two thousand records in the affected tables. If your tables have millions of records, you’re going to get creative.

If a column has latin1 encoding but actually contains utf8 encoded data, which is possible but not probable, the CONVERT operation will double-up the multibyte characters in strange ways. You have to verify by, for example,

mysql> select distinct make, hex(make) from make_models \
  where make is not null and 0 < length(make) limit 12;
+------------+----------------------+
| make       | hex(make)            |
+------------+----------------------+
| 107        | 313037               |
| 336        | 333336               |
| 8K         | 384B                 |
| 8K  CAB    | 384B2020434142       |
| 8KCAB      | 384B434142           |
| ACA        | 414341               |
| ACA8KCAB   | 414341384B434142     |
| Acro       | 4163726F             |
| Acro One   | 4163726F204F6E65     |
| Acro Sport | 4163726F2053706F7274 |
| Acrosport  | 4163726F73706F7274   |
| Aeronca    | 4165726F6E6361       |
+------------+----------------------+

and then compare with the encoded values, in Ruby for example,

irb 0> "Aeronca".encode('ISO-8859-1').bytes.collect { |b|
irb 1*   sprintf("%02x", b) }.join
=> "4165726f6e6361"
irb 0> "Aeronca".encode('UTF-8').bytes.collect { |b|
irb 1*   sprintf("%02x", b) }.join
=> "4165726f6e6361"
irb 0> "Aéróncá".encode('ISO-8859-1').bytes.collect { |b|
irb 1*   sprintf("%02x", b) }.join
=> "41e972f36e63e1"
irb 0> "Aéróncá".encode('UTF-8').bytes.collect { |b|
irb 1*   sprintf("%02x", b) }.join
=> "41c3a972c3b36e63c3a1"

You might want to insert a multi-byte value, like Aéróncá, into the table if, as in the example, there are none.

Collation

It’s necessary to ensure that the collation used by the new character encoding is compatible with the instance of mySQL running your production databases. In my instance, my development machine produced a collation of utf8mb4_0900_ai_ci. This collation was not present on Travis (who caught it), running mySQL 5.7 on Ubuntu Xenial (16.04).

The available collations may be determined by invoking SHOW COLLATION; at a mySQL prompt. In this case, there are a number of locale specific collations available, and (at least) two general ones. According to the mySQL documentation section 10.1.1, Unicode Character Sets, the utf8mb4_general_ci collation will be fastest and work with single characters. The utf8mb4_unicode_ci collation will properly handle combined characters for most languages, and the utf8_unicode_520_ci collation correctly implements Unicode Collation Algorithm 5.2.0.

In order to get a collation that’s bound to work, I had to update the ALTER TABLE statements as follows:

ALTER TABLE `airplanes` CONVERT TO CHARACTER SET `utf8mb4` \
  COLLATE `utf8mb4_unicode_ci`;

Column and key lengths

Other than the amount of time required in locking-down a table and converting it, there will be space changes. Depending on how many records you have in your tables, they could be significant. Changes in column lengths have other implications as well.

The lengths of text or string columns, measured in bytes, is going to increase as much as four times. If your current character encoding is a one byte encoding, it will be an increase of four times. If your current encoding is ‘utf8’, which requires three bytes per character, the multiplier will be four-thirds (4/3).

One of the impacts might be on your keys, as happened to me:

ActiveRecord::StatementInvalid: Mysql2::Error: \
  Specified key was too long; max key length is 767 bytes: \
  ALTER TABLE `make_models` CONVERT TO CHARACTER SET `utf8mb4` \
  COLLATE `utf8mb4_unicode_ci`;

To repair it, I had to truncate the columns involved in the key, by fixing a shorter length. This didn’t truncate any data. The lengths to begin with were too large, much larger than required for any of the expected data values. You could say this was another error, not conservatively fixing the column length to begin with.

Conclusion

The moral of the story is, if you’re going to support UTF-8 (and you are), make sure you have some UTF-8 encoded, four byte characters in your test data. In effect, use some emoji. Give it a little ❤️.