What are multibyte strings in PHP?

If you read New String Functions in PHP 8, you might be wondering why multibyte strings are important. Computers are built to process binary digits, and there is a straightforward approach for converting integers to binary. There is even a standard for representing floating point numbers in binary. However, how do all the sentences, words, and characters on this page get represented as binary?

The answer is character encoding. Without diving too much into the details, a character encoding is a mapping between characters and integers, and we know how to represent integers in binary. So… problem solved!

Most PHP functions assume the character encoding used is ISO-8859-1. In fact, the PHP manual defines a string as:

A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support.

This strategy and character encoding works great for languages that use the Latin alphabet. However, this encoding scheme does not work with languages such as Korean, Chinese, or Japanese, that use other characters. This is because ISO-8859-1 uses 1 byte for encoding the characters which limits its’ capacity to 256 characters. Other encoding schemes have emerged, but UTF-8 is by far, the most popular. Unlike ISO-8859-1 that uses only 1 byte, UTF-8 uses 32 bytes and can therefore represent over a million characters.

What do multibyte strings have to do with PHP?

Suppose we have Hello World in Hangul (Korean): 반갑다 세상아. What’s it’s length? The answer is seven, but strlen returns 19! Ninteen??! This is because strlen is actually returning the number of bytes used to represent the string, not the number of characters.

Some functions assume that the string is encoded in some (any) single-byte encoding, but they do not need to interpret those bytes as specific characters. This is case of, for instance, substr(), strpos(), strlen() or strcmp(). Another way to think of these functions is that operate on memory buffers, i.e., they work with bytes and byte offsets.

This is precisely why multibyte string functions exist. The mb_strlen counterpart indeed returns the correct length.

>>> $str = "반갑다 세상아";
=> "반갑다 세상아"

>>> strlen($str);
=> 19

>>> mb_strlen($str);
=> 7

Why are multibyte string functions separate?

You might ask yourself, why have a completely separate set of functions for handling multibyte strings? The answer is technical debt. PHP was developed shortly after UTF-8 was standardized, and I suspect the encoding scheme was not popular enough to warrant its’ use. The mb_ functions were added later as a way for supporting multibyte strings. In 2005, a small group from Yahoo, Zend, and the PHP community launched a project to add native unicode support to PHP. That project was named PHP 6 and it never launched. In his slide deck below, Andrei Zmievski provides more details on the PHP 6 project.

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6 from Andrei Zmievski