[Catalyst] Unicode::Encoding - utf8 "\xBA" does not map to Unicode

Bernhard Graf catalyst4 at augensalat.de
Sat Mar 12 11:55:28 GMT 2011


Am 12.03.2011 11:35 schrieb Eisenberger Tamás:

> When you do utf8::decode, on a string form any input (file, http
> request, etc..) perl groups the two separate input bytes represented as
> two characters in the string "\xC3\xA9" into one utf8 character: "\xE9"
> and marks the string as utf8 (so utf8::is_utf8 returns 1).

Sorry, but that answer is complete nonsens and shows, where the main
problem for most developers is: They simply don't understand the
difference between Unicode and UTF-8.

Unicode describes a standard, that assigns a distinct code point to
every known character. Those code points are numbers from 0 to infinity
(in theory - of course we an eventual number of characters).

To store any information in a computer, it must be encoded, and that is
also true for Unicode code points. UTF-8 is such an encoding standard.

If you have an "é", which is assigned to Unicode code point 233 aka
\xE9, and you want to store that, then you have to encode it.
If you choose the UTF-8 encoding, the result is the two bytes "octet
stream" (16 bits) "\xC3\xA9". That is, what is stored in a file for
example, when you save "é" in UTF-8.

For certain reasons characters on Unicode code points 128..255 are the
same as in the common used iso-8859-1 encoding. That makes it sometimes
harder to decide, if some data is actually meant as UTF-8 - or is broken
unencoded Unicode data.

Back to the URL: When you have 'first_name=K%E9vyn' in an URL, the
meaning of %E9 is actually ambiguous, because there is no information
about the encoding. Fortunately RFC 3986 advises to encode all "reserved
characters" as UTF-8 before transforming them into the URI-percent
encoding, because percent encoding only works with bytes (octets). In
that sense, "K%E9vyn" is simply invalid, because "\xE9" alone is no
valid UTF-8 encoded character. What you have there is obviously the
percent encoding of the ISO-8859-1 encoding of "Kévyn". Therefore the
correct RFC 3986 compliant URI-encoding for "Kévyn" would be "K%C3%A9vyn".

HTH

Bernhard Graf



More information about the Catalyst mailing list