[Catalyst] Re: decoding in core

Mon Feb 23 21:31:35 GMT 2009

* Neo [GC] <neo at gothic-chat.de> [2009-02-23 16:45]:
> Does anyone know a _safe_ method to convert _any_ string-scalar
> to utf8?

There isn’t. Strings in Perl are untyped. They are simply
sequences of arbitrarily large integers.

If a string only contains values between 0 and 255, then it can
be stored in an optimised form that uses exactly one byte per
integer and the UTF8 flag is off. Otherwise, it is stored in a
variable-width format that is identical to UTF-8 encoding, but
is not actually UTF-8. (There is no particular meaning implied
for these integers, and Perl strings can store integer values
that are undefined in Unicode.) The UTF8 flag simply means “this
is an unoptimised string”. It will sometimes be enabled on octet
strings (even though no integer value in the string is > 255) and
it will frequently be disabled on character strings. It tells you
nothing useful *at all* about the content of the string and you
should just forget that it exists. [^1]

If you have a string that corresponds to a sequence of octets
which store the encoded form of a string according to some
encoding, you have to manually keep track of this encoding,
because there is nothing about the string that tells you this.

The best approach is to simply decode strings as soon after input
as possible and encode them as late before output as possible. In
the middle of your code, then, you only have strings containing
Unicode codepoints.

[^1]: Almost. Unfortunately, there is quite a bit of broken XS
      code in modules out there which means you will have to
      `utf8::downgrade` strings to make sure they are stored in
      byte-wise optimised format before passing them in to such
      modules.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>