[Catalyst] Catalyst Unicode

Fri Jan 31 15:00:51 GMT 2014

* Will Crawford <billcrawford1970 at gmail.com> [2014-01-31 13:05]:
> If the string has been decoded *from* UTF-8 to Perl's internal
> representation, it's *not* going to be marked as UTF8 internally; it
> *shouldn't* be. It's no longer a "UTF8" string but a "Unicode" string,
> complete with wide characters. If anything, the internal "UTF8" flag
> means "this string needs decoding" rather than "has been decoded".

Sorry, this is nonsense. The UTF8 flag means the string is internally
stored as a variable-width integer sequence using the same encoding
scheme as UTF-8, which means it can store characters > 0xFF. If the
UTF8 flag is off, the string is stored as a byte array.

You are correct only insofar as that decoding a string could in theory
yield a string with the UTF8 flag *off*.

Because the UTF8 flag doesn’t mean anything. It only means that the
string can store characters > 0xFF, which only matters to perl
internally, since UTF8=0 strings will be transparently promoted to
UTF8=1 whenever necessary.

But Perl can’t tell whether a string is a Unicode string or byte string.
The UTF8 flag is irrelevant.

*You* can tell, because `length` returns 2 for a byte string with a “ü”
represented in UTF-8, and 1 for a Unicode string with the character “ü”.

(But `length` can return 1 for a UTF8=0 string, because the codepoint is
0xFC which can be stored as a single byte just fine; and it can return
2 even for a UTF8=1 string, because the UTF-8 encoded representation of
“ü” is 0xC3 0xBC and it doesn’t matter whether you store that in
a UTF8=0 or UTF8=1 string, it’s still the sequence 0xC3 0xBC.)

Christian:

This also affects you: you should not be looking at `is_utf8`. Instead
you should be looking at whether `length` returns the correct value.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>