[Catalyst] Re: Catalyst Unicode woes ...

Sat Aug 11 17:29:12 GMT 2007

* Tobias Kremer <list at funkreich.de> [2007-08-10 12:41]:
> Zitat von Tatsuhiko Miyagawa <miyagawa at gmail.com>:
> >Concatinating utf-8 flagged variables with utf-8 encoded byte
> >string causes automatic SV upgrade, which causes double utf-8
> >encoded string.
> 
> Hmmm. So my templates are utf8 _ENCODED_ and the strings coming
> in from other perl modules are just utf8 _FLAGGED_. When TT
> concats them together during process() the result is wrecked
> because of the automatic upgrade. Correct?

Forget the fact that they are UTF-8 flagged. Think of it this
way: Perl has two kinds of strings, byte strings and character
strings.

Byte strings consist of, well, bytes; they might be text, or
maybe they’re not. If they are, they are _encoded_; to understand
the text you have to _decode_ the byte sequence to characters.
This notion may seem weird if you haven’t dealt with Unicode in
depth, because most character sets use 255 characters, which they
just represent using a single byte. But if you have more than 255
characters (and Unicode has a lot more), then suddenly you have
to pick some way to represent the character codes. A sequence of
bytes alone is meaningless as text until you know what encoding
it’s in.

Character strings, OTOH, consist of Unicode characters; pure,
ideal, atomic characters that have no particular representation.
Of course the interpreter has to store these ideal characters
somehow, so it uses UTF-8 internally; but that could equally well
be UTF-16 or UCS-4 or for that matter ASCII plus XML entities.

For deeper exposition of the concepts (what is an ideal character
and how does it relate to encodings), read Joel Spolsky’s classic
article:

    The Absolute Minimum Every Software Developer Absolutely,
    Positively Must Know About Unicode and Character Sets (No
    Excuses!)
    http://www.joelonsoftware.com/articles/Unicode.html

Anyway, the problem you are seeing is that as long as you stay in
one realm, things will work.

F.ex., if you mix byte strings, and the bytes represent text
encoded with the same encoding in both strings, you can mix them
just fine. Note though that with multibyte or variable-width
encodings (eg. UCS-2 and UTF-8 respectively), you will have to be
careful to take the encoding into account in every string
mutation. F.ex. if you truncate post titles for display in a
sidebar, you will have to manually take care not to cut off the
string off in the middle of a three-byte character.

Likewise, the strings are both character strings, then you can
mix them no problem. And because they consist of pure ideal
characters, any operations on them treat characters as atomic.
You do not need to care whether a character is one, two, three or
however many bytes in the internal representation used by Perl;
you can just truncate strings or run substitutions on them etc
without worrying.

But if you mix byte strings and character strings, there is
trouble. Perl must find out what characters are in the byte
string, so it must decode it. By default it does so by assuming
that byte strings are text encoded in ISO-8859-1. If this is the
wrong encoding, because, say, your data was actually
UTF-8-encoded – well, oops: now you have UTF-8 that was decoded
as ISO-8859-1, which leads to the well-known artifacts.

Note, however, that you can change the default using the
`encoding` pragma. See `perldoc encoding`.

If the program code itself is in UTF-8, you may want to declare
that also: see `perldoc utf8`.

And finally – see `perldoc perlunicode`.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>