[Catalyst] Re: utf8 in regexes in Catalyst

Sun Mar 2 16:08:13 GMT 2008

Hi Alexandre,

don’t use encoding.pm. It’s a confused and broken design, and the
author himself recommends against its use. Its main purpose is to
allow you to write code in some arbitrary encoding. As a side
effect it sets your input/output encoding, but it shouldn’t, and
confusing the encoding of the source with the encoding of its
input/output is utterly broken.

Think about it this way: there a byte strings, and there a text
strings. Text strings consist of Unicode characters; byte strings
consist of byte values and have no meaning whatsoever as text.
(Even if you are used to thinking of them as though they did.)
Text strings need to be encoded to become byte strings; byte
strings need to be decoded to become text strings.

So for that one-liner, you do this:

    echo 'é' | perl -MEncode -e '$_ = decode 'UTF-8', scalar <>; print /\w/'

Yes, this is tedious. So what you do is you find ways to get the
parts of your program that speak to the outside world to decode
input on receipt and encode output on emission. Then inside your
program, you don’t need to think about it at all. F.ex., for the
one-liner, you would declare that your STDIN and STDOUT are in
UTF-8 and then reading from and writing to them automatically
does what it should. Handily, perl has a switch for that when it
comes to UTF-8:

    echo 'é' | perl -CS -e 'print <> =~ /\w/'

If your input was in a different encoding, you could use the
`open` pragma:

    echo 'é' | perl -Mopen=':encoding(latin-1),:std' -e 'print <> =~ /\w/'

Granted, that does not look like a big win in this example, but
if you had to do several I/O operations inside the code, it would
be, because you wouldn’t need to de-/encode every time.

So to deal with Unicode with minimal hassle, the conversion from
bytes to characters should happen at the “edge” of your code
where it interfaces with the outside world.

For Catalyst, that means things like Catalyst::Plugin::Unicode
and configuring your database and template engine correctly.

Aside from the configuration, your code should then avoid dealing
with encodings at all.

See also http://use.perl.org/~miyagawa/journal/35700

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>