[Catalyst] Problem with Catalyst::Plugin::I18N using UTF-8

Fri Dec 21 21:35:19 GMT 2007

>> I looked at the Unicode plugin and I believe it most likely will  
>> break the
> integration against our LDAP backend, for example when searching for  
> names
> containing characters like Ã¦Ã¸Ã¥. (OpenLDAP requires its input as  
> UTF-8.)
>
> In addition, this is bad if your code (or templates) contains  
> special unicode
> characters; which then becomes double-encoded.
>
>
> The Unicode plugin looks like could be useful if you are migrating  
> old data or
> an old website that didn't use UTF-8 before. It is definitely not the
> solution for me, as it means more data processing and might  
> introduce new
> bugs.
>
>
> As I said in my first post, the solution (which works for me) was to  
> turn off
> the Decode parameter. This makes more sense to me now, since my mo/ 
> po-files
> are already in UTF-8 and don't need to be converted.

Right, I think there is some confusion on your part as to what is the  
proper way of handling unicode in perl.

(The basic problem is that "perl's magic internal representation" just  
happens to look exactly like UTF-8 plus a magic flag. Longer  
description below)

First off, you need to understand the difference between characters  
and bytes/octets

"æøå" is a character string
"\303\246\303\270\303\245" is a utf8 byte sequence != a string

"\303\246\303\270\303\245" + UTF8 flag = "æøå" perl string

 From perldoc perlunicode

                                   ... What the "UTF8" flag means is  
that the
            sequence of octets in the representation of the scalar is  
the
            sequence of UTFâˆ’8 encoded code points of the characters  
of a
            string.  The "UTF8" flag being off means that each octet  
in this
            representation encodes a single character with code point  
0..255
            within the string.  Perl's Unicode model is not to use  
UTF-ˆ’8 until
            it is absolutely necessary.

The problem lies in that you can have two strings of data that look  
the same when you print them, lets take the example you gave of "æøå".  
If this data comes from a source that doesn't set the UTF8 flag, the  
SV (scalar value - where perl internals store scalars) will have the  
characters of

   "\303\246\303\270\303\245"

However since non of these code points are above 255 (they cant be as  
each character = one byte) perl thinks this isn't a utf8 string.  
Devel::Peek is a good module for this:

   DB<3> x $foo = "\303\246\303\270\303\245"
0  'æøå'
   DB<4> Dump($foo)
SV = PV(0x918d08) at 0x926848
   REFCNT = 1
   FLAGS = (POK,pPOK)
   PV = 0x5ace10 "\303\246\303\270\303\245"\0
   CUR = 6
   LEN = 8

It "looks right", but wait - LEN = 8. Perl thinks its a string of 8  
characters that our terminal just happens to print right.

Compare that with:

   DB<6> x $bar = "\x{E6}\x{F8}\x{E5}"
0  '???'
   DB<7> Dump($bar)
SV = PV(0x9398dc) at 0x9306b4
   REFCNT = 1
   FLAGS = (POK,pPOK)
   PV = 0x5acbf0 "\346\370\345"\0
   CUR = 3
   LEN = 4

Still not quite what we want...

   DB<10> Dump($baz = Encode::decode("utf8", $foo))
SV = PVMG(0x974e20) at 0x974168
   REFCNT = 1
   FLAGS = (POK,pPOK,UTF8)
   IV = 0
   NV = 0
   PV = 0x656d30 "\303\246\303\270\303\245"\0 [UTF8  
"\x{e6}\x{f8}\x{e5}"]
   CUR = 6
   LEN = 8
   MAGIC = 0x6575e0
     MG_VIRTUAL = &PL_vtbl_utf8
     MG_TYPE = PERL_MAGIC_utf8(w)
     MG_LEN = 3

Right, *now* $baz is a proper unicode string that perl knows is a  
string of UTF8 *characters*

To relate this to your problem, you are getting some of your data  
double encoded because the data (from the perl module you are using to  
access your LDAP server) is returning a byte sequence that perl  
doesn't know is supposed to be UTF8.

The answer is to do Encode::decode("utf8", $utf8_byte_sequence)
  on all the data coming back from your LDAP server (or to find the  
right option to make the module you are using do it).

Any of this make any sense?

PS. It seems that even Apple has problems with UTF8. In writing this  
email I saved it in my drafts folder. When I came back to edit it  
again, the non-ascii characters got fluffed up. Fun eh?