[Catalyst] Problem with Catalyst::Plugin::I18N using UTF-8
Ash Berlin
ash_cpan at firemirror.com
Fri Dec 21 21:35:19 GMT 2007
>> I looked at the Unicode plugin and I believe it most likely will
>> break the
> integration against our LDAP backend, for example when searching for
> names
> containing characters like æøå. (OpenLDAP requires its input as
> UTF-8.)
>
> In addition, this is bad if your code (or templates) contains
> special unicode
> characters; which then becomes double-encoded.
>
>
> The Unicode plugin looks like could be useful if you are migrating
> old data or
> an old website that didn't use UTF-8 before. It is definitely not the
> solution for me, as it means more data processing and might
> introduce new
> bugs.
>
>
> As I said in my first post, the solution (which works for me) was to
> turn off
> the Decode parameter. This makes more sense to me now, since my mo/
> po-files
> are already in UTF-8 and don't need to be converted.
Right, I think there is some confusion on your part as to what is the
proper way of handling unicode in perl.
(The basic problem is that "perl's magic internal representation" just
happens to look exactly like UTF-8 plus a magic flag. Longer
description below)
First off, you need to understand the difference between characters
and bytes/octets
"æøå" is a character string
"\303\246\303\270\303\245" is a utf8 byte sequence != a string
"\303\246\303\270\303\245" + UTF8 flag = "æøå" perl string
From perldoc perlunicode
... What the "UTF8" flag means is
that the
sequence of octets in the representation of the scalar is
the
sequence of UTFâ8 encoded code points of the characters
of a
string. The "UTF8" flag being off means that each octet
in this
representation encodes a single character with code point
0..255
within the string. Perl's Unicode model is not to use
UTF-8 until
it is absolutely necessary.
The problem lies in that you can have two strings of data that look
the same when you print them, lets take the example you gave of "æøå".
If this data comes from a source that doesn't set the UTF8 flag, the
SV (scalar value - where perl internals store scalars) will have the
characters of
"\303\246\303\270\303\245"
However since non of these code points are above 255 (they cant be as
each character = one byte) perl thinks this isn't a utf8 string.
Devel::Peek is a good module for this:
DB<3> x $foo = "\303\246\303\270\303\245"
0 'æøå'
DB<4> Dump($foo)
SV = PV(0x918d08) at 0x926848
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x5ace10 "\303\246\303\270\303\245"\0
CUR = 6
LEN = 8
It "looks right", but wait - LEN = 8. Perl thinks its a string of 8
characters that our terminal just happens to print right.
Compare that with:
DB<6> x $bar = "\x{E6}\x{F8}\x{E5}"
0 '???'
DB<7> Dump($bar)
SV = PV(0x9398dc) at 0x9306b4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x5acbf0 "\346\370\345"\0
CUR = 3
LEN = 4
Still not quite what we want...
DB<10> Dump($baz = Encode::decode("utf8", $foo))
SV = PVMG(0x974e20) at 0x974168
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x656d30 "\303\246\303\270\303\245"\0 [UTF8
"\x{e6}\x{f8}\x{e5}"]
CUR = 6
LEN = 8
MAGIC = 0x6575e0
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 3
Right, *now* $baz is a proper unicode string that perl knows is a
string of UTF8 *characters*
To relate this to your problem, you are getting some of your data
double encoded because the data (from the perl module you are using to
access your LDAP server) is returning a byte sequence that perl
doesn't know is supposed to be UTF8.
The answer is to do Encode::decode("utf8", $utf8_byte_sequence)
on all the data coming back from your LDAP server (or to find the
right option to make the module you are using do it).
Any of this make any sense?
PS. It seems that even Apple has problems with UTF8. In writing this
email I saved it in my drafts folder. When I came back to edit it
again, the non-ascii characters got fluffed up. Fun eh?
More information about the Catalyst
mailing list