[Catalyst] tips for troubleshooting/QAing Unicode

Wed Oct 1 01:54:56 BST 2008

* On Sat, Sep 27 2008, Darren Duncan wrote:
> Maybe you're already aware of this, but I've found from experience
> that troubleshooting encoding/Unicode problems in a web/db app can be
> difficult, especially with multiple conversions at different stages,
> but I've come up with a short generic algorithm to help test/ensure
> that things are working and where things need fixing.

A simplified version:

1) Identify sources of input to your application

2) Ensure that you called Encode::decode('the-character-encoding', ...)
on all that data.  If you are dealing with pure ASCII, I guess you can
skip this step.  Encode::decode('us-ascii', ...) probably works though.

Sometimes libraries will do this for you, but don't count on it, verify
it.  If you don't see the code doing it, it's not being done.

Note that the existence of the "UTF-8 flag" does not tell you whether
this is being correctly done.  Your program can be perfectly
Unicode-clean and never have a string with the UTF-8 flag on.

If you see stuff like utf8::encode and utf8::decode or Encode::_utf8_on
and so on, your program is horribly broken.  Use Encode properly before
continuing.

Finally, keep in mind that there are odd sources of data.  Hash keys
from config files, file names, file extended attributes, form params,
form field names, URIs(*), etc.

(*) handle these manually, there is no mention of Unicode in the URI
standard.

Some people do things like put Japanese text in the HTTP headers.  This
is not allowed.  ASCII only.

3) Identify where you output text.

4) Ensure that you called Encode::encode('output-character-encoding', ...) on
any data that leaves your program.

In the case of dealing with external applications, make sure that you've
told them what the output character encoding is.  Databases have flags
for this, HTTP has the Content-type header, etc.

5) You're done.

I have found that Devel::StringInfo is very helpful; you can have it
dump the information when you are inputting data... it will make it
clear when you have bytes instead of characters.

Be sure to test with all sorts of input -- I always use characters from
ASCII ("foo"), Latin ("ÿ"), and Japanese ("ほげ").  If your app gets
those three right, it is probably OK.

Regards,
Jonathan Rockway

--
print just => another => perl => hacker => if $,=$"