[Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)

Sat Sep 27 23:39:03 BST 2008

Maybe you're already aware of this, but I've found from experience that 
troubleshooting encoding/Unicode problems in a web/db app can be difficult, 
especially with multiple conversions at different stages, but I've come up 
with a short generic algorithm to help test/ensure that things are working 
and where things need fixing.  Note that these details assuming we're using 
Perl 5.8+.

1. Make sure all your text/code/template/non-binary/etc files are saved as 
UTF-8 text files (or they are 7-bit ASCII), and you have a Unicode-savvy 
text editor.

2. Have a "use utf8;" at the top of every Perl file, so Perl treats your 
source files as being Unicode.

3. Place a text string literal in your program code that you know isn't in 
ASCII ... for example I like to use the word 'サンプル', which is what came 
out of Google's translation tool when I asked it to translate the word 
'sample' to Japanese.  Then setup your program to display that text 
directly in your web page text, without any escaping.

4. Make sure the HTTP response headers for the webpage with that text have 
a content-type charset value of UTF-8, and make sure that Perl is encoding 
its output as actual UTF-8; if you were doing it directly using STDOUT for 
example such as in a CGI, it could be: "binmode *main::STDOUT, 
':encoding(UTF-8)';" or such.  Make sure your web browser is Unicode savvy.

5. At this point, if the web page displays correctly with the non-ASCII 
literal (and moreover, if you "view source" in the browser and the literal 
also displays literally), then you know your program can work/represent 
internally with Unicode correctly, and it can output Unicode correctly to 
the browser.  It is very important to get this step working first, in 
isolation, so that you are in a position to judge or troubleshoot other 
issues such as receiving Unicode input from a browser or using it with a 
database.

6. Next test that you can receive Unicode from the browser in the various 
ways, whether by query string / http headers or in an http post.  Eg try 
outputting a value and have the user submit it again, and compare for 
equality either in the Perl program or by displaying it again next to the 
original for visual inspection.  If any differences come up, then you know 
any fixes you have to do concern either how you read and interpret the 
browser request, or perhaps on how you instruct the browser on how to 
submit a request.  Once that's all cleared up, then you know your I/O with 
the web browser works fine.

7. To test a database, I suggest first using a known-good and Unicode savvy 
alternate input method for putting some Unicode text in the database, such 
as using an admin/utility tool that came with the DBMS.  Also make sure 
that the database is itself using UTF-8 character strings in its schema, eg 
that the schema is declared this way.

8. With a database known to contain some valid Unicode etc text, you first 
test simply selecting that text from the database and displaying it.  If 
anything doesn't match, it means you probably have to configure your DBMS 
client connection encoding so it is UTF-8 (often done with a few certain 
SQL commands), and then separately ensure that Perl is decoding the UTF-8 
data into Perl text strings properly.  Its important to make sure you can 
retrieve Unicode from the database properly so that you have a context for 
judging that you can insert such text in the database.

9. Next try to insert some Unicode text in the database using your program, 
then select it back to check that it worked.  If it didn't, then check DBMS 
client connection settings, or that Perl is encoding text as UTF-8 properly.

10. Actually, when you have a known-good external tool to help you, you can 
alternately start the DBMS tests with step 9, where your program inserts 
text, then you use the known-good tool to ensure it actually was recorded 
properly.

Anyway, that's it in a nutshell.  Now I'm sure many of you have already 
figured this out, but for those who haven't, I hope these tips help you. 
Adjust as appropriate to account for any abstraction tools or frameworks 
you are using which means your tests may also involve testing those tools 
or configuring them.

-- Darren Duncan

Hugh Hunter wrote:
> I've been struggling with this for some time and know there must be an 
> answer out there.
> 
> I'm using URL arguments to pass parameters to my controller.  It's a 
> site about names, so take the url http://domain.com/name/Jesús (note the 
> accented u).  The Name.pm controller has an :Args(1) decorator so Jesús 
> is stored in $name and then passed to my DBIC model in a ->search({name 
> => $name}) call.  This doesn't manage to find the row that exists in 
> mysql.  When I dump $name I get:
> 
> 'name' => 'Jes\xc3\xbas'
> 
> which I think I understand as being perl's internal escaping of utf-8 
> characters.
> 
> I've done everything recommended on 
> http://dev.catalystframework.org/wiki/gettingstarted/tutorialsandhowtos/using_unicode and 
> the name column in my mysql database uses the utf-8 charset.
> 
> Where am I going wrong?