[Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)

Sun Sep 28 08:11:54 BST 2008

Hi,

If it helps somebody, here are the things I did to make my Catalyst-based 
app to handle well UTF-8.

I've installed Perl 5.10.0 and Catalyst 5.7014 a few days ago (under Windows 
XP) and I've started to use C::C::HTML::FormFu.
When I was not using C::C::HTML::FormFu, in order to make the UTF-8 strings 
show correctly, I just needed to create the TT templates as UTF-8 encoded, 
to use "use utf8" in the perl modules that were using non-ASCII chars, and 
to configure Apache to send a Content-Type with the UTF-8 charset.

After using C::C::HTML::FormFu, I've seen that the templates that include 
HTML::FormFu forms don't display the non-ASCII chars correctly, and finally 
I needed to do the following to make the app work fine:

1. Add to httpd.conf:

AddDefaultCharset UTF-8

2. In the perl modules of the application that use non-ascii chars, use:

use utf8;

3. In MyApp.pm, add "Unicode" plugin:

use Catalyst qw/Unicode/;

4. In the configuration file or MyApp.pm, specify that the TT templates and 
the HTML::FormFu forms are UTF-8 encoded:

__PACKAGE__->config(
  'Controller::HTML::FormFu' => {
    constructor => {
      tt_args => {
        ENCODING => 'UTF-8',
      },
    },
  },
  'View::TTSite' => {
    ENCODING => 'UTF-8',
  },
);

5. In the class files generated by DBIC::Schema helper, after the line with 
"# DO NOT MODIFY THIS OR ANYTHING ABOVE", add the following 2 lines, and in 
the second line, specify the columns that can contain UTF-8 encoded chars:

__PACKAGE__->load_components("UTF8Columns");
__PACKAGE__->utf8_columns(qw/username first_name last_name/);

I've seen a recommendation to add "UTF8Columns" at the start of the file, as
__PACKAGE__->load_components("UTF8Columns", "Core");
but if the module is generated by the helper of Catalyst, it doesn't like to 
modify what it creates.
(It could be helpful if the helper could accept one more parameter that 
specifies that all columns should be UTF-8 encoded and do these settings.)

6. The TT templates and HTML::FormFu forms should be UTF-8 encoded, without 
having a BOM.

I've seen that the non-ASCII chars are displayed correctly with even fewer 
settings if the HTML::FormFu forms are UTF-8 encoded and not use a BOM, but 
the TT templates are UTF-8 encoded and use a BOM, however it is not very 
nice to need creating 2 types of files and who knows... maybe other problems 
could appear if doing so, so these settings should be made.

I have also created the MySQL tables as UTF-8 encoded, but I am not sure 
this is really necessary:

create table table_name(
...
) engine=InnoDB default charset=utf8;

And as a separate note, I just found a simpler way to send a UTF-8 encoded 
email from a Catalyst app or from a standalone program, using the module 
Mail::Builder for creating the message and Email::Send for sending it.
It allows creating a multipart/alternative message with a text and an html 
part, allows attaching files easily, it encodes the headers as UTF-8 
automaticly...

HTH.

Octavian

----- Original Message ----- 
From: "Darren Duncan" <darren at darrenduncan.net>
To: "The elegant MVC web framework" <catalyst at lists.scsys.co.uk>
Sent: Sunday, September 28, 2008 1:39 AM
Subject: [Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing 
UTF-8 arg in URL to DBIC search)

Maybe you're already aware of this, but I've found from experience that
troubleshooting encoding/Unicode problems in a web/db app can be difficult,
especially with multiple conversions at different stages, but I've come up
with a short generic algorithm to help test/ensure that things are working
and where things need fixing.  Note that these details assuming we're using
Perl 5.8+.

1. Make sure all your text/code/template/non-binary/etc files are saved as
UTF-8 text files (or they are 7-bit ASCII), and you have a Unicode-savvy
text editor.

2. Have a "use utf8;" at the top of every Perl file, so Perl treats your
source files as being Unicode.

3. Place a text string literal in your program code that you know isn't in
ASCII ... for example I like to use the word 'サンプル', which is what came
out of Google's translation tool when I asked it to translate the word
'sample' to Japanese.  Then setup your program to display that text
directly in your web page text, without any escaping.

4. Make sure the HTTP response headers for the webpage with that text have
a content-type charset value of UTF-8, and make sure that Perl is encoding
its output as actual UTF-8; if you were doing it directly using STDOUT for
example such as in a CGI, it could be: "binmode *main::STDOUT,
':encoding(UTF-8)';" or such.  Make sure your web browser is Unicode savvy.

5. At this point, if the web page displays correctly with the non-ASCII
literal (and moreover, if you "view source" in the browser and the literal
also displays literally), then you know your program can work/represent
internally with Unicode correctly, and it can output Unicode correctly to
the browser.  It is very important to get this step working first, in
isolation, so that you are in a position to judge or troubleshoot other
issues such as receiving Unicode input from a browser or using it with a
database.

6. Next test that you can receive Unicode from the browser in the various
ways, whether by query string / http headers or in an http post.  Eg try
outputting a value and have the user submit it again, and compare for
equality either in the Perl program or by displaying it again next to the
original for visual inspection.  If any differences come up, then you know
any fixes you have to do concern either how you read and interpret the
browser request, or perhaps on how you instruct the browser on how to
submit a request.  Once that's all cleared up, then you know your I/O with
the web browser works fine.

7. To test a database, I suggest first using a known-good and Unicode savvy
alternate input method for putting some Unicode text in the database, such
as using an admin/utility tool that came with the DBMS.  Also make sure
that the database is itself using UTF-8 character strings in its schema, eg
that the schema is declared this way.

8. With a database known to contain some valid Unicode etc text, you first
test simply selecting that text from the database and displaying it.  If
anything doesn't match, it means you probably have to configure your DBMS
client connection encoding so it is UTF-8 (often done with a few certain
SQL commands), and then separately ensure that Perl is decoding the UTF-8
data into Perl text strings properly.  Its important to make sure you can
retrieve Unicode from the database properly so that you have a context for
judging that you can insert such text in the database.

9. Next try to insert some Unicode text in the database using your program,
then select it back to check that it worked.  If it didn't, then check DBMS
client connection settings, or that Perl is encoding text as UTF-8 properly.

10. Actually, when you have a known-good external tool to help you, you can
alternately start the DBMS tests with step 9, where your program inserts
text, then you use the known-good tool to ensure it actually was recorded
properly.

Anyway, that's it in a nutshell.  Now I'm sure many of you have already
figured this out, but for those who haven't, I hope these tips help you.
Adjust as appropriate to account for any abstraction tools or frameworks
you are using which means your tests may also involve testing those tools
or configuring them.

-- Darren Duncan

Hugh Hunter wrote:
> I've been struggling with this for some time and know there must be an 
> answer out there.
>
> I'm using URL arguments to pass parameters to my controller.  It's a site 
> about names, so take the url http://domain.com/name/Jesús (note the 
> accented u).  The Name.pm controller has an :Args(1) decorator so Jesús is 
> stored in $name and then passed to my DBIC model in a ->search({name => 
> $name}) call.  This doesn't manage to find the row that exists in mysql. 
> When I dump $name I get:
>
> 'name' => 'Jes\xc3\xbas'
>
> which I think I understand as being perl's internal escaping of utf-8 
> characters.
>
> I've done everything recommended on 
> http://dev.catalystframework.org/wiki/gettingstarted/tutorialsandhowtos/using_unicode 
> and the name column in my mysql database uses the utf-8 charset.
>
> Where am I going wrong?

_______________________________________________
List: Catalyst at lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/