[Catalyst] utf8 in regexes in Catalyst

Mon Mar 3 02:10:22 GMT 2008

* On Sun, Mar 02 2008, Alexandre Jousset wrote:
> 	Hello list,
>
> 	I'm able tu "use encoding utf8" in a normal Perl program to
> correctly match regexes like this : echo 'é' | perl -e 'use encoding
> "utf8";print <> =~ /\w/' (in an utf8 terminal), but how could I
> achieve the same result in a Catalyst application? Am I obliged to
> "use encoding" in every module? Is there a way to do it globally? What
> is the right way to do this?

OK, never "use encoding".  Perl regexes will always DTRT on unicode
strings.  But, you need to ensure two things.  First, if you're putting
any unicode literals (as utf8) in your source code, you need to "use
utf8" at the top of the file.

Secondly, you need to make sure all data is correctly decoded into perl
characters (which is what regexes operate on).  To do this, you need to
say:

  use Encode;

  # get your data
  my $utf8_octets = <raw bytestream from somewhere that you know is
                    utf8-encoded>

  # convert the outside bytes to perl characters (CRITICAL STEP!)
  my $perl_characters = Encode::decode('utf-8', $utf8_octets);

Now $perl_characters will behave as expected.  For example, if you wrote:

  use utf8;
  use Encode;
  my $data = "ほげ"; # "use utf8" is for this literal;
  $data =~ s/(.)/($1)/g;

  # turn perl characters into utf8 for my xterm
  say Encode::encode('utf-8', $data); 

The output would be:

  (ほ)(げ)

If you didn't "use utf8" for the literal or Encode::decode the outside
data, then the result would be something like

  (ã)()(»)(ã)()()

or

  (\343)(\201)(\273)(\343)(\201)(\222)

To summarize:

  * decode everything coming into your application
  * encode everything going out
  * "use utf8" if your source code is utf8 encoded.
  * NEVER "use encoding", it's b0rken.

Catalyst::Plugin::Unicode will do the first two things for you.  

You can verify that it's working by using Devel::Peek to dump the
string.

Correctly decoded data:

  perl -MDevel::Peek -e 'my $data = "ほげ"; utf8::decode($data); Dump($data)'    

  SV = PV(0x72b098) at 0x72e3e0
    REFCNT = 1
    FLAGS = (PADMY,POK,pPOK,UTF8)
    PV = 0x73aa40 "\343\201\273\343\201\222"\0 [UTF8 "\x{307b}\x{3052}"]
    CUR = 6
    LEN = 8

Note the correct \x{...} string as the PV.

Incorrectly decoded data:

  $ perl -MDevel::Peek -e 'my $data = "ほげ"; Dump($data)'
  SV = PV(0x72b098) at 0x72e3e0
    REFCNT = 1
    FLAGS = (PADMY,POK,pPOK)
    PV = 0x73aa30 "\343\201\273\343\201\222"\0
    CUR = 6
    LEN = 8

Note the lack of the "decoded" string (in PV).

IMPORTANT NOTE: THIS DOES NOT ALWAYS RETURN CORRECT RESULTS.  If you
used ü instead of ほげ, perl can store the data as characters without
converting to utf8 (it uses latin1 instead).  So if you want to verify
that your code is correct, use some non-latin1 data and some latin1
data, and make sure it works in both cases.  You shouldn't have to do
this, but it's better to confirm instead of guess.

Finally, I recently wrote this article:

  http://blog.jrock.us/articles/Fuck%20the%20internal%20representation.pod

Hope this helps.

Regards,
Jonathan Rockway