[Catalyst] URI->new() with utf8 string and Unicode::Encoding will not work (but URI->new() with utf8 octets will work)

Francisco Obispo fobispo at isc.org
Fri Mar 4 07:45:05 GMT 2011


I believe what's happening is that Catalyst is converting the UTF-8 string into perl format (decoding), and in that particular example, is working for you because the string is forced back into UTF-8 with the encode_utf8 function.

This is a code I wrote and use to test unicode issues:

#!/usr/bin/env perl
use Encode;

#my @list = Encode->encodings(q{:all});

#printf("Encodings Available:\n");
#map {printf("\t-%s\n",$_)}@list;


foreach (@ARGV) {
  printf( "Word is %s\n", $_ );
  my $i = 0;
  my $string=decode('utf8',$_);
  my @chr = split( q{}, $string);
  printf( "Length decoded is %d\n", length(decode_utf8($_)) );
  printf( "Length as bytes is %d\n", length($_) );
  map {
    printf( '%d] +U%.4X - %2$04d - %s' . "\n",
            ++$i, ord($_), encode_utf8($_) )
  } @chr;
}

In order to get the correct length, I have to decode the UTF-8 string into internal Perl's format, otherwise it will just count bytes:

$ ./test_unicode.pl español
Word is español
Length decoded is 7
Length as bytes is 8
1] +U0065 - 0101 - e
2] +U0073 - 0115 - s
3] +U0070 - 0112 - p
4] +U0061 - 0097 - a
5] +U00F1 - 0241 - ñ
6] +U006F - 0111 - o
7] +U006C - 0108 - l

As you can see, perl interprets the string (len()) as either a UTF-8 string or as bytes depending whether the string has been decoded or not.

So, if you don't decode the string, the result is a disaster when using string functions (such as split()).

Hope this helps.

Francisco


On Mar 3, 2011, at 11:26 PM, Eisenberger Tamás wrote:

> Hy!
> 
> Yes using encode_utf8 makes the test works.
> 
> But anyway, this looks like a problem with the test, because we have
> tests to compare the entire captures / arguments / params strings with
> their originals, and if these tests pass the length of the strings must
> be ok!
> 
> So Erik, can you please review your test, or explain a real word
> situation of the problem you facing?
> 
> I actually use utf8 strings in url's now without problems :)
> -- 
> Eisenberger Tamás <tamas at eisenberger.hu>
> 
> On Thu, 2011-03-03 at 21:33 -0800, Bill Moseley wrote:
>> Does this help?
>> 
>> On Thu, Mar 3, 2011 at 2:38 PM, Erik Wasser <erik.wasser at iquer.net>
>> wrote:
>>        foreach my $u ('http://localhost/test/%E3%81%8B',
>>        "http://localhost/test/\x{304b}" )
>>        {
>>           my $request = HTTP::Request->new(
>>               'GET'=> encode_utf8($u), [ 'Content-Type' =>
>>        'text/html; charset=utf8', ],
>>           );
>>           print $request->as_string();
>>           my $response = request( $request );
>>           is( $response->content, 'length = 1', 'length = 1' );
>>        }
>> 
>> 
>> -- 
>> Bill Moseley
>> moseley at hank.org
>> _______________________________________________
>> List: Catalyst at lists.scsys.co.uk
>> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
>> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
>> Dev site: http://dev.catalyst.perl.org/
> _______________________________________________
> List: Catalyst at lists.scsys.co.uk
> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
> Dev site: http://dev.catalyst.perl.org/

Francisco Obispo 
Hosted@ Programme Manager
email: fobispo at isc.org
Phone: +1 650 423 1374 || INOC-DBA *3557* NOC
Key fingerprint = 532F 84EB 06B4 3806 D5FA  09C6 463E 614E B38D B1BE







More information about the Catalyst mailing list