[Catalyst] URI->new() with utf8 string and Unicode::Encoding will
not work (but URI->new() with utf8 octets will work)
Francisco Obispo
fobispo at isc.org
Fri Mar 4 07:45:05 GMT 2011
I believe what's happening is that Catalyst is converting the UTF-8 string into perl format (decoding), and in that particular example, is working for you because the string is forced back into UTF-8 with the encode_utf8 function.
This is a code I wrote and use to test unicode issues:
#!/usr/bin/env perl
use Encode;
#my @list = Encode->encodings(q{:all});
#printf("Encodings Available:\n");
#map {printf("\t-%s\n",$_)}@list;
foreach (@ARGV) {
printf( "Word is %s\n", $_ );
my $i = 0;
my $string=decode('utf8',$_);
my @chr = split( q{}, $string);
printf( "Length decoded is %d\n", length(decode_utf8($_)) );
printf( "Length as bytes is %d\n", length($_) );
map {
printf( '%d] +U%.4X - %2$04d - %s' . "\n",
++$i, ord($_), encode_utf8($_) )
} @chr;
}
In order to get the correct length, I have to decode the UTF-8 string into internal Perl's format, otherwise it will just count bytes:
$ ./test_unicode.pl español
Word is español
Length decoded is 7
Length as bytes is 8
1] +U0065 - 0101 - e
2] +U0073 - 0115 - s
3] +U0070 - 0112 - p
4] +U0061 - 0097 - a
5] +U00F1 - 0241 - ñ
6] +U006F - 0111 - o
7] +U006C - 0108 - l
As you can see, perl interprets the string (len()) as either a UTF-8 string or as bytes depending whether the string has been decoded or not.
So, if you don't decode the string, the result is a disaster when using string functions (such as split()).
Hope this helps.
Francisco
On Mar 3, 2011, at 11:26 PM, Eisenberger Tamás wrote:
> Hy!
>
> Yes using encode_utf8 makes the test works.
>
> But anyway, this looks like a problem with the test, because we have
> tests to compare the entire captures / arguments / params strings with
> their originals, and if these tests pass the length of the strings must
> be ok!
>
> So Erik, can you please review your test, or explain a real word
> situation of the problem you facing?
>
> I actually use utf8 strings in url's now without problems :)
> --
> Eisenberger Tamás <tamas at eisenberger.hu>
>
> On Thu, 2011-03-03 at 21:33 -0800, Bill Moseley wrote:
>> Does this help?
>>
>> On Thu, Mar 3, 2011 at 2:38 PM, Erik Wasser <erik.wasser at iquer.net>
>> wrote:
>> foreach my $u ('http://localhost/test/%E3%81%8B',
>> "http://localhost/test/\x{304b}" )
>> {
>> my $request = HTTP::Request->new(
>> 'GET'=> encode_utf8($u), [ 'Content-Type' =>
>> 'text/html; charset=utf8', ],
>> );
>> print $request->as_string();
>> my $response = request( $request );
>> is( $response->content, 'length = 1', 'length = 1' );
>> }
>>
>>
>> --
>> Bill Moseley
>> moseley at hank.org
>> _______________________________________________
>> List: Catalyst at lists.scsys.co.uk
>> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
>> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
>> Dev site: http://dev.catalyst.perl.org/
> _______________________________________________
> List: Catalyst at lists.scsys.co.uk
> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
> Dev site: http://dev.catalyst.perl.org/
Francisco Obispo
Hosted@ Programme Manager
email: fobispo at isc.org
Phone: +1 650 423 1374 || INOC-DBA *3557* NOC
Key fingerprint = 532F 84EB 06B4 3806 D5FA 09C6 463E 614E B38D B1BE
More information about the Catalyst
mailing list