[Dbix-class] Wrong UTF-8 handling in DBIx::Class/DBD::mysql
despite mysql_enable_utf8
Matias E. Fernandez
pisco at gmx.ch
Thu May 13 16:11:53 GMT 2010
Hello again
On 2010-05-13, at 24:19, Marc Mims wrote:
>> I disagree with that. Consider this:
>>
>> my $string = "\x{e4}\x{f6}\x{fc}";
>> utf8::upgrade($string);
>>
>> my $other_string = "\x{e4}\x{f6}\x{fc}";
>>
>> ok($string eq $other_string, "upgraded and not upgraded character strings are equal");
>>
>> Both $string and $other_string a perfectly valid Perl character strings, and
>> they are equal. How Perl holds them internally doesn't and shouldn't matter.
>
> Unfortunately, it does matter. Perl supports 2 types of strings: byte
> strings and unicode strings. For legacy reasons, byte-strings are
> interpreted as latin-1. In your example, $string (after the
> utf8::upgrade) is a unicode string. $other_string is not. DBD::mysql
> with mysql_enable_utf8 will be happy with $string but apparently isn't
> happy with $other_string.
The important point is not that byte-strings are interpreted as Latin-1 in some cases, but that Perl tries
to keep its data as eight-bit bytes for as long as possible!
From perluniintro [1]:
> Perl supports both pre-5.6 strings of eight-bit native bytes, and strings of Unicode characters. The principle is that
> Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be
> avoided, the data is (mostly) transparently upgraded to Unicode. There are some problems--see "The Unicode
> Bug" in perlunicode.
>
> Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1)
> is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl
> uses the native eight-bit character set. Otherwise, it uses UTF-8.
>
> A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings, but it
> becomes relevant when outputting Unicode strings to a stream without a PerlIO layer (one with the "default"
> encoding). In such a case, the raw bytes used internally (the native character set or UTF-8, as appropriate for each
> string) will be used, and a "Wide character" warning will be issued if those strings contain a character beyond
> 0x00FF.
Note the first sentence of the third paragraph: "A user of Perl does not normally need to know nor care how Perl
happens to encode its internal strings"! I really recommend reading the the whole chapter and having a look at
the examples! The problem arises because DBD::mysql sends data out without using a PerlIO layer, otherwise
there would be no problem with $other_string! Again, a user should never have to mess around with Perl internals
as the like of utf8::upgrade().
I repeat that it is not correct not to encode data if you want to send UTF-8! Suppose an UTF-8 shell and the
following example:
my $var = "\x{fc}bercool \x{263a}";
print $var,"\n";
Perl will issue a warning "Wide character in print at -e line 1." although everything looks fine in the terminal.
The correct way in this situation would be something like this:
my $var = "\x{fc}bercool \x{263a}";
binmode(STDOUT, ":encoding(UTF-8)");
print $var,"\n";
or
my $var = "\x{fc}bercool \x{263a}";
print Encode::encode("UTF-8", $var),"\n";
Everything works as expected and there are no warnings. Consider the following:
my $a_string = "\x{fc}bercool :-)"; # "übercool :-)"
my $another_string = "\x{fc}bercool \x{263a}"; # "übercool ☺"
Does it really make sense that a library does not work as expected with $a_string, but does so with $another_string?
I think it doesn't! Using the \x{...} notation is absolutely okay, it only has implications if a library you use doesn't
respect Perl's Unicode model!
> I was just trying to be helpful. Like I said, I'm no unicode expert.
Thank you very much. I'm trying to be helpful too by pointing at an existing and real problem.
Regards
Matias E. Fernandez
[1] http://perldoc.perl.org/5.12.0/perluniintro.html#Perl's-Unicode-Model
More information about the DBIx-Class
mailing list