[Dbix-class] Wrong UTF-8 handling in DBIx::Class/DBD::mysql despite mysql_enable_utf8

Thu May 13 16:11:53 GMT 2010

Hello again

On 2010-05-13, at 24:19, Marc Mims wrote:

>> I disagree with that. Consider this:
>> 
>> my $string = "\x{e4}\x{f6}\x{fc}";
>> utf8::upgrade($string);
>> 
>> my $other_string = "\x{e4}\x{f6}\x{fc}";
>> 
>> ok($string eq $other_string, "upgraded and not upgraded character strings are equal");
>> 
>> Both $string and $other_string a perfectly valid Perl character strings, and 
>> they are equal. How Perl holds them internally doesn't and shouldn't matter.
> 
> Unfortunately, it does matter.  Perl supports 2 types of strings: byte
> strings and unicode strings.  For legacy reasons, byte-strings are
> interpreted as latin-1. In your example, $string (after the
> utf8::upgrade) is a unicode string. $other_string is not.  DBD::mysql
> with mysql_enable_utf8 will be happy with $string but apparently isn't
> happy with $other_string.

The important point is not that byte-strings are interpreted as Latin-1 in some cases, but that Perl tries 
to keep its data as eight-bit bytes for as long as possible!

From perluniintro [1]:

> Perl supports both pre-5.6 strings of eight-bit native bytes, and strings of Unicode characters. The principle is that 
> Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be 
> avoided, the data is (mostly) transparently upgraded to Unicode. There are some problems--see "The Unicode 
> Bug" in perlunicode.
> 
> Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) 
> is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl 
> uses the native eight-bit character set. Otherwise, it uses UTF-8.
> 
> A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings, but it 
> becomes relevant when outputting Unicode strings to a stream without a PerlIO layer (one with the "default" 
> encoding). In such a case, the raw bytes used internally (the native character set or UTF-8, as appropriate for each 
> string) will be used, and a "Wide character" warning will be issued if those strings contain a character beyond
>  0x00FF.

Note the first sentence of the third paragraph: "A user of Perl does not normally need to know nor care how Perl 
happens to encode its internal strings"! I really recommend reading the the whole chapter and having a look at 
the examples! The problem arises because DBD::mysql sends data out without using a PerlIO layer, otherwise 
there would be no problem with $other_string! Again, a user should never have to mess around with Perl internals 
as the like of utf8::upgrade().

I repeat that it is not correct not to encode data if you want to send UTF-8! Suppose an UTF-8 shell and the 
following example:

my $var = "\x{fc}bercool \x{263a}"; 
print $var,"\n";

Perl will issue a warning "Wide character in print at -e line 1." although everything looks fine in the terminal. 
The correct way in this situation would be something like this:

my $var = "\x{fc}bercool \x{263a}";
binmode(STDOUT, ":encoding(UTF-8)");
print $var,"\n";

or 

my $var = "\x{fc}bercool \x{263a}";
print Encode::encode("UTF-8", $var),"\n";

Everything works as expected and there are no warnings. Consider the following:

my $a_string = "\x{fc}bercool :-)";                         # "übercool :-)"
my $another_string = "\x{fc}bercool \x{263a}";  # "übercool ☺"

Does it really make sense that a library does not work as expected with $a_string, but does so with $another_string? 
I think it doesn't! Using the \x{...} notation is absolutely okay, it only has implications if a library you use doesn't 
respect Perl's Unicode model!

> I was just trying to be helpful.  Like I said, I'm no unicode expert.

Thank you very much. I'm trying to be helpful too by pointing at an existing and real problem.

Regards
Matias E. Fernandez

[1] http://perldoc.perl.org/5.12.0/perluniintro.html#Perl's-Unicode-Model