[Dbix-class] Wrong UTF-8 handling in DBIx::Class/DBD::mysql despite mysql_enable_utf8

Wed May 12 21:25:12 GMT 2010

Hi again

On 2010-05-12, at 17:16, Marc Mims wrote:
>> It is a string consisting of the three characters \x{e4}, \x{f6} and \x{fc}. 
>> That's about all I have to know as a Perl user, reread [1] if in doubt. The 
>> important thing to know is that you cannot rely on Perl internally holding 
>> strings in UTF-8! Of course I could force Perl to internally hold this string in 
>> UTF-8 by using utf8::upgrade(), but the question is: where should I do that so 
>> as to cover all cases? As pointed out in [2], overwriting get_columns and 
>> store_columns won't work reliably. That's why I suggested using the 
>> inflate/deflate subroutines, but will this work in all cases? Even then it would 
>> be a bad idea to use utf8::upgrade() because that's not was it's meant for. As 
>> pointed out in [3] the flow should be as follows:
> 
> No.

What do you mean by "no"? Which part of the passage do you disagree with?

> It's a string consisting of 3 bytes that happen to be latin-1 characters.

I disagree with that. Consider this:

my $string = "\x{e4}\x{f6}\x{fc}";
utf8::upgrade($string);

my $other_string = "\x{e4}\x{f6}\x{fc}";

ok($string eq $other_string, "upgraded and not upgraded character strings are equal");

Both $string and $other_string a perfectly valid Perl character strings, and 
they are equal. How Perl holds them internally doesn't and shouldn't matter.

> If you're going to feed them to a module that expects UTF-8, you need to make them UTF-8, first.

If I was to send a module UTF-8 encoded data, if would do as follows:

Encode::encode('UTF-8', $data);

The resulting scalar would be UTF-8, but it would not have the UTF8 flag! I 
think that neither DBIx::Class nor DBD::mysql wants users to send them UTF-8 
encoded data, instead they wrongly rely on Perl strings being internally in 
UTF-8.

Have you read and understood the part of the Perl documentation I referenced 
about "The Unicode Bug"?

>>> 1. Receive and decode
>>> 2. Process
>>> 3. Encode and output
> 
> Correct.  The byte string in $title hasn't been decoded, yet.

Certainly not! utf8::upgrade() is not about decoding! Decoding would be 
Encode::decode(..., ...).

> And I've had no trouble with DBD::mysql with mysql_enable_utf8 set and either 
> the default encoding or specific columns set to utf8 in the DDL.

I agree that this is an edge case that exploits "The Unicode Bug", nonetheless I 
think that either DBIx::Class or DBD::mysql handles Perl strings wrong and that 
this is a valid test case. Why  don't they Encode::encode("UTF-8", ...) before sending 
data to the database?

> You're quite correct that you shouldn't have to worry about what the
> internal internal representation is, in perl.  As long as you decode
> input and encode output, you should be good.

Again, utf8::upgrade() and Encode::decode('UTF-8', ...) is not the same. Read 
the documentation I referenced in my previous messages.

> With DBD::mysql/mysql_enable_utf8, you send it decoded utf8 and you get
> back decoded utf8.  It takes care of the decoding and encoding on
> input/output for you.

Not quite. I get back decoded UTF-8 data, yes! That's why you'll find 
sv_utf8_decode in the DBD::mysql source, but nowhere do you find an encode!

> Perl itself understands that $title is latin-1, and when you encode it
> to utf8, it does the right thing.  DBD::mysql isn't quite as smart.  It
> expects a decoded utf8 string, so the utf8::upgrade is necessary.

utf8::upgrade() does is not about decoding data.

> I don't think deflate/inflate is the correct place.  That's serializing
> and de-serializing objects.  If you're using DBD::mysql, you can simply
> use the mysql_enable_utf8 flag and you won't need
> DBIx::Class::UTF8Columns [1].

My unit tests show that DBD::mysql does not work correctly if exposed to "The 
Unicode Bug". If deflate/inflate are about de-serializing objects, then it 
should be the right place, because I have to encode Perl Unicode strings to 
UTF-8 before sending them to the database. Oddly enough DBD::mysql does only the 
decode when receiving data from the database, but not the encode part.

How about doing an utf8::upgrade() in the deflate subroutine?

Do you now the bug found in April 2010 that the DBIx::Class::UTF8Columns 
documentation refers to?

This is the flow how I think it should be:

1. Receive and decode <-- DBD::mysql or DBIx::Class
2. Process <-- Perl code using either DBD::mysql or DBIx::Class
3. Encode and output <-- DBD::mysql or DBIx::Class

Have a look at JSON::XS, that's exactly the way JSON::XS behaves.

Regards
Matias E. Fernandez