[Dbix-class] Wrong UTF-8 handling in DBIx::Class/DBD::mysql
despite mysql_enable_utf8
Matias E. Fernandez
pisco at gmx.ch
Wed May 12 21:25:12 GMT 2010
Hi again
On 2010-05-12, at 17:16, Marc Mims wrote:
>> It is a string consisting of the three characters \x{e4}, \x{f6} and \x{fc}.
>> That's about all I have to know as a Perl user, reread [1] if in doubt. The
>> important thing to know is that you cannot rely on Perl internally holding
>> strings in UTF-8! Of course I could force Perl to internally hold this string in
>> UTF-8 by using utf8::upgrade(), but the question is: where should I do that so
>> as to cover all cases? As pointed out in [2], overwriting get_columns and
>> store_columns won't work reliably. That's why I suggested using the
>> inflate/deflate subroutines, but will this work in all cases? Even then it would
>> be a bad idea to use utf8::upgrade() because that's not was it's meant for. As
>> pointed out in [3] the flow should be as follows:
>
> No.
What do you mean by "no"? Which part of the passage do you disagree with?
> It's a string consisting of 3 bytes that happen to be latin-1 characters.
I disagree with that. Consider this:
my $string = "\x{e4}\x{f6}\x{fc}";
utf8::upgrade($string);
my $other_string = "\x{e4}\x{f6}\x{fc}";
ok($string eq $other_string, "upgraded and not upgraded character strings are equal");
Both $string and $other_string a perfectly valid Perl character strings, and
they are equal. How Perl holds them internally doesn't and shouldn't matter.
> If you're going to feed them to a module that expects UTF-8, you need to make them UTF-8, first.
If I was to send a module UTF-8 encoded data, if would do as follows:
Encode::encode('UTF-8', $data);
The resulting scalar would be UTF-8, but it would not have the UTF8 flag! I
think that neither DBIx::Class nor DBD::mysql wants users to send them UTF-8
encoded data, instead they wrongly rely on Perl strings being internally in
UTF-8.
Have you read and understood the part of the Perl documentation I referenced
about "The Unicode Bug"?
>>> 1. Receive and decode
>>> 2. Process
>>> 3. Encode and output
>
> Correct. The byte string in $title hasn't been decoded, yet.
Certainly not! utf8::upgrade() is not about decoding! Decoding would be
Encode::decode(..., ...).
> And I've had no trouble with DBD::mysql with mysql_enable_utf8 set and either
> the default encoding or specific columns set to utf8 in the DDL.
I agree that this is an edge case that exploits "The Unicode Bug", nonetheless I
think that either DBIx::Class or DBD::mysql handles Perl strings wrong and that
this is a valid test case. Why don't they Encode::encode("UTF-8", ...) before sending
data to the database?
> You're quite correct that you shouldn't have to worry about what the
> internal internal representation is, in perl. As long as you decode
> input and encode output, you should be good.
Again, utf8::upgrade() and Encode::decode('UTF-8', ...) is not the same. Read
the documentation I referenced in my previous messages.
> With DBD::mysql/mysql_enable_utf8, you send it decoded utf8 and you get
> back decoded utf8. It takes care of the decoding and encoding on
> input/output for you.
Not quite. I get back decoded UTF-8 data, yes! That's why you'll find
sv_utf8_decode in the DBD::mysql source, but nowhere do you find an encode!
> Perl itself understands that $title is latin-1, and when you encode it
> to utf8, it does the right thing. DBD::mysql isn't quite as smart. It
> expects a decoded utf8 string, so the utf8::upgrade is necessary.
utf8::upgrade() does is not about decoding data.
> I don't think deflate/inflate is the correct place. That's serializing
> and de-serializing objects. If you're using DBD::mysql, you can simply
> use the mysql_enable_utf8 flag and you won't need
> DBIx::Class::UTF8Columns [1].
My unit tests show that DBD::mysql does not work correctly if exposed to "The
Unicode Bug". If deflate/inflate are about de-serializing objects, then it
should be the right place, because I have to encode Perl Unicode strings to
UTF-8 before sending them to the database. Oddly enough DBD::mysql does only the
decode when receiving data from the database, but not the encode part.
How about doing an utf8::upgrade() in the deflate subroutine?
Do you now the bug found in April 2010 that the DBIx::Class::UTF8Columns
documentation refers to?
This is the flow how I think it should be:
1. Receive and decode <-- DBD::mysql or DBIx::Class
2. Process <-- Perl code using either DBD::mysql or DBIx::Class
3. Encode and output <-- DBD::mysql or DBIx::Class
Have a look at JSON::XS, that's exactly the way JSON::XS behaves.
Regards
Matias E. Fernandez
More information about the DBIx-Class
mailing list