[Dbix-class] Wrong UTF-8 handling in DBIx::Class/DBD::mysql despite mysql_enable_utf8

Marc Mims marc at questright.com
Wed May 12 15:16:33 GMT 2010


* Matias E. Fernandez <pisco at gmx.ch> [100512 07:43]:
> On 2010-05-12, at 15:40, Marc Mims wrote:
> >> my $title = "\x{e4}\x{f6}\x{fc}"; # "äöü"
> > 
> > This isn't a UTF-8 string.
> > 
> >    utf8::is_utf8($title); # false
> > 
> >    utf8::upgrade($title); # now it is
> 
> It is a string consisting of the three characters \x{e4}, \x{f6} and \x{fc}. That's about all I have to know as a Perl user, reread [1] if in doubt. The important thing to know is that you cannot rely on Perl internally holding strings in UTF-8! Of course I could force Perl to internally hold this string in UTF-8 by using utf8::upgrade(), but the question is: where should I do that so as to cover all cases? As pointed out in [2], overwriting get_columns and store_columns won't work reliably. That's why I suggested using the inflate/deflate subroutines, but will this work in all cases? Even then it would be a bad idea to use utf8::upgrade() because that's not was it's meant for. As pointed out in [3] the flow should be as follows:

No. It's a string consisting of 3 bytes that happen to be latin-1
characters.  If you're going to feed them to a module that expects
UTF-8, you need to make them UTF-8, first.

> > 1. Receive and decode
> > 2. Process
> > 3. Encode and output

Correct.  The byte string in $title hasn't been decoded, yet.

I'm not a UNICODE expert.  I've struggled mightily with it and seem to
have eventually got it right in Net::Twitter.  And I've had no trouble
with DBD::mysql with mysql_enable_utf8 set and either the default
encoding or specific columns set to utf8 in the DDL.  But in order for
it to work, you have to send it decode utf8 strings, not latin-1 strings.

You're quite correct that you shouldn't have to worry about what the
internal internal representation is, in perl.  As long as you decode
input and encode output, you should be good.

With DBD::mysql/mysql_enable_utf8, you send it decoded utf8 and you get
back decoded utf8.  It takes care of the decoding and encoding on
input/output for you.

> and as a matter of fact, neither DBIx::Class nor DBD::mysql do the 3rd step (encoding to UTF-8), because then the problem would not arise. Look at this:
> 
> my $title = "\x{e4}\x{f6}\x{fc}";
> return Encode::encode('UTF-8', $title);
> 
> and
> 
> my $other_title = "\x{e4}\x{f6}\x{fc}";
> utf8::upgrade($other_title); 
> return Encode::encode('UTF-8', $other_title);
> 
> Both yield the same result. Using utf8::upgrade() here is useless, and again: as pointed out in [1] you shouldn't care about the internal format.

Perl itself understands that $title is latin-1, and when you encode it
to utf8, it does the right thing.  DBD::mysql isn't quite as smart.  It
expects a decoded utf8 string, so the utf8::upgrade is necessary.

> My question remains: is deflate/inflate a safe place to do encoding, or will it suffer the same flaws as DBIx::Class::UTF8Columns?

I don't think deflate/inflate is the correct place.  That's serializing
and de-serializing objects.  If you're using DBD::mysql, you can simply
use the mysql_enable_utf8 flag and you won't need
DBIx::Class::UTF8Columns [1].

[1] http://search.cpan.org/~frew/DBIx-Class-0.08121/lib/DBIx/Class/UTF8Columns.pm#Warning_-_Native_Database_Unicode_Support

	-Marc



More information about the DBIx-Class mailing list