[Dbix-class] Unicode conversion problems

Octavian Rasnita octavian at fcc.ro
Tue Jul 6 16:00:56 GMT 2010


Hi Jesse,

Check:

http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

or search for
mysql double encoding
with Google and you'll find more step by step instructions.

I also had that problem and I dumped a few tables with mysqldump (because not all the tables were containing bad encoded data) then I have re-imported them with a good encoding. I was using DBIx::Class::UTF8Columns but I switched to not using it anymore, but I added the attribute mysql_enable_utf8 => 1 to the model config.

Octavian

----- Original Message ----- 
From: "Jesse Sheidlower" <jester at panix.com>
To: "DBIx::Class user and developer list" <dbix-class at lists.scsys.co.uk>
Sent: Tuesday, July 06, 2010 12:55 AM
Subject: Re: [Dbix-class] Unicode conversion problems


On Mon, Jul 05, 2010 at 11:02:02PM +0200, Matias E. Fernandez wrote:
> Hello Jesse
> 
> I'm pretty sure your data has been UTF-8 encoded twice. Consider this example:
> 
> use strict;
> use warnings;
> 
> use Encode;
> 
> # $string is UTF-8, but Perl doesn't know
> my $string = 'Pérez-Reverte, Arturo Кири́ллица ქართული  汉字 / 漢';
> # $double_utf8 contains the double UTF-8 encoded string
> # note that this is an implicit ISO-8859-1 to UTF-8 conversion
> my $double_utf8 = Encode::encode('UTF-8', $string);
> 
> print "double encoded UTF-8:\n", "$double_utf8\n\n";
> 
> # let Perl believe that $double_utf8 is UTF-8
> Encode::_utf8_on($double_utf8);
> # run $double_utf8 through a UTF-8 to ISO-8859-1 conversion
> my $double_utf8_to_latin1 = Encode::decode('ISO-8859-1', $double_utf8);
> 
> print "double UTF-8 to ISO-8859-1:\n", "$double_utf8_to_latin1\n\n";

Right, that looks "correct". But this is latin1, not UTF-8,
so...

> So why is your data in the database double encoded UTF-8?
> The problem is that you're not using the mysql_enable_utf8
> option (see the DBD::mysql documentation). If you don't use
> that option as a part to the call to 'connect()', DBD::mysql
> will the configure the connection in a way that MySQL
> believes it's being sent ISO-8859-1. Because you're table is
> configured to store character data as UTF-8, MySQL converts
> the received data from ISO-8859-1 to UTF-8. There you have
> double encoded UTF-8!

I am now, but there was a point when I hadn't been, or these
tables were first set up as latin-1, or some other screwup.
The problem is, the tables do exist now.

> The solution is simply to use mysql_enable_utf8 as part of
> the call to 'connect()'. If you're using DBIx::Class I
> recommend also disabling the mysql_auto_reconnect option,
> this will save you a lot of headache.

But that doesn't help me right now, it only helps me for the
future.

That is, I currently have data in the database, some of which
is double-encoded UTF-8. If I try to retrieve this, setting
mysql_enable_utf8 doesn't help. That is if I take my existing
data (e.g. the example I originally posted), connect to MySQL
with mysql_enable_utf8, and pull the data with a Perl script,
I still get junk.

In your above example you show how to un-double-encode the
data I have, but only by turning it into latin1, right? How do
I take my existing data and turn it into proper UTF-8, at
which point I can make sure everything is set correctly so
that I never have this problem again?

Thanks for looking at this so closely.

Jesse Sheidlower

_______________________________________________
List: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/dbix-class
IRC: irc.perl.org#dbix-class
SVN: http://dev.catalyst.perl.org/repos/bast/DBIx-Class/
Searchable Archive: http://www.grokbase.com/group/dbix-class@lists.scsys.co.uk




More information about the DBIx-Class mailing list