[Dbix-class] Unicode conversion problems

Mon Jul 5 23:18:42 GMT 2010

Hello Jesse

> Right, that looks "correct". But this is latin1, not UTF-8,
> so...

No, I think I lost you half way, look at the example carefully:

First you have character data encoded as UTF-8 (my $string).
You then run that already UTF-8 encoded character data through
an ISO-8859-1 to UTF-8 conversion (resulting in my $double_utf8),
what you get is _double_ encoded character data. Did you 
notice that I said "_double_"? ;-)
You then run this _double_ encoded character data through an UTF-8
to ISO-8859-1 conversion (my $double_utf8_to_latin1), thus 
resulting in single UTF-8 encoded data.

Let's recap:

1. start with UTF-8 encoded data
2. run UTF-8 encoded data through UTF-8 to ISO-8859-1 conversion
3. run double UTF-8 encoded data through ISO-8859-1to UTF-8 conversion

and in the end we have single UTF-8 encoded data!

Now, because we are 100% sure that the scalar $double_utf8_to_latin1
is holding valid UTF-8 encoded character data, it would be a good
idea to let Perl know that we have a Unicode string:

my $unicode = Encode::decode('UTF-8', $double_utf8_to_latin1);

Note that all this will do, is to turn on Perl's internal UTF8 flag. 
Alternatively, after having read and understood all of the Perl 
documentation on Unicode, you could just issue:

Encode::_utf8_on($double_utf8_to_latin1);

and then use $double_utf8_to_latin1 as your scalar holding
the Unicode character data.

> I am now, but there was a point when I hadn't been, or these
> tables were first set up as latin-1, or some other screwup.
> The problem is, the tables do exist now.

True. And how are you going to fix the situation?

>> The solution is simply to use mysql_enable_utf8 as part of
>> the call to 'connect()'. If you're using DBIx::Class I
>> recommend also disabling the mysql_auto_reconnect option,
>> this will save you a lot of headache.
> 
> But that doesn't help me right now, it only helps me for the
> future.

Of course it does, you should start fixing your code asap.

> That is, I currently have data in the database, some of which
> is double-encoded UTF-8. If I try to retrieve this, setting
> mysql_enable_utf8 doesn't help. That is if I take my existing
> data (e.g. the example I originally posted), connect to MySQL
> with mysql_enable_utf8, and pull the data with a Perl script,
> I still get junk.

True. What you need to do, apart from fixing your code, is to
fix the encoding in the database itself. There are several
possibilities to accomplish this.
One would be to use a Perl script to read the data from your
broken database, convert it and the write it into a new and
correctly set up database.
Another one would be to use:

mysqldump --default-character-set=utf8

and convert that dump using iconv:

iconv -f ISO-8859-1 -t UTF-8

and in a last step play the converted data back into your 
database.

Last but not least you could try using MySQL's CONVERT [1] 
function. I'm sure you'll find recipes on how to convert table 
encodings using CONVERT.

> In your above example you show how to un-double-encode the
> data I have, but only by turning it into latin1, right?

No, not quite! I showed how to get double UTF-8 encoded
data by running the character data, that was already UTF-8
encoded, through an ISO-8859-1 to UTF-8 conversion.

I then ran that double UTF-8 encoded data through an 
UTF-8 to ISO-8859-1 conversion, to demonstrate why your 
MySQL client is displaying correct results.

> How do
> I take my existing data and turn it into proper UTF-8, at
> which point I can make sure everything is set correctly so
> that I never have this problem again?

See one of the 3 aforementioned alternatives.
In Perl you could write something like this:

my $double_utf8 = "whatever"; # comes from your broken database

# let Perl believe that $double_utf8 is UTF-8
# this hack is only for advanced Perl users, handle with care
Encode::_utf8_on($double_utf8);
# run $double_utf8 through an UTF-8 to ISO-8859-1 conversion
# $utf8 will have valid UTF-8 data
my $utf8 = Encode::decode('ISO-8859-1', $double_utf8);
# let perl know that we have Unicode data
my $unicode = Encode::decode('UTF-8', $utf8);

And remember to encode correctly again before writing
to STDOUT or a file (and avoid "Wide character" warnings):

binmode(STDOUT, ":encoding(UTF-8)");

> Thanks for looking at this so closely.

You are welcome!

Regards
Matias E. Fernandez

[1] http://dev.mysql.com/doc/refman/5.1/en/charset-convert.html