[Dbix-class] Unicode conversion problems

Jesse Sheidlower jester at panix.com
Mon Jul 5 14:09:09 GMT 2010


Summary: I have a MySQL database with data in an unknown character set,
or mixture thereof (I thought it was Unicode, but it's not). It displays
correctly when used with MySQL commandline tools under certain
configurations, but I need to figure out how to convert it to proper
Unicode. After muddling through lots of Catalyst/DBIx::Class issues, I
now realize it's at a lower level than that, and I have various test
scripts, and have asked some people for advice, but no one is sure
what's going on or how to fix it.

In my database I have this string:

Pérez-Reverte, Arturo Кири́ллица ქართული  汉字 / 漢

On my UTF-8 terminal now, this begins with a capital "P", a capital "A"
with a tilde, and a copyright sign.

This is supposed to look like:

Pérez-Reverte, Arturo Кири́ллица ქართული  汉字 / 漢

The second character is an "e" with an acute accent; after
"Arturo" there are characters in Cyrillic, then in Georgian,
then in Chinese.

A MySQL script to test this:

--- foo-load.sql ---

/*!40101 SET NAMES utf8 */;

DROP TABLE IF EXISTS `foo`;

CREATE TABLE `foo` (
  `author` varchar(100) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET='utf8';

INSERT INTO `foo` VALUES ('Pérez-Reverte, Arturo Кири́ллица ქართული  汉字 / 漢');

--- end foo-load.sql ---

If I load this into a MySQL database, and then select it on the
commandline, it "works", i.e. it looks like I want it to:

---
mysql> select * from foo\G
*************************** 1. row ***************************
author: Pérez-Reverte, Arturo Кири́ллица ქართული  汉字 / 漢
1 row in set (0.00 sec)
---

Note that if I _don't_ have the "SET NAMES" command, which indicates
what character set the client uses to send messages to the server, it
fails, even though I have the server character set as utf8 in the my.cnf
file, and I specify utf8 as the default charset in the CREATE statement.

I thought that this represented double-encoded Unicode, but
when I wrote a script to fix this with
Encode::DoubleEncodedUTF8, it did not work, or, rather, it
only partially worked:

--- foo-test.pl ---
#!/usr/bin/perl

use strict;
use warnings;

use Encode qw(:all);
use Encode::DoubleEncodedUTF8;

my $data = 'Pérez-Reverte, Arturo Кири́ллица ქართული  汉字 / 漢';

my $fixed = decode("utf-8-de", $data); # Fix it

if (is_utf8($data)) { print "Original is Unicode!\n"; }
else { print "Original is not Unicode!\n"; }

if (is_utf8($fixed)) { print "Fixed data is Unicode!\n"; }
else { print "Fixed data is not Unicode!\n"; }

print "Data is: $data\n";
print "Fixed data is: $fixed\n";
--- end foo-test.pl ---

Output:

---
$ perl foo-test.pl
Original is not Unicode!
Fixed data is Unicode!
Data is: Pérez-Reverte, Arturo Кири́ллица ქართული  汉字 / 漢
Wide character in print at foo-test.pl line 20.
Fixed data is: Pérez-Reverte, Arturo Ки�€и́лл�¸Ñ†а ქართული  汉字 / 漢
$ 
---

Note that it gets the "e acute" right in the second character,
and some other random characters, but the rest is glop.

Clearly _something_ can figure this out, because the MySQL
commandline query can retrieve the data in the form I think
it's supposed to look.

I'd be very grateful if anyone could tell me what's going on, and more
importantly, how to fix it. I have several databases that are affected
by this, and they're failing in production.

Thanks. Would also be happy for pointers to other places to ask, since I
now know this is not a DBIC issue.

Jesse Sheidlower



More information about the DBIx-Class mailing list