[Dbix-class] Wrong UTF-8 handling in DBIx::Class/DBD::mysql despite mysql_enable_utf8

Alexander Hartmaier alexander.hartmaier at t-systems.at
Fri May 21 08:19:37 GMT 2010


Why not just switch to a real db like Postgres or Oracle, I'm sure that
will fix your problem as well.

--
Best regards, Alex


Am Donnerstag, den 13.05.2010, 18:11 +0200 schrieb Matias E. Fernandez:
> Hello again
>
> On 2010-05-13, at 24:19, Marc Mims wrote:
>
> >> I disagree with that. Consider this:
> >>
> >> my $string = "\x{e4}\x{f6}\x{fc}";
> >> utf8::upgrade($string);
> >>
> >> my $other_string = "\x{e4}\x{f6}\x{fc}";
> >>
> >> ok($string eq $other_string, "upgraded and not upgraded character strings are equal");
> >>
> >> Both $string and $other_string a perfectly valid Perl character strings, and
> >> they are equal. How Perl holds them internally doesn't and shouldn't matter.
> >
> > Unfortunately, it does matter.  Perl supports 2 types of strings: byte
> > strings and unicode strings.  For legacy reasons, byte-strings are
> > interpreted as latin-1. In your example, $string (after the
> > utf8::upgrade) is a unicode string. $other_string is not.  DBD::mysql
> > with mysql_enable_utf8 will be happy with $string but apparently isn't
> > happy with $other_string.
>
> The important point is not that byte-strings are interpreted as Latin-1 in some cases, but that Perl tries
> to keep its data as eight-bit bytes for as long as possible!
>
> From perluniintro [1]:
>
> > Perl supports both pre-5.6 strings of eight-bit native bytes, and strings of Unicode characters. The principle is that
> > Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be
> > avoided, the data is (mostly) transparently upgraded to Unicode. There are some problems--see "The Unicode
> > Bug" in perlunicode.
> >
> > Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1)
> > is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl
> > uses the native eight-bit character set. Otherwise, it uses UTF-8.
> >
> > A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings, but it
> > becomes relevant when outputting Unicode strings to a stream without a PerlIO layer (one with the "default"
> > encoding). In such a case, the raw bytes used internally (the native character set or UTF-8, as appropriate for each
> > string) will be used, and a "Wide character" warning will be issued if those strings contain a character beyond
> >  0x00FF.
>
> Note the first sentence of the third paragraph: "A user of Perl does not normally need to know nor care how Perl
> happens to encode its internal strings"! I really recommend reading the the whole chapter and having a look at
> the examples! The problem arises because DBD::mysql sends data out without using a PerlIO layer, otherwise
> there would be no problem with $other_string! Again, a user should never have to mess around with Perl internals
> as the like of utf8::upgrade().
>
> I repeat that it is not correct not to encode data if you want to send UTF-8! Suppose an UTF-8 shell and the
> following example:
>
> my $var = "\x{fc}bercool \x{263a}";
> print $var,"\n";
>
> Perl will issue a warning "Wide character in print at -e line 1." although everything looks fine in the terminal.
> The correct way in this situation would be something like this:
>
> my $var = "\x{fc}bercool \x{263a}";
> binmode(STDOUT, ":encoding(UTF-8)");
> print $var,"\n";
>
> or
>
> my $var = "\x{fc}bercool \x{263a}";
> print Encode::encode("UTF-8", $var),"\n";
>
> Everything works as expected and there are no warnings. Consider the following:
>
> my $a_string = "\x{fc}bercool :-)";                         # "übercool :-)"
> my $another_string = "\x{fc}bercool \x{263a}";  # "übercool ☺"
>
> Does it really make sense that a library does not work as expected with $a_string, but does so with $another_string?
> I think it doesn't! Using the \x{...} notation is absolutely okay, it only has implications if a library you use doesn't
> respect Perl's Unicode model!
>
> > I was just trying to be helpful.  Like I said, I'm no unicode expert.
>
> Thank you very much. I'm trying to be helpful too by pointing at an existing and real problem.
>
> Regards
> Matias E. Fernandez
>
> [1] http://perldoc.perl.org/5.12.0/perluniintro.html#Perl's-Unicode-Model
> _______________________________________________
> List: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/dbix-class
> IRC: irc.perl.org#dbix-class
> SVN: http://dev.catalyst.perl.org/repos/bast/DBIx-Class/
> Searchable Archive: http://www.grokbase.com/group/dbix-class@lists.scsys.co.uk


*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*
T-Systems Austria GesmbH   Rennweg 97-99, 1030 Wien
Handelsgericht Wien, FN 79340b
*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*
Notice: This e-mail contains information that is confidential and may be privileged.
If you are not the intended recipient, please notify the sender and then
delete this e-mail immediately.
*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*



More information about the DBIx-Class mailing list