[Catalyst] Re: Avoiding UTF8 in Catalyst

Mon Nov 23 16:34:20 GMT 2009

* Carl Johnstone <catalyst at fadetoblack.me.uk> [2009-11-23 15:35]:
> Aristotle Pagaltzis wrote:
> >     # everything should be bytes at this point, but just in case
> >     $response->content_length( bytes::length( $response->body ) );
> >
> > I was shocked to discover this! Any code that uses
> > bytes::length is automatically broken.
>
> Not in this case

Yes in this case.

> the HTTP spec says that the Content-Length header should
> contain the number of octets in the body. If you're sending
> UTF-8 then this is likely different to the number of characters
> in the string.

You’re right about HTTP.

But there’s no room for “likelies” here: that’s programming by
coincidence. Either you want it or you don’t, and in this case
you do. But bytes::length doesn’t do that.

Please plese don’t make statements like “not in this case”
without knowing what the thing you are talking about does, i.e.
in this case bytes::length, does. There are enough misconceptions
about Unicode in Perl already.

Try this:

    use 5.010; require utf8; require bytes; require Data::Dump;
    $a = $b = chr(0xff);
    utf8::upgrade($a);
    utf8::downgrade($b);
    say Data::Dump::pp $a, $b;
    say $a eq $b ? 'ok' : 'not ok';
    say length($a) == length($b) ? 'ok' : 'not ok';
    say bytes::length($a) == bytes::length($b) ? 'ok' : 'not ok';

It will print the following:

    ("\xFF", "\xFF")
    ok
    ok
    not ok

In other words, there are two entirely identical strings here,
their internal buffers just happen to be in different formats:
one is a packed byte array, the other is a variable-width integer
arrays. And then bytes::length goes and *IGNORES* which is which,
and just blithely looks at the size of the buffer without caring
about the (ill-named) UTF8 flag – even though both strings, when
printed, will produce the *exact same output*. Because they are
IDENTICAL.

In Perl, there are ONLY strings. Semantically, there are no “byte
strings and character strings”. Just strings. All strings are the
same: character sequences, where a a character is an arbitrarily
large integer value. That’s *all*.

Now there are, on the level of the perl implementation, two
string formats: packed byte sequence strings (which are fast but
can only store codepoints < 0x100) and variable-width integer
sequence strings (which are slower but can store all codepoints).

However, from the Perl level, there is NO difference between
those two kinds of string. If you have binary data in a string,
then it’s simply a string that happens to consist of characters
all < 0x100. Note how I didn’t talk about whether it’s a byte
array string or a variable-width integer string? That’s because
that doesn’t matter. Observe:

    my $jpeg = do {
        open my $fh, '<', 'some-image.jpeg' or die $!;
        local $/;
        <$fh>;
    };

    utf8::upgrade( $jpeg ); ### <------ note here

    open my $fh, '>', 'output.jpeg' or die $!;
    print $fh $jpeg;

If you run this code, end result will be two EXACTLY IDENTICAL
files. Because the contents in $jpeg mean the SAME THING after
upgrading as they did before. You cannot tell from just looking
at a string, whether it contains binary data or text.

However, if you ask for its bytes::length( $jpeg ), you’ll get
the wrong number! Because bytes.pm is broken! As designed!

Note that up- or downgrading a string like this will happen at
pretty random points in your code, and it won’t be obvious where
or why. It’s not actually random of course, but the point where
it happens might be hidden in some module several layers down
your call stack. It might happen only some of the time. Which is
perfectly fine, because the distinction between these two kinds
of strings is an implementation detail in perl! Just like when
you print numbers in Perl, and perl stringifies the scalar,
caches the result of that conversion in the IV slot of the
scalar, and never bothers to let you know.

Because you don’t need to know.

So it might happen that you properly Encode::encode’d your
string, but it’s passed to some routine somewhere in the guts of
some module you are using, which still causes it to get upgraded
in the course some operation. And that’s just fine. It’s not
a bug, just like it’s not a bug that perl silently stringifies
numbers and silently numifies strings. The resulting output will
always be correct in the end because every operation knows to pay
attention to all the IOK, POK, etc flags in scalars that keep
track of these conversions.

But bytes::length doesn’t! It breaks the fixed-/variable width
abstraction by blithely ignoring the UTF8 flag. (Which should
have been named UOK, to go with the IOK, POK, etc flags that
scalars already have.) It’s as if, when you asked for the length
of the number 65, and the scalar had never been stringified
before, Perl didn’t bother to stringify it, and just looked at
the length of the IV slot (integer value), and because you are
running a 32-bit perl, the answer you got was 4. Whereas if you
had stringified the scalar, then instead the answer would be
2 because "65" is two characters long. And maybe your code is
written such that it sometimes happens to stringify the scalar
(eg. by printing it in a diagnostic message) and sometimes not.
Then you get to play a lottery! Fun!

Conclusion of this much longer rant than I planned to write:

If you’re using bytes.pm or any of its functions, your code is
BROKEN. Unconditionally.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>