[Catalyst] Patch for Catalyst::Plugin::Unicode::Encoding

Wed Mar 19 14:32:27 GMT 2008

On Wed, Mar 19, 2008 at 12:36:42AM -0500, Jonathan Rockway wrote:
> * On Wed, Mar 19 2008, Jonathan Rockway wrote:
> 
> > We should not need to check the flag.  The incoming data should be
> > encoded.  Then we should decode it.  Then we should not try to decode it
> > again.
> 
> A key thing I forgot to mention is that "is_utf8" doesn't mean "we tried
> to decode this already".  It means that the internal representation of
> the string is utf8.

Please don't say that. ;)

I think the fact that people know that Perl's *internal* encoding is
utf8 causes a lot of confusion.  I wish is_utf8 was aliased as
"is_decoded" or "is_characters" and all we knew was that Perl uses
some kind of hidden internal character representation.

This is an important plugin, and it is often cited as something to
use.  I'd argue that it should be part of Catalyst core and not a
plugin -- when do we not work with octets externally and characters
internally?

I think there's a number of problems with the plugin:

To be clear, the problem I posted about was the plugin *decoding*
already decoded content.  If Perl sees the utf8 flag is already
set it will die with:

    Cannot decode string with wide characters at ...

But, you ask, why would the plugin see already decoded content?

This can happen if the body was parsed with, for example, Expat or
XML::LibXML -- or there's some other custom body parser that needs to
work in decoded characters before this plugin runs.

So, it's very possible that the plugin would see already decoded
content and thus should not attempt to decode again.

Unless someone does something stupid and forces the utf8 flag true,
the utf8 flag indicates that the content has been decoded. That's why
it makes perfect sense to test it before decoding to avoid the error.

True, the utf8 flag does not mean we have non-ascii or wide
characters.

    $ perl -MEncode -wle \
        'print utf8::is_utf8( Encode::decode_utf8( "foo" ))
            ? "Is utf8\n"
            : "Is not utf8\n"'

    Is utf8

But it does mean we don't have to call decode again.

Even if decoding ascii-only chars as utf8 did NOT set the utf8 flag
true then decoding as utf8 again is not a problem (thanks to ascii
mapping to utf8).

Testing is the utf8 flag before decoding is the fix.

Here's another thing this plugin does incorrectly.  The
current code is:

    sub prepare_parameters {
        my $c = shift;

        $c->NEXT::prepare_parameters;

        for my $value ( values %{ $c->request->{parameters} } ) {

        [...]

It should be decoding *body_parameters*.

In other words, the result of that code above is:

    length $c->req->parameters->{foo}
    != length $c->req->body_parameters->{foo}.

(or one parameter is utf8-flagged characters and one is a byte
string).

So, the fix is to override prepare_body_parameters to decode the
body params then $c->engine->prepare_parameter will merge the decoded
body params with the query params:

sub prepare_body_parameters {
    my $c = shift;

    # Read in the body params via the engine
    $c->NEXT::prepare_body_parameters;

    for my $value ( values %{ $c->request->body_parameters } ) {

        # This breaks decoding if a param is a hash
        # and Data::Visitor will return hash *keys* so that's no help
        next if ref $value && ref $value ne 'ARRAY';

        for ( ref( $value ) ? @{$value} : $value ) {

            # Don't decode if already decoded.
            next if Encode::is_utf8( $_ );

            $_ = decode_utf8( $_, $CHECK );
        }
    }

    return;
}

Now, the query parameters need fixing in the same way.

Where the code to decode body and query parameters should go is
debatable.  In some ways I think it should be in the Engine as that's
what is bringing the data into the application.  

Maybe the decoding could happen in Engine's prepare_parameters as it's
looking at both the body and query params -- but that runs the risk of
something using the query params before they are decoded.

Finally, the plugin does this:

    unless ( $c->response->content_type =~ /^text|xml$|javascript$/ ) {
        return $c->NEXT::finalize;
    }

    unless ( Encode::is_utf8( $c->response->body ) ) {
        return $c->NEXT::finalize;
    }

    $c->response->body( $c->encoding->encode( $c->response->body, $CHECK ) );

That's a bit less clear if that's correct or not.  The is_utf8 does
serve a purpose, but I think it's implemented wrong.

Here's an example where that might be a "fix": Your normal encoding is
utf8, but you also return XML and you have a client that, for some
reason, insists on a different encoding.  So you encode the body
(clearing the utf8 flag) before the plugin sees the body.  So the test
of is_utf8 above will keep from encoding the already-encoded byte
string.

Of course, what should happen is the output encoding should be
determined by looking at the Accept-Charset and maybe send a 406 if
can't find a suitable encoding.

-- 
Bill Moseley
moseley at hank.org