[Catalyst] UTF8 problems with plugin::encoding

Roman Jurkov winfinit at gmail.com
Tue Jul 22 13:34:54 GMT 2014


Bernhard,

To stop exception, you can modify $CHECK in Catalyst::Plugin::Unicode::Encoding by removing “FB_CROAK”, that way it won’t throw exception, and let the code go through, however it will not decode content correctly, but in this case, since it is a spider, i don’t know if it matters to you.

you can just add this to your catalyst application:

use Catalyst::Plugin::Unicode::Encoding;
$Catalyst::Plugin::Unicode::Encoding::CHECK = Encode::LEAVE_SRC;

more isolated test that is illustrating this issue:

use strict;
use warnings;

use Encode qw(decode encode);

our $CHECK = Encode::FB_CROAK | Encode::LEAVE_SRC;

my $str = '深入 so what';
my $oct = encode("euc-cn", $str);

my $obj = Encode::find_encoding('UTF-8');
my $res = $obj->decode($oct, $CHECK);
warn $res;


-roman

On Jul 22, 2014, at 7:31 AM, Mark Ellis <m at rkellis.com> wrote:

> I don't think there's anything you can do, you're app wants utf8 and they're sending something else which doesn't map. and since you can't know what format it is in, then all you can do is die if it doesn't map, which is what the plugin does.
> 
> as far as i can tell the ruby middleware i found handles this by returning a 400 bad request, which cataylst does as well. so there's no affect, other than the noise in the logs.
> 
> 
> On 22 July 2014 11:21, Bernhard Bauch <bauch at zsi.at> wrote:
> here’s also a perl-script that does it
> 
> ------------------------------------------
> use Encode qw(decode encode);
> use LWP::UserAgent;
> 
> my $str = '深入 so what';
> my $oct = encode("gb2312", $str);
> my $url = 'http://wbc-inco.net/object/event/past';
> my $ua       = LWP::UserAgent->new();
> my $response = $ua->post( $url, { $oct => $oct } );
> my $content  = $response->decoded_content();
> ------------------------------------------
> 
> On 22 Jul 2014, at 11:33, Bernhard Bauch <bauch at zsi.at> wrote:
> 
>> hey all,
>> 
>> this pyton3 script triggers the error ….
>> 
>> --------------------------------
>> import httplib2
>> import urllib.parse
>> 
>> somestr = '深入 so what'
>> encodedstr = somestr.encode('gb2312')
>> url = 'http://myappdomain.com/search'   
>> body = { encodedstr:encodedstr }
>> headers = {
>>     'Content-type': 'application/x-www-form-urlencoded', 
>>     'Accept': 'text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1',
>>     'Accept-Encoding': 'gzip, deflate',
>>     'Accept-Language': 'zh;q=0.9,en;q=0.8'
>> }
>> http = httplib2.Http()
>> response, content = http.request(url, 'POST', headers=headers, body=urllib.parse.urlencode(body))
>> ————————————————
>> 
>> now its possible to reproduce the error :)
>> 
>> any ideas how to solve this ?
>> ruby people did this with adding a utf8-sanitizer in the middleware..
>> 
>> bye, bernhard
>> 
>> 
>> On 21 Jul 2014, at 22:19, Bernhard Bauch <bauch at zsi.at> wrote:
>> 
>>> more news..
>>> 
>>> the crawler/searcheinge that triggers these errors is
>>> 	http://easou.com
>>> 
>>> this searchengine delivers their pages not in UTF8 — but in “gb2312” which is “simple chinese”
>>> if i open the “wrong utf8” parameters from the faulty requests with “gb2312” some readable signs appear.
>>> >> this leads me to: catalyst does not handle requests with gb2312 encoded parameters (because they are not utf8) -and the request does not promote that it is encoded in other than utf8.
>>> 
>>> any ideas what to do ?
>>> 
>>> bye, bernhard
>>> 
>>> 
>>> 
>>> On 21 Jul 2014, at 14:36, Roman Winfinit <winfinit at gmail.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> How are you running your application? Ie: mod_perl, fcgi, fcgi + httpd/nginx, plack + ... also what version of perl are you using and what os?
>>>> 
>>>> -roman
>>>> 
>>>> On Jul 21, 2014 6:58 AM, "Bernhard Bauch" <bauch at zsi.at> wrote:
>>>> Hey all,
>>>> 
>>>> on most of my website running on (latest catalyst: 5.90065) i always get utf8 related errors.
>>>> the usually appear if a spider 
>>>> 	Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)
>>>> comes accross.
>>>> 
>>>> the error is:
>>>> 	Caught exception in engine "UTF8 Error: utf8 "\x98" does not map to Unicode at /usr/local/…./lib/perl5/Catalyst/Plugin/Unicode/Encoding.pm line 167.
>>>> 
>>>> It took me while to get the actual parameters the spiders sends because the debug-message of catalyst do not tell that much :...
>>>> 
>>>> —————————————
>>>> [2014/07/16 15:08:47] [5.255.253.218] [INFO] vim /usr/local/…./lib/perl5/Catalyst.pm +2016: *** Request 164 (0.032/s) [10682] [Wed Jul 16 15:08:47 2014] ***
>>>> [2014/07/16 15:08:47] [5.255.253.218] [DEBUG] vim /usr/local/…./lib/perl5/Catalyst.pm +2309: Response Code: 400; Content-Type: text/plain; charset=UTF-8; Content-Length: unknown
>>>> [2014/07/16 15:08:47] [5.255.253.218] [INFO] vim /usr/local/.../lib/perl5/Catalyst.pm +1880: Request took 0.006491s (154.059/s)
>>>> .---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------.
>>>> | Action                                                                                                                                                                                            | Time      |
>>>> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+
>>>> '---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------'
>>>> —————————————
>>>> 
>>>> i changed to Plugin::Unicode::Encoding plugin a bit to find out what the client sends … the results are these:
>>>> UTF8 trash arrives - and the module seems unable to deal with it…
>>>> 
>>>> ————————————
>>>> Caught exception in engine "UTF8 Error: utf8 "\x98" does not map to Unicode at /usr/local/…../lib/perl5/Catalyst/Plugin/Unicode/Encoding.pm line 170.
>>>>  -
>>>> 
>>>> URL: notice/list
>>>> 
>>>> PARAMS:$VAR1 = {
>>>>           'X*Ö^K^@^@^@^@¸®ä
>>>> ^@^@^@^@8<83>^H^K^@^@^@^@h¡ä
>>>> ^@^@^@^@Hµä
>>>> ^@^@^@^@X^Z^N^Q^@^@^@^@ø<91>^F^Q^@^@^@^@Ø^F^N^Q^@^@^@^@¸<92>^F^Q^@^@^@^@(^K^N^Q^@^@^@^@<88>^B^N^Q^@^@^@^@¸úÝ^P^@^@^@^@^X%q^G^@^@^@^@اñ^O^@^@^@^@ØøB.^@^@^@^@èâÝ^P^@^@^@^@XÛ_^L^@^@^@^@ÈíÝ^P^@^@^@^@¸~P^S^@^@^@^@èåÝ^P^@^@^@^@Øný^O^@^@^@^@<88>úÝ^P^@^@^@^@^Xá( ^@^@^@^@ئÆ
>>>> ^@^@^@^@Øï*^Q^@^@^@^@^X' => '^F^L^@^@^@^@<98>Ûø^O^@^@^@^@Ø~^A^N^@^@^@^@<98>=H>^@^@^@^@ø<99>ó^K^@^@^@^@hÔu^R^@^@^@^@¸<8e>ó^K^@^@^@^@^Xä_^L^@^@^@^@Ø<90>a^G^@^@^@^@hðÉ^O^@^@^@^@8ã*^G^@^@^@^@ØØý^M^@^@^@^@Xùë^F^@^@^@^@^HÜý^M^@^@^@^@8W6^H^@^@^@^@øÐý^M^@^@^@^@xÿÃ^K^@^@^@^@X]i^O^@^@^@^@8^Mÿ^H^@^@^@^@Xû<98>^Q^@^@^@^@x¦h^H^@^@^@^@Xý<98>^Q^@^@^@^@^X=5^H^@^@^@^@^X¦ú^K^@^@^@^@^XVQ^P^@^@^@^@^H^Yû^N^@^@^@^@x¤h^H^@^@^@^@^Xå<98>^Q^@^@^@^@ø¤h^H^@^@^@^@Xé<98>^Q^@^@^@^@X¼h^H^@^@^@^@Ø¡h^H^@^@^@^@øf<82>^Q^@^@^@^@^X>éH^@^@^@^@xv<82>^Q^@^@^@^@X6éH^@^@^@^@xl<82>^Q^@^@^@^@83Ì^G^@^@^@^@Xl<82>^Q^@^@^@^@¸Ñý^M^@^@^@^@xr<82>^Q^@^@^@^@H[^H^Q^@^@^@^@^X|<82>^Q^@^@^@^@¸Ë¢^K^@^@^@^@¸u<82>^Q^@^@^@^@<98>Á¢^K^@^@^@^@Øp<82>^Q^@^@^@^@8Í¢^K^@^@^@^@Øl<82>^Q^@^@^@^@XË¢^K^@^@^@^@Xq<82>^Q^@^@^@^@^Xi^W^H^@^@^@^@Xc<82>^Q^@^@^@^@¸Å¢^K^@^@^@^@8h<82>^Q^@^@^@^@<98>Т^K^@^@^@^@¨fÐ^Q^@^@^@^@ØÉ=^R^@^@^@^@ÀC<95>^M^@^@^@^@°S<95>^M^@^@^@^@^PI<95>^M^@^@^@^@À\\<95>^M^@^@^@^@ðE<95>^M^@^@^@^@<80>B<95>^M^@^@^@^@@P<95>^M^@^@^@^@<80>Q<95>^M^@^@^@^@ J<95>^M^@^@^@^@p\\<95>^M^@^@^@^@àU<95>^M^@^@^@^@àF<95>^M^@^@^@^@àA<95>^M^@^@^@^@^@<9e>ô^P^@^@^@^@°<9d>ô^P^@^@^@^@0<91>ô^P^@^@^@^@ <9e>ô^P^@^@^@^@^P<8e>ô^P^@^@^@^@ <88>ô^P^@^@^@^@Ð<82>ô^P^@^@^@^@ <8d>ô^P^@^@^@^@<90><95>ô^P^@^@^@^@à<90>ô^P^@^@^@^@@<95>ô^P^@^@^@^@P<8f>ô^P^@^@^@^@<90><81>ô^P^@^@^@^@ <97>ô^P^@^@^@^@Ð<8c>ô^P^@^@^@^@p<88>ô^P^@^@^@^@P<99>ô^P^@^@^@^@<90><90>ô^P^@^@^@^@@<9a>ô^P^@^@^@^@0<9b>ô^P^@^@^@'
>>>>         };
>>>> 
>>>> 
>>>>  // value: $VAR1 = '^F^L^@^@^@^@<98>Ûø^O^@^@^@^@Ø~^A^N^@^@^@^@<98>=H>^@^@^@^@ø<99>ó^K^@^@^@^@hÔu^R^@^@^@^@¸<8e>ó^K^@^@^@^@^Xä_^L^@^@^@^@Ø<90>a^G^@^@^@^@hðÉ^O^@^@^@^@8ã*^G^@^@^@^@ØØý^M^@^@^@^@Xùë^F^@^@^@^@^HÜý^M^@^@^@^@8W6^H^@^@^@^@øÐý^M^@^@^@^@xÿÃ^K^@^@^@^@X]i^O^@^@^@^@8^Mÿ^H^@^@^@^@Xû<98>^Q^@^@^@^@x¦h^H^@^@^@^@Xý<98>^Q^@^@^@^@^X=5^H^@^@^@^@^X¦ú^K^@^@^@^@^XVQ^P^@^@^@^@^H^Yû^N^@^@^@^@x¤h^H^@^@^@^@^Xå<98>^Q^@^@^@^@ø¤h^H^@^@^@^@Xé<98>^Q^@^@^@^@X¼h^H^@^@^@^@Ø¡h^H^@^@^@^@øf<82>^Q^@^@^@^@^X>éH^@^@^@^@xv<82>^Q^@^@^@^@X6éH^@^@^@^@xl<82>^Q^@^@^@^@83Ì^G^@^@^@^@Xl<82>^Q^@^@^@^@¸Ñý^M^@^@^@^@xr<82>^Q^@^@^@^@H[^H^Q^@^@^@^@^X|<82>^Q^@^@^@^@¸Ë¢^K^@^@^@^@¸u<82>^Q^@^@^@^@<98>Á¢^K^@^@^@^@Øp<82>^Q^@^@^@^@8Í¢^K^@^@^@^@Øl<82>^Q^@^@^@^@XË¢^K^@^@^@^@Xq<82>^Q^@^@^@^@^Xi^W^H^@^@^@^@Xc<82>^Q^@^@^@^@¸Å¢^K^@^@^@^@8h<82>^Q^@^@^@^@<98>Т^K^@^@^@^@¨fÐ^Q^@^@^@^@ØÉ=^R^@^@^@^@ÀC<95>^M^@^@^@^@°S<95>^M^@^@^@^@^PI<95>^M^@^@^@^@À\\<95>^M^@^@^@^@ðE<95>^M^@^@^@^@<80>B<95>^M^@^@^@^@@P<95>^M^@^@^@^@<80>Q<95>^M^@^@^@^@ J<95>^M^@^@^@^@p\\<95>^M^@^@^@^@àU<95>^M^@^@^@^@àF<95>^M^@^@^@^@àA<95>^M^@^@^@^@^@<9e>ô^P^@^@^@^@°<9d>ô^P^@^@^@^@0<91>ô^P^@^@^@^@ <9e>ô^P^@^@^@^@^P<8e>ô^P^@^@^@^@ <88>ô^P^@^@^@^@Ð<82>ô^P^@^@^@^@ <8d>ô^P^@^@^@^@<90><95>ô^P^@^@^@^@à<90>ô^P^@^@^@^@@<95>ô^P^@^@^@^@P<8f>ô^P^@^@^@^@<90><81>ô^P^@^@^@^@ <97>ô^P^@^@^@^@Ð<8c>ô^P^@^@^@^@p<88>ô^P^@^@^@^@P<99>ô^P^@^@^@^@<90><90>ô^P^@^@^@^@@<9a>ô^P^@^@^@^@0<9b>ô^P^@^@^@';
>>>> 
>>>> 
>>>> headers: Connection: close
>>>> Accept: text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1
>>>> Accept-Encoding: gzip, deflate
>>>> Accept-Language: zh;q=0.9,en;q=0.8
>>>> Host: wbc-inco.net
>>>> User-Agent: Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)
>>>> Content-Length: 927
>>>> Content-Type: application/x-www-form-urlencoded
>>>> REFER: http://b------.net/“
>>>> 
>>>> ————————————
>>>> 
>>>> to understand the logging above: this is what i added /changed in the Catalyst::Plugin::Unicode::Encoding
>>>> 
>>>> ————————————————————
>>>> around line 168:
>>>> 
>>>>         my $val;
>>>>         eval {
>>>>          $val =  Encode::is_utf8( $value ) ? $value : $enc->decode( $value, $CHECK );
>>>>         };
>>>>         if ($@){
>>>>             # UPS !
>>>>         # get request infos
>>>> use Data::Dumper;
>>>> my $params = $self->req->parameters;
>>>> my $headers= $self->req->headers->as_string;
>>>> die "UTF8 Error: $@ - \n\nURL: " . $self->req->path . "\n\nPARAMS:" . Dumper( $params ) . "\n\n // value: " . Dumper($value) . "\n\nheaders: " . $headers;
>>>> ….
>>>> ————————————————————
>>>> 
>>>> I guess my Catalyst Apps are not the only ones with these errors ?
>>>> 
>>>> 
>>>> about my App settings / config:
>>>> 
>>>> app-config has
>>>> 	encoding                UTF-8
>>>> 
>>>> App.pm does not load Unicode::Encoding anymore (since this is not need when using latest Catalyst: 5.90065)
>>>> 
>>>> i am using postgres with
>>>> 	pg_enable_utf8 1
>>>> (but the error about is far away from any DB related problem i guess)
>>>> 
>>>> using Catalyst::Plugin::Unicode::Encoding version 2.1 (coming with catalyxt)
>>>> 
>>>> i just checked out the tracker for catalyst on cpan, there is an UTF8 issue ticket
>>>> 	https://rt.cpan.org/Public/Bug/Display.html?id=94957
>>>> but i does not look as it was this problem ...
>>>> 
>>>> Any ideas what todo ?
>>>> Add a issue/ticket ?
>>>> 
>>>> thanks for feedback,
>>>> bernhard bauch	
>>>> 
>>>> 
>>>> 
>>>>>>>> Bernhard Bauch, Webdevelopment
>>>> ZSI - Zentrum für soziale Innovation
>>>> bauch at zsi.at
>>>> Skype: berni-zsi
>>>> 
>>>> 
>>>> _______________________________________________
>>>> List: Catalyst at lists.scsys.co.uk
>>>> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
>>>> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
>>>> Dev site: http://dev.catalyst.perl.org/
>>>> 
>>>> !DSPAM:53cd09a3104511692032419! _______________________________________________
>>>> List: Catalyst at lists.scsys.co.uk
>>>> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
>>>> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
>>>> Dev site: http://dev.catalyst.perl.org/
>>>> 
>>>> 
>>>> !DSPAM:53cd09a3104511692032419!
>>> 
>>>>>> Bernhard Bauch, Webdevelopment
>>> ZSI - Zentrum für soziale Innovation
>>> bauch at zsi.at
>>> Skype: berni-zsi
>>> 
>>> _______________________________________________
>>> List: Catalyst at lists.scsys.co.uk
>>> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
>>> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
>>> Dev site: http://dev.catalyst.perl.org/
>>> 
>>> 
>>> !DSPAM:53cd7626104517769513966!
>> 
>>>> Bernhard Bauch, Webdevelopment
>> ZSI - Zentrum für soziale Innovation
>> bauch at zsi.at
>> Skype: berni-zsi
>> 
>> _______________________________________________
>> List: Catalyst at lists.scsys.co.uk
>> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
>> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
>> Dev site: http://dev.catalyst.perl.org/
>> 
>> 
>> !DSPAM:53ce305e104511469956211!
> 
> 
>> Bernhard Bauch, Webdevelopment
> ZSI - Zentrum für soziale Innovation
> bauch at zsi.at
> Skype: berni-zsi
> 
> 
> _______________________________________________
> List: Catalyst at lists.scsys.co.uk
> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
> Dev site: http://dev.catalyst.perl.org/
> 
> 
> _______________________________________________
> List: Catalyst at lists.scsys.co.uk
> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
> Dev site: http://dev.catalyst.perl.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.scsys.co.uk/pipermail/catalyst/attachments/20140722/96275c11/attachment.htm>


More information about the Catalyst mailing list