[Catalyst] My Life with UTF-8
jon at jrock.us
Fri Aug 11 18:10:05 CEST 2006
This isn't a Catalyst question per se', but my Catalyst app is affected.
I'm getting some strange double-encoding issues with UTF-8 characters on
my web page. Pretty much everything was working fine, until I added RSS
feeds to my page that contained unicode. Then all the unicode bytes
broke at once. Here's a description of how the problem evolved, what I
I do development on my Linux machine, which is in locale en_US.UTF-8.
UTF8 stuff works great -- I have tons of UTF-8 filenames, my xterm
displays UTF-8 perfectly, etc. (Perfect for the languages I use
regularly, anyway -- English and Japanese :)
In my Catalyst app, I have a variety of potential data sources which are
unicode -- file contents for blog posts, file *names* for the titles,
tags read from YAML files (or extended filesystem attributes, but those
don't work on OpenBSD, my "production" environment, yet), and finally,
UTF-8 constants in the source code.
The first unicode breakage I had was when I added Japanese-style dates
as timestamps on the pages. (Japanese day-name character in
parenthesis.) What's weird was, adding this to the page worked fine --
but it broke OTHER unicode characters on the page (sourced from a file
or file attribute). Adding "use utf8" to the top of my source file
fixed my problems, on Linux anyway. (Never tried on OpenBSD.)
The next problem I noticed was that C::V::TT::ForceUTF8 broke TT's "uri"
filter. According to the HTML validator, URIs can't be unicode, so you
have to encode the URI to UTF-8. TT's URI filter was documented to do
this, but it translated anything with the 8th bit set to nothing,
annoying me greatly. I filed a bug report with TT, but it was closed as
"resolved" with no commentary. I eventually ditched ForceUTF8 and used
regular TT, and everything worked for a while. URIs were encoded, UTF-8
looked great no matter the source, and all this worked on both Linux and
OpenBSD. I also added Catalyst::Plugin::Unicode around this time, and
it got rid of an occasional "warning: wide character in print at line
... in Catalyst::Engine", which made me feel that my unicode support was
finally solid. No display bugs, no warnings, everything worked.
Then, I decided to add RSS feeds to the sidebar. I tested it with my
delicious bookmarks, which contain links to various Japanese websites I
read, and were therefore unicode-encoded. When added to my page, these
titles broke *all* other unicode on the page, the unicode '...'
character, unicode tag names, even unicode in the articles. Ouch! (But
the characters in the RSS titles displayed fine!)
This too was fixable; I called utf8::encode($title) for each RSS feed
entry, and everything continued to work. On Linux. (Incidentally, this
is what TT::Provider::Encoding does -- encode $string if
!is_utf8($string) for everything on the template. This leads me to
believe that the non URI-encoding issue is a TT bug. If utf8 is on for
a string, it doesn't know what to do. If it's off, it just dumbly
url-encodes the octets.)
Anyway, feeling quite happy with myself, I deployed the changes to my
OpenBSD machine. Bad. Very bad. Now *all* unicode was broken,
including the RSS feeds. I looked at the bytes with hexdump, and it
looks like a classic double-encoding issue. I figured I would just set
the locale on my server to en_US.UTF-8, and all would be well.
Except OpenBSD doesn't have locale support!!!!
Any way I can tell perl, "trust me, everything is already UTF-8... don't
#^$ing touch it."? (I have no problem working with UTF-8 on my OpenBSD
machine because it's headless, and I use my UTF-8 aware xterm to work
with it. I write my articles on my Linux machine and copy them via DAV
to the OpenBSD machine. I add tags to the articles via my browser,
which sends the OpenBSD machine UTF-8 data. So everything works, but it
Any advice would be appreciated. Maybe this will even help others clear
up their unicode problems on sane operating systems.
More information about the Catalyst