[Catalyst] create search engine friendly uri from string

Octavian Rasnita orasnita at gmail.com
Tue Dec 16 19:33:31 GMT 2008


From: "Peter Karman" <peter at peknet.com>
> no. you must set ebit in new(), not after instantiation. I've added a
> note to the docs to emphasize that.
> my $tr = Search::Tools::Transliterate->new( ebit => 0 );

Thanks. This way it works fine.

> The latest 4 chars are 4 new UTF-8 chars in romanian language (U+0218,
> U+0219, U+021A, U+021B). Can they be transliterated?
> They are şŞţŢ but with a comma below, and not with a sedila. Can they be
> displayed as sStT?

> sure. Just add them via the map() method. I believe that's documented
> with an example, but here's another:
> use strict;
> use Search::Tools::Transliterate;
> use utf8;
>
> binmode STDERR, ':utf8';
>
> my $string = "ăşţâîĂŞŢÂÎ";
>
> # new romanian utf8 chars
> $string .= "\x{0218}";
> $string .= "\x{0219}";
> $string .= "\x{021A}";
> $string .= "\x{021B}";
>
> my $tr = Search::Tools::Transliterate->new(ebit=>0);
> $tr->map->{"\x{0218}"} = 's';
> $tr->map->{"\x{0219}"} = 'S';
> $tr->map->{"\x{021A}"} = 't';
> $tr->map->{"\x{021B}"} = 'T';
>
> print STDERR $tr->convert($string) . "\n";
>
> I added the above code as part of a new test and just uploaded 0.19 to
> cpan.
>
> If you have suggestions for permanent additions/changes to the character
> mapping file, please open a RT ticket and I'll see that they get
> reviewed for a future release.
> Thanks for the feedback.

Just as a feedback, here is a short comparison I've made between these 2 
modules:

Text::Unidecode is 5 or 6 times faster than S::T::T.

I haven't tested what S::T::T does internally, but Text::Unidecode uses many 
other perl modules which are loaded dynamicly, and the current ActiveState 
PDK can't load them automaticly, so it is harder to use Text::Unidecode.

Because it is able to use the map hash, S::T::T is more flexible than 
Text::Unidecode.

I found that Text::Unidecode gives "Bei Jing" for the string 
"\x{5317}\x{4EB0}\n" while S::T::T just gives 2 spaces.

And I've tried to transliterate those new 4 romanian chars using these 2 
modules:

use Text::Unidecode;
print unidecode("\x{0218}\x{0219}\x{021A}\x{021B}");
#It printed: SsTt

use Search::Tools::Transliterate;

my $tr = Search::Tools::Transliterate->new(ebit => 0);

open(OUT, ">:utf8", "test.txt");
print OUT $tr->convert("\x{0218}\x{0219}\x{021A}\x{021B}");

It printed: ŞşŢţ
Well, without using the map hash, this doesn't print the "correct" string, 
but it is interesting because it prints the corresponding characters which 
are used now instead of those new characters with a comma instead of a 
sedila below them.

HTH.

Octavian




More information about the Catalyst mailing list