[Catalyst] create search engine friendly uri from string
Octavian Rasnita
orasnita at gmail.com
Tue Dec 16 19:33:31 GMT 2008
From: "Peter Karman" <peter at peknet.com>
> no. you must set ebit in new(), not after instantiation. I've added a
> note to the docs to emphasize that.
> my $tr = Search::Tools::Transliterate->new( ebit => 0 );
Thanks. This way it works fine.
> The latest 4 chars are 4 new UTF-8 chars in romanian language (U+0218,
> U+0219, U+021A, U+021B). Can they be transliterated?
> They are şŞţŢ but with a comma below, and not with a sedila. Can they be
> displayed as sStT?
> sure. Just add them via the map() method. I believe that's documented
> with an example, but here's another:
> use strict;
> use Search::Tools::Transliterate;
> use utf8;
>
> binmode STDERR, ':utf8';
>
> my $string = "ăşţâîĂŞŢÂÎ";
>
> # new romanian utf8 chars
> $string .= "\x{0218}";
> $string .= "\x{0219}";
> $string .= "\x{021A}";
> $string .= "\x{021B}";
>
> my $tr = Search::Tools::Transliterate->new(ebit=>0);
> $tr->map->{"\x{0218}"} = 's';
> $tr->map->{"\x{0219}"} = 'S';
> $tr->map->{"\x{021A}"} = 't';
> $tr->map->{"\x{021B}"} = 'T';
>
> print STDERR $tr->convert($string) . "\n";
>
> I added the above code as part of a new test and just uploaded 0.19 to
> cpan.
>
> If you have suggestions for permanent additions/changes to the character
> mapping file, please open a RT ticket and I'll see that they get
> reviewed for a future release.
> Thanks for the feedback.
Just as a feedback, here is a short comparison I've made between these 2
modules:
Text::Unidecode is 5 or 6 times faster than S::T::T.
I haven't tested what S::T::T does internally, but Text::Unidecode uses many
other perl modules which are loaded dynamicly, and the current ActiveState
PDK can't load them automaticly, so it is harder to use Text::Unidecode.
Because it is able to use the map hash, S::T::T is more flexible than
Text::Unidecode.
I found that Text::Unidecode gives "Bei Jing" for the string
"\x{5317}\x{4EB0}\n" while S::T::T just gives 2 spaces.
And I've tried to transliterate those new 4 romanian chars using these 2
modules:
use Text::Unidecode;
print unidecode("\x{0218}\x{0219}\x{021A}\x{021B}");
#It printed: SsTt
use Search::Tools::Transliterate;
my $tr = Search::Tools::Transliterate->new(ebit => 0);
open(OUT, ">:utf8", "test.txt");
print OUT $tr->convert("\x{0218}\x{0219}\x{021A}\x{021B}");
It printed: ŞşŢţ
Well, without using the map hash, this doesn't print the "correct" string,
but it is interesting because it prints the corresponding characters which
are used now instead of those new characters with a comma instead of a
sedila below them.
HTH.
Octavian
More information about the Catalyst
mailing list