[Dbix-class] Maybe OT - How to create a result set based on 'similarity'?

Mario Minati mario at minati.de
Fri Mar 2 21:51:37 GMT 2007


Jason Kohles schrieb:
> On Mar 2, 2007, at 10:42 AM, Mario Minati wrote:
>
>> Hello @all,
>>
>> I'm looking for a solution to find out if there is already some data 
>> in my dataset that is similar to a new entry.
>>
>> Example:
>> Companynames
>> I would like to find out if there are already companies in my 
>> addressbook (DB) which are similar to a given name to avoid double 
>> entries.
>>
>> How to measure similarity:
>> I'am thinking of the hammingdistance. That means the difference 
>> between Linux and Linus is 1 as there is one letter different. The 
>> distance between Linux and Lisa is 3 as there is one letter more and 
>> two are different.
>>
>> Does anyone have an idea how to realize that?
>> Can one realize this with code running on the database (PL/SQL or 
>> something) or is there a way doing that with DBIx::Class (drawback: 
>> all data had to read before processing).
>
> Many databases have either built-in or extensions available to do 
> this.  The only ones that I have used are from the PostgreSQL contrib 
> module fuzzystrmatch, which provides functions for 
> calculating Levenshtein Distance and Soundex (possibly others, I can't 
> recall).
>
> If your database doesn't support it directly, for some solutions 
> (Soundex, Metaphone) where you are calculating a checksum of the value 
> and then looking for ones where the checksum is the same, you can 
> precalculate the value when updating the database, and then just do a 
> simple comparison search.  Something like this would probably work...
>
> package MyDB::Company;
> use base 'DBIx::Class';
> use Text::Metaphone qw();
>
> __PACKAGE__->add_columns( qw( company company_metaphone ) );
>
> sub calculate_company_metaphone {
> my ( $self, $value ) = @_;
>
> my $co = lc( $value );
> $co =~ s/[^\w\s]//g;
> $co =~ s/\b(incorporated|inc|corporation|corp)\b/INC/;
> $co =~ s/\b(company|co)\b/CO/;
> # probably some other useful standardization stuff here
> return Text::Metaphone::Metaphone( $co );
> }
>
> sub store_column {
> my ( $self, $name, $value ) = @_;
>
> if ( $name eq 'company' ) {
> $self->company_metaphone( $self->calculate_comany_metaphone( $value ) );
> }
> $self->next::method( $name, $value );
> }
>
> Then you can simply search something like...
>
> my $rs = $schema->resultset( 'Company' )->search( {
> company_metaphone => MyDB::Company->calculate_company_metaphone( 
> $company_name ),
> } );
>  
Hello Jason,

your idea of an adapted store_column function is very clever.
But in the meantime I did some more googlesports and found out that the 
soundex functions (Metaphone is one of them) was build for english 
language and that there are some serious implications with the german 
language. As I am living in germany ...

So I thought of useing the Levenshtein algorithm which cannot be 
precalculated. So I'll do it in the database. I found out that there is 
an addon for Levenshtein and it was pretty easy to install.
No I've just to create a stored procedure which takes the company names 
and split's them at whitespaces and calculates the Levenshtein 
difference to a given word and returns the lowest value per company 
name. This way I'll get a good value, as the 'string' is an ajax user input.

Once again thank you for the code sample, I'll use it soon for an other 
problem.

Greets,
Mario



More information about the Dbix-class mailing list