[Dbix-class] Maybe OT - How to create a result set based on 'similarity'?

Jason Kohles email at jasonkohles.com
Fri Mar 2 17:06:37 GMT 2007


On Mar 2, 2007, at 10:42 AM, Mario Minati wrote:

> Hello @all,
>
> I'm looking for a solution to find out if there is already some  
> data in my dataset that is similar to a new entry.
>
> Example:
> Companynames
> I would like to find out if there are already companies in my  
> addressbook (DB) which are similar to a given name to avoid double  
> entries.
>
> How to measure similarity:
> I'am thinking of the hammingdistance. That means the difference  
> between Linux and Linus is 1 as there is one letter different. The  
> distance between Linux and Lisa is 3 as there is one letter more  
> and two are different.
>
> Does anyone have an idea how to realize that?
> Can one realize this with code running on the database (PL/SQL or  
> something) or is there a way doing that with DBIx::Class (drawback:  
> all data had to read before processing).

Many databases have either built-in or extensions available to do  
this.  The only ones that I have used are from the PostgreSQL contrib  
module fuzzystrmatch, which provides functions for calculating  
Levenshtein Distance and Soundex (possibly others, I can't recall).

If your database doesn't support it directly, for some solutions  
(Soundex, Metaphone) where you are calculating a checksum of the  
value and then looking for ones where the checksum is the same, you  
can precalculate the value when updating the database, and then just  
do a simple comparison search.  Something like this would probably  
work...

package MyDB::Company;
use base 'DBIx::Class';
use Text::Metaphone qw();

__PACKAGE__->add_columns( qw( company company_metaphone ) );

sub calculate_company_metaphone {
	my ( $self, $value ) = @_;

	my $co = lc( $value );
	$co =~ s/[^\w\s]//g;
	$co =~ s/\b(incorporated|inc|corporation|corp)\b/INC/;
	$co =~ s/\b(company|co)\b/CO/;
	# probably some other useful standardization stuff here
	return Text::Metaphone::Metaphone( $co );
}

sub store_column {
	my ( $self, $name, $value ) = @_;

	if ( $name eq 'company' ) {
		$self->company_metaphone( $self->calculate_comany_metaphone 
( $value ) );
	}
	$self->next::method( $name, $value );
}

Then you can simply search something like...

my $rs = $schema->resultset( 'Company' )->search( {
	company_metaphone	=> MyDB::Company->calculate_company_metaphone 
( $company_name ),
} );

-- 
Jason Kohles
email at jasonkohles.com
http://www.jasonkohles.com/
"A witty saying proves nothing."  -- Voltaire


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.scsys.co.uk/pipermail/dbix-class/attachments/20070302/e83dd545/attachment.htm


More information about the Dbix-class mailing list