[Dbix-class] Maybe OT - How to create a result set based on 'similarity'?

Mario Minati mario at minati.de
Fri Mar 2 21:53:22 GMT 2007


John Napiorkowski schrieb:
> ----- Original Message ----
> From: Mario Minati <mario at minati.de>
> To: dbix-class at lists.rawmode.org
> Sent: Friday, March 2, 2007 10:42:29 AM
> Subject: [Dbix-class] Maybe OT - How to create a result set based on 'similarity'?
>
> Hello @all,
>
> I'm looking for a solution to find out if there is already some data in 
> my dataset that is similar to a new entry.
>
> Example:
> Companynames
> I would like to find out if there are already companies in my 
> addressbook (DB) which are similar to a given name to avoid double entries.
>
> How to measure similarity:
> I'am thinking of the hammingdistance. That means the difference between 
> Linux and Linus is 1 as there is one letter different. The distance 
> between Linux and Lisa is 3 as there is one letter more and two are 
> different.
>
> Does anyone have an idea how to realize that?
> Can one realize this with code running on the database (PL/SQL or 
> something) or is there a way doing that with DBIx::Class (drawback: all 
> data had to read before processing).
>
> Thank you for any hint.
>
> Greets,
> Mario Minati
>
> Mario,
>
> Seems more like something you'd want to do in a search engine.  Postgresql has done some work in this area, you might want to check their site.  I think using SQL to do this would be prohibitive.  I can imagine building a SQL statement that would return all rows in a table where a given column had a value that was one or two different in the way you mentioned, but anything bigger that that and you'd end up with quite a large SQL statement.  I'd try do do this using some build in capabilities of the Database if I could.  If the dataset was small than doing it in perl would be easy as well, but you are going to generate lots of database traffic.  If that's not an issue (this job is running on a scheduler during low activity time) you could cache the resultset out to disk to avoid filling all your memory.
>
> good luck!
> --john
>   
As I just answered Jason, I'll use the Postgres Addon for Levenshtein to 
solve my problem.

Thank you for your thoughs, they were helpfull.

Greets,
Mario



More information about the Dbix-class mailing list