[Dbix-class] Maybe OT - How to create a result set based on 'similarity'?

Jason Kohles email at jasonkohles.com
Fri Mar 2 17:09:08 GMT 2007


On Mar 2, 2007, at 10:42 AM, Mario Minati wrote:

> I would like to find out if there are already companies in my  
> addressbook (DB) which are similar to a given name to avoid double  
> entries.
>
> How to measure similarity:
> I'am thinking of the hammingdistance. That means the difference  
> between Linux and Linus is 1 as there is one letter different. The  
> distance between Linux and Lisa is 3 as there is one letter more  
> and two are different.
>
> Does anyone have an idea how to realize that?
> Can one realize this with code running on the database (PL/SQL or  
> something) or is there a way doing that with DBIx::Class (drawback:  
> all data had to read before processing).
>
In addition to my other response, you might find this interesting,  
although I would be a bit afraid of how long this might take in a  
large database...

 From http://www.merriampark.com/ldplsql.htm

Levenshtein Distance Algorithm: Oracle PL/SQL Implementation
by Barbara Boehmer (baboehme at hotmail.com)

CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
   (p_source_string   IN VARCHAR2,
    p_target_string   IN VARCHAR2)
   RETURN                NUMBER
   DETERMINISTIC
AS
   v_length_of_source    NUMBER := NVL (LENGTH (p_source_string), 0);
   v_length_of_target    NUMBER := NVL (LENGTH (p_target_string), 0);
   TYPE mytabtype IS     TABLE OF NUMBER INDEX BY BINARY_INTEGER;
   column_to_left        mytabtype;
   current_column        mytabtype;
   v_cost                NUMBER := 0;
BEGIN
   IF v_length_of_source = 0 THEN
     RETURN v_length_of_target;
   ELSIF v_length_of_target = 0 THEN
     RETURN v_length_of_source;
   ELSE
     FOR j IN 0 .. v_length_of_target LOOP
       column_to_left(j) := j;
     END LOOP;
     FOR i IN 1.. v_length_of_source LOOP
       current_column(0) := i;
       FOR j IN 1 .. v_length_of_target LOOP
         IF SUBSTR (p_source_string, i, 1) =
            SUBSTR (p_target_string, j, 1)
         THEN v_cost := 0;
         ELSE v_cost := 1;
         END IF;
         current_column(j) := LEAST (current_column(j-1) + 1,
                                     column_to_left(j) + 1,
                                     column_to_left(j-1) + v_cost);
       END LOOP;
       FOR j IN 0 .. v_length_of_target  LOOP
         column_to_left(j) := current_column(j);
       END LOOP;
     END LOOP;
   END IF;
   RETURN current_column(v_length_of_target);
END ld;

-- 
Jason Kohles
email at jasonkohles.com
http://www.jasonkohles.com/
"A witty saying proves nothing."  -- Voltaire


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.scsys.co.uk/pipermail/dbix-class/attachments/20070302/d2f95a12/attachment-0001.htm


More information about the Dbix-class mailing list