[Dbix-class] MySQL "OR" Performance for Large Query Sets

Wed Nov 19 13:38:14 GMT 2008

On Sun, 2008-11-16 at 13:48 +0100, demerphq wrote:
> 2008/11/16 Jon Schutz <jon+dbix at youramigo.com>:
> > I've just done some benchmarking on MySQL 5 for queries where you need
> > to select back many individual records, e.g. you could use
> >
> > SELECT ... WHERE id=? (and iterate for all possible id values)
> >
> > or
> >
> > SELECT ... WHERE id=1 OR id=2 OR id=3 OR ...
> >
> > or
> >
> > SELECT ... WHERE id IN (1, 2, 3, ...)
> >
> > and there are several other ways to do it.
> >
> > What I found was that using OR, at a certain point (when selecting >
> > about 1000 keys) the performance went exponentially sour - like, to
> > quote one benchmark, the OR query took 6400 secs vs 30s typical for
> > other methods.  I've written the results up in more detail at
> > http://notes.jschutz.net/19/perl/mysql-many-row-select-performance -
> > have a look at the graph if nothing else.  Using 'IN' is the best
> > solution for any sort of large key set.
> >
> > I figure this is relevant to DBIx::Class since it uses SQL::Abstract
> > which will translate a search spec such as A => [ 1 .. 1000 ] into (A=?
> > OR A=? ... ) 1000 times, which could be a real gotcha (indeed, some of
> > my own code, which I shall now be reviewing, can expect arrays of tens
> > of thousands of keys...).
> >
> > I was wondering - has anyone else seen similar behaviour?
> 
> Yes. In some cases or is faster (older MySQL servers) in some cases IN
> is faster (newer MySQL). In most cases they ARENT equivalent, the
> optimiser is just not smart enough, nor designed for this type of
> equivalence conversion. Also, with either, at a certain point the
> optimiser figures out it cant use the indexes on the column and turns
> things into a table scan, and with the IN, it can construct an in
> memory hash, and check each value in O(1) time, where as with the OR
> approach it doesnt, and thus operates in O(k) time.
> 

I note that I was a bit surprised about the IN result, as my previous
experience (as you say, on an older MySQL) showed IN to be unexpectedly
slow.

> If you are generate IN lists of 10s of thousands of items I bet you
> would be better to create an indexed temporary table containing the
> list, and then use join syntax.
> 

Nope.  Tried it, and performance with an index is not significantly
different to that without, and both are much slower than using IN(...).
In this test case, MySQL would be using the index on the main table and
scanning through the temporary table, I suspect, hence little
difference.

It was also suggested that I should try using SELECT ... UNION
SELECT ..., which turned out to be much the same as OR.

New benchmarks described at 
http://notes.jschutz.net/20/mysql/mysql-multi-select-performance-the-sequel

-- 

Jon Schutz                        My tech notes http://notes.jschutz.net
Chief Technology Officer                        http://www.youramigo.com
YourAmigo         
-- 

Jon Schutz                        My tech notes http://notes.jschutz.net
Chief Technology Officer                        http://www.youramigo.com
YourAmigo