[Dbix-class] MySQL "OR" Performance for Large Query Sets

Wed Nov 19 13:38:05 GMT 2008

On Mon, 2008-11-17 at 12:15 -0500, Jason Gottshall wrote:
> demerphq wrote:
> > 2008/11/16 Jon Schutz <jon+dbix at youramigo.com>:
> >> I've just done some benchmarking on MySQL 5 for queries where you need
> >> to select back many individual records, e.g. you could use
> >>
> >> SELECT ... WHERE id=? (and iterate for all possible id values)
> >>
> >> or
> >>
> >> SELECT ... WHERE id=1 OR id=2 OR id=3 OR ...
> >>
> >> or
> >>
> >> SELECT ... WHERE id IN (1, 2, 3, ...)
> >>
> >> and there are several other ways to do it.
> >>
> >> What I found was that using OR, at a certain point (when selecting >
> >> about 1000 keys) the performance went exponentially sour - like, to
> >> quote one benchmark, the OR query took 6400 secs vs 30s typical for
> >> other methods.  I've written the results up in more detail at
> >> http://notes.jschutz.net/19/perl/mysql-many-row-select-performance -
> >> have a look at the graph if nothing else.  Using 'IN' is the best
> >> solution for any sort of large key set.
> >>
> >> I figure this is relevant to DBIx::Class since it uses SQL::Abstract
> >> which will translate a search spec such as A => [ 1 .. 1000 ] into (A=?
> >> OR A=? ... ) 1000 times, which could be a real gotcha (indeed, some of
> >> my own code, which I shall now be reviewing, can expect arrays of tens
> >> of thousands of keys...).
> >>
> >> I was wondering - has anyone else seen similar behaviour?
> > 
> > Yes. In some cases or is faster (older MySQL servers) in some cases IN
> > is faster (newer MySQL). In most cases they ARENT equivalent, the
> > optimiser is just not smart enough, nor designed for this type of
> > equivalence conversion. Also, with either, at a certain point the
> > optimiser figures out it cant use the indexes on the column and turns
> > things into a table scan, and with the IN, it can construct an in
> > memory hash, and check each value in O(1) time, where as with the OR
> > approach it doesnt, and thus operates in O(k) time.
> > 
> > If you are generate IN lists of 10s of thousands of items I bet you
> > would be better to create an indexed temporary table containing the
> > list, and then use join syntax.
> 
> We've found that Oracle (9 or 10) seems to have a hard limit of 1000 
> values in an "IN()" criterion, so dbic probably shouldn't try to use 
> that as it's default behavior. I'm intrigued by the idea of using an 
> indexed temp table...
> 

That's a hard-coded hard limit, not like mysql which has a
max_allowed_packet variable that limits the size of the query?

Trouble is, even using OR ... as DBIC currently does, will eventually
run into MySQL's max_allowed_packet limit (and presumably equivalent
limits on any database), so the only way to be 100% reliable is to break
the query into multiple queries, each under max_allowed_packet.  I
imagine it would be quite a challenge to abstract this reliably in
SQL::Abstract.

-- 

Jon Schutz                        My tech notes http://notes.jschutz.net
Chief Technology Officer                        http://www.youramigo.com
YourAmigo         
-- 

Jon Schutz                        My tech notes http://notes.jschutz.net
Chief Technology Officer                        http://www.youramigo.com
YourAmigo