[Dbix-class] MySQL "OR" Performance for Large Query Sets

Mon Nov 17 17:15:13 GMT 2008

demerphq wrote:
> 2008/11/16 Jon Schutz <jon+dbix at youramigo.com>:
>> I've just done some benchmarking on MySQL 5 for queries where you need
>> to select back many individual records, e.g. you could use
>>
>> SELECT ... WHERE id=? (and iterate for all possible id values)
>>
>> or
>>
>> SELECT ... WHERE id=1 OR id=2 OR id=3 OR ...
>>
>> or
>>
>> SELECT ... WHERE id IN (1, 2, 3, ...)
>>
>> and there are several other ways to do it.
>>
>> What I found was that using OR, at a certain point (when selecting >
>> about 1000 keys) the performance went exponentially sour - like, to
>> quote one benchmark, the OR query took 6400 secs vs 30s typical for
>> other methods.  I've written the results up in more detail at
>> http://notes.jschutz.net/19/perl/mysql-many-row-select-performance -
>> have a look at the graph if nothing else.  Using 'IN' is the best
>> solution for any sort of large key set.
>>
>> I figure this is relevant to DBIx::Class since it uses SQL::Abstract
>> which will translate a search spec such as A => [ 1 .. 1000 ] into (A=?
>> OR A=? ... ) 1000 times, which could be a real gotcha (indeed, some of
>> my own code, which I shall now be reviewing, can expect arrays of tens
>> of thousands of keys...).
>>
>> I was wondering - has anyone else seen similar behaviour?
> 
> Yes. In some cases or is faster (older MySQL servers) in some cases IN
> is faster (newer MySQL). In most cases they ARENT equivalent, the
> optimiser is just not smart enough, nor designed for this type of
> equivalence conversion. Also, with either, at a certain point the
> optimiser figures out it cant use the indexes on the column and turns
> things into a table scan, and with the IN, it can construct an in
> memory hash, and check each value in O(1) time, where as with the OR
> approach it doesnt, and thus operates in O(k) time.
> 
> If you are generate IN lists of 10s of thousands of items I bet you
> would be better to create an indexed temporary table containing the
> list, and then use join syntax.

We've found that Oracle (9 or 10) seems to have a hard limit of 1000 
values in an "IN()" criterion, so dbic probably shouldn't try to use 
that as it's default behavior. I'm intrigued by the idea of using an 
indexed temp table...

Jason

--
Jason Gottshall
jgottshall at capwiz.com