[Dbix-class] trying to figure out why DBIc query is slow

Thu Nov 8 16:24:58 GMT 2012

Frank Sheiness <frank at korcett.com> wrote:
> I have two postgres queries that return the same data set.  One has some
> hard coded values, the other uses some joins  to figure out which rows to
> get.  Both queries run very quickly via the psql client and a DBI script.
> Using DBIc, the latter query runs very slowly.  I've tried using
> HashReinflator but it's still very slow.  Here are the two queries:
>
> SELECT me.dataset_id, me.sample_time, me.sample_mean, me.sample_stdv,
> me.sample_min, me.sample_max, me.sample_size, myself.sample_mean FROM
> datasample me  JOIN datasample myself ON myself.sample_time = me.sample_time
> WHERE ( ( ( me.sample_time >= ? AND me.sample_time <= ? AND
> myself.sample_time >= ? AND myself.sample_time <= ? ) AND me.dataset_id = ?
> AND myself.dataset_id = ? ) ) ORDER BY me.sample_time: '2012-10-01',
> '2012-11-06 23:59', '2012-10-01', '2012-11-06 23:59', '3165', '3166'
>
> SELECT me.dataset_id, me.sample_time, me.sample_mean, me.sample_stdv,
> me.sample_min, me.sample_max, me.sample_size, myself.sample_mean FROM
> datasample me  JOIN dataset dataset ON dataset.dataset_id = me.dataset_id
> JOIN datasample myself ON myself.sample_time = me.sample_time  JOIN dataset
> dataset_2 ON dataset_2.dataset_id = myself.dataset_id WHERE ( ( (
> me.sample_time >= ? AND me.sample_time <= ? AND myself.sample_time >= ? AND
> myself.sample_time <= ? ) AND dataset.device_id = ? AND dataset.label LIKE ?
> AND dataset.name = ? AND dataset_2.device_id = ? AND dataset_2.label LIKE ?
> AND dataset_2.name = ? ) ) ORDER BY me.sample_time: '2012-10-01',
> '2012-11-07 23:59', '2012-10-01', '2012-11-07 23:59', '659', '%[1]%',
> 'BytT', '659', '%[1]%', 'BytR'

It's not clear to me that those two queries should have comparable
performance.  I see three differences between them:

- The slow query considers a time period one day longer than the fast query's
- The slow query joins four tables, where the fast query joins only two
- The slow query contains two separate `LIKE '%[1]%'` clauses

My guess is that the last point is the biggest cause of the
performance difference you're seeing.  In practice, databases can't
readily optimise LIKE comparisons, and even in the presence of a index
on the relevant column, a LIKE pattern that begins with a wildcard is
almost certain to require at least a partial table scan.  In this
case, if you don't also have a suitable index on dataset.device_id
and/or dataset.name, you'll probably get a full table scan.

I'm aware that you say the two queries have comparable performance
when run from the psql client, but could I ask you to double-check?

You may also find Postgres's EXPLAIN useful.

-- 
Aaron Crane ** http://aaroncrane.co.uk/