[Dbix-class] paging question

Wed Feb 3 21:57:20 GMT 2010

On Wed, Feb 03, 2010 at 10:40:12AM -0800, Ihnen, David wrote:

> It takes four hours to run the query?  I have written queries
> affecting grouping and sorting 6 billion rows that ran several
> orders of magnitude faster than that.

Well, I'm pretty good at writing sql queries, but I've never claimed
to be an expert.  And efficiently querying this rather large data set
is something new to me.

> 
> I suspect that you need to consider appropriate indexes to support your where clauses?
> 
I've indexed the primary table, and have created a view in the db and
have verified that the view is properly using the index.  However, I
need to drill down into the postgresql docs to pick up more hints on
how to devise useful partial indexes (I can't see how they can be
applied yet) or a better index over my search expression.

> If you truly working with data of that billion-record magnitude,
> your heap is too big to query this way.  You'll need to take an

Ah. I have about  3.5 billion records now, holding _most_ of 2007.  This
is expected to grow as I add 08,09,...

> alternative approach such as denormalizing the data store into
> derived data closer to the final form that will be needed, then
> maintain the denormalized data to be in sync.  (That denormalization
> could take the form of creating a smaller summary table, which could
> be queried with DBIx::Class - I did that for statistics reporting
> before.  Though I could query the stats directly, it was very
> expensive.)

I guess I *am* trying to denormalize the data right now.  I'm trying
to get the data out of psql and into CouchDB, which I've found to be
extremely fast for 'front end' queries I need to do after I set up the
proper view.  My goal for this perl/sql step is to summarize the raw
data into 5 minute aggregates.

I think your advice is good.  My sql view is excessively complicated
and is probably the root of the problem.  I spent the morning looking
at statement caching and so on, and I think that's a useless
optimization, so I'll stop wondering whether the problem is sth
caching and start concentrating on helping psql do a better job.  
I will take your suggestion and break the problem down
into parts and maybe do more work in perl and less in sql.

Still, given how offset and limit work (according to the psql planner
output), it irritates me that I have to do a monster select only for a
subset of rows, then go back and do the exact same select again and
again for different subsets.

Thanks for the comments,
James

> 
> David
> 

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.