[Dbix-class] paging question

Fri Feb 5 20:30:26 GMT 2010

Thanks to the list for tips.  I can't say the process is running that
much faster (its just a lot of data to sling around) but postgres can
now handle multiple processes at once so I'm seeing a definite speed
up with multiple jobs running.

However, I'm still seeing something interesting, as follows.  

What I did was to step through my data hour by hour, as suggested,
dispensing with page and rows entirely.  My outer loop picks the
detector id.  My inner loop increments a date time object from the
earliest record to the latest record, or between two dates specified
on the command line.  My tables is indexed on both values and the
queries are super quick, I just have lots of them.

But I was still curious about how the 'disable_sth_caching' parameter
would affect things, so I experimented and got what I think is
counterintuitive results.  Perhaps I am still doing something wrong?

In an unscientific test, I ran two processes on 4 days of data, one
process with sth caching enabled, one with sth caching disabled.  I
ran both jobs at about the same time.  Each process connected to the
db and I could see the postgres jobs in top and using ps -ax.  As the
jobs neared completion, the postgres process associated with the sth
caching *enabled* process had consumed about twice the CPU time, and
in top was consistently using more CPU.  The two perl jobs had about
identical CPU and RAM usage.  The perl job with disable_sth_caching=>1
finished first.  The output of time for both is:

 disable_sth_caching=>1  real 71m23.551s; user 23m0.370s;  sys 0m9.330s
 disable_sth_caching=>0  real 79m14.924s; user 25m22.100s; sys 0m21.580s

While the two jobs processed different sets of days, the data are
roughly equivalent.  Disabling sth caching had better performance.  

So perhaps I am doing something wrong, and caching lots and lots of
sth's that never get used again?

Here is how I create the result sets (this code is not exactly how I
do it, I'm cutting and pasting from different subroutines I use to
keep my code more modular and I've changed some of the names to
protect the innocent):

($self is an object via MooseX::Declare, and my Schema is an object
attribute that is wrapped with a broad 'handles'=>qr/^(.*)/sxm,)

 my $data_rs =
  $self->resultset('Raw5minuteAggregatesView')->search(
            {},
            {
                'select' =>
                  [qw{detector_id fivemin intervals nsum oave nlanes olanes}],
            }
        );

  while ( DateTime->compare( $self->curr_dt, $self->end_dt ) < 0 ) {
        my $query = {
            'detector_id'  => $detector->id,
            'fivemin' => {
                -between => [
                 DateTime::Format::Pg->format_datetime(
                        $self->curr_dt
                    ),
                 DateTime::Format::Pg->format_datetime(
                        $self->curr_dt + $self->duration
                    ),

                ]
            },
        };
        my $rs = $data_rs->search($query);
        my $cursor = $rs->cursor;
        while ( my @vals = $cursor->next ) {
            $self->inner_loop_method( [@vals] );
        }
        # do other stuff and increment self->curr_dt, etc
  }

>From reading the docs, starting with
http://search.cpan.org/~frew/DBIx-Class-0.08115/lib/DBIx/Class/Storage/DBI.pm#DBIx::Class_specific_connection_attributes
and then linking to http://search.cpan.org/perldoc?DBI#prepare_cached,
I see this for 'prepare_cached':

  Like "prepare" except that the statement handle returned will be
  stored in a hash associated with the $dbh. If another call is made
  to prepare_cached with the same $statement and %attr parameter
  values, then the corresponding cached $sth will be returned without
  contacting the database server.

So if I understand this correctly, what is happening is that
prepare_cached is caching a statement that depends on both the query
*and* the attributes of the query.  And because my attributes change
with each $data_rs->search($query) line, I never end up resuing a
statement handle, right?  What I want is to get a prepared statement
handle reused for a query with different bind values, without having
to dive down into the DBI connection to do it.

Any thoughts or advice?

Regards, 
James

On Wed, Feb 03, 2010 at 02:14:35PM -0800, Ihnen, David wrote:
> Based on what you've said (5 minute summaries) I would expect your records have a date field that you are grouping in your query.
> 
> My strategy for handling summarizing this type of data would probably be something on the order of an hour at a time in five minute groups.  With a good b-tree ranging index on the date field, and the manageable data chunks from splitting it down that way I'd get 12 rows for each query that got a particular hour.  I'd just iterate over the hours, flipping the binding value for the time until I got to the end of the data set.
> 
> With a 'last updated' timestamp (indexed of course) on the records, you can periodically get 'all of the hours with changed data since I last checked' and iterate over those hours to replace the summaries stored in the denormalized tables to maintain sync.
> 
> And in the end never have a data set on the large heap generated of more than 12 rows output, and never ask the database to work with the entire data set - just the subset that the b-tree index limits it to.
> 
> Exciting stuff, working with data sets that large.
> 
> David

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.