[Catalyst] advise on data processing in Cat/DBIC/Model

Mon Nov 26 23:28:10 GMT 2007

On Mon, Nov 26, 2007 at 07:04:24PM +0000, Matt S Trout wrote:
> On Mon, Nov 26, 2007 at 04:33:02PM +0100, Rainer Clasen wrote:
> > Hello,
> > 
> > within my current project, some value is collected up to once a day:
> > 
> >  CREATE TABLE a_value {
> >   day date PRIMARY KEY,
> >   other_values integer NOT NULL,
> >   value integer
> >   another_value integer
> >  );
> > 
> > Data comes in a bit sporadic - so I cannot rely each day having an entry.
> > Actually there also be longer periods (weeks/month/??) without data.
> > 
> > I'm currently a bit at a loss on how to "properly" cook up this data to
> > easily display it in fixed time steps. I'm thinking of a list of *all*
> > days/weeks/month/... in a certain timerange. Such a list would allow the
> > view easy access to present the data (say as html table with one row per
> > time step or as input for GD::Graph).
> > 
> > This means there are basically two tasks:
> > - aggregate the data for each time step: No-brainer with DBIx::Class.
> > - get NULL entries for time steps without data: The intersting part.
> > 
> > I can come up the following solutions to generate the NULL entries:
> > 
> > - use a SQL stored procedure or temp table with the start-dates of the
> >   desired time-steps, do an outer join and stuff this in a DBIC
> >   result_source as described in the DBIC cookbook under "arbitrary SQL".
> > 
> >   example query for ->name():
> > 	SELECT
> > 	 d.id,
> > 	 steps AS day,
> > 	 d.value,
> > 	 COALESCE( d.other_value, $4 ) AS other_value
> > 	FROM
> > 	 timeseries( $1, $2, $3) AS steps
> > 	 LEFT JOIN ( SELECT * FROM data WHERE other_value = $4 ) d
> > 	  ON ( d.day >= $2 AND d.day + $1 < $3;
> >   $1 = time steps. eg. '1 day'
> >   $2 = start date. eg. '2007-11-1'
> >   $3 = end date. eg '2007-11-30'
> >   $4 = other_value to filter on.
> >   timeseries(step,start,end) = stored procedure that returns the 
> > 	start-dates of the time-steps within the specified time-range.
> 
> I tend to do -sort- of this.
> 
> Except that instead of using a function like timeseries() I'll create a
> pivot table with a 'date' column that I prepopulated with all dates from
> now to say 2020 (and make sure one of my cron jobs extends this when we
> reach say 2019 or so). Then I put function indexes on the various DATE_PART
> or equivalent functions that I might use to pull the month, year etc.
> 
> That way I can query the pivot as "just another DBIC class" and everything
> gets simpler.

I believe that's the method Joe Celko recommends as well.
You can extend it later with public holidays or some other metadata, too.

tjc