[Catalyst] advise on data processing in Cat/DBIC/Model

Mon Nov 26 15:33:02 GMT 2007

Hello,

within my current project, some value is collected up to once a day:

 CREATE TABLE a_value {
  day date PRIMARY KEY,
  other_values integer NOT NULL,
  value integer
  another_value integer
 );

Data comes in a bit sporadic - so I cannot rely each day having an entry.
Actually there also be longer periods (weeks/month/??) without data.

I'm currently a bit at a loss on how to "properly" cook up this data to
easily display it in fixed time steps. I'm thinking of a list of *all*
days/weeks/month/... in a certain timerange. Such a list would allow the
view easy access to present the data (say as html table with one row per
time step or as input for GD::Graph).

This means there are basically two tasks:
- aggregate the data for each time step: No-brainer with DBIx::Class.
- get NULL entries for time steps without data: The intersting part.

I can come up the following solutions to generate the NULL entries:

- use a SQL stored procedure or temp table with the start-dates of the
  desired time-steps, do an outer join and stuff this in a DBIC
  result_source as described in the DBIC cookbook under "arbitrary SQL".

  example query for ->name():
	SELECT
	 d.id,
	 steps AS day,
	 d.value,
	 COALESCE( d.other_value, $4 ) AS other_value
	FROM
	 timeseries( $1, $2, $3) AS steps
	 LEFT JOIN ( SELECT * FROM data WHERE other_value = $4 ) d
	  ON ( d.day >= $2 AND d.day + $1 < $3;
  $1 = time steps. eg. '1 day'
  $2 = start date. eg. '2007-11-1'
  $3 = end date. eg '2007-11-30'
  $4 = other_value to filter on.
  timeseries(step,start,end) = stored procedure that returns the 
	start-dates of the time-steps within the specified time-range.

  pro: 
  - also usable without catalyst / DBIC.
  - easy data aggregation
  - same interface as non-cooked data
  con: 
  - Outer joins tend to be slow.
  - the NULL entries have to be transfered from DB->application
  - the "arbitrary SQL" example makes me feel a bit sick.
  - the "WHERE other_value..." subquery needs to be different in certain
    cases.

- Catalyst Model, that creates a DBIC resultset and feeds it to a plain
  perl "aggregation module" (which returns the aggregated data in the
  desired time steps).

  pro:
  - aggregation module is quite easy to write
  -  "            "    is also usable without catalyst / DBIC
  - allows easy tweaking of the resultset
  con:
  - still a bit tricky to teach the DB about the time-steps for data
    aggregation (think of non trivial steps like '4 days').  
  - alternatively the aggregation has to be done "manually"
  - same applies for selecting the actual columns I want to
    retrieve/aggregate.

Which do you consider the more elegant way? Am I missing the "obvious" and
therefore trivial solution (as usual)? Am I completely nuts?

Rainer

-- 
KeyID=759975BD fingerprint=887A 4BE3 6AB7 EE3C 4AE0  B0E1 0556 E25A 7599 75BD