[Catalyst] "Streamed" content delivery?!

Wed Aug 6 00:24:29 BST 2008

On 5 Aug 2008, at 11:12, Heiko Jansen wrote:

> Tomas,
>
> thanks for your detailed response.

No problem.

>> I'd recommend that you use some sort of distributed job queueing
>> system for this. [Gearman][1] could be a good fit..
>
> Will have a look at that and/or at other job queue systems.  
> Whatever I'll
> end up with needs to allow for inserting custom filter steps (every
> datasource has its own transformation routines for query and  
> result) and
> the processing must be done in parallel.
>

I was thinking that the job workers handle this themselves and the  
job queue system never sees it.. As you have different workers for  
each DB you're talking to then they can do different things if they  
want to.. (Obviously, there is nothing stopping from sharing most of  
their code however)

>> At first glance, it doesn't look like option (b) above will allow you
>> do the templating in parallel,
>
> My intention was that every thread would call a model, receive a  
> response
> (data in the common format), then process a template (the same for all
> databases, representing a part of a html page) and return the  
> generated
> output back to the main original controller thread which would then be
> responsible to send the output to the client.
> That should've allowed for parallel templating.
>
>> but I'm not sure that I agree that it
>> has to be done that way. Given that your templating step is, really,
>> the munging of the DB specific data into a common format,
>> I'd consider it the job of the model.. So you generate and cache a
>> fragment of what you're going to output to the user, then you only
>> have to sew things together..
>
> Yes and no. I should receive responses in a common format (presumably
> XML) back from the model. The templating step would be the generation
> of html chunks for the user/client based on the homogeneously  
> formatted
> data.
> That should provide for having as little as possible knowledge on data
> sources in the controller and also as little as possible knowledge  
> on the
> intended output in the model (allowing for using the same data to,  
> e.g.,
> generate SOAP responses instead of HTML pages).

I see what you're saying here, I was deliberately simplifying first  
time round..

I'd probably then, in this case, arrange for there to be 2 steps / 2  
queues..
1) take query => db format, take data => xml, return
2) take XML => html/whatever

so any jobs which finish step (1) are just added to the pool of  
waiting jobs for step (2)..

>> You *need* the job queueing system to be de-duping things for you (so
>> if 2 controllers ask for the same query, they don't execute the same
>> query in parallel twice, but just wait for the same job), and you
>> *need* a limited number of workers per DB, otherwise you'll be
>> flattened by the thundering herd problem if you have to do a complete
>> cache flush for any reason...
>
> Since I expect it to be rather uncommon that the same query will  
> come in
> again the cache is mostly intended to speed up cases like a user  
> asking for
> hits 1 - 10 for a query, then 11 - 20 and then returning to range 1  
> - 10.
> But apart from that you're definitely right.

I'm not sure I 100% buy this - I mean, sure, you're not going to get  
the same users searching for the same thing over and over - but  
chances of someone coming back to look at the output of a query again  
a few hours / days later - quite high.. So probably quite a low cache  
hit ratio, but high value in caching (and also continuing to process  
jobs which have started in the background).

> I was only referring to omitting the content-length header, not all  
> headers
> (because I don't know how much content will be delivered if I send  
> part of the
> response page as soon as I receive the request and before the database
> results are available).
>

Ah, well, just not setting one and calling $c->write($somedata) will  
do the right thing then.

Cheers
Tom