[Catalyst] "Streamed" content delivery?!

Tue Aug 5 11:12:25 BST 2008

Tomas,

thanks for your detailed response.

Am Dienstag, den 05.08.2008, 09:49 +0100 schrieb Tomas Doran:

> > b) A thread is started in the controller for every database, in  
> > which a method in a Model is called that handles one database; the  
> > thread works like the one above but also processes the template and  
> > sends the final result to the Controller who sends it to the client.
> >
> > I'd probably go for method a) because I could implement the Model  
> > method as a rather simple wrapper which in fact connects to a  
> > different server which does the heavy lifting. On the other hand  
> > method a) would mean that template processing would happen  
> > sequentially while in method b) this also could be done in paralle
> 
> I'd recommend that you use some sort of distributed job queueing  
> system for this. [Gearman][1] could be a good fit..

Will have a look at that and/or at other job queue systems. Whatever I'll 
end up with needs to allow for inserting custom filter steps (every 
datasource has its own transformation routines for query and result) and 
the processing must be done in parallel.

> At first glance, it doesn't look like option (b) above will allow you  
> do the templating in parallel,

My intention was that every thread would call a model, receive a response 
(data in the common format), then process a template (the same for all 
databases, representing a part of a html page) and return the generated 
output back to the main original controller thread which would then be 
responsible to send the output to the client.
That should've allowed for parallel templating.

> but I'm not sure that I agree that it  
> has to be done that way. Given that your templating step is, really,  
> the munging of the DB specific data into a common format,
> I'd consider it the job of the model.. So you generate and cache a  
> fragment of what you're going to output to the user, then you only  
> have to sew things together..

Yes and no. I should receive responses in a common format (presumably 
XML) back from the model. The templating step would be the generation 
of html chunks for the user/client based on the homogeneously formatted 
data.
That should provide for having as little as possible knowledge on data 
sources in the controller and also as little as possible knowledge on the 
intended output in the model (allowing for using the same data to, e.g., 
generate SOAP responses instead of HTML pages).

> Here's a step-through of how I imagine it working.
> . You start a pile of workers to do DB queries (X workers per DB).
> . Your catalyst application starts, and connects to the job queueing  
> system.
> . A user makes a query to catalyst which is dispatched to your  
> controller.
> . Your controller makes a decision about *which* databases to query
> . Your controller creates a job for each database, and stashes to job  
> IDs
> . You do ask the job system to block for 'x' seconds, or until a  
> query result becomes available.
> . Query result becomes available (it is converted to the common  
> format, and also cached before being flagged available)
> . You pull it out of the cache and stream it to the user.
> . You ask the job system to block for 'x' seconds, or until a query  
> result becomes available.
> . Repeat until your give up threshold happens.
> . Controller either terminates outstanding jobs, or detaches them  
> (where they will complete and cache their results)
> . Controller writes a footer to the page, closes the connection.

Sounds sensible to me.

> You *need* the job queueing system to be de-duping things for you (so  
> if 2 controllers ask for the same query, they don't execute the same  
> query in parallel twice, but just wait for the same job), and you  
> *need* a limited number of workers per DB, otherwise you'll be  
> flattened by the thundering herd problem if you have to do a complete  
> cache flush for any reason...

Since I expect it to be rather uncommon that the same query will come in 
again the cache is mostly intended to speed up cases like a user asking for 
hits 1 - 10 for a query, then 11 - 20 and then returning to range 1 - 10. 
But apart from that you're definitely right. 

> As noted by the reply previously, you can pass a filehandle in and  
> Catalyst will stream it for you. You can also call $c->write($chunk)  
> repeatedly to write out the response in chunks, if you want to.  
> Useful for doing an LWP request of some form, and passing in a  
> closure to write the file to catalyst's output as a callback (or  
> that's what I use it for, anyway)..
> 
> Have a look at:
> http://dev.catalyst.perl.org/repos/Catalyst/trunk/examples/Streaming/ 
> lib/Streaming.pm
> 
> for a demo solution.

Thanks for the hint.

> I'm not sure, however, how to get that to happen without outputting  
> response headers. Why do you actually _need_ to do this in this  
> application? If possible I'd go with headers, as then you've got an  
> extensible channel for metadata passing, which will probably come in  
> handy later. (And it's not like just throwing away the headers and  
> reading the rest isn't trivial to do).

I was only referring to omitting the content-length header, not all headers 
(because I don't know how much content will be delivered if I send part of the 
response page as soon as I receive the request and before the database 
results are available). 
But I've just had another look at rfc2616 and it seems that I should familiarise 
myself with "Chunked Transfer Coding".
So we can probably forget what I wrote in the first mail.

Thanks again for your response! I've done quite some work in perl but not yet 
designed a whole application of this complexity. Therefore I very much 
appreciate the background info you provided.

Heiko
_____________________________________________________________________
Der WEB.DE SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
http://smartsurfer.web.de/?mc=100071&distributionid=000000000066