[Catalyst] "Streamed" content delivery?!

Tue Aug 5 09:49:36 BST 2008

On 4 Aug 2008, at 21:04, Heiko Jansen wrote:

> Dear Catalystians.
>
> I have a question on how content generation and delivery works in  
> Catalyst.
>
> I'll probably have to implement a metasearch / federated search  
> component for an app build with Catalyst.
> The user submits a request containing a list of databases and a  
> query. For each database the query has to be transformed to the  
> query syntax of the respective db and is then submitted via one of  
> many connection methods (SQL, HTTP, ...). The result received is  
> then transformed to a common format, which gets cached and is used  
> to generate output to the user (TT2 templates).
>
> Accessing the databases sequentially is not an option because of  
> greatly differing and probably quite long response times.
>
> As far as I understand it, the usual way would be to implement a  
> custom Model for the search.
> I can think of two implementation strategies:
> a) A method in the Model is called with all database names; one  
> thread per database is started performing query transformation,  
> connection and result parsing; as soon as the threads are started  
> the main method returns, handing back sth like a thread queue or  
> one or more sockets to the Controller; the controller loops over  
> the pipelines and whenever it received complete answer processes a  
> template (fragment) with that data structure
> b) A thread is started in the controller for every database, in  
> which a method in a Model is called that handles one database; the  
> thread works like the one above but also processes the template and  
> sends the final result to the Controller who sends it to the client.
>
> I'd probably go for method a) because I could implement the Model  
> method as a rather simple wrapper which in fact connects to a  
> different server which does the heavy lifting. On the other hand  
> method a) would mean that template processing would happen  
> sequentially while in method b) this also could be done in paralle

I'd recommend that you use some sort of distributed job queueing  
system for this. [Gearman][1] could be a good fit..

At first glance, it doesn't look like option (b) above will allow you  
do the templating in parallel, but I'm not sure that I agree that it  
has to be done that way. Given that your templating step is, really,  
the munging of the DB specific data into a common format, I'd  
consider it the job of the model.. So you generate and cache a  
fragment of what you're going to output to the user, then you only  
have to sew things together..

Here's a step-through of how I imagine it working.
. You start a pile of workers to do DB queries (X workers per DB).
. Your catalyst application starts, and connects to the job queueing  
system.
. A user makes a query to catalyst which is dispatched to your  
controller.
. Your controller makes a decision about *which* databases to query
. Your controller creates a job for each database, and stashes to job  
IDs
. You do ask the job system to block for 'x' seconds, or until a  
query result becomes available.
. Query result becomes available (it is converted to the common  
format, and also cached before being flagged available)
. You pull it out of the cache and stream it to the user.
. You ask the job system to block for 'x' seconds, or until a query  
result becomes available.
. Repeat until your give up threshold happens.
. Controller either terminates outstanding jobs, or detaches them  
(where they will complete and cache their results)
. Controller writes a footer to the page, closes the connection.

You *need* the job queueing system to be de-duping things for you (so  
if 2 controllers ask for the same query, they don't execute the same  
query in parallel twice, but just wait for the same job), and you  
*need* a limited number of workers per DB, otherwise you'll be  
flattened by the thundering herd problem if you have to do a complete  
cache flush for any reason...

This also gives you much better scaling properties / more  
flexibility, as you'll be able to run the worker processes on  
different machines to the web servers..

> l.
>
> But no matter which one I choose the main question is this:
> Can I send data to the client incremetally (or you could say: as a  
> stream) with catalyst?
> I want to send (using HTTP/1.0 without content-length header) the  
> start of the html page when I receive the request; then send out a  
> block for every databases result and finally (after the last db or  
> a search timeout) a page footer.
> I'm a novice regarding catalyst and have so far only seen code  
> examples where all output generation is done as a whole at the end  
> of the requests lifecycle and I'd be very happy if you could tell  
> where to look for examples or documentation concerning my needs.
>

As noted by the reply previously, you can pass a filehandle in and  
Catalyst will stream it for you. You can also call $c->write($chunk)  
repeatedly to write out the response in chunks, if you want to.  
Useful for doing an LWP request of some form, and passing in a  
closure to write the file to catalyst's output as a callback (or  
that's what I use it for, anyway)..

Have a look at:
http://dev.catalyst.perl.org/repos/Catalyst/trunk/examples/Streaming/ 
lib/Streaming.pm

for a demo solution.

I'm not sure, however, how to get that to happen without outputting  
response headers. Why do you actually _need_ to do this in this  
application? If possible I'd go with headers, as then you've got an  
extensible channel for metadata passing, which will probably come in  
handy later. (And it's not like just throwing away the headers and  
reading the rest isn't trivial to do).

Cheers
Tom

[1]: http://search.cpan.org/~bradfitz/Gearman-1.09/