[Catalyst] Hypothetical Site and Scalability Planning

J. Shirley jshirley at gmail.com
Fri Oct 26 20:30:43 GMT 2007

On 10/26/07, Mesdaq, Ali <amesdaq at websense.com> wrote:
>  Hey All,
> Just wanted to start a thread about scalability planning and design. I was
> thinking we could take the approach of what peoples opinions, ideas, and
> best practices are for large scale sites and use a hypothetical site or a
> existing site as the model to plan for. Not everything discussed needs to=
> catalyst only it could be general web server configs or something similar.
> For example how would you guys approach a project where you needed to
> create a site like a myspace.com or similar with 0 current users but could
> surpass 1 million users in 1 month then 100 million in 1 year. I am
> interested to see the opinions and designs people would have to deal with
> that type of scalability. I mean even simple issues become very complex w=
> those numbers. Like where and how to store photos. Should they be stored =
> filesystem, db, or external sites like akamai. What web server should be
> used? Apache? Should it be threaded version? How does that affect catalyst
> and its modules are they all thread safe or is threaded apache not even t=
> way to go?

Here's my opinions on the matter:
1) Start out with memcached in place.  It scales well, and use it.  Use
PageCache where you can.
2) Store images in something that is for storing data, not files.  Storing
images as files means you are stuck with some file system format that binds
you unnecessarily.  Things like S3, Akamai or your own homegrown MogileFS
cluster gives you an API into the data.  Granted, you could do the same for
NFS or whatever, and just write a good compatibility API, you are largely
duplicating the work of the previous tech.  If you use S3, setup your image
servers to cache for a loooooong time (on disk).  Pull from S3, and store it
for as long as you reasonably can.  This area a lot of people get wrong and
then get stuck with costly migrations.
3) Use database replication strategies where you can.  In the F/OSS world,
MySQL is outshining PostgreSQL with this.  InnoDB removes a lot of the
complaints that folks have about MySQL but there is always evangelism
against MySQL.  If it works for you, just take it in stride - a LOT of high
traffic sites use MySQL; you can usually get some insight from them.  MySQL
allows InnoDB on the master, and MyISAM on the slaves -- gets you faster
read times, and tends to not block on inserts that bad -- and then as you
grow it is easier to grow into a full blown MySQL cluster... but at that
point, you have enough money to thoroughly explore every option available.
4) You'll have to tune Apache or whatever web server you have to your
specific app.  Every app has different usage patterns, and you'll have to
customize your web server accordingly.  This is where starting from scratch
pays off -- you can experiment and see what improves performance.

Another piece of advice: Don't look at requests per second as the idea of
webserver scalability -- sure, you want to have efficient code, but that is
just efficient code measurement; not scalability.  Look at it this way: How
many webservers do I need to add to my cluster to double traffic.  If there
answer is more than two, start looking at bottlenecks.  If it is two, and
you are still near peak usage, look at bottlenecks.  If you add two, and
everything is running smooth then you are probably in good shape.

Now start worrying about your databases :)

Hope this helps, it is an area I have some experience in and find fun.


-- =

J. Shirley :: jshirley at gmail.com :: Killing two stones with one bird...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.scsys.co.uk/pipermail/catalyst/attachments/20071026/c956a=

More information about the Catalyst mailing list