[Catalyst] Hypothetical Site and Scalability Planning

Fri Oct 26 21:31:13 GMT 2007

"J. Shirley" <jshirley at gmail.com> wrote on 10/26/2007 02:30:43 PM:

> On 10/26/07, Mesdaq, Ali <amesdaq at websense.com> wrote:
> Hey All,
> Just wanted to start a thread about scalability planning and design.
> I was thinking we could take the approach of what peoples opinions,
> ideas, and best practices are for large scale sites and use a
> hypothetical site or a existing site as the model to plan for. Not
> everything discussed needs to be catalyst only it could be general
> web server configs or something similar.
> For example how would you guys approach a project where you needed
> to create a site like a myspace.com or similar with 0 current users
> but could surpass 1 million users in 1 month then 100 million in 1
> year. I am interested to see the opinions and designs people would
> have to deal with that type of scalability. I mean even simple
> issues become very complex with those numbers. Like where and how to
> store photos. Should they be stored on filesystem, db, or external
> sites like akamai. What web server should be used? Apache? Should it
> be threaded version? How does that affect catalyst and its modules
> are they all thread safe or is threaded apache not even the way to go?
>
> Here's my opinions on the matter:
> 1) Start out with memcached in place.  It scales well, and use it.
> Use PageCache where you can.

Seconded

> 2) Store images in something that is for storing data, not files.
> Storing images as files means you are stuck with some file system
> format that binds you unnecessarily.  Things like S3, Akamai or your
> own homegrown MogileFS cluster gives you an API into the data.
> Granted, you could do the same for NFS or whatever, and just write a
> good compatibility API, you are largely duplicating the work of the
> previous tech.  If you use S3, setup your image servers to cache for
> a loooooong time (on disk).  Pull from S3, and store it for as long
> as you reasonably can.  This area a lot of people get wrong and then
> get stuck with costly migrations.

NFS gets a bad wrap,  as long as you do sane planning and lay it out
properly NFS works very very well for servicing static files to the
webservers.  Breaking out to S3 seems silly (Amazon is out to make money
with S3 and if you do it yourself you should be able to do it for less
cost),  KISS works wonders as long as you think about usability.  Get a
Sysadmin to think out the NFS side realistically (Masters with multi read
onlys etc).

> 3) Use database replication strategies where you can.  In the F/OSS
> world, MySQL is outshining PostgreSQL with this.  InnoDB removes a
> lot of the complaints that folks have about MySQL but there is
> always evangelism against MySQL.  If it works for you, just take it
> in stride - a LOT of high traffic sites use MySQL; you can usually
> get some insight from them.  MySQL allows InnoDB on the master, and
> MyISAM on the slaves -- gets you faster read times, and tends to not
> block on inserts that bad -- and then as you grow it is easier to
> grow into a full blown MySQL cluster... but at that point, you have
> enough money to thoroughly explore every option available.

MySQL will be getting a huge dump o code from google in the next 6 months.
Most of it relating to Replication.  Agree, do not build your own.  No
matter what Database you choose have a well rehearsed plan in place for
disasters.

> 4) You'll have to tune Apache or whatever web server you have to
> your specific app.  Every app has different usage patterns, and
> you'll have to customize your web server accordingly.  This is where
> starting from scratch pays off -- you can experiment and see what
> improves performance.

Apache and light are the two major contenders here,  really tho, spend a
lot of time on reverse proxy servers vs the web servers.  a smart rev proxy
that can transparently divy up to dedicated (compartmentalized) web app
servers/images servers/file servers will save you a _ton_ of time and
headaches when

>
> Another piece of advice: Don't look at requests per second as the
> idea of webserver scalability -- sure, you want to have efficient
> code, but that is just efficient code measurement; not scalability.
> Look at it this way: How many webservers do I need to add to my
> cluster to double traffic.  If there answer is more than two, start
> looking at bottlenecks.  If it is two, and you are still near peak
> usage, look at bottlenecks.  If you add two, and everything is
> running smooth then you are probably in good shape.

Exactly,  also the frontend proxy servers will probably start off at 1/1 or
1/2 ratios to the webservers and as you add more webservers for load you
will drop down to 1/4 -> 1/8(or less) depending on what your site is
actually serving (larger long running downloads and file serving usually
men the ratio will remain high).

>
> Now start worrying about your databases :)

>
> Hope this helps, it is an area I have some experience in and find fun.
>
> -J
>
> --
> J. Shirley :: jshirley at gmail.com :: Killing two stones with one bird...
> http://www.toeat.com _______________________________________________
> List: Catalyst at lists.scsys.co.uk
> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> Searchable archive:
http://www.mail-archive.com/catalyst@lists.rawmode.org/
> Dev site: http://dev.catalyst.perl.org/