[Catalyst] Hypothetical Site and Scalability Planning

Sun Oct 28 13:40:50 GMT 2007

Memcached is not distributed.  Thus, you can't support distributed 
session state with it.  Having only one server is obviously bad from
a scalability and reliability point of view. 

Graphics in the database means that you are not taking advantage of
the web server's built-in sendfile(2) handling of static files. 
This is bad for performance and thus, scalability.

They key to scalability can be boiled down to distribution.  That is,
can you distribute the load (CPU, memory, filesystem, database, etc.)
to multiple servers in a near-linear way.  If you do it right, you 
also gain much increased reliability because you have no single point
of failure.

Distributed filesystems such as Lustre, Isilon, NetApp, OpenFiler 
or even something hacked together with DRDB or rsync replication 
will help.  Databases replication is available for most major RDBMs,
including mysql, postgres, oracle, etc.  Consider using hardware 
loadbalancers to front-end your web applications.  Think about 
splitting your mod_perl processes from static file serving httpd 
processes.  I could go on, but I won't.

Scalability planning isn't something you can do on the cheap (both
mentally and financially).  Think hard, try to consider all sides
and plan carefully.

Cheers,
Rob

-----Original Message-----
From: Mesdaq, Ali [mailto:amesdaq at websense.com] 
Sent: Friday, October 26, 2007 5:59 PM
To: The elegant MVC web framework
Subject: RE: [Catalyst] Hypothetical Site and Scalability Planning

J,

Amazing feedback this is great! 

I think memcached is great. I haven't had time to play with it yet but I
have pretty much read everything and been prepped to play with it once I
have a chance.

I personally think that storing images in the DB is the best place to
start because if other better solutions are available later you can very
easily migrate. But if you start out with filesystem migration is a
little bit more cludgy in my opinion. I mean you have to go traverse
directories and copy/move/delete or whatever you have to do for the
migration.

We have been using mysql on some pretty big internal projects here and
its been working satisfactorily. However there are issues with it that
make me not so confident in these big claims of large sites using it.
Mainly its the scaling out paradigm that is not very clear with mysql.
We tried using replication with master slaves and the replication speed
was wayyyyyy too slow. Then the whole clustering approach with mysql
seems to be very confusing and not very documented as far as I have
poked around. The only really solid scaling approaches I have seen with
mysql is either using vmware to cluster hardware at the hardware/os/vm
layer to make one big virtual machine or using third party
hardware/software bundles with mysql like ones from NetApp or similar. I
wish clustering with mysql was as simple as adding a node to the cluster
and you gain 0.7 performance per machine.

Another very intriguing thing with super large sites is the actual
schema design. You have to be very smart about design, data segregation,
indexes, etc. I mean I don't know for sure but I am pretty sure sites
like myspace don't just have one huge users table with user_id, email,
sha1_password. I would imagine they have segregated users into separate
schemas which would scale far better than mysql replication or
clustering would. Something like every 10,000 users are allocated on a
new mysql server.

Thanks,
------------------------------------------
Ali Mesdaq
Security Researcher II
Websense Security Labs
http://www.WebsenseSecurityLabs.com
------------------------------------------

-----Original Message-----
From: J. Shirley [mailto:jshirley at gmail.com] 
Sent: Friday, October 26, 2007 12:31 PM
To: The elegant MVC web framework
Subject: Re: [Catalyst] Hypothetical Site and Scalability Planning

On 10/26/07, Mesdaq, Ali <amesdaq at websense.com> wrote:

	Hey All, 

	Just wanted to start a thread about scalability planning and
design. I was thinking we could take the approach of what peoples
opinions, ideas, and best practices are for large scale sites and use a
hypothetical site or a existing site as the model to plan for. Not
everything discussed needs to be catalyst only it could be general web
server configs or something similar. 

	For example how would you guys approach a project where you
needed to create a site like a myspace.com <http://myspace.com>  or
similar with 0 current users but could surpass 1 million users in 1
month then 100 million in 1 year. I am interested to see the opinions
and designs people would have to deal with that type of scalability. I
mean even simple issues become very complex with those numbers. Like
where and how to store photos. Should they be stored on filesystem, db,
or external sites like akamai. What web server should be used? Apache?
Should it be threaded version? How does that affect catalyst and its
modules are they all thread safe or is threaded apache not even the way
to go? 

Here's my opinions on the matter:
1) Start out with memcached in place.  It scales well, and use it.  Use
PageCache where you can.
2) Store images in something that is for storing data, not files.
Storing images as files means you are stuck with some file system format
that binds you unnecessarily.  Things like S3, Akamai or your own
homegrown MogileFS cluster gives you an API into the data.  Granted, you
could do the same for NFS or whatever, and just write a good
compatibility API, you are largely duplicating the work of the previous
tech.  If you use S3, setup your image servers to cache for a loooooong
time (on disk).  Pull from S3, and store it for as long as you
reasonably can.  This area a lot of people get wrong and then get stuck
with costly migrations. 
3) Use database replication strategies where you can.  In the F/OSS
world, MySQL is outshining PostgreSQL with this.  InnoDB removes a lot
of the complaints that folks have about MySQL but there is always
evangelism against MySQL.  If it works for you, just take it in stride -
a LOT of high traffic sites use MySQL; you can usually get some insight
from them.  MySQL allows InnoDB on the master, and MyISAM on the slaves
-- gets you faster read times, and tends to not block on inserts that
bad -- and then as you grow it is easier to grow into a full blown MySQL
cluster... but at that point, you have enough money to thoroughly
explore every option available. 
4) You'll have to tune Apache or whatever web server you have to your
specific app.  Every app has different usage patterns, and you'll have
to customize your web server accordingly.  This is where starting from
scratch pays off -- you can experiment and see what improves
performance. 

Another piece of advice: Don't look at requests per second as the idea
of webserver scalability -- sure, you want to have efficient code, but
that is just efficient code measurement; not scalability.  Look at it
this way: How many webservers do I need to add to my cluster to double
traffic.  If there answer is more than two, start looking at
bottlenecks.  If it is two, and you are still near peak usage, look at
bottlenecks.  If you add two, and everything is running smooth then you
are probably in good shape. 

Now start worrying about your databases :)

Hope this helps, it is an area I have some experience in and find fun.

-J

--
J. Shirley :: jshirley at gmail.com :: Killing two stones with one bird...
http://www.toeat.com 

Click here
<https://www.mailcontrol.com/sr/13CZA7L8WnZKygC!vtOFEoENv!XWCM+4CHHyWURc
UaOFd4By5NsBQMW0RRglMIC9MNdyVDcC4LbY5rGljK6Ah6GIufzY36fhPPa2BFLh7xtvVXLv
3q!3oase5!VJjqbUzOKXfvQZ6DsY9PE1lueDe7GyRPN4qwvQRcyi5C3p!RPGENyTVNX0cIQ+
TZtfM4ZqxsH8AbYjoTXIf+hQ8pk0I1xrVhykbxl2>  to report this email as spam.

_______________________________________________
List: Catalyst at lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.rawmode.org/
Dev site: http://dev.catalyst.perl.org/