[Dbix-class] Help with Titanic Number of Inserts

Wed Jan 24 11:44:05 GMT 2007

> From: Ryan Richt [mailto:rjr2 at cec.wustl.edu] 
> Sent: Wednesday, January 24, 2007 02:01
> To: dbix-class at lists.rawmode.org
> Subject: [Dbix-class] Help with Titanic Number of Inserts
> 
> Hey Guys and Gals...
> 
> 	We are just starting a DBIx::Class project here at the 
> Washington University Genome Sequencing Center in St. Louis, 
> and __of__course__ we choose DBIx::Class over all that other 
> stuff ;-).
> 
> Anyway, we need to make > 112,000,000 inserts per day.  I 
> _know_ that DBIx::Class is not concerned with performance, 
> but with the
> (wonderful) design/interface/features.
> 
> However, other than wrapping it in a transaction, do you have 
> any other tips on making this manageable?  Currently this 
> would take 130 DAYS to insert w/o transactions, and something 
> like 30 DAYS with transactions.
> 
> (As far as setup goes, this is currently in SQLite and soon 
> to be Oracle once SQLite explodes, testing on a newer dual 
> processor  3gig ram machine, and could deploy on anything 
> from 1000+ node cluster or 50 node cluster of Opertons or 5 x 
> 100gig ram itanium quad processor machines, if that could 
> help us in any way!).
> 
> Thank you very much!
> 
> Ryan

Hi Ryan,

I also work in the comp bio field and have had to do projects which
involved large DTL processes into data warehouses and later analysis.
After quickly reading the posts so far on this thread, my advice is
along the same lines as the previous post by Jess Robinson.  I believe
from my own experience that you are taking the wrong approach if you are
using DBIx::Class or even DBI to do such huge numbers of inserts.  Even
DBI by itself is not really meant for such tasks because it must use the
DBMS's regular mechanisms of SQL processing which have too much
overhead.

Most DBMSs have bulk loading programs with tons of features that
circumvent these channels and they can be nicely automated.  Oracle has
SQL*Loader and MySQL has mysqlimport.  But to add further I believe you
should go with Oracle 9i or 10g Enterprise Edition not MySQL if you are
going to use this as a data warehouse and have logical tables with
millions or billions of rows.  We have found in multiple real world
tests that MySQL using MyISAM does not perform well or has inconsistent
performance when it is faced with a stream of SELECT statements on super
huge tables.  A person mentioned on a previous post the MyISAM tables
are great for doing SELECTs and this is generally true compared to
InnoDB or Oracle tables because of their straightforward no overhead
structure.  But this is not true when it comes to tables and indexes
with many millions or billions of rows because in these cases you need a
DBMS that can physically partition the logical table and on top of that
perform parallel operations on these partitions.  MySQL does not have
mature facilities for that yet but Oracle's are excellent and it can
also parallelize DDL and DML (inserts, updates, deletes).

I hope you have a license for Oracle 9i or 10g Enterprise Edition which
has all the data warehousing capabilities.  I would read in full the
SQL*Loader docs as there are a lot of features and intricacies.  To give
an example 1) design the data model for the data warehouse 2) create the
DDL script to create the tables, constraints, indexes, physical
partitions, and parallelization 3) generally make sure you have
constraints and indexes disabled before bulk loading but read through
the docs as there are *a lot* of different methods 4) write up the
SQL*Loader control files for each table you want to bulk insert into 5)
write up a Perl script to take your raw data, transform it as you need
and generate the properly formatted SQL*Loader data files 6) run the
sqlldr command line program with the appropriate switches and options on
the data files to insert the data.  All this can be automated so that
each time you need to bulk load data you have a Perl script that takes
the raw data, transforms it and generates SQL*Loader data files, then
disables indexes and constraints, and then runs the sqlldr program and
after successful load reenables indexes and constraints.

You will see that if the overall Oracle server is set up properly (good
hardware, good OFA setup of Oracle), if your logical data model is
properly physically partitioned and parallelized, and if SQL*Loader is
configured properly that inserting 112,000,000 rows will be extremely
fast (to take a guess and assuming a lot of things about the type data
you want to store and having good hardware this can take just a couple
minutes).  For fun write up a Perl script just using DBI and you will
see that it is orders of magnitude slower.  This is not DBI's fault as
it is wonderful and has great performance - this would happen with any
database access library that has to interact with the DBMS from
"outside" using the regular channels.

Hope this helps,

Leandro