[Dbix-class] Database: Slave or Master - call for paper review

Tue Oct 3 08:00:12 CEST 2006

At 5:24 PM +1300 10/3/06, Sam Vilain wrote:
>I've published an update to my paper, mostly based on your feedback, in
>the same place. Diffs at http://www.utsl.gen.nz/gitweb/?p=dbic-tangram-paper

Those changes all look like improvements to me.

I have little left in the way of feedback on the paper right now, but 
I'll briefly respond to some of your comments here.

>  > I say that it is perfectly valid for a data type to be arbitrarily
>>  complicated, and this does in no way violate any relational database
>>  normalization principles.
>
>In principle, yes. But can you point to a database that supports this?

All of them.

You see, the principle of normalization describes the layout of a 
collection of relations/tables, and says nothing about what data type 
is stored in each tuple/row attribute/field/column.  You can store an 
arbitrarily complex data type within a single attribute/field of a 
single tuple/row, and that database can be normalized to 5NF+.  The 
complexity of a data type is orthogonal to normalization of relations.

This is true no matter how the data type / object is stored, whether 
serialized or mapped or natively supported as-is.

Also, FYI, about the idea going around that having a non-atomic type 
stored in a single field violates 1NF, that idea is untrue.

1NF just means that a table doesn't have duplicate rows, and so in a 
SQL database, any table with a key on it is in 1NF.  For a true 
set-oriented database, that's a non-issue as they are always in 1NF 
by definition.

>  > I would say that there is nothing insane about a primary key being
>>  over all of the columns in a table, if there is no subset of those
>  > columns that can by themselves uniquely identify an object.
>
>I mean it would be insane, because then the primary key index would be
>the same size as the table, unless you made it index organised and the
>RDBMS DTRT. It's the 'correct' thing to do on one level, but really not
>what people expect. It is more likely to indicate an error.

That depends on what people expect.  In my mind, there are a *lot* of 
general practitioners, who aren't versed in the relational data 
model, for whom duplicates cause a lot of unexpected grief, and who 
wish items did just appear once in each table, since usually that is 
all they ever intended.  Just because a lot of implementations store 
duplicates by default doesn't mean that's not what a lot of users 
expect or want to happen.  If the database did not store duplicates 
by default, then it would be doing what most people want; only those 
who have used DBMSs for awhile and are used to the duplicates thing 
will get thrown, but they can easily adapt.

Regardless, I consider the issue largely moot as most objects, afaik, 
can be uniquely identified by a subset of their attributes, so not 
all of them would get used in their .WHICH or the primary key; eg, a 
person probably has a government id or a phone number or an email 
address that is unique.  If this isn't the case, then either the 
objects are for highly specialized fields, or they are badly designed.

Just because we can support all object types in a general sense 
doesn't mean we have to have good performance characteristics for 
badly designed objects.

As for specialized fields, I think of temporal (eg year+month+day) or 
spacial (eg, X+Y+Z) data as examples; all of their attributes are 
probably needed to identify them in the general case, and if not, 
then any extra attributes are meta-data or are redundant and can be 
derived.  But then DBMSs (or the SQL standard at least) tend to have 
built-in extensions for temporal or spacial types anyway, to make 
using them more efficient.

Regardless, performance characteristics are often orthogonal to the interface.

>In one sense, I agree with what you're saying. However, if this is
>happening, it means that people are treating the object like it has some
>kind of /je ne cest pas/ binding it together, and that's what they're
>referring to - its "object-ness". Or, its tuple-ness, if you prefer. In
>a relational database, that's fine - you could never uniquely refer to a
>row with no constraints anyway; the 'correct' way to refer to such a row
>is with all of its values.

Not quite.

A true relational database by definition (being set oriented) has no 
duplicate tuples (rows) in a relation (table), so a key being over 
all the attributes (fields/columns) is implicit, and you can always 
identify a tuple using all of its attributes.  Where you ever define 
an explicit key over a relation is when a subset of all its 
attributes is known to be distinct for every tuple, and so you can 
refer to the tuple by just those attributes.  Similarly, you can have 
multiple keys over different attributes where each can uniquely 
identify a tuple.  (Assuming for example that everyone has been given 
an id from the same country, and no one shares an email address, 
either could be a key for a person relation, and neither one needs be 
more "important" or "primary" than the other.)

On the other hand, a pseudo-relational database, like SQL databases 
(being bag oriented) always require an explicit key to have the 
advantage that a relational does implicitly, being able to refer to a 
single item uniquely.

>Or the rowid, but nobody does that except bad
>Oracle DB developers IME.

Rowids are indeed evil, and are borne as a consequence of the system 
being bag-oriented (and other DBMSs have them too, such as SQLite). 
The proper way to do things is that all attributes are user-defined, 
so users can fetch data using the same info they added in the first 
place, rather than having to use something the database generated and 
that aren't part of their actual data.

>I have a sneaky feeling that object pointers actually point to the
>.WHICH value. Which would mean that you'd invalidate pointers if you
>changed those values. But in objects, people won't be used to not being
>able to change tuples. Well, except Erlang programmers (Erlang is a
>language with no[*] mutable state).

AFAIK, for an arbitrary user-defined object type, .WHICH returns the 
same thing as .WHERE, which is akin to a memory address of the 
object; this is the .WHICH defined by the Object superclass of all 
objects.  Only classes which are explictly acting as value types 
(such as Int, Str, etc), by both declaring their own .WHICH (or ===) 
and by being immutable, return something derived from their 
attributes.  So, generally speaking in the manner of good Perl 6 
object design, you either both declare your .WHICH based on the 
object's value and the object is immutable, or neither is true.  And 
if the object is mutable, then .WHICH is its memory address, which 
remains valid as a unique id for the object itself, though not for 
the object's contents.  As for whether the explicit .WHICH uses all 
or just some of the attributes, is orthogonal to this matter.

>If your database doesn't have the ability to store arbitrary, custom
>column types, then you have two options:
>
>- express the object's type in a shape you can map, and keep mapping it.
>- give up and serialise it to a BLOB or TEXT
>
>Actually the second can be a lot faster, too. And so long as you don't
>care about your database ensuring referential integrity or being able to
>query or inspect the data :-), it's fine.

Possibly, if you are going the serialized approach, then no 
components of your serialized value need to have referential 
integrity against something elsewhere in the database, but rather you 
just want to know if the entirety of that value occurs somewhere else 
(serialized in the same method); or if that was the case, you would 
probably want to go the map route, as generally it is the database's 
job to guarantee the integrity of the data as much so as you tell it 
what the rules for the data are.

-- Darren Duncan