[Dbix-class] Database: Slave or Master - call for paper review
Darren Duncan
darren at DarrenDuncan.net
Tue Oct 3 08:00:12 CEST 2006
At 5:24 PM +1300 10/3/06, Sam Vilain wrote:
>I've published an update to my paper, mostly based on your feedback, in
>the same place. Diffs at http://www.utsl.gen.nz/gitweb/?p=dbic-tangram-paper
Those changes all look like improvements to me.
I have little left in the way of feedback on the paper right now, but
I'll briefly respond to some of your comments here.
> > I say that it is perfectly valid for a data type to be arbitrarily
>> complicated, and this does in no way violate any relational database
>> normalization principles.
>
>In principle, yes. But can you point to a database that supports this?
All of them.
You see, the principle of normalization describes the layout of a
collection of relations/tables, and says nothing about what data type
is stored in each tuple/row attribute/field/column. You can store an
arbitrarily complex data type within a single attribute/field of a
single tuple/row, and that database can be normalized to 5NF+. The
complexity of a data type is orthogonal to normalization of relations.
This is true no matter how the data type / object is stored, whether
serialized or mapped or natively supported as-is.
Also, FYI, about the idea going around that having a non-atomic type
stored in a single field violates 1NF, that idea is untrue.
1NF just means that a table doesn't have duplicate rows, and so in a
SQL database, any table with a key on it is in 1NF. For a true
set-oriented database, that's a non-issue as they are always in 1NF
by definition.
> > I would say that there is nothing insane about a primary key being
>> over all of the columns in a table, if there is no subset of those
> > columns that can by themselves uniquely identify an object.
>
>I mean it would be insane, because then the primary key index would be
>the same size as the table, unless you made it index organised and the
>RDBMS DTRT. It's the 'correct' thing to do on one level, but really not
>what people expect. It is more likely to indicate an error.
That depends on what people expect. In my mind, there are a *lot* of
general practitioners, who aren't versed in the relational data
model, for whom duplicates cause a lot of unexpected grief, and who
wish items did just appear once in each table, since usually that is
all they ever intended. Just because a lot of implementations store
duplicates by default doesn't mean that's not what a lot of users
expect or want to happen. If the database did not store duplicates
by default, then it would be doing what most people want; only those
who have used DBMSs for awhile and are used to the duplicates thing
will get thrown, but they can easily adapt.
Regardless, I consider the issue largely moot as most objects, afaik,
can be uniquely identified by a subset of their attributes, so not
all of them would get used in their .WHICH or the primary key; eg, a
person probably has a government id or a phone number or an email
address that is unique. If this isn't the case, then either the
objects are for highly specialized fields, or they are badly designed.
Just because we can support all object types in a general sense
doesn't mean we have to have good performance characteristics for
badly designed objects.
As for specialized fields, I think of temporal (eg year+month+day) or
spacial (eg, X+Y+Z) data as examples; all of their attributes are
probably needed to identify them in the general case, and if not,
then any extra attributes are meta-data or are redundant and can be
derived. But then DBMSs (or the SQL standard at least) tend to have
built-in extensions for temporal or spacial types anyway, to make
using them more efficient.
Regardless, performance characteristics are often orthogonal to the interface.
>In one sense, I agree with what you're saying. However, if this is
>happening, it means that people are treating the object like it has some
>kind of /je ne cest pas/ binding it together, and that's what they're
>referring to - its "object-ness". Or, its tuple-ness, if you prefer. In
>a relational database, that's fine - you could never uniquely refer to a
>row with no constraints anyway; the 'correct' way to refer to such a row
>is with all of its values.
Not quite.
A true relational database by definition (being set oriented) has no
duplicate tuples (rows) in a relation (table), so a key being over
all the attributes (fields/columns) is implicit, and you can always
identify a tuple using all of its attributes. Where you ever define
an explicit key over a relation is when a subset of all its
attributes is known to be distinct for every tuple, and so you can
refer to the tuple by just those attributes. Similarly, you can have
multiple keys over different attributes where each can uniquely
identify a tuple. (Assuming for example that everyone has been given
an id from the same country, and no one shares an email address,
either could be a key for a person relation, and neither one needs be
more "important" or "primary" than the other.)
On the other hand, a pseudo-relational database, like SQL databases
(being bag oriented) always require an explicit key to have the
advantage that a relational does implicitly, being able to refer to a
single item uniquely.
>Or the rowid, but nobody does that except bad
>Oracle DB developers IME.
Rowids are indeed evil, and are borne as a consequence of the system
being bag-oriented (and other DBMSs have them too, such as SQLite).
The proper way to do things is that all attributes are user-defined,
so users can fetch data using the same info they added in the first
place, rather than having to use something the database generated and
that aren't part of their actual data.
>I have a sneaky feeling that object pointers actually point to the
>.WHICH value. Which would mean that you'd invalidate pointers if you
>changed those values. But in objects, people won't be used to not being
>able to change tuples. Well, except Erlang programmers (Erlang is a
>language with no[*] mutable state).
AFAIK, for an arbitrary user-defined object type, .WHICH returns the
same thing as .WHERE, which is akin to a memory address of the
object; this is the .WHICH defined by the Object superclass of all
objects. Only classes which are explictly acting as value types
(such as Int, Str, etc), by both declaring their own .WHICH (or ===)
and by being immutable, return something derived from their
attributes. So, generally speaking in the manner of good Perl 6
object design, you either both declare your .WHICH based on the
object's value and the object is immutable, or neither is true. And
if the object is mutable, then .WHICH is its memory address, which
remains valid as a unique id for the object itself, though not for
the object's contents. As for whether the explicit .WHICH uses all
or just some of the attributes, is orthogonal to this matter.
>If your database doesn't have the ability to store arbitrary, custom
>column types, then you have two options:
>
>- express the object's type in a shape you can map, and keep mapping it.
>- give up and serialise it to a BLOB or TEXT
>
>Actually the second can be a lot faster, too. And so long as you don't
>care about your database ensuring referential integrity or being able to
>query or inspect the data :-), it's fine.
Possibly, if you are going the serialized approach, then no
components of your serialized value need to have referential
integrity against something elsewhere in the database, but rather you
just want to know if the entirety of that value occurs somewhere else
(serialized in the same method); or if that was the case, you would
probably want to go the map route, as generally it is the database's
job to guarantee the integrity of the data as much so as you tell it
what the rules for the data are.
-- Darren Duncan
More information about the Dbix-class
mailing list