[Dbix-class] RFC - some DBMS features you use

Darren Duncan darren at darrenduncan.net
Fri Feb 5 06:05:00 GMT 2010


Rob Kinyon wrote:
> On Thu, Feb 4, 2010 at 17:07, Darren Duncan <darren at darrenduncan.net> wrote:
>> Hello,

Rob,

Thank you very much for your reply; your comments are extremely helpful.

Your sentiments also generally agree with my own preferences, and in the few 
cases where I had thought differently, those weren't places I had strong 
feelings about.

A few comments about your comments ...

>> 1.  Locking.  Do you explicitly place locks (including with "select ... for
>> update" or otherwise) on either the whole or parts of the database to ensure
>> no one else updates them for the duration of the locks?  What DBMSs do or
>> don't support doing this?
> 
> All DBMSes worth talking about support locking at some level. Whether
> or not they should is a different story.
> 
> Locking, afaict, is a way of managing the conflicting transactions
> problem. Even though it's harder to program (at least, initially),
> database engines should really support indicating that a transaction
> commit would have a "merging conflict", similar to how git works.
> Then, the client can query the engine as to what the nature of the
> conflict is, resolve it, and attempt to commit again with resolved
> fixes. Normally, this would be done by discarding actions done in the
> transaction and restarting with a fresh view of the database.

Interesting that you mention git, because a few weeks ago I made a rough draft 
proposal in another (database-but-not-Perl-specific) forum about using version 
control systems, specifically git, as inspiration for conceptualizing how 
relational DBMSs can manage concurrency issues, meaning that an RDBMS would 
fundamentally be multi-versioned; that is also the approach I was thinking of 
taking Muldis D in; of course, you can easily express current DBMS' feature sets 
using that paradigm.

I will forward a copy of that next as a reply to this message.

In this paradigm, using locks is not necessary for concurrency, but it is useful 
as a preemptive action by a user to prevent merge conflicts.

Another reason that your mentioning git is interesting ... see my reply under #2b.

>> 2.  Read consistency.  After you start a transaction, are you seeing a
>> consistent database state besides any changes you make yourself, or if you
>> repeated the same query without making changes, would you get the same
>> results both times?  That is, are committed changes by other database users
>> visible to you to affect the results.  Do you want to be able to choose a
>> desired behavior and have it enforced, such as whether you want to see other
>> committed changes immediately or not see them at all during your
>> transaction, the latter especially if it is a read-only transaction?  For
>> picking either behavior, how do you currently tell the DBMS to do this?  And
>> do you do it natively with SQL or does your Perl wrapper handle it for you
>> and if so then how?
> 
> This concept of non-serializable transactions is absolutely insane.
> Transactions need to be serializable, period. Anything else is an
> optimization and a poor one given the current push towards
> parallelization. Serializable transactions can be transferred between
> multi-masters.

I agree with you in regard to how things should be, and I may very well just 
spec serializable period.

The reason I talked about other possibilities is that, while Muldis D pushes 
some ideals, by making the ideal way of doing things the easier road to travel, 
it also aims to, where reasonable, still be able to express the semantics that 
any DBMSs it might work with support, especially where they don't support the 
ideal and it can't be emulated, at least for the purposes of emulation of legacy 
systems (not that this needs to be 100% for everything).

For example, I have heard that some big name DBMSs don't support serializable 
transaction isolation at all.  Mind you, I have also heard that different DBMSs 
use the same terms, eg serializable or read committed or repeatable read, to 
mean different things.

Are there any DBMSs that have transactions but not serializable isolation? 
(Arguably then they don't actually have transactions, but whatever.)

>> 2b.  Is your DBMS multi-versioned, so that someone can see a consistent
>> database state for an extended period without blocking others that want to
>> update it?  Do you desire such functionality or do you prefer that updates
>> of a common-interest section will block other users?  What do current DBMSs
>> do in this regard or which let you choose?
> 
> When implementing transactions in dbm-deep, I ended up doing pure
> MVCC. When I tried to add SQLite and MySQL as optional backends, I
> found that their concept of transactions was so poor that I couldn't
> support transactions in dbm-deep if I was using them as a backend.
> MVCC is, afaik, the only way to do properly serializable transactions.
> Oracle does this, but no-one else does.

Hmm, I wasn't aware that DBM::Deep ever tried to use a SQL DBMS store.

Continuing my response for #1 ...

Back on 2007.12.17 in a private discussion with Sam Vilain, who in my experience 
was one of the first big proponents of Git in the Perl community (certainly the 
one that convinced me to use it starting in the Spring of 2007), Sam proposed 
something I readily agreed with, which was to use Git itself as a backend for my 
Muldis database project, as one of several non-SQL alternatives to SQL DBMSs. 
That is, Muldis Rosetta would be a new porcelain of Git.

This thought also contributed to my aforementioned proposal on concurrency, 
which is taking advantage of features that some backends make easy to support.

Presumably, DBM::Deep could be used as a backend likewise, or as a frontend.

> Yes, I want transition constraints. These are very important when it
> comes to properly modeling financial systems. There are several
> situations where state constraints are too complex (and so are never
> built), but transition constraints would be helpful.

Many of the examples I have seen before of transition constraints involved 
financial systems.  Others were say government databases where various projects 
had to go through project statuses in a specific order, or say there were only 
some statuses a project could be in directly after certain other ones.  So 
basically transition constraints are good for helping follow processes.

That said, what I had in mind only dealt with direct transitions (the most 
common kind I think); if you wanted to constrain multiple stages in more complex 
ways, you'd need audit records, and then the further you go in that direction, 
the more transition constraints can become state constraints.

>> 3a.  Do you expect transition constraints to evaluate per statement or inner
>> transaction / savepoint or only at the end of a main transaction, comparing
>> the state before the transaction to after it?
> 
> Yes. :)
> 
> In other words, all of those options should be possible.

What I currently have specced is that transition constraints, same as database 
state constraints, are evaluated after every "multi-update statement", which 
basically is an arbitrarily complex set of database changes that are defined to 
occur as a single atomic unit from the point of view of every database user, 
including the one performing it.  The idea is that, because a single 
multi-update statement should be able to do all interdependent changes (for 
example, an insert plus a delete, or an insert of parent plus child records) as 
a single unit, the database should be fully consistent both just before and just 
after it executes, meaning it obeys all database constraints.

What I currently have specced is that transition constraints execute after state 
constraints, so the transition constraint definition can assume that both of its 
inputs (the before and after state) are consistent for state constraints.  A 
failure on either of these would rollback/make-into-a-noop the multi-update 
statement at least, and possibly the rest of the transaction depending on how 
the user setup their exception handlers.

I have not yet specced a transition constraint ranging over a whole transaction, 
but it occurs to me that this would be natural to add.  After all, when you 
conceive of nested transactions anyway (which savepoints are a different syntax 
for specifying), each statement is conceptually a child transaction anyway, so 
if you can spec a transition constraint there, you can do it elsewhere.

That said, following the Git analogy (also in my next message), a transaction 
effectively includes a Git rebase command at the end, since multiple 
statements/commits from the performing user's point of view have effectively 
turned into a single atomic change from the POV of everyone else.

Anyway, yes, transition constraints are very important.  Exact details to work 
out.  Flexibility that doesn't cause more problems than it prevents is good.

>> 4.  Do you use triggers to optionally cause a database update that isn't
>> conceptually part of the triggering action, such as write to a log table, or
>> do you only use them to define constraints that say can't be expressed with
>> a CHECK clause or unique/foreign/etc key declaration?  Note that if the
>> trigger is simply implementing an updateable view, where all of the values
>> being written are from the user in the same statement/statement group, that
>> doesn't count as I would consider that lot simultaneous, the cause of
>> itself.
> 
> Triggers . . . I have a love/hate relationship with them.

I consider that triggers have their uses, but in a well designed DBMS their 
actual usefulness is for a lot fewer things than people use them for in SQL.

Things that are value or transition constraints, or virtual variables, should be 
defined and implemented declaratively, with some SQL examples being key 
constraints or check constraints or view definitions.  Effectively speaking, 
evaluation of these happens atomically with their stimuli, like with a type 
constraint or an action on the variable.

Eliminating those, I find that the general functionality of a trigger is best 
termed a "stimulus response rule", which is to specify things that are not 
atomic with their stimuli, but generally take place "after".  For example, you 
could have one to keep track of what users do, timestamped.  Or you could use 
them to implement nondeterministic constraints such as "inserted date must be 
before now()", which I would argue can't directly be in a normal constraint.

In fact, in a pure Muldis D application, where "the database" and "the 
application" are the same thing (all code is stored routines), the "main 
program" *is* a stimulus response rule, where the stimulus is the act of the 
DBMS starting up.  In a mixed language application such as you would normally 
write such as with Perl being the "main program", that wouldn't be the case.

But you get the idea that the role of "triggers" has both been reduced and expanded.

I believe some SQL DBMSs already support triggers on all sorts of activity, not 
just table inserts/updates/deletes/etc, so this can be expressed.

> Ugh. This is a security hole.

I may have expressed wrong what was actually happening in the government 
application, given that it was designed by very experienced people.

But whatever, that was an aside and need not be mentioned again.

> The datastore should provide and enforce one set of privileges. The
> application should provide and enforce another set of privileges. To
> conflate the two is to muddy the waters.

Well my current inclination is to not support defining database level privileges 
at all in the first functioning release, and come back to that later when 
someone needs it.  Especially since these could be arbitrarily complicated.

>> 6b.  How important is it to you to be able to have per-user privileges in
>> your databases?  Or how important is it to be able to define these without
>> using SQL?
> 
> This isn't an SQL problem and should be taken out of SQL. Most DBMS
> authn/authz systems are unnecessarily hampered by the "requirement"
> that all interaction with the engine be in SQL-like syntax. SQL should
> be about interacting with the data. Administrivia should be handled
> with a language designed for that purpose.

Fair enough.  But then the matter still exists that if one wants to administer 
the database, it needs to be expressable somehow.

I do agree with the principle of the DBMS essentially being a virtual machine 
and as much as possible all interaction with it is in the form of machine 
instructions.  For a SQL DBMS the instructions would be written in SQL, but on 
the other hand there is no saying the DBMS has to support only SQL.

This principle is in contrast to the idea that some things are better specified 
out of band, such as with a special programmatic API to the VM where each 
command has its own VM method or something.

Expressing more things in command language means a lot more flexibility in 
evolving the DBMS, and its programmatic API can be kept simple, such as just 
having a "run this" method.

Sort of like how the DBI module works, except that I wouldn't have all those 
special methods like "table_info"; rather, users would just query the system 
catalog or some such, through the "run this".

>> 7.  Spatial types and operators.  Who uses spatial or GIS or such types or
>> operators?  How valuable is it to be able to do so without writing SQL?
> 
> I don't use them. I know people who do. Instead of building these into
> the engine, much better for the engine to provide the ability to
> plugin new column types.

The latter, user-defined plugins, is a given.  In fact, the ability to define 
your own, arbitrarily complex, data types is one of the most fundamental 
features of my project, and part of what makes it comparable to a "programming 
language".

I raised question 7 because spatial types are provided by multiple DBMSs 
built-in and so to leverage those I thought it would be useful for me to also 
have something built-in that people can use by default.

Its the same issue with any good programming language.  Yes, it lets you build 
your own anything, but also bundles some common things so people don't reinvent 
the wheel.

As it is, built-in spatial types are strictly an "optional extension" built-in 
to start with.

For that matter, dates and times (except for simple TAI timestamps and 
durations) are *also* an "optional extension", but I didn't bring up dates and 
times since those are a lot more commonly used, and I already specced them.

Basically my only core built-in base types are: booleans, integers, rationals, 
bit strings, character strings, tuples, relations; other types, which are 
defined in terms of those, include: sets, arrays, bags, intervals, TAI instant, 
TAI duration; also there are core types mainly just used for code, not user 
data, such as identifiers and comments and 'order'.

> NaN is a NULL. Of course, q.v. my answer to #7 (type plugins) for
> people who think otherwise.

Agreed.  And I have no specced "extension" for this as yet.

I expect to add the 2 core singleton types "+Inf" and "-Inf" though, which are 
type-generic infinities, and useful in defining unbounded intervals.

Not the same thing though.

>> 9.  Full text search.  Who uses this versus a simple substring match?
> 
> It's absolutely key to be able to provide this. q.v. my answer to #7
> (type plugins).

Agreed.  And I plan to add a system-defined plugin for this, to front the FTS 
commonly built into SQL DBMSs already, or several if there are several.

>> 10.  Who uses regular expressions of any kind in the database?  Either
>> Perl-compatible or otherwise.  Basically any kind of pattern matching
>> besides the trivial kind that LIKE supports (literals plus . plus .*).
> 
> I do, for full-text searching. q.v. my answer to #7 (type plugins).

Agreed.  What I had in mind was to spec a separate system-defined optional 
plugin for each kind of complex pattern match engine.  One for Perl 5 regexes. 
One for Perl 6 rules.  One for Parrot's basic built-in pattern matching. 
Etcetera.  Only the simple matching of SQL's generic LIKE would be in core.

>> 11.  Out of band sequence generators.  If you read from / increment a
>> sequence generator from within a transaction, do you want a rollback of that
>> transaction to also rollback the sequence generator or not?  Do you want
>> both possibilities to be supported.  What do current DBMSs support?
> 
> I don't know of a current DBMS that supports rolling back a sequence
> generator because too many people have (mis)used sequences as an
> insertion ordering. Frankly, I don't think sequence generators should
> be supported. Instead, provide nano-second timestamping so that you
> can always know that -this- row came before -that- row and UUID
> generators for primary keys (the most common use for sequence
> generators).

Makes sense.

One thing I could do is not build-in sequence generators per se, but make it 
easy for users to define their own.  In fact, any built-in ones would be 
designed as if they were built by users, for example:

1.  The current state of a sequence generator is just stored in a relvar/table 
like any user data.  Whether there is a separate relvar per generator is up to 
the user, though that approach would most closely mimic the concept of a SQL 
DBMS having the generator as its own schema object.

2.  To read the current/last sequence value without changing it, just read from 
the relvar hold its value.  A shorthand syntax for referencing the sole value of 
a single-tuple/row, single-attribute/column relvar/table would be built-in.

3.  To increment it, just update the value.  A shorthand for this, which is 
analogous to "$nextval = $foo++" or "$nextval = $foo+=$x" in Perl, would be 
built-in.

4.  Doing all of the above normally would make the generator increment subject 
to transactions, as it is a normal relvar/table update.

5.  To do that not subject to transactions, ostensibly you would do the 
operation against the sequence value in a prior or parallel transaction, which 
is successfully committed first, and then whatever you use the sequence value in 
later won't cause the sequence to rollback.

Any built-in sequence generator support would be a thin veneer over the above, 
and any native sequence generators in the backend would be abstracted into the 
above as far as the user sees; if both are done, its essentially 2 
mutually-reversing abstractions, and so users can still work like they were 
using the SQL ones anyway, sort of.

> YES YES YES!!!1! **ALL** SQL statements inside a transaction are
> subject to the transaction. Anyone who provides an "implicit commit"
> should be shot. Period, end of story.

I'm glad we agree.

> q.v. my answer to #6b about SQL for data interaction and some other
> language for administrivia.

I currently spec data definition and transaction management as a built-in, and 
very fundamentally.

Also, a transaction is always scope-defined, by declaring for a stored procedure 
"is transaction"; that transaction commits with a normal return and rolls back 
with a thrown exception; that is currently the only way supported to do it; no 
"statements" for transaction management.  I believe that DBIx::Class's 
"txn_do()" works exactly the same way if I'm not mistaken.

This said, a DBMS VM/wrapper's interface to Perl would have explicit transaction 
control methods (begin,commit,rollback), breaking the general "run this" 
simplicity, for the sole purpose of grouping several distinct database 
interactions by Perl in a transaction", when those actions can't or don't want 
to be rendered as a single Muldis D "is transaction" procedure.

> I agree with subjugate-or-fail. Implicit -anything- is _ALWAYS_ bad.
> I'm talking about implicit commits, implicit type coercions, implicit
> case-insensitive matching . . . ANYTHING. I know why MySQL did things
> this way, but it's a bad practice. It's unpredictable and if you want
> your engine to have any traction in places that will pay you, you have
> to be predictable. Even Oracle could do more about this.

Predictable is *very* important to me.

Also, about your checklist:

1.  There are no implicit type conversions, such as num-to-string or vice-versa. 
  You can assign a subtype value to a supertype variable, but that doesn't 
change the *value*.  Moreover, the explicit type conversions are detailed or 
varied enough to reduce ambiguity; for example, the Int_from_Text() function 
makes you declare what numeric base you are interpreting it in (choose from 
2..36); you can also specify numeric literals in all 35 bases (same as Perl 6), 
though for base-10 literals you have the shortcut that the base may be implicit, 
same as any normal language.  Also, the Int_from_Rat() function makes you 
explicitly declare what numeric rounding method you want to use, eg half-up 
versus to-even etc.  Of course, wrappers like DBIx::Class are welcome to provide 
implicit config arguments so their users don't have to.  Note that Muldis D 
*does* support optional parameters, but some params are made mandatory when it 
is thought that code omitting explicit arguments to that would be ambiguous, 
such as in the prior 2 examples.  On a related note to that, I don't have an 
"average" function", but rather I have the 3 "mean", "median", and "mode", which 
cuts ambiguity.

2.  All text comparison is case-sensitive, accent-sensitive, etc, and to do 
anything insensitive you have to explicitly fold the arguments, same as in Perl 
or in Oracle I believe.  Both text data and identifiers are that way, same as in 
Perl and many languages, and delimited SQL identifiers.

3.  If an engine doesn't support subjugating data definition to transactions, it 
will throw an exception (or fail to compile, if we would know at compile time) 
when someone tries to do data definition in a transaction (they'll need to try 
again outside a transaction), not just go ahead.

Thank you very much.

-- Darren Duncan



More information about the DBIx-Class mailing list