[Dbix-class] RFC - some DBMS features you use
Darren Duncan
darren at darrenduncan.net
Fri Feb 5 06:05:00 GMT 2010
Rob Kinyon wrote:
> On Thu, Feb 4, 2010 at 17:07, Darren Duncan <darren at darrenduncan.net> wrote:
>> Hello,
Rob,
Thank you very much for your reply; your comments are extremely helpful.
Your sentiments also generally agree with my own preferences, and in the few
cases where I had thought differently, those weren't places I had strong
feelings about.
A few comments about your comments ...
>> 1. Locking. Do you explicitly place locks (including with "select ... for
>> update" or otherwise) on either the whole or parts of the database to ensure
>> no one else updates them for the duration of the locks? What DBMSs do or
>> don't support doing this?
>
> All DBMSes worth talking about support locking at some level. Whether
> or not they should is a different story.
>
> Locking, afaict, is a way of managing the conflicting transactions
> problem. Even though it's harder to program (at least, initially),
> database engines should really support indicating that a transaction
> commit would have a "merging conflict", similar to how git works.
> Then, the client can query the engine as to what the nature of the
> conflict is, resolve it, and attempt to commit again with resolved
> fixes. Normally, this would be done by discarding actions done in the
> transaction and restarting with a fresh view of the database.
Interesting that you mention git, because a few weeks ago I made a rough draft
proposal in another (database-but-not-Perl-specific) forum about using version
control systems, specifically git, as inspiration for conceptualizing how
relational DBMSs can manage concurrency issues, meaning that an RDBMS would
fundamentally be multi-versioned; that is also the approach I was thinking of
taking Muldis D in; of course, you can easily express current DBMS' feature sets
using that paradigm.
I will forward a copy of that next as a reply to this message.
In this paradigm, using locks is not necessary for concurrency, but it is useful
as a preemptive action by a user to prevent merge conflicts.
Another reason that your mentioning git is interesting ... see my reply under #2b.
>> 2. Read consistency. After you start a transaction, are you seeing a
>> consistent database state besides any changes you make yourself, or if you
>> repeated the same query without making changes, would you get the same
>> results both times? That is, are committed changes by other database users
>> visible to you to affect the results. Do you want to be able to choose a
>> desired behavior and have it enforced, such as whether you want to see other
>> committed changes immediately or not see them at all during your
>> transaction, the latter especially if it is a read-only transaction? For
>> picking either behavior, how do you currently tell the DBMS to do this? And
>> do you do it natively with SQL or does your Perl wrapper handle it for you
>> and if so then how?
>
> This concept of non-serializable transactions is absolutely insane.
> Transactions need to be serializable, period. Anything else is an
> optimization and a poor one given the current push towards
> parallelization. Serializable transactions can be transferred between
> multi-masters.
I agree with you in regard to how things should be, and I may very well just
spec serializable period.
The reason I talked about other possibilities is that, while Muldis D pushes
some ideals, by making the ideal way of doing things the easier road to travel,
it also aims to, where reasonable, still be able to express the semantics that
any DBMSs it might work with support, especially where they don't support the
ideal and it can't be emulated, at least for the purposes of emulation of legacy
systems (not that this needs to be 100% for everything).
For example, I have heard that some big name DBMSs don't support serializable
transaction isolation at all. Mind you, I have also heard that different DBMSs
use the same terms, eg serializable or read committed or repeatable read, to
mean different things.
Are there any DBMSs that have transactions but not serializable isolation?
(Arguably then they don't actually have transactions, but whatever.)
>> 2b. Is your DBMS multi-versioned, so that someone can see a consistent
>> database state for an extended period without blocking others that want to
>> update it? Do you desire such functionality or do you prefer that updates
>> of a common-interest section will block other users? What do current DBMSs
>> do in this regard or which let you choose?
>
> When implementing transactions in dbm-deep, I ended up doing pure
> MVCC. When I tried to add SQLite and MySQL as optional backends, I
> found that their concept of transactions was so poor that I couldn't
> support transactions in dbm-deep if I was using them as a backend.
> MVCC is, afaik, the only way to do properly serializable transactions.
> Oracle does this, but no-one else does.
Hmm, I wasn't aware that DBM::Deep ever tried to use a SQL DBMS store.
Continuing my response for #1 ...
Back on 2007.12.17 in a private discussion with Sam Vilain, who in my experience
was one of the first big proponents of Git in the Perl community (certainly the
one that convinced me to use it starting in the Spring of 2007), Sam proposed
something I readily agreed with, which was to use Git itself as a backend for my
Muldis database project, as one of several non-SQL alternatives to SQL DBMSs.
That is, Muldis Rosetta would be a new porcelain of Git.
This thought also contributed to my aforementioned proposal on concurrency,
which is taking advantage of features that some backends make easy to support.
Presumably, DBM::Deep could be used as a backend likewise, or as a frontend.
> Yes, I want transition constraints. These are very important when it
> comes to properly modeling financial systems. There are several
> situations where state constraints are too complex (and so are never
> built), but transition constraints would be helpful.
Many of the examples I have seen before of transition constraints involved
financial systems. Others were say government databases where various projects
had to go through project statuses in a specific order, or say there were only
some statuses a project could be in directly after certain other ones. So
basically transition constraints are good for helping follow processes.
That said, what I had in mind only dealt with direct transitions (the most
common kind I think); if you wanted to constrain multiple stages in more complex
ways, you'd need audit records, and then the further you go in that direction,
the more transition constraints can become state constraints.
>> 3a. Do you expect transition constraints to evaluate per statement or inner
>> transaction / savepoint or only at the end of a main transaction, comparing
>> the state before the transaction to after it?
>
> Yes. :)
>
> In other words, all of those options should be possible.
What I currently have specced is that transition constraints, same as database
state constraints, are evaluated after every "multi-update statement", which
basically is an arbitrarily complex set of database changes that are defined to
occur as a single atomic unit from the point of view of every database user,
including the one performing it. The idea is that, because a single
multi-update statement should be able to do all interdependent changes (for
example, an insert plus a delete, or an insert of parent plus child records) as
a single unit, the database should be fully consistent both just before and just
after it executes, meaning it obeys all database constraints.
What I currently have specced is that transition constraints execute after state
constraints, so the transition constraint definition can assume that both of its
inputs (the before and after state) are consistent for state constraints. A
failure on either of these would rollback/make-into-a-noop the multi-update
statement at least, and possibly the rest of the transaction depending on how
the user setup their exception handlers.
I have not yet specced a transition constraint ranging over a whole transaction,
but it occurs to me that this would be natural to add. After all, when you
conceive of nested transactions anyway (which savepoints are a different syntax
for specifying), each statement is conceptually a child transaction anyway, so
if you can spec a transition constraint there, you can do it elsewhere.
That said, following the Git analogy (also in my next message), a transaction
effectively includes a Git rebase command at the end, since multiple
statements/commits from the performing user's point of view have effectively
turned into a single atomic change from the POV of everyone else.
Anyway, yes, transition constraints are very important. Exact details to work
out. Flexibility that doesn't cause more problems than it prevents is good.
>> 4. Do you use triggers to optionally cause a database update that isn't
>> conceptually part of the triggering action, such as write to a log table, or
>> do you only use them to define constraints that say can't be expressed with
>> a CHECK clause or unique/foreign/etc key declaration? Note that if the
>> trigger is simply implementing an updateable view, where all of the values
>> being written are from the user in the same statement/statement group, that
>> doesn't count as I would consider that lot simultaneous, the cause of
>> itself.
>
> Triggers . . . I have a love/hate relationship with them.
I consider that triggers have their uses, but in a well designed DBMS their
actual usefulness is for a lot fewer things than people use them for in SQL.
Things that are value or transition constraints, or virtual variables, should be
defined and implemented declaratively, with some SQL examples being key
constraints or check constraints or view definitions. Effectively speaking,
evaluation of these happens atomically with their stimuli, like with a type
constraint or an action on the variable.
Eliminating those, I find that the general functionality of a trigger is best
termed a "stimulus response rule", which is to specify things that are not
atomic with their stimuli, but generally take place "after". For example, you
could have one to keep track of what users do, timestamped. Or you could use
them to implement nondeterministic constraints such as "inserted date must be
before now()", which I would argue can't directly be in a normal constraint.
In fact, in a pure Muldis D application, where "the database" and "the
application" are the same thing (all code is stored routines), the "main
program" *is* a stimulus response rule, where the stimulus is the act of the
DBMS starting up. In a mixed language application such as you would normally
write such as with Perl being the "main program", that wouldn't be the case.
But you get the idea that the role of "triggers" has both been reduced and expanded.
I believe some SQL DBMSs already support triggers on all sorts of activity, not
just table inserts/updates/deletes/etc, so this can be expressed.
> Ugh. This is a security hole.
I may have expressed wrong what was actually happening in the government
application, given that it was designed by very experienced people.
But whatever, that was an aside and need not be mentioned again.
> The datastore should provide and enforce one set of privileges. The
> application should provide and enforce another set of privileges. To
> conflate the two is to muddy the waters.
Well my current inclination is to not support defining database level privileges
at all in the first functioning release, and come back to that later when
someone needs it. Especially since these could be arbitrarily complicated.
>> 6b. How important is it to you to be able to have per-user privileges in
>> your databases? Or how important is it to be able to define these without
>> using SQL?
>
> This isn't an SQL problem and should be taken out of SQL. Most DBMS
> authn/authz systems are unnecessarily hampered by the "requirement"
> that all interaction with the engine be in SQL-like syntax. SQL should
> be about interacting with the data. Administrivia should be handled
> with a language designed for that purpose.
Fair enough. But then the matter still exists that if one wants to administer
the database, it needs to be expressable somehow.
I do agree with the principle of the DBMS essentially being a virtual machine
and as much as possible all interaction with it is in the form of machine
instructions. For a SQL DBMS the instructions would be written in SQL, but on
the other hand there is no saying the DBMS has to support only SQL.
This principle is in contrast to the idea that some things are better specified
out of band, such as with a special programmatic API to the VM where each
command has its own VM method or something.
Expressing more things in command language means a lot more flexibility in
evolving the DBMS, and its programmatic API can be kept simple, such as just
having a "run this" method.
Sort of like how the DBI module works, except that I wouldn't have all those
special methods like "table_info"; rather, users would just query the system
catalog or some such, through the "run this".
>> 7. Spatial types and operators. Who uses spatial or GIS or such types or
>> operators? How valuable is it to be able to do so without writing SQL?
>
> I don't use them. I know people who do. Instead of building these into
> the engine, much better for the engine to provide the ability to
> plugin new column types.
The latter, user-defined plugins, is a given. In fact, the ability to define
your own, arbitrarily complex, data types is one of the most fundamental
features of my project, and part of what makes it comparable to a "programming
language".
I raised question 7 because spatial types are provided by multiple DBMSs
built-in and so to leverage those I thought it would be useful for me to also
have something built-in that people can use by default.
Its the same issue with any good programming language. Yes, it lets you build
your own anything, but also bundles some common things so people don't reinvent
the wheel.
As it is, built-in spatial types are strictly an "optional extension" built-in
to start with.
For that matter, dates and times (except for simple TAI timestamps and
durations) are *also* an "optional extension", but I didn't bring up dates and
times since those are a lot more commonly used, and I already specced them.
Basically my only core built-in base types are: booleans, integers, rationals,
bit strings, character strings, tuples, relations; other types, which are
defined in terms of those, include: sets, arrays, bags, intervals, TAI instant,
TAI duration; also there are core types mainly just used for code, not user
data, such as identifiers and comments and 'order'.
> NaN is a NULL. Of course, q.v. my answer to #7 (type plugins) for
> people who think otherwise.
Agreed. And I have no specced "extension" for this as yet.
I expect to add the 2 core singleton types "+Inf" and "-Inf" though, which are
type-generic infinities, and useful in defining unbounded intervals.
Not the same thing though.
>> 9. Full text search. Who uses this versus a simple substring match?
>
> It's absolutely key to be able to provide this. q.v. my answer to #7
> (type plugins).
Agreed. And I plan to add a system-defined plugin for this, to front the FTS
commonly built into SQL DBMSs already, or several if there are several.
>> 10. Who uses regular expressions of any kind in the database? Either
>> Perl-compatible or otherwise. Basically any kind of pattern matching
>> besides the trivial kind that LIKE supports (literals plus . plus .*).
>
> I do, for full-text searching. q.v. my answer to #7 (type plugins).
Agreed. What I had in mind was to spec a separate system-defined optional
plugin for each kind of complex pattern match engine. One for Perl 5 regexes.
One for Perl 6 rules. One for Parrot's basic built-in pattern matching.
Etcetera. Only the simple matching of SQL's generic LIKE would be in core.
>> 11. Out of band sequence generators. If you read from / increment a
>> sequence generator from within a transaction, do you want a rollback of that
>> transaction to also rollback the sequence generator or not? Do you want
>> both possibilities to be supported. What do current DBMSs support?
>
> I don't know of a current DBMS that supports rolling back a sequence
> generator because too many people have (mis)used sequences as an
> insertion ordering. Frankly, I don't think sequence generators should
> be supported. Instead, provide nano-second timestamping so that you
> can always know that -this- row came before -that- row and UUID
> generators for primary keys (the most common use for sequence
> generators).
Makes sense.
One thing I could do is not build-in sequence generators per se, but make it
easy for users to define their own. In fact, any built-in ones would be
designed as if they were built by users, for example:
1. The current state of a sequence generator is just stored in a relvar/table
like any user data. Whether there is a separate relvar per generator is up to
the user, though that approach would most closely mimic the concept of a SQL
DBMS having the generator as its own schema object.
2. To read the current/last sequence value without changing it, just read from
the relvar hold its value. A shorthand syntax for referencing the sole value of
a single-tuple/row, single-attribute/column relvar/table would be built-in.
3. To increment it, just update the value. A shorthand for this, which is
analogous to "$nextval = $foo++" or "$nextval = $foo+=$x" in Perl, would be
built-in.
4. Doing all of the above normally would make the generator increment subject
to transactions, as it is a normal relvar/table update.
5. To do that not subject to transactions, ostensibly you would do the
operation against the sequence value in a prior or parallel transaction, which
is successfully committed first, and then whatever you use the sequence value in
later won't cause the sequence to rollback.
Any built-in sequence generator support would be a thin veneer over the above,
and any native sequence generators in the backend would be abstracted into the
above as far as the user sees; if both are done, its essentially 2
mutually-reversing abstractions, and so users can still work like they were
using the SQL ones anyway, sort of.
> YES YES YES!!!1! **ALL** SQL statements inside a transaction are
> subject to the transaction. Anyone who provides an "implicit commit"
> should be shot. Period, end of story.
I'm glad we agree.
> q.v. my answer to #6b about SQL for data interaction and some other
> language for administrivia.
I currently spec data definition and transaction management as a built-in, and
very fundamentally.
Also, a transaction is always scope-defined, by declaring for a stored procedure
"is transaction"; that transaction commits with a normal return and rolls back
with a thrown exception; that is currently the only way supported to do it; no
"statements" for transaction management. I believe that DBIx::Class's
"txn_do()" works exactly the same way if I'm not mistaken.
This said, a DBMS VM/wrapper's interface to Perl would have explicit transaction
control methods (begin,commit,rollback), breaking the general "run this"
simplicity, for the sole purpose of grouping several distinct database
interactions by Perl in a transaction", when those actions can't or don't want
to be rendered as a single Muldis D "is transaction" procedure.
> I agree with subjugate-or-fail. Implicit -anything- is _ALWAYS_ bad.
> I'm talking about implicit commits, implicit type coercions, implicit
> case-insensitive matching . . . ANYTHING. I know why MySQL did things
> this way, but it's a bad practice. It's unpredictable and if you want
> your engine to have any traction in places that will pay you, you have
> to be predictable. Even Oracle could do more about this.
Predictable is *very* important to me.
Also, about your checklist:
1. There are no implicit type conversions, such as num-to-string or vice-versa.
You can assign a subtype value to a supertype variable, but that doesn't
change the *value*. Moreover, the explicit type conversions are detailed or
varied enough to reduce ambiguity; for example, the Int_from_Text() function
makes you declare what numeric base you are interpreting it in (choose from
2..36); you can also specify numeric literals in all 35 bases (same as Perl 6),
though for base-10 literals you have the shortcut that the base may be implicit,
same as any normal language. Also, the Int_from_Rat() function makes you
explicitly declare what numeric rounding method you want to use, eg half-up
versus to-even etc. Of course, wrappers like DBIx::Class are welcome to provide
implicit config arguments so their users don't have to. Note that Muldis D
*does* support optional parameters, but some params are made mandatory when it
is thought that code omitting explicit arguments to that would be ambiguous,
such as in the prior 2 examples. On a related note to that, I don't have an
"average" function", but rather I have the 3 "mean", "median", and "mode", which
cuts ambiguity.
2. All text comparison is case-sensitive, accent-sensitive, etc, and to do
anything insensitive you have to explicitly fold the arguments, same as in Perl
or in Oracle I believe. Both text data and identifiers are that way, same as in
Perl and many languages, and delimited SQL identifiers.
3. If an engine doesn't support subjugating data definition to transactions, it
will throw an exception (or fail to compile, if we would know at compile time)
when someone tries to do data definition in a transaction (they'll need to try
again outside a transaction), not just go ahead.
Thank you very much.
-- Darren Duncan
More information about the DBIx-Class
mailing list