[Dbix-class] RFC - some DBMS features you use
Darren Duncan
darren at darrenduncan.net
Fri Feb 5 06:33:43 GMT 2010
Darren Duncan wrote:
> Rob Kinyon wrote:
>> Locking, afaict, is a way of managing the conflicting transactions
>> problem. Even though it's harder to program (at least, initially),
>> database engines should really support indicating that a transaction
>> commit would have a "merging conflict", similar to how git works.
>> Then, the client can query the engine as to what the nature of the
>> conflict is, resolve it, and attempt to commit again with resolved
>> fixes. Normally, this would be done by discarding actions done in the
>> transaction and restarting with a fresh view of the database.
>
> Interesting that you mention git, because a few weeks ago I made a rough
> draft proposal in another (database-but-not-Perl-specific) forum about
> using version control systems, specifically git, as inspiration for
> conceptualizing how relational DBMSs can manage concurrency issues,
> meaning that an RDBMS would fundamentally be multi-versioned; that is
> also the approach I was thinking of taking Muldis D in; of course, you
> can easily express current DBMS' feature sets using that paradigm.
>
> I will forward a copy of that next as a reply to this message.
>
> In this paradigm, using locks is not necessary for concurrency, but it
> is useful as a preemptive action by a user to prevent merge conflicts.
And here below the line is that draft, nearly pristine.
Please note that, being written to a different audience, I may have used some
terminology that I assume they know my meaning of, while you might not. But on
the other hand it is generally the same terminology I use in my language spec
itself. Also, some details are now out of date with my current thoughts, but
not too much. (More recently, a truncated version of the below is also stuffed
as an item in the TODO file of the Muldis D spec.)
============
I propose a conceptual framework for database transactions that is strongly
inspired by how distributed source-code version control systems (VCSs) work, in
particular drawing on my limited knowledge of GIT (see http://git-scm.com/ and
http://eagain.net/articles/git-for-computer-scientists/ also) specifically.
Being a *conceptual* framework, this isn't proposing syntax, but more just
general semantics and features that could have syntax provided for them.
My proposed framework is orthogonal or complementary to the relational model, or
that is to say, using it does not require that a database is comprised of
relvars or that the database has any particular structure; for the rest of this
proposal, when I say "database", I just mean an arbitrarily complex collection
such as any tuple whose attributes are all relations, however you can always
read it as the latter.
When drawing an analogy between the data that a VCS manages and the data of an
RDBMS, don't interpret this proposal as mapping any particular components. Each
folder or file or file line managed by a VCS is not intended to correspond
specifically to any of an RDBMS relvar or tuple or attribute thereof. All that
matters for this proposal is that one entire database under management by an
RDBMS corresponds to one entire depot/repository under management by a VCS, and
either of those things being managed can be arbitrarily complex, and is structured.
The fundamental feature of the framework is that the DBMS is managing a depot
consisting of 1..N versions of the same database, where every one of these
versions is both consistent and durable. A "version" is also called a "commit";
anywhere that the word "commit" appears as a noun in this proposal, it means "a
database version". Each commit is completely defined in isolation,
conceptually, and so any commits in a depot may be deleted without compromising
each of the other commits' ability to define a version of the entire database.
It is implementation-dependent as to how the commits are actually stored, such
as each having all of the data versus most of them just having deltas from some
other commit; what matters is that each commit *appears* to be self-contained.
Every commit is created as a single atomic action, and it is never modified
afterwards, though it may be later deleted (also an atomic action).
Every in-DBMS user process, henceforth called "user", has its own concept of the
current state of the database, which is one of the depot's commits that is
designated a "head". A user's current head is never replaced during the course
of the in-DBMS process unless the user explicitly replaces it, such as by either
performing an update or requesting to see the latest version (the latter done
such as with an explicit "synchronize" control statement). Therefore, each user
is highly isolated from all the others, and is guaranteed consistent repeatable
reads and no phantoms; they will get repeatable reads until they request
otherwise. (If a DBMS policy limits for how long they may keep this guarantee
after someone else starts to update the database, then trying to use it beyond
that time will produce an exception rather than inconsistent results, which they
would resolve by requesting to see the latest version.)
The framework has no native concept of "nesting transactions" or "savepoints" or
explicit "commit" or "rollback" commands. Rather, every single DBMS-performed
parent-most multi-update statement (which is the smallest scope where the
database is required to be consistent both immediately before and immediately
after its execution), is a durable atomic transaction all by itself. The effect
of a successful multi-update statement is to both produce a new (durable) commit
in the depot and to update the executing user's "head" to be that new commit
(the prior commit may then be deleted automatically depending on circumstances);
a failed multi-update statement is a no-op for the depot, and the user gets a
thrown exception. A successful statement can sometimes be "undone" simply by
updating the executing user's "head" to a prior commit and then optionally
deleting the later commit; other times, it would take a forward-moving new
commit consisting of another multi-update statement that performed updates with
the opposite semantics.
A depot's commits are arranged in a directed acyclic graph where each commit
save the oldest cites 1..N other commits as its parents, and conversely each
commit may have 0..N children. A child commit has exactly 1 parent when it was
created as the result of executing a multi-update statement in the context of
the parent commit; the parent commit is the pre-update state of the database and
the child is the post-update state of the database. A child commit has multiple
parents when it is the result of merging or serializing the changes of multiple
users' statements that ran in parallel. One main purpose of tracking parents
like this is for reliable merging of parallel changes, so that the intended
semantics of each change can be interpreted correctly, and potential conflicts
can be easily detected, and effectively resolved. More on how this works
follows below. Note that commits simply have unique identifiers to be
referenced with and there is no implied ordering between them if they are
generated as serial numbers or using date stamps, though commits with earlier
date stamps are given priority in the case of a merge conflict.
So a multi-update statement is the only native "transaction" concept, and it is
ACID all by itself. Now, the multi-statement "transactions" or concepts of
nested transactions or savepoints would all be syntactic sugar over the native
concept, and basically involve keeping track of commits prior to the head and
optionally making an older one the head. My personal preference would be to
define the syntactic sugar as being tied to code block boundaries rather than
command statements; doing this we avoid the "goto" problem and don't ever have
to manually handle commit ids or human readable labels for said that savepoint
names are.
This framework uses the VCS concept of "branching" (which is something that GIT
strongly encourages the use of, as GIT makes later "merging" relatively
painless) as the native way to manage concurrent autonomous database updates by
multiple users. By default, when no users have made any changes to the
database, a depot just has a "trunk", and its childmost or only commit is called
"master"; every database user process' "head" starts off as the "master" commit
when that process starts. Each (autonomous) user process that wants to update
the database will start by creating a new branch off of the trunk, and
subsequent commits of theirs will go into that, rather than into the trunk or
some other branch. The trunk is shared by all users while each user's branch is
just for that user, as their private working space.
Note that, unlike a VCS in general where branches can become long-lived and
interact with each other independently of the trunk, the framework instead
follows the typical needs of an RDBMS, which espouses a single world view as
being dominant over any others, and expects that any branches will be very
short-lived, not existing for longer than a conceptual "database transaction"
would; only the trunk is expected to be long-lived. (This isn't to say that a
DBMS can't maintain them long term, but one that acts like a typical RDBMS of
today wouldn't.)
So, lets start with a simple scenario to begin with where there only exists a
single user / in-DBMS process and there are no conflicts or needs to rollback.
1. Commit C1 is the initial state of the database and it is the only version
that the depot contains; C1 lives in the trunk and is the "master". The user
process U1 exists and its head is C1 and it has no branches yet.
2. U1 creates branch B1, which designates that its first commit would have C1
as its sole parent.
3. U1 executes a multi-update S1 which succeeds and the result is that commit
C2 is created, in branch B1, and U1's head is now C2.
4. U1 executes a multi-update S2 which fails for a constraint violation and the
result is a no-op (plus a thrown exception which U1 chooses to catch since it
had S2 wrapped in a try-block).
5. U1 now wants to merge branch B1 into the trunk.
6. U1 moves commit C2 to the trunk from branch B1, and C2 is designated master
over C1, and (empty) branch B1 is then deleted.
Now another scenario that has 2 updating user processes whose updates may
possibly conflict.
1. Commit C1 is the initial state of the database and it is the only version
that the depot contains; C1 lives in the trunk and is the "master". The user
process U1 exists and its head is C1 and it has no branches yet. The user
process U2 exists and its head is C1 and it has no branches yet.
2. U1 creates branch B1, which designates that its first commit would have C1
as its sole parent.
3. U2 creates branch B2, which designates that its first commit would have C1
as its sole parent.
4. U1 executes a multi-update S1 which succeeds and the result is that commit
C2 is created, in branch B1, and U1's head is now C2.
5. U2 executes a multi-update S2 which succeeds and the result is that commit
C3 is created, in branch B2, and U2's head is now C3.
6. U1 now wants to merge branch B1 into the trunk.
7. U1 moves commit C2 to the trunk from branch B1, and C2 is designated master
over C1, and (empty) branch B1 is then deleted.
8. U2 now wants to merge branch B2 into the trunk.
9. U2 tries to move commit C3 to the trunk from branch B1 but fails because C3
isn't a direct child of the current master, C2.
10. U2 creates commit C4 in B2 by merging commits C3 and C2; C4 cites both C3
and C2 as parents. If database version C4 would violate any constraints / be
inconsistent, then commit C4 would fail to be created, the effect of this
attempt a depot no-op plus a thrown exception; the user would then have to deal
with it same as if S2 had failed. For this example, we'll say that the merge
succeeded with no constraint violations / inconsistencies.
11. U2 creates commit C5 by rebasing C4 (the current head of B2) so it now
designates C2 (the current master) as its direct and only parent (C5 represents
an identical database to C4); all of the other commits in branch B2 are deleted,
as C5 consolidated them into itself.
12. U2 moves commit C5 to the trunk from branch B2, and C5 is designated master
over C2, and (empty) branch B2 is then deleted.
Note that merge steps of trunk to branch can be done at any time, such as
between each statement done in the branch, whether inside or outside (typically
inside) an explicit transaction or not; they don't have to just be done at the
end. But the conceptual framework assumes that the user, U2 in this case, can
choose exactly when they happen, since this also affects the ability to do a
repeatable read.
Now another scenario that still just has one user process but adds use of
explicit multi-statement transaction syntax by the user.
1. Commit C1 is the initial state of the database and it is the only version
that the depot contains; C1 lives in the trunk and is the "master". The user
process U1 exists and its head is C1 and it has no branches yet.
2. U1 creates branch B1, which designates that its first commit would have C1
as its sole parent.
3. U1 starts transaction T1 in branch B1, which designates C1 still of interest
by T1.
4. U1 executes multi-update S1, which creates C2 in branch B1 and sets U1's
head to it.
5. U1 executes multi-update S2, which creates C3 and sets U1's head to it.
6. U1 starts transaction T2, which designates C3 still of interest.
7. U1 executes multi-update S3, which creates C4 and sets U1's head to it.
8. U1 executes multi-update S4, which creates C5 and sets U1's head to it.
9. U1 ends transaction T2 with success; this designates C3 as no longer of
interest to T2.
10. U1 starts transaction T3, which designates C5 still of interest.
11. U1 executes multi-update S5, which creates C6 and sets U1's head to it.
12. U1 executes multi-update S6, which fails and is a no-op; S6 throws an
exception.
13. The exception handler as proxy for U1 ends transaction T3 with failure;
this first resets U1's head to C5 and second designates C5 as no longer of
interest to T3.
14. U1 had a try-catch block around T3 which traps the exception produced by S6.
15. U1 executes multi-update S7, which creates C7 and sets U1's head to it.
16. U1 ends transaction T1 with success; this designates C1 as no longer of
interest to T1.
17. U1 now wants to merge branch B1 into the trunk.
18. U1 creates commit C8 by rebasing C7 (the current head of B1) so it now
designates C1 (the current master) as its direct and only parent (C8 represents
an identical database to C7); all of the other commits in branch B2 are deleted,
as C8 consolidated them into itself.
19. U1 moves commit C8 to the trunk from branch B1, and C8 is designated master
over C1, and (empty) branch B1 is then deleted.
As a typical final step to either of the 3 above scenarios, all of the other
commits in the trunk except for the 1 master would also be deleted, as in a
typical DBMS that didn't maintain history, they are no longer needed.
Note that the final action on a branch that involves moving its head commit to
the trunk as master, this would be perceived by all other DBMS users as all of
the changes wrought by the branch being a single atomic update.
And so, this conceptual framework should work to provide true ACID to both whole
"transactions" as well as each individual multi-update statement, or "inner
transaction", comprising said.
Now this message hasn't touched on write-locking, but that can be applied also,
in appropriate places, though presumably it would have to be applied across
database versions.
-- Darren Duncan
More information about the DBIx-Class
mailing list