[Dbix-class] abstract syntax (extract from a conversation)
Darren Duncan
darren at DarrenDuncan.net
Sat Jul 21 10:55:00 GMT 2007
At 4:01 AM +0100 7/21/07, Matt S Trout wrote:
>This is the most coherent message to show the lines along which we were
>thinking (hdp is confound on #dbix-class).
>
>Those of you who don't have context for this, poke the list archives.
>
>Those of you who do, please bear in mind when responding what anything
>involving the creation of assloads of objects will be laughed at and/or
>ignored since the intention is SQL::Abstract-level performance.
<snippage>
Since its already nearly 2am here, I'll just put out a few of my
ideas and leave the rest for later.
1. I'm assuming your comment about objects means that you don't want
to have, eg, an object for each scalar value or entity name or
expression tree node or whatever. Sure, that's fine, and should be
good for speed, and perhaps brevity, but taking your examples as a
point of departure, we'll probably want to add at least one more
element to each of the many array refs defining expression nodes so
to provide meta-data about the node, such as a replacement for the
meta-data that the name of the class an object is blessed into
provides.
2. As far as I recall from a discussion we had on IRC, the new AST
we are defining here is supposed to work not just with SQL databases
but also databases accessed via some other language, such as LDAP, or
my Muldis D. Partly for this reason, and partly just because SQL
DBMSs differ from themselves enough that talking to them is like
multiple languages, I believe that our AST should not conceptually be
limited by some SQL lowest common denominator, and it should not
simply try to mirror the structure of a simple select query.
3. Don't go lowest common denominator. If our AST is good, glue
code that talks to a less capable back-end should be able to break
down what the AST says into smaller chunks that the back-end
understands, and feed them appropriately so that the back-end still
does the right thing, and appropriately gather the results and return
them as if the back-end was able to do that natively.
4. As should be the nature of ASTs, the focus of ours should be to
accurately representing the *semantics* or meaning of what the user
wants. The AST should provide the means to explicitly say what the
desired behaviour is for particular constructs, any time there is a
reasonable chance that either different backends have different
defaults in that regard, or users are likely to have different
default expectations. Of course, our AST can have various default
behaviours defined for it such that users don't have to be explicit
about some details if their desires match the defaults, but we still
need to specify it in the design docs of our AST itself, and not just
leave a lot of things to be back end implementation defined.
5. Our AST should be strongly typed from end to end, which assists
in semantics. Any piece of data that it carries should know whether
it is text or a number or whatever. That way, if we have '0124', we
know how round-tripping it through a database would retain the
leading zero or not. Matters of case-sensitivity need to be defined
and not left to back-end defaults. That's not to say that we can't
have generic types, as per Perl scalars, but these should be defined
over stronger types, such that eg every value is of a certain
stronger type, but a particular variable is allowed to hold values of
any of several types.
6. We need to define our own full set of system-defined types and
operators, which users of our AST invoke, and which back-ends gluing
our AST then convert into or emulate native equivalents. Moreover, I
recommend that our names for all such things are spelled with just
letters, eg use 'equal' and 'not_equal' rather than '==' or '!=' etc.
7. It is essential to have the distinct concept of a logical boolean
data type, and values, and operators. This is the result type of
equality tests or and|or etc.
8. The most important distinct simple data types are: boolean,
integer, bit string (blob), character string (text); then other
numerics, then temporal types, then whatever such as spatial types if
we want them.
9. The AST should support the concept of having collection-typed
values, so that eg we can have table field values that are themselves
eg tables, rows, arrays, etc. Nevermind whether the back-end DBMS
can do this, some can, some can't, where they don't, we can fake it
by splitting tables behind the scenes. If we have native support
like this, it should be easy to, say, formulate a query over eg a
one-to-many table relationship that returns both a parent record and
its child records, in a single result set, without duplication; the
result set eg has one row per parent record, and one field of that
row is table-typed and contains the child records.
10. Tables should always contain, and queries should always return,
no duplicate rows, if not always than at least by default. Users
should have to explicitly say if they want duplicates, and if not
then every row will be distinct. This is what most people want
anyway, and doing it by default will significantly reduce bugs in
user code that crop up due to duplicates being present.
11. Any operator that is conceptually N-ary should simply be defined
to take N similar arguments, that argument being an array ref or
something. Similarly, 'and' and 'or' should be ordinary N-ary
boolean operators, as are string concatenation, and numerical
addition and multiplication. For that matter, relational union,
intersection, and natural join are all N-ary as well. All the
examples are commutative, save concatenation, and all are
associative. Put another way, any N-ary operator is a "reduce
operator", iterating over a list to produce one result.
12. Our AST should be setup to only allow column names in a rowset
to be distinct. If eg 2 tables are joined that have common column
names, then if those 2 columns represent the same data and are
redundant following the join, then eliminate one, or else if they
don't rep the same data, rename one (SQL has 'as' for a reason).
13. Columns should be referred to by name only, not by any ordinal
position. When specifying a relational union, the column names of
both operands need to be the same, and columns will match up on
common names.
14. Relational joins should all be natural joins, such that given 2
rowset/table operands, the join should simply match them up on
columns of the same name (if necessary, columns of the operands can
be renamed first to either be the same or different as needed).
Doing it this way lets an N-table join be commutative, and the result
won't have any duplicate columns. Its also easier to specify since,
aside from possible column renaming of the operands, you don't need
to specify join conditions to do a join. And if the db schema is
well designed in the first place, you often won't have to rename
columns either when joining them, or not often.
15. There should be distinct entity name spaces for system-defined
types and operators, and user-defined ones. Eg, have either a tag or
a name prefix on eg every operator call to specify. So then, our AST
can specify invocation of stored procedures or functions et al the
same way it specifies using other operators. Eg, 'sys.Int.add'
versus 'user.bar_schema.foo_proc'. Doing it this way, there's no
concern about reserved words.
16. The AST should treat a query as an arbitrary depth self-similar
expression tree, where both scalar and relational operators can be
called in any place. In SQL terms, the AST should embrace derived
tables or subqueries or whatever. Don't leave these out just because
some backends don't have them; we can fake it there if we have to.
17. The AST should just use such as the various simpler relational
algebra or calculus operators rather than monolithic 'select'. For
example, each of these is done using a separate operator, calls to
which can be chained: selecting a subset of columns, filtering rows,
joining rowsets, unioning rowsets, attaching new columns, grouping
rows, summarizing rows, sorting rows, etc.
So that's probably a good start. I can suggest specific alterations
to the example syntax if that is useful or people can't get what I'm
saying above without such examples.
Meanwhile, I highly recommend looking at
http://search.cpan.org/dist/Muldis-DB/lib/Muldis/DB/Language/Core.pod
, which is the currently defined list of core system-defined
operators (and data types) of Muldis D. This may give you some good
ideas for what specific operators you want to have built-in to our
AST definition. Note that if you don't understand some of my
terminology, or want some context, you may want to read
http://search.cpan.org/dist/Muldis-DB/lib/Muldis/DB/Language.pod
first, at least the NOTES ON TERMINOLOGY section. For example, I say
'relation value' rather than 'rowset' and 'relation variable' rather
than 'table'.
P.S. And now it's 3am here.
-- Darren Duncan
More information about the Dbix-class
mailing list