[Dbix-class] abstract syntax (extract from a conversation)

Sat Jul 21 10:55:00 GMT 2007

At 4:01 AM +0100 7/21/07, Matt S Trout wrote:
>This is the most coherent message to show the lines along which we were
>thinking (hdp is confound on #dbix-class).
>
>Those of you who don't have context for this, poke the list archives.
>
>Those of you who do, please bear in mind when responding what anything
>involving the creation of assloads of objects will be laughed at and/or
>ignored since the intention is SQL::Abstract-level performance.
<snippage>

Since its already nearly 2am here, I'll just put out a few of my 
ideas and leave the rest for later.

1.  I'm assuming your comment about objects means that you don't want 
to have, eg, an object for each scalar value or entity name or 
expression tree node or whatever.  Sure, that's fine, and should be 
good for speed, and perhaps brevity, but taking your examples as a 
point of departure, we'll probably want to add at least one more 
element to each of the many array refs defining expression nodes so 
to provide meta-data about the node, such as a replacement for the 
meta-data that the name of the class an object is blessed into 
provides.

2.  As far as I recall from a discussion we had on IRC, the new AST 
we are defining here is supposed to work not just with SQL databases 
but also databases accessed via some other language, such as LDAP, or 
my Muldis D.  Partly for this reason, and partly just because SQL 
DBMSs differ from themselves enough that talking to them is like 
multiple languages, I believe that our AST should not conceptually be 
limited by some SQL lowest common denominator, and it should not 
simply try to mirror the structure of a simple select query.

3.  Don't go lowest common denominator.  If our AST is good, glue 
code that talks to a less capable back-end should be able to break 
down what the AST says into smaller chunks that the back-end 
understands, and feed them appropriately so that the back-end still 
does the right thing, and appropriately gather the results and return 
them as if the back-end was able to do that natively.

4.  As should be the nature of ASTs, the focus of ours should be to 
accurately representing the *semantics* or meaning of what the user 
wants.  The AST should provide the means to explicitly say what the 
desired behaviour is for particular constructs, any time there is a 
reasonable chance that either different backends have different 
defaults in that regard, or users are likely to have different 
default expectations.  Of course, our AST can have various default 
behaviours defined for it such that users don't have to be explicit 
about some details if their desires match the defaults, but we still 
need to specify it in the design docs of our AST itself, and not just 
leave a lot of things to be back end implementation defined.

5.  Our AST should be strongly typed from end to end, which assists 
in semantics.  Any piece of data that it carries should know whether 
it is text or a number or whatever.  That way, if we have '0124', we 
know how round-tripping it through a database would retain the 
leading zero or not.  Matters of case-sensitivity need to be defined 
and not left to back-end defaults.  That's not to say that we can't 
have generic types, as per Perl scalars, but these should be defined 
over stronger types, such that eg every value is of a certain 
stronger type, but a particular variable is allowed to hold values of 
any of several types.

6.  We need to define our own full set of system-defined types and 
operators, which users of our AST invoke, and which back-ends gluing 
our AST then convert into or emulate native equivalents.  Moreover, I 
recommend that our names for all such things are spelled with just 
letters, eg use 'equal' and 'not_equal' rather than '==' or '!=' etc.

7.  It is essential to have the distinct concept of a logical boolean 
data type, and values, and operators.  This is the result type of 
equality tests or and|or etc.

8.  The most important distinct simple data types are: boolean, 
integer, bit string (blob), character string (text); then other 
numerics, then temporal types, then whatever such as spatial types if 
we want them.

9.  The AST should support the concept of having collection-typed 
values, so that eg we can have table field values that are themselves 
eg tables, rows, arrays, etc.  Nevermind whether the back-end DBMS 
can do this, some can, some can't, where they don't, we can fake it 
by splitting tables behind the scenes.  If we have native support 
like this, it should be easy to, say, formulate a query over eg a 
one-to-many table relationship that returns both a parent record and 
its child records, in a single result set, without duplication; the 
result set eg has one row per parent record, and one field of that 
row is table-typed and contains the child records.

10.  Tables should always contain, and queries should always return, 
no duplicate rows, if not always than at least by default.  Users 
should have to explicitly say if they want duplicates, and if not 
then every row will be distinct.  This is what most people want 
anyway, and doing it by default will significantly reduce bugs in 
user code that crop up due to duplicates being present.

11.  Any operator that is conceptually N-ary should simply be defined 
to take N similar arguments, that argument being an array ref or 
something.  Similarly, 'and' and 'or' should be ordinary N-ary 
boolean operators, as are string concatenation, and numerical 
addition and multiplication.  For that matter, relational union, 
intersection, and natural join are all N-ary as well.  All the 
examples are commutative, save concatenation, and all are 
associative.  Put another way, any N-ary operator is a "reduce 
operator", iterating over a list to produce one result.

12.  Our AST should be setup to only allow column names in a rowset 
to be distinct.  If eg 2 tables are joined that have common column 
names, then if those 2 columns represent the same data and are 
redundant following the join, then eliminate one, or else if they 
don't rep the same data, rename one (SQL has 'as' for a reason).

13.  Columns should be referred to by name only, not by any ordinal 
position.  When specifying a relational union, the column names of 
both operands need to be the same, and columns will match up on 
common names.

14.  Relational joins should all be natural joins, such that given 2 
rowset/table operands, the join should simply match them up on 
columns of the same name (if necessary, columns of the operands can 
be renamed first to either be the same or different as needed). 
Doing it this way lets an N-table join be commutative, and the result 
won't have any duplicate columns.  Its also easier to specify since, 
aside from possible column renaming of the operands, you don't need 
to specify join conditions to do a join.  And if the db schema is 
well designed in the first place, you often won't have to rename 
columns either when joining them, or not often.

15.  There should be distinct entity name spaces for system-defined 
types and operators, and user-defined ones.  Eg, have either a tag or 
a name prefix on eg every operator call to specify.  So then, our AST 
can specify invocation of stored procedures or functions et al the 
same way it specifies using other operators.  Eg, 'sys.Int.add' 
versus 'user.bar_schema.foo_proc'.  Doing it this way, there's no 
concern about reserved words.

16.  The AST should treat a query as an arbitrary depth self-similar 
expression tree, where both scalar and relational operators can be 
called in any place.  In SQL terms, the AST should embrace derived 
tables or subqueries or whatever.  Don't leave these out just because 
some backends don't have them; we can fake it there if we have to.

17.  The AST should just use such as the various simpler relational 
algebra or calculus operators rather than monolithic 'select'.  For 
example, each of these is done using a separate operator, calls to 
which can be chained: selecting a subset of columns, filtering rows, 
joining rowsets, unioning rowsets, attaching new columns, grouping 
rows, summarizing rows, sorting rows, etc.

So that's probably a good start.  I can suggest specific alterations 
to the example syntax if that is useful or people can't get what I'm 
saying above without such examples.

Meanwhile, I highly recommend looking at 
http://search.cpan.org/dist/Muldis-DB/lib/Muldis/DB/Language/Core.pod 
, which is the currently defined list of core system-defined 
operators (and data types) of Muldis D.  This may give you some good 
ideas for what specific operators you want to have built-in to our 
AST definition.  Note that if you don't understand some of my 
terminology, or want some context, you may want to read 
http://search.cpan.org/dist/Muldis-DB/lib/Muldis/DB/Language.pod 
first, at least the NOTES ON TERMINOLOGY section.  For example, I say 
'relation value' rather than 'rowset' and 'relation variable' rather 
than 'table'.

P.S.  And now it's 3am here.

-- Darren Duncan