[Dbix-class] Re: how to use a complex query in DBIx::Class

Mon Jan 1 17:39:32 GMT 2007

* Matt LeBlanc <antirice at gmail.com> [2006-12-31 10:30]:
> On 12/30/06, A. Pagaltzis <pagaltzis at gmx.de> wrote:
> >* Matt LeBlanc <mleblanc at cpan.org> [2006-12-31 03:40]:
> >> On 12/30/06, A. Pagaltzis <pagaltzis at gmx.de> wrote:
> >> >You're abusing the ON clause. You're not joining on the userid
> >> >condition; you're constraing the result set. You should put it
> >> >in the WHERE clause:
> >>
> >> Forgive my ignorance but could you please give me a reason why
> >> this is an abuse? This is standard ANSI SQL.
> >
> >How does "abuse" imply it's not in the standard? Finding out the
> >length of an array in Perl by saying `$#array + 1` is valid and
> >working Perl as well, and yet it's abuse, because the semantics
> >are off.
> 
> Although you will disagree, I do not find your example to be analogous
> to the query presented by Wan.

I know. I didn’t come up with a better example off the cuff; your
mention of regexes was a much better analogy.

> We could simply have just returned hiddenboards.userid and
> checked if userid was not undefined in our perl code. Would
> this too have been an abuse?

It’s not abuse in that you are not using an inappropriate
language construct, but you *are* distributing concerns badly.

> >I did not realise that the ON clause behaviour is affected in
> >this way because of the LEFT JOIN, although in retrospect it
> >is obvious. (Jess didn't catch this point either, I notice.)
> >I'm pretty sure that issue would catch out a *lot* more people
> >glancing at the query than just me and her.
> >
> >This is what I mean by abuse: the intent is not very explicit
> >in the original formulation of the query.
> 
> If you don't normally use left joins, then I can see why this
> would seem odd.

I use them plenty. I just never use them for filtering the
resultset in the sense of WHERE and HAVING.

> >I've rewritten quite a few queries to use subqueries instead
> >of horribly convoluted joins (for some inexplicable reason,
> >MySQL people seem to have a habit of writing these…), and
> >performance has usually benefited from it, sometimes
> >spectacularly so.
> 
> This would be because MySQL tends to suck the more you join. In
> most other databases, joins tend to be much quicker with slower
> subqueries.

I have no idea about MySQL’s performance characteristics
concerning joins and subqueries, but it is not with MySQL that I
made my experiences and actually I would expect that subqueries
are slower on MySQL than joins, as they were not supported for a
very long time. (They were first added sometime in 4.x, but were
limited; only the very latest 5.0 series has solid enough support
that they truly are useful.)

> >Not that I shun joins; but I've found that it's rarely
> >problematic for performance when queries are written to reveal
> >intent and always a boon for maintenance, so I use a subquery
> >when that's what expresses intent more closely and a join when
> >that's what does.
> 
> I find the opposite to be true since joins tend to reveal all
> the sources of data involved and make the logic of the
> inclusion of the resulting data involved to be fairly apparent.

I don’t know how subqueries can be construed to conceal any of
that. Badly designed queries using subqueries of course do
conceal these issues, but that’s because of the “badly designed”
part and not because of the “using subqueries” qualification and
applies equally to badly designed queries using joins. Obfuscated
code is always harder to read than clear code, in any language,
and this is ultimately exactly why I’m arguing that this query
should be written with a subquery.

> >> >    SELECT
> >> >        b.id,
> >> >        EXISTS (
> >> >            SELECT NULL
> >> >            FROM hiddenboards h
> >> >            WHERE
> >> >                h.userid = 3
> >> >                AND h.boardid = b.id
> >> >        ) AS hidden
> >> >    FROM boards b
> >> >    WHERE b.active = 1
> >> >    ORDER BY b.pos
> >> >
> >> >Note that I use EXISTS and select nothing but NULL to make
> >> >it clear that the only fact of interest is whether a
> >> >matching row exists in `hiddenboards`.
> >> >
> >> >In that way it is self-documenting what you're trying to
> >> >ask for.
> >>
> >> If it allows you to read the query more readily, then use
> >> it. However, you now know that there is a more efficient
> >> solution that is acceptable to most individuals well-versed
> >> in the use of SQL.
> >
> >    Programs must be written for people to read, and only
> >    incidentally for machines to execute.
> >                                      — Abelson & Sussman
> 
> Again, the more you use them the more apparent they become.
> Should I stop using regular expressions since most programmers
> can't read them?

I find your leaps of logic confusing. I quoted Abelson & Sussman
to say that I care primarly about clarity and documentation of
intent and only secondarily, at best, about performance.

How do you leap from that to the assertion that one should use
the least powerful constructs? And anyway, where are the
programmers who can read subqueries but not joins – does your
analogy even work at all?

> >    SELECT
> >        id,
> >        id IN (
> >            SELECT boardid
> >            FROM hiddenboards
> >            WHERE userid = 3
> >        ) AS hidden
> >    FROM boards
> >    WHERE active = 1
> >    ORDER BY pos
> >
> >Yes, I tested it; it works correctly. Now go on, tell me
> >that's not much simpler and easier to understand than the
> >original query. Go on, tell me that's going to be slower than
> >a join.
> 
> Some might wonder why you have a condition in the select list
> that is normally seen in a where clause.

Why should it be in the WHERE clause? I’m not interested in
constraining the resultset. I’m interested in deriving the value
of an expression based on one of the columns in the query’s
sources. Cf. `SELECT price * 1.19 AS price_after_vat`.

> However, depending upon the database, it will be slower than a
> left join (MySQL being the odd database where this does not
> hold true). =)

Seriously? How does that work? The subquery is not correlated, so
only needs to be run once for the entirety of the query, so I’d
be curious to learn how that can still be slower than joining
tables.

(OK, if boardid is indexed and userid is not, the subquery will
require a full table scan where the join would pre-constrain the
resultset a tad. But that’s a stupid way to set a up a join table
in any event, as the entire point of such a table is to have at
least two columns of FKs.)

> However, despite all of this, it *is* equivalent. Of course,
> this would mean you couldn't use any of the fields from
> hiddenboards in any where condition or joins meaning we throw
> this stuff out if we ever need these fields. Of course, for now
> it doesn't matter. Yeah, weak argument but considering the
> future is always useful.

Seems like a very weak argument indeed when you consider the
tables involved – there is no information you would get from
keeping the fields that you can’t infer anyway from the results
of the query as written.

I don’t like throwing in extra data “just because” either. It’s
better to keep interfaces narrow by default, to avoid the
temptation to use them for their secondary effects, which results
in the familiar big ball of mud where everything is dependent on
everything else.

> >> Would you not use regular expressions if someone told you
> >> that they were a hack and that you should instead use index,
> >> rindex, and substr?
> >
> >Oh, that is an excellent example. Is the following not abuse
> >of regular expressions?
> >
> >    if( $foo =~ /^.{5}$/s ) { ... }
> >
> >Why would you write that and not this?
> >
> >    if( 5 == length $foo ) { ... }
> >
> 
> Obviously those are not equivalent.
> 
> perl -l
> my $foo = "12345$/";
> if ($foo =~ /^.{5}$/s) {
>    print "regex DOES match";
> } else {
>    print "regex DOES NOT match";
> }
> if (5 == length $foo) {
>    print "length check DOES match";
> } else {
>    print "length check DOES NOT match";
> }
> __END__
> 
> regex DOES match
> length check DOES NOT match
> 
> ;-)
> 
> Sorry for the pedantry.

Yes, I know. I left it out because \A and \z are unfamiliar to
people who’re not aware of that subtlety and to be so elaborate
seemed beside the point. (Although it is not *that* irrelevant in
practice; a lot of people are not familiar with the subtlety
you’re pointing out.)

So consider all my examples as written that way if that makes you
feel better. :-)

> This, I agree, is an abuse. However, this is a case of using a
> construct when a better solution exists; specifically one whose
> purpose is exactly what you are trying to do.

But the constructs are exactly equivalent. (Assuming, of course,
you know the subtleties well enough.) So why is it an abuse?

It’s an abuse because the construct does not reveal the intent as
clearly as another could.

And that’s why I’m saying joins and subqueries should not be used
interchangably.

> However, as I explained earlier, this sort of circumstance is
> what left and right joins were meant to be used for (that being
> the inclusion of all rows on the left by having null fields on
> the right in the resultset). The condition was just a logical
> progression from what we already knew would be the result of
> the joins.

Yes, but the purpose of a join is, well, to join rows, including
duplicating rows from either of the sources as necessary.
Generating null columns by evaluating conditions in a join that
do not involve both sources is a use of joins for their side
effects rather than their purpose.

> I'm certain when you read the last paragraph, you couldn't help
> but think this sort of thing is exactly what subqueries are
> used for. However, suppose we need more fields or have another
> condition to consider. If the base conditions of the join were
> more complex, we could possibly have a hell of a subselect
> going on in our select list.

How is that an argument for anything? If the conditions are
complex, they’re going to be complex whether you put them in a
subquery or the main query.

Are you sure you’re not making the exact kind of assertion that
you mistakenly charged me with? What you’re saying sounds like “I
don’t like to read complex subqueries in the SELECT clause” which
to me is congruent to “a lot of programmers don’t know regular
expressions.”

> Of course, I honestly cannot remember the last time I used
> scalar subqueries since I can always construct a join that will
> provide the fields that I need in the select list.

What I wrote returns a single column, but not a single row, so it
is not a scalar subquery. I don’t use those a whole lot either.

> I often use derived tables (joins against subqueries) in cases
> when I need data that includes aggregations but grouping
> against some of the normal fields does not make sense.

I join against subqueries occasionally as well. I don’t have an
aversion against joins at all, really. What I find is that it
seems that rather most people have an aversion against subqueries
that I don’t understand. I use both freely.

> >Or this, is it not abuse?
> >
> >    if( $foo =~ /^foobar$/ ) { ... }
> >
> >Why not just say this?
> >
> >    if( $foo eq 'foobar' ) { ... }
> >
> >People use regular expressions inappropriately quite
> >frequently. That doesn't mean regular expressions are a hack.
> >It does mean that they're not always the right tool for the
> >job.
> 
> I am in complete agreement on this point. However, suppose you
> discovered someone wrote subroutines for isnumber and isword
> and used them along with index, rindex and substr because they
> never learned regular expressions? You try to show regular
> expressions to them but they don't use them simply because it
> does not look obvious to them and they refer to regular
> expressions as a hack. They are using a subset of the language
> even though they are missing out on a very powerful feature. I
> hate analogies, but I hope this can at least convey my view on
> our disagreement.

I still don’t understand what point you are arguing.

a) I never said not to use joins. I said in this instance a join
   is inappropriate because it is used for its side effect.

b) It seems very improbable to me that anyone would understand
   subqueries but never have learned about joins, in the way that
   many know string functions but not regular expressions.

> Aristotle, I've respected you for a long time on perlmonks.

And now I disgraced my reputation with you? :-)

> I doubt I will change your position as I know you haven't
> rattled mine. The whole subquery subject was a tangent to the
> original comment that you made. On that subject and just to
> clarify, I never said your subquery-based solution was wrong. I
> just said it was not ideal. Hopefully I have explained why this
> is the case for myself. Since you use MySQL, subqueries are
> probably best for you.

I feel like I am talking into a void.

Was I talking to someone else in my last four diatribe-length
mails?

I DO NOT CARE which RDBMS I am using. I will use subqueries on
MySQL and I will use them on Postgres and SQLite and Oracle just
the same.

I DO NOT CARE about performance. Please re-read the points I have
been making over and over.

I brought up MySQL because it didn’t support subqueries until
recently so people who grew up using MySQL tend to do EVERYTHING
with joins, which often makes queries far more difficult to read
than they could be.

The arguments you keep making sound as if somehow you got the
idea in your head that I was saying joins are bad or subqueries
are faster or people don’t understand joins or something else I
don’t know along these lines. None of this true.

> Hopefully you can now see why I disagree with your point of
> view that we were abusing the on clause. Even if this were an
> inner join, putting additional filtering conditionals on the
> join would not be a hack since this would indicate exactly when
> the conditionals come into effect. There are generally four
> filtering periods in select queries, namely join filtering,
> where filtering, having filtering, and finally ordering and
> limiting. Changing the timing of when you choose to filter can
> often change the results by quite a margin and attempts to fix
> said changes can increase the complexity of other levels of
> filtering.

And that “changing the timing” issue is precisely my point.
Trying to move the condition from the ON clause to another point
in the query tripped you up several times. If I were to maintain
that query, I would have to be very careful about where I add any
non-trivial conditions in order to avoid exactly these complex
effects that you point out. I have O’Reilly’s _SQL Cookbook_
here, and there are a bunch of recipes in there devoted solely to
how to add a particular kind of condition to some kind of complex
query such that you won’t get tripped up by the filter cascade
timing.

The query as written with a non-correlated subquery, meanwhile,
makes it trivially obvious what’s going to happen.

Of course, that’s not always the case; when a lot of subqueries
are in use it can be difficult to see what the conditions
governing the eventual result really are, and indiscriminately
nested subqueries are also a very effective obfuscation device.

As I keep saying; the choice for one or the other depends on
which one better reveals intent.

> To clarify my point of view from above, what is available at
> the end of each of these filtering stages is the information I
> am interested in when attempting to deconstruct a select query.
> The earlier the filtering, the less stuff I need to consider at
> later stages.

I can see that point. Equally, though, the fewer sources in any
particular FROM clause, the fewer interactions to consider there
are.

> Hopefully in this mass of text, I got my point across.

I know what point I perceived you making, but I know not whether
it was the one you intended to make and I cannot know whether you
would interpret an explanation of the point as I understood it in
the way I intended, either.

Such is the peril of communication, and many a philosopher has
been lost in the blind alley that is the question of whether it
is even fundamentally possible. :-)

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>