[Dbix-class] Re: how to use a complex query in DBIx::Class

Sun Dec 31 09:25:59 GMT 2006

On 12/30/06, A. Pagaltzis <pagaltzis at gmx.de> wrote:
> * Matt LeBlanc <mleblanc at cpan.org> [2006-12-31 03:40]:
> > On 12/30/06, A. Pagaltzis <pagaltzis at gmx.de> wrote:
> > >You're abusing the ON clause. You're not joining on the userid
> > >condition; you're constraing the result set. You should put it
> > >in the WHERE clause:
> >
> > Forgive my ignorance but could you please give me a reason why
> > this is an abuse? This is standard ANSI SQL.
>
> How does "abuse" imply it's not in the standard? Finding out the
> length of an array in Perl by saying `$#array + 1` is valid and
> working Perl as well, and yet it's abuse, because the semantics
> are off.

Although you will disagree, I do not find your example to be analogous
to the query presented by Wan. $#array + 1 being equivalent to scalar
@array is incidental by default and depends entirely upon $[ remaining
the default of 0. Left and right joins were created so resulting rows
would include all rows on the left or right, respectively, but with
null fields on the right or left, respectively, if no valid join
existed. We could simply have just returned hiddenboards.userid and
checked if userid was not undefined in our perl code. Would this too
have been an abuse?

>
> > >    SELECT
> > >        b.id,
> > >        CASE WHEN h.userid IS NOT NULL THEN 1 ELSE 0 END AS hidden
> > >    FROM boards b
> > >    LEFT JOIN hiddenboards h ON h.boardid = b.id
> > >    WHERE
> > >        b.active = 1
> > >        AND h.userid = 3
> > >    ORDER BY b.pos
> > >
> >
> > Did you run this query? Did it return the same results as my
> > query? Just in case, I will now present the results for you.
> >
> > id,hidden
> > 2,1
>
> Ah, interesting. I did not realise that the ON clause behaviour
> is affected in this way because of the LEFT JOIN, although in
> retrospect it is obvious. (Jess didn't catch this point either,
> I notice.) I'm pretty sure that issue would catch out a *lot*
> more people glancing at the query than just me and her.
>
> This is what I mean by abuse: the intent is not very explicit in
> the original formulation of the query.
>

If you don't normally use left joins, then I can see why this would seem odd.

> > To remove the "abuse" that I had, a simple method would be to
> > do something like this:
> >
> > SELECT
> >     b.id,
> >     CASE WHEN h.userid IS NOT NULL THEN 1 ELSE 0 END AS hidden
> > FROM boards b
> >    LEFT JOIN hiddenboards h ON h.boardid = b.id
> > WHERE
> >     b.active = 1 AND
> >     (h.userid = 3 OR h.userid IS NULL)
> > ORDER BY b.pos
> >
> > This results in:
> > id,hidden
> > 2,1
> > 3,0
> >
> > Urg, we still don't have the first row. Let's move the where
> > condition into the case statement.
>
> Yeah, I was going to point that out before you showed the
> results.
>
> > SELECT
> >     b.id,
> >     CASE WHEN h.userid = 3 THEN 1 ELSE 0 END AS hidden
> > FROM boards b
> >    LEFT JOIN hiddenboards h ON h.boardid = b.id
> > WHERE
> >     b.active = 1
> > ORDER BY b.pos
> >
> > Eureka!
> > id,hidden
> > 1,0
> > 2,1
> > 3,0
> >
> >
> > But wait! userId 3 sets board 1 to hidden as well.
> >
> > INSERT INTO hiddenboards VALUES(3,1);
> >
> > id,hidden
> > 1,0
> > 1,1
> > 2,1
> > 3,0
>
> Yes, I was going to make this point about the original query but
> refrained because as long as the userid is constrained with a
> condition it cannot happen. Once you removed the constraint, you
> got duplicate rows. This *is* a join, after all; duplicating rows
> is what it does.
>

Yes, but it is specifically why the userid check was in the join condition.

> > >But this is really an abuse of joining, because you're not
> > >actually interested in joining data,
> >
> > As a matter of fact, I am. This is why the left join was used.
>
> Not in the final result. You only want to know *whether* a
> matching row for the pair of given userid and examined boardid
> exists, but you're not interested in returning any data from that
> row.
>

I am interested in the data resulting from the joins. The joins return
the rows with all the information we want. We are just using the
available data to produce the final resultset.

> > >only in checking, which should be written as a correlated
> > >subquery:
> >
> > Subqueries are generally expensive, depending upon the database
> > engine. With the left join, the database can use an index if
> > one is available in coming up with all of the rows. The
> > subqueries you have would also be able to use the index but one
> > at a time as these will involve a scan of the rows returned
> > since they are not run until the very end of the joining
> > process. In most circumstances, a subquery is the wrong answer.
>
> My experience has been the exact opposite. I've rewritten quite a
> few queries to use subqueries instead of horribly convoluted
> joins (for some inexplicable reason, MySQL people seem to have a
> habit of writing these…), and performance has usually benefited
> from it, sometimes spectacularly so.
>

This would be because MySQL tends to suck the more you join. In most
other databases, joins tend to be much quicker with slower subqueries.

> Not that I shun joins; but I've found that it's rarely
> problematic for performance when queries are written to reveal
> intent and always a boon for maintenance, so I use a subquery
> when that's what expresses intent more closely and a join when
> that's what does.
>

I find the opposite to be true since joins tend to reveal all the
sources of data involved and make the logic of the inclusion of  the
resulting data involved to be fairly apparent.

> > >    SELECT
> > >        b.id,
> > >        EXISTS (
> > >            SELECT NULL
> > >            FROM hiddenboards h
> > >            WHERE
> > >                h.userid = 3
> > >                AND h.boardid = b.id
> > >        ) AS hidden
> > >    FROM boards b
> > >    WHERE b.active = 1
> > >    ORDER BY b.pos
> > >
> > >Note that I use EXISTS and select nothing but NULL to make it
> > >clear that the only fact of interest is whether a matching row
> > >exists in `hiddenboards`.
> > >
> > >In that way it is self-documenting what you're trying to ask
> > >for.
> >
> > If it allows you to read the query more readily, then use it.
> > However, you now know that there is a more efficient solution
> > that is acceptable to most individuals well-versed in the use
> > of SQL.
>
>     Programs must be written for people to read, and only
>     incidentally for machines to execute.
>                                       — Abelson & Sussman
>

Again, the more you use them the more apparent they become. Should I
stop using regular expressions since most programmers can't read them?

> I don't care about efficiency unless it's a problem.
>
> However, it occured to me that there is no need to use a
> correlated subquery in this case anyway. I wrote it that way
> because I was still blinkered by the JOIN abuse when I rewrote
> that query.
>
>     SELECT
>         id,
>         id NOT IN (
>             SELECT boardid
>             FROM hiddenboards
>             WHERE userid = 3
>         ) AS hidden
>     FROM boards
>     WHERE active = 1
>     ORDER BY pos
>
> Yes, I tested it; it works correctly. Now go on, tell me that's
> not much simpler and easier to understand than the original
> query. Go on, tell me that's going to be slower than a join.
>

Some might wonder why you have a condition in the select list that is
normally seen in a where clause. However, depending upon the database,
it will be slower than a left join (MySQL being the odd database where
this does not hold true). =)

However, despite all of this, it *is* equivalent. Of course, this
would mean you couldn't use any of the fields from hiddenboards in any
where condition or joins meaning we throw this stuff out if we ever
need these fields. Of course, for now it doesn't matter. Yeah, weak
argument but considering the future is always useful.

> > No offense, but your "purist" view on SQL is completely odd to
> > me.
>
> It's a purist view on programming; SQL is incidental.
>
> > Why would I not want to use everything the language has to
> > offer to return the data I want? Using left joins so you can
> > check for null values is a common usage.
>
> For the same reason you'd sometimes use `@array - 1` and
> sometimes `$#array`, yet not use them interchangably.
>
> > Would you not use regular expressions if someone told
> > you that they were a hack and that you should instead use
> > index, rindex, and substr?
>
> Oh, that is an excellent example. Is the following not abuse of
> regular expressions?
>
>     if( $foo =~ /^.{5}$/s ) { ... }
>
> Why would you write that and not this?
>
>     if( 5 == length $foo ) { ... }
>

Obviously those are not equivalent.

perl -l
my $foo = "12345$/";
if ($foo =~ /^.{5}$/s) {
    print "regex DOES match";
} else {
    print "regex DOES NOT match";
}
if (5 == length $foo) {
    print "length check DOES match";
} else {
    print "length check DOES NOT match";
}
__END__

regex DOES match
length check DOES NOT match

;-)

Sorry for the pedantry. This, I agree, is an abuse. However, this is a
case of using a construct when a better solution exists; specifically
one whose purpose is exactly what you are trying to do. However, as I
explained earlier, this sort of circumstance is what left and right
joins were meant to be used for (that being the inclusion of all rows
on the left by having null fields on the right in the resultset). The
condition was just a logical progression from what we already knew
would be the result of the joins.

I'm certain when you read the last paragraph, you couldn't help but
think this sort of thing is exactly what subqueries are used for.
However, suppose we need more fields or have another condition to
consider. If the base conditions of the join were more complex, we
could possibly have a hell of a subselect going on in our select list.
Of course, I honestly cannot remember the last time I used scalar
subqueries since I can always construct a join that will provide the
fields that I need in the select list. I often use derived tables
(joins against subqueries) in cases when I need data that includes
aggregations but grouping against some of the normal fields does not
make sense. Meh...that's just my experience.

> Or this, is it not abuse?
>
>     if( $foo =~ /^foobar$/ ) { ... }
>
> Why not just say this?
>
>     if( $foo eq 'foobar' ) { ... }
>
> People use regular expressions inappropriately quite frequently.
> That doesn't mean regular expressions are a hack. It does mean
> that they're not always the right tool for the job.
>

I am in complete agreement on this point. However, suppose you
discovered someone wrote subroutines for isnumber and isword and used
them along with index, rindex and substr because they never learned
regular expressions? You try to show regular expressions to them but
they don't use them simply because it does not look obvious to them
and they refer to regular expressions as a hack. They are using a
subset of the language even though they are missing out on a very
powerful feature. I hate analogies, but I hope this can at least
convey my view on our disagreement.

> Regards,
> --
> Aristotle Pagaltzis // <http://plasmasturm.org/>
>

Aristotle, I've respected you for a long time on perlmonks. I doubt I
will change your position as I know you haven't rattled mine. The
whole subquery subject was a tangent to the original comment that you
made. On that subject and just to clarify, I never said your
subquery-based solution was wrong. I just said it was not ideal.
Hopefully I have explained why this is the case for myself. Since you
use MySQL, subqueries are probably best for you.

Hopefully you can now see why I disagree with your point of view that
we were abusing the on clause. Even if this were an inner join,
putting additional filtering conditionals on the join would not be a
hack since this would indicate exactly when the conditionals come into
effect. There are generally four filtering periods in select queries,
namely join filtering, where filtering, having filtering, and finally
ordering and limiting. Changing the timing of when you choose to
filter can often change the results by quite a margin and attempts to
fix said changes can increase the complexity of other levels of
filtering. To clarify my point of view from above, what is available
at the end of each of these filtering stages is the information I am
interested in when attempting to deconstruct a select query. The
earlier the filtering, the less stuff I need to consider at later
stages.

Hopefully in this mass of text, I got my point across. I know that my
communication skills are lacking so I doubt I have. Sorry if anything
I wrote seemed smug or disparaging.