2014-07-14

MySQL join syntax

http://dev.mysql.com/doc/refman/5.7/en/join.html
http://dev.mysql.com/doc/refman/5.7/en/nested-join-optimization.html

In the classical way to write a query
SELECT a.x, b.y
  FROM a, a2b, b
 WHERE a.a_no = a2b.a_no
   AND a2b.b_no = b.b_no
   AND a.z = 123

the WHERE clause mixes join conditions with selection conditions. Therefore, and once one gets to things like outer joins, it is better to be explicit, using the where condition to restrict the results by value, and the joins to restrict them by matches between tables.

As always, tables in joins can be actual tables, or subqueries returning tables, and we can define aliases for tables. I am ignoring partitions here.

A join merges two tables into one, using matching values between them, to restrict the cartesian product of all possible combos that can be made with rows from these tables to those rows that have the matching values. There are several kinds of joins

JOIN ... needs to have matching values in tables on both sides
LEFT JOIN ... needs values in left table, and tacks on values from the right, or null otherwise

You use left joins if you have a table that only has values for some records in your main table, and you want to see them if available, but you do not want to filter out rows from the main table, if they are absent.

One can optionally insert an OUTER between LEFT and JOIN. One can optionally call the simple, bidirectional join INNER JOIN or CROSS JOIN (in other dialects INNER requires an ON clause, here not), or use a comma between table names to imply it with the join condition in the where clause. Be careful when mixing table lists (i.e. implied inner joins) and left or explicit joins: because the explicit joins have precedence over the implied inner joins, the column you may want to join on in your outer join may not yet have been joined in, resulting in an error message. (See below).

There are several ways to describe on which fields from the left and right operands should be joined.

ON ... followed by an explicit statement which field from which table equals which field from which other (one actually can use anything one could use in a where clause in ON, but then one loses the advantage of cleanly splitting joins from selection criteria).

USING ... followed by a parenthesis enclosed list of fields which need to be present in both tables under the same name, and need to have matching content for a match.

NATURAL JOIN of two tables even does not need the field(s) in USING any more, as it will join on all fields of the same name between the tables.

The SELECT * output of USING and NATURAL are not quite identical to ON: the matching columns are listed only once, before the other columns (so called "coalesced" common columns), in order of appearance in the first table, then columns unique to the first table in order, then those unique to the second table in order. ON just lists all columns from all joined tables.

For example, assume you have the tables
t1(id,x), t2(id,y)then
SELECT * FROM t1 JOIN t2 ON t1.id=t2.id;
SELECT * FROM t1 JOIN t2 USING (id);
SELECT * FROM t1 NATURAL JOIN t2;
will all achieve the same effect.
If you have the three tables t1(a,b), t2(c,b), and t3(a,c) then the natural join SELECT * FROM t1 NATURAL JOIN t2 NATURAL JOIN t3; really means that after joining up t1 and t2 the resutling table will have the columns b, a, c. Again natural joining this with t3 will thus join on both a and c, not just on c, equivalent to this: SELECT ... FROM t1, t2, t3 WHERE t1.b = t2.b AND t2.c = t3.c AND t1.a = t3.a; This may not be what you intendet. So while NATURAL is nice for the lazy, it also is dangerous for its side effects. It is better to be explicit and use USING.
As long as all of them are inner joins, the order of joins is unimportant, and no parentheses are needed. 
However, this is not true for outer joins.

Lets say we have
CREATE TABLE t1 (i1 INT, j1 INT);
CREATE TABLE t2 (i2 INT, j2 INT);
CREATE TABLE t3 (i3 INT, j3 INT);
INSERT INTO t1 VALUES(1,1);
INSERT INTO t2 VALUES(1,1);
INSERT INTO t3 VALUES(1,1);
SELECT * FROM t1, t2 JOIN t3 ON (t1.i1 = t3.i3);
will create an error Unknown column 't1.i1' in 'on clause', because the explicit join is evaluated first, and does not know anything about i1. SELECT * FROM t1, t2 JOIN t3 ON (t1.i1 = t3.i3);would fix that, as would SELECT * FROM t1 JOIN t2 JOIN t3 ON (t1.i1 = t3.i3); because there the joins are worked off left to right in order. That means you can list first all your joins (making a huge cartesian table in theory), then all the on conditions with AND. Likewise, you can only refer to tables that were mentioned before (to the left) in the ON clause, not to tables that are mentioned after (joined to the right of the clause).

Referring to joins

Joins generate one big table on which the where clause can work. You can supply aliases to the various tables contribution to a join.

Finally, multiple joins, in a mix of inner and outer, with the outer being nested via linking tables. Ie what you deal with in the real world:

 select * 
   from finding f 
        join acq_source s using (acq_source_id)
        join finding2treatment f2t using (finding_id)
        join treatment t using (treatment_id)
        join treatment2genotype t2y using (treatment_id)
        join genotype y using (genotype_id)
        join genotype2variant y2v using (genotype_id)
        join gene g using (gene_id)
   left join (treatment2disease t2d join disease d) on (t2d.treatment_id = t.treatment_id and t2d.disease_id = d.disease_id)
   left join (treatment2drug t2u join drug u) on (t2u.treatment_id = t.treatment_id and t2u.drug_id = u.drug_id)
   left join (reference2finding r2e join reference r) on (r2e.finding_id = f.finding_id and r2e.reference_id = r.reference_id)
 where f.finding_id in ( )

No comments:

Post a Comment