Simple (?) JOIN question

rblon · January 8, 2010, 11:49am

I have a question how to have two JOINs using the same table.
The previous sentence probably doesn’t make sense, so let me try to explain using an example. Let say, I have two tables


CREATE TABLE users (
   userid INTEGER AUTO_INCREMENT,
   name VARCHAR(36),
   PRIMARY KEY (userid)
)

CREATE TABLE transactions (
   transactionid INTEGER AUTO_INCREMENT,
   buyerid INTEGER,
   sellerid INTEGER,
   price INTEGER,
   PRIMARY KEY (transactionid)
)

Now, when I want buyer information for all transactions where the price was greater than 10, I could retrieve this using the following statement


SELECT 
  users.name AS buyer 
FROM 
  transactions INNER JOIN users
    ON transactions.buyerid = users.userid
WHERE
   transactions.price > 10

But what statement should I use when I also want to retrieve the sellers name with the same query?

swampBoogie · January 8, 2010, 12:19pm


select buyer.name as buyer,
         seller.name as seller
  from users as buyer
  join  transactions 
    on buyer.userid = transactions.buyerid
  join users as seller
    on transaction.sellerid = seller.userid
 where transactions.price > 10

r937 · January 8, 2010, 12:35pm

i prefer to write the tables in the FROM clause in a logical sequence

SELECT buyer.name AS buyer
     , seller.name AS seller
  FROM transactions 
INNER
  JOIN users AS buyer
    ON buyer.userid = transactions.buyerid
INNER
  JOIN users AS seller
    ON seller.userid = transactions.sellerid
 WHERE transactions.price > 10

i always try to start the FROM clause with the “driving” table, i.e. the one mentioned in the WHERE clause which has the main restriction, the one which the other tables are joined to

rblon · January 8, 2010, 12:43pm

great, your responses make it all clear.
I missed the point that I should give the table that is being joined an alias, so that I can join it twice

rblon · January 8, 2010, 2:56pm

I actually would like to ask a second question and that is about the efficiency of JOIN statements.

Let say, the transaction table has many records. And I would like to get the buyer and seller name for a specific transaction (eg transaction 123456). As I have been explained, I can do this in one query:


SELECT 
  buyers.name AS buyer,
  sellers.name AS seller 
FROM 
  transactions 
    INNER JOIN users AS buyers
      ON transactions.buyerid = buyers.userid
    INNER JOIN users AS sellers
      ON transactions.sellerid = sellers.userid
WHERE
  transactions.transactionid = 123456

Now, I would “visualize” this query as a large table being constructed (with buyer and seller names for all transactions) to which the WHERE clause is being applied.

If the database would indeed process the query like that, it would be very inefficient. In that case it would be better to obtain the buyerid and sellerid, and get the names in separate query.

Does anyone know if one should take these kind of considerations into account? Or can you expect the database to optimize the query?

r937 · January 8, 2010, 3:27pm

yes, i do

have you ever done an EXPLAIN? which database system is this?

rblon · January 8, 2010, 3:37pm

You take these considerations into account? Or you know the answer?

No I have never done an EXPLAIN… I am not running a large database, it is more an academic interest. Hence, the stylized example.

I am using MySQL.

r937 · January 8, 2010, 3:42pm

i know the answer

the reason i asked which database you’re using is because we have a mysql forum (for mysql, eh) and a databases forum (for everything else), and you didn’t post in the mysql forum and you also didn’t say which database you’re using

in your query, put the word EXPLAIN in front of the word SELECT, run it, and show us the results

rblon · January 8, 2010, 4:36pm

Ok, so I have created the following php script to populate the tables with some random data


$seed = 100;
srand($seed);

// Fill users table with 1000 users
for ($i = 0; $i < 1000; $i++) {
  $name = rand(0, 100000);
  $sql = "INSERT INTO users (name) VALUES ('$name')";
  mysql_query($sql, $con);
}

// Fill transactions table with 200k transactions
for ($i = 0; $i < 200000; $i++) {
  $buyerid = rand(0, 999);
  while (($sellerid = rand(0, 999)) == $buyerid);
  $price = rand(1, 100);
  $sql = "INSERT INTO transactions (buyerid, sellerid, price) ";
  $sql .= "VALUES ($buyerid, $sellerid, $price)";
  mysql_query($sql, $con);
}

Now executing the SELECT statement with two JOINs returns buyer=72396 (corresponding to userid=148) and seller=88733 (corresponding to userid=184) pretty instantaneously.

And EXPLAIN gives an output as per screenshot.

Based on the speed of returning the result I would think MySQL is optimizing.

And, then I would assume any other respectable database system would do the same.

r937 · January 8, 2010, 4:57pm

yep, they do

rblon · January 8, 2010, 6:43pm

ok great, then we can keep it general, and I might be posting in the right forum after all

Any insights in what EXPLAIN can tell me?

r937 · January 8, 2010, 7:04pm

it can tell you when certain operations are inefficient, e.g. table scans where the “key” column is empty

rblon · January 8, 2010, 7:12pm

but it doesn’t tell me that the WHERE clause is executed before the JOINs (as seems to be the case)?

r937 · January 8, 2010, 7:22pm

yes, the WHERE clause is applied in step 1 to the transaction table (using its PRIMARY key)

the “rows” column indicates that 1 row has been retrieved (instead of all of them)

after that, the other tables are joined