Joining 3 tables and SUMming by 2 of them - which was is 'best'?

Say I’ve got an orders table, an ordered_products table, and order_costs table, like so:

orders:
--------------------
orderID | orderName
   1    | bob
   2    | fred
--------------------

order_products
-------------------------
orderID | productID | cost
   1    |     100   |  10
   1    |     101   |  20
-------------------------

ordered_costs
------------------------------
orderID | costDesc | costValue
   1    | shipping | 5
------------------------------

If I want to get the cost, and the product cost for a given order, What would be the best way to do it? My first thought would normally go to a simple 3 way join:

SELECT o.orderID, SUM(p.cost) AS product_cost, SUM(c.cost) AS order_cost
FROM
orders o
JOIN ordered_products p ON o.orderID=p.orderID
JOIN order_costs c ON o.orderID=c.orderID
WHERE o.orderID=1
GROUP BY o.orderID

However, that would be incorrect because it would give me:

------------------------------
orderID | product_cost| orderCost
   1    |      30     |     10
------------------------------

I understand that it’s because the flat table before grouping would be:


------------------------------
orderID | product_cost| orderCost
   1    |      10     |     5
   1    |      20     |     5
------------------------------

What’s the best way to return the correct result here? I would usually do something like joining one table at a time, within a series of subqueries:

SELECT o.orderID, o.product_cost, SUM(c.cost) AS order_cost
FROM
(
SELECT o.orderID, SUM(p.cost) AS product_cost
FROM
orders o
JOIN ordered_products p ON o.orderID=p.orderID
WHERE o.orderID=1
GROUP BY o.orderID
) AS o
JOIN order_costs c ON o.orderID=c.orderID
GROUP BY o.orderID

But, if I were to join a large number of records, the subquery requires an un-indexed join, which I believe is inefficient and potentially slow. Another option is a correlated subquery in a similar way:

SELECT o.orderID, SUM(p.cost) AS product_cost, (SELECT SUM(c.cost) orderCost FROM order_costs c WHERE c.orderID=o.orderID) AS order_cost
FROM
orders o
JOIN ordered_products p ON o.orderID=p.orderID
WHERE o.orderID=1
GROUP BY o.orderID

But isn’t this similarly inefficient, especially if I wanted to SUM many columns in each table?

Any suggestions in this scenario? I’m probably missing something very easy.

And what if the order has multiple rows in the order_costs table as well?
How about:


SELECT 
    o.orderID
  , p.product_cost
  , c.order_cost
FROM orders o
INNER JOIN 
  (SELECT 
       orderID
     , SUM(cost) AS product_cost
   FROM ordered_products
   GROUP BY orderID
  ) AS p 
ON o.orderID=p.orderID
INNER JOIN
  (SELECT 
       orderID
     , SUM(cost) AS order_cost
   FROM order_costs
   GROUP BY orderID
  ) AS c 
ON o.orderID=c.orderID
WHERE o.orderID=1

Or, since you don’t want all orders, but only a specific one:


SELECT 
    o.orderID
  , p.product_cost
  , c.order_cost
FROM orders o
CROSS JOIN 
  (SELECT 
     SUM(cost) AS product_cost
   FROM ordered_products
   WHERE orderID = 1
  ) AS p 
CROSS JOIN
  (SELECT 
     SUM(cost) AS order_cost
   FROM order_costs
   WHERE orderID = 1
  ) AS c 
ON o.orderID=c.orderID
WHERE o.orderID=1

Or maybe CROSS JOIN isn’t the way to go, and this is better (I don’t know, test it or wait for someone with more knowledge about query performance to give his opinion):


SELECT 
    o.orderID
  , (SELECT 
        SUM(cost) AS product_cost
      FROM ordered_products
      WHERE orderID = 1
     ) AS p.product_cost
  , (SELECT 
        SUM(cost) AS order_cost
      FROM order_costs
      WHERE orderID = 1
     ) AS c.order_cost
FROM orders o
WHERE o.orderID=1

Yeah, these are the kind of approaches I’ve been considering. I’ve got it into my head that these correlated subqueries would be resource intensive, and they’d be doing many, many lookups if we were selecting multiple orders, but when I’ve tested them they seem to work ok. It could be my lack of thorough understanding of the optimisation mechanisms. So I was just wondering if there was sme radically different and better way to do it, but it seems I’m pretty much on the right track.

If you limit the subqueries to only specific id’s, and you have an index on the id columns, I don’t think it should be very resource intensive, but I don’t know for sure.

And how would I go about summing on a range of orders, say by date?

The only way I can think of is a massive subquery ie (using Guido’s example):



SELECT COUNT(orderID) orders, SUM(product_cost) product_cost, SUM(order_cost) order_cost
FROM
(

    SELECT 
        o.orderID
      , (SELECT 
            SUM(cost) AS product_cost
          FROM ordered_products op
          WHERE op.orderID=o.orderID
         ) AS p.product_cost
      , (SELECT 
            SUM(cost) AS order_cost
          FROM order_costs oc
          WHERE oc.orderID = o.orderID
         ) AS c.order_cost
    FROM orders o
    WHERE o.order_date between "2011-01-01" AND "2011-01-31"

) AS orders

Now, is there a better way than THAT?

I would use this structure (you can test it to see if it makes a difference):


SELECT 
    COUNT(o.orderID) orders
  , SUM(product_cost) product_cost
  , SUM(order_cost) order_cost
FROM orders o
INNER JOIN 
  (SELECT 
       orderID
     , SUM(cost) AS product_cost
   FROM ordered_products
   GROUP BY orderID
  ) AS p 
ON o.orderID=p.orderID
INNER JOIN
  (SELECT 
       orderID
     , SUM(cost) AS order_cost
   FROM order_costs
   GROUP BY orderID
  ) AS c 
ON o.orderID=c.orderID
WHERE o.order_date between "2011-01-01" AND "2011-01-31"

For some reason it just feels like doing those subqueries in the FROM is more efficient that doing them in the SELECT, but of course I could be completely wrong.

I’ve just tried that approach and it never finished running the query!

I think it’s because the subqueries are selecting and grouping every row on large tables. They ran faster when I included in the subquery a join to the orders table, so that I could use the date in a WHERE, but still pretty sluggish.

So far the correlated subqueries are looking pretty good.