Huge Performance difference in two forms of writing a subquery

Hi,

I wrote the following subquery to return the names of people that have a phone number in common with someone else:


SELECT  Name
FROM    People
WHERE   Phone IN (SELECT Phone
                  FROM   People
                  GROUP BY Phone
                  HAVING COUNT(*) > 1)

The table has about 100K rows and this takes about 20 seconds which does not make sense because running the subquery by itself takes only about 0.20 secs and the main query about that much if I enter the result of the subquesry manually (10 rows returned). So I would have expected this to run in 0.40 seconds and not 20!

So MySQL must be doing something wrong here, interpreting this the wrong way or something or somehow running the subquery multiple times? Though I can’t see why.

Playing around with this a little I then (almost accidentally) found that if I rewrite the above as:


SELECT  Name
FROM    People
WHERE   Phone IN (SELECT Phone FROM (SELECT Phone
                  FROM   People
                  GROUP BY Phone
                  HAVING COUNT(*) > 1) InnerTable)

Does indeed run in only 0.40 secs!!! Now I can’t figure out why these two forms should behave so differently. I can’t understand why MySQL would interpret the first one in any other way than the obvious and even more puzzling (to me) that the second form fixes the problem.

I am happy to use the second form but I would also like to understand why this difference. The only other problem is that the second (and faster) form is not allowed to be written as a View, it complains that the View’s SELECT contains a subquery in the FROM clause.

Here are the explain plans for the two queries (I left out the columns that were NULL for all rows):

First Form (SLOW):


+----+--------------------+--------+------+--------+----------------------------------------------+
| id | select_type        | table  | type | rows   | Extra                                        |
+----+--------------------+------- +------+--------+----------------------------------------------+
|  1 | PRIMARY            | People | ALL  | 108423 | Using where                                  |
|  2 | DEPENDENT SUBQUERY | People | ALL  | 108423 | Using where; Using temporary; Using filesort |
+----+--------------------+--------+------+--------+----------------------------------------------+

Second Form (FAST):


+----+--------------------+------------+------+--------+----------------------------------------------+
| id | select_type        | table      | type | rows   | Extra                                        |
+----+--------------------+------------+------+--------+----------------------------------------------+
|  1 | PRIMARY            | People     | ALL  | 108431 | Using where                                  |
|  2 | DEPENDENT SUBQUERY | <derived3> | ALL  |      3 | Using where                                  |
|  3 | DERIVED            | People     | ALL  | 108431 | Using where; Using temporary; Using filesort |
+----+--------------------+------------+------+--------+----------------------------------------------+

Note that there is also a time difference to come up with each plan, the first one (SLOW) takes no time (0.00 sec) while the second one takes 0.20 sec, not sure if this matters or is in any way significant.

Also welcome are any suggestions for a better way to write this query.

Thanks
John P.

i think this might be a bug. in the first query, mysql thinks the subquery is dependent. it should instead be a derived table.

what EXACT version of mysql are you using?

I am using version 5.0.18.

Yes, it does indeed appear to run that query for every row that matches because I have another restriction in the WHERE clause that limits the main query to about 100 rows, so 100 * 0.20 = 20 seconds.

try this query:

select a.name
  from people a
  join people b
    on a.phone = b.phone
   and a.name != b.name

i have confirmed that 5.0.24 behaves in the same manner

Hi longneck,

Thanks for the query, obviously much better than mine. Actually yours will return douplicate rows (every row twice) because a.name != b.name will match two rows a:Alan b:Bill and a:Bill b:Alan with the same phone, so it needs a DISTINCT which of course is not too bad because my result set in only a few rows. It runs in exactly the same time as my fast version of the subquery.

So, you guys think this is a bug? I tried giving an alias to the table in the subquery and prefixing the columns with that alias in an attempt to convince MySQL that the subquery is independent but with no success.

Thank you
john

doesn’t need DISTINCT (which is very expensive)

just change a.name != b.name to a.name > b.name

:slight_smile:

i think it’s a bug. i’m looking through other subquery bugs to see if anyone has reported something similar.

http://bugs.mysql.com/25926 created using the attached sample data

Sorry my bad. The douplicate rows were not from the fact that we got two rows per phone, actually that’s what we want and we need to keep the != operator. The problem comes when the same phone is shared by more than 2 people which creates a lot more combinations resulting in duplicate a.name and a.phone.

Not sure if there is a way to avoid DISTINCT in this case but, as I said, the expected result set is less than a dozen rows which means that sorting shouldn’t take long enough to be able to measure.

Thanks all for your responses and longneck for creating the bug at MySQL.

Hi,

I know this is not a very prompt reply but if anyone still interested, this problem is fixed by new subquery materialization optimizations in MySQL 6.0 - the query runs in 0.03 sec now. Thank you guys for reporting the bug.

Sergey Petrunia
Software developer, Query optimizer team
MySQL AB

thank you very much for taking the time to let us know

p.s. they aren’t bugs, they’re opportunities :slight_smile: