MySQL: When GROUP BY is faster than DISTINCT

I had a query on a very small table (~500 rows) and noticed that it was taking over 100ms. I noticed that if you did any one of the following three things the query ran about 3 to 4 times faster (all using SQL_NO_CACHE):

  • Removed DISTINCT
  • Removed ORDER BY
  • Removed JOINs

All of these things were there for a reason so I didn’t want to remove any of them — and didn’t want to write separate queries when it wasn’t complicated with JOINs.

I found that simply adding GROUP BY to the primary key also made it run about 4 times faster, which was about the speed I expected. Interestingly, once I added GROUP BY having DISTINCT in there or not made no difference.

I think I know why but I just wanted to be sure.

  • When you run DISTINCT MySQL has to looks across all selected columns whereas GROUP BY will only do it for whatever columns you explicitly assign to GROUP BY so there is less work to do (my query was selecting about 15 columns)
  • DISTINCT would usually be faster than GROUP BY if a) there’s no index on that column and b) you are not ordering as well since GROUP BY does both filtering and ordering
  • As my example illustrates, GROUP BY can be faster than DISTINCT if you are ordering as well and if you are doing JOINs, selecting a lot of columns, etc
  • If you use both DISTINCT and GROUP BY the optimiser seems to ignore the DISTINCT
  • And most importantly, always try both to see which is faster in your use case!

Does that sound right?

this is true as far as it goes… however, if you have GROUP BY with only a few of your 15 columns, then technically you are running an invalid query in mysql (yes, mysql will run these invalid queries), and the results are indeterminate

not necessarily true… abandon this line of thought

unless you GROUP BY the exact same columns as you would have in the DISTINCT, all bets are off (see above comment)

[quote=“DrQuincy, post:1, topic:272420, full:true”]

  • As my example illustrates, GROUP BY can be faster than DISTINCT if you are ordering as well and if you are doing JOINs, selecting a lot of columns, etc[/quote]
    you are well and truly off the deep end, now

you cannot compare a query without joins to a query with joins

that’s like saying if you’re going to fly to New York from Boston, it’s faster than if you fly to New York from San Francisco

just not comparable

no, it does not ignore it

no, no, no… sorry

run the query that is correct for your requirements, not some “similar” query which you think runs faster

Thanks for the reply. You’ve actually helped me out a few times over the years so I appreciate your advice. :slight_smile:

I may not have explained myself well enough in certain areas so let me clarify and provide an example.

My comment about trying more than one way was that if you have multiple queries that return the same data it is best to use the one that is faster. And that is what I am trying to do. I’m not an expert in MySQL, as you can tell, since I don’t really work with large data sets. So, I was surprised to see this query take 100 ms on a small table.

And yes, let’s forget about joins for some since it’s apples to oranges.

What I am specifically referring to is a query like this:

Query 1

SELECT DISTINCT id, column1, column2 … FROM journal JOIN … ORDER BY dateFiled DESC LIMIT 0. 10

dateFiled is not indexed due to the small table size. It joins categories and sometimes it results in duplicate rows hence DUPLICATE. id is the PK.

If i do this it is 4 times faster.

Query 2

SELECT DISTINCT id, column1, column2 … FROM journal JOIN … GROUP BY id ORDER BY dateFiled DESC LIMIT 0. 10

All I’ve done is add in GROUP BY id. I really just want to understand why this is faster.

You said:

this is true as far as it goes… however, if you have GROUP BY with only a few of your 15 columns, then technically you are running an invalid query in mysql (yes, mysql will run these invalid queries), and the results are indeterminate

Given the above difference between Query 1 and Query 2:

  • Is my first point correct? From what I can tell from what I have read and your response it is so long as I don’t start adding extra columns to GROUP BY
  • Why is Query 2 faster than Query 1?
  • In Query 2 is the ORDER BY superfluous?
  • If I remove DISTINCT from Query 2 there seems to be no difference in performance. Why is this? Is it because GROUP BY already makes it unique so DISTINCT runs on a results set that is already unique so there is negligible overhead?

Thanks again.

you need to stop here and understand why it’s wrong

https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html

Okay, I get now why my use of GROUP BY is wrong.

Why does it significantly speed up my query?

Your type of GROUP BY was traditionally wrong but it doesn’t appear to be so any longer. When you group by a primary key column then the result from other columns is determinate because there can be only one value for any primary key and there is no ambiguity.

In fact, the very MySQL manual page mentions this at the very beginning:

SQL99 and later permits such nonaggregates per optional feature T301 if they are functionally dependent on GROUP BY columns: If such a relationship exists between name and custid, the query is legal. This would be the case, for example, were custid a primary key of customers.

MySQL 5.7.5 and up implements detection of functional dependence. […]

So it looks like the new SQL standard permits grouping by the primary key and MySQL has started supporting it. The same happened in PostgreSQL a few years ago:

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

So I’d say your GROUP BY id is perfectly fine and as it is functionally identical to the DISTINCT clause in this case I don’t see anything wrong with using it for performance reasons. But I don’t know if this kind of query will work in other databases, which might not be that permissible.

I don’t really know but it must be caused by how MySQL optimizer decides to execute the query - it is certainly doing something different if the speed is different. Maybe the GROUP BY causes it to use the index while the DISTINCT does not? Try running EXPLAIN for both. It is quite possible that in other databases you might get completely different speed results because they have different query planners.

Hi @Lemon_Juice, thanks for that, very useful.

It looks like my query runs okay under 5.6.x because ONLY_FULL_GROUP_BY is enabled by default but as of 5.7.5 ONLY_FULL_GROUP_BY is disabled by default but would still work because of what you cited. Is that right?

Bearing in mind that I only use MySQL is there anything inherently bad about using GROUP BY in that way?

Also, my hosters have said that when they upgrade they will likely moved from MySQL 5.6 to MariaDB 10 or 10.1. Have you any experience/issues with this?

EXPLAIN does show some differences. I will post later on.

That should be right except you reversed the defaults: ONLY_FULL_GROUP_BY is off by default in 5.6 and on in >=5.7.5.

No, it’s just taking advantage of a more recent SQL standard. I’ve read some people prefer the traditional strict way of listing all columns but to me this is a nuisance and I’m glad databases are lifting this restriction when grouping by a PK.

I don’t have any experience with MariaDB but I think it should be easy to find this in their docs?

I’ve just found out you may be out of luck. There’s a bug for it but not much activity.

You’re right, I got them mixed up!

Thanks for that, good find! I will forward that on to my hosting company. I thought MariaDB was meant to be interchangeable but I guess it is not fully. That’s a real shame that no one seems interested in looking at that.

Removing ONLY_FULL_GROUP_BY from the SQL_MODE would be the only “fix” in that case I guess.

Here’s the query:

SELECT DISTINCT
       `journalItems`.`id` , 
       `pages`.`slug`, 
	   `journalItems`.`title`, 
	   `journalItems`.`html`, 
	   `journalItems`.`dateFiled`, 
	   `journalItems`.`image`, 
	   `journalItems`.`gallery`, 
	   `journalItems`.`documents`, 
	   `journalItems`.`links`, 
	   YEAR(`journalItems`.`dateFiled`) AS `year`, 
	   (SELECT GROUP_CONCAT(`categoryID`) FROM `journalListings` WHERE `itemID` = `journalItems`.`id`) `categories` 
FROM `journalItems` 
JOIN `journalListings` ON `journalItems`.`id` = `journalListings`.`itemID` 
JOIN `pages` ON `journalItems`.`page` = `pages`.`id` 
WHERE `pages`.`status` = 'published' AND `dateFiled` <= NOW() 
GROUP BY `journalItems`.`id`
ORDER BY `dateFiled` DESC
LIMIT 0, 3

Here are 3 x EXPLAINS where I run as is and then comment out DISTINCT and GROUP BY respectively.

DISTINCT AND GROUP BY:
select_type			table				type 		possible_keys		key			key_len 	ref								rows	Extra
PRIMARY				journalListings		index		itemID				itemID		4			NULL							459		Using index; Using temporary; Using filesort
PRIMARY				journalItems		eq_ref		PRIMARY,page		PRIMARY		4			client.journalListings.itemID	1		Using where
PRIMARY				pages				eq_ref		PRIMARY				PRIMARY		4			client.journalItems.page		1		Using where
DEPENDENT SUBQUERY	journalListings		ref			itemID				itemID		4			func							1		NULL

DISTINCT ONLY:
select_type			table				type 		possible_keys		key			key_len 	ref								rows	Extra
PRIMARY				journalItems		ALL			PRIMARY,page		NULL		NULL		NULL							351		Using where; Using temporary; Using filesort
PRIMARY				journalListings		ref			itemID				itemID		4			client.journalItems.id			1		Using index
PRIMARY				pages				eq_ref		PRIMARY				PRIMARY		4			client.journalItems.page		1		Using where
DEPENDENT SUBQUERY	journalListings		ref			itemID				itemID		4			client.journalItems.id			1		NULL

GROUP BY ONLY:
select_type			table				type 		possible_keys		key			key_len 	ref								rows	Extra
PRIMARY				journalListings		index		itemID				itemID		4			NULL							459		Using index; Using temporary; Using filesort
PRIMARY				journalItems		eq_ref		PRIMARY,page		PRIMARY		4			client.journalListings.itemID	1		Using where
PRIMARY				pages				eq_ref		PRIMARY				PRIMARY		4			client.journalItems.page		1		Using where
DEPENDENT SUBQUERY	journalListings		ref			itemID				itemID		4			func							1		NULL

If I keep the GROUP BY and comment out DISTINCT it seems the same, so is DISTINCT being ignored by the optimiser? It seems with DISTINCT only it is choosing not to use a key. Why would that be and is that likely to be the cause of the difference in performance?

That’s why out of the forks I prefer Percona Server as it’s much more compatible with MySQL.

You can also list all the columns in your GROUP BY - apart from being verbose it shouldn’t affect performance. However, I’m not sure how adding the subquery column to the GROUP BY will work, you’d need to test it.

I don’t know but I think it would be wise for the optimizer to ignore it in this case so maybe it does?

That’s what I suspected. It turns out MySQL doesn’t make use of the index for the DISTINCT filter - it appears to create a temporary table with the whole result set and then removing duplicates. And yes, using an index vs not using one can have a big impact on performance.

Again, the manual may shed some light:

DISTINCT combined with ORDER BY needs a temporary table in many cases.

@Lemon_Juice, thank you for your help. Honestly, it is really very much appreciated.

I think I will experiment with SQL MODE ONLY_FULL_GROUP_BY in 5.6 and see how I get on. :slight_smile:

Thanks again.

Anyone else reading this there is also this option in 5.7:

https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_any-value

thanks for this… it will assist me in illustrating my points

okay, so i’m going to guess journalItems.id is that table’s primary key, and that there is a one-to-many relationship to journalListings, where itemID is a foreign key

also, i’m going to guess that journalItems is in a many-to-one relationship with pages, such that each journalItems belongs to only one pages.id

so everything would be peachy keen in your query with just GROUP BY journalItems.id (assuming ONLY_FULL_GROUP_BY is disabled), since all journalItems columns in the SELECT clause are in a 1-to-1 relationship with their PK, and pages.slug also has only one value for each journalItems.id, while journalListings.categoryID has a GROUP_CONCAT aggregate function applied to it, which is perfectly fine in a grouping query

except for one small error … that blows your EXPLAINS out of the water

in your SELECT clause, in the subselect, you are pulling all categories for each journalItems.id from a different copy of the journalListings table than in your FROM clause

consequently, the FROM clause produces multiple rows for each journalItems row, one for each category

however, your GROUP BY then collapses these multiple rows into one… and you never even notice, because your SELECT clause doesn’t pull any columns from the journalListings table that the query has joined, while the journalListings table in the subquery is a different copy of the table, so it requires additional resources to execute (it’s inside a correlated subquery)

if you fix that problem by removing journalListings from the FROM clause, you will find that you need neither the GROUP BY nor the DISTINCT

or, you could remove the correlated subquery, and instead just use the aggregate function alone, but then you would need the GROUP BY – but not the DISTINCT

moral of the story: as i’ve often said, optimizing a broken query is a mug’s game – make sure your query is actually running correctly before worrying about optimization

1 Like

But it’s also unavailable in MariaDB. Anyway, I would treat it as a temporary band-aid solution only, for example when you want to quickly adapt legacy code for a new version of MySQL, and not for use when creating new queries.

And r937 has a good point. I didn’t look close enough at your query, it doesn’t make sense in this case to join with journalListings if you are not referencing journalListings in your SELECT columns, nor WHERE clause nor other JOINS - subqueries do not need JOINs to the same table in the main query - and this can only cause unnecessary duplicates.

But the good side of this is that you’ve learned some valuable lessons about the GROUP BY clause and how it changed in 5.7.5 - I was also unaware this thing happened in MySQL recently!

You absolutely nailed it. You were correct in all your assumptions.

if you fix that problem by removing journalListings from the FROM clause, you will find that you need neither the GROUP BY nor the DISTINCT

This worked a treat and the query is now running at around 10ms, what I would expect.

I can’t believe I didn’t see that major flaw in the query, thanks for pointing it out.

Yes, you are right.

From what I’ve read if you are forced to upgrade you are probably best off disabling ONLY_FULL_GROUP_BY until you are able to thoroughly test your queries and that if you’re sticking with MySQL you should be okay if non-aggregates are functionally dependent on the GROUP BYs, MariaDB not so much unless they implement that feature you link to.

That is how I have read it anyway.

@r937 Do you consider ONLY_FULL_GROUP_BY lazy/bad practice?

hell no… it should be enabled… all the time…

of course, those of us who grew up writing SQL when “only full group by” was baked into the language are shaking our heads at the unnecessary mayhem caused by mysql’s odious behaviour in returning indeterminate values in syntactically invalid grouped queries written by novices who honestly had no idea how grouping actually worked

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.