Getting the number of rows that a PostgreSQL SELECT would return without a LIMIT from a query with a LIMIT

Mittineague · September 13, 2018, 5:56pm

An interesting feature request at meta discourse:

To avoid poorly crafted SELECT queries from causing problems, the Data Explorer plugin has a LIMIT built in.

I’m having trouble seeing how it can be both ways, getting the row COUNT of what a potentially resource hungry query would return without a LIMIT in place from a query that has a LIMIT.

The best approach I can think of is, when the number of rows returned is the same as the LIMIT, to show some type of “LIMIT met, there may be more rows that meet the SELECT conditions” message.

But maybe there’s some “trick” I’m unaware of that could get the possible row count without causing resource use problems?

rpkamp · September 13, 2018, 6:14pm

In postgres you have the OVER() window function you can use that will return the number of rows that would have been retrieved had there been no limit on the query, eg

SELECT id, COUNT(*) OVER() FROM sometable LIMIT 500;

this will return the number of rows found in each and every row, but you could filter the column out when displaying the data and just use it to display the message you mentioned.

Mittineague · September 13, 2018, 10:17pm

Thanks ScallioXTX, I doubt I would have discovered OVER() soon, if ever, on my own. Works a treat. eg.

SELECT users.username 
 , COUNT(users.username) OVER ()  
FROM users 
WHERE users.id > 4 
LIMIT 5

over-count

Lemon_Juice · September 16, 2018, 8:32pm

I decided to test how COUNT(*) OVER() performs and my initial conclusions are that for simple queries like those selecting data from a single table it is faster to use COUNT(*) OVER() and run just one query. But for more complex queries it was no longer the case.

For instance, I tested a select query with 5 left joins and the performance was as follows (in seconds):

pure SELECT: 0.011
SELECT COUNT(*): 0.008
SELECT with COUNT(*) OVER(): 0.025

This means that the two queries - SELECT and SELECT COUNT - ran in 0.019 seconds total, which is faster than the single query that selects data and counts rows at the same time.

This by no means proves how exactly COUNT(*) OVER() affects performance but shows that it’s important to benchmark specific queries.

And finally, I found out that if I really needed one query both to fetch data and count rows then using another SELECT COUNT(*) in the same query that fetches the data is faster than using COUNT(*) OVER(). For example, I can do the count in a subquery -

instead of:

SELECT *
, COUNT(*) OVER()         
FROM orders
WHERE progress='D'
ORDER BY id DESC
LIMIT 20 OFFSET 0

this is faster:

SELECT *
, (SELECT COUNT(*) FROM (
    SELECT *	            
    FROM orders	            
    WHERE progress='D') dt) AS cnt

FROM orders      
WHERE progress='D'
ORDER BY id DESC
LIMIT 20 OFFSET 0

For simple queries the performance improvement is small but for complex ones it is greater. The obvious downside is that I have to repeat (almost) the whole query with the WHERE conditions in the subquery, which is less elegant than simply adding COUNT(*) OVER(). I suppose, we could use a CTE instead of the subquery and achieve the same effect.

I tested this on Postgres 9.6. I don’t know about later versions - maybe Postgres improved the query optimizer so it can achieve better performance with COUNT(*) OVER()?

Mittineague · September 16, 2018, 11:25pm

Just curious. If you do COUNT(an_indexed_field) instead of the COUNT(wildcard), are the performance times significantly different?

Lemon_Juice · September 17, 2018, 10:36am

From my experience I can say the performance is the same. Maybe there are some edge cases where counting by some column would result in different execution times but I’ve never seen them.

Mittineague · December 17, 2018, 5:37pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.