Result returning wrong COUNT values?

Hi all

The code below produces the result I need based on the venue position, though things aren’t playing so kindly with the COUNTs. Not sure if I’m building this correctly :expressionless:

For some reason the reviewCount and hitsCount are slightly inflated, meaning if the number should be 12, it returns 96. Both COUNTs also have the exact same number.

SELECT *
  FROM ( SELECT v.id
  , v.venue_id AS Venue
  , COUNT(r.venue_fk) AS reviewCount
  , COUNT(h.venue_id) AS hitsCount
  
  FROM tbl_venues v
    INNER JOIN tbl_venue_page_hits h
        ON (h.venue_id = v.id)
    INNER JOIN tbl_reviews r
        ON (v.id = r.venue_fk)
    
  WHERE h.dtstamp > CURDATE()
  GROUP BY v.venue_id
  
  ) AS d
  
  ORDER BY hitsCount DESC

The result based on the code above:

id    Venue    reviewCount       hitsCount
46    v1       609               609
16    v5       208               208
34    v7       184               184
17    v9       136               136

It should be looking something like:

id    Venue    reviewCount       hitsCount
46    v1       29               21
16    v5       22               18
34    v7       34               16
17    v9       28               11

If I remove the reviews table count and join (snippet below), reducing it to two tables, everything works and displays the hits count correctly.

COUNT(r.venue_fk) AS reviewCount
INNER JOIN tbl_reviews r ON (v.id = r.venue_fk)

Two tables without reviews show

id    Venue    hitsCount
46    v1       21
16    v5       18
34    v7       16
17    v9       11

Any ideas what I’m doing wrong — why the reviews table is causing the incorrect result?

I have left the outer SELECT in place, was playing around with this though the results are the same.

Cheers, Barry

I think a better approach is, as the below works:

SELECT v.id
  , v.venue_id AS Venue
  , COUNT(h.venue_id) AS hitsCount
  
  FROM tbl_venues v
    INNER JOIN tbl_venue_page_hits h
        ON (h.venue_id = v.id)
    
  WHERE h.dtstamp > CURDATE()
  GROUP BY v.venue_id
  ORDER BY hitsCount DESC

Showing:

id    Venue    hitsCount
46    v1       21
16    v5       18
34    v7       16
17    v9       11

So how would I add the reviews table into the mix so I can COUNT the reviews from each venue, based on the matching v.id = r.venue_fk?

r.venue_fk is the foreign key inside the reviews table which matches the venues id (v.id).

Just to recap - When I add INNER JOIN tbl_reviews r ON (v.id = r.venue_fk) thats when everything starts to go wrong.

Can anybody help, thanks.

let’s start with what happens when you do multiple JOINs in the FROM clause

the result that’s produced is a NxM table – N rows, each of M columns

try running your problematic query ~without~ doing a GROUP BY and without counts, just to see the actual result rows produced by the joins

look at it until you figure out why the COUNTs are identical – it’s because each venue joins to multiple hits rows, and every one of those is joined to every review

so where i said it’s producing N rows of M columns, that’s for the overall result table – but each venue is represent PxQ times, where P is the number of hits rows, and Q is the number of reviews

somebody is going to spoil the party and say “just use COUNT(DISTINCT foo) instead” but i want you to understand the inefficiency of that, just by looking at what your GROUP BY is “hiding under the covers” so to speak

let me know if this doesn’t make sense

also, google “cross join effects” for other examples

1 Like

Hey r937

Been reading a bit about this, thanks for the heads up, was hoping you was about :sunglasses:

Ok, I think I understand what’s happening with cross joins, and in particular what’s casuing the result to be the same and have the inflated values.

Cross Joins produce results that consist of every combination of rows from two or more tables.

…if you have table A with 10,000 rows and Table B with 30,000 rows, and you accidentally create the product of these two tables, you will end up with a 300,000,000 row result

So generally speaking, we are adding everything together — resulting in one big total.

I tried running the query without the COUNTs and Group By :open_mouth:

So am I right in saying the COUNT is the same because the query is acting like a CROSS JOIN, and adding all the rows together?

And first time I’ve heard/seen the NxM formula — Intresting.

Cool, ok, so now we know why this is happening, how do we fix? :grin:

Barry

And I had to try COUNT(DISTINCT foo)
But this only returned 1 for both counts, and for every venue.

let me know if this doesn’t make sense

Not a 100%, though I think I have a fair understanding.

not “adding” – joining

let’s set up an example

table A has rows A1, A2, A3

A1 A2 A3

table B has a foreign key to A, so for each row in A, there are multiple rows in B

this is what we get when we join B to A –

A1 B23 A1 B24 A1 B25 A2 B46 A3 B71 A3 B72 A3 B73 A3 B74

table C has a foreign key to A, so for each row in A, there are multiple rows in C

this is what we get when we join C to A –

A1 C012 A1 C013 A2 C025 A2 C026 A3 C064 A3 C065 A3 C066

however, this is what we get when we join ~both~ B and C to A

A1 B23 C012 A1 B23 C013 A1 B24 C012 A1 B24 C013 A1 B25 C012 A1 B25 C013 A2 B46 C025 A2 B46 C026 A3 B71 C064 A3 B71 C065 A3 B71 C066 A3 B72 C064 A3 B72 C065 A3 B72 C066 A3 B73 C064 A3 B73 C065 A3 B73 C066 A3 B74 C064 A3 B74 C065 A3 B74 C066

this is what you were supposed to see when you ran your own query without GROUP BY

this is the MxN result table, where N is 20 total rows and M is 3 columns

okay, now watch this…

let’s do the GROUP BY and the COUNTs on the 3-table join –

A1 6 6 A2 2 2 A3 12 12

now, stop here and study this until you see exactly why the counts are what they are, and why the B and C counts are the same

2 Likes

Yes thats made sense.

So we know things do work when joining one table, but not when joining both.

Correct this is what I was seeing, something similar when I removed the GROUP BY.

I wish I could reply with… I’ve found it! I know!
Unfortunately, I’ve been trying to figure out what’s happen here for the past 4 hours :upside_down: :thinking:

I’ll keep trying, see if I can put my finger on the problem.
And thanks for the detailed example r937 :slight_smile:

I was wondering, maybe:

  1. sub query
  2. union
  3. am I joining the wrong tables
  4. using the wrong joins
  5. is this even possible

---------- Update

Everything I read talks about, distinct, sub queries and selecting from the correct tables.

I was curious and tried adding distinct to the counts and made a small change to the inner join on clasue:

SELECT v.id
  , v.venue_id AS Venue
  , COUNT(distinct r.venue_fk) AS reviewCount
  , COUNT(distinct h.venue_id) AS hitsCount 
  
  FROM tbl_venues v
    INNER JOIN tbl_page_hits h ON (v.id = h.venue_id)
    INNER JOIN tbl_reviews r ON (v.id = r.venue_fk)
    
  WHERE h.dtstamp > CURDATE()
  GROUP BY v.venue_id
  ORDER BY hitsCount DESC

Which outputted:

id    Venue    hitsCount reviewCount
46    v1       1         1
16    v5       1         1
34    v7       1         1
17    v9       1         1

Still don’t understand what’s going on :expressionless:

Something simple, I bet.

Thanks,
Barry

john has a hat and a coat

john also has a dog and a cat and a hamster

you would think that the count of clothes is 2, and the count of pets is 3, and it is, if you did them separately

but when you join john to clothes and also to pets, you get 6 rows in the result, before grouping –

john hat dog
john hat cat
john hat hamster
john coat dog
john coat cat
john coat hamster

that’s 6 rows in the join, before grouping

six(6) rows in the join before grouping

now, carefully scan your eye down the columns in the result of the join

how many non-null values in each column? six (6) in both clothes and pets

(also 6 non-null john values, but that’s not terribly relevant at the moment)

now, please, scroll back up and find my comment about how you should ~not~ attempt to solve this underlying problem by using DISTINCT

  1. sub query
    YES … but not until you understand the joins you’re currently running

  2. union
    NO

  3. am I joining the wrong tables
    NO

  4. using the wrong joins
    NO

  5. is this even possible
    OF COURSE IT IS

No eureka moment, though I think I understand what’s happening here.

I just looked at one of my own results after studying your results:

With the join — the total amount for v1 both hitsCount and reviewCount was 406

reviews = 29
hits = 14

29 x 14 (every combination) = 406 :slight_smile:

So what you’re saying is, even though the DISTINCT might work in some instances, its really only hiding the flaws within the code/db design, and once removed, the flaws are plain to see — as our results show. Does that sound right?

Side note:
When I tried using DISTINCT as a test, my result above was not working and only shown 1.

Ok, so do I understand things now?

And the way to get the results we need will mean adding a sub query, separate from the two table join — is that what you have in mind?

Thanks, Barry

i will help you with the subqueries if you would follow along…

step 1, write a stand-alone query that queries only tbl_page_hits and returns a count for each venue_id – a result table of size Nx2

test it and everything, okay?

step 2, write a stand-alone query that queries only tbl_reviews and returns a count for each venue_fk

when both these queries are working, post them here and i’ll show you the next step

bingo

you don’t have a full CROSS JOIN (every row times every row) but instead you have cross join effects which are the same multiplication of rows but locally, within each key value

:sunglasses:

Sounds good!
Actually feels great now I have a understanding :nerd:

Will get to work on the others now.

And just wondering, you keep mentioning things like:

  • MxN
  • PxQ
  • Nx2

What are they and should I know about them in detail?
If I understand correctly, a type of table formula used in mathematics.

I have a couple of other questions but will save these until we have things up and running.

Cheers and speak soon,
Barry

Ok r937
I think everything looks right, queries work as expected in my testing area.

The first query which shows each venue and the hits from today.

SELECT 
      venue_id
    , COUNT(venue_id) AS venueHits
FROM
	tbl_venue_page_hits
WHERE 
	dtstamp > CURDATE()
GROUP BY
	venue_id
ORDER BY 
	venueHits  DESC

Not sure if we need CURDATE() in this query because we are using it on the master query ?
And again, I think ORDER BY could be optionally if we can do this with the master.

Second query

SELECT 
      venue_fk
    , COUNT(venue_fk) AS reviewCount
FROM
	tbl_reviews
WHERE
	confirmed = 1
GROUP BY
	venue_fk

The next step :slight_smile:

Thanks,
Barry

in both of those queries, you would be better off using COUNT(*) instead of COUNT(colname)

it’s always easier for the database engine to count rows rather than inspecting every value to find the NULLs

anyhow…

that NxM business simply describes cardinality of a table’s dimensions

a PxQ table would have P rows consisting of Q columns

the point is very much that a database table is a two-dimensional structure

(the only “holes” are where there are NULLs but for our purposes in this example, there simply won’t be any NULLs to worry about)

the reason i keep bringing this up is because every query produces a table – N rows of M columns

in particular, the FROM clause produces a table – which, sometimes, gets converted into another, different table, by grouping

your three-table join in your FROM clause was producing an intermediate table (before grouping) that had cross join effects in it

Yes best use (*) then. I did read something about this a while back, saying something similar. Thanks for pointing the out.

And appreciate the deep explanations r937, moving into some advance stuff :nerd:
Explained well, making things much clearer.

Barry

Next step?

:thinking:

join the venues table to both of those subqueries

i guess you’ve never seen a subquery in the FROM clause before?

it’s called a derived table – and it has exactly what you’d expect, an NxM two-dimensional structure

so here you go –

SELECT ... FROM tbl_venues AS v LEFT OUTER JOIN ( SELECT venue_id , COUNT(*) AS venueHits FROM tbl_venue_page_hits GROUP BY venue_id ) AS h ON h.venue_id = v.venue_id LEFT OUTER JOIN ( SELECT venue_fk , COUNT(*) AS reviewCount FROM tbl_reviews GROUP BY venue_fk ) AS r ON r.venue_fk = v.venue_id

notice this time it has to be LEFT OUTER JOINs

1 Like

This is excellent :grinning:
Thanks for getting back with the example.

I’ve amended my code and things are looking and working really well :sunglasses:

SELECT v.id
  , v.venue_id
  , venueHits
  , reviewCount
  FROM tbl_venues AS v
  LEFT OUTER JOIN ( SELECT venue_id
                , COUNT(*) AS venueHits
             FROM tbl_venue_page_hits
             WHERE dtstamp > CURDATE()
           GROUP BY venue_id ) AS h
      ON h.venue_id = v.id
  LEFT OUTER JOIN ( SELECT venue_fk                    
                , COUNT(*) AS reviewCount
             FROM tbl_reviews
           GROUP BY venue_fk ) AS r 
      ON r.venue_fk = v.id
  WHERE v.venue_active = 1
  ORDER BY venueHits DESC
  LIMIT 5

I also needed to add WHERE dtstamp > CURDATE() into the first subquery — Is this good practice/ok to add WHERE in this instance?

Yes I noticed this.

So what is the main the purpose of us using the LEFT OUTER JOIN over the INNER JOIN?
And is this something I’ll need to get used to if I’m going to build other queries like this?

I have seen them. I’ve just never needed to use them, which has shown in my little knowledge :slight_smile:

And thanks again for sharing the knowledge r937 :nerd:

inner join requires a matching key

what happens if you have a venue that has no page hits or reviews?

don’t say “oh, that’ll never happen” – it will happen the moment you add a new venue, because you have to add a venue before it can accumulate hits or reviews