Group similar results, listing users

Forbes · February 22, 2011, 10:58pm

Hi guys!

I’d like to group similar results, but list all of the users who’ve added the same data.

Just to explain, I have two tables:

a table for the bookmarks, containing the URI by which I’d be grouping, and;
a table for the users and bookmark IDs, linking users to bookmarks.

I know how to group by the URI, but not how to pull through the names of all of the users who added the same bookmark.

Maybe this would have to happen within PHP? Right now, I don’t know what would be quicker — which is why I’m here, asking you guys!

SpacePhoenix · February 22, 2011, 11:26pm

Can you post the output of a SHOW CREATE TABLE query for both of the tables?

r937 · February 23, 2011, 3:49am

SELECT bookmarks.uri
     , COUNT(DISTINCT bookmarks.id) AS dupes
     , GROUP_CONCAT(userbookmarks.user_id) AS users
  FROM bookmarks
LEFT OUTER
  JOIN userbookmarks
    ON userbookmarks.bookmark_id = bookmarks.id
GROUP
    BY bookmarks.uri

the count of distinct ids for each bookmark should always be 1… i just threw that in there in case you might have duplicates (which you can avoid by defining a UNIQUE index on the uri, and then you won’t need this count)

Forbes · February 23, 2011, 9:26am

Hi, and thanks for the reply.

CREATE TABLE IF NOT EXISTS `bookmarks` (
  `id` int(11) NOT NULL auto_increment,
  `url` text,
  `title` text,
  `snippet` text,
  `datetime` datetime default NULL,
  PRIMARY KEY  (`id`),
  KEY `datetime` (`datetime`),
  FULLTEXT KEY `title` (`title`,`snippet`)
) ENGINE=MyISAM

CREATE TABLE IF NOT EXISTS `links` (
  `id` mediumint(11) NOT NULL auto_increment,
  `user_id` mediumint(11) NOT NULL,
  `bookmark_id` int(11) NOT NULL,
  `status` enum('public','private') NOT NULL default 'public',
  UNIQUE KEY `id` (`id`),
  KEY `bookmark_id` (`bookmark_id`)
) ENGINE=MyISAM

Forbes · February 23, 2011, 9:29am

r937:

SELECT bookmarks.uri
     , COUNT(DISTINCT bookmarks.id) AS dupes
     , GROUP_CONCAT(userbookmarks.user_id) AS users
  FROM bookmarks
LEFT OUTER
  JOIN userbookmarks
    ON userbookmarks.bookmark_id = bookmarks.id
GROUP
    BY bookmarks.uri
the count of distinct ids for each bookmark should always be 1… i just threw that in there in case you might have duplicates (which you can avoid by defining a UNIQUE index on the uri, and then you won’t need this count)

Hi and thanks for the reply.

Yes, there are duplicate URIs; that’s an integral function of the application.

I’ll get back later today, once I’ve managed to try out the code.

Forbes · February 23, 2011, 3:55pm

r937:

SELECT bookmarks.uri
     , COUNT(DISTINCT bookmarks.id) AS dupes
     , GROUP_CONCAT(userbookmarks.user_id) AS users
  FROM bookmarks
LEFT OUTER
  JOIN userbookmarks
    ON userbookmarks.bookmark_id = bookmarks.id
GROUP
    BY bookmarks.uri
the count of distinct ids for each bookmark should always be 1… i just threw that in there in case you might have duplicates (which you can avoid by defining a UNIQUE index on the uri, and then you won’t need this count)

I got an error relating to a non unique alias for the links table, which I removed from the FROM. But then I got another error:

Unknown column 'bookmarks.id' in 'on clause'

Here’s the sequel statement in full:

SELECT bookmarks.id, COUNT(DISTINCT bookmarks.id) AS duplicates, GROUP_CONCAT(links.user_id) AS users, links.user_id, bookmarks.url, bookmarks.title, SUBSTR(bookmarks.snippet, 1, 100) as snippet, GROUP_CONCAT(tags.tag) AS tags, bookmarks.datetime, links.status FROM bookmarks, tags LEFT OUTER JOIN links ON (links.bookmark_id = bookmarks.id) WHERE (links.status = 'public') AND (links.bookmark_id = bookmarks.id) GROUP BY bookmarks.url ORDER BY bookmarks.datetime DESC LIMIT 0, 9

Needless to say, there most certainly is a column called ‘id’ for the bookmarks table.

You’ll perhaps notice that there’s a duplicate (links.bookmark_id = bookmarks.id) in the statement; that’s a hold over from the previous statement. Even if I remove it, I still get the error.

r937 · February 23, 2011, 4:07pm

okay, we’ll have to deconstruct your latest query to focus on what you’re really after

what i gave you in post #3 was based on your original requirements, but your latest query adds all kinds of extra stuff, much of which is quite incompatible with the original GROUP BY clause

let me give you an analogy to illustrate the conceptual problem you’re having

suppose you have a school consisting of multiple classrooms, where each classroom has multiple students

now i will ask you to write a query which returns the number of students in each classroom, along with the student’s last name

your first and immediate response should be “which student?”

it is the same issue – when you have a GROUP BY clause, every column in the SELECT clause must either be mentioned in the GROUP BY clause or be the argument of an aggregate function, like COUNT() or MAX()

do you see the issue?

Forbes · February 23, 2011, 4:13pm

Before I do that, I looked up the error and, apparently, the cause is the usage of the LEFT OUTER JOIN not coming after an originating JOIN. Would that be correct?

r937 · February 23, 2011, 4:17pm

yeah, it is down to mixing the (deprecated) comma-list join syntax together with explicit JOIN syntax

but we will have to tear all that apart anyway…

Forbes · February 23, 2011, 4:31pm

Yes, sort of.

Why would I be asking you which student? The whole purpose of this query is to just return the names of those in the same classrooms, not to return a specific student by name.

That aside, what do you want me to do? I’ assuming the statement will need carving into two, or something like that.

r937 · February 23, 2011, 4:53pm

because a GROUP BY on the classroom returns a single aggregate row for the classroom, and since there are multiple ~different~ student names in a classroom, asking for one of them, without stating which one, is nonsensical

similarly, you have both

     , GROUP_CONCAT(links.user_id) AS users
     , links.user_id

in the SELECT clause – the first one is okay, because it aggregates all the users into a single string, but the second one isn’t, because it’s not an aggregate function

let’s get back to your fixing your query, though

we’ll begin with simple retrieval from the bookmarks table –

SELECT bookmarks.id
     , bookmarks.url
     , bookmarks.title
     , SUBSTR(bookmarks.snippet, 1, 100) as snippet
     , bookmarks.datetime
  FROM bookmarks

you mentioned that there could be duplicate urls in this table – may i ask why?

Forbes · February 24, 2011, 11:16am

r937:

We’ll begin with simple retrieval from the bookmarks table –
SELECT bookmarks.id
     , bookmarks.url
     , bookmarks.title
     , SUBSTR(bookmarks.snippet, 1, 100) as snippet
     , bookmarks.datetime
  FROM bookmarks
you mentioned that there could be duplicate urls in this table – may i ask why?

Sorry about the late reply — life!

I’m working on a bookmarking service, so it’s a function of that service to allow people to add the same bookmark.

Over time, the number of times a bookmark has been added will build towards a ranking algorithm.

r937 · February 24, 2011, 11:29am

okay, so given that a particular url can be in the table more than once, will ~all~ rows for the same url have identical titles and snippets and datetimes?

presumably the answer is no

so let me ask you then, if you were to aggregate/collapse all the rows for a particular url into one result row, which title or snippet or datetime would you like to see? keep in mind the student name analogy for the classroom count

Forbes · February 24, 2011, 11:45am

Correct. And here’s an example of a bookmark that’s been added by the same person.

So far, Under Cloud can handle duplicate bookmarks on an individual basis, like the aforementioned.

As with the example I provided, the first instance would be the proper bookmark to use, as all others are more recent.

So in the index view, when a duplication is discovered, the very first bookmark would be used, along with the date and time by which it was added.

I suppose the next logical question to you would be, are the duplicates detected within the rows retrieved within the remit of the LIMIT clause, or across the whole of the table?

r937 · February 24, 2011, 11:58am

not sure what you mean by index view

here’s the query which counts the bookmarks, augmented with the “earliest” data…

SELECT bookmarks.url
     , m.dupes
     , bookmarks.title
     , SUBSTR(bookmarks.snippet, 1, 100) as snippet
     , bookmarks.datetime
  FROM ( SELECT url
              , COUNT(*) AS dupes
              , MIN(datetime) AS earliest
           FROM bookmarks
         GROUP
             BY url ) AS m
INNER
  JOIN bookmarks
    ON bookmarks.url = m.url
   AND bookmarks.datetime = m.earliest

make sense?

Forbes · February 24, 2011, 1:13pm

Index is simply the listing of all bookmarks in date order.

r937:

here’s the query which counts the bookmarks, augmented with the “earliest” data…

SELECT bookmarks.url
     , m.dupes
     , bookmarks.title
     , SUBSTR(bookmarks.snippet, 1, 100) as snippet
     , bookmarks.datetime
  FROM ( SELECT url
              , COUNT(*) AS dupes
              , MIN(datetime) AS earliest
           FROM bookmarks
         GROUP
             BY url ) AS m
INNER
  JOIN bookmarks
    ON bookmarks.url = m.url
   AND bookmarks.datetime = m.earliest

make sense?

I just tried that on the live data and it didn’t combine the two most recent duplicates “Alan Turing’s Patterns in Nature, and Beyond”, which was added by another user and myself.

SpacePhoenix · February 24, 2011, 1:34pm

Check the URLs for the duplicates of “Alan Turing’s Patterns in Nature, and Beyond”, are the URLs identical?

Forbes · February 24, 2011, 2:09pm

They were, but the I’ve since edited mine (removing extraneous name-value parameters).

Having run the query again, the “dupes” column is now showing two and there’s only one bookmark being listed, and the date and time relates to the first bookmark added by Anna. So that’s now working.

By way of proof, here’s the actual bookmark listing, with me as the other user having bookmarked the same article.

I’ve just tried adding back in the options for retrieving user information, but to no avail.

r937 · February 24, 2011, 2:11pm

my crystal ball is broken, and i can’t see your attempt from here

mind sharing?

Forbes · February 24, 2011, 2:26pm

Well it was such an appalling mess, I just reverted back!

I was trying to merge the parts of the original query (posted previously) with your own.

I’ll be honest, all of this way beyond me — the queries I write are, for the most part, very basic.

SELECT
bookmarks.id, source.duplicates, GROUP_CONCAT(links.user_id) AS users, links.user_id, bookmarks.url, bookmarks.title, SUBSTR(bookmarks.snippet, 1, 100) as snippet, GROUP_CONCAT(tags.tag) AS tags, bookmarks.datetime, links.status
FROM (
	SELECT url, COUNT(*) AS duplicates, MIN(datetime) AS earliest
	FROM bookmarks GROUP BY url) AS source, links, tags
INNER JOIN bookmarks ON (bookmarks.url = source.url)
AND (bookmarks.datetime = source.earliest)
AND (links.status = 'public')
AND (links.bookmark_id = bookmarks.id) GROUP BY bookmarks.url
ORDER BY bookmarks.datetime DESC LIMIT 0, 9

All of which gives me an error:

#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘ON (bookmarks.url = source.url) AND (bookmarks.datetime = source.earliest) AND (’ at line 1

From what I can grasp, MySQL doesn’t know to handle the various references to columns belonging to the links and tags tables, and I don’t know where to put the table names.