Complicated MySQL query

I have two database tables which track documents and the document release dates. The ‘document’ table only contains the current version of each document. The ‘document_replaced’ table only contains previous versions of the documents.

When querying the database we always specify a project_id and a document_category.

I need to create a list of document release dates which takes into account the following rules/attributes:

  • Each date should only appear once in the list UNLESS a specific document was updated two or more times on the same day.
  • The first version isn’t always ‘1’ - it can be any number.
  • Documents that haven’t been updated will have an entry in the ‘documents’ table only.

Here is some sample data from the two database tables (document and document_replaced):

SELECT document_id,document_version,DATE(document_date),document_project,document_category FROM document WHERE document_project = 242 AND document_category=‘Architectural’ order by document_id;

±------------±-----------------±--------------------±-----------------±------------------+
| document_id | document_version | DATE(document_date) | document_project | document_category |
±------------±-----------------±--------------------±-----------------±------------------+
| 12503 | 7 | 2016-04-11 | 242 | Category_1 |
| 12504 | 2 | 2016-04-07 | 242 | Category_1 |
| 12505 | 2 | 2016-04-07 | 242 | Category_1 |
| 12506 | 3 | 2016-04-07 | 242 | Category_1 |
| 12507 | 4 | 2016-04-07 | 242 | Category_1 |
| 12508 | 2 | 2016-04-07 | 242 | Category_1 |
| 12509 | 1 | 2016-04-11 | 242 | Category_1 |
±------------±-----------------±--------------------±-----------------±---------------+

SELECT parent_doc,document_version,DATE(document_date),parent_project,document_category FROM document_replaced WHERE parent_project = 242 AND document_category=‘Architectural’ order by parent_doc;
±-----------±-----------------±--------------------±---------------±------------------+
| parent_doc | document_version | DATE(document_date) | parent_project | document_category |
±-----------±-----------------±--------------------±---------------±------------------+
| 12503 | 1 | 2016-04-03 | 242 | Category_1 |
| 12503 | 2 | 2016-04-07 | 242 | Category_1 |
| 12503 | 3 | 2016-04-07 | 242 | Category_1 |
| 12503 | 4 | 2016-04-07 | 242 | Category_1 |
| 12503 | 5 | 2016-04-07 | 242 | Category_1 |
| 12503 | 6 | 2016-04-07 | 242 | Category_1 |
| 12504 | 1 | 2016-04-07 | 242 | Category_1 |
| 12508 | 1 | 2016-04-07 | 242 | Category_1 |
±-----------±-----------------±--------------------±---------------±---------------+

The ‘parent_doc’ field in the ‘document_replaced’ table corresponds with the ‘document_id’ field in the ‘document’ table. When a document is replaced we copy the field values over from the ‘document’ table and set the ‘parent_doc’ field in the document_replaced table to the document_id of the document that is being replaced.

It might be easier to understand what I’m looking for if I include a crude visualisation of how the final data is displayed. Document names on the left, revision dates across the top and corresponding version numbers in the grid:

					3-Apr-16  7-Apr-16  7-Apr-16  7-Apr-16  7-Apr-16  7-Apr-16  11-Apr-16
Document 12503		                   1	     2	       3	 4	   5	     6		7
Document 12504			                     1	       2
Document 12505			                     2
Document 12506			                     3
Document 12507			                     4
Document 12508			                     1	       2
Document 12509			                                                                        1

I’m not even sure if what I need can be done with a single query but I’ve tried doing it with a couple of queries, some PHP code and an array and it’s doing my head in. I’m looking to get that list of dates but if a single query can gather all of the related data then that would be a bonus.

Gary

yes, you can grab everything in one query –

SELECT document_id , document_version , DATE(document_date) AS document_date , document_project , document_category FROM document WHERE document_project = 242 AND document_category = 'Architectural' UNION ALL SELECT parent_doc , document_version , DATE(document_date) , parent_project , document_category FROM document_replaced WHERE parent_project = 242 AND document_category = 'Architectural' ORDER BY document_id

Thanks - that’s way more efficient than how I’m currently doing it! I think I need to spend some time reading up on UNION and UNION ALL.

My other stumbling block in this is getting the right number of dates across the top of the grid. I have tried but I always end up with too many date columns. If we take document 12503 out (it’s an exception - there really wouldn’t be that many updates in one day) we have documents that were updated once on 7-Apr, documents that were updated twice on 7-Apr and one that was updated on 11-Apr. In this case, there should be two columns for 7-Apr and one for 11-Apr. This is one of the areas where it got really messy - I had a query for each table that pulled out the dates and then ran them through some loops, looking at the maximum number of updates that were done on each date. An easy query (for me) if it were a single table but because there are two tables I’m finding it difficult.

I suspect the query needs to be a UNION query that can group by document_id/parent_doc,version,document_date. It has to be able to return, for example, a specific date 3 times if there are two versions in document_replaced and one version in document. The closest I’ve got is this, which is returning way too many date columns:

SELECT DATE(document_date) AS document_date
       FROM document 
 WHERE document_project = 242 
   AND document_category = 'Architectural'
   GROUP BY document_id,document_version,document_date
UNION ALL
SELECT DATE(document_date)
       FROM document_replaced 
 WHERE parent_project = 242 
   AND document_category = 'Architectural' 
   GROUP BY parent_doc,document_version,document_date
ORDER
    BY document_date

Gary

You already have all the info you need with the query r937 gave you.
I’d say that creating the grid is something you need to do in PHP.

1 Like

Thanks for the assistance r937. I’ve worked some more on the query to extract the number of times each document has been updated on each date. This query, based on the query you provided, should allow me to create an array of dates by looking at the datecount column which I can then filter and arrange in an array.

SELECT docid,document_date,COUNT(*) AS datecount FROM
(
SELECT document_id docid
     , document_version
     , DATE(document_date) AS document_date
     , document_project
     , document_category 
  FROM document 
 WHERE document_project = 229 
   AND document_category = 'Architectural'
UNION ALL
SELECT parent_doc docid
     , document_version
     , DATE(document_date)
     , parent_project
     , document_category 
  FROM document_replaced 
 WHERE parent_project = 229 
   AND document_category = 'Architectural' 
ORDER BY docid,document_version
) AS alldocs
GROUP BY docid,document_date
ORDER BY document_date ASC

Gary

looks good… may i offer a couple suggestions

you don’t need document_project or document_category in the SELECT clauses, that just bloats the intermedate table produced by the union

also, the ORDER BY inside the union should be removed

Yep - done and done. Thanks :slight_smile:

With your help I have everything working correctly. The function I originally had for generating the dates at the top of the grid has been reduced from around 100 lines of mess to a much more efficient 15 lines or so.

Gary

1 Like

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.