SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    Obey the Purebreed trib4lmaniac's Avatar
    Join Date
    Dec 2004
    Location
    Cornwall, UK
    Posts
    594
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Revision Control

    What would be an efficient manner to store revision information in a database? I am not talking about large quantities of text such as on Wikipedia (which I would use diff tools for), but for your average tables made up of a set of small fields.
    Perhaps a few of the database gurus around here could conjecture as to pluses and pitfalls of the methods listed here, and perhaps suggest different ways of doing it.

    Having thought about it a bit and looked around a few sites, I can see 3 obvious ways to store this information.

    Split Tables
    Code SQL:
    -- Contains unversioned foo data.
    CREATE TABLE foos
    (
        id INTEGER NOT NULL,
        bar INTEGER NOT NULL,
     
        PRIMARY KEY (id)
    );
     
    -- Contains versioned foo data.
    CREATE TABLE foo_revisions
    (
        foo_id INTEGER NOT NULL,
        revision_id INTEGER NOT NULL,
        revision_timestamp DATETIME NOT NULL,
     
        baz DATETIME NOT NULL,
        quux VARCHAR(255) NOT NULL,
     
        PRIMARY KEY (foo_id, revision_id)
    );
    This methods seems to be the cleanest to me, but I would be worried about performance after a while as joins would be required for the simplest of selects. On the plus side though, not all data is required to be under reversion control.

    Single Table
    Code SQL:
    -- Contains ALL foo data.
    CREATE TABLE foos
    (
        id INTEGER NOT NULL,
        revision_id INTEGER NOT NULL,
        revision_timestamp DATETIME NOT NULL,
     
        bar INTEGER NOT NULL,
        baz DATETIME NOT NULL,
        quux VARCHAR(255) NOT NULL,
     
        PRIMARY KEY (id, revision_id)
    );
    This method reduces the number of joins required for selects, but also requires that all data is versioned.

    Shadow Tables
    Code SQL:
    -- Contains the most current foo data.
    CREATE TABLE foos
    (
        id INTEGER NOT NULL,
        bar INTEGER NOT NULL,
        baz DATETIME NOT NULL,
        quux VARCHAR(255) NOT NULL,
     
        PRIMARY KEY (id)
    );
     
     
    -- Contains past foo data.
    CREATE TABLE foo_shadows
    (
        foo_id INTEGER NOT NULL,
        revision_id INTEGER NOT NULL,
        revision_timestamp DATETIME NOT NULL,
     
        bar INTEGER NOT NULL,
        baz DATETIME NOT NULL,
        quux VARCHAR(255) NOT NULL,
     
        PRIMARY KEY (foo_id, revision_id)
    );
    Shadow tables could make use of triggers to make the versioning system transparent to the client. It also holds all the "current" information in a separate table, meaning the shadow tables can be ignored if they are not required. It does require duplication of table structure though.

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    i got lost at the Split Table design

    the original table has only one data column (bar), so if there is versioning, then i would expect to see bar in the revisions table too

    instead, i see a revision_id (how does it work? it appears to want to be incremental within foo_id), a timestamp (presumably self-explanatory), and then a couple of new columns, including a datetime (what for?) and a varchar (what for?)

    but no bar

    so i am lost as to where the previous value of bar might be found
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    Obey the Purebreed trib4lmaniac's Avatar
    Join Date
    Dec 2004
    Location
    Cornwall, UK
    Posts
    594
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    the original table has only one data column (bar), so if there is versioning, then i would expect to see bar in the revisions table too
    I split the columns between the two tables to demonstrate the fact that the first design does not require all information to be versioned. bar is not versioned, whilst baz and quux are. In the other two designs, all three fields are required to be under version control.

    Quote Originally Posted by r937 View Post
    instead, i see a revision_id (how does it work? it appears to want to be incremental within foo_id), a timestamp (presumably self-explanatory), and then a couple of new columns, including a datetime (what for?) and a varchar (what for?)
    If I were to actually implement this, revision_id would likely be AUTO_INCREMENTed, I am just trying to demonstrate the layout of the tables. The revision_id simply says what revision the "row" (there could be many of them) is currently at. Ideally, you could use the timestamp alone to tell the current revision, but theoretically two updates could have the same timestamp.
    As for the DATETIME for which you ask "what for?" it is simply "foo data." An arbitrary data type for an arbitrary column.

    Perhaps it would help if I demonstrated the sort of things I would like to be able to do once the structure is in place. I'll try and make the example more practical this time

    Code SQL:
    CREATE TABLE articles
    (
      id INTEGER NOT NULL,
      author_id INTEGER NOT NULL,
     
      PRIMARY KEY (id)
    );
     
    CREATE TABLE article_revisions
    (
      article_id INTEGER NOT NULL,
      revision_id INTEGER NOT NULL,
      revision_timestamp DATETIME NOT NULL,
     
      title TINYTEXT NOT NULL,
      body TEXT NOT NULL,
     
      PRIMARY KEY (article_id, revision_id)
    );
    In this set up - using the "split table" design I described earlier, the two tables represent articles (obviously), which have 4 fields; the id, the author's id, and the article's title and body. The author_id is in the first table because it does not require versioning (it will never change). The second table represents the changes made to an articles title and body, along with the time it happened (revision_timestamp).

    With this structure I can, for example, retrieve the latest revisions for all articles in the database with the following query:
    Code SQL:
    SELECT
        a.id,
        a.author_id,
        ar.title,
        ar.body
    FROM
        articles a,
        article_revisions ar
    WHERE
        a.id = ar.article_id AND
        (ar.article_id, ar.revision_id) IN (SELECT ar.article_id, MAX(ar.revision_id) revision_id FROM article_revisions ar GROUP BY ar.article_id)
    ORDER BY
        ar.article_id ASC,
        ar.revision_id DESC;
    This works wonderfully well, but I wonder as to its efficiency. For a task as simple as retrieve the latest articles, I had to use an inner join and a subquery! Of course, there may be better ways to do this...

    Of the 3 methods I have outlined in my previous post, I'm sure each have their pros and cons; but I am interested in potential bottlenecks and slow queries that could arise.
    If I were to implement revision control, it would be in a database heavy, web application, so I need to be aware of these things before I go about committing myself to any particular method.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •