SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 47
  1. #1
    SitePoint Addict
    Join Date
    Jan 2005
    Posts
    204
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    best way to get total number of rows

    i am building something similar to what a forum is like.
    so on the list of all the threads, i am going to state the total number of replies on each threads displayed on the list.

    if i were to do a query of mysql_num_rows() or count() on the replies table for every single thread to get the total number of replies, is this the right way to do things? i just find that one day when many of the threads has like a few thousand replies then will my query timeout or have some error or will cause the page to load very slowly?

  2. #2
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Maintain a reply count in the threads table. A little denormalization for the sake of huge performance gains. You don't want to be counting up thousands of items on every request for a potentially high usage app like a forum.

  3. #3
    SitePoint Addict
    Join Date
    Jan 2005
    Posts
    204
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    if i were to add a field called totalreplies into the threads table instead of creating another table, would it be the same as creating a new table just to maintain the reply count?

  4. #4
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can issue a query to count the number of rows.

    Assuming that your table looks like this:
    Code:
    create table replies (
      reply_id serial,
      thread_id integer,
      reply_text text
    )
    Where thread_id is the id of a thread, in a different table. The following will give you the number of replies for a given thread:
    Code:
    select count(reply_id) from replies group by thread_id;

  5. #5
    SitePoint Addict
    Join Date
    Jan 2005
    Posts
    204
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    but if one thread has like more than a thousand replies, will it slow down or even timeout by doing a count on all the rows?

    on the thread index, it has a lot of threads. imagine counting the rows for all the threads, will it cause problems?

  6. #6
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    mysql doesn't "count" rows in a literal sense, it just pulls the number out of the index file. This is very fast even if you have millions of records.

  7. #7
    SitePoint Addict
    Join Date
    Jan 2005
    Posts
    204
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    oh...then what about mysql_num_rows() in php? does it do the same thing as the count function in the sql statement?

  8. #8
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by x[E]nOn View Post
    oh...then what about mysql_num_rows() in php? does it do the same thing as the count function in the sql statement?
    Nope. That gives you the number of rows in a resultset. To use that, you need to actually select out all the rows, which will be very slow on a large table. It's not the right way to count rows in a table.

  9. #9
    SitePoint Addict Jasper Bekkers's Avatar
    Join Date
    May 2007
    Location
    The Netherlands
    Posts
    282
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by stereofrog View Post
    mysql doesn't "count" rows in a literal sense, it just pulls the number out of the index file. This is very fast even if you have millions of records.
    Even for simple queries like "SELECT COUNT(*) FROM aTable GROUP BY aColumn" can't be retrieved from the index, the only possible case where you can say 'for sure' (I'll remove the quotes if you can find a reference in the MySql documentation saying it does) that it retrieves the number from the index is with a 'SELECT COUNT(*) FROM aTable' which isn't the most common query anyway. A more optimal solution would be to cache the data somewhere in your database because a wrong COUNT can be a huge hog on performance.

  10. #10
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    MySql does a lot of caching, so if you issue a count query many times, or perhaps follow it by a select with the same criteria, it may not be such a big problem. Before optimising on a database, you really need to profile the application first.

    That said, count() may not be efficient under all circumstances, and then you could apply something like Dan Grossman suggested. You just shouldn't do it, until you have actually identified a performance bottleneck.

  11. #11
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    "select count(*) from replies group by thread_id" will be lighting fast for any number of rows (if there's an index on thread_id).

    Quote Originally Posted by Jasper Bekkers View Post
    Even for simple queries like "SELECT COUNT(*) FROM aTable GROUP BY aColumn" can't be retrieved from the index
    Why?

  12. #12
    SitePoint Addict Jasper Bekkers's Avatar
    Join Date
    May 2007
    Location
    The Netherlands
    Posts
    282
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    MySql does a lot of caching
    I think you're talking about the query cache, which only results the execution path of the query, not the resultset.
    Quote Originally Posted by stereofrog View Post
    Why?
    Sorry it should have been a WHERE clause, MyISAM and InnoDB are notoriously slow in that case. (InnoDB is anyway for a COUNT query).

  13. #13
    SitePoint Addict
    Join Date
    Jan 2005
    Posts
    204
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    what are some of the methods that some big apps are using? say for this vBulletin forum, how are they counting our total posts the total posts in each thread?

  14. #14
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Jasper Bekkers View Post
    Even for simple queries like "SELECT COUNT(*) FROM aTable GROUP BY aColumn" can't be retrieved from the index, the only possible case where you can say 'for sure' (I'll remove the quotes if you can find a reference in the MySql documentation saying it does) that it retrieves the number from the index is with a 'SELECT COUNT(*) FROM aTable' which isn't the most common query anyway.
    You're confusing indexes with table statistics stored by MyISAM tables. Selecting a COUNT(*) from an entire table takes constant time on a MyISAM table because this is stored in the table statistics and neither the table nor an index needs to be examined. A COUNT that is completely covered by an index won't take constant time but will still be much faster than if it had to scan the table instead.

  15. #15
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by x[E]nOn View Post
    what are some of the methods that some big apps are using? ?
    Both phpBB and punBB store a reply count in the threads/topics table.

  16. #16
    SitePoint Guru
    Join Date
    Sep 2004
    Location
    NY, USA
    Posts
    712
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dan Grossman View Post
    Both phpBB and punBB store a reply count in the threads/topics table.

    as does vbulletin

  17. #17
    SitePoint Addict
    Join Date
    Jan 2005
    Posts
    204
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    thanks guys!
    by the way, where is the index and index file you guys were talking about? i am not sure what is it.

  18. #18
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by x[E]nOn View Post
    thanks guys!
    by the way, where is the index and index file you guys were talking about? i am not sure what is it.
    You *can't* design a database for a real world application without knowing what indexes are, when to use them, and how to use them. Time to study up

    http://dev.mysql.com/doc/refman/5.0/...l-indexes.html

  19. #19
    SitePoint Member
    Join Date
    May 2007
    Location
    San Francisco, California
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    my own .02:
    Making up some numbers for the sake of an example..
    If I had/was expecting on average like 50-75 replies to each post, I'd store the reply count in the thread record as a column. If I had like 10,000 threads with maybe 2-3 replies, I'd just do a select count() sql query as mentioned above.

    I'd be interested in what is faster though...

    I'm sure its something that is *technically* rdbms-specific, but from past experience, SELECT count(*) FROM table is pretty quick, even on tables with over 100k rows. My personal experience is MSSQL, but have had similar results with postgres (don't really use MySQL[anymore])

  20. #20
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by oranjcrush View Post
    I'd be interested in what is faster though...
    There's no question about what's going to be faster. You're already retrieving the rows from the threads table to display the list of threads. If you stored the reply count in that table, retrieving an additional column in the same query adds nothing. Not maintaining this count means *at the very least* scanning an index on the posts table; you're retrieving a count grouped by thread_id, not just an unbounded count(*).

  21. #21
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by oranjcrush View Post
    I'd be interested in what is faster though...
    The right question would be: what is more proper? Storing derived facts like 'count of replies' makes the database bloated and hard to maintain.

  22. #22
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by stereofrog View Post
    The right question would be: what is more proper? Storing derived facts like 'count of replies' makes the database bloated and hard to maintain.
    In exchange for running at all. SitePoint, for example, would be offline if it didn't. The number of times a thread list is retrieved far exceeds the number of times a reply is made and the count has to be updated.

  23. #23
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

  24. #24
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,220
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    normalization does not mean eliminating "redundant" data like a count that is calculated elsewhere and then stored in a parent table

    normalization deals with the dependence of the attribute on the primary key, so the redundant count is actually not denormalized

    normalization is not for sissies, it is for everyone

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  25. #25
    SitePoint Member
    Join Date
    May 2007
    Location
    San Francisco, California
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by stereofrog View Post
    The right question would be: what is more proper? Storing derived facts like 'count of replies' makes the database bloated and hard to maintain.
    Granted. By 'best' are we talking about best pratice or best performance?
    I'd rather be poor than run a proxy site.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •