SitePoint Sponsor

User Tag List

Results 1 to 13 of 13
  1. #1
    purple monkey dishwasher scoates's Avatar
    Join Date
    Nov 2001
    Location
    Montreal
    Posts
    794
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    MySQL Optimization help.

    I have a table with ~300,000 rows in it.

    I want to do this:
    SELECT COUNT(DISTINCT target) FROM stats_log;

    It returns the correct number (1364), but takes 3.27 seconds to run.

    In the mysql client, if I do this:
    SELECT DISTINCT target FROM stats_log;

    it returns the whole result set, and then says: "1364 rows in set (0.63 sec)"

    I want to avoid sending the 1364 rows to my PHP script, and have PHP run a num_rows on the query, but I don't seem to have another option. (3.x seconds is WAY too long for a select query).

    Is there a better way to do COUNT(DISTINCT column) ? MySQL's default algorithm seems to suck.

    Thanks,
    S

  2. #2
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    maybe you could insert all the rows into a temp table and then count the rows from there?

    Code:
    CREATE TEMPORARY TABLE stats_temp TYPE=HEAP
    	SELECT DISTINCT target FROM stats_log;
    SELECT COUNT(*) FROM stats_temp;
    DROP TABLE stats_temp;
    i made the temp table a HEAP, so it's in memory. see how that works. i'm not sure if COUNT(*) is optimized (as fast) the same on HEAP tables as MyISAM, but i would think the HEAP would be faster since it's not disk-based.
    - Matt ** Ignore old signature for now... **
    Dr.BB - Highly optimized to be 2-3x faster than the "Big 3."
    "Do not enclose numeric values in quotes -- that is very non-standard and will only work on MySQL." - MattR

  3. #3
    purple monkey dishwasher scoates's Avatar
    Join Date
    Nov 2001
    Location
    Montreal
    Posts
    794
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yep, that works.. ~0.63 seconds.

    too bad MySQL isn't smart enough to do this internally.. )-:

    Thanks!

    S

  4. #4
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    all right, goody. MySQL 4 is supposed to have a faster COUNT(DISTINCT col).

  5. #5
    SitePoint Author Kevin Yank's Avatar
    Join Date
    Apr 2000
    Location
    Melbourne, Australia
    Posts
    2,571
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Wow, cool tip! Just be aware of the memory footprint this operation is occupying on your server. I wouldn't recommend it on a table where most of the ~300,000 rows would be distinct.

    -Kev.
    Kevin Yank
    CTO, sitepoint.com
    I wrote: Simply JavaScript | BYO PHP/MySQL | Tech Times | Editize
    Baby’s got back—a hard back, that is: The Ultimate CSS Reference

  6. #6
    purple monkey dishwasher scoates's Avatar
    Join Date
    Nov 2001
    Location
    Montreal
    Posts
    794
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, even if it WAS 300,000 rows, they only occupy ~35MB of space on disk, so it's possible to have them in memory.

    On Linux, I know that the VM system would keep the whole thing in RAM anyway (it has REALLY efficient caching/swapping)..

    I'm actually surprised that the number of distinct rows is that big (~1300). I doubt it would get much larget than that, unless I start getting a lot of requests fo pages that don't exist (and then, I'd just filter out 404s).

    (-:

    S

  7. #7
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Originally posted by kyank
    Wow, cool tip! Just be aware of the memory footprint this operation is occupying on your server. I wouldn't recommend it on a table where most of the ~300,000 rows would be distinct.
    you mean because it's a HEAP table? well, you can't have HEAP tables bigger than max_heap_table_size, which is 16MB by default. if it gets larger than that, it will automatically be converted to a disk-based MyISAM table, i believe.

  8. #8
    SitePoint Author Kevin Yank's Avatar
    Join Date
    Apr 2000
    Location
    Melbourne, Australia
    Posts
    2,571
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Yes, but consider three simultaneous requests that each create a 35MB heap table... If you're designing something like a Web-based forum where simultaneous requests will be common, a solution like this might not be practical.

    Dr. Pepper is right though, I think -- likely MySQL resorts to disk-based table when the maximum heap size is exceeded.
    Kevin Yank
    CTO, sitepoint.com
    I wrote: Simply JavaScript | BYO PHP/MySQL | Tech Times | Editize
    Baby’s got back—a hard back, that is: The Ultimate CSS Reference

  9. #9
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That is really bizarre behavior; do you have an index on the counted column? That would probably help.

  10. #10
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Originally posted by MattR
    That is really bizarre behavior; do you have an index on the counted column? That would probably help.
    that's what i was gonna ask, but you would think that it would process the COUNT() the same way, except counting the rows instead of returning them. ya know, it should still have to examine the same amount or rows...

  11. #11
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Originally posted by DR_LaRRY_PEpPeR
    that's what i was gonna ask, but you would think that it would process the COUNT() the same way, except counting the rows instead of returning them. ya know, it should still have to examine the same amount or rows...
    Well, if there is an index on the column it would only have to read the index which is much smaller than the table (so the likelihood of it sitting in memory is higher).

    Plus, depending on how MySQL optimizes distinct it could factor out rows, since on the leaf-levels of the B tree you have your key in asc order:
    1203
    1204
    1205
    1205
    1205
    1205

    So it can 'skip ahead' to the next largest value without counting all of them if it is smart as to the key distribution.

    Take a look at the EXPLAIN output, here is what it says for Sybase:
    Code:
    SELECT COUNT( DISTINCT userid )
      FROM user
    
    QUERY PLAN FOR STATEMENT 1 (at line 1).
    
    
        STEP 1
            The type of query is INSERT.
            The update mode is direct.
    
            FROM TABLE
                users
            Nested iteration.
            Table Scan.
            Forward scan.
            Positioning at start of table.
            Using I/O Size 2 Kbytes for data pages.
            With LRU Buffer Replacement Strategy for data pages.
            TO TABLE
                Worktable1.
    
        STEP 2
            The type of query is SELECT.
            Evaluate Ungrouped COUNT AGGREGATE.
    
            FROM TABLE
                Worktable1.
            Nested iteration.
            Table Scan.
            Forward scan.
            Positioning at start of table.
            Using I/O Size 2 Kbytes for data pages.
            With MRU Buffer Replacement Strategy for data pages.
    
        STEP 3
            The type of query is SELECT.
    Last edited by MattR; May 9, 2002 at 17:29.

  12. #12
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    So it looks like it is performing the same thing internally but quite slowly.. I noticed the same performance for non-indexed columns in ASE for our 400,000 row thread table (2.7 seconds vs. .6).

  13. #13
    purple monkey dishwasher scoates's Avatar
    Join Date
    Nov 2001
    Location
    Montreal
    Posts
    794
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes, there are indexes (indices) on columns logtime and target..
    target is the key here..

    adding the index on target made the query faster, but hardly (.1 to .2 seconds).

    S


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •