SitePoint Sponsor

User Tag List

Results 1 to 8 of 8

Hybrid View

  1. #1
    SitePoint Zealot ShytKicka's Avatar
    Join Date
    Aug 2004
    Location
    New York
    Posts
    120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    How to optimize a 100 million record table?

    I have a MyISAM table with 100 million records.

    The table has a PRIMARY KEY and a non-unique INDEX on a BIGINT column. My question is, how can we run a query on this INDEXED column and figure out which PRIMARY KEYs have the most matches. As of right now, this is pretty much as far as I can get it:

    SELECT
    primary_key_column,
    count(*)
    FROM
    table_name
    WHERE
    indexed_column = some_numeric_value
    GROUP BY
    primary_key_column
    ORDER BY
    count(*) DESC
    LIMIT 10;


    I know there are tiny ways to improve this, but I am looking to go from several seconds to getting data within as fast as 100-200 ms. Any ideas? Am interested in hearing any algos, tuning tips, or even database specifically designed for this even if it is not MySQL.

  2. #2
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Having enough RAM on the server to hold that table's index in memory would resolve the query in <200ms.. but assuming you're already maximizing your use of memory..

    Does it hurt if the counts are 10 seconds stale? A minute? An hour?

    You can set up a cron job to issue that query every so often and populate a second table with the counts, which you could then select the values from directly without aggregating rows.

  3. #3
    SitePoint Zealot ShytKicka's Avatar
    Join Date
    Aug 2004
    Location
    New York
    Posts
    120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    No can do. We are basically getting dynamic values everytime, and would need to get it back to the user. We wouldn't know ahead of time what the value would be.

    My idea was to use MySQL Partionining (a new feature in 5.1+), which would do the job. We can then divide a 100 million record table into 1000 partitions. So scanning would be as fast as scanning a 100,000 record table. That's my one idea thus far.

    I am not sure if we can add any more RAM at this time...

  4. #4
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    the only way that partitioning would help you is if your stats calculate in a way to prune partitions. but since your query looks over the entire table, that's out of the question.

    what is the cardinality of indexed_column?
    Check out our new Industry News forum!
    Keep up-to-date with the latest SP news in the Community Crier

    I edit the SitePoint Podcast

  5. #5
    SitePoint Zealot ShytKicka's Avatar
    Join Date
    Aug 2004
    Location
    New York
    Posts
    120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Not sure if I understand.

    The idea for the partinioning would be to split the table into 100 or 1000 partitions based on the numeric value. So columns that begin with 1000000 - 1100000 would be partion 1, and 1100001 - 1200000 would be partition 2, etc.

    We'd know the value we are trying to find at the time of executing the query, so MySQL would then just use the partition based on the value. If our value would be 1102131, then MySQL would use partition 2.

    The only problem with this is that we can have a 100,000 partition, but have 50,000 or more matches, so I don't know what MySQL would do at this point, perhaps a file sort, which can slow things down? No idea....

    To add complication to the equation, I will need to do this upto 50 times for different column values in a UNION query, and need all results grouped by the most occuring primary key within less than a second.

    So far testing on a development server with a 150,000 record table, results come back as fast as 36 ms for a 100 queries in a UNION doing a scan on the indexed column and returning results grouped by ID, which is quite good so far.

    The only problem is that this is a single table, not a partition, so not sure if there will be a difference in query execution times when working with a partition, or with 100-1000 seperate tables of my own. I guess I will need to test. And a bigger problem may be cardinality. Currenlty I am testing with unique keys, but final results will have a lot of the same values in the index, so will need to figure out the impact on execution time, and ofcoarse grouping with 30,000 records rather than 100 matched records, which is basically the cardinality issue.

    I am just trying to figure out if there are any ways, or any other database software out there that works well with millions of records, without expensive hardware.

    Or some other way to form the query. In other words, looking for some type of algorithm perhaps that can place data in a way that doesn't require a l ot of work. Something planned ahead of time.

  6. #6
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    sorry, when i read your query, i completely missed the part where you had a WHERE clause. yes, if you partition on that value or a hash of that value, you can effectively prune. (does 5.1 actually do partition pruning yet? last i checked, it doesn't.)

    so, this changes my advice a little.

    the absolute best thing you can do is create a covering index and configure enough key cache to keep that index in RAM. you might also have to to use the FORCE INDEX statement to make sure mysql uses it. you can also use LOAD INDEX to pre-load the entire index in memory.

    if that's not an option, the next best thing would probably be to load your values in to a temporary table and join to that instead of using UNION.

    as i'm typing this out, i'm thinking that both are a good idea, actually. in my experience with large data sets, if you have more than 10 or so data points you're looking for, a temp table results in a faster execution plan than UNION or IN().

    finally, if none of this is possible or fails to yield adequate results, you will probably have to resort to a summary table. normally i strongly recommend against triggers, but this is definitely one of the situations where the treatment risks are outweighed by the risk of side effects.
    Check out our new Industry News forum!
    Keep up-to-date with the latest SP news in the Community Crier

    I edit the SitePoint Podcast

  7. #7
    SitePoint Zealot ShytKicka's Avatar
    Join Date
    Aug 2004
    Location
    New York
    Posts
    120
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Spot on.

    I think I'll probably add 4 gb of ram to my server for like $80/month and call it a day. But, your idea to use a temporary table is an interesting one. I am loading data, so it will take a while to test out. I guess it will all come down to cardinality.

    If UNIONs don't work out I will see what I can do with temporary tables.

    The only thing I don't understand is that I'll need to run 50 queries. So I don't understand how I can avoid using UNIONs in this situation? Either way, temporary table or not, my guess is that I'll still need to execute those 50 queries seperately either way (assuming that each query accesses a different partition/table)

    Temporary table would be an interesting solution if everything were to be stored in 1 table without partitioning. I guess I will test it out and see how that works out compared to partitions.

  8. #8
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    i don't understand the bit about 50 queries. why do you need 50 queries?
    Check out our new Industry News forum!
    Keep up-to-date with the latest SP news in the Community Crier

    I edit the SitePoint Podcast


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •