SitePoint Sponsor

User Tag List

Results 1 to 23 of 23
  1. #1
    SitePoint Addict bimalpoudel's Avatar
    Join Date
    Feb 2009
    Location
    Kathmandu, Nepal
    Posts
    279
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Speedy: ORDER BY RAND() operation

    What can be done to make the following query run faster?
    file_id is the primary key on the table (index)
    Code:
    # Random playlist
    SELECT
    	file_name `title`,
    	REPLACE(file_path, 'P:/', '') `location`
    FROM physical_files
    ORDER BY
    	RAND()
    LIMIT 30;
    I have several records in the database. And, this query is just not efficient. I want to draw random files, but ORDER BY is not good here.

    There is an interesting post on similar topic on Anton Titov. Stil my problem cannot be solved. I cannot generate the list of random file_ids randomly in PHP, because, they may not exist or, filtered within WHERE condition.

    Thank you in advance for a good suggestion.
    Last edited by bimalpoudel; Jan 21, 2011 at 14:33. Reason: remove url added
    Bimal Poudel @ Sanjaal Framework » over Smarty Template Engine
    ASKING INTERESTING QUESTIONS ON SITEPOINT FOURM

    Hire for coding support - PHP/MySQL

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,220
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    i thought it was an excellent article

    why can't you generate the file_ids? what do they look like?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,788
    Mentioned
    25 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by bimalpoudel View Post
    I cannot generate the list of random file_ids randomly in PHP, because, they may not exist or, filtered within WHERE condition.
    There are a number of options on how to select random entries depending on how many you want.

    In the variant where you get the ids and then select one at random you use a database query to extract the list of all the key values that exist in the table and then select randomly from that list. That way they always would exist and satisfy the where condition because you'd specify that in the query retrieving them.

    If there are going to be too many ids to extract them all then one alternative is to first do a query to determine how many rows of the table satisfy your query to be included in the random selection. Then generate the required number of random numbers less than that number and then do a second query of the database to return the data in the entries corresponding to those random position in the results (using LIMIT 1 OFFSET randomnumber). To retrieve 30 randomly selected results would require 31 database calls to first retrieve the number of records to select from and then the specific 30 records randomly selected but it would still be lots faster than using ORDER BY rand().

    ps. Also agree that is an excellent article. It doesn't cover quite as many alternative options as the book "SQL Antipatterns" does in chapter 16 which provides a half dozen alternative solutions for this problem but the article does provide a far more detailed explanation of the alternatives it does cover.
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">

  4. #4
    SitePoint Addict bimalpoudel's Avatar
    Join Date
    Feb 2009
    Location
    Kathmandu, Nepal
    Posts
    279
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    @r937
    The reason why I cannot rely on self generated random numbers between 1 and max(file_id) is:
    * The file_ids may not be continuous.
    * Some rows will be filtered in WHERE condition; like: allow_play='Y', and if its (row whose allow_play is 'N') file_id was generated, there will be no data.

    @felgall
    What about generating a pool of valid file_ids in a different table.
    And then again randomly selecting file_ids still by ORDER BY RAND() LIMIT 30.
    Building a comma separated list of those file_ids, and querying with the WHERE part:
    Code:
    WHERE file_id IN(r,a,n,d,o,m,l,y,...,f,i,l,e,_,i,d,s)
    I think it's ORDER BY RAND() will run much faster than on the table physical_files.
    This pool has only one column in the table: file_id, and can be indexed.
    Or, even made the primary key, as like in physical_files.
    Last edited by bimalpoudel; Jan 21, 2011 at 23:01. Reason: indexing a new table with single row: file_id
    Bimal Poudel @ Sanjaal Framework » over Smarty Template Engine
    ASKING INTERESTING QUESTIONS ON SITEPOINT FOURM

    Hire for coding support - PHP/MySQL

  5. #5
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,788
    Mentioned
    25 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by bimalpoudel View Post
    I think it's ORDER BY RAND() will run much faster than on the table physical_files.
    Since ORDER BY RAND has to load all the keys into memory and then spend a huge amount of time switching them around getting rid of the first step will not save any significant amount of time (like standing on a ladder would make no significant difference in getting you to the moon).

    Based on the information you have provided I would suggest that you do one call to count the number of entries that match the criteria you are after. Then generate your 30 random numbers between 0 and one less than the number of entries. Finally do thirty separate calls to the database each to return one of the random results you are after - specify LIMIT 1 and OFFSET randomNumber to get the specific entry corresponding to the random number you generated.

    The 31 separate calls to the database will between them be way more efficient than the one call using ORDER BY RAND. Just how much time it will save depends on how many records you are extracting the 30 from but even if you only have a few hundred records it should result in saving you at least 90% of the time the order by rand would take (and probably a lot more).
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">

  6. #6
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,220
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by felgall View Post
    The 31 separate calls to the database will between them be way more efficient than the one call using ORDER BY RAND.
    i strenuously dispute this blanket statement, and ask you to present proof
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  7. #7
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,788
    Mentioned
    25 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by r937 View Post
    i strenuously dispute this blanket statement, and ask you to present proof
    Why don't you provide an example where the one call using ORDER BY RAND would be more efficient at retrieving 30 records at random than the way I suggested. The inefficiency of ORDER BY RAND is a well documented fact so I believe that the burden of proof that there are circumstances where it isn't lies on the person who disputes that inefficiency.

    Anyway the simplest most efficient situation using ORDER BY RAND in this case would be with only 30 records to sort randomly (you can't have fewer than that and be able to retrieve 30 records). Assuming the minimum possible number of comparisons in sorting the database you'd have 29 comparisons with the final random order being identical to the original order. The time taken to do those 29 comparisons compared to the time taken to retrieve the number of records in the table is the difference between the shortest possible time that the order by rand can take and the time that the alternative will always take. This might be slightly faster than doing the extra read but will occur only once in 2.6525285981219105863630848e+32 runs of the code. In all of the other 2.6525285981219105863630848e+32 possible orders in which the 30 records can be returned there will be more than 29 comparisons that need to be done. The average number of compares required will take much longer to carry out than retrieving the number of records in the table. That's with just 30 records in the table. The number of comparisons to be done will grow rapidly as additional records are added to the table and so the average time taken to do the compares will be significantly larger if there are even a few more records in the table. The time taken using the alternative changes far less since all we need to do is count the records and not sort them. So except in a very few instances where the "random order" ends up the same as the order in the database the time taken to do the compares will be significantly longer than the time taken to count the records.
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">

  8. #8
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,220
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    come on, stephen, please stop making stuff up, i've called you on that before and yet you keep making these statements

    i'm not disputing that ORDER BY RAND is slow, i'm disputing your claim that 31 calls are "way more efficient" and "saving you at least 90% of the time"

    you're the one who pulled the 90% number out of the air, and i'm just asking you to cite your sources

    otherwise, it's just more bafflegab on your part, like that monolithic huge paragraph of rationalization that you just posted
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  9. #9
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,033
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    I was rather intrigued by this topic so I decided I'd run a test to see if the alternative method Stephen is discussing actually works.

    I used a database table called 'items' that is part of a website I made several years ago. It holds RSS entries from several RSS feeds (stored in a table called 'feeds', referenced to in the 'items' table by 'feedId').

    The table currently contains 141687 rows and the SHOW CREATE is as follows:

    Code:
    CREATE TABLE `items` (
      `itemId` int(11) unsigned NOT NULL AUTO_INCREMENT,
      `feedId` int(11) NOT NULL,
      `itemHash` char(32) DEFAULT NULL,
      `title` varchar(255) NOT NULL,
      `description` text NOT NULL,
      `link` varchar(255) NOT NULL,
      `fetchtime` datetime NOT NULL,
      `imageId` char(64) NOT NULL,
      `pubdate` datetime NOT NULL,
      `pubdate_raw` varchar(255) DEFAULT NULL,
      `numViews` int(11) NOT NULL DEFAULT '0',
      PRIMARY KEY (`itemId`),
      KEY `pd_feedid` (`feedId`,`pubdate`),
      FULLTEXT KEY `search` (`title`,`description`)) ENGINE=MyISAM AUTO_INCREMENT=245465 DEFAULT CHARSET=latin1
    The table holds the results of several years of RSS feeds, but every so often I delete rows that don't get a lot of views (indicated by the column numViews), so there quite a lot of gaps in the itemId auto_increment column.

    For the WHERE condition I chose WHERE feedId > 34, which holds for 101065 (roughly two third) of the data.

    So this is real world data collected from real world RSS feeds. Not some artificial data that might be generated to favor one method over the other.

    The script I used for testing is:
    PHP Code:
    set_time_limit(0);
    class 
    Stopwatch
    {
        private 
    $_starttime;
        function 
    __construct() {}

        public function 
    start()
        {
            
    $mtime microtime();
            list(
    $usec$sec) = split(' '$mtime);
            
    $this->_starttime = (float)$usec + (float)$sec;
        }

        public function 
    stop()
        {
            
    $mtime microtime();
            list(
    $usec$sec) = split(' '$mtime);
            return (((float)
    $usec + (float)$sec) - (float)$this->_starttime);
        }
    }

    $db=mysql_connect('***','***','***');
    mysql_select_db('***');
    $sw=new StopWatch;
    $fp=fopen('times.csv''a+');

    for (
    $numItems=10$numItems<=300$numItems+=10)
    {
        echo 
    "\n"'++++ numItems = '$numItems' ++++'"\n";

        
    // ORDER BY RAND()
        
    $sw->start();
        for (
    $n=0;$n<10;$n++)
        {
            
    $query='SELECT itemId FROM items WHERE feedId>34 ORDER BY RAND() LIMIT '.$numItems;
            
    $res=mysql_query($query);
            
        }
        
    $orderRandTime=$sw->stop() / 10;

        
    // ALTERNATIVE METHOD
        
    $sw->start();
        for (
    $n=0;$n<10;$n++)
        {
            
    $query='SELECT count(*) FROM items WHERE feedId>34';
            
    $res=mysql_query($query);

            
    $row mysql_fetch_assoc($res);
            
    $items=$row['count(*)'];

            for (
    $i=0;$i<$numItems;$i++) {
                
    $r=rand(0,$items-1);
                
    $q='SELECT itemId FROM items LIMIT 1 OFFSET '.$r;
                
    $res=mysql_query($q);
            }
        }
        
    $altMethodTime=$sw->stop() / 10;

        
    fputs($fp$numItems.';'.((String)$orderRandTime).';'.((String)$altMethodTime)."\n");
    }

    fclose($fp); 
    As you can see I start with taking 10 random items from the database, first using ORDER BY RAND() and then using the method Stephen described.
    I do that 10 times in a row and take the average time over those 10 runs to even out outliers (should have done that more times, like 100, but that would take too long, and anyway 10 also gives quite a good estimate).

    After taking 10 random items, I move on to 20, 30, 40, ... 300

    Attached to my this post are the CSV file (with .txt extension because vB won't let me upload .csv files) of the results, and a JPG of the graph where the results are drawn in. The horizontal axis indicates the number of random items to take from the database ($numItems in the test script) and the vertical axis indicates the number of seconds it took (on average over 10 runs) to obtain this data.

    As one can see from the graph the ORDER BY RAND() method appears to have a O(1) time complexity, whereas the method Stephen described appears to have a O(n) time complexity.
    When $numItems is smaller than ~75 items, the method Stephen describes is faster, but when I need more then 75 random items that method is losing from the ORDER BY RAND() method.

    Of course these actual numbers of this test are highly dependent on the data at hand, but the trend of O(1) time complexity vs O(n) should hold regardless of the data.
    Attached Images Attached Images
    Attached Files Attached Files
    Rémon - Hosting Advisor

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  10. #10
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,788
    Mentioned
    25 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by ScallioXTX View Post
    As one can see from the graph the ORDER BY RAND() method appears to have a O(1) time complexity, whereas the method Stephen described appears to have a O(n) time complexity.
    Wouldn't the complexity of the ORDER BY RAND also vary dependent on the number of records in the database (which you didn't change) by a lot more than the alternative method would vary by the number of records retrieved? ORDER BY RAND only has a O(1) time complexity because you didn't change the number of records in the data.

    Also the number of records in the database would be the one changing over time. The number of records that are required to be retrieved would usually be fixed. That would give this method the O(1) complexity when the number of records in the database is varying and the number of records to retrieve is static.
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">

  11. #11
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,220
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by ScallioXTX View Post
    I was rather intrigued by this topic so I decided I'd run a test
    thank you for the effort, it is much appreciated

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  12. #12
    SitePoint Addict bimalpoudel's Avatar
    Join Date
    Feb 2009
    Location
    Kathmandu, Nepal
    Posts
    279
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Or, what about we suggest Sun/MySQL have a different way ;-)

    Code:
    SELECT ... FROM ...
    ORDER BY
      RAND_IMPROVED(file_id)
    LIMIT 30;
    All with the above logics that we discussed? Again ;-)
    Bimal Poudel @ Sanjaal Framework » over Smarty Template Engine
    ASKING INTERESTING QUESTIONS ON SITEPOINT FOURM

    Hire for coding support - PHP/MySQL

  13. #13
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,033
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by felgall View Post
    Wouldn't the complexity of the ORDER BY RAND also vary dependent on the number of records in the database (which you didn't change) by a lot more than the alternative method would vary by the number of records retrieved? ORDER BY RAND only has a O(1) time complexity because you didn't change the number of records in the data.
    Touché. The time it takes for ORDER BY RAND would indeed change as a function of the cardinality of the table, m. So O(1) is indeed not correct. The Big Oh for ORDER BY RAND must contain m in some way.

    Quote Originally Posted by felgall View Post
    That would give this method the O(1) complexity when the number of records in the database is varying and the number of records to retrieve is static.
    Indeed, the method you're describing will always use the same time given a fixed value for n, but that doesn't magically give the method as a whole a complexity of O(1); the complexity is still O(n) since the time needed still increases as a function of n. Pointing out that the time remains the same for a fixed value of n is cheating

    But what this means is, if m changes but n remains fixed, the time it will take for ORDER BY RAND will increase, while the time the method you outlined will indeed remain constant.

    In terms of the graph I attached with my last post, the blue line will be displaced vertically (its values will be increased) whereas the red line will stay the way it is now.

    Nevertheless, there will still be a point where the two lines intersect and after that point ORDER BY RAND will be faster than the method you've described.

    To further illustrate this, I have an older version of the database that is a backup from before the last time I pruned it, and that one has 216032 rows. Of these rows, 101065 have a feedId of more than 34. Since the ratio of feedId > 34 vs feedId <= 34 is not the same in this database as it is in the other database, I removed 5913 rows with feedId>34 randomly from the database.
    Code:
    DELETE FROM items WHERE feedId>34 ORDER BY RAND() LIMIT 5913
    That makes the number of rows with feedId > 34, 154095
    Since 101065 / 141687 ~= 154095 / 216032 the ratios are now the same for both database, just to make sure the results aren't skewed.

    See attachment for the graph with the results of this test for this bigger database versus the results of the smaller database. As you can see the line for the method you outlined indeed hasn't changed (except for high values of n, but I think we accredit that to only taking 10 runs to sample), whereas ORDER BY RAND has increased from ~0.95 to ~0.98

    I think we can agree that the rule of thumb seems to be that the method you outlined is better for smaller values of n, while ORDER BY RAND is better for bigger values of n
    Attached Images Attached Images

  14. #14
    From space with love silver trophy
    SpacePhoenix's Avatar
    Join Date
    May 2007
    Location
    Poole, UK
    Posts
    5,000
    Mentioned
    101 Post(s)
    Tagged
    0 Thread(s)
    Scallio, do you mind testing this way of doing it (with the same initial dataset as the first two ways)?

    PHP Code:
        // 3rd Method
        
        
    $sw->start();

        for (
    $n=0;$n<10;$n++) {


            
    $query='SELECT itemId FROM items WHERE feedId>34 LIMIT '.$numItems;

            
    $res=mysql_query($query);
            
            
    $item_ids=array();
            while (
    $row mysql_fetch_array($resultMYSQL_BOTH)) {
                
    $item_ids[]=$row;
            }
            
            
    $random_item_ids=array_rand($item_ids,10);
            
            
    // Optional shuffling of the order
            
    shuffle($random_item_ids);
        }
        
        
    $orderRandTime=$sw->stop() / 10
    Community Team Advisor
    Forum Guidelines: Posting FAQ Signatures FAQ Self Promotion FAQ
    Help the Mods: What's Fluff? Report Fluff/Spam to a Moderator

  15. #15
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,033
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    I wouldn't mind but I don't see what you're trying to do. You get $numItems from the database, pick $numItems of those at random (so you'll just get all items in the array) and then shuffle them.

    So you just get the $numItems first rows from the database in a random order. Or am I missing something?
    Rémon - Hosting Advisor

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  16. #16
    From space with love silver trophy
    SpacePhoenix's Avatar
    Join Date
    May 2007
    Location
    Poole, UK
    Posts
    5,000
    Mentioned
    101 Post(s)
    Tagged
    0 Thread(s)
    Remove the LIMIT from the query
    Community Team Advisor
    Forum Guidelines: Posting FAQ Signatures FAQ Self Promotion FAQ
    Help the Mods: What's Fluff? Report Fluff/Spam to a Moderator

  17. #17
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,033
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by SpacePhoenix View Post
    Remove the LIMIT from the query
    Ah, now it makes sense!

    Well, that also has a relatively constant time, and I have to say I'm totally surprised. I would have expected it to be the slowest of all options (due to the massive amount of data transfer needed), but with only ~0,21 seconds for $numItems=10 to ~0,22 seconds for $numItems=300 (on the larger table) it's actually quite a lot (~67%) faster than ORDER BY RAND !

    Nice one SpacePhoenix!!

    PS. If PHP and MySQL run on different servers this probably isn't the best option, especially not if you have to pay for data transfer between those servers. But still, very nice!
    PPS. This is the code I used:

    PHP Code:
    $sw->start();
    for (
    $n=0;$n<10;$n++) {
        
    $query='SELECT itemId FROM items WHERE feedId>34';

        
    $res=mysql_query($query);

        
    $item_ids=array();
        while (
    $row mysql_fetch_assoc($res)) {
            
    $item_ids[]=$row['itemId'];
        }

        
    $random_item_ids=array_rand($item_ids,$numItems);

        
    // Optional shuffling of the order
        // shuffle($random_item_ids);
    }
    $time=$sw->stop() / 10
    I didn't apply shuffle because array_rand also get's random items, thus shuffle would only give it slightly more entropy
    Rémon - Hosting Advisor

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  18. #18
    From space with love silver trophy
    SpacePhoenix's Avatar
    Join Date
    May 2007
    Location
    Poole, UK
    Posts
    5,000
    Mentioned
    101 Post(s)
    Tagged
    0 Thread(s)
    Scallio it would make for an intersting test to repeat the three tests but with PHP and MySQL on physically separate server boxes and to compare the speed of each option both against each other and when each option is done where they're on the same server box.
    Community Team Advisor
    Forum Guidelines: Posting FAQ Signatures FAQ Self Promotion FAQ
    Help the Mods: What's Fluff? Report Fluff/Spam to a Moderator

  19. #19
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,033
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by SpacePhoenix View Post
    Scallio it would make for an intersting test to repeat the three tests but with PHP and MySQL on physically separate server boxes and to compare the speed of each option both against each other and when each option is done where they're on the same server box.
    If I had access to two different server boxes I'd surely like to try that. Sadly I don't ...
    Rémon - Hosting Advisor

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  20. #20
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,220
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  21. #21
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,033
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by r937 View Post
    Sorry, but I'm going to write that one off as a Bad Idea (tm).

    It's based on the assumption that if you want you to get x percent of the rows in the database you should use WHERE RAND() < x/100
    That in turn is based on the assumption that RAND() in MySQL has a perfectly uniform distribution.
    Quite an assumption to make, and it doesn't hold.

    Back to my table with 141687 rows. Say I want 100 rows from that. From the info you linked to I should use 100/141687 ~= 0.000706. So the query is:

    Code:
    SELECT itemId FROM items WHERE RAND()<=0.000706;
    (leaving out the WHERE feedId>34 for simplicity)

    If I run that 1000 times, these are the numbers of rows I get back (n: m meaning n rows returned m times out of 1000):
    Code:
    73: 2
    75: 1
    77: 4
    78: 5
    79: 3
    81: 6
    82: 10
    83: 8
    84: 9
    85: 12
    86: 17
    87: 13
    88: 20
    89: 20
    90: 24
    91: 28
    92: 23
    93: 40
    94: 30
    95: 40
    96: 45
    97: 40
    98: 40
    99: 32
    100: 47
    101: 39
    102: 35
    103: 37
    104: 39
    105: 30
    106: 26
    107: 29
    108: 32
    109: 29
    110: 24
    111: 20
    112: 16
    113: 11
    114: 9
    115: 20
    116: 13
    117: 13
    118: 9
    119: 5
    120: 7
    121: 7
    122: 9
    123: 2
    124: 4
    125: 5
    126: 3
    127: 1
    128: 1
    129: 1
    130: 1
    131: 1
    132: 1
    135: 1
    139: 1
    Which looks like the starting of a bell curve with mju=100, which is no surprise at all.

    My point being that you don't necessarily get 100 rows back with this 0.000706. The author of the method also found that and said "we can increase the fragment number a bit and limit the query", but that's just changing the facts because the theory doesn't fit, and still doesn't guarantee 100 rows. Just because the chance of less than 100 rows decreases doesn't mean it won't happen.

    All that, plus I need to know many rows are in the database before I can even start this procedure, which is an extra query (SELECT COUNT(*) FROM daTable).
    Rémon - Hosting Advisor

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  22. #22
    SitePoint Addict bimalpoudel's Avatar
    Join Date
    Feb 2009
    Location
    Kathmandu, Nepal
    Posts
    279
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Code:
    WHERE RAND()<=0.000706;
    This either repeats data, or misses records. Every time I tried this without LIMIT, it returns different number of results.

    Further, really needs to calculate the value upto which we can compare.
    Bimal Poudel @ Sanjaal Framework » over Smarty Template Engine
    ASKING INTERESTING QUESTIONS ON SITEPOINT FOURM

    Hire for coding support - PHP/MySQL

  23. #23
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,033
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by bimalpoudel View Post
    Code:
    WHERE RAND()<=0.000706;
    That one was specific for the number of results I wanted from my database. It's not some generic number you can just plug in anywhere.

    Quote Originally Posted by bimalpoudel View Post
    This either repeats data,
    No it doesn't

    Quote Originally Posted by bimalpoudel View Post
    or misses records.
    That's the whole point of ORDER BY RAND() isn't it?

    Quote Originally Posted by bimalpoudel View Post
    Every time I tried this without LIMIT, it returns different number of results. Further, really needs to calculate the value upto which we can compare.
    Yes, that's the whole point of my previous post. Did you read that post or just scan it?
    Rémon - Hosting Advisor

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •