Speedy: ORDER BY RAND() operation

What can be done to make the following query run faster?
file_id is the primary key on the table (index)


# Random playlist
SELECT
	file_name `title`,
	REPLACE(file_path, 'P:/', '') `location`
FROM physical_files
ORDER BY
	RAND()
LIMIT 30;

I have several records in the database. And, this query is just not efficient. I want to draw random files, but ORDER BY is not good here.

There is an interesting post on similar topic on Anton Titov. Stil my problem cannot be solved. I cannot generate the list of random file_ids randomly in PHP, because, they may not exist or, filtered within WHERE condition.

Thank you in advance for a good suggestion.

i thought it was an excellent article

why can’t you generate the file_ids? what do they look like?

There are a number of options on how to select random entries depending on how many you want.

In the variant where you get the ids and then select one at random you use a database query to extract the list of all the key values that exist in the table and then select randomly from that list. That way they always would exist and satisfy the where condition because you’d specify that in the query retrieving them.

If there are going to be too many ids to extract them all then one alternative is to first do a query to determine how many rows of the table satisfy your query to be included in the random selection. Then generate the required number of random numbers less than that number and then do a second query of the database to return the data in the entries corresponding to those random position in the results (using LIMIT 1 OFFSET randomnumber). To retrieve 30 randomly selected results would require 31 database calls to first retrieve the number of records to select from and then the specific 30 records randomly selected but it would still be lots faster than using ORDER BY rand().

ps. Also agree that is an excellent article. It doesn’t cover quite as many alternative options as the book “SQL Antipatterns” does in chapter 16 which provides a half dozen alternative solutions for this problem but the article does provide a far more detailed explanation of the alternatives it does cover.

@r937
The reason why I cannot rely on self generated random numbers between 1 and max(file_id) is:

  • The file_ids may not be continuous.
  • Some rows will be filtered in WHERE condition; like: allow_play=‘Y’, and if its (row whose allow_play is ‘N’) file_id was generated, there will be no data.

@felgall
What about generating a pool of valid file_ids in a different table.
And then again randomly selecting file_ids still by ORDER BY RAND() LIMIT 30.
Building a comma separated list of those file_ids, and querying with the WHERE part:

WHERE file_id IN(r,a,n,d,o,m,l,y,...,f,i,l,e,_,i,d,s)

I think it’s ORDER BY RAND() will run much faster than on the table physical_files.
This pool has only one column in the table: file_id, and can be indexed.
Or, even made the primary key, as like in physical_files.

Since ORDER BY RAND has to load all the keys into memory and then spend a huge amount of time switching them around getting rid of the first step will not save any significant amount of time (like standing on a ladder would make no significant difference in getting you to the moon).

Based on the information you have provided I would suggest that you do one call to count the number of entries that match the criteria you are after. Then generate your 30 random numbers between 0 and one less than the number of entries. Finally do thirty separate calls to the database each to return one of the random results you are after - specify LIMIT 1 and OFFSET randomNumber to get the specific entry corresponding to the random number you generated.

The 31 separate calls to the database will between them be way more efficient than the one call using ORDER BY RAND. Just how much time it will save depends on how many records you are extracting the 30 from but even if you only have a few hundred records it should result in saving you at least 90% of the time the order by rand would take (and probably a lot more).

i strenuously dispute this blanket statement, and ask you to present proof

Why don’t you provide an example where the one call using ORDER BY RAND would be more efficient at retrieving 30 records at random than the way I suggested. The inefficiency of ORDER BY RAND is a well documented fact so I believe that the burden of proof that there are circumstances where it isn’t lies on the person who disputes that inefficiency.

Anyway the simplest most efficient situation using ORDER BY RAND in this case would be with only 30 records to sort randomly (you can’t have fewer than that and be able to retrieve 30 records). Assuming the minimum possible number of comparisons in sorting the database you’d have 29 comparisons with the final random order being identical to the original order. The time taken to do those 29 comparisons compared to the time taken to retrieve the number of records in the table is the difference between the shortest possible time that the order by rand can take and the time that the alternative will always take. This might be slightly faster than doing the extra read but will occur only once in 2.6525285981219105863630848e+32 runs of the code. In all of the other 2.6525285981219105863630848e+32 possible orders in which the 30 records can be returned there will be more than 29 comparisons that need to be done. The average number of compares required will take much longer to carry out than retrieving the number of records in the table. That’s with just 30 records in the table. The number of comparisons to be done will grow rapidly as additional records are added to the table and so the average time taken to do the compares will be significantly larger if there are even a few more records in the table. The time taken using the alternative changes far less since all we need to do is count the records and not sort them. So except in a very few instances where the “random order” ends up the same as the order in the database the time taken to do the compares will be significantly longer than the time taken to count the records.

come on, stephen, please stop making stuff up, i’ve called you on that before and yet you keep making these statements

i’m not disputing that ORDER BY RAND is slow, i’m disputing your claim that 31 calls are “way more efficient” and “saving you at least 90% of the time

you’re the one who pulled the 90% number out of the air, and i’m just asking you to cite your sources

otherwise, it’s just more bafflegab on your part, like that monolithic huge paragraph of rationalization that you just posted

I was rather intrigued by this topic so I decided I’d run a test to see if the alternative method Stephen is discussing actually works.

I used a database table called ‘items’ that is part of a website I made several years ago. It holds RSS entries from several RSS feeds (stored in a table called ‘feeds’, referenced to in the ‘items’ table by ‘feedId’).

The table currently contains 141687 rows and the SHOW CREATE is as follows:


CREATE TABLE `items` (
  `itemId` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `feedId` int(11) NOT NULL,
  `itemHash` char(32) DEFAULT NULL,
  `title` varchar(255) NOT NULL,
  `description` text NOT NULL,
  `link` varchar(255) NOT NULL,
  `fetchtime` datetime NOT NULL,
  `imageId` char(64) NOT NULL,
  `pubdate` datetime NOT NULL,
  `pubdate_raw` varchar(255) DEFAULT NULL,
  `numViews` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`itemId`),
  KEY `pd_feedid` (`feedId`,`pubdate`),
  FULLTEXT KEY `search` (`title`,`description`)) ENGINE=MyISAM AUTO_INCREMENT=245465 DEFAULT CHARSET=latin1

The table holds the results of several years of RSS feeds, but every so often I delete rows that don’t get a lot of views (indicated by the column numViews), so there quite a lot of gaps in the itemId auto_increment column.

For the WHERE condition I chose WHERE feedId > 34, which holds for 101065 (roughly two third) of the data.

So this is real world data collected from real world RSS feeds. Not some artificial data that might be generated to favor one method over the other.

The script I used for testing is:


set_time_limit(0);
class Stopwatch
{
	private $_starttime;
	function __construct() {}

	public function start()
	{
		$mtime = microtime();
		list($usec, $sec) = split(' ', $mtime);
		$this->_starttime = (float)$usec + (float)$sec;
	}

	public function stop()
	{
		$mtime = microtime();
		list($usec, $sec) = split(' ', $mtime);
		return (((float)$usec + (float)$sec) - (float)$this->_starttime);
	}
}

$db=mysql_connect('***','***','***');
mysql_select_db('***');
$sw=new StopWatch;
$fp=fopen('times.csv', 'a+');

for ($numItems=10; $numItems<=300; $numItems+=10)
{
	echo "\
", '++++ numItems = ', $numItems, ' ++++', "\
";

	// ORDER BY RAND()
	$sw->start();
	for ($n=0;$n<10;$n++)
	{
		$query='SELECT itemId FROM items WHERE feedId>34 ORDER BY RAND() LIMIT '.$numItems;
		$res=mysql_query($query);
		
	}
	$orderRandTime=$sw->stop() / 10;

	// ALTERNATIVE METHOD
	$sw->start();
	for ($n=0;$n<10;$n++)
	{
		$query='SELECT count(*) FROM items WHERE feedId>34';
		$res=mysql_query($query);

		$row = mysql_fetch_assoc($res);
		$items=$row['count(*)'];

		for ($i=0;$i<$numItems;$i++) {
			$r=rand(0,$items-1);
			$q='SELECT itemId FROM items LIMIT 1 OFFSET '.$r;
			$res=mysql_query($q);
		}
	}
	$altMethodTime=$sw->stop() / 10;

	fputs($fp, $numItems.';'.((String)$orderRandTime).';'.((String)$altMethodTime)."\
");
}

fclose($fp);

As you can see I start with taking 10 random items from the database, first using ORDER BY RAND() and then using the method Stephen described.
I do that 10 times in a row and take the average time over those 10 runs to even out outliers (should have done that more times, like 100, but that would take too long, and anyway 10 also gives quite a good estimate).

After taking 10 random items, I move on to 20, 30, 40, … 300

Attached to my this post are the CSV file (with .txt extension because vB won’t let me upload .csv files) of the results, and a JPG of the graph where the results are drawn in. The horizontal axis indicates the number of random items to take from the database ($numItems in the test script) and the vertical axis indicates the number of seconds it took (on average over 10 runs) to obtain this data.

As one can see from the graph the ORDER BY RAND() method appears to have a O(1) time complexity, whereas the method Stephen described appears to have a O(n) time complexity.
When $numItems is smaller than ~75 items, the method Stephen describes is faster, but when I need more then 75 random items that method is losing from the ORDER BY RAND() method.

Of course these actual numbers of this test are highly dependent on the data at hand, but the trend of O(1) time complexity vs O(n) should hold regardless of the data.

Wouldn’t the complexity of the ORDER BY RAND also vary dependent on the number of records in the database (which you didn’t change) by a lot more than the alternative method would vary by the number of records retrieved? ORDER BY RAND only has a O(1) time complexity because you didn’t change the number of records in the data.

Also the number of records in the database would be the one changing over time. The number of records that are required to be retrieved would usually be fixed. That would give this method the O(1) complexity when the number of records in the database is varying and the number of records to retrieve is static.

thank you for the effort, it is much appreciated

:slight_smile:

Or, what about we suggest Sun/MySQL have a different way :wink:


SELECT ... FROM ...
ORDER BY
  RAND_IMPROVED(file_id)
LIMIT 30;

All with the above logics that we discussed? Again :wink:

Touché. The time it takes for ORDER BY RAND would indeed change as a function of the cardinality of the table, m. So O(1) is indeed not correct. The Big Oh for ORDER BY RAND must contain m in some way.

Indeed, the method you’re describing will always use the same time given a fixed value for n, but that doesn’t magically give the method as a whole a complexity of O(1); the complexity is still O(n) since the time needed still increases as a function of n. Pointing out that the time remains the same for a fixed value of n is cheating :wink:

But what this means is, if m changes but n remains fixed, the time it will take for ORDER BY RAND will increase, while the time the method you outlined will indeed remain constant.

In terms of the graph I attached with my last post, the blue line will be displaced vertically (its values will be increased) whereas the red line will stay the way it is now.

Nevertheless, there will still be a point where the two lines intersect and after that point ORDER BY RAND will be faster than the method you’ve described.

To further illustrate this, I have an older version of the database that is a backup from before the last time I pruned it, and that one has 216032 rows. Of these rows, 101065 have a feedId of more than 34. Since the ratio of feedId > 34 vs feedId <= 34 is not the same in this database as it is in the other database, I removed 5913 rows with feedId>34 randomly from the database.


DELETE FROM items WHERE feedId>34 ORDER BY RAND() LIMIT 5913

That makes the number of rows with feedId > 34, 154095
Since 101065 / 141687 ~= 154095 / 216032 the ratios are now the same for both database, just to make sure the results aren’t skewed.

See attachment for the graph with the results of this test for this bigger database versus the results of the smaller database. As you can see the line for the method you outlined indeed hasn’t changed (except for high values of n, but I think we accredit that to only taking 10 runs to sample), whereas ORDER BY RAND has increased from ~0.95 to ~0.98

I think we can agree that the rule of thumb seems to be that the method you outlined is better for smaller values of n, while ORDER BY RAND is better for bigger values of n :slight_smile:

Scallio, do you mind testing this way of doing it (with the same initial dataset as the first two ways)?

    // 3rd Method
    
    $sw->start();

    for ($n=0;$n<10;$n++) {


        $query='SELECT itemId FROM items WHERE feedId>34 LIMIT '.$numItems;

        $res=mysql_query($query);
        
        $item_ids=array();
        while ($row = mysql_fetch_array($result, MYSQL_BOTH)) {
            $item_ids[]=$row;
        }
        
        $random_item_ids=array_rand($item_ids,10);
        
        // Optional shuffling of the order
        shuffle($random_item_ids);
    }
    
    $orderRandTime=$sw->stop() / 10;

I wouldn’t mind but I don’t see what you’re trying to do. You get $numItems from the database, pick $numItems of those at random (so you’ll just get all items in the array) and then shuffle them.

So you just get the $numItems first rows from the database in a random order. Or am I missing something? :shifty:

:blush:Remove the LIMIT from the query

Ah, now it makes sense!

Well, that also has a relatively constant time, and I have to say I’m totally surprised. I would have expected it to be the slowest of all options (due to the massive amount of data transfer needed), but with only ~0,21 seconds for $numItems=10 to ~0,22 seconds for $numItems=300 (on the larger table) it’s actually quite a lot (~67%) faster than ORDER BY RAND !

Nice one SpacePhoenix!!

PS. If PHP and MySQL run on different servers this probably isn’t the best option, especially not if you have to pay for data transfer between those servers. But still, very nice! :slight_smile:
PPS. This is the code I used:


$sw->start();
for ($n=0;$n<10;$n++) {
	$query='SELECT itemId FROM items WHERE feedId>34';

	$res=mysql_query($query);

	$item_ids=array();
	while ($row = mysql_fetch_assoc($res)) {
		$item_ids[]=$row['itemId'];
	}

	$random_item_ids=array_rand($item_ids,$numItems);

	// Optional shuffling of the order
	// shuffle($random_item_ids);
}
$time=$sw->stop() / 10;

I didn’t apply shuffle because array_rand also get’s random items, thus shuffle would only give it slightly more entropy

Scallio it would make for an intersting test to repeat the three tests but with PHP and MySQL on physically separate server boxes and to compare the speed of each option both against each other and when each option is done where they’re on the same server box.

If I had access to two different server boxes I’d surely like to try that. Sadly I don’t … :-/

here’s another method …