I was rather intrigued by this topic so I decided I’d run a test to see if the alternative method Stephen is discussing actually works.
I used a database table called ‘items’ that is part of a website I made several years ago. It holds RSS entries from several RSS feeds (stored in a table called ‘feeds’, referenced to in the ‘items’ table by ‘feedId’).
The table currently contains 141687 rows and the SHOW CREATE is as follows:
CREATE TABLE `items` (
`itemId` int(11) unsigned NOT NULL AUTO_INCREMENT,
`feedId` int(11) NOT NULL,
`itemHash` char(32) DEFAULT NULL,
`title` varchar(255) NOT NULL,
`description` text NOT NULL,
`link` varchar(255) NOT NULL,
`fetchtime` datetime NOT NULL,
`imageId` char(64) NOT NULL,
`pubdate` datetime NOT NULL,
`pubdate_raw` varchar(255) DEFAULT NULL,
`numViews` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`itemId`),
KEY `pd_feedid` (`feedId`,`pubdate`),
FULLTEXT KEY `search` (`title`,`description`)) ENGINE=MyISAM AUTO_INCREMENT=245465 DEFAULT CHARSET=latin1
The table holds the results of several years of RSS feeds, but every so often I delete rows that don’t get a lot of views (indicated by the column numViews), so there quite a lot of gaps in the itemId auto_increment column.
For the WHERE condition I chose WHERE feedId > 34, which holds for 101065 (roughly two third) of the data.
So this is real world data collected from real world RSS feeds. Not some artificial data that might be generated to favor one method over the other.
The script I used for testing is:
set_time_limit(0);
class Stopwatch
{
private $_starttime;
function __construct() {}
public function start()
{
$mtime = microtime();
list($usec, $sec) = split(' ', $mtime);
$this->_starttime = (float)$usec + (float)$sec;
}
public function stop()
{
$mtime = microtime();
list($usec, $sec) = split(' ', $mtime);
return (((float)$usec + (float)$sec) - (float)$this->_starttime);
}
}
$db=mysql_connect('***','***','***');
mysql_select_db('***');
$sw=new StopWatch;
$fp=fopen('times.csv', 'a+');
for ($numItems=10; $numItems<=300; $numItems+=10)
{
echo "\
", '++++ numItems = ', $numItems, ' ++++', "\
";
// ORDER BY RAND()
$sw->start();
for ($n=0;$n<10;$n++)
{
$query='SELECT itemId FROM items WHERE feedId>34 ORDER BY RAND() LIMIT '.$numItems;
$res=mysql_query($query);
}
$orderRandTime=$sw->stop() / 10;
// ALTERNATIVE METHOD
$sw->start();
for ($n=0;$n<10;$n++)
{
$query='SELECT count(*) FROM items WHERE feedId>34';
$res=mysql_query($query);
$row = mysql_fetch_assoc($res);
$items=$row['count(*)'];
for ($i=0;$i<$numItems;$i++) {
$r=rand(0,$items-1);
$q='SELECT itemId FROM items LIMIT 1 OFFSET '.$r;
$res=mysql_query($q);
}
}
$altMethodTime=$sw->stop() / 10;
fputs($fp, $numItems.';'.((String)$orderRandTime).';'.((String)$altMethodTime)."\
");
}
fclose($fp);
As you can see I start with taking 10 random items from the database, first using ORDER BY RAND() and then using the method Stephen described.
I do that 10 times in a row and take the average time over those 10 runs to even out outliers (should have done that more times, like 100, but that would take too long, and anyway 10 also gives quite a good estimate).
After taking 10 random items, I move on to 20, 30, 40, … 300
Attached to my this post are the CSV file (with .txt extension because vB won’t let me upload .csv files) of the results, and a JPG of the graph where the results are drawn in. The horizontal axis indicates the number of random items to take from the database ($numItems
in the test script) and the vertical axis indicates the number of seconds it took (on average over 10 runs) to obtain this data.
As one can see from the graph the ORDER BY RAND() method appears to have a O(1) time complexity, whereas the method Stephen described appears to have a O(n) time complexity.
When $numItems
is smaller than ~75 items, the method Stephen describes is faster, but when I need more then 75 random items that method is losing from the ORDER BY RAND() method.
Of course these actual numbers of this test are highly dependent on the data at hand, but the trend of O(1) time complexity vs O(n) should hold regardless of the data.