Speedy: ORDER BY RAND() operation

Sorry, but I’m going to write that one off as a Bad Idea ™.

It’s based on the assumption that if you want you to get x percent of the rows in the database you should use WHERE RAND() < x/100
That in turn is based on the assumption that RAND() in MySQL has a perfectly uniform distribution.
Quite an assumption to make, and it doesn’t hold.

Back to my table with 141687 rows. Say I want 100 rows from that. From the info you linked to I should use 100/141687 ~= 0.000706. So the query is:


SELECT itemId FROM items WHERE RAND()<=0.000706;

(leaving out the WHERE feedId>34 for simplicity)

If I run that 1000 times, these are the numbers of rows I get back (n: m meaning n rows returned m times out of 1000):


73: 2
75: 1
77: 4
78: 5
79: 3
81: 6
82: 10
83: 8
84: 9
85: 12
86: 17
87: 13
88: 20
89: 20
90: 24
91: 28
92: 23
93: 40
94: 30
95: 40
96: 45
97: 40
98: 40
99: 32
100: 47
101: 39
102: 35
103: 37
104: 39
105: 30
106: 26
107: 29
108: 32
109: 29
110: 24
111: 20
112: 16
113: 11
114: 9
115: 20
116: 13
117: 13
118: 9
119: 5
120: 7
121: 7
122: 9
123: 2
124: 4
125: 5
126: 3
127: 1
128: 1
129: 1
130: 1
131: 1
132: 1
135: 1
139: 1

Which looks like the starting of a bell curve with mju=100, which is no surprise at all.

My point being that you don’t necessarily get 100 rows back with this 0.000706. The author of the method also found that and said “we can increase the fragment number a bit and limit the query”, but that’s just changing the facts because the theory doesn’t fit, and still doesn’t guarantee 100 rows. Just because the chance of less than 100 rows decreases doesn’t mean it won’t happen.

All that, plus I need to know many rows are in the database before I can even start this procedure, which is an extra query (SELECT COUNT(*) FROM daTable).

WHERE RAND()<=0.000706;

This either repeats data, or misses records. Every time I tried this without LIMIT, it returns different number of results.

Further, really needs to calculate the value upto which we can compare.

That one was specific for the number of results I wanted from my database. It’s not some generic number you can just plug in anywhere.

No it doesn’t

That’s the whole point of ORDER BY RAND() isn’t it? :wink:

Yes, that’s the whole point of my previous post. Did you read that post or just scan it?