the big, bad thread of "MySQL Best Practices and Other Useful Information"
in this thread will be posted various Best Practices and Other Useful Infomation, gleaned from the experienced members lurking around the forums and good other external resources.
please observe the following when posting in this thread:
- make sure to create a good title when replying to this thread. see my first couple posts for examples.
- this is not the place to ask questions about a problem you're having. if you need help with something but want to reference a topic in this thread, please right-click the post number in the upper-right of the post and pick copy link. then start a new thread (not a reply) and paste that link in.
- if you disagree with a point, go ahead and post in this thread, but include some really good evidence.
- give credit where credit is due. if you get your information from an external source, make sure to reference your source.
- if you post sample code, make sure it's not vulnerable to SQL injection or other problems, or state the that code is not inherintly secure and needs additional consideration.
- if your post is version specific, include that information in your post.
let the knowledge feast begin!
mysql "features" to avoid, part 1: hidden group by fields
mysql is not very strict in certain situations. this leads to problems: other SQL servers are often more strict and won't accept mysql's syntax, the feature becomes unsupported when mysql releases newer versions (good SQL compliance is a goal of mysql), or the results returned are unexpected. this is the first in a series detailing which of these features to avoid.
standard SQL requires that all non-aggregate fields in a select statement with a group by clause appear in that group by clause. mysql does not have this requirement. this "feature" is supposed to be used in situations where the "hidden" fields do not change. for example:
this query will return a list of users and the number of posts they've made. notice that one of the non-aggregate fields (u.username) does not appear in the group by clause.
from users u
join posts p
on p.user_id = u.user_id
according to the mysql manual, u.username does not need to be in the group by clause because that value will not change. however, in situations where you DO need to include a field in the group by clause, mysql doesn't warn you that your query will return indeterminate results. here's a common mistake:
the user who wrote this query mistakenly assumes that fp.title will contain the title of the latest post from that forum. it will not. according to the mysql manual, "The server is free to return any value from the group, so the results are indeterminate unless all values are the same".
from forum f
join forumpost fp
on fp.forum_id = f.forum_id
mysql "features" to avoid, part 2: non-SQL standard quoting
here are the quotes i will be refering to in this post:
' single quote, a.k.a. apostrophe
" double quote
` backtick (on the same key as ~ on US keyboards)
let's start by discussing ANSI SQL standard quote usage. in ANSI SQL, single quotes (and only single quotes) are used to signify a string. if a string contains a single quote, double the quote. for purposes of correctly processing a query, this is the ONLY escaping that needs to be done and is completely supported by mysql. here's an example:
note that the last names contain quotes and therefore have the single quotes doubled. when mysql stores the data, the extra single quotes will be removed. therefore, there's no need to try and remove doubled single quotes when retrieving data.
( username, first, last, active )
( 'brian', 'Brian', 'O''Reilly', 0 )
, ( 'billy', 'Billy', 'O''Toole', 1 )
best practice: escape quotes only by doubling the single quotes. since it's ANSI SQL standard, it work with most other database servers. even though mysql supports escaping by using backslashes, this is not always portable to other servers. since doubling the single quotes is ANSI SQL standard, it's very likely to be portable.
however, if you adopt a database abstraction layer for database connectivity and it uses slashes for escaping, make sure the only function from the mysql client library it uses to insert those slashes is mysql_real_escape_string() as it is the only function guaranteed to be appropriate for mysql. (by extension, yes, PHP's addslashes() and magic_quotes are NOT appropriate)
notice also that the numbers were not in quotes. ANSI SQL does not allow numbers to be in quotes.
best practice: for maximum portability, do not put numbers in quotes.
next up is the double quote. in ANSI SQL, double quotes are used to quote identifiers (database, table and column names). use them in cases where the names of your objects contain spaces, operators (. , + / etc.) or are reserved words. here's an example:
unfortunately, by default mysql does not allow double quotes to be used for identifiers. by default, double quotes in mysql can also be used to identify strings, just like single quotes. this can be changed by setting the SQL mode, but it is unusual in mysql installations to see this. mysql instead uses backticks to quote identifiers. unfortunately, this backtick syntax is not widely supported by other servers.
, "product name"
where size = 'XL'
best practice: use names that do not requires quoting. this is the only thing guaranteed to be supported by servers from different vendors since it's ANSI SQL compatible. staying away from reserved words and limiting yourself to letters, numbers and underscores is a good way to stay safe.
bonus tip: DO NOT use microsoft word for ANYTHING relating to SQL development. microsoft word has this nasty habit of converting single and double quotes to ‘ ’ and “ ”, which are not recognized by SQL, mysql or otherwise. do yourself a favor and get a real code editor.
why does count(*) take so long?
this query can sometimes take longer than you would suspect to complete:
usually, a complaint of a slow count(*) is asked in the context of an innodb table, so i will start with an explanation of how myisam tables handle this query. (:
with myisam tables, the number of rows in the table is kept in a special counter in the table header. a request for count(*) on a single myisam table with no where clause returns that value instantly without consulting any data or indexes.
with innodb tables, there is no row counter. and even if there was, it would not always be accurate because of transactions. that only leaves one option: actually count all the rows.
the worst case scenario is when the transaction isolation mode is set to repeatable read or higher. in these modes, every row that you "view" has to be duplicated so that if you go back and read it again the same result is returned. hence the mode name "repeatable read". when you ask for count(*), you've asked to examine every row in the table. innodb has to duplicate the entire table so when you ask for one of those rows again or another count(*), you get the same value.
note: innodb doesn't actually create a copy of all of the rows in all cases, but that is the worst case scenario. this process is called multi-versioning and innodb's implementation is discussed in the innodb section of the mysql manual. the performance implications are effectively the same when compared to myisam's behavior: multiple orders of magnitude difference in execution time.
query optimization: benchmarking functions
this is the first in a series of posts on query optimization. there are loads of good articles and tutorials on the internet about using EXPLAIN to optimize queries, so i'll skip that particular tool and concentrate on lesser-known tools.
questions on the sitepoint forums often yield multiple answers, each with a different approach. some are obviously better than others, but others are more difficult to compare. using EXPLAIN is a good way to make sure indexes are being used, but it shows nothing about the efficiency of the constituent statements within a query. this is where the valuable tool called benchmark() comes in.
for example, compare the following two queries:
since both statements are very similar and return the same result set, you would think that the performance would be the same. let's test that theory using the benchmark() function:
select case when 'foo'='foo' then 1 else 0 end;
select case 'foo' when 'foo' then 1 else 0 end;
as you can see, the second method is slightly faster. the difference isn't much, but in a CPU bound query with a couple of case statements and thousands or rows, the difference can add up.
mysql> select benchmark(10000000, case when 'foo'='foo' then 1 else 0 end);
| benchmark(10000000, case when 'foo'='foo' then 1 else 0 end) |
| 0 |
1 row in set (1.11 sec)
mysql> select benchmark(10000000, case 'foo' when 'foo' then 1 else 0 end);
| benchmark(10000000, case 'foo' when 'foo' then 1 else 0 end) |
| 0 |
1 row in set (0.95 sec)
here are some other optimization lessons that i have learned:
when should i use indexes?
here are some common cases where indexes should be used:
columns that are part of joins should be indexed in both tables. for example:
the columns tablea.bar and tableb.tablea_bar should both have indexes. this can improve query execution speed from minutes down to seconds.
from tablea a
join tableb b
on a.bar = b.tablea_bar
columns that are used in where clauses should generally be indexed. for example:
the column tablea.bar should be indexed. additionally, you could create a compound index on (bar, foo) to really speed things up. with that index, mysql will not have to pull any data records off the disk to find the corresponding value of foo. most likely, it will have already been brought in to memory when mysql pulled the index off disk. this is how to create a compound index:
where bar = 5
a compound index will also help queries with multiple items in the where clause:
alter table tablea
add index (bar, foo)
this query could benefit from the index created above. note that mysql will only sometimes recognize that reordering elements in the where clause will result in an index being able to be used. rule of thumb: try order your where clause so that the elements match the order of an compound indexes you have.
where bar = 5
and foo = 11
if bar is indexed in the above query, mysql can quickly consult the index and return the answer without ever having to read data rows.
in this case, creating independent indexes on foo and bar would be somewhat helpful, but a compound index on (foo, bar) would be most helpful. to understand why, you need to know that: a) mysql processes the WHERE clause before the GROUP BY clause, b) mysql will only use one index per table, and c) compound indexes can only be used in the order they are created. so create your compound indexes so that mysql can take advantage of more than one part.
where foo = 11
in all cases...
make sure that any columns that you have indexes for appear on the left side of any equality operators by themselves. for example, in the following query, an index can not be used:
because some_date_column is being processed by the year function, mysql can not consult an index. rewriting your query like this will allow an index to be used:
where year(some_date_column) = 2004
where some_date_column between '2004-01-01' and '2004-12-31'