The big, bad thread of "MySQL Best Practices and Other Useful Information"

in this thread will be posted various Best Practices and Other Useful Infomation, gleaned from the experienced members lurking around the forums and good other external resources.

please observe the following when posting in this thread:

[list][]make sure to create a good title when replying to this thread. see my first couple posts for examples.
[
]this is not the place to ask questions about a problem you’re having. if you need help with something but want to reference a topic in this thread, please right-click the post number in the upper-right of the post and pick copy link. then start a new thread (not a reply) and paste that link in.
[]if you disagree with a point, go ahead and post in this thread, but include some really good evidence.
[
]give credit where credit is due. if you get your information from an external source, make sure to reference your source.
[]if you post sample code, make sure it’s not vulnerable to SQL injection or other problems, or state the that code is not inherintly secure and needs additional consideration.
[
]if your post is version specific, include that information in your post.[/list]
let the knowledge feast begin!

mysql is not very strict in certain situations. this leads to problems: other SQL servers are often more strict and won’t accept mysql’s syntax, the feature becomes unsupported when mysql releases newer versions (good SQL compliance is a goal of mysql), or the results returned are unexpected. this is the first in a series detailing which of these features to avoid.

standard SQL requires that all non-aggregate fields in a select statement with a group by clause appear in that group by clause. mysql does not have this requirement. this “feature” is supposed to be used in situations where the “hidden” fields do not change. for example:

select u.user_id
     , u.username
     , count(p.id)
  from users u
left outer
  join posts p
    on p.user_id = u.user_id
group
    by u.user_id

this query will return a list of users and the number of posts they’ve made. notice that one of the non-aggregate fields (u.username) does not appear in the group by clause.

according to the mysql manual, u.username does not need to be in the group by clause because that value will not change. however, in situations where you DO need to include a field in the group by clause, mysql doesn’t warn you that your query will return indeterminate results. here’s a common mistake:

select f.name
     , max(dateposted)
     , fp.title
  from forum f
  join forumpost fp
    on fp.forum_id = f.forum_id
group
    by f.name

the user who wrote this query mistakenly assumes that fp.title will contain the title of the latest post from that forum. it will not. according to the mysql manual, “The server is free to return any value from the group, so the results are indeterminate unless all values are the same”.

references:
http://dev.mysql.com/doc/refman/5.0/en/group-by-hidden-columns.html

some useful information: features that were added and removed in the different versions of mysql. this is not an exhaustive list, but instead important things you need to keep in mind when developing for a specific platform.

mysql 3.23 (retired)

  • the first version where innodb tables were included and turned on by default
  • partial support for inner joins
  • later versions added partial support for outer joins
  • no union support

mysql 4.0 (end-of-life)

  • full support for union
  • full support for inner and outer joins
  • upgraded myisam tables to replace isam tables
  • query cache

mysql 4.1 (end-of-life)

  • added support for per-table character sets and collations (but you absolutely need to use a later release in this series because early versions had some evil problems with multibyte character set support in certain functions)
  • better password hashing
  • subqueries (early versions in the 4.1 series did not allow some clauses inside subqueries, like LIMIT. this restriction was eventually lifted.)
  • spatial extensions
  • clustering using the ndbcluster in-memory tables

mysql 5.0 (current stable release)

  • added support for per-column character sets and collations
  • stored procedures
  • triggers (in early versions of the mysql 5.0 and 5.1 series, triggers could not refer to any data other than the row that caused the trigger or run non-deterministic procedures. in later releases, this restriction was lifted and triggers can run just about any query from any table, and just about any procedure.)
  • views
  • HUGELY improved replication
  • INFORMATION_SCHEMA database added (allows access to database objects via standard SELECT statements)

mysql 5.1 (beta)

  • horizontal partitioning
  • event scheduling (like a built-in cron)
  • XML data manipulation in VARCHAR and TEXT columns
  • disk based clustering

if i missed anything significant, please let me know via PM and i’ll add it to the list.

here are the quotes i will be refering to in this post:
’ single quote, a.k.a. apostrophe
" double quote
` backtick (on the same key as ~ on US keyboards)

let’s start by discussing ANSI SQL standard quote usage. in ANSI SQL, single quotes (and only single quotes) are used to signify a string. if a string contains a single quote, double the quote. for purposes of correctly processing a query, this is the ONLY escaping that needs to be done and is completely supported by mysql. here’s an example:

insert
  into users
     ( username, first, last, active )
       values
     ( 'brian', 'Brian', 'O''Reilly', 0 )
   , ( 'billy', 'Billy', 'O''Toole', 1 )

note that the last names contain quotes and therefore have the single quotes doubled. when mysql stores the data, the extra single quotes will be removed. therefore, there’s no need to try and remove doubled single quotes when retrieving data.

best practice: escape quotes only by doubling the single quotes. since it’s ANSI SQL standard, it work with most other database servers. even though mysql supports escaping by using backslashes, this is not always portable to other servers. since doubling the single quotes is ANSI SQL standard, it’s very likely to be portable.

however, if you adopt a database abstraction layer for database connectivity and it uses slashes for escaping, make sure the only function from the mysql client library it uses to insert those slashes is mysql_real_escape_string() as it is the only function guaranteed to be appropriate for mysql. (by extension, yes, PHP’s addslashes() and magic_quotes are NOT appropriate)

notice also that the numbers were not in quotes. ANSI SQL does not allow numbers to be in quotes.

best practice: for maximum portability, do not put numbers in quotes.

next up is the double quote. in ANSI SQL, double quotes are used to quote identifiers (database, table and column names). use them in cases where the names of your objects contain spaces, operators (. , + / etc.) or are reserved words. here’s an example:

select id
     , "product name"
     , "count"
  from inventory
 where size = 'XL'

unfortunately, by default mysql does not allow double quotes to be used for identifiers. by default, double quotes in mysql can also be used to identify strings, just like single quotes. this can be changed by setting the SQL mode, but it is unusual in mysql installations to see this. mysql instead uses backticks to quote identifiers. unfortunately, this backtick syntax is not widely supported by other servers.

best practice: use names that do not requires quoting. this is the only thing guaranteed to be supported by servers from different vendors since it’s ANSI SQL compatible. staying away from reserved words and limiting yourself to letters, numbers and underscores is a good way to stay safe.

bonus tip: DO NOT use microsoft word for ANYTHING relating to SQL development. microsoft word has this nasty habit of converting single and double quotes to ‘ ’ and “ ”, which are not recognized by SQL, mysql or otherwise. do yourself a favor and get a real code editor.

references:
http://dev.mysql.com/doc/refman/5.0/en/server-sql-mode.html
http://dev.mysql.com/doc/refman/5.0/en/mysql-real-escape-string.html
http://dev.mysql.com/doc/refman/5.0/en/string-syntax.html

this query can sometimes take longer than you would suspect to complete:

select count(*)
  from sometable

usually, a complaint of a slow count(*) is asked in the context of an innodb table, so i will start with an explanation of how myisam tables handle this query. (:

with myisam tables, the number of rows in the table is kept in a special counter in the table header. a request for count(*) on a single myisam table with no where clause returns that value instantly without consulting any data or indexes.

with innodb tables, there is no row counter. and even if there was, it would not always be accurate because of transactions. that only leaves one option: actually count all the rows.

the worst case scenario is when the transaction isolation mode is set to repeatable read or higher. in these modes, every row that you “view” has to be duplicated so that if you go back and read it again the same result is returned. hence the mode name “repeatable read”. when you ask for count(), you’ve asked to examine every row in the table. innodb has to duplicate the entire table so when you ask for one of those rows again or another count(), you get the same value.

note: innodb doesn’t actually create a copy of all of the rows in all cases, but that is the worst case scenario. this process is called multi-versioning and innodb’s implementation is discussed in the innodb section of the mysql manual. the performance implications are effectively the same when compared to myisam’s behavior: multiple orders of magnitude difference in execution time.

this is the first in a series of posts on query optimization. there are loads of good articles and tutorials on the internet about using EXPLAIN to optimize queries, so i’ll skip that particular tool and concentrate on lesser-known tools.

questions on the sitepoint forums often yield multiple answers, each with a different approach. some are obviously better than others, but others are more difficult to compare. using EXPLAIN is a good way to make sure indexes are being used, but it shows nothing about the efficiency of the constituent statements within a query. this is where the valuable tool called benchmark() comes in.

for example, compare the following two queries:

select case when 'foo'='foo' then 1 else 0 end;
select case 'foo' when 'foo' then 1 else 0 end;

since both statements are very similar and return the same result set, you would think that the performance would be the same. let’s test that theory using the benchmark() function:

mysql> select benchmark(10000000, case when 'foo'='foo' then 1 else 0 end);
+--------------------------------------------------------------+
| benchmark(10000000, case when 'foo'='foo' then 1 else 0 end) |
+--------------------------------------------------------------+
|                                                            0 |
+--------------------------------------------------------------+
1 row in set (1.11 sec)

mysql> select benchmark(10000000, case 'foo' when 'foo' then 1 else 0 end);
+--------------------------------------------------------------+
| benchmark(10000000, case 'foo' when 'foo' then 1 else 0 end) |
+--------------------------------------------------------------+
|                                                            0 |
+--------------------------------------------------------------+
1 row in set (0.95 sec)

as you can see, the second method is slightly faster. the difference isn’t much, but in a CPU bound query with a couple of case statements and thousands or rows, the difference can add up.

here are some other optimization lessons that i have learned:

[list][]comparing long strings takes longer than short strings, so if you’re matching 1000+ characters, computing the md5() hash would probably be faster. see http://www.sitepoint.com/forums/showthread.php?t=342006
[
]if you have to choose between find_in_set(), field(), or an elaborate case statement, use find_in_set(). see http://www.sitepoint.com/forums/showthread.php?t=406656[/list]

comparing long strings takes longer than short strings, so if you’re matching 1000+ characters, computing the md5() hash would probably be faster

Can I just clarify that you mean cache the md5 hashes and compare them, not simply replace string comparison with md5 calc + comparison?

It took me a couple of reads through to figure that one out.

here are some common cases where indexes should be used:

join clauses
columns that are part of joins should be indexed in both tables. for example:

select a.foo
     , b.foo
  from tablea a
  join tableb b
    on a.bar = b.tablea_bar

the columns tablea.bar and tableb.tablea_bar should both have indexes. this can improve query execution speed from minutes down to seconds.

where clauses
columns that are used in where clauses should generally be indexed. for example:

select foo
  from tablea
 where bar = 5

the column tablea.bar should be indexed. additionally, you could create a compound index on (bar, foo) to really speed things up. with that index, mysql will not have to pull any data records off the disk to find the corresponding value of foo. most likely, it will have already been brought in to memory when mysql pulled the index off disk. this is how to create a compound index:

alter table tablea
add index (bar, foo)

a compound index will also help queries with multiple items in the where clause:

select wee
     , ping
  from tablea
 where bar = 5
   and foo = 11

this query could benefit from the index created above. note that mysql will only sometimes recognize that reordering elements in the where clause will result in an index being able to be used. rule of thumb: try order your where clause so that the elements match the order of an compound indexes you have.

group by

select bar
     , count(*)
  from tablea
group
     by bar

if bar is indexed in the above query, mysql can quickly consult the index and return the answer without ever having to read data rows.

select bar
     , count(*)
  from tablea
 where foo = 11
group
    by bar

in this case, creating independent indexes on foo and bar would be somewhat helpful, but a compound index on (foo, bar) would be most helpful. to understand why, you need to know that: a) mysql processes the WHERE clause before the GROUP BY clause, b) mysql will only use one index per table, and c) compound indexes can only be used in the order they are created. so create your compound indexes so that mysql can take advantage of more than one part.

in all cases…
make sure that any columns that you have indexes for appear on the left side of any equality operators by themselves. for example, in the following query, an index can not be used:

select foo
  from tablea
 where year(some_date_column) = 2004

because some_date_column is being processed by the year function, mysql can not consult an index. rewriting your query like this will allow an index to be used:

select foo
  from tablea
 where some_date_column between '2004-01-01' and '2004-12-31'

Common Queries
Extension to chapter 9 of the book “Get It Done With MySQL 5&6” by Peter Brawley and Arthur Fuller

From their site:

Common queries
Our collection of common MySQL queries soon outgrew Chapter 9. It runs now to more than 120 printed pages, and it’s still growing, so we turned it into a PHP page driven by a pair of MySQL tables implementing an edge list tree model (for how to use edge list and other tree models in MySQL, see Chapter 20 of the book). Click here for the Common Queries page, and here for a panelled tree view of the same material.

Don’t use != for “not equal to” as it is specific to MySQL, instead use <>

I don’t know if I’d go as far to say it’s specific to MySQL, there are lots of different databases. But it seems there are at least some that don’t recognize it.

So if you know your code will be used only with MySQL it doesn’t much matter. If you’re concerned about developing a good habit so you won’t have problems while working with a different database, then using <> is the wise choice.

Determining what version of MySQL you have

There are times when knowing what version of MySQL you are using is helpful. eg. When posting a thread asking for help :wink:

One way is to use this query

SHOW  VARIABLES  LIKE  "%version%"

If your data is really sensitive, I suggest to use the following additional flags:
added_on, modified_on, sink_weight, is_active as:

`added_on` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Added on',
`modified_on` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Last modified on',
`sink_weight` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Sorting weight',
`is_active` enum('N','Y') NOT NULL DEFAULT 'N' COMMENT 'Active Record?',

Here are my feelings:
Added On (added_on)
This defaults to UNIX_TIMESTAMP(CURRENT_TIMESTAMP()) when you first make your INSERT query.
This means, you log the time when this entry was made to the database. Later, you can reuse
this column to calculate the age of the record, sort chronologically, or any other purpose.
But my aim is to secretly log the first entry time of the record.

Modified On (modified_on)
Record when your data was last modified on. Use this during your UPDATE query.
Set to the UNIX_TIMESTAMP(CURRENT_TIMESTAMP()) at the moment.
Also, while deleting a record (see below).
This might be useful in identifying which record has the fresh information.
As seen in PHPBB forum updates (if you reply to the old posts, it appears on the top again).

Sinking weight (sink_weight)
The buoyancy: heavier objects sink deep (appear on the bottom of the list) and
the reverse. Less the number, the record appears on the top. By this way, you can create
an admin interface to allow a user control the sinking amount. Or, just symbolically the
move up and move down arrows. Move up means decreate the sink weight value that the one
just above it currently. Move down means increase the sink weight just one more than the one
in the bottom of current record.

Is Active (is_active)
Secretly administer your data! Never DELETE any data physically. But rather mark them as
deleted. If Y, it means, you CANNOT show the data anywhere else. If N, use the data a general
record. By this way, you can survive for a customer’s hearing: Oh, I mistakenly deleted my
precious data, help me! - Just unflag it to N again, manually.

Final points:
If you don’t like to use them, never mind; your database design might have been still robust.
But these flags are administrative flags only. These days, you do not have to worry about
the space that will be added to each record of all the tables. Even if you do not use them
currently, put them for the future reference and use. One day, you might come to hear a similar
design. So, why don’t make everything compatible to the future demands?

If you did not have these flags earlier, you can add them; without affecting the current
feature of your database. Except that, you did not write queries like:
SELECT * FROM …
INSERT INTO table (full list of columns, …), etc.

A sticky is not the place to start discussions, but… if you want to register the moment of insertion and/or modification, use DATETIME or DATE, not int(10).