Unique Index and its Rationale
This article is inspired by Craig's recent post "How to Use Unique Indexes in MySQL and Other Databases".
There are a few more tightly related and equally important aspects regarding Primary Key (PK) and Unique Index (UI) which were not covered in that article.
So in this article, we will take a further look at these areas.
Auto Increment as PK
A common practice for database designers is to use an auto-increment as the PK for a particular table.
As the definition tells us:
The PRIMARY KEY constraint uniquely identifies each record in a database table.
To be exact, the PK is there to tell the database that the record is unique.
Using auto-increment as the PK is an easy but lazy way to accomplish this. A record with
id=1 is different from a record with
id=2. That's it.
The most critical thing to bear in mind is that in most cases, an auto-increment PK field is NOT part of the record itself. So how can we rely on the auto-increment PK itself, which has nothing related to the record content to identify a record?
Take a typical record to hold a sales item for example. It may include the following fields:
- id (auto increment)
- invoice number
- client name (or id)
- date of entry
- item name (or id)
- unit price
- subtotal (a calculated field)
A "natural" structure will lead us to use id as the auto-increment field and thus as the PK. I am not convinced at all that this will suffice.
Consider the following two entries:
1 12345 GoodGuy 13-12-12 item1 2 199 398 2 12346 GoodGuy 13-12-12 item1 2 199 398
Are they unique or just duplicated? I would rather treat them as duplicated entries where the person who keys in the entries accidentally typed the invoice number wrong for at least one of the entries.
Furthermore, in this example, we must argue that even the
invoice number wouldn't be a good candidate to act as a PK or
We will have to use a compound unique index, i.e., a unique index comprised of several fields to assure us of the uniqueness of a record.
Compound unique index
Let's step back from the database structure first and look at the physical real world entity: a sales item. Based on our experience and discussion with the sales people, how can we distinguish two sales entries?
The general logic could be:
- If purchased items are different, definitely they are different sales items;
- If purchased items are the same, but with different client, date, quantity, they are different.
Depending on the situation, more approaches could be possible.
Now it is clearer: if two sales entries have the same item, client, date, quantity, it is very likely they are duplicated entries. So in this database structure we will set up a compound unique index based on the fields listed above:
ALTER TABLE `test`.`salesitem` ADD UNIQUE INDEX `uniqueentry` (`itemid` ASC, `clientid` ASC, `date` ASC, `quantity` ASC);
With this uniqueness constraint applied to our table, we will be able to pick out the above two entries as duplicated.
NOTE: Keep in mind that the duplication detected by the
uniqueentry index is very likely. In real cases, they may not be duplicated after all (say the
GoodGuy just placed a repeated order at the same day with the exactly same items). If that is the case, the database designer must work with the sales person and further discuss what other fields should be introduced in the table structure and the unique index composition to distinguish two entries.
NOTE: To avoid a duplicated entry in invoice numbers, it is also highly recommended to create another unique index based on the
invoice number field.
In combination of the above two unique index constraints, we have avoided such human errors as:
- A typo in invoice number that causes a duplicated invoice number;
- A typo in invoice number that makes a duplicated entry in the table with the same item, client, date and quantity;
We are not able to eliminate all human errors by our design. It is in the realm of business process design and relying on supporting hardware (for example, a scanner to scan the invoice number to avoid manual input) to enhance the accuracy.
Rules of thumb:
- In any real world application, never ever rely on the auto-increment PK to determine the uniqueness of a record;
- Don't rely on a single man-made field (in this case, the invoice number) to check the uniqueness.
- The fields in a compound unique index should be those physically existing and generated by business activities. In our example above, date, item (which is indirectly referred to as a foreign key constraint to our
producttable), quantity and client (which is indirectly referred to as a foreign key constraint to our
clienttable) all have these characteristics.
Further note on 3 above, that is why I don't like to use UUID as PK or UI. Besides the complexity to calculate a UUID, the UUID itself has no real-world corresponding attribute for a record.
With these principles in mind, our table structure design will be more robust.
Justification of the existence of an auto-increment PK
After saying all this and setting up the two unique indexes, we may still use an auto-increment field as PK. Why?
The answer is simple: to facilitate the data location and make it simpler and faster.
In our example, to locate a single record, we can either use
select * from sales_entry where id=1 or use
select * from sales_entry where invoice='...' or use
select * from sales_entry where itemid=... and clientid=... and date='...' and quantity=.... Obviously, locating through an integer will be the fastest to execute and the easiest to write.
Also, an auto-increment PK will help other tables in the database to have a simpler Foreign Key setup to enforce data integrity. It is still recommended to set up an auto-increment field as the PK but it is strongly NOT recommended to rely on the PK to determine the uniqueness of a record.
The order of fields in a compound (unique) index
The design of a database and its tables is an art. Especially so when we are designing the indexes of each table.
After the above setup, there are already 3 indexes in our table:
- One PK using the auto-increment
- One UI using the
- One UI using the combination of several fields.
Let's look again at the SQL that creates the 3rd index:
ADD UNIQUE INDEX `uniqueentry` (`itemid` ASC, `clientid` ASC, `date` ASC, `quantity` ASC);
and consider the following SQL statement and its corresponding
EXPLAIN output excerpt:
select * from salesitem where clientid=1
The table does not have an index on clientid yet, so the above SQL will be unable to use any indexes (
possible_keys = null) to speed up the query and has to do a full table scan to fetch the record (
Extra = Using Where).
select * from salesitem where itemid=1 and clientid=1 and date='13-12-14'
where statement contains several filters and they ARE in the strict order that we created our
uniqueentry index. It takes the full advantage of our UI and we can expect the best speed out of it.
select * from salesitem where itemid=1 and clientid=1 and quantity=1
where statement contains several filters and they ARE NOT in the strict order that we created our
uniqueentry index. MySQL still tries hard to use the UI and we can expect a moderate speed.
select * from salesitem where itemid=1 and date='13-12-14' and quantity=1 and clientid=1
This is bad. The 4 fields in the UI are in the SQL but in a totally distorted order. It will also invoke a full table scan to fetch the record and most unlucky, there is no record found after we seek through the full table!
EXPLAIN is a very useful command in MySQL and MariaDB. To use it, simply issue
EXPLAIN *your sql statment to run* in MySQL terminal or in your MySQL GUI applications like PhpMyAdmin.
When designing a compound index, be cautious about the order of the fields in the index.
uniqueentry index is created with 4 fields in the strict order: item, client, date, quantity.
In our SQL statement, if our
where clause matches the full or partial (like the 2nd and 3rd SQL) order of a compound index fields order, MySQL will be smart enough to match the longest field chain in the compound index to speed up the search. Otherwise, MySQL won't be able to find a matching index and will have to do a full table scan, which is always discouraged in any database development.
uniqueentry UI's fields order may not be useful if our program will most often query on a client's purchase history (like in our 1st SQL statement). If that is the case, we need either change the order of fields in
uniqueentry UI, or create a new (compound) index with
clientid as the only (or the first) field in that index.
Too many indexes will slow down the database performance as the database engine, upon every create, update and delete operation, will have to do some extra work to update the indexes. In a multi-user, high-frequency-write environment, this can cause significant performance issues.
Thus, there must be a balance: less table/record lock time, faster query/aggregation time, reasonable number of indexes for a particular table. There are no set rules on how to achieve this – everything can be accomplished with experimentation and knowing the purpose and usage of the app in great detail.
In this article, we further elaborated on some basic and key concepts in designing a table and its indexes. In particular, we focused a lot on how to determine the uniqueness of a record by formulating a meaningful compound unique index. We also explained a bit on how a well designed compound index can be used by MySQL to speed up the query.
In future articles, we will look at another important aspect of the database: Stored Procedures and their usage in PHP.