Using Unique Identifiers as Primary Key in MySQL
5 min read
Unique identifiers are used to distinguish one object from another. They are commonly used to identify entities such as users, files, processes, network devices, and other objects. Unique identifiers are often implemented using a numerical or alphanumeric string that is assigned to an object when it is created and they have many types like UUID, ULID, Snowflake ID, GUID and more.
Using integers for tables’ primary key is the go-to approach when working with MySQL and other RDBMS. But, using unique identifiers instead of auto-incremented integers can have some benefits:
Unique identifiers have a low probability of collision even across tables and databases.
They can be generated by the application without a centralized authority.
And since they are pseudo-random, they can be sent safely to the client side.
But as with any tech or tool, some tradeoffs come with using a unique identifier.
UUID stands for Universally Unique Identifier, which is a standard for generating unique identifiers that are widely used in computer systems and software applications. UUID consists of two parts: a time-based component and a random component, and there are officially 5 versions of UUID, each with a different algorithm for generating the random and time-based components. The 128-bit value of a UUID is usually represented as a string of 36 characters, consisting of 32 hexadecimal digits separated by hyphens in the form of
8-4-4-4-12. For example, a UUID might look like this:
UUIDs are Pseudo Random
The negative impact of using UUID as the primary key comes from how MySQL’s default engine,
InnoDB, stores the data.
InnoDB by default creates a b-tree for the table’s primary key and stores the rows data in the leaf nodes of the same b-tree which is called a clustered index. In other words, when a table has a clustered index, the data is physically stored on disk in the same order as the index.
When a new row with a random UUID needs to be inserted,
InnoDB needs to:
Checks whether the page is already in the buffer pool
If the page is not in the buffer pool,
InnoDBreads the page from the disk.
Write the page to the buffer pool.
Insert the new row.
At some point, the page will be flushed back to the disk.
Normally when a page is frequently accessed and modified it becomes a hot page, and the buffer pool gives it a higher priority to stay in memory as long as possible. This can result in performance gains in terms of latency and throughput. But, when the primary key is random, it is unlikely for a page to become frequently accessed thus each insert will mostly require an extra disk I/O.
Additionally, inserting rows with random primary keys can cause more page splits. A page split occurs when an
InnoDB page becomes full and cannot accommodate the new data. thus it needs to be divided into two or more smaller pages. This operation is costly in terms of performance as it requires many disk I/O operations to read and write data.
Random primary keys can also lead to a page low filling factor. According to MySQL documentation:
When new records are inserted into an
InnoDBtries to leave 1/16 of the page free for future insertions and updates of the index records. If index records are inserted in a sequential order (ascending or descending), the resulting index pages are about 15/16 full. If records are inserted in a random order, the pages are from 1/2 to 15/16 full.
UUIDs have a size of 128 bits and their hexadecimal form (
8-4-4-4-12) can be stored in a
CHAR(36) column because each hexadecimal digit represents 4 bits so it will 32. However, since each character in a CHAR column in MySQL is represented by one byte of storage, 36 characters (32 for the digits and 4 for the hyphens) in a
CHAR(36) column will require 36 bytes of storage.
Having 36 bytes for the primary key does not seem to be a lot at first glance and to put it into perspective let's assume a table with a UUID primary key and 4 secondary indexes. Considering that each secondary index stores the primary key of its row, the UUID will be stored 5 times for each row. If the table has 1 million rows the storage required will be:
6 x 36 = 216 bytes of storage per row
For the 10M rows: 10,000,000 rows x 216 bytes/row = 2,160,000,000 bytes
To convert this to gigabytes, we divide by 100,000,000 (the number of bytes in a gigabyte): 2,160,000,000 bytes / 1,000,000,000 bytes/GB = 2.16 GB
ULID stands for Universally Unique Lexicographically Sortable Identifier and it tries to solve some of UUIDs limitations. ULIDs are 26 characters long and consist of two parts:
A 10-character timestamp, measured in milliseconds since Unix epoch time (January 1, 1970, 00:00:00 UTC). This provides the chronological ordering of ULIDs.
A 16-character random value, encoded using Crockford's base32 encoding. This value ensures that each ULID is unique even if generated at the same time.
Additionally, according to the ULID spec, it can be superior to UUIDs given that:
1.21 X 10^24possible values per millisecond.
Canonically encoded as a 26-character string, as opposed to the 36-character UUID
In conclusion, ULID can solve the performance issues resulting from the randomness of UUID and it is more efficient in terms of storage. So it would be a better alternative to integer primary key than UUIDs. It is also implemented in many programming language