MySQL is one of the most frequently used database engine used in online applications. When a programmer starts using SQL and uses just small tables, performance is not the priority. When used in real-life applications, databases start to grow fast. With a database consisting of millions of records and being hammered with multiple simultaneous connections, performance optimization becomes important. In this article, I’ll present some basic and more advanced techniques that I tried to optimize an event database.
I’ve written this article relying on my experience developed when developing the online party search engine Party Locator. This platform was powered by a mediocre web server using dual XEON hexacores, 16 GB of ram and running on a raid of hard drives. When I just started the project, only several hundreds of entries were in the events table.
The platform grew exponentially and after just five months, the platform stored more than one hundred thousand events in the database. This was the first stage when database optimization became necessary. Just one month later the crawler had indexed over half a million events, the second time optimization was desired. Using the techniques described below, I managed to reduce the average response time from 5.24 seconds to just 1.02 seconds (including abnormal peaks). The practical response time for a more complex query is now just 0.78 seconds, a performance boost of roughly 413%. Just in time, one year after launch, the Party Locator contained about 2.3 million events.
Not every tip I’ll give you will solve all your problems and will magically let your database double its performance. Optimization is all about testing and benchmarking. I have failed hundreds of times when trying to improve overall performance. But after testing multiple days, the solution finally had the desired impact. And at last, I’m not a professional of course.
Try to replace sub queries with JOINS as sub queries do not use indices and are evaluated for every row. Use JOIN-structures instead. A subquery, though more reader-friendly, can easily be rewritten to a regular query using JOINS. This was the first optimization I tried and it had an immediate effect. Almost cutting the response time in half, this was one of the most influential techniques I used. E.g.:
WHERE t1.b NOT IN (
The example above can be rewritten to:
FROM t1 LEFT JOIN t2 ON t1.b = t2.b
WHERE t2.b IS NULL
Every accessed field in the WHERE-clause should have an index. But keep in mind to not index every column as MySQL will not use more than two indices when it evaluates a query. Using many indices will also have an impact on INSERT and UPDATE statements as indices (most of the time BTREES) need to be rebuilt or updated. When writing queries are not executed very often, using multiple indices is the way to go.
Note that it is possible to define indices for multiple fields. The order of these fields has a role as well.
Try not to cast were not necessary. For example, when comparing strings, converting a string to lowercase (or uppercase) is a bottleneck as it is to be executed for every row. String comparisons are expensive. Try using another character set or other operands.
Another form of casting can be in date functions, e.g. when you compare dates but have a stored timestamp, you’ll use the cast DATE(timestamp), but this function is also executed for every row. If you use more date or time functions, try using DATETIME format instead of timestamp.
Field lengths have an impact on string comparisons. Using varchar with a specified maximum length instead of a text block (e.g. tinytext), the comparison will run much faster.
Do not use larger types than necessary. When you know that names will never be longer than 100 characters, don’t allocate more space for the field. Note that it is good practice to specify maximum lengths and let your application check these requirements before sending a query to the database.
In the SELECT clause, almost never use the * selector. This selector selects all columns from the result. Only specify the columns that you need. If you select columns that are not needed, useless data will be retrieved from disk and sent to the application which is of course slower.
MySQL has a build-in optimize command which works pretty well after many insertions or updates. Give it a try! The command is:
OPTIMIZE TABLE tablename
Another option is to defragment the table. Assuming you are using the InnoDB engine:
ALTER TABLE tablename ENGINE = InnoDB
Order of operations
The general rule is that expressions that evaluate to true for most expressions are placed last, expressions that accept the least rows are placed first. e.g.
1 = 1
a LIKE 'something'
Note that the first expression ‘1 = 1’ will be evaluated for every row in the table (smart DBMS will ignore this row though) and then the field ‘a’ will be compared to the string ‘something’.
When this query is restructured to:
a LIKE 'something'
1 = 1
At first, the field ‘a’ will be compared with the string ‘something’. Not all fields in the column ‘a’ will equal to ‘something’. That means that the ‘1 = 1’ expression will be executed fewer times.
Some number: let's say that there are 1000 entries in the table and 10 entries in the table have ‘something’ in column ‘a’. In the first code example, there will be approx. 1000 + 1000 = 2000 evaluations while there will be 1000 + 10 evaluations in the second.
From my experience, InnoDB is faster when the database is extensively used for writing and MyISAM is faster when reading is the most frequent task. Thus, the Party Locator API queries a MyISAM database. The crawler runs in the background and has a less strict response time requirement.
The public API consists mostly out of search endpoints. These endpoints are case-insensitive. One way to handle this is to convert your search query to the lowercase character and use LOWER(str) in the SQL. However, this function has a performance impact on the query. It’s better to use a character set that is case-insensitive for this purpose.
In my queries I also like to ignore special characters, for me, an ‘é’ is the same as a regular ‘e’ and visa-versa. A naive approach to this problem is to convert a search query to only use ASCII characters, but this cannot be done for the data in the table. What if these special characters are important in the data? For this purpose, there exists a character set (or collation) — which is also case-insensitive- and threats an ‘é’ as an ‘e’ and visa-versa (and of course for all the variants), e.g. utf8_general_ci.
When you need the result of multiple queries, keep the connection with the DBMS open for further requests. Connecting to the database takes some time and only increases the response time of your application.
When you only need a certain amount of entries, limit the number of rows using the LIMIT clause. Sending unnecessary data to the application will only increase latency.
Using LIMIT does not always have a positive performance impact. In some cases, using LIMIT did have a negative impact when sorting was enabled. E.g. when you have an ORDER BY clause and a LIMIT clause, all the rows will first be sorted before the LIMIT clause will take effect.
Mind your paging!
If one is sure that no doubles can exist in a table, do not use the DISTINCT clause. This clause will iterate over all selected rows and make sure every row only appears once. When returning many rows, this can take some time.
Do not use ORDER BY when sorting isn’t needed. Most of the time, MySQL will sort its data clusters on disk using the primary key. However, this is not always the case. Not using ORDER BY does not guarantee the order of the records.
I’ve encountered some situations where only a very small number of rows were selected (without LIMIT) the response time became 10 times longer when the ORDER BY clause were added. In that scenario, it was better to do the sorting of the little rows on the application level.
The ultimate optimization technique was to cache frequent queries. For example, from logs, it became apparent that queries for the Belgian cities Antwerp, Diepenbeek, Ghent, Hasselt, and Leuven were most frequent. These results were cached in JSON files on the server. When a user issued such a query, the cache was returned. While the user views the result, a background process updates the cache and adds new entries to the interface. CRON jobs also updated cache files on specified times.
Sometimes all these optimizations aren’t enough and the only possibility is to buy faster hardware. If you want to check if your DBMS is eating up all the system resources, use the SQL command:
SHOW FULL PROCESSLIST
Did I write something incorrectly? Do you have better practices? Contact me!