SQL Query Pitfalls: Common Mistakes That Undermine Index Efficiency
Boost Database Performance and Query Optimisation by Avoiding These Critical Index-Unfriendly Practices
Table of contents
- Introduction
- Avoiding Function Pitfalls on Indexed Columns
- The Hidden Cost of Leading Wildcards in LIKE Patterns
- Optimizing OR Conditions for Better Index Usage
- Rethinking NOT IN and NOT EXISTS for Performance Gains
- How Calculations in WHERE Clauses Sabotage Indexes
- Composite Indexes: Order Matters
- The Role of Statistics and Cardinality in Index Efficiency
- How Statistics and Cardinality Affect Index Usage
- Conclusion
Introduction
The efficiency of SQL indexes greatly relies on how the queries using them are written. Most poorly written queries will force the database engine to bypass existing indexes and instead carry out a time-consuming full table scan. The impact of this is not only within the performance of these individual queries but can also have far-reaching implications on the overall performance of the system, particularly in databases dealing with volumes of data or heavy traffic.
Let's plunge into the world of SQL optimisation and figure our SQL query errors that sabotage index usage.
Avoiding Function Pitfalls on Indexed Columns
One of the most common mistakes that prevent the use of indexes is applying functions to indexed columns in the WHERE clause of a SQL query. This practice can force the database engine to perform a full table scan, negating the benefits of having an index in the first place.
Indexes are created on the actual values stored in a column. When we apply a function to an indexed column, we are essentially asking the database to compute a new value for every row before making the comparison. This computation needs to happen at runtime for each row, which means the database can't use the pre-computed index structure
-- This query can use an index on last_name
SELECT * FROM employees WHERE last_name = 'Smith';
-- This query cannot use an index on last_name
SELECT * FROM employees WHERE UPPER(last_name) = 'SMITH';
In case of frequently searches on a transformed version of a column (e.g., uppercase), consider adding a computed column and indexing it and for date-based queries, use date range conditions instead of extracting parts of the date.
-- This query can use an index on hire_date
SELECT * FROM employees WHERE hire_date = '2023-01-01';
-- This query cannot use an index on hire_date
SELECT * FROM employees WHERE YEAR(hire_date) = 2023;
-- Instead of: WHERE YEAR(hire_date) = 2023
SELECT * FROM employees WHERE hire_date >= '2023-01-01' AND hire_date < '2024-01-01';
The Hidden Cost of Leading Wildcards in LIKE Patterns
Using a wildcard at the beginning of a LIKE pattern (e.g., '%name') prevents the database from using an index effectively since indexes are typically sorted in ascending order. When we use a leading wildcard, the database can't use this ordering to its advantage and must check every row.
In case of frequent use case of this, consider creating storing the reverse value of this column into a new column and creating an index on that.
-- Create a computed column with reversed string
ALTER TABLE products ADD reversed_name AS REVERSE(name);
-- Create an index on the reversed column
CREATE INDEX idx_reversed_name ON products(reversed_name);
-- Query using the reversed column
SELECT * FROM products WHERE reversed_name LIKE REVERSE('%shirt');
Optimizing OR Conditions for Better Index Usage
Using OR conditions across different columns can often lead to suboptimal query performance drop as the database often can't use multiple indexes efficiently and may resort to a table scan.
Instead of using OR, we can often improve performance by using UNION ALL
SELECT * FROM employees WHERE last_name = 'Smith'
UNION ALL
SELECT * FROM employees WHERE first_name = 'John';
Rethinking NOT IN and NOT EXISTS for Performance Gains
While NOT IN and NOT EXISTS have their uses, they can often lead to poor performance, especially with large datasets. These often require the database to check every row in the table hence ignoring any index usage.
By using Left Join with Is Null and EXCEPT instead of NOT IN as it is functionally equivalent to NOT IN, can be more performant and allows better use of indexes in some scenarios.
SELECT e.*
FROM employees e
LEFT JOIN managers m ON e.id = m.employee_id
WHERE m.employee_id IS NULL;
SELECT *
FROM employees e
WHERE NOT EXISTS (
SELECT 1
FROM managers m
WHERE e.id = m.employee_id
);
How Calculations in WHERE Clauses Sabotage Indexes
When we include calculations in the WHERE clause, the database engine must perform these calculations for each row before it can apply the filter condition. This process prevents the use of indexes, as the calculated values don't exist in the index structure.
-- This query can't use an index on order_date
SELECT * FROM orders WHERE DATEADD(day, 30, order_date) > GETDATE();
-- This query can't use an index on price
SELECT * FROM products WHERE price * 1.1 > 100;
The above queries can be rewritten by moving the calculation to the right-hand side of the condition so the order_date column can be used directly in the comparison. SQL Server can then use an index on order_date.
SELECT * FROM orders WHERE order_date > DATEADD(day, -30, GETDATE());
SELECT * FROM products WHERE price > 100 / 1.1;
In case of frequent calculations on columns, consider creating computed columns and index them. SQL Server can then use the computed columns when this query is executed without having to recalculate on the fly.
Composite Indexes: Order Matters
Composite indexes can be powerful tools for query optimization, but their effectiveness depends heavily on how they're created and used.
The order of columns in a composite index determines its usefulness for different queries. The index can be used efficiently for queries that reference:
The first column
The first and second columns
The first, second, and third columns, and so on
However, it's not efficient for queries that don't include the first column in the WHERE clause.
Assume we have a composite index on (last_name, first_name, birth_date). This index will be efficient for:
-- Uses the index efficiently
SELECT * FROM employees WHERE last_name = 'Smith';
-- Also uses the index efficiently
SELECT * FROM employees WHERE last_name = 'Smith' AND first_name = 'John';
-- Uses the index most efficiently
SELECT * FROM employees WHERE last_name = 'Smith' AND first_name = 'John' AND birth_date = '1990-01-01';
But not for
-- Can't use the index efficiently
SELECT * FROM employees WHERE first_name = 'John';
-- Can't use the index efficiently
SELECT * FROM Employees WHERE first_name = 'John' AND last_name = 'Smith'
The Role of Statistics and Cardinality in Index Efficiency
In the context of databases, a statistic is metadata that describes the distribution of data in a table or index. It involves but is not limited to:
The number of rows in a table
Number of distinct in column
Distribution of values in column, such as most common values and range of values
These statistics enable the query optimizer to estimate the number of rows that will be returned by different operations in a query. That information is vital to choosing the most efficient execution plan.
Cardinality refers to the number of unique values in a column relative to the total number of rows.
High cardinality: Many unique values (e.g., a primary key column)
Low cardinality: Few unique values (e.g., a boolean column)
Cardinality is a key factor in index effectiveness. Columns with high cardinality are often good candidates for indexing, as they allow the database to quickly narrow down the set of rows to examine.
How Statistics and Cardinality Affect Index Usage
The query optimizer uses statistics to decide whether to use an index and how to use it. For example:
If statistics show that a WHERE clause will return a large percentage of rows, the optimizer might choose a full table scan instead of using an index.
For joins, the optimizer uses statistics to determine the best join order and method (nested loops, hash join, merge join).
Cardinality influences index choice in multi-column indexes. Columns with higher cardinality are often more effective when placed first in the index.
Most databases automatically update statistics, but in some cases, we might need to do it manually:
-- SQL Server
UPDATE STATISTICS TableName;
-- PostgreSQL
ANALYZE TableName;
-- MySQL
ANALYZE TABLE TableName;
Make sure to ensure the database is configured to automatically update statistics and to always update statistics after large data modifications (bulk inserts, updates, or deletes).
Conclusion
Optimizing SQL queries for index usage is a crucial skill for anyone working with databases. The mistakes we've discussed – from using functions on indexed columns to ignoring database statistics – can significantly impact query performance and overall system efficiency.
Regularly analysing the query performance and execution plans will also help to identify areas for improvement. As the data grows and changes, continually reassess the indexing strategies and query patterns. By doing so, we can ensure our database continues to perform optimally, providing fast and efficient data access for the applications and users.