SQL performance

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the databases category.

Last Updated: 2024-04-28

follow up questions - learn about GIN uses - learn about Gist uses - learn about block - learn (A LOT) about reading explain statements - learn how to get stats for DB - follow up with https://modern-sql.com/ - how to prevent postgres using an index for a specific query. - when does postgres update statistics? on index creation - most selective first in concatenated indexes?

Index basics

When useful

An index is a way to efficiently retrieve a relatively small number of rows from a table. It’s only useful if the number of rows to be retrieved from a table is relatively small (that is, the condition for retrieving rows - the WHERE clause - is selective).

If your index is too big to fit into ram, it may not work

B-tree indexes are also useful for avoiding sorting.

How they are constructed

Layer 1: Leaf Nodes (containing leaf entries)

Each index leaf node is connected via a double-linked list. Sequential data not possible because insert would need to move large amounts of data. The solution to the problem is to establish a logical order that is independent of physical order in memory. It is thus possible to insert new entries without moving large amounts of data—it just needs to change some pointers. Doubly linked lists enable the database to read the index forwards or backwards as needed.

Databases use doubly linked lists to connect the so-called index leaf nodes. An index leaf node contains many index entries. Each leaf node is stored in a database block or page; that is, the database’s smallest storage unit. All index blocks are of the same size—typically a few kilobytes. The database uses the space in each block to the extent possible and stores as many index entries as possible in each block. That means that the index order is maintained on two different levels: the index entries within each leaf node, and the leaf nodes among each other using a doubly linked list. See image here: https://use-the-index-luke.com/sql/anatomy/the-leaf-nodes

Each index entry consists of the indexed columns (the key of the index - e.g. user_id or whatever you set the index) and refers to the corresponding table row (via ROWID or RID). So the index key to the row.

Unlike the index, the table data is stored in a heap structure and is not sorted at all. There is neither a relationship between the rows stored in the same table block nor is there any connection between the blocks.

Layer 2: Branch nodes

The index leaf nodes are stored in an arbitrary order—the position on the disk does not correspond to the logical position according to the index order. Think of a telephone directory with shuffled pages. A database needs a second structure to find the entry among the shuffled pages quickly: a balanced search tree—in short: the B-tree (balance tree, not binary tree)

Branch nodes contain multiple branch entries - e.g. [46,53,57]. Each of these numbers corresponds to the biggest value in the respective leaf node. There is also a pointer to where that data is stored.

The next layer is built similarly, but on top of the first branch node level. The procedure repeats until all keys fit into a single node, the root node.

The structure is a balanced search tree because the tree depth is equal at every position; the distance between root node and leaf nodes is the same everywhere.

Once created, the database maintains the index automatically. It applies every insert, delete and update to the index and keeps the tree in balance, thus causing maintenance overhead for write operations.

The tree traversal is a very efficient operation. It works almost instantly—even on a huge data set. That is primarily because of the tree balance, which allows accessing all elements with the same number of steps, and secondly because of the logarithmic growth of the tree depth. That means that the tree depth grows very slowly compared to the number of leaf nodes

Why might indexes still be slow?

Big Picture

One case is where there are many index entries for the indexed row (e.g. user_id=60) The database must read the next leaf node to see if there are any more matching entries. That means that an index lookup not only needs to perform the tree traversal, it also needs to follow the leaf node chain.

The second reason for a slow index lookup is accessing the table. Even a single leaf node might contain many hits—often hundreds. The corresponding table data is usually scattered across many table blocks. That means that there is an additional table access for each hit.

Thus: An index lookup requires three steps: (1) the tree traversal; (2) following the leaf node chain; (3) fetching the table data. The tree traversal is the only step that has an upper bound for the number of accessed blocks—the index depth. The other two steps might need to access many blocks—they cause a slow index lookup.

Oracle DB has actually a bunch of operations here:

  1. INDEX UNIQUE SCAN (fast) The INDEX UNIQUE SCAN performs the tree traversal only. The Oracle database uses this operation if a unique constraint ensures that the search criteria will match no more than one entry. E.g. primary keys or unique indexes

  2. INDEX RANGE SCAN (potentially much slower) The INDEX RANGE SCAN performs the tree traversal and follows the leaf node chain to find all matching entries. This is the fall­back operation if multiple entries could possibly match the search criteria. In postgres, this is called bitmap index scan

  3. TABLE ACCESS BY INDEX ROWID (also potentially much slower) The TABLE ACCESS BY INDEX ROWID operation retrieves the row from the table. This operation is (often) performed for every matched record from a preceding index scan operation.

  4. TABLE ACCESS FULL must read and process the entire table. This can be faster than table access by row ID because the engine can read large parts of the table (many rows) at once (vs.doing them one by one). In postgres, I believe this is called "Seq scan". Where clauses appear as "filters"

In postgres, there is something called bitmap heap scan. Postgres first fetches all results from the index (Bitmap Index Scan), then sorts the rows according to the physical storage location of the rows in the heap table and than fetches all rows from the table (Bitmap Heap Scan). This method reduces the number of random access IOs on the table.

This operation can become a performance bottleneck if many entries to lookup...—but there is no such risk in connection with an INDEX UNIQUE SCAN (as would happen if looking up by a unique index, such as a primary key)

Poor Where Clauses

Primary Keys

The database automatically creates an index for the primary key. That means if ID is the primary key, there is an index on the ID column, even though there is no create index statement. The where clause cannot match multiple rows because the primary key constraint ensures uniqueness of the EMPLOYEE_ID values. The database does not need to follow the index leaf nodes—it is enough to traverse the index tree

Partial indexes

A partial index covers just a subset of a table’s data. It’s an index with a WHERE clause.

CREATE INDEX articles_flagged_created_at_index ON articles(created_at) WHERE flagged IS TRUE;

Think of it as specifying rows to be indexed (instead of columns, as with creating indexes)

This index remains fairly small, and can also be used along other indexes on the more complex queries that require it.

A partial index is useful for commonly used where conditions that use constant values—like the status code in the following example:

SELECT message
 FROM messages
WHERE processed = 'N'
  AND receiver  = ?

If the above is for a queue system, and you don't want to index the huge number of items already processed, add a where clause to the index

 CREATE INDEX messages_todo
          ON messages (receiver)
       WHERE processed = 'N'

Compare this to the native approach

CREATE INDEX messages_todo
          ON messages (receiver, processed)

Compared to the native approach, the partial index reduces size - vertically (fewer rows) - horizontally (only one instead of two cols)

It can even mean that the index size remains unchanged although the table grows without bounds. The index does not contain all messages, just the unprocessed ones.

Multi-column indexes (aka Concatenated Indexes)

Postgres can combine single field indexes. So, make sure to benchmark and justify the creation of a multi-column index before you create one.

CREATE UNIQUE INDEX employees_pk
    ON employees (employee_id, subsidiary_id)

Index unique scan will be used with SELECT first_name, last_name FROM employees WHERE employee_id = 123 AND subsidiary_id = 30

But if just subsidiary_id in the where clause, then a "table access full" will be used (reads the entire table and evaluates every row against the where clause.)

SELECT first_name, last_name
  FROM employees
 WHERE subsidiary_id = 20

Here is why: The first column is the primary sort criterion and the second column determines the order only if two entries have the same value in the first column and so on. The ordering of a two-column index is therefore like the ordering of a telephone directory: it is first sorted by surname, then by first name. That means that a two-column index does not support searching on the second column alone; that would be like searching a telephone directory by first name.... they are not next to each other globally.

=> The most important consideration when defining a concatenated index is how to choose the column order so it can be used as often as possible. E.g. if EMPLOYEEID were a PK and therefore alreayd indexes, it would be wiser to create this index as `CREATE UNIQUE INDEX employeespk ON employees (employeeid, subsidiaryid)`

Incomplete indexes

If you have an index on subsidiaryid, but not lastname and do this query:

SELECT first_name, last_name, subsidiary_id, phone_number
  FROM employees
 WHERE last_name  = 'WINAND'
   AND subsidiary_id = 30

It could end up slower than without the index (exactly when? when the query optimizer statistics use guestimates that are wrong) This is because the index will be used on subsidiary ID (INDEX RANGE SCAN) followed by slow TABLE ACCESS BY INDEX ROWID for all items (potentially many) in that subsidiary in order to filter on last name. In comparison, without an index, TABLE ACCESS FULL could potentially be faster. NB: The statement’s response time does not depend on the result set size but on the number of employees in the particular subsidiary.

The fix (if the above query was slow enough to be problematic) is adding an index on employee name

CREATE INDEX emp_name ON employees (last_name)

Function-Based-Indexes (FBI)

If functions are used, they are black boxes and therefore indexes can't be used

An index on last_name cannot be used for the query

SELECT first_name, last_name, phone_number
  FROM employees
 WHERE UPPER(last_name) = UPPER('winand')

From the database’s perspective, calling a function like UPPER results in something entirely different.

To support that query, we need an index that covers the actual search term. That means we do not need an index on LASTNAME but on UPPER(LASTNAME):

CREATE INDEX emp_up_name
    ON employees (UPPER(last_name))

As a general tip, unify your access paths in your code such that (e.g.) your case insensitive search always works with UPPER. If you sometimes do LOWER, then you'll need another index and this gets redundant and bloated.

FBIs must use deterministic functions

If you have a function GET_AGE(date_of_birth) that calculates age based on the current time, this cannot be used in the index. The result of the function call is not fully determined by its parameters. . When inserting a new row, the database calls the function and stores the result in the index and there it stays, unchanged. There is no periodic process that updates the index. The database updates the indexed age only when the date of birth is changed by an update statement. After the next birthday, the age that is stored in the index will be wrong.

Besides being deterministic, PostgreSQL requires functions to be declared to be deterministic when used in an index so you have to use the keyword IMMUTABLE (PostgreSQL).

Index for equality first, then ranges/between

Say you have the following query

SELECT first_name, last_name, date_of_birth
  FROM employees
 WHERE date_of_birth >= TO_DATE(?, 'YYYY-MM-DD')
   AND date_of_birth <= TO_DATE(?, 'YYYY-MM-DD')
   AND subsidiary_id  = ?

If your index is (dateofbirth, subsidiaryid), then it often be inefficient because each index entry will be sorted first on DOB and then on subsidiary ID. (. Only if two employees were born on the same day is the SUBSIDIARYID used to sort these records) Many of the DOBs in this range may point to leaf nodes that do not have the subsidiary ID (e.g. 27) matching your query. Essentially, the filter on DATEOFBIRTH is therefore the only condition that limits the scanned index range. If the index were reversed to (subsidiary_id, date_of_birth), then we'd narrow down to the right subsidiary and then B-tree our way to the right DOB quickly.

Note: The actual performance difference depends on the data and search criteria. The difference can be negligible if the filter on DATEOFBIRTH is very selective on its own. The bigger the date range becomes, the bigger the performance difference will be.

Concatenated indexes cannot use access predicates for queries with multiple ranges

SELECT first_name, last_name, date_of_birth
  FROM employees
 WHERE UPPER(last_name) < ?
   AND date_of_birth    < ?

It is impossible to define a B-tree index that would support this query without filter predicates. If you define the index as UPPER(LASTNAME), DATEOF_BIRTH (in that order), the list begins with A and ends with Z. The date of birth is considered only when there are two employees with the same name. If you define the index the other way around, it will start with the eldest employees and end with the youngest. In that case, the names only have almost no impact on the sort order.

An index can therefore only support one range condition as an access predicate You can of course accept the filter predicate and use a multi-column index nevertheless. That is the best solution in many cases anyway. The index definition should then mention the more selective column first so it can be used with an access predicate. NB: - This is a case when you should put the most selective column first in the index. (This isn't generally automatically true.)

The other option is to use two separate indexes, one for each column. Then the database must scan both indexes first and then combine the results. The duplicate index lookup alone already involves more effort because the database has to traverse two index trees. Additionally, the database needs a lot of memory and CPU time to combine the intermediate results. This combination happens either via the "join operation" or "data warehouse" approach (read only), which converts your b-tree indexes into temporary special purpose bitmap indexes which can be combined easily. (Persisted bitmap indexes are not suitable for web applications generally due to poor concurrency support and due to slow writes)

Bitmap indexes work well for low-cardinality columns (modest number of distinct values). Extreme case is boolean data. Bitmap indexes use bit arrays (bit maps) and answer queries by performing bitwise logical operations on these bitmaps. With basic bitmap indexes, for each possible value of the column, a bit array will be created - e.g. in boolean case, one bitmap for "has internet" (true) and another for "does not have internet" (false). It is possible to reduce the number of bitmaps needed by using a different encoding method. E.g. C distinct values can be encoded using log(C) bitmaps with binary encoding. The downside is that more bitmaps have to be accessed to answer the query.

Index sorting is a good idea when nulls are sorted last in the index

This technique is mostly relevant with single column indexes when you require “nulls to sort last” behavior because otherwise, the order is already available since an index can be scanned in any direction

B-Tree index entries are sorted in ascending order by default. In some cases, it makes sense to supply a different sort order for an index. Take the case when you’re showing a paginated list of articles, sorted by most recent published first

CREATE INDEX articles_published_at_index ON articles(published_at DESC NULLS LAST);

Production impact during index creation

Keep in mind that creating an index locks the table against writes. For large tables that can mean your site is down for hours.

Postgres allows you to CREATE INDEX CONCURRENTLY, which takes much longer to build but doesn’t require a lock that blocks writes

Only B-Trees can produce sorted output

Other index types return matching rows in an unspecified, implementation-dependent order

Nulls and order by via index in postgres

Postgres B-trees will place items in ascending order (1..2...8) with nulls last. So order by x is really order by x ASC NULLS LAST. Doing order by X DESC will produce ORDER BY X DESC NULLS FIRST. You can adjust the ordering by including options when creating the index

CREATE INDEX test2_info_nulls_low ON test2 (info NULLS FIRST);
CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);

Nulls and unique indexes in postgres

(This may vary in another DBs)

Postgres by default treats NULL values as distinct values in indexing. In general, this is consistent with the SQL standard, where NULL is unknown and it is impossible to determine is one unknown is equal to another unknown. Because NULL values are not equal to one another, they do not violate unique constraints.

This can by changed (as of PG15) by creating constraints and indexing using UNIQUE NULLS NOT DISTINCT - e.g.

CREATE TABLE null_new_style
(
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    val1 TEXT NOT NULL,
    val2 TEXT NULL,
    CONSTRAINT uq_val1_val2_new
        UNIQUE NULLS NOT DISTINCT (val1, val2)
);

Now, repeated NULL values are not allowed.

Reindexing

Finally, indexes will become sligtly fragmented and unoptimized after some time, especially if the rows in the table are often updated or deleted. In those cases, it can be required to perform a reindex, leaving you with a balanced and optimized index. One strategy to achieve the same result on a live site is to build an index concurrently on the same table and columns but with a different name, and then drop the original index and rename the new one

Don't expect more than a 10-20% improvement in speed here.

Index types

1. B-Tree

B-Tree is the default that you get when you do CREATE INDEX

2. Generalized Inverted Indexes (GIN)

Useful when an index must map many values to one row. Whereas B-tree indexes are optimized for when a row has a single key value. GINs are good for indexing array values as well as for implementing full-text search.

3. Generalized Search Tree (GiST)

Allow you to build general balanced tree structures, and can be used for operations beyond equality and range comparisons. They’re used to index the geometric data types, as well as full-text search.

Join basics

Joins start with tables with least data (less data needs to be carried around, increasing performance). in the query plan this will be the first child of the rightmost join (except if you are doing correlated subquery)

Abstractly, you could say they take normalized data and create denormalized data for a specific purpose.

ALl algorithms only process two tables at a time. Therefore an SQL query with N tables requires N-1 intermediate tables. DBs do use pipelining to avoid storing the full intermediate table and reduce memory usage that would otherwise be used for the intermediate result set. Because of the need to evaluate execution plans, this grows at O(n!)) if you don't use bind parameters (since not using bind parameters is like recompiling a program every time)

Types of Join

The optimiser uses stats about row numbers to choose most efficient join. This depends on those stats being up to date and useful.

The query engine is adaptive and may change type of join mid-way... it starts with nested loop... but if number of rows crosses the ahold, start building hash table

1. Nested loop join

For every row in first table, iterate through the entire second table looking for matches. V inefficient

The algorithm is O(n^2)

Nested loops are made faster with indexes on join cols on 2nd table...this enables the DB to pick out the right rows without iterating though whole table. By comparison, hash join (below) cannot use an index on 2nd table because the algo requires a looping.

2. Merge join

Speed can often be improved with sorting both tables on the join columns... then you know that once you reached a value in the second table that is greater than the one in the first, you can stop searching in the second table

So if you have a posts table and a comments table (with post_id), you'd sort the posts table on id the the comments table on post_id. Then run this algo:

int postCount = posts.size(), postCommentCount = postComments.size();
int i = 0, j = 0;

while(i < postCount && j < postCommentCount) {
    Post post = posts.get(i);
    PostComment postComment = postComments.get(j);

    if(post.getId().equals(postComment.getPostId())) {
        tuples.add(
            new Tuple()
                .add("post_id", postComment.getPostId())
                .add("post_title", post.getTitle())
                .add("review", postComment.getReview())
        );
        j++;
    } else {
        i++;
    }
}

The algorithm is O(nlog(n) + mlog(m)))

When the joining relations have an index, there is no need to sort the relation as the index can be used to read the records in the desired sorted order.

3. Hash Join

Go through every item in table one and apply a hash function on the join cols to create a hash table. Then go through second table applying the same hash function and checking for items. Single loop. The sole purpose of the hash table is to act as a temporary in-memory structure to avoid accessing the EMPLOYEE table many time

In the first step, it creates an in-memory hash table structure (i.e. a dict) from the records of the relation with fewer elements. With a posts table and comments table, this will likely be the posts table:

postHashTable.put(post.getId(), post);

In the second step, the larger relation is iterated and the smaller table record is located using the previously build hash map:

for (PostComment postComment : postComments) {
    Long postId = postComment.getPostId();
    Post post = postMap.get(postId);

    if (post != null) {
        tuples.add(
            new Tuple()
                .add("post_id", postComment.getPostId())
                .add("post_title", post.getTitle())
                .add("review", postComment.getReview())
        );
    }
}

The complexity of the Hash Join algorithm is linear (e.g., O(N + M)),

Hash join is usually much faster. But only works on equalities. In other words, greater than, not equal to, etc. cannot be used

Hash joins will be slower than other joins in cases where only a relatively small number of rows will be joined because we need enough memory in RAM for this.

The indexing strategy for a hash join is quite different because there is no need to index the join columns. Only indexes for independent where predicates improve hash join performance. An "independent predicate" is one that refers to one table only and does not belong to the join predicates. This stuff's perf can be improved with indexing.

It's possible to reduce hash table size (and improve perf) by selecting fewer columns or by adding filters that will decrease the result set that needs to be placed into hash table

Index range scan

(In Postgres, this is called Bitmap Index Scan)

Skips a full DB read and just goes from the index

If you have an index for a where query, you'll do an index range scan

Clustering factor

Measures how closely the data location matches the index. If there is similar ordering, then less seeks and reloading of blocks and more efficient . This number will be high if no correlation. Lower bound is number of blocks in table. Upper is number of rows

Optimizing the clustering factor for one column or set of columns almost always comes at the cost of other columns' clustering factor (cuz cols are uncorrelated)

Obfuscated conditions

Obfuscated conditions are where clauses that are phrased in a way that prevents proper index usage.

Obfuscation 1: Date types

Watch out and ensure that date types (rather than char type) is passed to DB via ORM etc. An index on a date column will not trigger if the query is for characters.

If it's the ORM passes a character type, you can always convert to a date

SELECT ...
  FROM sales
 WHERE sale_date = TO_DATE('1970-01-01', 'YYYY-MM-DD')

Also, prefer to use the BETWEEN operator over converting date formats (which may prevent index usage)

# Slow
SELECT ...
  FROM sales
 WHERE DATE_FORMAT(sale_date, "%Y-%M")
     = DATE_FORMAT(now()    , "%Y-%M")

# Faster (using a custom function)
SELECT ...
  FROM sales
 WHERE sale_date BETWEEN quarter_begin(?)
                     AND quarter_end(?)

Obfuscation 2: Storing numbers in a string column

If you index numeric string, it can only be used for when the input query gives it as a string. Using a number will fail to use the index due to ambiguity about how to convert (e.g. leading 0s, spaces, etc.)

# index
SELECT ...
  FROM ...
 WHERE numeric_string = '42'

# no index.
SELECT ...
  FROM ...
 WHERE numeric_string = 42

It's very bad practice to put integers in strings instead of numeric columns.

If stuck with legacy code, convert the search term

# index usable again
SELECT ...
  FROM ...
 WHERE numeric_string = TO_CHAR(42)

Obfuscation 3: Math

A plain numeric_number index cannot be used for this query

SELECT numeric_number
  FROM table_name
 WHERE numeric_number - 1000 > ?

instead move the 1000 to the other side of the equation.

If using multiplication etc., put an index on the match equation

CREATE INDEX math ON table_name (3*a - b)

Using a redundant WHERE condition can speed things up

The following query cannot use indexes due to ADDTIME turning this into derived data.

SELECT ...
  FROM ...
 WHERE ADDTIME(date_column, time_column)
     > DATE_ADD(now(), INTERVAL -1 DAY)

But if we add a redundant where clause, it will be faster

SELECT ...
  FROM ...
WHERE ADDTIME(date_column, time_column)
     > DATE_ADD(now(), INTERVAL -1 DAY)
   AND date_column
    >= DATE(DATE_ADD(now(), INTERVAL -1 DAY))

This is a straight filter on DATE_COLUMN that can be used as access predicate. E

Tips

Tip 1. Wrapping table columns in a function weakens the index

CREATE INDEX tbl_idx ON tbl (date_column)

SELECT COUNT(*)
  FROM tbl
 WHERE EXTRACT(YEAR FROM date_column) = 2017

With this index, the database can still read the full index end to end. Although this can be faster than reading the full table end to end, it is still not very efficient.

The best rewrite is to keep the index but rewrite as a range

SELECT COUNT(*)
  FROM tbl
 WHERE date_column >= DATE'2017-01-01'
   AND date_column <  DATE'2018-01-01'

This is preferable to an index on EXTRACT(YEAR FROM date_column) because the plain date column index is more flexible for other possible queries. Otherwise you might need a a 2nd or 3rd index etc.

Tip 2. Case insensitive Search cannot use case sensitive indexes

CREATE INDEX tbl_idx ON employees (last_name)

SELECT first_name, last_name, phone_number
  FROM employees
 WHERE UPPER(last_name) = UPPER('winand')

This cannot use the index on last_name (changing case turns it into something completely different according to the DB) and the software must use a full table scan. Instead, you need to index the UPPER(last_name)

Tip 3 Your indexes need to support order clauses by listing that order column in the index

The following index dual index is necessary for the subsequent query

CREATE INDEX tbl_idx ON tbl (a, date_column)

SELECT *
  FROM tbl
 WHERE a = 12
 ORDER BY date_column DESC
 FETCH FIRST 1 ROW ONLY

The database uses the index to find the last entry that matches the where clause and takes it as result. Even though there is an order by clause, there is no need to sort any rows.

Tip 4: Concatenated indexes can only be used from left to right side (not the reverse)

CREATE INDEX tbl_idx ON tbl (a, b)

SELECT *
  FROM tbl
 WHERE a = 38
   AND b = 1

SELECT *
  FROM tbl
 WHERE b = 1

That index will work efficiently for the 1st query (a and b) but not the 2nd (b only). Here, changing the index column order makes the index suitable for both queries—without additional overhead.

CREATE INDEX tbl_idx ON tbl (b, a)

Tip 5: LIKE queries ending with wildcard or with it in the middle can use indexes efficiently, but starting with wildcard cannot

LIKE filters can only use the characters before the first wild card during tree traversal. The remaining characters are just filter predicates that do not narrow the scanned index range.

The following can use the index efficiently because the wildcard is at the end

CREATE INDEX tbl_idx ON tbl (text varchar_pattern_ops)

SELECT *
  FROM tbl
 WHERE text LIKE 'TJ%'

Thus: A single LIKE expression can therefore contain two predicate types: (1) the part before the first wild card as an access predicate (narrows index scan range); (2) the other characters as a filter predicate (iterates and is slow)

A like expression with a leading wildcard cannot use the index at all.

Tip 6: Plan for index-only scans

In this first query, all the data necessary to answer the query is available in the index

CREATE INDEX tbl_idx ON tbl (a, date_column)

SELECT date_column, count(*)
  FROM tbl
 WHERE a = 38
 GROUP BY date_column

Therefore it will be lightning fast (index-only scan).

But if we modify the query to search through a column not in the index

SELECT date_column, count(*)
  FROM tbl
 WHERE a = 38
   AND b = 1
 GROUP BY date_column

Now the database must look into the actual table for each row that qualifies the original where clause to see if it also satisfies the new filter. Even if the new filter removes all rows, it does so after incurring additional work. Although the grouping has fewer rows to aggregate, this cannot compensate for the additional table look-ups.

Query plans

Sometimes these can be self-contradicting. E.g., below, the index gives 40 rows but then 100 rows are supposed to be accessed by row id. This isn't possible. It indicates a problem with the statistics. Look up for your DB, whether creating an index updates statistics (sometimes not...)

|Id |Operation                   | Name        | Rows | Cost |
--------------------------------------------------------------
| 0 |SELECT STATEMENT            |             |  100 |   41 |
| 1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES   |  100 |   41 |
|*2 |  INDEX RANGE SCAN          | EMP_UP_NAME |   40 |    1 |

If you see the word, "filter", you've got a filter predicate. These are like unexploded ordnance devices and cause response time to blow up in an awful way. You want to be seeing "access" predicates.

Query Plans in Postgres

Index Scan using employees_pk on employees
   (cost=0.00..8.27 rows=1 width=14)
   Index Cond: (employee_id = 123::numeric)

The PostgreSQL operation Index Scan combines the INDEX [UNIQUE/RANGE] SCAN and TABLE ACCESS BY INDEX ROWID operations from the Oracle Database.

Prefer calling funtions on input rather than on existing tables

E.g. The following converts every existing row's sales_data to character. So N operations. (It's also bad because it might obfuscate a date index)

SELECT ...
  FROM sales
 WHERE TO_CHAR(sale_date, 'YYYY-MM-DD') = '1970-01-01'

Compared to this, which only had one operation, on the input.

SELECT ...
  FROM sales
 WHERE sale_date = TO_DATE('1970-01-01', 'YYYY-MM-DD')

Latency vs Bandwidth

Generally, when a DB is over a network (as is usual with web applications), there are two big contributors to response time:

Latency can make response time slow by having many round-trips between your server and the SQL database. E.g. Due to N+1 queries

Bandwidth can make response time slow by increasing the amount of data to be transferred, therefore taking longer.

In general, latency is a much more common and serious issue.

Suspicious response times are often taken lightly during development. This is largely because we expect the “more powerful production hardware” to deliver better performance. More often than not it is the other way around because the production infrastructure is more complex and accumulates latencies that do not occur in the development environment. Even when testing on a production equivalent infrastructure, the background load can still cause different response times.

Batch execution (aka array execution)

Many databases offer a way to execute more than one statement in a single round trip over the network.

UNION clause

If data from two different tables that have the same structure is needed, you can use a UNION to fetch both at once. For example a CURRENT and a HISTORY table might contain the life-cycle of a row.

IN clause

Consider to use an IN clause to fetch multiple records form the same table in one run. Note that the SQL standard doesn't guarantee the order of the result sets corresponds to the IN clause order...

Open Cursors / Prepared Statement

The same PreparedStatement instance should be used if the same SQL statement is executed multiple times. The first execution of a Statement involves an additional round trip to open the cursor. This additional round trip can be avoided when the same PreparedStatement instance is reused. As a side effect, PARSING overhead is also reduced. However, the number of open cursors has a hard limit; don’t forget to close them when the work is done.

Prefer Joins to SubQueries

While subqueries may be easier to understand and use for many SQL users, JOINs are often more efficient.

However, subqueries are necessary in some situations.

  1. Subquery in FROM With a GROUP BY
SELECT city, sum_price
 FROM
(
  SELECT city, SUM(price) AS sum_price FROM sale
  GROUP BY city
) AS s
WHERE sum_price < 2100;

Using the results of the subquery, the outer query selects only the cities whose total sale price is less than $2,100. The join cannot access that group by.

Explain

How to read:

Actual time=8163.890..8163.893 means

Initializing that step ("startup") took 8163.890ms. Running the whole step took 8163.893ms. So nearly all time was startup.

The same logic is "applied" to the cost information

Resources