Documentos de Académico
Documentos de Profesional
Documentos de Cultura
In this environment, a single inefficient query can have disastrous effects. A bad
statement may overload all database processors, so that they are no longer available to
serve other customers' orders. Of course, such problems typically occur shortly after the
launch of new offers... that is, precisely under heavy marketing fire. Could you imagine
the mood of our senior management if such a disaster happens?
That's why every database developer (and every application developer coping with
databases) should understand the basic concepts of database performance tuning. The
objective of this article is to give a theoretical introduction to the problem. At the end of
this article, you should be able to answer the question: is this execution plan reasonable
given the concrete amount of data I have?
I have to warn you: this is about theory. I know everyone dislike it, but there is no serious
way to go around it. So, expect to find here a lot of logarithms and probabilities... Not
afraid? So let's continue.
• Applying predicates.
• Joining tables with hash.
• Sorting
• Merging
• Joining tables with merge joins.
Scenario
I need a sample database for the examples of this article. Let's set up the scene.
The CUSTOMERS table contains general information about all customers. Say the company
has about a million of customers. This table has a primary key CUSTOMER_ID, which is
indexed by PK_CUSTOMERS. The LAST_NAME column is indexed by
IX_CUSTOMERS_LAST_NAME. There are 100000 unique last names. Records in this table
have an average of 100 bytes.
The REGION_ID column of the CUSTOMERS table references the REGIONS table, which
contains all the geographical regions of the country. There are approximately 50 regions.
This table has a primary key REGION_ID indexed by PK_REGIONS.
The component of the database server that is responsible for computing the optimal
execution plan is called the optimizer. The optimizer bases its decision on its knowledge
of the database content.
How to inspect an execution plan
If you are using Microsoft SQL Server 2000, you can use the Query Analyzer, to which
execution plan is chosen by the optimizer. Simply type an SQL statement in the Query
window and press the Ctrl+L key. The query is displayed graphically:
As an alternative, you can get a text representation. This is especially useful if you have
to print the execution plan. Using a Command Prompt, open the isql program (type isql
-? to display the possible command line parameters). Follow the following instructions:
The main disk allocation unit of database engines is called a page. The size of a page is
typically some kilobytes. A page usually contains between dozens and hundreds of
records. This is important to remember: sometimes you may think a query is optimal
from the point of view of the record accesses, while it is not if you look at page accesses.
Say we are looking for a few records in a single table -- for instance we are
looking for the customers whose last name is @LastName.
The first strategy is to read records from the table of customers and select the ones
fulfilling the condition LAST_NAME = @LastName. Since the records are not
sorted, we have to read absolutely all the records from the beginning to the end of
the table. This operation is called a full table scan. It has linear complexity,
which means that the execution time is a multiple of the number of rows in the
table. If it takes 500 ms to look for a record in a table of 1000 records, it may take
8 minutes in a table of one million records and 5 days in a table of one billion
records...
To compute the cost of sql1, we set up a table with primitive operations. For each
operation, we specify the cost of one occurrence and the number of occurrences.
The total cost of the query is obviously the sum of the products of operation unit
cost and number of repetitions.
Let's take a metaphor: a full table scan is like finding all occurrences of a word in
a Roman.
Now what if the book is not a Roman but a technical manual with an exhaustive
index at the end? For sure, the search would be much faster. But what is precisely
an index?
o An index is a collection of pairs of key and location. The key is the word
by which we are looking. In the case of a book, the location is the page
number. In the case of a database, it is the physical row identifier. Looking
for a record in a table by physical row identifier has constant complexity,
that is, it does not depend on the number of rows in the table.
o Keys are sorted, so we don't have to read all keys to find the right one.
Indeed, searching in an index has logarithmic complexity. If looking for
a record in an index of 1000 records takes 100 ms, it may take 200 ms in
an index of million of rows and 300 ms in an index of billion of rows.
(Here I'm talking about B-Tree indexes. There are other types of indexes,
but they are less relevant for application development).
If we are looking for customers by name, we can perform the following physical
operations:
The detailed cost analysis of sql1 using an index range scan is the following.
Bad news is that the query complexity is still linear, so the query time is still a
multiple of the table size. Good news is that we cannot do really better: the
complexity of a query cannot be smaller than the size of its result set.
In the next section of this article, we will accept a simplification: we will assume
that index look-up has unit cost. This estimation is not so rough because a
logarithmic cost can always be neglected if it is added to a linear cost. This
simplification is not valid if it is multiplied to another cost.
• Index selectivity
Comparing the cost of the full table scan approach and the index range scan
approach introduces us to a crucial concept in database tuning. The conclusion of
the previous section is that the index range scan approach shall be faster if, in
terms of order of magnitude, the following condition is true:
The probability that a customer has a given name is simply the number customers
having this name divided by the total number of customers. Let
KEYS(IX_CUSTOMERS_LAST_NAME) denote the number of unique keys in the index
IX_CUSTOMERS_LAST_NAME. The number of customers named @LastName is
statistically RECORDS(CUSTOMERS)/KEYS(IX_CUSTOMERS_LAST_NAME).
That is, an index is adequate if the number of records per unique key is
smaller than the number of pages of the table.
The inverse of the left member of the previous expression is called the selectivity
of an index:
SELECTIVITY(IX_CUSTOMERS_LAST_NAME) =
KEYS(IX_CUSTOMERS_LAST_NAME) / RECORDS(CUSTOMERS)
The selectivity of a unique index is always 1. The more an index is selective (the
larger is its selectivity coefficient), the more is its efficiency. Corollary: indexes
with poor selectivity can be counter-productive.
Suppose we want to display the name of the region besides the name of the customer:
Among the possible strategies, I will present in this article the most natural: choosing a
table, reading it from the beginning to the end and, for each record, search the
corresponding record in the second table. The first table is called the outer table or
leading table, and the second one the inner table. The dilemma is of course to decide
which table should be leading.
So let's first try to start with the table of regions. We learnt before that an index on
CUSTOMERS.REGION_ID would have too low selectivity to be efficient, so our first
candidate execution plan is to read the table of regions and, for each region, perform a
full table scan of CUSTOMERS.
Now what if we did the opposite? Since the table of regions is so small that it has a single
page, it is useless to have an index, so we choose again two nested full table scans.
Since we are interested only in minimizing disk access, we can consider that the cost of
reading a page from memory is zero.
The REGIONS table and its primary key can both be stored in cache memory. It follows
that the cost matrix can be rewritten as follows:
Summary
Let's summarize what I have tried to introduce in this article: