re: SQL query optimization -- RPG400-L

hey guys, it seems someone at iseries News mag is listening in....

this just in from the Club Tech iSeries Programming Tips Newsletter:

MYSTERY OF THE QUERY OPTIMIZER AND AN EXISTING INDEX
Q: I have a query that sometimes uses an existing index and at other times
builds an access path. I'm querying a table of 50 million records, so you
can well imagine that the query that builds an access path takes a LONG
time. The query that uses an existing index is very fast, on the other
hand.

For example, consider the following query:

  Select   *
  From     MyTable
  Where    SalesDate >= 20010701 and
           SalesDate <= 20010731 and
           SalesAmt  >= 1000.00  and
           SalesAmt  <= 2000.00
  Order by SalesDate

This query selects sales history records where the amount of the sale was
between $1,000 and $2,000 (inclusive) and the sale occurred in the month of
July in 2001. The query is presented in date order. This query uses an
index
comprised of SalesDate, SalesRep, and SalesAmt and runs very fast.

Now, consider the following query:

  Select   *
  From     MyTable
  Where    SalesDate >= 20010701 and
           SalesDate <= 20010930 and
           SalesAmt  >= 1000.00  and
           SalesAmt  <= 2000.00
  Order by SalesDate

This query is identical to the first except that the span of time is a
three-month period beginning with July 2001.

Why does the first query use the existing index, yet the second one
doesn't?

A: As you probably know, the iSeries uses a query optimizer whose goal is
to
execute queries in the timeliest fashion. Your mystery is most likely
explained by the key range estimation feature of the query optimizer.

The optimizer examines the range of values specified for key fields. Based
on your database characteristics and the specific query request, the
optimizer predicts whether it is more efficient to use the index or to use
some other method. For example, the optimizer could choose to perform a
table scan (read the entire table) or to build an access path.

The optimizer is deciding that the range of possible key values in your
second query is large enough that it is more efficient to build an access
path (based on the specifics of the query) than it is to read through the
index and then access the dataspace that the selected index entries
reference. Your query specifies that all fields should be returned
(Select *). This influences the optimizer's decision to not use the
existing
index. Because all fields are to be retrieved, the optimizer knows that
whether or not the system uses the existing index, the system must access
the dataspace to retrieve those fields that are not included in the index's
key fields. Given the range of key values and the fact that the dataspace
must be accessed, the optimizer chooses not to use the existing index.

Depending on your requirements, you may be able to force the query to use
the existing access path, however. Consider the following query:

  Select   SalesDate, SalesAmt
  From     MyTable
  Where    SalesDate >= 20010701 and
           SalesDate <= 20010930 and
           SalesAmt  >= 1000.00  and
           SalesAmt  <= 2000.00
  Order by SalesDate

When you run this query, you'll likely find that though the range of dates
is identical to the slow-running query that foregoes use of the existing
index, this query uses the index and completes very quickly. Because this
query selects only fields that exist in the list of key fields specified by
the index, the system can retrieve all required data from the index without
the need to access the dataspace. Hence, the optimizer chooses to use the
index and all is well.

If you need fields that are not included in the list of key fields and
you're running the query as part of an application (rather than interactive
SQL), you can try a technique that MIGHT prove beneficial. Add a field to
the end of your list of key fields that will uniquely identify each record
(if necessary, add this unique identifier to the table). Then, issue the
following query from your application:

  Select   MyUniqueIDField,
  From     MyTable
  Where    SalesDate >= 20010701 and
           SalesDate <= 20010930 and
           SalesAmt  >= 1000.00  and
           SalesAmt  <= 2000.00
  Order by SalesDate

Your application can then cycle through the returned unique identifiers
(MyUniqueIDField) and perform a random I/O operation (such as RPG's Chain)
to retrieve all necessary fields. I realize this is a two-phase approach,
but the overall time to obtain the information you need might be less than
that required to build an access path and extract the data.