Re: SQL resource usage and monitoring -- MIDRANGE-L

On 14-Mar-2015 08:59 -0500, Darryl Freinkel wrote:

I have a complex SQL script that used to run in about 3 minutes and
processed through 700k records.

I made a bunch of changes and one of the changes has resulted in the
run time going out to 12 hours.

May be great value in obtaining the pre-change copy of the script and comparing\running each of them step-by-step to compare\contrast.

Quite common for such a radical change is multiple-file queries for which a predicate between files is [since] missing; that omission having resulted in a cross-product result. Another issue [and possibly in combination] is a change that resulted in the Classic Query Engine (CQE) performing the query instead of the SQL Query Engine (SQE), and the particular query likely is lacking an efficient SQL-like implementation available from the classic optimizer which was not designed originally with an intention solely to implement SQL queries. Likely resolutions are to avoid whatever operations require a fallback on CQE [e.g. shared-weight sequences for collation and selection] or [for joins anyhow] to provide the query more if even seemingly redundant predicates [or statistics if running SQE] to better limit selection.
<http://www.ibm.com/support/knowledgecenter/api/content/ssw_i5_54/rzajq/queryoptimize.htm>
_Query Engine Overview_

I have all the advised indexes required by the system.

Advised indexes are those which the optimizer presumes would be useful for the queries being run. Additional indexes on the data can provide statistics about the data that may yield alternate preferences, thus changes to what had previously been advised; i.e. having only\all the advised indexes does not imply that the ideal keyed access paths exist for implementing queries. Adding RI and Check Constraints provide even more information than just statistics; known logical constraints on the data can narrow focus to specific data rather than all indexed data.

I am trying to monitor system usage and found the following:
- CPU utilization is constantly around 40% (2 hours)
- Disk IO across all 10 drives is 1% or less (1 hours)

On the face of it, it appears the system is idle whilst it is
processing the data.

Unlikely /idle/, more likely just not visible at the job level. The Work With System Activity (WRKSYSACT) command shows LIC tasks that are not visible via the Work With Active Jobs (WRKACTJOB) interface. Unless all of that 40% was known to be taken by other work than the long-running query, a noticeable portion of that was probably the query.

Looking at the job status, for status transitions [in WRKACTJOB], could be easy and more telling than much else, if indeed there were any /wait/ conditions; e.g. something like DEQW, LCKW, xxxW, then those are visible there. If the job is not changing status, and for example remains continually in RUN status, then either work is progressing, or any waits are not those exposed outside of the LIC; the long-running database operations are typically exposed in the job status as IDX for building an access path or as RUN for selecting and reading data records. If the program stack appears to be unchanging, then a LIC dump of the process will show the effective stack of the LIC processing.

Does anyone have any advice on how to track this down.

My first pass is usually just to look at the queries to decide which are the most likely to be difficult to optimize and\or complete, and then run and investigate those separately. However running the script with the query timeout option [the query governor] set and ensuring inquiry messages are prompted, allows the optimizer to reveal prior to each query to be run, which are presumed to take longer than a specified limit or an option allows forcing the optimizer to reveal the presumptions of the query costing, irrespective of any limit. See Change Query Attributes (CHGQRYA) for the Query Time(out) Limit (QRYTIMLMT) parameter; the value zero suggests to stop every request and inform of the cost before a reply allows continuing. That feature also enables similar limits, for query temporary storage requirements, via the Query Storage Limit (QRYSTGLMT) parameter. See the following, along with the related and parent links at the bottom of the page that refer to what also might be of interest:
<http://www.ibm.com/support/knowledgecenter/api/content/ssw_i5_54/rzajq/testperfwithqrygov.htm>
_Testing performance with the query governor_

If not separating each query with the forced timeout inquiry, I might insert a CALL to a routine that logs the current timestamp along with something to identify which statement is being run; if only to get timings for each, reviewed just once or separate runs compared to see any [lack of] consistency. Because I prefer to see the job traces, I might surround each request with requests to start and end a trace; later print the trace information for the request(s) that are suspect as poor-performers.

With the old script having the same pre\post processing inserted around the statements to effect logging as were included in the updated script, the timing differences on a per-statement basis are easier to fathom. Other tooling like the DB Monitor should assist in the same way, but I generally prefer my own logging. For example, one script might have combined two queries into one; with my own logging I may be able to logically group the /same/ work easier than can be done with other tooling. If the script is really long vs just a few queries, I might run the old and new concurrently side-by-side instead of using any logging.

We are on V5r4 so no IBM support?

AFaIK individual support contracts are available to be purchased, even when a release since has had general support ended. There are also non-IBM service providers that presumably could similarly assist.

Do you know how I can use the data base monitor to analyze what is
going on?

There are some /canned queries/ documented that can be run on the DBMON output to make certain inquiries about a specific query or a batch of queries. I have seen some published on the web, perhaps in articles at one of the news sites; some may be included in the iNav database feature, I do not recall. As I recall there was a simple way to query which statements were run via CQE vs SQE. A really old document that has some information that may assist, followed by links to some progressively newer documents [which may have /related topics/ links included]:

<http://www.redbooks.ibm.com/redpapers/pdfs/redp0502.pdf> ??-???-2000
Using AS/400 Database Monitor and Visual Explain To Identify and Tune SQL Queries

<http://www.redbooks.ibm.com/abstracts/sg246092.html> 16-Sep-2004
DB2 Universal Database for iSeries Administration: The Graphical Way on V5R3

<http://www.redbooks.ibm.com/abstracts/sg246654.html> 26-May-2006
SQL Performance Diagnosis on IBM DB2 Universal Database for iSeries

<http://www.redbooks.ibm.com/abstracts/tips0624.html> 08-Aug-2006 upd: 12 August 2008
Index Advisor Surfaces Maintained Temporary Indexes (MTI) Activity on DB2 for i5/OS

<http://www.redbooks.ibm.com/abstracts/sg246598.html> 27-Sep-2006
Preparing for and Tuning the SQL Query Engine on DB2 for i5/OS

<http://www.redbooks.ibm.com/abstracts/sg247326.html> 12-Mar-2007
OnDemand SQL Performance Analysis Simplified on DB2 for i5/OS in V5R4

Why can I not see where the processing is? It's almost like there is
a wait happening?

The easy answer is "the lack of knowing [best] where to look". Hopefully some of the above will assist. Having worked on the DB2 for i, I tend to rely much more on what I know should happen, without much reliance on the various feedback mechanisms beyond debug messages and maybe VE, paired with my knowledge of both the actual data and the DDL that defined the files, the relations [including RI] and indexes... plus consideration for anything extra like read triggers, fieldprocs, or UDFs that might inhibit quick access.