A data management head-scratcher: CHAIN taking 30+ seconds -- MIDRANGE-L

Greetings all--

A customer has documented long response times--over 30 seconds--when doing
a read-only CHAIN to an indexed (one key, 36 characters) file, "XXX", with
*IMMED maintenance and *AFTER journaling. The program waits and the
transaction is abandoned by the requestor. *ALL commitment control is not
active. No other apps use this file. The problem is chronic and
intermittent. The system is well-resourced with CPU, memory, and disk.

The program in question--ZED--is a web service acting as a wrapper for the
business app (providing price quotes) and about 50 instances are running
concurrently. Tens of thousands of transactions per day flow through this
app. File XXX has about 5 million records and about 200,000 deleted
records. The job log shows SQL7917, "Access plan not updated", when the
program initiates.

Here's the scenario:

1. At IPL, 50 instances of program ZED are initiated.
2. Each instance opens file XXX as a shared input file. Important fact:
at this point, XXX is not available for “exclusive use”.
3. An instance of ZED leaves its wait state and begins execution.
4. The program executes to the CHAIN to XXX.
5. The database engine—the query optimizer, to be specific—recognizes a
need to rebuild the access path for XXX. We are sure the file is XXX but
SQL7917 doesn't have any details.
6. The query optimizer updates the access path in memory but—and here’s
where there’s trouble in River City—it can’t update the access path
object. According to the second-level text on the error message, this
update requires exclusive access to XXX.
7. Why no exclusive access? Because XXX is open the other 49 copies of
ZED. It's cousin to the database "deadly embrace" design flaw.

My hypothesis is this: the query optimizer can't get exclusive use of the
file and can't update the file object with the updated (meaning the 200K
deleted entries are dropped) access path. The resulting wait for the CHAIN
to complete is caused by the database reading through those 200K+ deleted
entries in XXX's index. Because the first instance of ZED ties up the
file, the other 49 instances have the same issue and the access path
remains dirty.

Having deleted entries in the access path doesn't make sense; my
understanding is that the system rebalances the tree dynamically. But the
purge program deleting unneeded records runs while ZED is active and, once
again, no exclusive use of XXX is possible. So, all 50 instances of ZED
are stuck with a dirty index.

My suggestions to the customer:

1. Use SQL instead of a CHAIN (not sure this will work)
2. Create an LF over XXX and use it in ZED.
3. Change the schedule for XXX's purging to be part of the monthly IPL,
when the system is down and we can RGZPFM the file; I'm not sure if a
reorg-while-active will work.
4. Open and close the file within the RPG program; this may provide the
window for the first executing instance to let the access path update
complete.
5. Spawn one instance, wait a minute (or longer, thereby allowing this
first instance to do let the access path update complete), then spawn the
other 49.

What have I missed?

Thank you for any insight/feedback/instruction!

--reeve