Re: System slow-down - disk usage? -- MIDRANGE-L

Paul

Besides Hi!!

I believe a CHAIN is more expensive than SETLL - the latter only access the index - LF - where the former actually reads both index and PF.

They should be using SETLL with a key list, then check the %equal() BIF - this is the most efficient way to check for existence of a record, methinks.

Vern

On 6/15/2010 12:16 PM, Paul Nelson wrote:

Is the previous file keyed, or can it be? I'd try using a chain to get the
exact record instead of reading the whole file.

Paul Nelson
Office 512-392-2577
Cell 708-670-6978
nelsonp@xxxxxxxxxxxxx

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx
[mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Kurt Anderson
Sent: Tuesday, June 15, 2010 12:02 PM
To: 'Midrange Systems Technical Discussion'
Subject: RE: System slow-down - disk usage?

Thought I'd bounce a couple things out there to see what people think.

We've identified the slow-down cause, and it has to do with us checking
every record we process against a file of records that we've previously
processed. If that "previous" file starts off empty, the program starts off
screaming along. However, 20 million records later, when the previous file
is at 20 million records, the program has slowed down considerably. (Just
using 20 million as an example, it's really a gradual decline in
performance.)

We check this Previous file by doing a SETLL on it. If we get a hit, we
kick out the record we're processing, otherwise we write the record to the
Previous file and continue along.

I also noticed that when the program is running with a large Previous file,
that the page/fault ratio closes in on 3:1. Is this because the SetLL is
moving through so much data? (No other batch jobs were running, and we have
very little interactive use.)

We are looking at getting more main memory (as well as aux storage), but I'm
trying to think outside the box... is there a way for me to perform this
previous check without flooding the memory?

Thanks,

-Kurt

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx
[mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Kurt Anderson
Sent: Monday, June 14, 2010 8:41 AM
To: 'Midrange Systems Technical Discussion'
Subject: RE: System slow-down - disk usage?

Actually, that wasn't it. Although was worth discovering anyway.

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx
[mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Kurt Anderson
Sent: Friday, June 11, 2010 9:44 PM
To: 'Midrange Systems Technical Discussion'
Subject: RE: System slow-down - disk usage?

Looks like I found the issue. The output file which receives a high volume
of records in our "problem" job had a field defined as VARLEN(16), but 90%
of the calls had a value that exceeded 16 bytes. This was a new field for
this client and the size was determined based on all other clients (before
we saw the new client's data). Once I discovered that, I upped the VARLEN
to 25 (based on new client's data) and suddenly the job screams in speed. I
knew a misuse of VARLEN would cause issues, but didn't realize it would be
so bad.

Now, I had mentioned that a completely unrelated job took 24 hours to run
(4-5x normal). That still boggles me because that job was not exposed to
any of the changes made - which is what made me think there was a system
issue.

We did check the Cache Battery, which at least exposed us to the fact that
we are 100 days until a warning, so I really appreciate that tidbit. I've
also been exposed to some system monitoring tools.

Thank you everyone for responding with ideas and things to check into.

-Kurt

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx
[mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Kurt Anderson
Sent: Thursday, June 10, 2010 2:57 PM
To: 'Midrange Systems Technical Discussion'
Subject: RE: System slow-down - disk usage?

I knew I forgot something.

Main Storage: 3885.01MB

QPFRADJ = 2

In WrkDskSts I see Active for all units.

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx
[mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of DrFranken
Sent: Thursday, June 10, 2010 12:09 PM
To: midrange-l@xxxxxxxxxxxx
Subject: Re: System slow-down - disk usage?

OK Cleaning up never hurts and often helps so that much is good.

Your disk configuration is a bit odd in that the first two units are
mirrored to each other while the rest are RAID protected. Not
unsupported or anything like that but you do lose about 35GB of
available storage this way and the RAID is on only 4 drives rather than
across 8 drives (all drives ARE still protected) In any case while 'odd'
it shouldn't be the source of your problem.

I forgot to mention that when looking at %busy on the drives or at the
paging/faulting ratios you need to wait until the elapsed time at the
top is about 5 minutes. Much shorter than that and you get 'spikey' data
that isn't real valuable. Much longer than that and all your data
averages into useless mush. It's not bad to know that your average
%busy for the entire morning was say 12% but it doesn't help you
troubleshoot much.

On the WRKDSKSTS screen after you pressed F11 did you see 'Degraded' or
'Unprotected'? That would indicate battery failure/drive failure.
'Active' is what you want to see.

One disk is one arm so you're correct there. How much memory is in the
machine? (Easy find is at top of WRKSHRPOOL screen.)

Definitely watch the paging/faulting and %BUSY numbers while the long
slow job is running.

Also what is your system value QPFRADJ set at?

- DrFranken

After I sent out my earlier email, we buckled down and cleaned up a lot of

excess on the system, essentially gaining back 10% of the disk, which put us
back to where we started. We had another job running, although this time
with less to process, but it did take significantly longer than expected.

I found an article on the Cache Battery and have passed it along to my

boss.

http://www.itjungle.com/fhg/fhg050907-story03.html

I checked our paging to fault ratio, and it seems decent. Our "hog" jobs

aren't running right now, but I looked at wrksyssts enough while they were
running to recall that the ratio was around 1 fault per 50 pages.

Using WRKDSKSTS, we have 7 units. Is each unit considered an arm? (I

guess, 1 arm per disk? I'm a software guy doing his best to understand the
system side here.) 1& 2 are Protection Type MBR, 3-7 are DPY. All are
Active. I'll have to start up some tests to take a look at the Busy %. At
the moment the Busy % is in the low teens or lower.

In regard to our system, it's a 520, 9405, P10.

Thanks for the help,
Kurt

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx

[mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of DrFranken

Sent: Wednesday, June 09, 2010 11:45 PM
To: midrange-l@xxxxxxxxxxxx
Subject: Re: System slow-down - disk usage?

Absolutely check Richard's suggestion on the Cache Battery. If any of
them are dead this sort of performance WILL result during significant I/O.

Going from 60 to 70% DASD should not cause this dramatic slow down. It
may cause some small fraction but nothing like 4 times plus.

What is the %BUSY in WRKDSKSTS when the long running job is running? If
the disks are 40% or more busy then you likely need more arms, faster
arms, or bigger disk cache but even then that's only a 'probable'. Also
how many disk arms do you currently have? Are they DPY or MRR protected
(From WRKDSKSTS F11)

You also need to check faulting as you mention. The big thing about
paging and faulting is the ratios. If the pool running the jobs is
paging at say 2500 but faulting at 25 then you're doing exceptionally
well as only 1 in 100 pages results in a fault. If you've got 500 faults
out of 500 pages then you likely have a memory pool that is far too
small or has too many jobs running in it. The reason you don't find
specific numbers is because 'It Depends'. If you have a 32 way 595 you
can have faulting numbers that would make a 520 user cry and not bat an
eye.

What is your system CPW (or processor feature code) and how much memory
is installed.

- - DrFranken

On 6/9/2010 6:35 PM, Kurt Anderson wrote:

I'm on v5r4, and we've recently gotten a very large customer and have had

some speed issues. At first we thought they were specific to some certain
new programs, but today we discovered the issue was impacting another job
that was completely an absolutely isolated as far as programs go. So, we
were looking at things from a system point of view to see what changed to
cause this other job to slow down so much. Our guess - that our % system
ASP used went from ~60% to ~70%. Is it possible that that would cause us an
issue? (We had a job that would normally run 5 hours take almost 24
hours.)

We IPL'd over the weekend as well. Anyway, I realize this email is

probably lacking a lot of specific information, but I'm not really a systems
guy, and we're kind of grasping at straws, so I thought I'd see if such a
change to disk % used should have such a big impact?

I am looking into other performance improving methods, but at this time

we'd really like to pin down the cause of our performance crawl before
attempting to put in enhancements.

While I'm at it, I'm curious how to quantify "excessive paging." I've

seen reference to that phrase online, yet can't seem to find a number.

Thanks,

Kurt Anderson
Sr. Programmer/Analyst
CustomCall Data Systems