Re: Domino 10.0.1 experiences: Server Restart Notification -- DOMINO400

If you haven't already, you should set log_agentmanager=1 so that you're
recording the start and stop times of each agent. This way you know exactly
which one had been running at the server panic. You may even need to add
some print statements in the agent to see how far it is getting in it's
code execution. If it's hanging at the same place when the crashes happen,
then something may need to be adjusted. It could be a memory leak with the
agent or amgr task. But agents can definitely cause servers to crash.

Thanks,
Chris

On Thu, Feb 28, 2019 at 9:02 AM Rob Berendt <rob@xxxxxxxxx> wrote:

HCL:
<snip>
...
After reviewing the log files that you have submitted, the least we can do
for now is to monitor your server if another crash or fault recovery will
happen. Based on the call stack captured on the NSD (as seen below), the
process that caused the
server to crash is an Agent Manager.
...
Unfortunately, the database and the specific agent that caused the crash
was not captured in any of the log files uploaded. This is why we need to
monitor the server if another crash will happen and check if same call
stack will be captured and if we
will be able to determine the Agent and database affected.
May I confirm with you if you are actually running an agent for this
server? May I know what agent is that, so we can check further?
</snip>

Again, blame the server restart on a "divide by zero" error... :-(
It's like this whole product is cobbled together from tissue paper and
spit.

-----Original Message-----
From: Domino400 <domino400-bounces@xxxxxxxxxxxxxxxxxx> On Behalf Of Rob
Berendt
Sent: Thursday, February 28, 2019 8:41 AM
To: Lotus Domino on the IBM i (AS/400 and iSeries) <
domino400@xxxxxxxxxxxxxxxxxx>
Subject: RE: Domino 10.0.1 experiences: Server Restart Notification

You know that field in the server document, "Mail Fault Notification
to:"? This is what it is used for:
Fault Recovery Notification: Server QUALITY3/DEKKO was restarted after a
fault on 02/27/2019 12:42:37

Hopefully they will figure out why this 10.0.1 server faulted on it's
own. I have so many of these on my 9.0.1FP10 servers that the tickets drag
on for months. So it doesn't initially make me paranoid about 10.0.1. I
was kind of hoping they'd go away though.

HCL tends to blame our agent code. Which, to me, makes about as much
sense as blaming the IPL of an lpar of IBM i because some RPG programmer
had a divide by zero error.
HCL: The nsd shows the agent was running at the time of the system fault.
Me: The agent runs a bazillion times a day doing transactions from our
ERP into Domino. So I think it's just a coincidence. It didn't fault the
system the other gazillion times it ran.

--
This is the Lotus Domino on the IBM i (AS/400 and iSeries) (Domino400)
mailing list
To post a message email: Domino400@xxxxxxxxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: https://lists.midrange.com/mailman/listinfo/domino400
or email: Domino400-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives
at https://archive.midrange.com/domino400.
--
This is the Lotus Domino on the IBM i (AS/400 and iSeries) (Domino400)
mailing list
To post a message email: Domino400@xxxxxxxxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: https://lists.midrange.com/mailman/listinfo/domino400
or email: Domino400-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives
at https://archive.midrange.com/domino400.