BlueGene Fixes 2008

Hardware


09/19  R02-M1-N6-C:J05-U01  L3 major internal error: recovery failed, DJT8WDJ
09/08  R02-M0-N7-C:J11-U01  CE sym 0, at 0x1089a540, mask 0x04 
09/08  R00-M1-N7-I:J18-U01  Error: unable to mount filesystem, #DJT9VL8
09/05  R01-M1-NE-I:J18-U11, R01-M0-NF-I:J18-U01     Error: unable to mount filesystem  #DJT9VL8
09/01  R00-M1-NE-I:J18-U11, R01-M0-NA-I:J18-U01, R01-M0-NB-I:J19-U11 no ethernet link, cables unplugged 
07/11  R00-M1-N9-I:J19-U11, R00-M1-NE-I:J18-U11, R01-M0-NA-I:J18-U01, R01-M0-NB-I:J19-U11 no ethernet link
       new cbales to Hollborn
05/23  R00-M0-NC-I:J18:U11, R00-M1-N6-I:J19:U11 no ethernet link, new cables to Victoria
04/29  NODECARD: R01-M0-N9
03/03  I/O node board down R01-M1-N6-I:J19-U01, I/O CARD REPLACEMENT, board replaced
02/21  ethernet down R01-M0-NE-I:J19-U11, new cable to Victoria

PMR 51568 BG: CE sym 0 errors

User is seeing wrong results in his code on rack R02 while the very     
same code runs correctly on rack R01     

9/8/08 11:15:20 AM  Informational  Kernel  R02  46303 R02-M0-N7-C:J11-U01  total of 11 ddr error(s) detected and corrected    over 12881 seconds                                                      
9/8/08 11:15:20 AM  Informational  Kernel  R02  46303 R02-M0-N7-C:J11-U01  CE sym 0, at 0x1089a540, mask 0x04    

9/9/08 10:38:28 AM  Informational  Kernel  R02-bot  46374  R02-M0-NC-C:J15-U11  critical input interrupt 
(unit=0x0b bit=0x0a): warning for torus z+ wire, suppressing further interrupts of same type


I suggest running full midplane diags on both midplanes on R02. Do not run linkcard diags.     
Sep 11 13:26:15 2008 PDT forceCardsUnavailable: WARNING: Marking nodecard(s): R02-M0-NE in ERROR until next service action         
Compute Node  dr_bitfail  9/11/08 1:38:15 PM 07441025612FFFFF0E081B9044E0  R02-M1-N8-C:J06-U11  Fatal hardware failure

Action Taken: I have placed a hardware call number DJT9LVR to have node 
card R02-M0-NE and compute card R02-M1-N8-C:J06-U11 replaced on the     
customers system.                                                       
                                     
Action Taken:There was a node card in R02-M0 that was not being         
             discovered. Eva helped us get onto the system and I looked 
             at the Service Card and saw that the linkcard did not have 
             the 1.5 volt power set. So this linkcard was not           
             initialized correctly.  I reset the Service Card and mark  
             the linkcards and nodecards for this midplane to be missing
             for the idochips. After doing this we started Discovery0   
             back up and then all of the hardware was found and the     
             midplane looked good.                                      


9/12/08 12:59:21 PM  Fatal  Kernel  bot128-1  46526 R00-M0-N3-C:J03-U11  machine check interrupt (bit=0x06): L3 major internal error       

             Eva then said that she had one other question in R00-M0. We
             did some checking and there was a compute card the took a  
             L3 Major error.  We helped her do a SA on this midplane to 
             get it back up and she is aware that the node card could go
             into an error status again if the L3 happens again.        
                                                                        
After hardware replaced EndServiceAction hangs:

Sep 15 15:32:09.154 PDT:   All of R00-M0's NodeCards are active
Sep 15 15:32:09.159 PDT: @ Still waiting for 1 compute Processor cards in R00-M0 to become active

===> Restart 
bgsn:/discovery # ./Discovery0 stop
bgsn:/discovery # ./PostDiscovery stop
bgsn:/discovery # ./SystemController stop
bgsn:/discovery # ./SystemController start
bgsn:/discovery # ./Discovery0 start
bgsn:/discovery # ./PostDiscovery start

PMR PMR 39625 timout



{270}.0: Starting syslog services
{270}.0: 
Starting ciod
Starting XL Compiler Environment for I/O node


BusyBox v1.00 (2007.10.12-17:59+0000) Built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty; job control turned off
~ # ciod: version "Oct 23 2007 19:23:06"
ciod: running in virtual node mode with 16 processors

Boot a rack:

mmcs$ allocate_block R00
OK
mmcs$ boot_block
OK
mmcs$ select_block top64-2 
OK
mmcs$  boot_block
FAIL
timeout   ->> communication failure? 

still after power cycle
reseat switch card Network
replace IO cards

su - bglsysdb
bglsysdb@bgsn:~> db2 "connect to bgdb0"
db2 "select location,ipaddress from bglnode where location='R00-M1-N7-I:J18-U01'" 

LOCATION                         IPADDRESS                                                                                                                                                                                                                                                      
R00-M1-N7-I:J18-U01              192.168.253.62                                                                                                                                                                                                                                                 

bglsysdb@bgsn:~> db2 "select location,ipaddress from bglnode where location='R01-M0-NF-I:J18-U01'" 

LOCATION                         IPADDRESS                                                                                                                                                                                                                                                      
R01-M0-NF-I:J18-U01              192.168.253.202                                                                                                                                                                                                                                                
R01-M0-NF-I:J18-U01              192.168.253.202                                                                                                                                                                                 

bglsysdb@bgsn:~> db2 "select location,ipaddress from bglnode where location='R01-M1-NE-I:J18-U01'" 
R01-M1-NE-I:J18-U01              192.168.253.183                                                                                                                                                                                                                                                
R01-M1-NE-I:J18-U01              192.168.253.183                                                                                                                                                                                                                                                

  2 record(s) selected.

bglsysdb@bgsn:~> db2 "select location,ipaddress from bglnode where location like '%I%'order by location" > 
location.tx

PMR 39792 BG: midplane switch error:

mmcs$ allocate_block R02
OK
mmcs$ boot_block
FAIL
midplane switch error: 

Reset link cards:
mmcs$ pgood_linkcards all
OK

PMR 39625 BG: gpfs not mounting on IO nodes


mmcs$ allocate rack

FAIL

RAS event: KERNEL FATAL: /var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come
up on I/O node bgio163 : 192.168.253.8

ssh bgio163     shows:
/dev/gdg             18897027072 18826870784  70156288 100% /bggpfs
/dev/gpfswan         658336110592 449413885952 208922224640  69% /gpfs-wan


mmgetstate -a
mmgetstat.bgio:     556      bgio163          arbitrating
...

arbitrating: A node is trying to form a quorum with the other available nodes. 



PrepareForService fails

Jul 18 11:39:47.014 PDT: NcInfo - Exception occurred while building an
IDo for NcInfo(NodeCard: Card location (R01-M0-N0), Type(4),
SN(203231503833343000000000594c31324b35323330305039,
LP(FF:F2:9F:16:AF:0C:00:0D:60:E9:50:F3), IP(10.0.0.179))
java.io.IOException: Could not contact iDo with
LP=FF:F2:9F:16:AF:0C:00:0D:60:E9:50:F3 and IP=/10.0.0.179 because
java.lang.RuntimeException: Communication error: (DirectIDo for
Uninitialized DirectIDo for
FF:F2:9F:16:AF:0C:00:0D:60:E9:50:F3@/10.0.0.179:0 is in state =
COMMUNICATION_ERROR, sequenceNumberIsOk = false, ExpectedSequenceNumber
= 0, Reply Sequence Number = -1, timedOut = true, retries = 5, timeout =
1000, Expected Op Command = 5, Actual Op Reply = -1, Expected Sync
Command = 10, Actual Sync Reply = -1)

See /bgl/BlueLight/logs/BGL # more bgsn-mmcs_db_server-*.log for errors:

[IBM][CLI Driver][DB2/LINUXPPC] SQL0968C  The file system is full.  SQLSTATE=57011




PMR 19096 BG: Node Card R01-M1-NE Error ( allocate error)
mmcs$ allocate R01
FAIL
connect: idoproxy communication failure: Invalid Licence plate


mmcs$ allocate R01
FAIL
create_block: resources are unavailable - NODECARD: R01-M1-N0, NODECARD: R01-M1-N1, NODECARD: R01-M1-N2, 
NODECARD: R01-M1-N3, NODECARD: R01-M1-N4, NODECARD: R01-M1-N5, NODECARD: R01-M1-N6, NODECARD: R01-M1-N7, 
NODECARD: R01-M1-N8, NODECARD: R01-M1-N9, NODECARD: R01-M1-NA, NODECARD: R01-M1-NB, NODECARD: R01-M1-NC, 
NODECARD: R01-M1-ND, NODECARD: R01-M1-NE, NODECARD: R01-M1-NF

su - bglsysdb
bglsysdb@bgsn:~> db2
(c) Copyright IBM Corporation 1993,2002
Command Line Processor for DB2 SDK 8.2.8

db2 => connect to bgdb0


db2 => select status, location from  tbglprocessorcard where location like 'R01-M1-NE-%' AND 
     status = 'A' order by location

STATUS LOCATION                        
------ --------------------------------
A      R01-M1-NE-C:J02                 
A      R01-M1-NE-C:J03                 
A      R01-M1-NE-C:J04                 
A      R01-M1-NE-C:J05                 
A      R01-M1-NE-C:J06                 
A      R01-M1-NE-C:J07                 
A      R01-M1-NE-C:J08                 
A      R01-M1-NE-C:J09                 
A      R01-M1-NE-C:J10                 
A      R01-M1-NE-C:J11                 
A      R01-M1-NE-C:J12                 
A      R01-M1-NE-C:J13                 
A      R01-M1-NE-C:J14                 
A      R01-M1-NE-C:J15                 
A      R01-M1-NE-C:J16                 
A      R01-M1-NE-C:J17                 
A      R01-M1-NE-I:J18                 
A      R01-M1-NE-I:J19                 

  18 record(s) selected.

reseat card

restart:
bglmaster restart.
Discovery0 SystemController and PostDiscovery   stop/start
Power off rack R01 and leave it off for 15 min, then power it back up and
let discovery find it all.


check monitor turned off:
bgsn:/discovery # bglmaster status
idoproxy            started [25699]
ciodb               started [25700]
mmcs_server         started [25701]
monitor0            stopped
perfmon             stopped

In the mmcs log I saw lots of these msgs.                               


/bgl/BlueLight/logs/BGL/bgsn-mmcs_db_server-2008-0716-15:21:53.log
                                                                        
WARNING: Node R01-M1-N9-C:J13-U01 has less memory than other nodes on   
the midplane!                                                           
WARNING: Node R01-M1-NC-C:J14-U11 has less memory than other nodes on   
the midplane!                                                           
WARNING: Node R01-M1-N5-C:J15-U11 has less memory than other nodes on   
the midplane!                                                           
WARNING: Node R01-M1-N2-C:J15-U11 has less memory than other nodes on   
the midplane!                                                       
    
so I ran these queries.                                                 

db2 "select memorymodulesize,serialnumber from TBGLNODEHWATTR where memorymodulesize <> 6"                                                  

                                                                        
MEMORYMODULESIZE SERIALNUMBER                                           
---------------- ---------------------------------------------------    
              14 x'00000000000000000000074624A6522FFFFF08101B611CDE'    
              14 x'00000000000000000000076124A54B2FFFFF06091BC08EE6'    
                                                                        
  2 record(s) selected.                                                 

db2 "select serialnumber,location,status from bglprocessorcard where serialnumber = x'00000000000000000000076124A54B2FFFFF06091BC08EE6'"     
                                                                        
SERIALNUMBER                                        LOCATION            
STATUS                                                                  
----------------------------------------------------------------------------------- ------                                 
x'00000000000000000000076124A54B2FFFFF06091BC08EE6' R01-M1-NE-C:J12  A              
The other serialnumber did not get a hit.                               
                                                                        





Partition not booting
mmcs$ setusername sharikov
OK
mmcs$ redirect R02-bot on
OK
mmcs$ {207}.0: mount
{234}.0: mount
{234}.0: : Mounting /dev/gdg on /bggpfs failed: Stale NFS file handle


on BGSN: mmshutdown
         mmstartup

bgsn:/var/adm/ras # mount /bggpfs
mount: Stale NFS file handle

/var/adm/ras/mmfs.log.latest:
Thu Jun 12 11:19:43 2008: Disk failure.  Volume gdg. rc = 19. Physical volume d0181gdg.
Thu Jun 12 11:19:43 2008: Disk failure.  Volume gdg. rc = 19. Physical volume d0182gdg.
Thu Jun 12 11:19:43 2008: Disk failure.  Volume gdg. rc = 19. Physical volume d0183gdg.
Thu Jun 12 11:19:43 2008: File System gdg unmounted by the system with return code 19 reason code 0


on tg-c008: mmlsnsd -m |grep not
 d0161gdg     C6CA703244A3004D   -              tg-c018.sdsc.teragrid.org (not found) backup node
 d0162gdg     C6CA703244A30065   -              tg-c018.sdsc.teragrid.org (not found) backup node
 d0163gdg     C6CA703244A3007B   -              tg-c018.sdsc.teragrid.org (not found) backup node
 d0181gdg     C6CA703444A3004F   -              tg-c018.sdsc.teragrid.org (not found) primary node
 d0182gdg     C6CA703444A30067   -              tg-c018.sdsc.teragrid.org (not found) primary node
 d0183gdg     C6CA703444A3007D   -              tg-c018.sdsc.teragrid.org (not found) primary node


--> check GigE connection and restart mmfs on tg-c018

{45}.0: /var/etc/rc.d/rc3.d/S40gpfs: GPFS is ready on I/O node bgio222 :



 Bluegene MDS Provider crashed

Restart on bg-login1:

kenneth@bg-login1:/etc/rc.d> ps -ef | grep mds
kenneth  32087 29928  0 14:54 pts/24   00:00:00 grep mds

kenneth@bg-login1:/etc/rc.d> sudo ./globus-mds-info start
Starting Globus Container:                                           done

kenneth@bg-login1:/etc/rc.d> ps -ef | grep mds
globus   32104     1  0 14:54 pts/24   00:00:00 /usr/local/apps/globus-mds-info-4.0.5-r2/sbin/globus-start-container-detached -p 8446
globus   32105 32104 75 14:54 ?        00:00:03 /usr/local/apps/IBMJava2-ppc64-142/bin/java
-Dlog4j.configuration=container-log4j.properties
-DGLOBUS_LOCATION=/usr/local/apps/globus-mds-info-4.0.5-r2
-Djava.endorsed.dirs=/usr/local/apps/globus-mds-info-4.0.5-r2/endorsed
-DGLOBUS_HOSTNAME=bg-login1.sdsc.edu -DGLOBUS_TCP_PORT_RANGE=50000,51000
-Djava.security.egd=file:///dev/urandom -classpath /usr/local/apps/globus-mds-info-4.0.5-r2/lib/bootstrap.jar:/usr/local/apps/globus-mds-info-4.0.5-r2/lib/cog-url.jar:/usr/local/apps/globus-mds-info-4.0.5-r2/lib/axis-url.jarorg.globus.bootstrap.Bootstrap org.globus.wsrf.container.ServiceContainer -p 8446



PMR 91160 BlueGene: resources are unavailable

mmcs$ allocate R01                                                      
FAIL                                                                    
create_block: resources are unavailable - NODECARD: R01-M0-N9   

4/28/08 1:46:20 PM  R01   R01-M0-N9-C:J12-U01  rts panic! - stopping execution

run diagnositcs:
http://bgsn:8080/BlueGeneNavigator/faces/secured/diagnostics.jsp

Logs in: /bgl/BlueLight/logs/diags

1. Node Card  R01-M0-N9  J205  10.0.2.80    is red
3. 4/29/08 2:08:47 AM  Failure  Monitor    R01-M0-N1  power module status fault detected on node card. 
status
registers  are: 16/0/0/0

this error shows up in the RAS log every 30 minutes until 11:52:41 AM

4. I started with the short test:
   dr_bitfail  FAILED  4/29/08 1:04:37 PM  4/29/08 1:06:10 PM
   mem_l2_coherency  SUCCESS 4/29/08 1:06:10 PM 4/29/08 1:08:14 PM
   ms_gen_short FAILED  4/29/08 1:08:14 PM  4/29/08 1:12:54 PM
   dgemm3200  FAILED  4/29/08 1:12:54 PM  4/29/08 1:18:34 PM
   dgemm160  SUCCESS 4/29/08 1:18:34 PM  4/29/08 1:22:28 PM
   ts_multinode FAILED  4/29/08 1:22:46 PM  4/29/08 1:32:03 PM
   tr_multinode SUCCESS 4/29/08 1:32:30 PM  4/29/08 1:37:48 PM
   emac_dg  SUCCESS 4/29/08 1:38:13 PM  4/29/08 1:38:50 PM
Result:  Fatal hardware failure

Now we have 2 errors:
Node Card  R01-M0-N6  Error Position J115
Node Card  R01-M0-N9  Error Position J205

from the log file:
forceCardsUnavailable: WARNING: Marking nodecard(s): R01-M0-N6,R01-M0-N9 in ERROR until next service action
forceCardsUnavailable: WARNING: Marking nodecard(s): R01-M0-N6,R01-M0-N9 in ERROR until next service action

from the bgldiag.report
076210C1922FFFFF01091B3120DA  0.0714 29001049  R01-M0-N9-C:J12-U01   31 dr_bitfail < 1, 0, 6>  failed 219
04DB84B6872FFFFF06071B404ADA       0 29001049  R01-M0-N6-C:J02-U11   37 dr_bitfail < 7, 5, 1>  failed 350
076210C1922FFFFF01091B3120DA  0.0714 29001049  R01-M0-N9-C:J12-U01   31 ms_gen_short < 1, 0, 6>  failed 219
076210C1922FFFFF01091B3120DA  0.0714 29001049  R01-M0-N9-C:J12-U01   31 dgemm3200 < 1, 0, 6>  failed 219
076210C1922FFFFF01091B3120DA  0.0714 29001049  R01-M0-N9-C:J12-U01   31 ts_multinode < 1, 0, 6>  failed 219




PMR 00227 llq failed
bgsn /users/whitej> llq                                                 
llq: 2512-301 An error occurred while receiving data from the           
LoadL_negotiator daemon on host bgsn.                                   


I realized the problem with the db2usrinf missing, db2 is not           
installed on one of the submission nodes and it just kept favoring      
that node as the submitter.  llctl stop'ing that node solved the        
problem.                                                                


I believe that DB2 apar  LI71571  should fix this problem,   from the   
apar description:                                                       
                                                                        
ERROR DESCRIPTION:                                                      
A memory leak and gradual performance slowdown may occur in DB2         
client applications due to a growing list of thread-related             
memory structures.  These structures may not be reused due to a         
bad comparision resulting from a data-type mismatch.  When we           
search the list of memory structures, the correct one or "match"        
will not be found when the ID of the thread inside the DB2              
application exceeds a 32-bit value - instead a new structure            
will incorrectly be allocated.                                          
.                                                                       
The problem is more likely to happen on a Linux 64-bit platform         
since that platform can have threads with ID's that are larger t        
han 32-bit in size.                                                     


PMR 91052 BlueGene: Internal compiler error 
bg-login3 mahidhar/src_PCG> more stokes.lst                   

/opt/ibmcmp/xlf/bg/10.1/bin/blrts_xlf90: 1501-230 Internal compiler     
error; please contact your Service Representative                       
1501-511  Compilation failed for file stokes.F.                         
make: *** [obj/stokes.o] Error 40                                       
1501-511  Compilation failed for file stokes.F.                         
1501-544  Object file not created.                                                                                                 


Compiler development have determined that this is a BG operating system 
issue. The contents of *.mod files generated by the compiler are not    
flushed into the disk right away.  As a result, when they are used in   
compilation immediately thereafter, the compiler gets bad data from .mod
files and issues the internal compiler error.                           
                                                                        
We have worked with the BG operating system team, and they suggested    
that temporarily, you can compile it locally in /tmp or any file system 
that is local to the front end node to get around the problem. 
                                                                       
             This can happen based on how they have set up there NFS    
             mounted file system.  I think there is a setting they can  
             change to prevent this, but it slows things down,  To prove
             this, have them compile it locally like in /tmp            
                                                                        
             Please let us know if it works when you compile locally.   
                                                                      

Development was doing some more looking and for the NFS    
             setting: if you use the sync export option this fixes this 
             as well.                                                   
                                                            
hocks@tg-c078:/users/hocks> more /etc/exports 
/bgl/users \
        198.202.0.0/255.255.0.0(rw,async,no_root_squash) \
        bgio*(rw,async,no_root_squash)



PMR 15649 Compiler crashes   XL Fortran Ver: 10      Rel: 1
APAR LI72593

source  modules _evb_gfchk.f. 

IBM:  

I have verified that we have solved the problem, and have made an       
interim fix available

John:
we have compiled code successfully with this efix and are awaiting a    
job to push through the queue.                                          

IBM:
Good news!  The March 2008 XL Fortran for BG/L Update is now available: 
http://www.ibm.com/support/docview.wss?rs=43&uid=swg24012720            



PMR  53970 BlueGene: dbhome full                                                  
the largest files are in /dbhome/sqllib/db2dump
backup image:
 4605485056 Apr  4  2007 BGDB0.0.bglsysdb.NODE0000.CATN0000.20070404085201.001                   
                                                                        
archive log file:

bgsn: db2diag -A   

db2diag: Moving "/dbhome/sqllib/db2dump/db2diag.log"
         to     "/dbhome/sqllib/db2dump/db2diag.log_2008-07-18-13.19.26"

show detail to find where the database is.                                          
bgsn:db2 list db directory 

crontab on bgsn:
45 23 * * * /dbhome/DB2/sbin/inst.run -b -e /dbhome/DB2/sbin/runstatsall
30 23 * * 0 /u/bgdb2cli/sqllib/bin/db2diag -A 




PMR 89264 DB2  not accepting connections
Fixpak15 and increasing Message Limits
db2 not accepting connections.  In       
the log, I see multiple:                                                
Jul 16 09:00:02 bgsn DB2[30932]: Open of log file                       
"/u/bgdb2cli/sqllib/db2dump/db2diag.log" failed with rc 0x840F0001      
                                                                        
When db2 is started back up and all the bglmaster processes are         
brought back up, llq -s on any given job returns:                       
==================== EVALUATIONS FOR JOB STEP bgsn.28967.0              
====================                                                    
                                                                        
Not enough resources to start now.                                      
No resource available. BlueGene is configured but not active.           
Not enough resources for this step as top-dog.                          

db2diag.log 
FUNCTION: DB2 UDB, common communication, sqlcctcpconnr, probe:110       
MESSAGE : DIA3202C The TCP/IP call "connect" returned an errno="111     
/var/log/messages:                                                      
Aug 19 17:52:01 bgsn /USR/SBIN/CRON[15407]: (root) CMD (/usr/local/bin/perl /opt/llview/bglquery.pl > /srv/www/htdocs/llview/.data/llqxml.dat  2>/dev/null)                                                
Aug 19 17:52:07 bgsn mmcs_db_server[25257]: MMCSDBMonitor: error accessing the DB, result = 1                                            
Aug 19 17:52:23 bgsn idoproxy[25255]: GetDbPollInterval data base query failed                                                            

IBM: 
I was thinking that it was a problem with the /dbhome being full but John
says that the log file seems to become inaccessible and the LL fails. Once
he restarts DB2 and LL things get back in sync and all is well.

db2 "connect to bgdb0 user bglsysdb using db24bgls"

If you cannot perform a db2 connect, then there is no Blue Gene software
problem here at all.                                                    
Generally, we do something like   

db2 force applications all     
db2 terminate   
db2stop  
db2start
                                                                        

do it ( although typically not recommended ) is with ipcs and ipcrm .   
Here's a quick / dirty example-                                         
                                                                        
for i in $(ipcs | awk '{print $2}' | grep -v ^S | grep -vi message |    
grep -vi id); do ipcrm -q $i ; ipcrm -m $i ; ipcrm -s $i; done          
                                                                        
After that runs issue 

db2start as bglsysdb.                             
                                                                        
Typically we set db2 to autostart on boot.  -                           
/opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb              
and then add a init.d script for db2 , i.e. 

/etc/init.d/rc.db2   ( enabled via /sbin/chkconfig --add rc.db2 ).  

db2 diag:
DIAGLEVEL setting. To check the current setting, issue:                 
                                                                        
DB2 GET DBM CFG                                                         
                                                                        
Look for the following variable:                                        
                                                                        
Diagnostic error capture level              (DIAGLEVEL) = 3             

To alter the setting, use the command:                                  
                                                                        
DB2 UPDATE DBM CFG USING DIAGLEVEL X                                    
where X is the desired notification level.                              
There are 4 setting  for DIAGLEVEL                                      
                                                                        
# 0 - No diagnostic data captured                                       
# 1 - Severe errors only                                                
# 2 - All errors                                                        
# 3 - All errors and warnings                                           
# 4 - All errors, warnings and informational messages                   
                                                                        
Change the path to a filesystem with more space.                     
      Example:   db2 update dbm cfg using DIAGPATH  /db2home/diaglog .  
                                                                        
Either way the permission have to be 666 and at some point the size of  
the db2diag.log should be trimmed                                       


They system errors regarding MMSCDBMonitor and GetDbPollInterval are not
db2 functions.  I found a similar issue with another BlueGene PMR and
upgrading to Fixpak15 and increasing Message Limits resolved the problem.

ftp://ftp.software.ibm.com/ps/products/db2/fixes2/english-us/db2linuxPPC64v8/fixpak                                                             
                                                                        



PMR 65044 DB2 crashes
jobs do not get started due to hung up db2 server.
LoadLelever lost the connection to db2.

The last error in the db2diag log:
FUNCTION: DB2 UDB, common communication, sqlcctcpconnr, probe:110       
MESSAGE : DIA3202C The TCP/IP call "connect" returned an errno="111".   


1) on the server, check db2set -all. DB2COMM should be set as follows:  
[i] DB2COMM=TCPIP                                                       
                                                                        
2) check that the client machine has cataloged the correct port at the  
server for DB2.                                                         
                                                                        
3) from a client, try: 'telnet  50000' where 50000
is the default port # used                                              
                                                                        

IBM:  
When the hang is taking place please run the following script, THEN     
collect a db2support.zip.                                               
                                                                        
>>>>>>db2service.perf1<<<<<<<<                                          
                                                                        
Here are the steps on gathering a db2support.zip file and placing it out
on our EMEA ftp site.                                                   
                                                                        
1) From the CLP run "db2support . -d  -c -s"                   
2) Rename the files created above to 65044.227.000. and then  
ftp the file to our ftp server.                                         
3) ftp ftp.emea.ibm.com                                                 
4) Log in as "anonymous" and use your "email address" as the password   
5) cd toibm/unix                                                        
6) Type "bin" for binary mode                                           
7) put 65044.227.000.                                         
8) quit