BlueGene Fixes 2008

BlueGene Fixes 2008

  • Hardware
    
    09/19  R02-M1-N6-C:J05-U01  L3 major internal error: recovery failed, DJT8WDJ
    09/08  R02-M0-N7-C:J11-U01  CE sym 0, at 0x1089a540, mask 0x04 
    09/08  R00-M1-N7-I:J18-U01  Error: unable to mount filesystem, #DJT9VL8
    09/05  R01-M1-NE-I:J18-U11, R01-M0-NF-I:J18-U01     Error: unable to mount filesystem  #DJT9VL8
    09/01  R00-M1-NE-I:J18-U11, R01-M0-NA-I:J18-U01, R01-M0-NB-I:J19-U11 no ethernet link, cables unplugged 
    07/11  R00-M1-N9-I:J19-U11, R00-M1-NE-I:J18-U11, R01-M0-NA-I:J18-U01, R01-M0-NB-I:J19-U11 no ethernet link
           new cbales to Hollborn
    05/23  R00-M0-NC-I:J18:U11, R00-M1-N6-I:J19:U11 no ethernet link, new cables to Victoria
    04/29  NODECARD: R01-M0-N9
    03/03  I/O node board down R01-M1-N6-I:J19-U01, I/O CARD REPLACEMENT, board replaced
    02/21  ethernet down R01-M0-NE-I:J19-U11, new cable to Victoria
    
    
  • PMR 51568 BG: CE sym 0 errors
    User is seeing wrong results in his code on rack R02 while the very     
    same code runs correctly on rack R01     
    
    9/8/08 11:15:20 AM  Informational  Kernel  R02  46303 R02-M0-N7-C:J11-U01  total of 11 ddr error(s) detected and corrected    over 12881 seconds                                                      
    9/8/08 11:15:20 AM  Informational  Kernel  R02  46303 R02-M0-N7-C:J11-U01  CE sym 0, at 0x1089a540, mask 0x04    
    
    9/9/08 10:38:28 AM  Informational  Kernel  R02-bot  46374  R02-M0-NC-C:J15-U11  critical input interrupt 
    (unit=0x0b bit=0x0a): warning for torus z+ wire, suppressing further interrupts of same type
    
    
    I suggest running full midplane diags on both midplanes on R02. Do not run linkcard diags.     
    Sep 11 13:26:15 2008 PDT forceCardsUnavailable: WARNING: Marking nodecard(s): R02-M0-NE in ERROR until next service action         
    Compute Node  dr_bitfail  9/11/08 1:38:15 PM 07441025612FFFFF0E081B9044E0  R02-M1-N8-C:J06-U11  Fatal hardware failure
    
    Action Taken: I have placed a hardware call number DJT9LVR to have node 
    card R02-M0-NE and compute card R02-M1-N8-C:J06-U11 replaced on the     
    customers system.                                                       
                                         
    Action Taken:There was a node card in R02-M0 that was not being         
                 discovered. Eva helped us get onto the system and I looked 
                 at the Service Card and saw that the linkcard did not have 
                 the 1.5 volt power set. So this linkcard was not           
                 initialized correctly.  I reset the Service Card and mark  
                 the linkcards and nodecards for this midplane to be missing
                 for the idochips. After doing this we started Discovery0   
                 back up and then all of the hardware was found and the     
                 midplane looked good.                                      
    
    
    9/12/08 12:59:21 PM  Fatal  Kernel  bot128-1  46526 R00-M0-N3-C:J03-U11  machine check interrupt (bit=0x06): L3 major internal error       
    
                 Eva then said that she had one other question in R00-M0. We
                 did some checking and there was a compute card the took a  
                 L3 Major error.  We helped her do a SA on this midplane to 
                 get it back up and she is aware that the node card could go
                 into an error status again if the L3 happens again.        
                                                                            
    After hardware replaced EndServiceAction hangs:
    
    Sep 15 15:32:09.154 PDT:   All of R00-M0's NodeCards are active
    Sep 15 15:32:09.159 PDT: @ Still waiting for 1 compute Processor cards in R00-M0 to become active
    
    ===> Restart 
    bgsn:/discovery # ./Discovery0 stop
    bgsn:/discovery # ./PostDiscovery stop
    bgsn:/discovery # ./SystemController stop
    bgsn:/discovery # ./SystemController start
    bgsn:/discovery # ./Discovery0 start
    bgsn:/discovery # ./PostDiscovery start
    
    
    
  • PMR PMR 39625 timout
    
    
    {270}.0: Starting syslog services
    {270}.0: 
    Starting ciod
    Starting XL Compiler Environment for I/O node
    
    
    BusyBox v1.00 (2007.10.12-17:59+0000) Built-in shell (ash)
    Enter 'help' for a list of built-in commands.
    
    /bin/sh: can't access tty; job control turned off
    ~ # ciod: version "Oct 23 2007 19:23:06"
    ciod: running in virtual node mode with 16 processors
    
    Boot a rack:
    
    mmcs$ allocate_block R00
    OK
    mmcs$ boot_block
    OK
    mmcs$ select_block top64-2 
    OK
    mmcs$  boot_block
    FAIL
    timeout   ->> communication failure? 
    
    still after power cycle
    reseat switch card Network
    replace IO cards
    
    su - bglsysdb
    bglsysdb@bgsn:~> db2 "connect to bgdb0"
    db2 "select location,ipaddress from bglnode where location='R00-M1-N7-I:J18-U01'" 
    
    LOCATION                         IPADDRESS                                                                                                                                                                                                                                                      
    R00-M1-N7-I:J18-U01              192.168.253.62                                                                                                                                                                                                                                                 
    
    bglsysdb@bgsn:~> db2 "select location,ipaddress from bglnode where location='R01-M0-NF-I:J18-U01'" 
    
    LOCATION                         IPADDRESS                                                                                                                                                                                                                                                      
    R01-M0-NF-I:J18-U01              192.168.253.202                                                                                                                                                                                                                                                
    R01-M0-NF-I:J18-U01              192.168.253.202                                                                                                                                                                                 
    
    bglsysdb@bgsn:~> db2 "select location,ipaddress from bglnode where location='R01-M1-NE-I:J18-U01'" 
    R01-M1-NE-I:J18-U01              192.168.253.183                                                                                                                                                                                                                                                
    R01-M1-NE-I:J18-U01              192.168.253.183                                                                                                                                                                                                                                                
    
      2 record(s) selected.
    
    bglsysdb@bgsn:~> db2 "select location,ipaddress from bglnode where location like '%I%'order by location" > 
    location.tx
    
    
    
    
  • PMR 39792 BG: midplane switch error:
    mmcs$ allocate_block R02
    OK
    mmcs$ boot_block
    FAIL
    midplane switch error: 
    
    Reset link cards:
    mmcs$ pgood_linkcards all
    OK
    
    
    
  • PMR 39625 BG: gpfs not mounting on IO nodes
    
    mmcs$ allocate rack
    
    FAIL
    
    RAS event: KERNEL FATAL: /var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come
    up on I/O node bgio163 : 192.168.253.8
    
    ssh bgio163     shows:
    /dev/gdg             18897027072 18826870784  70156288 100% /bggpfs
    /dev/gpfswan         658336110592 449413885952 208922224640  69% /gpfs-wan
    
    
    mmgetstate -a
    mmgetstat.bgio:     556      bgio163          arbitrating
    ...
    
    arbitrating: A node is trying to form a quorum with the other available nodes. 
    
    
    
    
  • PrepareForService fails
    
    Jul 18 11:39:47.014 PDT: NcInfo - Exception occurred while building an
    IDo for NcInfo(NodeCard: Card location (R01-M0-N0), Type(4),
    SN(203231503833343000000000594c31324b35323330305039,
    LP(FF:F2:9F:16:AF:0C:00:0D:60:E9:50:F3), IP(10.0.0.179))
    java.io.IOException: Could not contact iDo with
    LP=FF:F2:9F:16:AF:0C:00:0D:60:E9:50:F3 and IP=/10.0.0.179 because
    java.lang.RuntimeException: Communication error: (DirectIDo for
    Uninitialized DirectIDo for
    FF:F2:9F:16:AF:0C:00:0D:60:E9:50:F3@/10.0.0.179:0 is in state =
    COMMUNICATION_ERROR, sequenceNumberIsOk = false, ExpectedSequenceNumber
    = 0, Reply Sequence Number = -1, timedOut = true, retries = 5, timeout =
    1000, Expected Op Command = 5, Actual Op Reply = -1, Expected Sync
    Command = 10, Actual Sync Reply = -1)
    
    See /bgl/BlueLight/logs/BGL # more bgsn-mmcs_db_server-*.log for errors:
    
    [IBM][CLI Driver][DB2/LINUXPPC] SQL0968C  The file system is full.  SQLSTATE=57011
    
    
    
  • PMR 19096 BG: Node Card R01-M1-NE Error ( allocate error)
    mmcs$ allocate R01
    FAIL
    connect: idoproxy communication failure: Invalid Licence plate
    
    
    mmcs$ allocate R01
    FAIL
    create_block: resources are unavailable - NODECARD: R01-M1-N0, NODECARD: R01-M1-N1, NODECARD: R01-M1-N2, 
    NODECARD: R01-M1-N3, NODECARD: R01-M1-N4, NODECARD: R01-M1-N5, NODECARD: R01-M1-N6, NODECARD: R01-M1-N7, 
    NODECARD: R01-M1-N8, NODECARD: R01-M1-N9, NODECARD: R01-M1-NA, NODECARD: R01-M1-NB, NODECARD: R01-M1-NC, 
    NODECARD: R01-M1-ND, NODECARD: R01-M1-NE, NODECARD: R01-M1-NF
    
    su - bglsysdb
    bglsysdb@bgsn:~> db2
    (c) Copyright IBM Corporation 1993,2002
    Command Line Processor for DB2 SDK 8.2.8
    
    db2 => connect to bgdb0
    
    
    db2 => select status, location from  tbglprocessorcard where location like 'R01-M1-NE-%' AND 
         status = 'A' order by location
    
    STATUS LOCATION                        
    ------ --------------------------------
    A      R01-M1-NE-C:J02                 
    A      R01-M1-NE-C:J03                 
    A      R01-M1-NE-C:J04                 
    A      R01-M1-NE-C:J05                 
    A      R01-M1-NE-C:J06                 
    A      R01-M1-NE-C:J07                 
    A      R01-M1-NE-C:J08                 
    A      R01-M1-NE-C:J09                 
    A      R01-M1-NE-C:J10                 
    A      R01-M1-NE-C:J11                 
    A      R01-M1-NE-C:J12                 
    A      R01-M1-NE-C:J13                 
    A      R01-M1-NE-C:J14                 
    A      R01-M1-NE-C:J15                 
    A      R01-M1-NE-C:J16                 
    A      R01-M1-NE-C:J17                 
    A      R01-M1-NE-I:J18                 
    A      R01-M1-NE-I:J19                 
    
      18 record(s) selected.
    
    reseat card
    
    restart:
    bglmaster restart.
    Discovery0 SystemController and PostDiscovery   stop/start
    Power off rack R01 and leave it off for 15 min, then power it back up and
    let discovery find it all.
    
    
    check monitor turned off:
    bgsn:/discovery # bglmaster status
    idoproxy            started [25699]
    ciodb               started [25700]
    mmcs_server         started [25701]
    monitor0            stopped
    perfmon             stopped
    
    In the mmcs log I saw lots of these msgs.                               
    
    
    /bgl/BlueLight/logs/BGL/bgsn-mmcs_db_server-2008-0716-15:21:53.log
                                                                            
    WARNING: Node R01-M1-N9-C:J13-U01 has less memory than other nodes on   
    the midplane!                                                           
    WARNING: Node R01-M1-NC-C:J14-U11 has less memory than other nodes on   
    the midplane!                                                           
    WARNING: Node R01-M1-N5-C:J15-U11 has less memory than other nodes on   
    the midplane!                                                           
    WARNING: Node R01-M1-N2-C:J15-U11 has less memory than other nodes on   
    the midplane!                                                       
        
    so I ran these queries.                                                 
    
    db2 "select memorymodulesize,serialnumber from TBGLNODEHWATTR where memorymodulesize <> 6"                                                  
    
                                                                            
    MEMORYMODULESIZE SERIALNUMBER                                           
    ---------------- ---------------------------------------------------    
                  14 x'00000000000000000000074624A6522FFFFF08101B611CDE'    
                  14 x'00000000000000000000076124A54B2FFFFF06091BC08EE6'    
                                                                            
      2 record(s) selected.                                                 
    
    db2 "select serialnumber,location,status from bglprocessorcard where serialnumber = x'00000000000000000000076124A54B2FFFFF06091BC08EE6'"     
                                                                            
    SERIALNUMBER                                        LOCATION            
    STATUS                                                                  
    ----------------------------------------------------------------------------------- ------                                 
    x'00000000000000000000076124A54B2FFFFF06091BC08EE6' R01-M1-NE-C:J12  A              
    The other serialnumber did not get a hit.                               
                                                                            
    
    
    
  • Partition not booting
    mmcs$ setusername sharikov
    OK
    mmcs$ redirect R02-bot on
    OK
    mmcs$ {207}.0: mount
    {234}.0: mount
    {234}.0: : Mounting /dev/gdg on /bggpfs failed: Stale NFS file handle
    
    
    on BGSN: mmshutdown
             mmstartup
    
    bgsn:/var/adm/ras # mount /bggpfs
    mount: Stale NFS file handle
    
    /var/adm/ras/mmfs.log.latest:
    Thu Jun 12 11:19:43 2008: Disk failure.  Volume gdg. rc = 19. Physical volume d0181gdg.
    Thu Jun 12 11:19:43 2008: Disk failure.  Volume gdg. rc = 19. Physical volume d0182gdg.
    Thu Jun 12 11:19:43 2008: Disk failure.  Volume gdg. rc = 19. Physical volume d0183gdg.
    Thu Jun 12 11:19:43 2008: File System gdg unmounted by the system with return code 19 reason code 0
    
    
    on tg-c008: mmlsnsd -m |grep not
     d0161gdg     C6CA703244A3004D   -              tg-c018.sdsc.teragrid.org (not found) backup node
     d0162gdg     C6CA703244A30065   -              tg-c018.sdsc.teragrid.org (not found) backup node
     d0163gdg     C6CA703244A3007B   -              tg-c018.sdsc.teragrid.org (not found) backup node
     d0181gdg     C6CA703444A3004F   -              tg-c018.sdsc.teragrid.org (not found) primary node
     d0182gdg     C6CA703444A30067   -              tg-c018.sdsc.teragrid.org (not found) primary node
     d0183gdg     C6CA703444A3007D   -              tg-c018.sdsc.teragrid.org (not found) primary node
    
    
    --> check GigE connection and restart mmfs on tg-c018
    
    {45}.0: /var/etc/rc.d/rc3.d/S40gpfs: GPFS is ready on I/O node bgio222 :
    
    
  • Bluegene MDS Provider crashed
    
    Restart on bg-login1:
    
    kenneth@bg-login1:/etc/rc.d> ps -ef | grep mds
    kenneth  32087 29928  0 14:54 pts/24   00:00:00 grep mds
    
    kenneth@bg-login1:/etc/rc.d> sudo ./globus-mds-info start
    Starting Globus Container:                                           done
    
    kenneth@bg-login1:/etc/rc.d> ps -ef | grep mds
    globus   32104     1  0 14:54 pts/24   00:00:00 /usr/local/apps/globus-mds-info-4.0.5-r2/sbin/globus-start-container-detached -p 8446
    globus   32105 32104 75 14:54 ?        00:00:03 /usr/local/apps/IBMJava2-ppc64-142/bin/java
    -Dlog4j.configuration=container-log4j.properties
    -DGLOBUS_LOCATION=/usr/local/apps/globus-mds-info-4.0.5-r2
    -Djava.endorsed.dirs=/usr/local/apps/globus-mds-info-4.0.5-r2/endorsed
    -DGLOBUS_HOSTNAME=bg-login1.sdsc.edu -DGLOBUS_TCP_PORT_RANGE=50000,51000
    -Djava.security.egd=file:///dev/urandom -classpath /usr/local/apps/globus-mds-info-4.0.5-r2/lib/bootstrap.jar:/usr/local/apps/globus-mds-info-4.0.5-r2/lib/cog-url.jar:/usr/local/apps/globus-mds-info-4.0.5-r2/lib/axis-url.jarorg.globus.bootstrap.Bootstrap org.globus.wsrf.container.ServiceContainer -p 8446
    
    
  • PMR 91160 BlueGene: resources are unavailable
    
    mmcs$ allocate R01                                                      
    FAIL                                                                    
    create_block: resources are unavailable - NODECARD: R01-M0-N9   
    
    4/28/08 1:46:20 PM  R01   R01-M0-N9-C:J12-U01  rts panic! - stopping execution
    
    run diagnositcs:
    http://bgsn:8080/BlueGeneNavigator/faces/secured/diagnostics.jsp
    
    Logs in: /bgl/BlueLight/logs/diags
    
    1. Node Card  R01-M0-N9  J205  10.0.2.80    is red
    3. 4/29/08 2:08:47 AM  Failure  Monitor    R01-M0-N1  power module status fault detected on node card. 
    status
    registers  are: 16/0/0/0
    
    this error shows up in the RAS log every 30 minutes until 11:52:41 AM
    
    4. I started with the short test:
       dr_bitfail  FAILED  4/29/08 1:04:37 PM  4/29/08 1:06:10 PM
       mem_l2_coherency  SUCCESS 4/29/08 1:06:10 PM 4/29/08 1:08:14 PM
       ms_gen_short FAILED  4/29/08 1:08:14 PM  4/29/08 1:12:54 PM
       dgemm3200  FAILED  4/29/08 1:12:54 PM  4/29/08 1:18:34 PM
       dgemm160  SUCCESS 4/29/08 1:18:34 PM  4/29/08 1:22:28 PM
       ts_multinode FAILED  4/29/08 1:22:46 PM  4/29/08 1:32:03 PM
       tr_multinode SUCCESS 4/29/08 1:32:30 PM  4/29/08 1:37:48 PM
       emac_dg  SUCCESS 4/29/08 1:38:13 PM  4/29/08 1:38:50 PM
    Result:  Fatal hardware failure
    
    Now we have 2 errors:
    Node Card  R01-M0-N6  Error Position J115
    Node Card  R01-M0-N9  Error Position J205
    
    from the log file:
    forceCardsUnavailable: WARNING: Marking nodecard(s): R01-M0-N6,R01-M0-N9 in ERROR until next service action
    forceCardsUnavailable: WARNING: Marking nodecard(s): R01-M0-N6,R01-M0-N9 in ERROR until next service action
    
    from the bgldiag.report
    076210C1922FFFFF01091B3120DA  0.0714 29001049  R01-M0-N9-C:J12-U01   31 dr_bitfail < 1, 0, 6>  failed 219
    04DB84B6872FFFFF06071B404ADA       0 29001049  R01-M0-N6-C:J02-U11   37 dr_bitfail < 7, 5, 1>  failed 350
    076210C1922FFFFF01091B3120DA  0.0714 29001049  R01-M0-N9-C:J12-U01   31 ms_gen_short < 1, 0, 6>  failed 219
    076210C1922FFFFF01091B3120DA  0.0714 29001049  R01-M0-N9-C:J12-U01   31 dgemm3200 < 1, 0, 6>  failed 219
    076210C1922FFFFF01091B3120DA  0.0714 29001049  R01-M0-N9-C:J12-U01   31 ts_multinode < 1, 0, 6>  failed 219
    
    
    
  • PMR 00227 llq failed
    bgsn /users/whitej> llq                                                 
    llq: 2512-301 An error occurred while receiving data from the           
    LoadL_negotiator daemon on host bgsn.                                   
    
    
    I realized the problem with the db2usrinf missing, db2 is not           
    installed on one of the submission nodes and it just kept favoring      
    that node as the submitter.  llctl stop'ing that node solved the        
    problem.                                                                
    
    
    I believe that DB2 apar  LI71571  should fix this problem,   from the   
    apar description:                                                       
                                                                            
    ERROR DESCRIPTION:                                                      
    A memory leak and gradual performance slowdown may occur in DB2         
    client applications due to a growing list of thread-related             
    memory structures.  These structures may not be reused due to a         
    bad comparision resulting from a data-type mismatch.  When we           
    search the list of memory structures, the correct one or "match"        
    will not be found when the ID of the thread inside the DB2              
    application exceeds a 32-bit value - instead a new structure            
    will incorrectly be allocated.                                          
    .                                                                       
    The problem is more likely to happen on a Linux 64-bit platform         
    since that platform can have threads with ID's that are larger t        
    han 32-bit in size.                                                     
    
    
  • PMR 91052 BlueGene: Internal compiler error
    bg-login3 mahidhar/src_PCG> more stokes.lst                   
    
    /opt/ibmcmp/xlf/bg/10.1/bin/blrts_xlf90: 1501-230 Internal compiler     
    error; please contact your Service Representative                       
    1501-511  Compilation failed for file stokes.F.                         
    make: *** [obj/stokes.o] Error 40                                       
    1501-511  Compilation failed for file stokes.F.                         
    1501-544  Object file not created.                                                                                                 
    
    
    Compiler development have determined that this is a BG operating system 
    issue. The contents of *.mod files generated by the compiler are not    
    flushed into the disk right away.  As a result, when they are used in   
    compilation immediately thereafter, the compiler gets bad data from .mod
    files and issues the internal compiler error.                           
                                                                            
    We have worked with the BG operating system team, and they suggested    
    that temporarily, you can compile it locally in /tmp or any file system 
    that is local to the front end node to get around the problem. 
                                                                           
                 This can happen based on how they have set up there NFS    
                 mounted file system.  I think there is a setting they can  
                 change to prevent this, but it slows things down,  To prove
                 this, have them compile it locally like in /tmp            
                                                                            
                 Please let us know if it works when you compile locally.   
                                                                          
    
    Development was doing some more looking and for the NFS    
                 setting: if you use the sync export option this fixes this 
                 as well.                                                   
                                                                
    hocks@tg-c078:/users/hocks> more /etc/exports 
    /bgl/users \
            198.202.0.0/255.255.0.0(rw,async,no_root_squash) \
            bgio*(rw,async,no_root_squash)
    
    
    
  • PMR 15649 Compiler crashes XL Fortran Ver: 10 Rel: 1
    APAR LI72593
    
    source  modules _evb_gfchk.f. 
    
    IBM:  
    
    I have verified that we have solved the problem, and have made an       
    interim fix available
    
    John:
    we have compiled code successfully with this efix and are awaiting a    
    job to push through the queue.                                          
    
    IBM:
    Good news!  The March 2008 XL Fortran for BG/L Update is now available: 
    http://www.ibm.com/support/docview.wss?rs=43&uid=swg24012720            
    
    
    
  • PMR 53970 BlueGene: dbhome full
    the largest files are in /dbhome/sqllib/db2dump
    backup image:
     4605485056 Apr  4  2007 BGDB0.0.bglsysdb.NODE0000.CATN0000.20070404085201.001                   
                                                                            
    archive log file:
    
    bgsn: db2diag -A   
    
    db2diag: Moving "/dbhome/sqllib/db2dump/db2diag.log"
             to     "/dbhome/sqllib/db2dump/db2diag.log_2008-07-18-13.19.26"
    
    show detail to find where the database is.                                          
    bgsn:db2 list db directory 
    
    crontab on bgsn:
    45 23 * * * /dbhome/DB2/sbin/inst.run -b -e /dbhome/DB2/sbin/runstatsall
    30 23 * * 0 /u/bgdb2cli/sqllib/bin/db2diag -A 
    
    
    
  • PMR 89264 DB2 not accepting connections
    Fixpak15 and increasing Message Limits
    db2 not accepting connections.  In       
    the log, I see multiple:                                                
    Jul 16 09:00:02 bgsn DB2[30932]: Open of log file                       
    "/u/bgdb2cli/sqllib/db2dump/db2diag.log" failed with rc 0x840F0001      
                                                                            
    When db2 is started back up and all the bglmaster processes are         
    brought back up, llq -s on any given job returns:                       
    ==================== EVALUATIONS FOR JOB STEP bgsn.28967.0              
    ====================                                                    
                                                                            
    Not enough resources to start now.                                      
    No resource available. BlueGene is configured but not active.           
    Not enough resources for this step as top-dog.                          
    
    db2diag.log 
    FUNCTION: DB2 UDB, common communication, sqlcctcpconnr, probe:110       
    MESSAGE : DIA3202C The TCP/IP call "connect" returned an errno="111     
    /var/log/messages:                                                      
    Aug 19 17:52:01 bgsn /USR/SBIN/CRON[15407]: (root) CMD (/usr/local/bin/perl /opt/llview/bglquery.pl > /srv/www/htdocs/llview/.data/llqxml.dat  2>/dev/null)                                                
    Aug 19 17:52:07 bgsn mmcs_db_server[25257]: MMCSDBMonitor: error accessing the DB, result = 1                                            
    Aug 19 17:52:23 bgsn idoproxy[25255]: GetDbPollInterval data base query failed                                                            
    
    IBM: 
    I was thinking that it was a problem with the /dbhome being full but John
    says that the log file seems to become inaccessible and the LL fails. Once
    he restarts DB2 and LL things get back in sync and all is well.
    
    db2 "connect to bgdb0 user bglsysdb using db24bgls"
    
    If you cannot perform a db2 connect, then there is no Blue Gene software
    problem here at all.                                                    
    Generally, we do something like   
    
    db2 force applications all     
    db2 terminate   
    db2stop  
    db2start
                                                                            
    
    do it ( although typically not recommended ) is with ipcs and ipcrm .   
    Here's a quick / dirty example-                                         
                                                                            
    for i in $(ipcs | awk '{print $2}' | grep -v ^S | grep -vi message |    
    grep -vi id); do ipcrm -q $i ; ipcrm -m $i ; ipcrm -s $i; done          
                                                                            
    After that runs issue 
    
    db2start as bglsysdb.                             
                                                                            
    Typically we set db2 to autostart on boot.  -                           
    /opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb              
    and then add a init.d script for db2 , i.e. 
    
    /etc/init.d/rc.db2   ( enabled via /sbin/chkconfig --add rc.db2 ).  
    
    db2 diag:
    DIAGLEVEL setting. To check the current setting, issue:                 
                                                                            
    DB2 GET DBM CFG                                                         
                                                                            
    Look for the following variable:                                        
                                                                            
    Diagnostic error capture level              (DIAGLEVEL) = 3             
    
    To alter the setting, use the command:                                  
                                                                            
    DB2 UPDATE DBM CFG USING DIAGLEVEL X                                    
    where X is the desired notification level.                              
    There are 4 setting  for DIAGLEVEL                                      
                                                                            
    # 0 - No diagnostic data captured                                       
    # 1 - Severe errors only                                                
    # 2 - All errors                                                        
    # 3 - All errors and warnings                                           
    # 4 - All errors, warnings and informational messages                   
                                                                            
    Change the path to a filesystem with more space.                     
          Example:   db2 update dbm cfg using DIAGPATH  /db2home/diaglog .  
                                                                            
    Either way the permission have to be 666 and at some point the size of  
    the db2diag.log should be trimmed                                       
    
    
    They system errors regarding MMSCDBMonitor and GetDbPollInterval are not
    db2 functions.  I found a similar issue with another BlueGene PMR and
    upgrading to Fixpak15 and increasing Message Limits resolved the problem.
    
    ftp://ftp.software.ibm.com/ps/products/db2/fixes2/english-us/db2linuxPPC64v8/fixpak                                                             
                                                                            
    
    
  • PMR 65044 DB2 crashes
    jobs do not get started due to hung up db2 server.
    LoadLelever lost the connection to db2.
    
    The last error in the db2diag log:
    FUNCTION: DB2 UDB, common communication, sqlcctcpconnr, probe:110       
    MESSAGE : DIA3202C The TCP/IP call "connect" returned an errno="111".   
    
    
    1) on the server, check db2set -all. DB2COMM should be set as follows:  
    [i] DB2COMM=TCPIP                                                       
                                                                            
    2) check that the client machine has cataloged the correct port at the  
    server for DB2.                                                         
                                                                            
    3) from a client, try: 'telnet  50000' where 50000
    is the default port # used                                              
                                                                            
    
    IBM:  
    When the hang is taking place please run the following script, THEN     
    collect a db2support.zip.                                               
                                                                            
    >>>>>>db2service.perf1<<<<<<<<                                          
                                                                            
    Here are the steps on gathering a db2support.zip file and placing it out
    on our EMEA ftp site.                                                   
                                                                            
    1) From the CLP run "db2support . -d  -c -s"                   
    2) Rename the files created above to 65044.227.000. and then  
    ftp the file to our ftp server.                                         
    3) ftp ftp.emea.ibm.com                                                 
    4) Log in as "anonymous" and use your "email address" as the password   
    5) cd toibm/unix                                                        
    6) Type "bin" for binary mode                                           
    7) put 65044.227.000.                                         
    8) quit