To diagnose a problem with Catalina: Run any commands as someone in the catalina group: - Check the catalina.config file in the Catalina install directory for CAT_LOCK_GROUP - su to a user in the appropriate group - newgrp the appropriate group - umask 002 1. Run show_res to get information on existing reservations. login to ds001 cd /catalina/install ./show_res | more 2. Run show_q to get information on jobs. cd /catalina/install ./show_q | more 3. To show all reservations on a node, run show_res. ./show_res --nodegrep --purpose --comment --readable --start --end --runID | grep tf228i 4. To show overlapping reservations, run show_res. ./show_res --overlap --purpose --comment --readable --start --end 5. To show nodes in a reservation, use the --node_list option. ./show_res --res_id= --node_list 6. To show the job restriction code for a reservation, run show_res. ./show_res --res_id= --job_rest The job restriction will probably be something like: if input_tuple[0]['user'] in ['user1','user2'] : result = 0 This should be interpreted as, "if the job's user is in the list containing user1 and user2, then allow the job to run in the reservation. 7. To evaluate any problems in the scheduling loop, including problems getting database locks, run an iteration in foreground: ./catalina_schedule_jobs This will send output, including any uncaught exceptions to std out. If an uncaught exception is seen, check the Catalina code at the indicated line number to get hints on what might be wrong. Many times, if locks can't be obtained, the scheduling loop is having a problem and failing to release its locks. 8. Corrupt databases. If a database file is corrupted, then Catalina may have problems. If catalina_schedule_jobs reports problems with cPickle.load, then something is wrong with a database file. exceptions.KeyboardInterrupt File "./catalina_schedule_jobs", line 119, in ? jobs_db_handle = Catalina.open_db(JOBS_DB,write) File "./Catalina.py", line 464, in open_db dict = cPickle.load(FO) File "", line 0, in ? For example, if catalina_schedule_jobs reports a problem with the JOBS db, and the ls -l in the Catalina HOMEDIR looks like this: ls -l jobs* -rw-rw-r-- 1 kenneth catalina 0 Nov 07 19:10 jobs -rw-rw-r-- 1 kenneth catalina 0 Oct 31 19:01 jobs.lock -rw-rw-r-- 1 kenneth catalina 809048 Nov 07 19:10 jobs_readonly -rw-rw-r-- 1 kenneth catalina 0 Jun 27 16:27 jobs_readonly.lock That means that the new jobs file did not get successfully written. Unless Catalina is in the process of writing out the new db file, it should be the same size as the readonly version. Here, the 0 size of 'jobs' does not match 809048 of 'jobs_readonly'. To recover from corrupt databases, it may be possible to use the *_readonly copy of the database file. Each database has the writable copy, as 'jobs', 'reservations', 'events', etc. A complete list can be found in the initialize_dbs script. After releasing a write-lock Catalina copies the writable instance to the readonly instance, for example: cp jobs jobs_readonly. This allows read-only utilities to use the database without obtaining a read-lock first. This means that if the writable instance, 'jobs', is corrupt, then it may be possible to do: cp jobs_readonly jobs This will take the readonly instance and drop it over the writable instance. Assuming the corruption occurred in the writable instance only, the database should now be good. Be sure to check owner, group and mode for database files. The owner and group should be loadl:catalina. Mode should be 664. Current database files: resource_readonly resource reservations_readonly reservations old_reservations_readonly old_reservations old_jobs_readonly old_jobs jobs_readonly jobs events_readonly events configured_resources_readonly configured_resources configuration_readonly configuration If it proves impossible to recover valid copies of the database files, then, as a last resort, the databases can be re-initialized. This will wipe out all info, including standing reservations and system queue times. ./initialize_dbs At this point, re-create standing reservations with the create_standing_res command. 9. 'Catalina failure' email gets sent when an uncorrectable error occurs in the scheduling loop. The information will often be in the form of a python traceback. Date: Wed, 6 Nov 2002 19:07:26 GMT From: SP2 Load Leveler To: diegella@hpcmail.sdsc.edu, hocks@hpcmail.sdsc.edu, kenneth@hpcmail.sdsc.edu Subject: Catalina failure exceptions.KeyError 11 File "./catalina_schedule_jobs", line 141, in ? Catalina.schedule_jobs(events_db_handle, jobs_db_handle, resources_db_handle, reservations_db_handle, cfg_resources_db_handle, standing_reservations_db_handle ) File "./Catalina.py", line 2330, in schedule_jobs update_job_priorities(jobs_db_handle) File "./Catalina_LL.py", line 779, in update_job_priorities max_pri = float(Catalina.QOS_MAX_PRIORITY_dict[temp_job[QOS]]) This means that there was a problem in line 779 for the Catalina_LL.py file. 'KeyError 11' means that a dictionary in that line did not have the key '11'. The QOS_MAX_PRIORITY dictionary is defined in the catalina.config file : QOS_MAX_PRIORITY_STRING = { '0' : 100000000L, '1' : 1000000000L, '2' : 1000000000L, '3' : 1000000000L, '4' : 1000000000L, '5' : 1000000000L, '6' : 1000000000000000L, '7' : 10000000000L, '8' : 10000000000L, '9' : 10000000000000000000L, '10' : 1000000000000000L } This was changed to: QOS_MAX_PRIORITY_STRING = { '0' : 100000000L, '1' : 1000000000L, '2' : 1000000000L, '3' : 1000000000L, '4' : 1000000000L, '5' : 1000000000L, '6' : 1000000000000000L, '7' : 10000000000L, '8' : 10000000000L, '9' : 10000000000000000000L, '10' : 1000000000000000L, '11' : 100000000000000L } and that fixed the problem. 10. LoadL Job is trying to start repeatedly. ./show_events | more This will display job start events and resulting return codes: -------------------------------------------------------------------------------- Tue Jun 1 16:15:26 2004 cmd : /catalina/install/rj_LL ds001 14703 0 ds100 name : run_jobs return_string : ['adding to nodelist (ds100)\n', 'rc from ll_start_job is >0<\n' ] -------------------------------------------------------------------------------- If the return code from ll_start_job is non-zero, look up the interpretation for that code in /usr/lpp/LoadL/full/include/llapi.h The NegotiatorLog may also have information on what went wrong with the job start. Here's an example: ./show_events | more ... list (ds266)\n', 'adding to nodelist (ds266)\n', 'adding to nodelist (ds266)\n', 'adding to nodelist (ds266)\n', 'adding to nodelist (ds266)\n', 'rc from ll_sta rt_job is >-10<\n'] ... /usr/lpp/LoadL/full/include shows: /*********************************************************************** * Status codes to support external scheduler. **********************************************************************/ #define API_OK 0 /* API call runs to complete */ #define API_INVALID_INPUT -1 /* Invalid input */ #define API_CANT_CONNECT -2 /* can't connect to CM */ #define API_CANT_MALLOC -3 /* out of memory */ #define API_CONFIG_ERR -4 /* Error from init_params() */ #define API_CANT_FIND_PROC -5 /* can't find proc */ #define API_CANT_TRANSMIT -6 /* xdr error */ #define API_CANT_AUTH -7 /* can't authorize */ #define API_WRNG_PROC_VERSION -8 /* Wrong proc version */ #define API_WRNG_PROC_STATE -9 /* Wrong proc state */ #define API_MACH_NOT_AVAIL -10 /* Machine not available */ ... So, a node is not available. /var/loadl/log/NegotiatorLog shows: 06/01 21:38:25 TI-954793 Received Start Command for step ds001.14802.0 06/01 21:38:25 TI-954793 machine:ds114 class: normal starter_inuse=0 06/01 21:38:25 TI-954793 machine:ds114 class: normal starter_inuse=0 06/01 21:38:25 TI-954793 machine:ds114 class: normal starter_inuse=0 06/01 21:38:25 TI-954793 machine:ds114 class: normal starter_inuse=0 ... 06/01 21:38:25 TI-954793 Machine ds121 not available. llstatus ds121 shows Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys ds121 Down 0 0 Idle 0 0.00 9999 R6000 AIX52 So, something is wrong with ds121. LoadL thinks it's Idle, but won't start a job on it. 11. Why is the job not starting, even though there is a reservation? ./show_q | more Check the RES_START value for the job. If it's None, the scheduler can't find nodes and time for the job. If the job has been bound to the reservation, and there is not enough time in the reservation, the job will never start. It may be possible to unbind_job_from_res, to allow the job to run past the end of the reservation. If there are no jobs scheduled after the end of the reservation, the job should run. If the RES_START time is some value in the future, outside the reservation, there are a couple of possibilities. The reservation may not be allowing the job in. To check the jobs allowed in to the reservation, run ./show_res --res= --job_rest This will print out the Python code used to filter the job. Check to see that the appropriate user is allowed to run: ds001 $ ./show_res --res=1084809696 --job_rest SERVERMODE (TEST) TZ (GMT) ResID JobRestriction 1084809696 (if input_tuple[0]['user'] in ['kenneth'] or input_tuple[0]['group'] in [] or input_tuple[0]['account'] in [] : result = 0 ) ds001 $ Means that if the job's user is 'kenneth', allow the job in to the reservation. Another possibility is that the job would run past the end of the reservation, and higher priority jobs are scheduled after the reservation. Check the wall clock limit of the job to see if it would run past the end of the reservation. If so, it may be possible to run the job by increasing it's priority with update_system_priority.