1 (17 Jul 97) ************************************** * * * Section 5 - Programmer's Reference * * * ************************************** This section describes features of the GAMESS implementation which are true for all machines. See the section 'hardware specifics' for information on each machine type. The contents of this section are: o Installation overview (sequential mode) o Files comprising the GAMESS distribution. o Altering program limits o Names of source code modules o Programming conventions o Parallel version of GAMESS philosophy TCGMSG toolkit installation process execution process load balancing timing examples broadcast identifiers o Disk files used by GAMESS o Contents of DICTNRY master file 1 Installation overview GAMESS will run on a number of different machines under FORTRAN 77 compilers. However, even given the F77 standard there are still a number of differences between various machines. For example some machines have 32 bit word lengths, requiring the use of double precision, while others have 64 bit words and are used in single precision. Although there are many types of computers, there is only one (1) version of GAMESS. This portability is made possible mainly by keeping machine dependencies to a minimum (that is, writing in F77, not vendor specific language extensions). The unavoidable few statements which do depend on the hardware are commented out, for example, with "*IBM" in columns 1-4. Before compiling GAMESS on an IBM machine, these four columns must be replaced by 4 blanks. The process of turning on a particular machine's specialized code is dubbed "activation". A semi-portable FORTRAN 77 program to activate the desired machine dependent lines is supplied with the GAMESS package as program ACTVTE. Before compiling ACTVTE on your machine, use your text editor to activate the very few machine dependent lines in ACTVTE before compiling it. Be careful not to change the DATA initialization! The task of building an executable form of GAMESS is this: activate compile link *.SRC ---> *.FOR ---> *.OBJ ---> *.EXE source FORTRAN object executable code code code image where the intermediate files *.FOR and *.OBJ are discarded once the executable has been linked. It may seem odd at first to delete FORTRAN code, but this can always be reconstructed from the master source code using ACTVTE. The advantage of maintaining only one master version is obvious. Whenever any improvements are made, they are automatically in place for all the currently supported machines. There is no need to make the same changes in a plethora of other versions. The control language needed to activate, compile, and link GAMESS on your brand of computer is probably present on the distribution tape. These files should not be used without some examination and thought on your part, but should give you a starting point. 1 There may be some control language procedures for one computer that cannot be duplicated on another. However, some general comments apply: Files named COMP will compile a single module. COMPALL will compile all modules. LKED will link together an executable image. RUNGMS will run a GAMESS job, and RUNALL will run all the example jobs. The first step in installing GAMESS should be to print the manual. If you are reading this, you've got that done! The second step would be to get the source code activator compiled and linked (note that the activator must be activated manually before it is compiled). Third, you should now compile all the source modules (if you have an IBM, you should also assemble the two provided files). Fourth, link the program. Finally, run all the short tests, and very carefully compare the key results shown in the 'sample input' section against your outputs. These "correct" results are from a IBM RS/6000, so there may be very tiny (last digit) precision differences for other machines. That's it! Before starting the installation, you should read the pages decribing your computer in the 'Hardware Specifics' section of the manual. There may be special instructions for your machine. 1 Files for GAMESS *.DOC The files you are reading now. You should print these on 8.5 by 11 inch white paper, using column one as carriage control. Double sided, 3 hole, 10 pitch laser output is best! *.SRC source code for each module *.ASM IBM mainframe assembler source *.C C code used by some UNIX systems. EXAM*.INP 29 short test jobs (see TESTS.DOC). These are files related to some utility programs: ACTVTE.CODE Source code activator. Note that you must use a text editor to MANUALLY activate this program before using it. MBLDR.* model builder (internal to Cartesian) CARTIC.* Cartesian to internal coordinates CLENMO.* cleans up $VEC groups There are files related to X windows graphics. See the file INTRO.MAN for their names. The remaining files are command language for the various machines. *.COM VAX command language. PROBE is especially useful for persons learning GAMESS. *.MVS IBM command language for MVS (dreaded JCL). *.CMS IBM command language for CMS. These should be copied to filetype EXEC. *.CSH UNIX C shell command language. These should have the "extension" omitted, and have their mode changed to executable. 1 Altering program limits Almost all arrays in GAMESS are allocated dynamically, but some variables must be held in common as their use is ubiquitous. An example would be the common block which holds the basis set. The following Unix script, which we call 'mung', changes the PARAMETER statements that set various limitations: #!/bin/csh # # automatically change GAMESS' built-in dimensions # chdir /u3/mike/gamess/source # foreach FILE (*.src) set FILE=$FILE:r echo ===== redimensioning in $FILE ===== echo "C 01 JAN 97 - SELECT NEW DIMENSIONS" \ > $FILE.munged sed -e "/MXATM=500/s//MXATM=100/" \ -e "/MXFRG=50/s//MXFRG=1/" \ -e "/MXDFG=5/s//MXDFG=1/" \ -e "/MXPT=100/s//MXPT=1/" \ -e "/MXAOCI=768/s//MXAOCI=768/" \ -e "/MXRT=100/s//MXRT=100/" \ -e "/MXSP=100/s//MXSP=1/" \ -e "/MXTS=2500/s//MXTS=1/" \ -e "/MXSH=1000/s//MXSH=1000/" \ -e "/MXGSH=30/s//MXGSH=30/" \ -e "/MXGTOT=5000/s//MXGTOT=5000/" \ $FILE.src >> $FILE.munged mv $FILE.munged $FILE.src end exit In this script, MXATM = max number of atoms MXFRG = max number of effective fragment potentials MXDFG = max number of different effective fragments MXPT = max number of effective fragment points MXAOCI= max number of basis functions in CI/MCSCF MXRT = max number of CI roots saved by $GUGDIA MXSP = max number of spheres in PCM MXTS = max number of tesserae in PCM MXSH = max number of symmetry unique shells MXGSH = max number of Gaussians per shell MXGTOT= max number of symmetry unique Gaussians 1 The script shows how to -minimize- memory use, by a a small decrease in the number of atoms, and turning off the effective fragment and PCM dimensioning. Little can be saved by reducing the other adjustable parameters. Of course, the 'mung' script shown above could also be used to increase the dimensions... If you are really trying to save memory, you can also link only the core program (SCF energy and gradients) by selecting FULLGAMESS=false when linking. Linking to the dummy routines in STUB.SRC will omit the CI, MCSCF, MP2, analytic Hessian, MOPAC, effective fragment, and PCM code. This roughly halves the size of the program's executable image, and may be useful on a small memory machine. 1 Names of source code modules The source code for GAMESS is divided into a number of sections, called modules, each of which does related things, and is a handy size to edit. The following is a list of the different modules, what they do, and notes on their machine dependencies. machine module description dependency ------- ------------------------- ---------- BASECP SBK and HW valence basis sets BASEXT DH, MC, 6-311G extended basis sets BASHUZ Huzinaga MINI/MIDI basis sets to Xe BASHZ2 Huzinaga MINI/MIDI basis sets Cs-Rn BASN21 N-21G basis sets BASN31 N-31G basis sets BASSTO STO-NG basis sets BLAS level 1 basic linear algebra subprograms CPHF coupled perturbed Hartree-Fock 1 CPROHF open shell/TCSCF CPHF 1 DRC dynamic reaction coordinate ECP pseudopotential integrals ECPHW Hay/Wadt effective core potentials ECPLIB initialization code for ECP ECPSBK Stevens/Basch/Krauss/Jasien/Cundari ECPs EIGEN Givens-Householder, Jacobi diagonalization EFDRVR fragment only calculation drivers EFELEC fragment-fragment interactions EFGRD2 2e- integrals for EFP numerical hessian EFGRDA ab initio/fragment gradient integrals EFGRDB " " " " " EFGRDC " " " " " EFINP effective fragment potential input EFINTA ab initio/fragment integrals EFINTB " " " " EFPAUL effective fragment Pauli repulsion FFIELD finite field polarizabilities FRFMT free format input scanner GAMESS main program, single point energy and energy gradient drivers, misc. GRADEX traces gradient extremals GRD1 one electron gradient integrals GRD2A two electron gradient integrals 1 GRD2B specialized sp gradient integrals GRD2C general spdfg gradient integrals GUESS initial orbital guess GUGDGA Davidson CI diagonalization 1 GUGDGB " " " 1 (continued...) 1 machine module description dependency ------- ------------------------- ---------- GUGDM 1 particle density matrix GUGDM2 2 particle density matrix 1 GUGDRT distinct row table generation GUGEM GUGA method energy matrix formation 1 GUGSRT sort transformed integrals 1 GVB generalized valence bond HF-SCF 1 HESS hessian computation drivers HSS1A one electron hessian integrals HSS1B " " " " HSS2A two electron hessian integrals 1 HSS2B " " " " INPUTA read geometry, basis, symmetry, etc. INPUTB " " " " INPUTC " " " " INT1 one electron integrals INT2A two electron integrals 1 INT2B " " " IOLIB input/output routines,etc. 2 LAGRAN CI Lagrangian matrix 1 LOCAL various localization methods 1 LOCCD LCD SCF localization analysis LOCPOL LCD SCF polarizability analysis MCCAS FOCAS/SOSCF MCSCF calculation 1 MCQDPT multireference perturbation theory 1,2 MCQUD QUAD MCSCF calculation 1 MCSCF FULLNR MCSCF calculation 1 MCTWO two electron terms for FULLNR MCSCF 1 MOROKM Morokuma energy decomposition 1 MP2 2nd order Moller-Plesset 1 MP2GRD CPHF and density for MP2 gradients 1 MPCDAT MOPAC parameterization MPCGRD MOPAC gradient MPCINT MOPAC integrals MPCMOL MOPAC molecule setup MPCMSC miscellaneous MOPAC routines MTHLIB printout, matrix math utilities 1 NAMEIO namelist I/O simulator ORDINT sort atomic integrals 1 PARLEY communicate to other programs PCM Polarizable Continuum Model setup PCMCAV PCM cavity creation PCMDER PCM gradients PCMDIS PCM dispersion energy PCMPOL PCM polarizabilities PCMVCH PCM repulsion and escaped charge PRPEL electrostatic properties PRPLIB miscellaneous properties PRPPOP population properties (continued...) 1 machine module description dependency ------- ------------------------- ---------- RHFUHF RHF, UHF, and ROHF HF-SCF 1 RXNCRD intrinsic reaction coordinate RYSPOL roots for Rys polynomials SCFLIB HF-SCF utility routines, DIIS code SCRF self consistent reaction field SOBRT full Breit-Pauli spin-orbit compling SOFFAC spin-orbit matrix element form factors SOZEFF 1e- spin-orbit coupling terms STATPT geometry and transition state finder STUB small version dummy routines SURF PES scanning SYMORB orbital symmetry assignment SYMSLC " " " 1 TCGSTB stub routines to link a serial GAMESS TDHF time-dependent Hartree-Fock NLO 1 TRANS partial integral transformation 1 TRFDM2 two particle density backtransform 1 TRNSTN CI transition moments TRUDGE nongradient optimization UNPORT unportable, nasty code 3,4,5,6,7,8 VECTOR vectorized version routines 9 VIBANL normal coordinate analysis ZHEEV complex matrix diagonalization ? ZMATRX internal coordinates Ordinarily, you will not use STUB.SRC, which is linked only if your system has a very small amount of physical memory. In addition, the IBM mainframe version uses the following assembler language routines: ZDATE.ASM, ZTIME.ASM. UNIX versions may use the C code: ZUNIX.C, ZMIPS.C. The machine dependencies noted above are: 1) packing/unpacking 2) OPEN/CLOSE statments 3) machine specification 4) fix total dynamic memory 5) subroutine walkback 6) error handling calls 7) timing calls 8) LOGAND function 9) vector library calls 1 Programming Conventions The following "rules" should be adhered to in making changes in GAMESS. These rules are important in maintaining portability, and should be strictly adhered to. Rule 1. If there is a way to do it that works on all computers, do it that way. Commenting out statements for the different types of computers should be your last resort. If it is necessary to add lines specific to your computer, PUT IN CODE FOR ALL OTHER SUPPORTED MACHINES. Even if you don't have access to all the types of supported hardware, you can look at the other machine specific examples found in GAMESS, or ask for help from someone who does understand the various machines. If a module does not already contain some machine specific statements (see the above list) be especially reluctant to introduce dependencies. Rule 2. a) Use IMPLICIT DOUBLE PRECISION(A-H,O-Z) specification statements throughout. b) All floating point constants should be entered as if they were in double precision. The constants should contain a decimal point and a signed two digit exponent. A legal constant is 1.234D-02. Illegal examples are 1D+00, 5.0E+00, and 3.0D-2. c) Double precision BLAS names are used throughout, for example DDOT instead of SDOT. The source code activator ACTVTE will automatically convert these double precision constructs into the correct single precision expressions for machines that have 64 rather than 32 bit words. Rule 3. FORTRAN 77 allows the use of generic functions. Thus the routine SQRT should be used in place of DSQRT, as this will automatically be given the correct precision by the compilers. Use ABS, COS, INT, etc. Your compiler manual will tell you all the generic names. Rule 4. Every routine in GAMESS begins with a card containing the name of the module and the routine. An example is "C*MODULE xxxxxx *DECK yyyyyy". The second star is in column 18. Here, xxxxxx is the name of the module, and yyyyyy is the name of the routine. Furthermore, the individual decks yyyyyy are stored in alphabetical order. This rule is designed to make it easier for a person completely unfamiliar with GAMESS to find routines. The trade off for this is that the driver for a particular module is often found somewhere in the middle of that module. 1 Rule 5. Whenever a change is made to a module, this should be recorded at the top of the module. The information required is the date, initials of the person making the change, and a terse summary of the change. Rule 6. No lower case characters, no more than 6 letter variable names, no imbedded tabs, statements must lie between columns 7 and 72, etc. In other words, old style syntax. * * * The next few "rules" are not adhered to in all sections of GAMESS. Nonetheless they should be followed as much as possible, whether you are writing new code, or modifying an old section. Rule 7. Stick to the FORTRAN naming convention for integer (I-N) and floating point variables (A-H,O-Z). If you've ever worked with a program that didn't obey this, you'll understand why. Rule 8. Always use a dynamic memory allocation routine that calls the real routine. A good name for the memory routine is to replace the last letter of the real routine with the letter M for memory. Rule 9. All the usual good programming techniques, such as indented DO loops ending on CONTINUEs, IF-THEN-ELSE where this is clearer, 3 digit statement labels in ascending order, no three branch GO TO's, descriptive variable names, 4 digit FORMATs, etc, etc. The next set of rules relates to coding practices which are necessary for the parallel version of GAMESS to function sensibly. They must be followed without exception! Rule 10. All open, rewind, and close operations on sequential files must be performed with the subroutines SEQOPN, SEQREW, and SEQCLO respectively. You can find these routines in IOLIB, they are easy to use. 1 Rule 11. All READ and WRITE statements for the formatted files 5, 6, 7 (variables IR, IW, IP, or named files INPUT, OUTPUT, PUNCH) must be performed only by the master task. Therefore, these statements must be enclosed in "IF (MASWRK) THEN" clauses. The MASWRK variable is found in the /PAR/ common block, and is true on the master process only. This avoids duplicate output from slave processes. At the present time, all other disk files in GAMESS also obey this rule. Rule 12. All error termination is done by means of "CALL ABRT" rather than a STOP statement. Since this subroutine never returns, it is OK to follow it with a STOP statement, as compilers may not be happy without a STOP as the final executable statment in a routine. 1 Parallel version of GAMESS Under the auspices of a 1991 joint ARPA/Air Force project, we began to parallelize GAMESS. Today, nearly all ab initio methods run in parallel, although many of these still have a step or two running sequentially only. Only the MP2 energy for MCSCF, and RHF MP2 gradients have no parallel method coded. We have not parallelized the semi- empirical MOPAC runs, and probably never will. Current information about the parallel implementation is given below. Additional parallel work is in progress, under a new DoD software initiative which "kicked off" in 1996. If a parallel linked version of GAMESS is run on only one node, it behaves as if it is a sequential version, and the full functionality of the program is available to you. * * * The two major philosophies for distributed memory MIMD (multiple instruction on multiple data) programs are 1) Have a master program running on one node do all of the work, except that smaller slave programs running on the other nodes are called to do their part of the compute intensive tasks, or 2) Have a single program duplicate all work except for compute intensive code where each node does only its separate piece of the work (SPMD, which means single program, multiple data). We have chosen to implement the SPMD philosophy in GAMESS for several reasons. The first of these is that only one code is required (not master and slave codes). Therefore, two separate GAMESS codes do not need to be maintained. The second reason is also related to maintainance. GAMESS is constantly evolving as new code is incorporated into it. The parallel calls are "hidden" at the lowest possible subroutine levels to allow programmers to add their code with a minimum of extra effort to parallelize it. Therefore, new algorithms or methods are available to all nodes. The final reason given here is that duplication of computation generally cuts down on communication. The only portion of the master/slave concept to survive in GAMESS is that the first process (node 0) handles reading all input lines and does all print out and PUNCH file output, as well as all I/O to the DICTNRY master file. In this sense node 0 is a "master". 1 * * * Several tools are available for parallelization of codes. We have chosen to use the parallelization tool TCGMSG from Robert Harrison, now at Pacific Northwest Laboratory. This message passing toolkit has been ported to many UNIX machines and was written specifically for computational chemistry. It works on distributed memory MIMD systems, on Ethernetworks of ordinary workstations, and on shared memory parallel computers. Thus TCGMSG allows one to run parallel GAMESS on a fairly wide assortment of hardware. Be sure to note that TCGMSG does support communication between Ethernet workstations of different brands and/or speeds. For example, we have been able to run on a 3 node parallel system built from a DECstation, a SGI Iris, and a IBM RS/6000! (see XDR in $SYSTEM.) It is also useful to note that your Ethernet parallel system does not have to contain a power of two number of nodes. TCGMSG uses the best interprocess communication available on the hardware being used. For a Ethernetwork of workstations, this means that TCP/IP sockets are used for communication. In turn, this means it is extremely unlikely you will be able to include non-Unix systems in a Ethernet parallel system. * * * If you are trying to run on a genuine parallel system on which TCGMSG does not work, you may still be in luck. The "stubs" TCGSTB.SRC can be used to translate from the TCGSMG calls sprinkled throughout GAMESS to some other message passing language. For example, we run GAMESS on the IBM SP2 (using MPI), Intel Paragon (using Intel's own library), and Thinking Machines CM-5 (using TMC's CMMD library) in this way, so there is no need to install TCGMSG to run GAMESS on these systems. The Cray T3D uses a special code written at Cray to convert TCGMSG calls to PVM, so again, TCGMSG is not used. All of these systems should compile for parallel execution just by picking the appropriate "target" in the compiling scripts. * * * Our experience with parallel GAMESS is that it is quite robust in production runs. In other words, most of the grief comes during the installation phase! TCGMSG will install and execute without any special priviledges. The first step in getting GAMESS to run in parallel on machines using TCGMSG is to link GAMESS in sequential mode, against the object file from TCGSTB.SRC, and ensure that the program is working correctly in sequential mode. 1 Next, obtain a copy of the TCGMSG toolkit. This is available by anonymous ftp from ftp.tcg.anl.gov. Go to the directory /pub/tcgmsg and, using binary mode, transfer the file tcgmsg.4.04.tar.Z. (or higher version) Unpack this file with 'uncompress' and 'tar -xvf'. The only modification we make to TCGMSG before compiling it is to remove all -DEVENTLOG flags from the prototype file tcgmsg/ipcv4.0/Makefile.proto. Then, use the makefile provided to build the TCGMSG library chdir ~/tcgmsg make all MACHINE=IBMNOEXT If your machine is not a IBM RS/6000, substitute the name of your machine instead. At this point you should try the simple "hello" example, chdir ipcv4.0 parallel hello to make sure TCGMSG is working. Finally, link GAMESS against the libtcgmsg.a library instead of tcgstb.o to produce a parallel executable for GAMESS. It is not necessary to recompile to accomplish this. Instead just change the 'lked' script and relink. * * * Execute GAMESS on systems which do not use TCGMSG by picking the appropriate "target" with the 'pargms' script. * * * Execute GAMESS under TCGMSG by picking the "tcgmsg" target within the 'pargms' script, according to the directions within that script. You also must create a file named'gamess.p' in your home directory,such as # user, host, nproc, executable, workdir theresa si.fi.ameslab.gov 1 /u/theresa/gamess/gamess.01.x /scr/theresa windus ge.fi.ameslab.gov 1 /u/windus/gamess/gamess.01.x /wrk/windus The fields in each line are: username on that machine, hostname of that machine, number of processes to be run on that machine, full file name of the GAMESS executable on that machine, and working directory on that machine. Comments begin with a # character. Although TCGMSG allows long lines to continue on to a new line, as shown above, you should not do this. The execution script provided with GAMESS will automatically delete work files established in the temporary directories, but only if this script gives all host info on a single line. A detailed explanation of each field follows: 1 The first hostname given must be the name of the machine on which you run the 'pargms' script. This script defines environment variables specifying the location of the input and output files. The environment is not passed to other nodes by TCGMSG's "parallel" program, so the master process (node 0) running "pargms" **must** be the first line of your gamess.p file. The hostname may need to be the shortened form, rather than the full dotted name, especially on SGI and Unicos. In general, the correct choice is whatever the response to executing the command "hostname" is. The processes on other workstations are generated by use of the Unix 'rsh' command. This means that you must set up a .rhosts file in your home directory on each node on which you intend to run in parallel. This file validates login permission between the various machines, by mapping your accounts on one machine onto the others. For example, the following .rhosts file might be in Theresa's home directory on both systems, si.fi.ameslab.gov theresa ge.fi.ameslab.gov windus You can test this by getting 'rsh' to work, by a command such as this, (from si.fi.ameslab.gov) rsh ge.fi.ameslab.gov -l windus 'df' and then also try it in the reverse direction too. Note that the number of processes to be started on a given machine is ordinarily one. The only exception is if you are running on a multiCPU box, with a common memory. In this case, gamess.p should contain just one line, starting n processes on your n CPUs. This will used shared memory communications rather than sockets to pass messages, and is more efficient. The executable file does not have to be duplicated on every node, although as shown in the example it can be. If you have a homogenous Ethernet system, and there is a single file server, the executable can be stored on this server to be read by the other nodes by NFS. Of course, if you have a heterogenous network, you must build a separate executable for each different brand of computer you have. At present GAMESS may write various binary files to the working directory, depending on what kind of run you are doing. In fact, the only types of run which will not open files on the other nodes is a direct SCF (non-analytic hessian job), or direct MP2 energy. Any core dump your job might produce will end up in this work directory as well. 1 * * * We have provided you with a script named 'seqgms' which will run a TCGMSG-linked version of GAMESS using only one process, on your current machine. Seqgms will automatically build a single node .p file. Using this script means you need to keep only a parallel-linked GAMESS executable, and yet you still retain access to the parts of GAMESS that do not yet run in parallel. * * * We turn now to a description of the way each major parallel section of GAMESS is implemented, and give some suggestions for efficient usage. * * * The HF wavefunctions can be evaluated in parallel using either conventional disk storage of the integrals, or via direct recomputation of the integrals. Assuming the I/O speed of your system is good, direct SCF is *always* slower than disk storage. But, a direct SCF might be faster if your nodes access disk via NFS over the Ethernet, or if you are running on a Intel or CM-5 machine. If you are running on Ethernetted workstations which have large local disks on each one, then conventional disk based SCF is probably fastest. When you run a disk based SCF in parallel, the integral files are opened in the work directory which you defined in your gamess.p file. Only the subset of integrals computed by each node are stored on that node's disk space. This lets you store integral files distributed (in pieces) across all nodes, and which are larger than will fit on any one of your computers. You may wish to experiment with both options, so that you learn which is fastest on your hardware setup. * * * One of the most important issues in parallelization is load balancing. Currently, GAMESS has two algorithms available for load balancing of the two-electron integrals and gradients. The first is a simple inner loop algorithm (BALTYP=LOOP). The work of the inner most loop is split up so the next processor works on the next loop occurence. If all processors are of the same speed and none of the processors is dedicated to other work (for example, an Intel), this is the most effective load balance technique. 1 The second method is designed for systems where the processor speeds may not be the same, or where some of the processors are doing other work (such as a system of equal workstations in which one of them might be doing other things). In this technique, as soon as a processor finishes its previous task, it takes the next task on the list of work to be done. Thus, a faster node can take more of the work, allowing all nodes to finish the run at the same time. This method is implemented throughout most of GAMESS (see BALTYP=NXTVAL in $SYSTEM). It requires some extra work coordinating which tasks have been done already, so NXTVAL adds a communication penalty of about 5% to the total run time. All integral sections (meaning the ordinary integrals, gradient integrals, and hessian integrals) have both LOOP and NXTVAL balancing implemented. Thus all of a HF level run involving only the energy and gradient has both load balancing techniques. Analytic HF hessians also have both balancing techniques for the integral transformation step. The parallel CI/MCSCF program also contains both balancing algorithms, except that for technical reasons MCSCF gradient computation will internally switch to the LOOP balancing method for that step only. The parallel MP2 program uses only LOOP balancing during the MP2, although it will use either balancing method during the preliminary SCF. The IBM SP2, Intel, CM-5, Cray T3D always use LOOP balancing, since they do not use TCGMSG, and thus they will ignore your input BALTYP. * * * You can find performance numbers for conventional and direct SCF, as well as gradient evaluation in the paper M.W.Schmidt, et al., J.Comput.Chem. 14, 1347-1363(1993). The gradient code has been replaced since 1993 with a version that is much faster, but which should scale well. Total job scaling is poorer now, since the SCF step is now the predominant part of the calculation, but note that total job CPU is much smaller. Below is some data collected by Paul Day in 1996 on an Intel Paragon. The system is glutamic acid, the computation is a direct SCF with gradient, using 195 AOs in a DH(d,p) basis. 1 #nodes -------- AB INITIO ---------- time speed-up efficiency 1 52714.3 1.00 100% 2 26961.8 1.96 98% 4 13738.0 3.84 96% 8 7049.5 7.48 93% 16 3724.1 14.15 88% 32 2038.0 25.87 81% 64 1186.1 44.44 69% 128 755.0 69.82 55% Adding ten EFP waters to the simulation yields #nodes ----------- EFP -------------- time speed-up efficiency 1 63381.0 1.00 100% 2 33158.0 1.91 96% 4 17304.6 3.66 92% 8 9136.0 6.94 87% 16 5275.6 12.01 75% 32 3174.3 19.97 62% 64 2548.2 24.87 39% 128 1808.6 35.04 27% An idea of the variation in time with basis set size can be gained from the following runs made by Johannes Grotendorst, Juelich, Germany, on a Cray T3E or Intel Paragon, using 32 nodes on either, for various asymmetric organic compounds, computing the RHF energy and gradient, using the 6-31G(d) basis set: T3E Paragon taxol, 1032 AOs, CPU TIME = 546.8 -- minutes cAMP, 356 AOs, CPU TIME = 14.6 106.4 luciferin, 294 AOs, CPU TIME = 8.9 67.2 nicotine, 208 AOs, CPU TIME = 3.8 26.1 thymine, 149 AOs, CPU TIME = 1.5 12.2 alanine, 104 AOs, CPU TIME = 0.5 5.2 glycine, 85 AOs, CPU TIME = 0.3 3.2 * * * The MCSCF program is described in T.L.Windus, M.W. Schmidt, M.S.Gordon, Theoret.Chim.Acta 89, 77-88(1994). This paper gives timings for the FULLNR MCSCF program, considering three limits: transformation dominant, CI diagonalization dominant, orbital improvement dominant. The data are from IBM RS/6000 model 350s, connected by Ethernet, using LOOP balancing. We refer you to this paper for more details. The SOSCF MCSCF procedure shortens the integral transformation and orbital improvement steps. For 7-azaindole, using a DH(d,p) basis with 165 AOs, correlating all 10 pi electrons in 9 orbitals gives 5292 CSFs. Some data collected by Galina Chaban in 1996 on a IBM SP2 using duplicated AO integral files: 1 # of nodes= 1 2 3 4 5 6 7 8 2e- ints 574 575 574 575 575 574 574 573 transformation 318 172 125 95 88 86 86 81 Hamiltonian 15 13 12 12 11 11 11 11 diag. H 20 11 8 6 6 5 5 4 2e- density 13 12 12 11 11 11 11 11 rho*pqkl 5 5 5 5 5 5 5 5 Fock constr. 32 32 32 32 32 32 32 32 lagr.+rot. 2 1 2 1 7 3 4 13 total(10 its) 4618 3119 2627 2391 2270 2203 2252 2099 shows that use of more than 4 nodes is pretty silly. Since the SP2 has faster communications, the use of a distributed AO integral file improves things somewhat, # of nodes= 1 2 3 4 5 6 7 8 2e- ints 574 289 195 149 119 99 86 75 transformation 318 499 504 365 238 187 176 102 Hamiltonian 15 13 13 12 12 11 11 11 diag. H 20 12 8 6 5 5 4 4 2e- density 13 12 12 11 11 11 11 11 rho*pqkl 5 5 5 5 5 5 5 5 Fock constr. 32 17 17 10 6 4 4 3 lagr.+rot. 2 2 1 1 1 1 1 1 total(10 its) 4618 5866 5400 3976 2241 2284 2208 1473 which shows better performance on more nodes, but a bit of a communication bottleneck during the transformation on 2 or 3 nodes. The difference between duplicated and distributed AO integrals is discussed below. A CI or MCSCF job will open several disk files on each node. For example, if the integral transformation is not being run in direct mode (see DIRTRF in $TRANS) then each node will compute and store either a full copy or a subset of the AO integrals (see AOINTS in $TRANS). Each node will store a subset of the transformed integrals, the CI Hamiltonian, and the density matrix. The algorithm thus demands not only disk storage on each node, but also a reasonable I/O bandwidth. We would not normally run the MCSCF code on an Intel Paragon or a Thinking Machines CM-5, which are not famous for their I/O capacity! Similar comments apply to analytic hessians: you must have disk storage on each node, reachable at a reasonable bandwidth. The integral transformation described in the TCA article is used to drive both parallel analytic hessians and energy localizations. Thus the scalability of parallel hessians is much better than described by T.L.Windus, M.W.Schmidt, M.S.Gordon, Chem.Phys.Lett., 216, 375-379(1993), in that all steps but the coupled Hartree-Fock are now parallel. Current timing data for the example discussed in this paper is given at the end of this section. 1 The closed shell parallel MP2 computation is adapted from Michel Dupuis' HONDO implementation. A description of this method can be found in A.M.Marquez, M.Dupuis, J.Comput.Chem., 16, 395-404(1995). The partial integral transformation or the specialized MP2 transformation must have all AO integrals available to all nodes, in order to produce a subset of the transformed MO integrals on each node. Two possible strategies for this are a) duplicate the AO integrals on disk on each node, or b) distribute a single AO list (in pieces) across all nodes, then broadcast each record read from disk by a particular node to all other nodes. If the speed of your communications link is low compared to disk I/O rates, as is the case with Ethernet, then strategy a is better. If your machine has high speed inter-processor communication, such as a IBM SP2, then strategy b is perhaps better. Both methods are implemented in both transformation routines, since one system may have paid for disk drives, while another paid for communications speed. The program usually intelligently chooses the AO integral storage default, with TCGMSG systems using duplicated integral files because they are probably talking on Ethernet, whereas the genuine parallel computers will use distributed AO integral files. You can select the alternative strategy by use of the AOINTS keyword in $TRANS (AOINTS in $MP2 for MP2 level computations). Shared memory computers such as IBM or DEC SMP systems using TCGMSG will have high communications rates, and should be forced to use AOINTS=DIST. If you have a FDDI connection between your machines, you may want to play with AOINTS=DIST to see if it is of benefit, as FDDI bandwidths are comparable to disk I/O rates (especially if your FDDI segment is isolated from the rest of your institution). In these cases, GAMESS will not choose the best method of AO integral storage, by default, so you'll need to input this keyword. Some numbers might make all this clearer: typical SCSI disk I/O rates are 3-5 MBytes/sec, Ethernet is 1.25 MB/sec, FDDI is 10 MB/sec (note that Ethernet and FDDI bandwidth is shared by all nodes on the segment), the SP2 switch is 40 MB/sec. The typical AO integral file is measured in GBytes, and broadcasting it places considerable stress on a low speed communication medium. Note that both transformations, if run in direct mode, compute the full AO integral list on each node during each pass, a blatant sequential bottleneck. Only the time to carry out the transformation is being parallelized. 1 Most types of ab initio runs should now execute in parallel. However, only the code for HF energies and gradients is mature, so several sequential bottlenecks remain. The following steps of a parallel run will be conducted sequentially by the master: HF: solution of SCF equations MCSCF: solution of Newton-Raphson equations analytic hessians: the coupled Hartree-Fock energy localizations: the orbital localization step transition moments/spin-orbit: the final property step However, all other steps (such as the evaluation of the underlying wavefunction) do speed up in parallel. Other steps which do not scale well, although they do speed up slightly are: MCSCF/CI: Hamiltonian and 2 body density generation Future versions of GAMESS will address these bottlenecks. In the meantime, some trial and error will teach you how many nodes can effectively contribute to any particular type of run. One example, using the same RS/6000-350 machines and molecule (bench12.inp converted to runtyp=hessian, with 2,000,000 words, with baltyp=loop) gives the following replacement for Table 1 of the Chem.Phys.Lett. 216, 375-379(1993) paper: p= 1 2 3 --- --- --- setup 0.57 0.69 0.73 1e- ints 1.10 0.87 0.88 huckel guess 15.77 15.74 16.17 2e- ints 111.19 55.34 37.42 RHF cycles 223.13 103.26 79.44 properties 2.23 2.46 2.63 2e- ints -- 111.28 110.97 transformation 1113.67 552.38 381.09 1e- hess ints 28.20 16.46 14.63 2e- hess ints 3322.92 1668.86 1113.37 CPHF 1438.66 1433.34 1477.32 ------- ------- ------- total CPU 6258.01 3961.34 3235.27 total wall 8623(73%) 5977(66%) 5136(63%) so you can see the CPHF is currently a hindrance to full scalability of the analytic hessian program. You would not submit this type of run to a 32 node partition! 1 List of parallel broadcast identifiers GAMESS uses TCGMSG calls to pass messages between the parallel processes. Every message is identified by a unique number, hence the following list of how the numbers are used at present. If you need to add to these, look at the existing code and use the following numbers as guidelines to make your decision. All broadcast numbers must be between 1 and 32767. 20 : Parallel timing 100 - 199 : DICTNRY file reads 200 - 204 : Restart info from the DICTNRY file 210 - 214 : Pread 220 - 224 : PKread 225 : RAread 230 : SQread 250 - 265 : Nameio 275 - 310 : Free format 325 - 329 : $PROP group input 350 - 354 : $VEC group input 400 - 424 : $GRAD group input 425 - 449 : $HESS group input 450 - 474 : $DIPDR group input 475 - 499 : $VIB group input 500 - 599 : matrix utility routines 800 - 830 : Orbital symmetry 900 : ECP 1e- integrals 910 : 1e- integrals 920 - 975 : EFP and SCRF integrals 980 - 999 : property integrals 1000 - 1025 : SCF wavefunctions 1030 - 1039 : reserved for Kurt 1050 : Coulomb integrals 1200 - 1215 : MP2 1300 - 1320 : localization 1495 - 1499 : reserved for Jim Shoemaker 1500 : One-electron gradients 1505 - 1599 : EFP and SCRF gradients 1600 - 1602 : Two-electron gradients 1605 - 1620 : One-electron hessians 1650 - 1665 : Two-electron hessians 1700 - 1750 : integral transformation 1800 : GUGA sorting 1850 - 1865 : GUGA CI diagonalization 1900 - 1910 : GUGA DM2 generation 2000 - 2010 : MCSCF 2100 - 2120 : coupled perturbed HF 1 Disk files used by GAMESS These files must be defined by your control language for executing GAMESS. For example, on UNIX the "name" field shown below should be set in the environment to the actual file name to be used. Most runs will open only a subset of the files shown below, with only files 5, 6, 7, and 10 existing in every run. Only files 4, 5, 6, and 7 contain formatted data. unit name contents ---- ---- -------- 4 IRCDATA archive results punched by IRC runs, restart data for numerical HESSIAN runs, summary of results for DRC. 5 INPUT Namelist input file. This MUST be a disk file, as GAMESS rewinds this file often. 6 OUTPUT Print output (FT06F001 on IBM mainframes) If not defined, UNIX systems will use the standard output for this file. 7 PUNCH Punch output. A copy of the $DATA deck, orbitals for every geometry calculated, hessian matrix, normal modes from FORCE, properties output, IRC restart data, etc. 8 AOINTS Two e- integrals in AO basis 9 MOINTS Two e- integrals in MO basis 10 DICTNRY Master dictionary, for contents see below. 11 DRTFILE Distinct row table file for -CI- or -MCSCF- 12 CIVECTR Eigenvector file for -CI- or -MCSCF- 13 NTNFMLA Newton-Raphson formula tape for -MCSCF-; semi-transformed ints for FOCAS/SOSCF MCSCF 14 CIINTS Sorted integrals for -CI- or -MCSCF- 15 WORK15 GUGA loops for diagonal elements; ordered two body density matrix for MCSCF; scratch storage during Davidson diag; Hessian update info during 2nd order SCF; [ia|jb] integrals during MP2 gradient 16 WORK16 GUGA loops for off diagonal elements; unordered two body density matrix for MCSCF; orbital hessian during MCSCF; orbital hessian for analytic hessian CPHF; orbital hessian during MP2 gradient CPHF; two body density during MP2 gradient 1 (disk files, continued) unit name contents ---- ---- -------- 17 CSFSAVE CSF data for transition moments or SOC 18 FOCKDER derivative Fock matrices for analytic hess 20 DASORT Sort file for -MCSCF- or -CI-; also used by SCF level DIIS; form factor sorting for spin-orbit 23 JKFILE J and K "Fock" matrices for -GVB-; Hessian update info during SOSCF MCSCF; orbital gradient and hessian for QUAD MCSCF 24 ORDINT sorted AO integrals; integral subsets during Morokuma analysis 25 EFPIND electric field integrals for EFP 26 PCMDATA gradient and D-inverse data for PCM runs 27 PCMINTS normal projections of PCM field gradients 30 MCDIIS DIIS data during FOCAS MCSCF files 50-63 are used only for MCQDPT runs: 50 MCQD50 Direct access file for MC-QDPT, its contents are documented in source code. 51 MCQD51 One-body coupling constants for CAS-CI and other routines 52 MCQD52 One-body coupling constants for perturb. 53 MCQD53 One-body coupling constants extracted from MCQD52 54 MCQD54 One-body coupling constants extracted further from MCQD52 55 MCQD55 Sorted 2-e integrals 56 MCQD56 Half transformed 2-e integral 57 MCQD57 Sorted half transformed 2-e integral of the (ii/aa) type 58 MCQD58 Sorted half transformed 2-e integral of the (ei/aa) type 59 MCQD59 2-e integral in MO basis of the (ii/ii), (ei/ii), (ei/ei) types 60 MCQD60 2-e integral in MO basis arranged for perturbation calculations 61 MCQD61 One-body coupling constants between state and CSF 62 MCQD62 Two-body coupling constants between state and CSF 63 MCQD63 canonical Fock orbitals (FORMATTED) 64 MCQD64 Spin functions and orbital configuration functions (FORMATTED) 1 Contents of the direct access file 'DICTNRY' 1. Atomic coordinates 2. various energy quantities in /ENRGYS/ 3. Gradient vector 4. Hessian (force constant) matrix int 5. ISO - symmetry operations for shells int 6. ISOC - symmetry operations for centers (atoms) 7. PTR - symmetry transformation for p orbitals 8. DTR - symmetry transformation for d orbitals 9. not used, reserved for FTR 10. not used, reserved for GTR 11. Bare nucleus Hamiltonian integrals 12. Overlap integrals 13. Kinetic energy integrals 14. Alpha Fock matrix (current) 15. Alpha orbitals 16. Alpha density matrix 17. Alpha energies or occupation numbers 18. Beta Fock matrix (current) 19. Beta orbitals 20. Beta density matrix 21. Beta energies or occupation numbers 22. Error function interpolation table 23. Old alpha Fock matrix 24. Older alpha Fock matrix 25. Oldest alpha Fock matrix 26. Old beta Fock matrix 27. Older beta Fock matrix 28. Oldest beta Fock matrix 29. Vib 0 gradient for FORCE runs 30. Vib 0 alpha orbitals in FORCE 31. Vib 0 beta orbitals in FORCE 32. Vib 0 alpha density matrix in FORCE 33. Vib 0 beta density matrix in FORCE 34. dipole derivative tensor in FORCE. 35. frozen core Fock operator 36. Lagrangian multipliers 37. floating point part of common block /OPTGRD/ int 38. integer part of common block /OPTGRD/ 39. ZMAT of input internal coords int 40. IZMAT of input internal coords 41. B matrix of redundant internal coords 42. not used. 43. Force constant matrix in internal coordinates. 44. SALC transformation 45. symmetry adapted Q matrix 46. S matrix for symmetry coordinates 47. ZMAT for symmetry internal coords int 48. IZMAT for symmetry internal coords 49. B matrix 50. B inverse matrix 1 51. overlap matrix in Lowdin basis, temp Fock matrix storage for ROHF 52. genuine MOPAC overlap matrix 53. MOPAC repulsion integrals 54. exchange integrals for screening 55. orbital gradient during SOSCF MCSCF 56. orbital displacement during SOSCF MCSCF 57. orbital hessian during SOSCF MCSCF 58. not used 59. Coulomb integrals in Ruedenberg localizations 60. exchange integrals in Ruedenberg localizations 61. temp MO storage for GVB and ROHF-MP2 62. temp density for GVB 63. dS/dx matrix for hessians 64. dS/dy matrix for hessians 65. dS/dz matrix for hessians 66. derivative hamiltonian for OS-TCSCF hessians 67. partially formed EG and EH for hessians 68. MCSCF first order density in MO basis 69. alpha Lowdin populations 70. beta Lowdin populations 71. alpha orbitals during localization 72. beta orbitals during localization 73. alpha localization transformation 74. beta localization transformation 75. fitted EFP interfragment repulsion values 76-77. not used 78. "Erep derivative" matrix associated with F-a terms 79. "Erep derivative" matrix associated with S-a terms 80. EFP 1-e Fock matrix including induced dipole terms 81. not used 82. MO-based Fock matrix without any EFP contributions 83. LMO centroids of charge 84. d/dx dipole velocity integrals 85. d/dy dipole velocity integrals 86. d/dz dipole velocity integrals 87-88. not used 89. EFP multipole contribution to one e- Fock matrix 90. ECP coefficients int 91. ECP labels 92. ECP coefficients int 93. ECP labels 94. bare nucleus Hamiltonian during FFIELD runs 95. x dipole integrals, in AO basis 96. y dipole integrals, in AO basis 97. z dipole integrals, in AO basis 98. former coords for Schlegel geometry search 99. former gradients for Schlegel geometry search 100. not used 1 records 101-248 are used for NLO properties 101. U'x(0) 149. U''xx(-2w;w,w) 200. UM''xx(-w;w,0) 102. y 150. xy 201. xy 103. z 151. xz 202. xz 104. G'x(0) 152. yy 203. yz 105. y 153. yz 204. yy 106. z 154. zz 205. yz 107. U'x(w) 155. G''xx(-2w;w,w) 206. zx 108. y 156. xy 207. zy 109. z 157. xz 208. zz 110. G'x(w) 158. yy 209. U''xx(0;w,-w) 111. y 159. yz 210. xy 112. z 160. zz 211. xz 113. U'x(2w) 161. e''xx(-2w;w,w) 212. yz 114. y 162. xy 213. yy 115. z 163. xz 214. yz 116. G'x(2w) 164. yy 215. zx 117. y 165. yz 216. zy 118. z 166. zz 217. zz 119. U'x(3w) 167. UM''xx(-2w;w,w) 218. G''xx(0;w,-w) 120. y 168. xy 219. xy 121. z 169. xz 220. xz 122. G'x(3w) 170. yy 221. yz 123. y 171. yz 222. yy 124. z 172. zz 223. yz 125. U''xx(0) 173. U''xx(-w;w,0) 224. zx 126. xy 174. xy 225. zy 127. xz 175. xz 226. zz 128. yy 176. yz 227. e''xx(0;w,-w) 129. yz 177. yy 228. xy 130. zz 178. yz 229. xz 131. G''xx(0) 179. zx 230. yz 132. xy 180. zy 231. yy 133. xz 181. zz 232. yz 134. yy 182. G''xx(-w;w,0) 233. zx 135. yz 183. xy 234. zy 136. zz 184. xz 235. zz 137. e''xx(0) 185. yz 236. UM''xx(0;w,-w) 138. xy 186. yy 237. xy 139. xz 187. yz 238. xz 140. yy 188. zx 239. yz 141. yz 189. zy 240. yy 142. zz 190. zz 241. yz 143. UM''xx(0) 191. e''xx(-w;w,0) 242. zx 144. xy 192. xy 243. zy 145. xz 193. xz 244. zz 146. yy 194. yz 147. yz 195. yy 148. zz 196. yz 197. zx 198. zy 199. zz 1 245. old NLO Fock matrix 246. older NLO Fock matrix 247. oldest NLO Fock matrix 249. not used 250. transition density matrix in AO basis 251. static polarizability tensor alpha 252. X dipole integrals in MO basis 253. Y dipole integrals in MO basis 254. Z dipole integrals in MO basis 255-299. not used 300. Z-vector during MP2 gradient 301. Pocc during MP2 gradient 302. Pvir during MP2 gradient 303. Wai during MP2 gradient 304. Lagrangian Lai during MP2 or CI gradient 305. Wocc during MP2 gradient 306. Wvir during MP2 gradient 307. P(MP2)-P(RHF) during MP2 gradient 308. SCF density during MP2 gradient 309. energy weighted density during MP2 gradient 311. Supermolecule h during Morokuma 312. Supermolecule S during Morokuma 313. Monomer 1 orbitals during Morokuma 314. Monomer 2 orbitals during Morokuma 315. combined monomer orbitals during Morokuma 316-319. not used 320. MCSCF active orbital density 321. MCSCF DIIS error matrix 322. MCSCF orbital rotation indices 323. Hamiltonian matrix during QUAD MCSCF 324. MO symmetry labels during MCSCF 330. CEL matrix during PCM 331. VEF matrix during PCM 332. QEFF matrix during PCM 333. ELD matrix during PCM 340-346. reserved for Kurt's code In order to correctly pass data between different machine types when running in parallel, it is required that a DAF record must contain only floating point values, or only integer values. No logical or Hollerith data may be stored. The final calling argument to DAWRIT and DAREAD must be 0 or 1 to indicate floating point or integer values are involved. The records containing integers are so marked in the list below. Physical record 1 (containing the DAF directory) is written whenever a new record is added to the file. This is invisible to the programmer. The numbers shown above are "logical record numbers", and are the only thing that the programmer need be concerned with.