Prequest: 4.4 MOL2 Format
Back to File Formats
This is the format used in-house at the CCDC for the main-file
database creation.It is also the internal format used in PreQuest;
each entry is stored in a temporary working file using this format. When
an entry is edited in Prequest using a text-editor this BCCAB format
will appear, regardless of the original input format (SHELX etc.).
The description given here is sufficient to allow the user to achieve
two objectives:
(a) to edit with the text-editor
(b) to define the important fields for a stand-alone input record
An entry in BCCAB is defined as a set of text lines of maximum length
80 characters. The first line must begin with the
"?" character, followed by the reference code.
The last line is #END.
For private databases it is best to use numeric reference codes which will
not be confused with main-file records. The reference code must be the
single unique identifier for each record, e.g. 00001234. This is the
key reference number used in the CSD database system.
A typical input record for a published structure is shown below:
?HEMTEY
#JRNL 035,59,2787,1994
#AUTHOR R.Gleiter, B.Treptow, H.Irngartinger, T.Oeser
#QUAL at 243 deg.K
#PROPS "Color: colorless.
#SYSCAT sys O cat 3
#CELL a 15.284,2 b 14.882,2 c 7.805,1
z 4 sg Pnma v 1775. cent 1
#DENSITY dx 1.24 fw 330.4
#RFACT R= 0.054
#COMPND Dimethyl hexacyclo(9.5.0.0$1,3!.0$2,10!.0$3,9!.0$9,11!)hexadeca 2,10-dicarboxylate
#SYNONM Dimethyl propella(3)prismane dicarboxylate
#FORMUL C20 H26 O4
#ATOM O1 0.8519,1 0.4546,1 0.0761,2
O2 0.7778,1 0.3475,1 -0.0662,2
C1 0.8856,1 0.3022,1 0.1253,3
C2 0.9865,1 0.3024,1 0.1573,3
C3 0.9251,1 0.3025,1 0.3078,3
C4 0.9150,1 0.3661,1 0.4540,3
C5 0.9732,2 0.3380,2 0.6058,3
C6 0.9476,3 0.2500 0.6921,5
C7 1.0560,1 0.3662,2 0.1005,3
C8 1.1455,1 0.3382,2 0.1703,4
C9 1.1831,2 0.2500 0.0997,6
C10 0.8376,1 0.3760,1 0.0449,3
C11 0.7316,2 0.4181,2 -0.1575,4
#BOND O1 C10 1.214,3
C2 C3 1.504,3
O2 C10 1.329,3
C2 C7 1.492,3
O2 C11 1.453,3
C3 C3* 1.562,3
C1 C10 1.462,3
C3 C4 1.491,3
C1 C1* 1.554,3
C4 C5 1.539,4
C1 C2 1.562,3
C5 C6 1.525,3
C1 C3 1.548,3
C7 C8 1.530,3
C2 C2* 1.560,3
C8 C9 1.535,3
#END
The entry consists of a number of data fields. Each field begins on a new
line with the character "#" followed by the field name.
There are no restrictions on the order of fields, or the spacing within
the text. The suggested minimum requirement for a private database is
the following:
?refcode
#COMPND : compound name
#AUTHOR : person responsible for data
#CELL : unit cell and space group
#ATOM : atomic coordinates
#END
It is strongly recommended that as much data as possible is entered at the
time of input as it will enrich the database and make the record much
more useful to future users. You should check through the fields listed
below and provide data if you can. Remember that #QUAL and
#RMARKS text is "searchable" by Quest, and you can
use your own keywords here.
A Note on Chemical Diagrams
At the CCDC a 2D chemical diagram is constructed using the
2D Edit function of Prequest, which will appear as the
two fields #CONN and #DIAG. There is no need to describe
the format of these fields here - if they appear in a working record,
please do not edit them. In general you do not input the diagram by typing
these fields. Prequest will make the diagram either automatically
(see Make 2D), or using the graphical interface
(see 2D Edit).
Each atom has an atom label, and fractional coords x/a, y/b, z/c with
optional estimated standard deviations (e.s.d.). e.g.
#ATOM C1 0.1234,3 -0.3456,12 0.4567,8
C1' 0.3456,2 0.2345,13 0.3456,7
Atom labels
Maximum length 8 characters. Must begin with a valid element symbol,
usually followed with numbers, e.g. Br2, C123, but any string of
alphabetic characters or quote mark is allowed, e.g. C1' C1"
H11a Ow1
Atom coordinates
Input these with a decimal point. If an e.s.d. is given type this after
the comma and with no space. e.g. 0.1234,12 means e.s.d. is 0.0012.
Suppressed flag
Suppressed atoms have an "S" after the z-coordinate. These can
be manually written using the text editor. In the example below C1'
is suppressed.
#ATOM C1 0.1234,3 -0.3456,12 0.4567,8
C1' 0.3456,2 0.2345,13 0.3456,7 S
Author names to be written in the following style:
#AUTHOR A.B.Smith, J.-P.Mornon, P.Van Stappen, G.L'Abbe, P.Murray-Rust,
D. van der Helm, Yu.T.Struchkov, R.King III, E.F.Meyer Junior, Shao Mei-cheng
Note
Give initials with full-stop, no spaces, and use comma to separate names.
This is consistent with the main CSD file and enables
concurrent searching.
This field contains the bond lengths reported by the author and corresponds
to the atomic coordinates listed in #ATOM. Each bond length is
described using the appropriate pair of atom labels followed by the value
of the distance. If the e.s.d. of the bond length is available then it
follows the value and is separated from it by comma. E.g.
#BOND C1 C12 1.451,3
C1 H1 0.98,1
Note
This is optional input. These author-given bonds are used as a consistency
check in Check-3D, comparing calculated and given values and are
not vital in PreQuest.
This field contains the Chemical Abstracts Service Registry Number. It
takes the numeric form AAAAAA-BB-C where the first number AAAAAA can have
up to 6 digits, BB has 2 digits, C is a single check digit, e.g. 699-98-9
Note
Optional input.
This field contains all of the unit cell information using a variety
of keywords:
a : length of unit cell a-axis (Å)
b : length of unit cell b-axis (Å)
c : length of unit cell c-axis (Å)
alpha : value of interaxial angle alpha (in degrees)
beta : value of interaxial angle beta (degrees)
gamma : value of interaxial angle gamma (degrees)
v : volume of unit cell (in cubic Angstroms)
z : number of formula units per unit cell
sg : space group symbol (in Hermann-Mauguin notation)
cent : flag to indicate whether or not the space group origin is at
a centre of symmetry
An example of a #CELL field is:
#CELL a 6.3746,5 b 15.8638,8 c 7.7460,6
alpha 87.12,1 beta 91.34,4 gamma 93.67,4 v 776.42
z 4 sg P-1 cent 1
This example of an anorthic cell can be used to illustrate various details:
In most reported studies the monoclinic cell is chosen with the b-axis unique
so beta is recorded. However, if the a-axis is unique then alpha is recorded
and likewise if the c-axis is unique then gamma is recorded.
- The unit cell volume is expressed as a real number. If the author reports
v = 1234 it is recorded as v 1234 (v is used only for check purposes and is
not archived to the database).
- The z value is recorded as an integer, (number of formula units
per cell).
- The space group symbol is recorded in Hermann-Mauguin notation with the
use of two conventions. The bar symbol above a symmetry axis symbol is
replaced by the sign in front of the axis symbol. Suffixes which are
normally involved in screw axis symbols are recorded "in-line" e.g.
21 for a 2-fold screw axis.
Special conventions are used for the recording of monoclinic space group
symbols for a- or c-axis unique:
a-axis unique P21 is recorded as P2111
P21/n is recorded as P21/n11
c-axis unique P21 is recorded as P1121
P21/n is recorded as P1121/n
If a trigonal space group is described in terms of a rhombohedral unit cell
(a and alpha recorded) then the conventional space group symbol, e.g. R3,
is recorded as R3r.
- The cent flag is program-generated from the space group symbol and takes
the values 1 or 2.
cent 1 indicates that the space group origin is at a centre
of symmetry
cent 2 indicates that the space group origin is not at a centre of symmetry
The cent flag is directly linked to the set of general equivalent
positions (#SYMM field) which is program-generated from the space
group symbol. For cent 1 the #SYMM field contains only one half of the
general equivalent positions - those not related by the centre of symmetry
at the origin.
Certain space groups allow a choice of origin and the program default
always chooses the setting with a centre of symmetry at the space group
origin. If this choice is incorrect for a particular structure
determination then cent 2 should be manually set and the appropriate #SYMM
field input manually.
This field contains the chemical class assignment for the compound.
These classes are listed below or can be seen in Quest
by typing HELP CLASS. Example:
#CLASS 5 9
Note
Each entry can be assigned up to 4 class numbers.
This is optional data but can be very useful especially for classifying
natural products.
| Chemical Class |
Class Number |
| Carbohydrates |
1
|
| Nucleosides & nucleotides |
2
|
| Amino-acids, peptides and complexes |
3
|
| Porphyrins, corrins & complexes |
4
|
| Antibiotics |
5
|
| Steriods |
6
|
| Terpenes |
7
|
| Alkaloids |
8
|
| Micellaneous natural products |
9
|
| Suprmolecular entities |
10
|
| High polymers |
11
|
This field contains the chemical compound name following standard rules
if possible. For natural products the trivial name is usually recorded -
a field exists to supplement the name by one or more synonyms e.g. drug or
trade name (see #SYNONM).
CSD conventions are:
(a) Greek lower-case letters are spelt out in full e.g. alpha
(b) Subscripts and superscripts are indicated by the use of ! sub,
and $ super
(c) Square brackets are replaced by parentheses
(d) Metal oxidation states are recorded e.g. iron(0), cobalt(ii),
molybdenum(vi)
Note
You can use trivial names or local names e.g. Compound A12387
This field contains the molecular formula, represented as the sum of
the individual formulae for each of the residues. A residue is defined
here as being a discrete bonded unit. For example, sodium acetate
monohydrate consists of 3 residues viz. acetate anion, sodium
cation, water molecule - its formula is recorded as:
#FORMUL C2 H3 O2 1-,Na1 1+,H2 O1
The general expression for a residue formula is:
pre-multiplier (elements with atom counts charge)
post-multiplier
Elements are listed in the order:
C followed by H followed by D
followed by other elements in alphabetic order
Note
This is always used at CCDC for consistency checking. If you
use Prequest to construct the 2D chemical diagram (Make-2D
or Edit 2D) the #FORMUL field is generated automatically.
A #FORMUL field is necessary for screen generation by PreQuest
and without it the entry may be rendered unsearchable in the database.
In the example:
#FORMUL C2 H3 O2 1-,Na1 1+,H2 O1
notice that spaces separate each element-item, commas separate residues.
Charges are given at end of residue in the style 1-, 2+ etc.
This field contains the journal reference for a published structure. It
takes the form: coden, volume, page, year. E.g.
#JRNL 591,39,136,1983
#JRNL 1078,,,1995
Note
Optional input for private databases. If #JRNL is not input PreQuest
will treat the records as "Private Communication", coden
1078, and fill in the current year from the computer system date. Coden is
a table of code numbers pointing to text names for journals. This can be seen
in Quest by typing HELP CODEN.
This field stores crystal property information using four keywords
"Mp: melting point
"Color: colour of the crystal
"Source: source of a natural product or polymorph
"Note: other properties
Example:
#PROPS "Mp. 123deg.C
"Color: red-brown.
"Source: bark of Japoni
"Note: Hygroscopic and decomposes on exposure to x-ra
Note
Optional input but potentially contains valuable information - include
if relevant data are available.
This field contains important attributes of the compound, or
the crystallographic study if different from X-rays at room temperature.
The types of data recorded in main-file CSD are:
- low- or high-temperature study
- low- or high-pressure study
- neutron study
- absolute configuration determined
- specification of polymorphic form
- data or refinement details
- drug activity
Examples:
#QUAL neutron study, at 123 deg.C
#QUAL neuroleptic drug, absolute configuration
#QUAL blue monoclinic form
#QUAL alpha form, high-order data only
#QUAL refinement of data of Cannas et al.,Inorg.Chem. 16,228,1977
#QUAL refinement in space group no. 33
Note
Optional for private databases - but recommended input.
This field contains the radius used to determine the crystal connectivity
for each element present in the list of atomic coordinates (#ATOM).
The distance Dij between two atoms i and j is defined
to be a bonding distance if
(Ri + Rj - Tol) .le. Dij .le. (Ri +
Rj + Tol)
where Ri, Rj are radii of atoms and Tol is a
tolerance normally 0.4 Å.
Note
Not normally input. The radius values are obtained from a standard table
by PreQuest. The user will not normally need to edit these values
as radius adjustment can be performed by Radj in Check 3D.
The table is printed in Vol 3. Appendix 10.
This field contains the crystallographic R-factor as a decimal number
e.g.
#RFACT R=0.0410
The sub-keyword R= has the value 0.0410, indicating an R-factor of 4.1%
Note
Optional input - but a very valuable indicator of accuracy.
This field contains general remarks not catered for by the other keywords.
Examples:
#RMARKS The coordinates of P(8) seem to be in error.
#RMARKS Unresolved problems with the coordinates of solvent.
Note
Optional input - but remember this is searchable text in Quest.
This field contains any appropriate synonym(s) for the compound name.
It is often used to record trade or drug names. If more than one synonym
is required then separate with semi-colon. Examples:
#SYNONM Aspirin
#SYNONM Ampicillin; Nuvapen; Totapen
#SYNONM 1,8-Dihydroxy-2,4,5,7-tetranitro-9,10-anthracenedione
Note
Optional input. Local company names could be given e.g. A1234C5.
This field contains the crystal system information using the keywords
sys and cat e.g. #SYSCAT sys A cat
Crystal system is recorded as:
sys A for anorthic (triclinic)
sys M for monoclinic
sys O for orthorhombic
sys T for tetragonal
sys H for hexagonal
sys R for rhombohedral
sys C for cubic
For trigonal space groups the conventions used are:
If space group lattice symbol is R, e.g. R-3, we set sys R
If space group lattice symbol is P, e.g. P31, we set sys H
Category is always recorded as 3, meaning full structure determination.
Note
Not normally input - generated automatically by Prequest.
This field contains the tolerance, in Angstroms, used in conjunction
with #RADIUS for the determination of the crystal connectivity.
See #RADIUS. E.g.
#TOLER 0.40
Note
Not normally input - generated automatic by PreQuest.
This field is used to control the level of processing of the data,
causing certain test to be by-passed. It is chiefly for use by the
CCDC editorial staff to allow unresolved errors to be flagged in the
main database, and to ensure that no incorrect connectivity is stored.
The data consists of keywords followed by values.
int 3 diffractometer data
2 densitometer photographic data
1 visual photographic data
ig 1 ignore atom valence checks
2 serious error - no crystal connectivity output
rpa 1 refer problem to author - by letter
pd 1 disorder is present (set by PreQuest)
Back to File Formats