Macromolecular Crystallographic Information (mmCIF) Tutorial

Last Update Nov. 08, 1995
CIF Editor

Contents

  1. Introduction
  2. How Does an mmCIF Data File Differ from a PDB File?
    1. The Format of an mmCIF Data File Versus a PDB File
  3. Describing a Structure using mmCIF
    1. Examples - Partial CIFs
    2. Examples - Full CIF's

Introduction

The macromolecular Crystallographic Information File (mmCIF) provides a way of describing in detail the features of a macromolecular structure and the X-ray diffraction experiment that was used to derive that structure. The mmCIF may be used for data exchange, data archival and to assist in writing software. This tutorial will assist crystallographers in understanding how mmCIF's are used to represent structures (described as content) and will assist crystallographic programmers in writing code that uses mmCIF's (described as context.

We begin this discussion with what should be familiar to the reader - A PDB file.

How Does an mmCIF Data File Differ from a PDB File?

The mmCIF differs from the existing PDB file in the following important ways:

The Format of an mmCIF Data File Versus a PDB File

A PDB file consists of a series of records each identified by a keyword (e.g. HEADER) of up to 6 characters. The format and content of fields within a given record is dependent on the record type. mmCIF, on the other hand, consists of a series of _name value pairs. The name is distinguished from the value by a leading underscore (_). Thus, the COMPND record of a PDB entry (1CBN) would be represented as follows:

        HEADER    PLANT SEED PROTEIN                    11-OCT-91   1CBN

becomes:

                _struct.entry_id                '1CBN'
                _struct.title                   'PLANT SEED PROTEIN'

                _struct_keywords.entry_id       '1CBN'
                _struct_keywords.text           'plant seed protein'

                _database_2.database_id          PDB
                _database_2.database_code        1CBN

                _database_PDB_rev.num                   1
                _database_PDB_rev.date_original 1991-10-11

This is not very efficient where for each data name there is often multiple values, for example in PDB ATOM records. This is dealt with using a STAR loop_ construct as shown here for PDB ATOM records:

     loop_
    _atom_site.group_PDB
    _atom_site.type_symbol
    _atom_site.label_atom_id
    _atom_site.label_comp_id
    _atom_site.label_asym_id
    _atom_site.label_seq_id
    _atom_site.label_alt_id
    _atom_site.cartn_x
    _atom_site.cartn_y
    _atom_site.cartn_z
    _atom_site.occupancy
    _atom_site.B_iso_or_equiv
    _atom_site.footnote_id
    _atom_site.entity_id
    _atom_site.entity_seq_num
    _atom_site.id
  ATOM N  N   VAL  A  11 . 25.360  30.691  11.795  1.00  17.93 . 1  11 1
  ATOM C  CA  VAL  A  11 . 25.970  31.965  12.332  1.00  17.75 . 1  11 2
  ATOM C  C   VAL  A  11 . 25.569  32.010  13.881  1.00  17.83 . 1  11 3
#               [data omitted]

Notice the structure of the _name, which is of the form _category.extension. We shall meet the concept of a category in several different contexts, A category forms a natural grouping of data items intuitive to a crystallographer. We will be looking more at these categories, note that they are tabulated in
Table I. There is no restriction on the length or contents of _name (compare the 6 character limit of a PDB keyword), however, special characters should be avoided. While there is no formal specification in _name beyond the category and extension, the extension is usually represented as an informal hierachy of parts, with each part separated by an underscore (_).

The length of records in a mmCIF is restricted to 80 characters.

Questions arises concerning the separation of data names and data values that are solved with some additional syntax. For example, what if the data value contains white space, an underscore or runs over several lines? Or what if a value in a loop_ is undefined or has no meaning in the context in wish it is defined? The following syntax rules complete the picture:

To complete the introductory picture of the appearance of a mmCIF data file consider the notion of scope. A PDB file has essentially one form of scope - the complete file. Thus, a single structure corresponds to a single file and an ensemble of structures (usually from an NMR experiment) is represented by a single file with each member of the ensemble separated by a PDB MODEL keyword record. There is no computer readable mechanism for associating components of say the REMARK records with a particular member of the ensemble. The mmCIF representation deals with this issue by using the STAR data block concept. Data blocks begin with data_ and have a scope that extends until the next data_ or an end-of-file is reached. A name may appear only once in a data block, but data items may appear in any order. A consequence of these STAR rules is that the combination of data block name and data name is always unique. That is, associated with each data item is the version of the dictionary in which it was defined. This is important if a true historic record and reproducibility of the data file are to be maintained. At present there is no formal provision for associating data contained in two blocks as would be desirable in the case of, for example, a native structure and a mutant.

Describing a Structure using mmCIF

There are a number of concepts to be introduced for describing a macromolecular structure. These are:

The relationship between this features is iluustrated below:

You can view this from top down (how a molecular biologist would think of it) or from ithe bottom up (how (I think) a crystallographer would think of it). To remain unbiased we will start in the middle.

An entity is a unique chemical feature of the structure and is represented by the ENTITY category group. If that entity happens to be a polymer - most likely a polypeptide chain or DNA strand it will be described by the ENTITY_POLY category group which details the overall features of the polymer, inclicating the presence of non-standard monomers, chirality, and linkages. The actual sequence of monomers describing the polymer is found in the ENTITY_POLY_SEQ category.

The details of the chemical components of the structure are detailed in the CHEM_COMP category group. This is where, for example, the details of a non-standard group would be given, including connectivity and geometry .

At the bottom of the hierarchy is the category group ATOM_SITE which describes the actual atomic positions. Another category group (not shown) ATOM_SITES describes the overall features of a group of atom positions. Full compatibility with the PDB ATOM and HETATM record types is maintained.

Moving up the hierarchy, each ATOM_SITE is a member of a unique component of the asymmetric unit, designated by the STRUCT_ASYM category. Similiarly, groups of atoms sites are used to designate particular structures of interest: STRUCT_SITE (e.g., active sites), STRUCT_CONN (e.g., hydrogen bonds, salt bridges, disulphide linkages), and STRUCT_CONF (e.g., beta strands, alpha helices, turns).

Finally, components of the asymmetric unit can be combined to produce biologically interesting components. These are described by the category group STRUCT_BIOL and generated within the STRUCT_BIOL_GEN group.

We can now apply these definitions to features of several structures and then move on to complete files containing a full description of the experiment.

Examples - Partical CIF's

Examples - Full CIF's

The protein examples were generated with a PDB2CIF convertor and should be considered preliminary. The DNA examples were generated by the NDB project and are believed to be accurate.

Crambin (1CRN) - A Simple Example

Crambin consists of a single polypeptide chain of 48 amino acids (grey). There is a single ethanol molecule co-crystallized (red) and solvent (not shown). The structure is stabilized by 3 disulphide linkages (yellow). Using the author assignments for secondary structure, we have residues 7-19 and 23-30 as helix (green) and residues 1-4 and 32-35 as beta strand (blue).

There are 3 unique chemical components (entities) - the polypeptide chain, the ethanol molecule, and the solvent. These could be described as follows:


        loop_
        _entity.id
        _entity.type
        _entity.formula_weight
        _entity.src_method
                A               polymer         4716            'NATURAL'
                ethanol         non-polymer     52              'SYNTHETIC'
                H20             water           18              .

This is not the complete list of data items in the entity category, however, there is no need to include all data items. Those that must be included if th ecategory is used at all are labelled mandatory. The only mandatory data item in the entity category is _entity.id. What categories are manadatory for a data submission will be determined presumably by community consensus.

It is then possible to expand upon this basic description of each entity using the entity.id as a reference. So for example the common and systematic names are specified as.


        _entity_name_com.entity_id              A
        _entity_name_com.name                   crambin

        _entity_name_sys.entity_id              A
        _entity_name_sys.name                   'Crambe Abyssinica'

Similarly the natural and synthetic description can be given in more detail, so for the natural product we have:

        _entity_src_nat.entity_id               A
        _entity_src_nat.common_name             'Abyssinian Cabbage'
        _entity_src_nat.genus                   ?
        _entity_src_nat.species                 ?
        _entity_src_nat.details                 ?

In this context the ? indicates that the value is unknown, but should be supplied.

Using the entities as building blocks the contents of the asymmetric unit are specified. Crambin is straightforward since each entity appears only once in the asymmetric unit:


        loop_
        _struct_asym.id
        _struct_asym.entity_id
        _struct_asym.details
                chain_a A           'Single polypeptide chain'
                ethanol ethanol     'Cocrystallized ethanol molecule'
                HOH     water           .

Entities classified as polymer, in this instance only that entity identified as A, are further described. First, the overall features of the polypeptide chain:

        _entity_poly.entity_id          A
        _entity_poly.type               polypeptide(L)
        _entity_poly.nstd_chirality     no
        _entity_poly.nstd_linkage       no
        _entity_poly.nstd_monomers      no
        _entity_poly.type_details       'None Available'

These data items indicated that there is nothing non-standard about the entity - no non-standard monomers, linkages or chirality.

Now the components of the polymer entity:


        loop_
        _entity_poly_seq.entity_id
        _entity_poly_seq.num
        _entity_poly_seq.mon_id
                A       1       THR             A       2       THR
#       [data omitted]
                A       22      PRO             A       23      GLU
                A       24      ALA             A       25      LEU
#       [data omitted]
                A       47      ALA             A       48      ASN
The entity may also exist in other databases and these references may be cited and described. For the entity designated A, which is defined in Genbank.

_entity_reference.entity_id             A
_entity_reference.database_name         Genbank
_entity_reference.database_code         ?
_entity_reference.details               ?

Once each component of the asymmetric unit is defined, the details of the secondary structure can be defined using the STRUCT_CONF category:

        loop_
        _struct_conf.id
        _struct_conf.conf_type.id
        _struct_conf.beg_label_comp_id
        _struct_conf.beg_label_asym_id
        _struct_conf.beg_label_seq_id
        _struct_conf.end_label_comp_id
        _struct_conf.end_label_asym_id
        _struct_conf.end_label_seq_id
        _struct_conf.details
        H1  HELX-RHAL  ILE  chain_a  7  PRO  chain_a  19 'HELX-RH3T 17-19'
        H2  HELX-RHAL  GLU  chain_a 23  THR  chain_a  30 'Alpha-N start'
        S1  STRN       CYS  chain_a 32  ILE  chain_a  35 .
        S2  STRN       THR  chain_a  1  CYS  chain_a   4 .
        T1  TURN-TY1P  ARG  chain_a 17  GLY  chain_a  20 .
        T2  TURN-TY1P  PRO  chain_a 41  TYR  chain_a  44 .
These assignments are similar to those made in a PDB file for the record types HELIX, TURN and SHEET, however, the STRUCT_CONF_TYPE category
(Table I) specifies the method of assignment which could, for example, be deduced by the crystallographer from the electron density maps or defined algorithmically:
        loop_
        _struct_conf_type.id
        _struct_conf_type.criteria
        _struct_conf_type.reference
        HELX-RHAL       'author judgement'      .
        STRN            'author judgement'      .
        TURN-TY1P       'author judgement'      .
#       HELX-RHAL       'Kabsch and Sander'     'Biopolymers (1983) 22:2577'

The commented entry at the end is a hypothetical example for a calculated assignment. Data items also exist (Table I) for the description of beta sheets, but are not shown in this introductory example.

Interactions between various portions of the structure are described by the STRUCT_CONN and associated STRUCT_CONN_TYPE category:


        loop_
        struct_conn.id
        struct_conn.conn_type_id
        struct_conn.ptnr1_label_comp_id
        struct_conn.ptnr1_label_asym_id
        struct_conn.ptnr1_label_seq_id
        struct_conn.ptnr1_label_atom_id
        struct_conn.ptnr1_role
        struct_conn.ptnr1_symmetry
        struct_conn.ptnr2_label_comp_id
        struct_conn.ptnr2_label_asym_id
        struct_conn.ptnr2_label_seq_id
        struct_conn.ptnr2_label_atom_id
        struct_conn.ptnr2_role
        struct_conn.ptnr2_symmetry
        struct_conn.details
        SS1 disulf CYS chain_a 3  SG . 1_555 CYS chain_a 40 SG . 1_555 .
        SS2 disulf CYS chain_a 4  SG . 1_555 CYS chain_a 32 SG . 1_555 .
#       [data omitted]
        HB1  hydrog  SER  chain_a  6  OG positive  1_555
                     LEU  chain_a  8  O  negative  1_556  .
        HB2  hydrog  ARG  chain_a 17  N  positive  1_555
                     ASP  chain_a 43  O  negative  1_554  .
#       [data omitted]

These intermolecular interactions are partially specified on PDB CONNECT records. However mmCIF provides an additional level of detail such that the criteria used to define an interaction may be given using the STRUCT_CONN_TYPE category. Here is a hypothetical example used to describe a salt bridge and a hydrogen bond:
        loop_
        _struct_conn_type.id
        _struct_conn_type.criteria
        _struct_conn_type.reference
        saltbr   'negative to positive distance > 2.5  and < 3.2  '    .
        hydrog   'N to O distance > 2.5 , < 3.2 , NOC angle < 120 '    .

Components of a Protein DNA Complex

Consider a mmCIF representation for a more complex structure, the gene regulatory protein 434 CRO complexed with a 20 base pair DNA segment containing operator (Mondragon and Harrison; PDB code 3CRO).

The protein consists of 2 domains R (red) and L (green). The DNA strands (gold) are complementary given a one base offset.

        loop_
        _struct_biol.id
        _struct_biol.details
                        complex
;               The complex consists of 2 protein domains bound to a
                20 base pair DNA segment.
;
                        protein
;               Each of the 2 protein domains is a single homologous
                polypeptide chain of 71 residues.The first chain is
                designated L and the second designated R.
;
                        DNA
;               The two strands (A and B) are complementary given a one
                base offset.
;
The protein/DNA complex, the protein, and the DNA are considered as three separate biological components each generated from the contents of the asymmetric unit. No crystallographic symmetry need be applied to generate the biologically relevant components.

        loop_
        _struct_biol_gen.biol_id
        _struct_biol_gen.asym.id
        _struct_biol_gen.symmetry
                complex L       1_555
                complex R       1_555
                complex A       1_555
                complex B       1_555
                protein L       1_555
                protein R       1_555
                DNA             A       1_555
                DNA             B       1_555

        loop_
        _entity.id
        _entity.type
                434             polymer
                DNA_A           polymer
                DNA_B           polymer
                water           water
Since each protein domain is chemically identical they constitute a single entity which has been designated dimer. The complementary DNA strands are not chemically identical and therefore constitute two separate entities:

        _struct_asym.id
        _struct_asym.entity_id
        _struct_asym.details
                L       434             '71 residue polypeptide chain'
                R       434             '71 residue polypeptide chain'
                A       DNA_A           '20 base strand'
                B       DNA_B           '20 base strand'
                HOH     water           'solvent'

Features of the CRO 434 secondary structure and intermolecular contacts can be described in the same way in which crambin was represented and are not repeated.

Non-standard Groups - Ribonuclease H (1RNH)

This example illustrates how to represent primary structure where certain amino acids are modified.

Selenomethionyl ribonuclease H (1RNH) consists of a single polypeptide chain (blue) with 3 methionyl substitutions shown in red. A fourth substitution at the N terminus is not seen in the electron density maps.

In mmCIF terminology monomers are structural building blocks which may or may not polymerize, and therefore comprise amino acids and nucleotides (modified and unmodified) as well as heterogen groups. The category group CHEM_COMP (Table I) permits a full description of monomers, including connectivity and geometry.

Ribonuclease H has 4 methionines substituted by 4 selenomethionines at positions 1, 47, 50, and 142 of the single polypeptide chain (Fig. 5). The N terminal substitution is not observed in the electron density maps at 2.0 Å.

        _chem_comp.id                   MSE
        _chem_comp.model_source         '1987 Protin/Prolsq ideals'
        _chem_comp.name                 selenomethionine
        _chem_comp.mon_nstd_flag        yes
        _chem_comp.mon_nstd_class       'modified amino acid'
        _chem_comp.mon_nstd_parent      methionine
        _chem_comp.mon_nstd_details
;       Derived from protein expressed in a selenomethionine rich
        auxotrophic environment.
;
and then appears in the sequence representation as:
        loop_
        _entity_poly_seq.entity_id
        _entity_poly_seq.num
        _entity_poly_seq.mon_id
                A       1       MSE             A       2       LEU
#       [data omitted]