We begin this discussion with what should be familiar to the reader - A PDB file.
HEADER PLANT SEED PROTEIN 11-OCT-91 1CBN
becomes:
_struct.entry_id '1CBN'
_struct.title 'PLANT SEED PROTEIN'
_struct_keywords.entry_id '1CBN'
_struct_keywords.text 'plant seed protein'
_database_2.database_id PDB
_database_2.database_code 1CBN
_database_PDB_rev.num 1
_database_PDB_rev.date_original 1991-10-11
This is not very efficient where
for each data name there is often multiple values, for example
in PDB ATOM records. This is
dealt with using a STAR loop_ construct as shown here for PDB ATOM records:
loop_
_atom_site.group_PDB
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_seq_id
_atom_site.label_alt_id
_atom_site.cartn_x
_atom_site.cartn_y
_atom_site.cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.footnote_id
_atom_site.entity_id
_atom_site.entity_seq_num
_atom_site.id
ATOM N N VAL A 11 . 25.360 30.691 11.795 1.00 17.93 . 1 11 1
ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 1 11 2
ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 1 11 3
# [data omitted]
Notice the structure of the _name, which is of the form _category.extension.
We shall meet the concept of a category in several different contexts,
A category
forms a natural
grouping of data items intuitive to a crystallographer.
We will be looking more at these categories, note that they are
tabulated in Table I. There is no
restriction on the length or contents of _name (compare the 6 character
limit of a PDB keyword), however, special characters should be avoided.
While there is no formal specification in _name
beyond the category and extension, the extension is usually represented as
an informal hierachy of parts, with each part separated by an underscore
(_).
The length of records in a mmCIF is restricted to 80 characters.
Questions arises concerning the separation of data names and data values that are solved with some additional syntax. For example, what if the data value contains white space, an underscore or runs over several lines? Or what if a value in a loop_ is undefined or has no meaning in the context in wish it is defined? The following syntax rules complete the picture:
To complete the introductory picture of the appearance of a mmCIF data file consider the notion of scope. A PDB file has essentially one form of scope - the complete file. Thus, a single structure corresponds to a single file and an ensemble of structures (usually from an NMR experiment) is represented by a single file with each member of the ensemble separated by a PDB MODEL keyword record. There is no computer readable mechanism for associating components of say the REMARK records with a particular member of the ensemble. The mmCIF representation deals with this issue by using the STAR data block concept. Data blocks begin with data_ and have a scope that extends until the next data_ or an end-of-file is reached. A name may appear only once in a data block, but data items may appear in any order. A consequence of these STAR rules is that the combination of data block name and data name is always unique. That is, associated with each data item is the version of the dictionary in which it was defined. This is important if a true historic record and reproducibility of the data file are to be maintained. At present there is no formal provision for associating data contained in two blocks as would be desirable in the case of, for example, a native structure and a mutant.
The relationship between this features is iluustrated below:
You can view this from top down (how a molecular biologist would think of it) or from ithe bottom up (how (I think) a crystallographer would think of it). To remain unbiased we will start in the middle.
An entity is a unique chemical feature of the structure and is represented by the ENTITY category group. If that entity happens to be a polymer - most likely a polypeptide chain or DNA strand it will be described by the ENTITY_POLY category group which details the overall features of the polymer, inclicating the presence of non-standard monomers, chirality, and linkages. The actual sequence of monomers describing the polymer is found in the ENTITY_POLY_SEQ category.
The details of the chemical components of the structure are detailed in the CHEM_COMP category group. This is where, for example, the details of a non-standard group would be given, including connectivity and geometry .
At the bottom of the hierarchy is the category group ATOM_SITE which describes the actual atomic positions. Another category group (not shown) ATOM_SITES describes the overall features of a group of atom positions. Full compatibility with the PDB ATOM and HETATM record types is maintained.
Moving up the hierarchy, each ATOM_SITE is a member of a unique component of the asymmetric unit, designated by the STRUCT_ASYM category. Similiarly, groups of atoms sites are used to designate particular structures of interest: STRUCT_SITE (e.g., active sites), STRUCT_CONN (e.g., hydrogen bonds, salt bridges, disulphide linkages), and STRUCT_CONF (e.g., beta strands, alpha helices, turns).
Finally, components of the asymmetric unit can be combined to produce biologically interesting components. These are described by the category group STRUCT_BIOL and generated within the STRUCT_BIOL_GEN group.
We can now apply these definitions to features of several structures and then move on to complete files containing a full description of the experiment.
Crambin consists of a single polypeptide chain of 48 amino acids (grey). There is a single ethanol molecule co-crystallized (red) and solvent (not shown). The structure is stabilized by 3 disulphide linkages (yellow). Using the author assignments for secondary structure, we have residues 7-19 and 23-30 as helix (green) and residues 1-4 and 32-35 as beta strand (blue).
There are 3 unique chemical components (entities) - the polypeptide chain, the ethanol molecule, and the solvent. These could be described as follows:
loop_
_entity.id
_entity.type
_entity.formula_weight
_entity.src_method
A polymer 4716 'NATURAL'
ethanol non-polymer 52 'SYNTHETIC'
H20 water 18 .
This is not the complete list of data items in the entity category, however,
there is no need to include all data items. Those that must be included if
th ecategory is used at all are labelled mandatory. The only mandatory
data item in the entity category is _entity.id. What categories are
manadatory for a data submission will be determined presumably by
community consensus.
It is then possible to expand upon this basic description of each entity using the entity.id as a reference. So for example the common and systematic names are specified as.
_entity_name_com.entity_id A
_entity_name_com.name crambin
_entity_name_sys.entity_id A
_entity_name_sys.name 'Crambe Abyssinica'
Similarly the natural and synthetic description can be given in more
detail, so for the natural product we have:
_entity_src_nat.entity_id A
_entity_src_nat.common_name 'Abyssinian Cabbage'
_entity_src_nat.genus ?
_entity_src_nat.species ?
_entity_src_nat.details ?
In this context the ? indicates that the value is unknown, but
should be supplied.
Using the entities as building blocks the contents of the asymmetric unit are specified. Crambin is straightforward since each entity appears only once in the asymmetric unit:
loop_
_struct_asym.id
_struct_asym.entity_id
_struct_asym.details
chain_a A 'Single polypeptide chain'
ethanol ethanol 'Cocrystallized ethanol molecule'
HOH water .
Entities classified as polymer, in this instance only that entity
identified as A, are further
described. First, the overall features of the polypeptide chain:
_entity_poly.entity_id A
_entity_poly.type polypeptide(L)
_entity_poly.nstd_chirality no
_entity_poly.nstd_linkage no
_entity_poly.nstd_monomers no
_entity_poly.type_details 'None Available'
These data items indicated that there is nothing non-standard about
the entity - no non-standard monomers, linkages or chirality.
Now the components of the polymer entity:
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
A 1 THR A 2 THR
# [data omitted]
A 22 PRO A 23 GLU
A 24 ALA A 25 LEU
# [data omitted]
A 47 ALA A 48 ASN
The entity may also exist in other databases and these references may
be cited and
described. For the entity designated A, which is defined in
Genbank.
_entity_reference.entity_id A _entity_reference.database_name Genbank _entity_reference.database_code ? _entity_reference.details ?Once each component of the asymmetric unit is defined, the details of the secondary structure can be defined using the STRUCT_CONF category:
loop_
_struct_conf.id
_struct_conf.conf_type.id
_struct_conf.beg_label_comp_id
_struct_conf.beg_label_asym_id
_struct_conf.beg_label_seq_id
_struct_conf.end_label_comp_id
_struct_conf.end_label_asym_id
_struct_conf.end_label_seq_id
_struct_conf.details
H1 HELX-RHAL ILE chain_a 7 PRO chain_a 19 'HELX-RH3T 17-19'
H2 HELX-RHAL GLU chain_a 23 THR chain_a 30 'Alpha-N start'
S1 STRN CYS chain_a 32 ILE chain_a 35 .
S2 STRN THR chain_a 1 CYS chain_a 4 .
T1 TURN-TY1P ARG chain_a 17 GLY chain_a 20 .
T2 TURN-TY1P PRO chain_a 41 TYR chain_a 44 .
These assignments are similar to those made in a PDB file for the
record types HELIX,
TURN and SHEET, however, the STRUCT_CONF_TYPE category
(Table I) specifies
the method of assignment which could, for example, be deduced by the
crystallographer
from the electron density maps or defined algorithmically:
loop_
_struct_conf_type.id
_struct_conf_type.criteria
_struct_conf_type.reference
HELX-RHAL 'author judgement' .
STRN 'author judgement' .
TURN-TY1P 'author judgement' .
# HELX-RHAL 'Kabsch and Sander' 'Biopolymers (1983) 22:2577'
The commented entry at the end is a hypothetical example for a
calculated assignment.
Data items also exist (Table I)
for the description of beta sheets,
but are not shown in this introductory example.
Interactions between various portions of the structure are described by the STRUCT_CONN and associated STRUCT_CONN_TYPE category:
loop_
struct_conn.id
struct_conn.conn_type_id
struct_conn.ptnr1_label_comp_id
struct_conn.ptnr1_label_asym_id
struct_conn.ptnr1_label_seq_id
struct_conn.ptnr1_label_atom_id
struct_conn.ptnr1_role
struct_conn.ptnr1_symmetry
struct_conn.ptnr2_label_comp_id
struct_conn.ptnr2_label_asym_id
struct_conn.ptnr2_label_seq_id
struct_conn.ptnr2_label_atom_id
struct_conn.ptnr2_role
struct_conn.ptnr2_symmetry
struct_conn.details
SS1 disulf CYS chain_a 3 SG . 1_555 CYS chain_a 40 SG . 1_555 .
SS2 disulf CYS chain_a 4 SG . 1_555 CYS chain_a 32 SG . 1_555 .
# [data omitted]
HB1 hydrog SER chain_a 6 OG positive 1_555
LEU chain_a 8 O negative 1_556 .
HB2 hydrog ARG chain_a 17 N positive 1_555
ASP chain_a 43 O negative 1_554 .
# [data omitted]
These intermolecular interactions are partially specified on PDB
CONNECT records.
However mmCIF provides an additional level of detail such that the
criteria used to define
an interaction may be given using the STRUCT_CONN_TYPE category. Here is a
hypothetical example used to describe a salt bridge and a hydrogen bond:
loop_
_struct_conn_type.id
_struct_conn_type.criteria
_struct_conn_type.reference
saltbr 'negative to positive distance > 2.5 Å and < 3.2 Å ' .
hydrog 'N to O distance > 2.5 Å, < 3.2 Å, NOC angle < 120 ' .
The protein consists of 2 domains R (red) and L (green). The DNA strands (gold) are complementary given a one base offset.
loop_
_struct_biol.id
_struct_biol.details
complex
; The complex consists of 2 protein domains bound to a
20 base pair DNA segment.
;
protein
; Each of the 2 protein domains is a single homologous
polypeptide chain of 71 residues.The first chain is
designated L and the second designated R.
;
DNA
; The two strands (A and B) are complementary given a one
base offset.
;
The protein/DNA complex, the protein, and the DNA are considered as
three separate
biological components each generated from the contents of the
asymmetric unit. No
crystallographic symmetry need be applied to generate the biologically relevant
components.
loop_
_struct_biol_gen.biol_id
_struct_biol_gen.asym.id
_struct_biol_gen.symmetry
complex L 1_555
complex R 1_555
complex A 1_555
complex B 1_555
protein L 1_555
protein R 1_555
DNA A 1_555
DNA B 1_555
loop_
_entity.id
_entity.type
434 polymer
DNA_A polymer
DNA_B polymer
water water
Since each protein domain is chemically identical they constitute a
single entity which
has
been designated dimer. The complementary DNA strands are not
chemically identical and
therefore constitute two separate entities:
_struct_asym.id
_struct_asym.entity_id
_struct_asym.details
L 434 '71 residue polypeptide chain'
R 434 '71 residue polypeptide chain'
A DNA_A '20 base strand'
B DNA_B '20 base strand'
HOH water 'solvent'
Features of the CRO 434 secondary structure and intermolecular contacts can be
described in the same way in which crambin was represented and are not repeated.
Selenomethionyl ribonuclease H (1RNH) consists of a single polypeptide chain (blue) with 3 methionyl substitutions shown in red. A fourth substitution at the N terminus is not seen in the electron density maps.
In mmCIF terminology monomers are structural building blocks which may or may not polymerize, and therefore comprise amino acids and nucleotides (modified and unmodified) as well as heterogen groups. The category group CHEM_COMP (Table I) permits a full description of monomers, including connectivity and geometry.
Ribonuclease H has 4 methionines substituted by 4 selenomethionines at positions 1, 47, 50, and 142 of the single polypeptide chain (Fig. 5). The N terminal substitution is not observed in the electron density maps at 2.0 Å.
_chem_comp.id MSE
_chem_comp.model_source '1987 Protin/Prolsq ideals'
_chem_comp.name selenomethionine
_chem_comp.mon_nstd_flag yes
_chem_comp.mon_nstd_class 'modified amino acid'
_chem_comp.mon_nstd_parent methionine
_chem_comp.mon_nstd_details
; Derived from protein expressed in a selenomethionine rich
auxotrophic environment.
;
and then appears in the sequence representation as:
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
A 1 MSE A 2 LEU
# [data omitted]