We begin this discussion with what should be familiar to the reader - A PDB file.
HEADER PLANT SEED PROTEIN 11-OCT-91 1CBN becomes: _struct.entry_id '1CBN' _struct.title 'PLANT SEED PROTEIN' _struct_keywords.entry_id '1CBN' _struct_keywords.text 'plant seed protein' _database_2.database_id PDB _database_2.database_code 1CBN _database_PDB_rev.num 1 _database_PDB_rev.date_original 1991-10-11This is not very efficient where for each data name there is often multiple values, for example in PDB ATOM records. This is dealt with using a STAR loop_ construct as shown here for PDB ATOM records:
loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.cartn_x _atom_site.cartn_y _atom_site.cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.entity_id _atom_site.entity_seq_num _atom_site.id ATOM N N VAL A 11 . 25.360 30.691 11.795 1.00 17.93 . 1 11 1 ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 1 11 2 ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 1 11 3 # [data omitted]Notice the structure of the _name, which is of the form _category.extension. We shall meet the concept of a category in several different contexts, A category forms a natural grouping of data items intuitive to a crystallographer. We will be looking more at these categories, note that they are tabulated in Table I. There is no restriction on the length or contents of _name (compare the 6 character limit of a PDB keyword), however, special characters should be avoided. While there is no formal specification in _name beyond the category and extension, the extension is usually represented as an informal hierachy of parts, with each part separated by an underscore (_).
The length of records in a mmCIF is restricted to 80 characters.
Questions arises concerning the separation of data names and data values that are solved with some additional syntax. For example, what if the data value contains white space, an underscore or runs over several lines? Or what if a value in a loop_ is undefined or has no meaning in the context in wish it is defined? The following syntax rules complete the picture:
To complete the introductory picture of the appearance of a mmCIF data file consider the notion of scope. A PDB file has essentially one form of scope - the complete file. Thus, a single structure corresponds to a single file and an ensemble of structures (usually from an NMR experiment) is represented by a single file with each member of the ensemble separated by a PDB MODEL keyword record. There is no computer readable mechanism for associating components of say the REMARK records with a particular member of the ensemble. The mmCIF representation deals with this issue by using the STAR data block concept. Data blocks begin with data_ and have a scope that extends until the next data_ or an end-of-file is reached. A name may appear only once in a data block, but data items may appear in any order. A consequence of these STAR rules is that the combination of data block name and data name is always unique. That is, associated with each data item is the version of the dictionary in which it was defined. This is important if a true historic record and reproducibility of the data file are to be maintained. At present there is no formal provision for associating data contained in two blocks as would be desirable in the case of, for example, a native structure and a mutant.
The relationship between this features is iluustrated below:
You can view this from top down (how a molecular biologist would think of it) or from ithe bottom up (how (I think) a crystallographer would think of it). To remain unbiased we will start in the middle.
An entity is a unique chemical feature of the structure and is represented by the ENTITY category group. If that entity happens to be a polymer - most likely a polypeptide chain or DNA strand it will be described by the ENTITY_POLY category group which details the overall features of the polymer, inclicating the presence of non-standard monomers, chirality, and linkages. The actual sequence of monomers describing the polymer is found in the ENTITY_POLY_SEQ category.
The details of the chemical components of the structure are detailed in the CHEM_COMP category group. This is where, for example, the details of a non-standard group would be given, including connectivity and geometry .
At the bottom of the hierarchy is the category group ATOM_SITE which describes the actual atomic positions. Another category group (not shown) ATOM_SITES describes the overall features of a group of atom positions. Full compatibility with the PDB ATOM and HETATM record types is maintained.
Moving up the hierarchy, each ATOM_SITE is a member of a unique component of the asymmetric unit, designated by the STRUCT_ASYM category. Similiarly, groups of atoms sites are used to designate particular structures of interest: STRUCT_SITE (e.g., active sites), STRUCT_CONN (e.g., hydrogen bonds, salt bridges, disulphide linkages), and STRUCT_CONF (e.g., beta strands, alpha helices, turns).
Finally, components of the asymmetric unit can be combined to produce biologically interesting components. These are described by the category group STRUCT_BIOL and generated within the STRUCT_BIOL_GEN group.
We can now apply these definitions to features of several structures and then move on to complete files containing a full description of the experiment.
Crambin consists of a single polypeptide chain of 48 amino acids (grey). There is a single ethanol molecule co-crystallized (red) and solvent (not shown). The structure is stabilized by 3 disulphide linkages (yellow). Using the author assignments for secondary structure, we have residues 7-19 and 23-30 as helix (green) and residues 1-4 and 32-35 as beta strand (blue).
There are 3 unique chemical components (entities) - the polypeptide chain, the ethanol molecule, and the solvent. These could be described as follows:
loop_ _entity.id _entity.type _entity.formula_weight _entity.src_method A polymer 4716 'NATURAL' ethanol non-polymer 52 'SYNTHETIC' H20 water 18 .This is not the complete list of data items in the entity category, however, there is no need to include all data items. Those that must be included if th ecategory is used at all are labelled mandatory. The only mandatory data item in the entity category is _entity.id. What categories are manadatory for a data submission will be determined presumably by community consensus.
It is then possible to expand upon this basic description of each entity using the entity.id as a reference. So for example the common and systematic names are specified as.
_entity_name_com.entity_id A _entity_name_com.name crambin _entity_name_sys.entity_id A _entity_name_sys.name 'Crambe Abyssinica'Similarly the natural and synthetic description can be given in more detail, so for the natural product we have:
_entity_src_nat.entity_id A _entity_src_nat.common_name 'Abyssinian Cabbage' _entity_src_nat.genus ? _entity_src_nat.species ? _entity_src_nat.details ?In this context the ? indicates that the value is unknown, but should be supplied.
Using the entities as building blocks the contents of the asymmetric unit are specified. Crambin is straightforward since each entity appears only once in the asymmetric unit:
loop_ _struct_asym.id _struct_asym.entity_id _struct_asym.details chain_a A 'Single polypeptide chain' ethanol ethanol 'Cocrystallized ethanol molecule' HOH water .Entities classified as polymer, in this instance only that entity identified as A, are further described. First, the overall features of the polypeptide chain:
_entity_poly.entity_id A _entity_poly.type polypeptide(L) _entity_poly.nstd_chirality no _entity_poly.nstd_linkage no _entity_poly.nstd_monomers no _entity_poly.type_details 'None Available'These data items indicated that there is nothing non-standard about the entity - no non-standard monomers, linkages or chirality.
Now the components of the polymer entity:
loop_ _entity_poly_seq.entity_id _entity_poly_seq.num _entity_poly_seq.mon_id A 1 THR A 2 THR # [data omitted] A 22 PRO A 23 GLU A 24 ALA A 25 LEU # [data omitted] A 47 ALA A 48 ASNThe entity may also exist in other databases and these references may be cited and described. For the entity designated A, which is defined in Genbank.
_entity_reference.entity_id A _entity_reference.database_name Genbank _entity_reference.database_code ? _entity_reference.details ?Once each component of the asymmetric unit is defined, the details of the secondary structure can be defined using the STRUCT_CONF category:
loop_ _struct_conf.id _struct_conf.conf_type.id _struct_conf.beg_label_comp_id _struct_conf.beg_label_asym_id _struct_conf.beg_label_seq_id _struct_conf.end_label_comp_id _struct_conf.end_label_asym_id _struct_conf.end_label_seq_id _struct_conf.details H1 HELX-RHAL ILE chain_a 7 PRO chain_a 19 'HELX-RH3T 17-19' H2 HELX-RHAL GLU chain_a 23 THR chain_a 30 'Alpha-N start' S1 STRN CYS chain_a 32 ILE chain_a 35 . S2 STRN THR chain_a 1 CYS chain_a 4 . T1 TURN-TY1P ARG chain_a 17 GLY chain_a 20 . T2 TURN-TY1P PRO chain_a 41 TYR chain_a 44 .These assignments are similar to those made in a PDB file for the record types HELIX, TURN and SHEET, however, the STRUCT_CONF_TYPE category (Table I) specifies the method of assignment which could, for example, be deduced by the crystallographer from the electron density maps or defined algorithmically:
loop_ _struct_conf_type.id _struct_conf_type.criteria _struct_conf_type.reference HELX-RHAL 'author judgement' . STRN 'author judgement' . TURN-TY1P 'author judgement' . # HELX-RHAL 'Kabsch and Sander' 'Biopolymers (1983) 22:2577'The commented entry at the end is a hypothetical example for a calculated assignment. Data items also exist (Table I) for the description of beta sheets, but are not shown in this introductory example.
Interactions between various portions of the structure are described by the STRUCT_CONN and associated STRUCT_CONN_TYPE category:
loop_ struct_conn.id struct_conn.conn_type_id struct_conn.ptnr1_label_comp_id struct_conn.ptnr1_label_asym_id struct_conn.ptnr1_label_seq_id struct_conn.ptnr1_label_atom_id struct_conn.ptnr1_role struct_conn.ptnr1_symmetry struct_conn.ptnr2_label_comp_id struct_conn.ptnr2_label_asym_id struct_conn.ptnr2_label_seq_id struct_conn.ptnr2_label_atom_id struct_conn.ptnr2_role struct_conn.ptnr2_symmetry struct_conn.details SS1 disulf CYS chain_a 3 SG . 1_555 CYS chain_a 40 SG . 1_555 . SS2 disulf CYS chain_a 4 SG . 1_555 CYS chain_a 32 SG . 1_555 . # [data omitted] HB1 hydrog SER chain_a 6 OG positive 1_555 LEU chain_a 8 O negative 1_556 . HB2 hydrog ARG chain_a 17 N positive 1_555 ASP chain_a 43 O negative 1_554 . # [data omitted]These intermolecular interactions are partially specified on PDB CONNECT records. However mmCIF provides an additional level of detail such that the criteria used to define an interaction may be given using the STRUCT_CONN_TYPE category. Here is a hypothetical example used to describe a salt bridge and a hydrogen bond:
loop_ _struct_conn_type.id _struct_conn_type.criteria _struct_conn_type.reference saltbr 'negative to positive distance > 2.5 Å and < 3.2 Å ' . hydrog 'N to O distance > 2.5 Å, < 3.2 Å, NOC angle < 120 ' .
The protein consists of 2 domains R (red) and L (green). The DNA strands (gold) are complementary given a one base offset.
loop_ _struct_biol.id _struct_biol.details complex ; The complex consists of 2 protein domains bound to a 20 base pair DNA segment. ; protein ; Each of the 2 protein domains is a single homologous polypeptide chain of 71 residues.The first chain is designated L and the second designated R. ; DNA ; The two strands (A and B) are complementary given a one base offset. ;The protein/DNA complex, the protein, and the DNA are considered as three separate biological components each generated from the contents of the asymmetric unit. No crystallographic symmetry need be applied to generate the biologically relevant components.
loop_ _struct_biol_gen.biol_id _struct_biol_gen.asym.id _struct_biol_gen.symmetry complex L 1_555 complex R 1_555 complex A 1_555 complex B 1_555 protein L 1_555 protein R 1_555 DNA A 1_555 DNA B 1_555 loop_ _entity.id _entity.type 434 polymer DNA_A polymer DNA_B polymer water waterSince each protein domain is chemically identical they constitute a single entity which has been designated dimer. The complementary DNA strands are not chemically identical and therefore constitute two separate entities:
_struct_asym.id _struct_asym.entity_id _struct_asym.details L 434 '71 residue polypeptide chain' R 434 '71 residue polypeptide chain' A DNA_A '20 base strand' B DNA_B '20 base strand' HOH water 'solvent'Features of the CRO 434 secondary structure and intermolecular contacts can be described in the same way in which crambin was represented and are not repeated.
Selenomethionyl ribonuclease H (1RNH) consists of a single polypeptide chain (blue) with 3 methionyl substitutions shown in red. A fourth substitution at the N terminus is not seen in the electron density maps.
In mmCIF terminology monomers are structural building blocks which may or may not polymerize, and therefore comprise amino acids and nucleotides (modified and unmodified) as well as heterogen groups. The category group CHEM_COMP (Table I) permits a full description of monomers, including connectivity and geometry.
Ribonuclease H has 4 methionines substituted by 4 selenomethionines at positions 1, 47, 50, and 142 of the single polypeptide chain (Fig. 5). The N terminal substitution is not observed in the electron density maps at 2.0 Å.
_chem_comp.id MSE _chem_comp.model_source '1987 Protin/Prolsq ideals' _chem_comp.name selenomethionine _chem_comp.mon_nstd_flag yes _chem_comp.mon_nstd_class 'modified amino acid' _chem_comp.mon_nstd_parent methionine _chem_comp.mon_nstd_details ; Derived from protein expressed in a selenomethionine rich auxotrophic environment. ;and then appears in the sequence representation as:
loop_ _entity_poly_seq.entity_id _entity_poly_seq.num _entity_poly_seq.mon_id A 1 MSE A 2 LEU # [data omitted]