What is CIF? A Gentle Introduction

Last Update Oct. 04, 1995
CIF Editor

Abstract

This page introduces CIF - Crystallographic Information File - which is a data representation used by several disciplines (predominantly crystallography) concerned with molecular structure. The basis for this data representation is the Self Defining Text Archive and Retrieval (STAR) definition. STAR is nothing more than a set of syntax rules. Associated with STAR is a Dictionary Definition Language (DDL) from which STAR compliant dictionaries have been developed by several discipline. From the dictionaries it is possible to define data files which use data items referenced in the dictionaries. The STAR DDL and associated dictionaries is considered as example of metadata -- data describing how to represent other data.

The disciplines that have developed dictionaries and the common acronymns by which they are known are as follows (some of this information is taken from the IUCr home page).

There are also several sub-dictionaries, for example, on monomers and non-commensurate structures.

The rationale behind CIF, components of STAR and the basic construct of a STAR dictionary and file are given on this page.

Introduction - Why Metadata?

The statement "the B value for the serine C alpha at position 101 is 31.2" is meaningful to a crystallographer, but may not be understood or may be misinterpreted by someone from a different discipline, not to mention a computer. Defining precisely what a B value represents as well as a serine, C alpha, and position 101 by means of a dictionary makes this statement explicit. The formalism used to represent these explicit definitions can, for the purposes of this discussion, be considered an example of metadata. This formalism can include, for example, a range of possible values for B and an extensible scheme for declaring properties of a serine. In short, some implications of a good metadata reference model are: Another way of looking at the function of metadata is that it provides the opportunity to shift knowledge from the domain scientist to a usable machine readable form. For example, if the metadata includes an explicit definition of the interrelationships between data items a representative data structure can be constructed with out any a priori knowledge.

The need for a low-level data representation has been recognized both in other disciplines (American Petrolium Institute, 1992), (IEEE, 1994), (pfaltz et al, 1992) and in molecular biology through the use of ASN.1 (Ostell, 1990) to represent primary sequence and other forms of data.

Introduction - STAR

The metadata representation used by CIF is based upon the Self Defining Text Archival and Retrieval (STAR) Data Definition Language (DDL) initially developed by Cook and Hall (1992).

Using STAR it is possible to define a Data Definition Language (DDL) and then use that language to define a dictionary of the terms that you wish to be common to everyone in the sub-discipline. Data files can then be created that contain terms used in the dictionaries. This scheme is depicted below.

History - CIF

The Crystallographic Information File (CIF) is based on a subset of STAR and was the first major application. CIF was developed by the International Union of Crystallography (IUCr) Working Party on Crystallographic Information at the behest of the IUCr Commission on Crystallographic Data and the IUCr Commission on Journals. The result of this effort was a core dictionary of data items sufficient for archiving the small molecule crystallographic experiment and its results (Hall et al., 1991). This core dictionary was adopted by the IUCr at its 1990 Congress in Bordeaux. Much of the software used in small molecule crystallography now produces CIF's (data files compliant with the CIF syntax and containing data items found in the CIF dictionaries) directly and approximately 50% of papers are submitted to Acta Crystallographica Volume C as CIF's. This facilitates the review process since a paper can be typeset directly from a CIF and the data submitted can be verified using a consistent set of software tools maintained by the journal.

In 1990, the IUCr formed a working group to expand the core dictionary to include data items relevant to the macromolecular crystallographic experiment (Fitzgerald et al, 1993). The macromolecular dictionary (mmCIF) which encompasses many of the data items from the core dictionary, is in the final stages of development. Not surprisingly, given the complexity of biological macromolecules, the dictionary is extensive and contains over 600 data items unique to macromolecules.

The maintenance of the powder diffraction dictionary, macromolecular dictionary and core small molecule dictionary is overseen by the IUCr appointed COMCIFS committee.

It was recognized early in the development of mmCIF that effective use required good software tools. In turn attempting to develop good software led to changes in the STAR DDL. This evolutionary process of both DDL and dictionaries continues under the auspices of the IUCr in a manner that is upwardly compatible.

At the time of writing there are two draft DDL dictionaries: DDL 1.3 in which the core dictionary is cast and DDL 2.0.8 in which the macromolecular dictionary is being cast. Eventually it is expected that v1.3 will evolve into a single v2.x encompassing v2.0.8.

In parallel a number of other major dictionaries have been developed. At this time these are all based on DDL 1.3 or earlier. These are the powder dictionary (CIFdic.p94) and an NMR dictionary. Since the specification of dictionaries is intended to be extensible it is not surprising that sub-dictionaries have started to appear. This is an obvious strength of this approach, since as the dictionaries are STAR compliant and basic software for browsing, editing and parsing a STAR file will immediately work on a new dictionary or data file.

A Few Details on STAR

STAR consists of a simple syntax and a DDL which obeys that syntax; both are then used to define a STAR dictionary. A STAR dictionary, like a STAR compliant data file, consists of tag-value pairs. Further, a STAR file may consist of one or more data blocks which begin with the declaration data_ . Each data block consists of data items, identified by a leading underscore (tag) followed by a value for that data item (value). Repetitive data values may be grouped into a loop_ structure. A data item can exist only once in a data block. Global values are applied to one or more data blocks by way of a global_ declaration. A save frame is a subset of a data block which may have pointers to other save frames. The frame starts with a save_ framename statement where framename is a unique identifying name within the data block. The save frame is closed with a save_ statement. A save frame may contain data items and loop structures, but not other save frames. A save frame may not contain another save frame but it may reference other save frames in the same data block using $framename pointers.

The following is an example of a simple STAR file construct:


#           S 2   O 6
#          /  \  //
#        3/    \//
#         |     |
#         |_____|
#         4     5
#

data_Ex0_1

_define_object_class             molecule
_molecule_name_common            thiobutyrolactone

loop_
    _atom_id
    _atom_type
    _atom_attach_h   
                    1 C 0  2 S 0  3 C 2  4 C 2  5 C 2  6 O 0  

loop_
    _bond_id_1
    _bond_id_2
    _bond_type_mif
                    1 2 s  2 3 s  3 4 s  4 5 s  5 1 s  1 6 d 


_display_scale          50
_display_span_x        150
_display_span_y        100

loop_
    _display_id
    _display_object
    _display_symbol
    _display_colour
    _display_size
    _display_coord_x
    _display_coord_y
    loop_
        _display_conn_id
        _display_conn_symbol
        _display_conn_colour

                 1  null  .      .   .  200 200  2 1b  black stop_
                 2  text  S  yellow  10 150 250  3 1b  black stop_
                 3  null  .      .   .  100 200  4 1b  black stop_
                 4  null  .      .   .  125 150  5 1b  black stop_
                 6  text  O    blue  10 250 250  1 2b  black stop_



A Few details on CIF

A CIF dictionary or CIF compliant data file uses the STAR syntax with the following restrictions: The following is an extract from a small molecule CIF data block:

data_P6122
# Sample from a CIF file based upon DDL 1.3
#

_chemical_name_systematic
;   trans-3-Benzoyl-2-(tert-butyl)-4-(iso-butyl)-1,3-oxazolidin-5-one
;

_chemical_compound_source                'tree bark'

_chemical_melting_point                  56

loop_
_atom_site_label
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_U_iso_or_equiv
_atom_site_thermal_displace_type
_atom_site_calc_flag
_atom_site_calc_attached_atom
_atom_site_type_symbol
  s  .20200  .79800  .91667  .030(3)  Uij  ?  ?  s
  o  .49800  .49800  .66667  .02520  Uiso  ?  ?  o
  c1  .48800  .09600  .03800  .03170  Uiso  ?  ?  c
The above shows three single data items e.g _chemical_melting_point and a loop structure. There are some items to note here about the STAR syntax. To be a valid data file, each tag should appear in a dictionary. So for example:

data_atom_site_calc_flag
    _name                      '_atom_site_calc_flag'
    _category                    atom_site
    _type                        char
    _list                        yes
    _list_reference            '_atom_site_label'
    loop_ _enumeration
          _enumeration_detail   d     'determined from diffraction measurements'
                                calc  'calculated from molecular geometry'
                                c     'abbreviation for "calc"'
                               dum    'dummy site with meaningless coordinates'
    _enumeration_default        d
    _definition
;              A standard code to signal if the site data has been determined by
               diffraction data or calculated from the geometry of surrounding
               sites, or has been assigned dummy coordinates. The abbreviation
               'c' ay be used in place of 'calc'.
;
is the entry for _atom_site_calc_flag

Even without a knowledge of the complete STAR syntax, it should be apparent that the CIF dictionary is itself CIF compliant with each data block characterizing a single data item using components of the Data Definition Language (DDL), themselves described in a separate CIF compliant dictionary, as illustrated for _enumeration_range.


data_enumeration_range    
    _name                      '_enumeration_range'  
    _category                    enumeration
    _type                        char
    _definition
;  The range of values permitted for a defined numerical item. 
   The construction is 'min:max'. If 'max' is omitted then the 
   item can have any value greater than or equal to 'min'.
;

Close examination reveals that the DDL is a compromise between the desires of the scientist to define characteristics such as _definition, useful in defining the content of the data, and _list_reference (analogous to a primary key) which defines the relationship betwen items of data and is useful to programmers.

It would not be a gentle introduction to discuss these issues in detail. But to get the flavor consider the following:

Content and Context

The descriptor _enumeration_* which is a shorthand notation for _enumeration, _enumeration_detail, _enumeration_example, and _enumeration_range provides useful information about a data item to be used by both a crystallographer reading the dictionary and a programmer trying to validate whether a data item present in a data file is valid. The _category on the other hand is useful to the programmer since it logically groups together data items. This information might be obvious to a crystallographer, but previously such a relationship would need to be written into a computer program. Now the opportunity exists to read this information from the dictionary and not hard code it. Click here to view DDL 1.3.

Where to Now?

We have used the CIF dictionary to illustrate the basic features of STAR and hopefully get across the point that such dictionaries help to facilitate data exchange and have scientists talk about an item of data, each knowing exactly what is meant.

The picture is more complex (yet richer) when considering the implications of DDL version 2.0. The suggested next step is to further explore the ideas behind STAR by working through the discipline with which you are most familiar.