What is CIF? A Gentle Introduction
Last Update Oct. 04, 1995
This page introduces CIF - Crystallographic Information File - which
is a data representation used by several disciplines (predominantly
crystallography) concerned with
The basis for this data representation is the
Self Defining Text Archive and Retrieval (STAR) definition. STAR is nothing more than a set of
syntax rules. Associated with STAR is a Dictionary
(DDL) from which STAR compliant
dictionaries have been
developed by several discipline. From the dictionaries it is
possible to define data files which use data items referenced
in the dictionaries. The STAR DDL and associated dictionaries
is considered as example of metadata -- data
describing how to represent other data.
The disciplines that have developed dictionaries and the common
acronymns by which they are known are as follows (some of this information
is taken from the
IUCr home page).
There are also several sub-dictionaries,
for example, on
The rationale behind CIF, components of STAR and the basic construct of a
STAR dictionary and file are given on this page.
Introduction - Why Metadata?
The statement "the B value for the serine C alpha at position 101 is 31.2"
is meaningful to a crystallographer, but may not be understood or may be
misinterpreted by someone from a different discipline, not to
mention a computer. Defining precisely
what a B value represents as well as a serine, C alpha, and position 101 by
means of a dictionary makes this statement explicit. The formalism
used to represent
these explicit definitions can, for the purposes of
this discussion, be considered an example of
metadata. This formalism can include, for example,
a range of possible values for B and
an extensible scheme for declaring properties
of a serine. In short, some implications
of a good metadata reference model are:
Another way of looking at the function of metadata is that it provides
the opportunity to shift knowledge
from the domain scientist to a usable machine readable form.
For example, if the metadata includes an explicit definition of the
interrelationships between data items a representative data structure can
be constructed with out any a priori knowledge.
- Simple query, browsing and retrieval;
- Consistent data resulting from autonomous validation and verification
(here validation implies
that the data conforms to the metadata specification and
verification implies the data lies within known boundaries);
- Simple and consistent data exchange;
- A unified view of disparate types of
- Accommodation of new knowledge as it is discovered;
- Inclusion of procedures (methods) to specify how a particular
item of data is derived or verified.
The need for a low-level data
representation has been recognized both in other disciplines
(American Petrolium Institute, 1992),
(pfaltz et al, 1992)
and in molecular biology through the use of
ASN.1 (Ostell, 1990) to represent
primary sequence and other forms of data.
Introduction - STAR
The metadata representation used by CIF is based upon
the Self Defining Text Archival and Retrieval (STAR) Data Definition Language
(DDL) initially developed by
Cook and Hall (1992).
Using STAR it is possible to define a Data Definition Language
and then use that language to define a dictionary of the terms that
you wish to be common to everyone in the sub-discipline. Data files
can then be created that contain terms used in the dictionaries.
This scheme is depicted below.
History - CIF
The Crystallographic Information File (CIF)
is based on a subset of STAR and was the first major application.
CIF was developed by the
International Union of Crystallography (IUCr) Working Party on
Crystallographic Information at the behest of the IUCr Commission on
Crystallographic Data and the IUCr Commission on Journals. The result of
this effort was a core dictionary of data items sufficient for archiving the
small molecule crystallographic experiment and its results
(Hall et al., 1991).
This core dictionary was adopted by the IUCr at its
1990 Congress in Bordeaux. Much of the software used in small molecule
crystallography now produces CIF's (data files compliant with the CIF
syntax and containing data items found in the CIF dictionaries)
directly and approximately
50% of papers
are submitted to Acta Crystallographica Volume C as CIF's. This
facilitates the review process since a paper can be typeset
directly from a CIF
and the data submitted can be verified using a consistent set of
software tools maintained by the journal.
In 1990, the IUCr formed a working group to expand the core dictionary to
include data items relevant to the macromolecular crystallographic experiment
(Fitzgerald et al, 1993).
The macromolecular dictionary (mmCIF) which encompasses many of the
data items from the core dictionary, is in the final stages of
development. Not surprisingly, given the complexity of
biological macromolecules, the dictionary is extensive and contains
over 600 data items unique to macromolecules.
The maintenance of the powder diffraction dictionary, macromolecular
dictionary and core small molecule dictionary is overseen by the IUCr
It was recognized early in the development of mmCIF
that effective use required good software tools. In turn attempting
to develop good software led to changes in the STAR DDL. This evolutionary
process of both DDL and dictionaries continues under the auspices of the
IUCr in a manner that is upwardly compatible.
At the time of writing there are two draft
DDL dictionaries: DDL 1.3
in which the core dictionary is cast and DDL 2.0.8 in which the
macromolecular dictionary is being cast. Eventually it is expected that
v1.3 will evolve into a single v2.x encompassing v2.0.8.
In parallel a number of other major dictionaries have been developed. At this
time these are all based on DDL 1.3 or earlier. These are the powder
dictionary (CIFdic.p94) and an
Since the specification of dictionaries is intended to be
extensible it is not surprising that sub-dictionaries have started to
appear. This is an obvious strength of this approach, since as the
dictionaries are STAR compliant and basic software for browsing, editing
and parsing a STAR file will immediately work on a new dictionary or
A Few Details on STAR
STAR consists of a simple syntax and a DDL which obeys that syntax;
both are then used to define a STAR dictionary. A STAR dictionary, like a
STAR compliant data file, consists of tag-value pairs. Further, a STAR file
may consist of one
or more data blocks which begin with the declaration data_ .
Each data block
consists of data items, identified by a leading underscore (tag)
followed by a
value for that data item (value). Repetitive data
values may be grouped into a loop_ structure. A data item
can exist only once in a data block. Global values are applied to one or
more data blocks by way of a global_ declaration.
A save frame is
a subset of a data block which may have pointers to other save frames. The
frame starts with a save_ framename
statement where framename is a unique identifying name within the
data block. The save frame is closed with a save_ statement.
A save frame may contain data items and loop structures, but
not other save frames.
A save frame may not contain another save frame but it may reference
other save frames in the same data block using $framename
The following is an example of a simple STAR file construct:
# S 2 O 6
# / \ //
# 3/ \//
# | |
# 4 5
1 C 0 2 S 0 3 C 2 4 C 2 5 C 2 6 O 0
1 2 s 2 3 s 3 4 s 4 5 s 5 1 s 1 6 d
1 null . . . 200 200 2 1b black stop_
2 text S yellow 10 150 250 3 1b black stop_
3 null . . . 100 200 4 1b black stop_
4 null . . . 125 150 5 1b black stop_
6 text O blue 10 250 250 1 2b black stop_
A Few details on CIF
A CIF dictionary or CIF compliant data file uses the STAR syntax with the
The following is an extract from a small molecule CIF data block:
- Only CIF dictionaries use the global declaration;
- Data names cannot be longer than 32 characters;
- Each file record cannot be greater than 80 characters;
- Loops cannot be nested.
# Sample from a CIF file based upon DDL 1.3
_chemical_compound_source 'tree bark'
s .20200 .79800 .91667 .030(3) Uij ? ? s
o .49800 .49800 .66667 .02520 Uiso ? ? o
c1 .48800 .09600 .03800 .03170 Uiso ? ? c
The above shows three single data items e.g _chemical_melting_point
and a loop structure. There are some items to note here about the
To be a valid data file, each tag should appear in a dictionary. So for
- Comments begin with a hash (#) and end at the end of a line.
- Single data values that contain white space can either be contained
in single quotes or contained within semicolons (;) as the first
character of a line.
- To maintain the integrity of a loop structure missing values must
still be represented by a questionmark (?). Note a period is used if the
value does not have any meaning withing the loop.
- The loop ends when the next tag is encountered.
- The data block is declared by a data_ and terminates with an
end-of-file or another data_ block.
_enumeration_detail d 'determined from diffraction measurements'
calc 'calculated from molecular geometry'
c 'abbreviation for "calc"'
dum 'dummy site with meaningless coordinates'
; A standard code to signal if the site data has been determined by
diffraction data or calculated from the geometry of surrounding
sites, or has been assigned dummy coordinates. The abbreviation
'c' ay be used in place of 'calc'.
is the entry for _atom_site_calc_flag
Even without a knowledge of the complete STAR syntax,
it should be apparent that the CIF dictionary is itself CIF compliant with
each data block characterizing a single data item using components of the
Data Definition Language (DDL), themselves described in a separate CIF
compliant dictionary, as illustrated for _enumeration_range.
; The range of values permitted for a defined numerical item.
The construction is 'min:max'. If 'max' is omitted then the
item can have any value greater than or equal to 'min'.
Close examination reveals that the DDL is a compromise between the
desires of the scientist to define characteristics such
as _definition, useful in defining the content of the
data, and _list_reference (analogous to a primary key) which defines
the relationship betwen items of data and is useful to programmers.
It would not be a gentle introduction to discuss these issues
in detail. But to get the flavor consider the following:
Content and Context
The descriptor _enumeration_* which is a shorthand notation
for _enumeration, _enumeration_detail, _enumeration_example,
and _enumeration_range provides useful information about a
data item to be used by both a crystallographer reading the
dictionary and a programmer trying to validate whether a data
item present in a data file is valid.
The _category on the other hand is useful to the programmer
since it logically groups together data items. This information
might be obvious to a crystallographer, but previously such a
relationship would need to be written into a computer program.
Now the opportunity exists to read this information from the
dictionary and not hard code it.
to view DDL 1.3.
Where to Now?
We have used the CIF dictionary to illustrate the basic features of
STAR and hopefully get across the point that such dictionaries
help to facilitate data exchange and have scientists talk about
an item of data, each knowing exactly what is meant.
The picture is more complex (yet richer) when considering the
implications of DDL version 2.0.
The suggested next step is to further explore the ideas behind STAR
by working through the discipline with which you are most familiar.