The disciplines that have developed dictionaries and the common acronymns by which they are known are as follows (some of this information is taken from the IUCr home page).
The rationale behind CIF, components of STAR and the basic construct of a STAR dictionary and file are given on this page.
The need for a low-level data representation has been recognized both in other disciplines (American Petrolium Institute, 1992), (IEEE, 1994), (pfaltz et al, 1992) and in molecular biology through the use of ASN.1 (Ostell, 1990) to represent primary sequence and other forms of data.
Using STAR it is possible to define a Data Definition Language (DDL) and then use that language to define a dictionary of the terms that you wish to be common to everyone in the sub-discipline. Data files can then be created that contain terms used in the dictionaries. This scheme is depicted below.
In 1990, the IUCr formed a working group to expand the core dictionary to include data items relevant to the macromolecular crystallographic experiment (Fitzgerald et al, 1993). The macromolecular dictionary (mmCIF) which encompasses many of the data items from the core dictionary, is in the final stages of development. Not surprisingly, given the complexity of biological macromolecules, the dictionary is extensive and contains over 600 data items unique to macromolecules.
The maintenance of the powder diffraction dictionary, macromolecular dictionary and core small molecule dictionary is overseen by the IUCr appointed COMCIFS committee.
It was recognized early in the development of mmCIF that effective use required good software tools. In turn attempting to develop good software led to changes in the STAR DDL. This evolutionary process of both DDL and dictionaries continues under the auspices of the IUCr in a manner that is upwardly compatible.
At the time of writing there are two draft DDL dictionaries: DDL 1.3 in which the core dictionary is cast and DDL 2.0.8 in which the macromolecular dictionary is being cast. Eventually it is expected that v1.3 will evolve into a single v2.x encompassing v2.0.8.
In parallel a number of other major dictionaries have been developed. At this time these are all based on DDL 1.3 or earlier. These are the powder dictionary (CIFdic.p94) and an NMR dictionary. Since the specification of dictionaries is intended to be extensible it is not surprising that sub-dictionaries have started to appear. This is an obvious strength of this approach, since as the dictionaries are STAR compliant and basic software for browsing, editing and parsing a STAR file will immediately work on a new dictionary or data file.
The following is an example of a simple STAR file construct:
# S 2 O 6
# / \ //
# 3/ \//
# | |
# |_____|
# 4 5
#
data_Ex0_1
_define_object_class molecule
_molecule_name_common thiobutyrolactone
loop_
_atom_id
_atom_type
_atom_attach_h
1 C 0 2 S 0 3 C 2 4 C 2 5 C 2 6 O 0
loop_
_bond_id_1
_bond_id_2
_bond_type_mif
1 2 s 2 3 s 3 4 s 4 5 s 5 1 s 1 6 d
_display_scale 50
_display_span_x 150
_display_span_y 100
loop_
_display_id
_display_object
_display_symbol
_display_colour
_display_size
_display_coord_x
_display_coord_y
loop_
_display_conn_id
_display_conn_symbol
_display_conn_colour
1 null . . . 200 200 2 1b black stop_
2 text S yellow 10 150 250 3 1b black stop_
3 null . . . 100 200 4 1b black stop_
4 null . . . 125 150 5 1b black stop_
6 text O blue 10 250 250 1 2b black stop_
data_P6122 # Sample from a CIF file based upon DDL 1.3 # _chemical_name_systematic ; trans-3-Benzoyl-2-(tert-butyl)-4-(iso-butyl)-1,3-oxazolidin-5-one ; _chemical_compound_source 'tree bark' _chemical_melting_point 56 loop_ _atom_site_label _atom_site_fract_x _atom_site_fract_y _atom_site_fract_z _atom_site_U_iso_or_equiv _atom_site_thermal_displace_type _atom_site_calc_flag _atom_site_calc_attached_atom _atom_site_type_symbol s .20200 .79800 .91667 .030(3) Uij ? ? s o .49800 .49800 .66667 .02520 Uiso ? ? o c1 .48800 .09600 .03800 .03170 Uiso ? ? cThe above shows three single data items e.g _chemical_melting_point and a loop structure. There are some items to note here about the STAR syntax.
data_atom_site_calc_flag
_name '_atom_site_calc_flag'
_category atom_site
_type char
_list yes
_list_reference '_atom_site_label'
loop_ _enumeration
_enumeration_detail d 'determined from diffraction measurements'
calc 'calculated from molecular geometry'
c 'abbreviation for "calc"'
dum 'dummy site with meaningless coordinates'
_enumeration_default d
_definition
; A standard code to signal if the site data has been determined by
diffraction data or calculated from the geometry of surrounding
sites, or has been assigned dummy coordinates. The abbreviation
'c' ay be used in place of 'calc'.
;
is the entry for _atom_site_calc_flag
Even without a knowledge of the complete STAR syntax, it should be apparent that the CIF dictionary is itself CIF compliant with each data block characterizing a single data item using components of the Data Definition Language (DDL), themselves described in a separate CIF compliant dictionary, as illustrated for _enumeration_range.
data_enumeration_range
_name '_enumeration_range'
_category enumeration
_type char
_definition
; The range of values permitted for a defined numerical item.
The construction is 'min:max'. If 'max' is omitted then the
item can have any value greater than or equal to 'min'.
;
Close examination reveals that the DDL is a compromise between the
desires of the scientist to define characteristics such
as _definition, useful in defining the content of the
data, and _list_reference (analogous to a primary key) which defines
the relationship betwen items of data and is useful to programmers.
It would not be a gentle introduction to discuss these issues in detail. But to get the flavor consider the following:
The picture is more complex (yet richer) when considering the implications of DDL version 2.0. The suggested next step is to further explore the ideas behind STAR by working through the discipline with which you are most familiar.