Tutorial: Writing a STAR/CIF Compliant Dictionary using DDL 2.x

This tutorial is based on experience in developing a dictionary to describe protein and DNA sequences at the level of detail found in a GenBank entry.

Last Update Oct. 30, 1997 Phil Bourne San Diego Supercomputer Center (SDSC) 
CIF Home (UK) (local mirror) | mmCIF Home (Rutgers)  (local mirror) | SDSC STAR/CIF 


  • Introduction
  • Motivation
  • The Example Dictionary
  • Religious Wars
  • What are STAR, DDL, and CIF?
  • Writing the Sequence Dictionary
  • The Complete Dictionary
  • Parsing Your Dictionary
  • Acknowledgments
  • List of Figures and Tables
    • Figure 1 Relationship Between Datablocks and Dictionary Categories
    • Figure 2 Relationship Between Categories and Category Groups
    • Figure 3 Relationship Between Data Items
    • Table 1 Category Groups and Categories in the Sequence Dictionary


    This tutorial discusses the creation of a data dictionary that is compliant with the Self-defining Text Archival and Retrieval (STAR) language specification (of which the Crystallographic Information FIle - CIF - is a subset). The dictionary is based upon Dictionary Definition Language (DDL) version 2.x which was developed to describe the macromolecular Crystallographic Information Files (mmCIF) dictionary. The mmCIF dictionary describes in great detail a structure determined by X-ray crystallography or NMR, and the X-ray crystallographic experiment. These various terms are explained subsequently.

    The data dictionary we will use as an example describes DNA and protein sequence data at the level of detail found in a GenBank entry described in FASTA format. The dictionary can be viewed in its entirety at http://www.sdsc.edu/pb/cif/write_dic/sequence970904.txtor via a series of Web pages created using StarHtml. This dictionary was chosen as it is covers many features of STAR and the dictionary definition language, yet is relatively simple with respect to content.


    Why develop a data dictionary?
    1. Annotation. If two researchers are talking about a sequence feature table are they really talking about the same thing? An exact description of what constitutes a sequence feature table is needed. In other words, a dictionary definition for each and every item of data is needed. Depending on the item of data that implies: a definition; how that item of data relates to other items of data; enumeration of possible values; data type; and so on. Good annotation  is of great benefit to researchers not familiar with the data, yet wishing to use it without misinterpretation.
    2. Support foir Databases. There is so much data that the only way it can use it effectively is by placing it in a database. To avoid loss of data when placing it in a database through the parser's failure to interpret parts of the data that data must be defined rigorously.
    Why should the dictionary be compliant with the STAR and CIF specifications?
    1. The STAR/CIF approach yields a dictionary that is extensible; i.e., a dictionary that can accommodate new items of data as a discipline evolves.
    2. STAR/CIF has already been used to develop scientific dictionaries, notably those for small-molecular and macromolecular crystallography. Therefore, it may be possible to use definitions already defined by others. This may seem unimportant at first, but as you begin to develop definitions and see the time and effort involved, work already done becomes more valuable. Consider something as complex as a macromolecular structure. It took several years to finalize a set of definitions which encapsulated the many types of macromolecular structure and their associated features.Over time  it may be this description that becomes the most valuable feature of the dictionary.
    3. STAR/CIF permits the inclusion of  methods (i.e., software) which are associated with particular items of data. While this is not used widely at present (see Biggs et al., 1997 for an example) it presents some interesting possibilities. At present many programmers write code which interprets the meaning of data in a data file. Those interpretations can be different leading to inconsistent use of the same item of data. The definitions in the dictionary reduce these inconsistencies, but the availability of software within the dictionary (or at least associated with the dictionary) provide even greater consistency than the definitions alone. In our sequence dictionary we shall see how to include code which, for example,  parses specific data items in FASTA format.
    4. Software exists already for performing basic data manipulation of STAR/CIF files.
    5. The STAR/CIF specification is well documented and supported by the community.
    6. STAR/CIF is supported by the International Union of Crystallography (IUCr).

    The Example Dictionary

    We needed to develop a database to support data being collected on the protein kinase family of enzymes (http://www.sdsc.edu/kinases). These data are extensive and cover sequence, structure, enzymology, function, protocols, and so on. Based on experience in developing other databases we realized that a complete description of each item of data to be included was critical. This description should include the attributes (classification, type, enumerated values, etc.) of each item of data and its relationship to other items (formally referred to as a schema). Since we were familiar with STAR/CIF we chose it to form the underlying data representation. We took particular care that data items be as far as possible generic, that is applicable to any protein family and not specific to protein kinases. The task of defining the data items  is daunting and we divided it up into items related to sequence, structure (mostly taken from the mmCIF dictionary), enzymology, function, and global (applicable to all types of data items). For each set of data items we solicited help from experts in that area. It is the sequence dictionary developed by Mike Gribskov and Stella Veretnick which is used in this tutorial.

    Religious Wars

    Why STAR/CIF, what about other data representations?

    There are other data representations that could form the basis for casting a dictionary, notably ASN.1. Certainly ASN.1 has some positive features:

    1. It is an IEEE standard and not defined by one domain of users like STAR/CIF.
    2. There are tools available for use with ASN.1.
    3. It is used extensively by the NCBI  i.e., it works well.
    4. It is a simple yet rigorous data description.
    5. MMDB developed at NCBI has a set of definitions defining macromolecular structure. This includes generally applicable chemical graphs, but lacks a description of the experiment.
    Trying to decide which data representation to use can take on a religious fervour. It is not our intention to endorse one approach over another, but rather to provide a balanced view of what is available. Familiarity with STAR/CIF and the number of data items already defined for the mmCIF dictionary made STAR/CIF a good choice.

    A critical issue is that at some future date we could convert to an ASN.1 (or other) representation and back to STAR/CIF without loss of information. The ability to convert without data loss is what is critical.  Based on this criteria we believe  STAR/CIF and ASN.1 to be suitable candidates for casting a dictionary.

    What are STAR,  DDL and CIF?

    A detailed answer to this question can be found in a series of papers listed at http://ndb.sdsc.edu/NDB/mmcif/references/index.html. An abbreviated  answer is given here.

    STAR: A General Language for Describing Information

    Self-defining Text Archival and Retreival (STAR) is a set of simple syntax rules for defining: a dictionary definition language (DDL), dictionaries based upon the DDL, and data files containing the data items defined in the dictionaries. Thus DDL, dictionaries, and data files are in the same format and are interpretable by the same basic parsers. We will meet these syntax rules as we go along, but here are some basics. An alternative description of STAR syntax is available at http://ndb.rutgers.edu/workshop/mmCIF-tutorials/mechanics/syntax.htm. A detailed discussion of the STAR syntax can be found in

    Dictionary Definition Languages (DDLs) Based upon STAR

    We can now use these simple STAR rules to define a language upon which to base our dictionaries. Consider the analogy to a Websters English Dictionary. The dictionary only uses the English Language (cf STAR rules) and each word is presented in the dictionary alphabetically followed by a pronunciation and then a definition. A formal definition of how the dictionary is laid out is analogous to the DDL.  To date, two such presentation formats have been developed within the crystallography community. These are referred to as DDL v1.x and DDL v2.x, where x represents different versions of the major format.

    DDL v1.x is simple but not rigorous. For example, using DDL 1.x the name given to the data item infers a hierarchy of parts, for example atom_site_cartsn_x. But what is the hierarchy?  A crystallographer would know that the logical hierarchy is that cartsn_x is one member of a class of data items called atom_site, however, there is no way for a parser to automatically determine that fact.  DDL 2.x addresses this problem by referencing data names as atom_site.Cartsn_x. That is a period separates the category of data items (atom_site) from the members of the category (e.g., Cartsn_x) Even this requires inference from the data name and is not rigorous. Hence, as we shall see subsequently, the category to which a data item belongs is defined explicitly as part of the definition of that data item.

    DDL 1.x  is currently used to describe small molecule crystallographic data - the core dictionary is expressed using DDL 1.x.. DDL v2.x is used to express the macromolecular crystallographic dictionary. DDL v2.x is backward compatible with v1.x. Thus, any data item in a DDL 2.x dictionary which has a DDL 1.x counterpart includes an alias to that data item. Consider our example of an x coordinate as it appears in the macromolecular dictionary:.

        _item_aliases.alias_name    '_atom_site_Cartn_x'
        _item_aliases.dictionary      cif_core.dic
        _item_aliases.version         2.0.1

    The data name is declared to be the name of the save frame (save_ followed by _atom_site.Catsn_x hence two underscores concatenanted together). On the left are data names defined by the DDL and the values are on the right. This is only a partial description of that data item. We will get to the full description subsequently. This example defines in the macromolecular dictionary a data item _atom_site.Cartn_y which has a counterpart called _atom_site_Cartn_y located in version 2.0.1 of the core dictionary cif_core.dic.

    The alias mechanism is not used in the sequence dictionary that we are about to develop, since none of the data items have been defined in other DDL v1.x dictionaries.It is introduced here to highlight differences between dictionaries cast in DDL 1.x and DDL 2.x.

    In summary, DDL 2.x has all the capabilities of DDL 1.x and more.  Software tools exist that work with files described by both DDLs, with just DDL v1.x, or with just DDL v2.x.  Our dictionary development will be based on DDL v2.x.

    A full description of the DDL can be found at http://ndb.sdsc.edu/NDB/mmcif/ddl/index.html in various formats. They have been prepared by John Westbrook, the original author of  DDL v2.x. There is also a DDL tutorial at http://ndb.rutgers.edu/workshop/mmCIF-tutorials/mechanics/dict-struct.htm

    You will need a working knowledge of the DDL to build your dictionary and it would be worthwhile reading at least the DDL tutorial at this point. What follows is an application of the material described in the tutorial.

    Crystallographic Information File (CIF)

    CIF only uses a subset of the STAR syntax for all the crystallographic dictionaries and associated data files. We have chosen to stick with it for the sequence dictionary since it provides maximum usage of existing software. However, some of these restrictions make the representation cumbersome. Important restrictions are:

    Where Do the Data Fit In?

    Where do the data fit in to this STAR/CIF model? At the conclusion of this tutorial we will have a dictionary which contains a set of data items for describing sequences. Our data files (one or more data blocks per file) contain only the names of the data items and the values of those data items, either looped or as single name-value pairs. The data names correspond only to names found in the dictionary. Several tools exist  for validating a data file against a data dictionary. The same tools that parse the dictionary should be able to parse a data file. To see what a macromolecular CIF (mmCIF) data file looks like, go to moose, select a structure using the text option, and then select mmCIF(ASCII) as the display option. This will show your favorite PDB entry as a STAR/CIF compliant mmCIF file.

    Characteristics of the Dictionary

    This section highlights the various characteristics of the dictioanry which are illustrated using segments of the sequence dictionary. In the following section we put these segments together to complete the dictionary.


    The dictionary description and the data items defined in the dictionary are organized into  dictionary, category_group, category, sub_category, and finally the data items themselves (item). Each of levels of organization are described subsequently, starting with the dictionary description.


    Figure 1 Datablock and Dictionary Relationships

    There are two tables (categories) describing the dictionary itself, dictionary and dictionary_history. They in turn have a pointer to a category datablock which describes the datablock itself, which in turn has a pointer to datablock_methods defining methods associated with the datablock.In the sequence dictionary this is written as:

    This introduction to the dictionary introduces a number of features, some syntax some style:

    Category Groups

    The next level of organization is category groups - categories that it makes sense to group together. The organization of category groups, categories and sub-categories are: 
    Figure 2 Relationship Between Categories and Category Groups

    We begin with category groups. In the sequence dictionary there are three category groups. inclusive_group, general_group and sequence_group. They are declared by the DDL _category_group_list (itself a category) which identifies the name of the category group, the parent category group from which it was derived and a description of the category group.

    The inclusive_group, as the name suggests, is the parent of the other category_groups.


    Within each category group are one or more categories of data items. Here is an example of expressing a category: The category description is contained within a save frame that begins and end with save_ Since the name of the category starts with an underscore there is a double underscore at the beginning i.e. save__sequence. The category uses new DDL names that we have yet to encounter.


    A sub_category is used to associate a procedure with a subset of data items. A sub_category classification is not used in the sequence dictionary described here, nor is it fully exploited in the macromolecular CIF dictionary, hence only a brief introduction of how it might be used is given here. First a subcategory is declared.
      The methods associated with this subcategory  are declared in a sub_catetegory_methods loop The specific method then appears in the method_list which is defined subsequently.

    Data Items

    The final level of the orgaization are the individual data items which have a number of DDL descriptors as shown in Figure 3. Only a subset are used in describing most data items.  
    Figure 3 Relationship Between Items

    Here is a simple data item:


    Expression of Relationships Between Data Items

    It is necessary to express a variety of relationships between data items and categories when using DDL 2.x. Here, we describe common relationships that that are found in our example sequence dictionary.
    Parent-child relationships are used to relate tables (loops) where  instances of a data item has the same value and occurs in different categories. This is necessary because of the restriction resulting from single-level looping (see above). That is, a nested loop, such as the list of authors associated with a single publication, which is in turn part of a list of publications, appears in two separate lists (categories) - one for the list of publications and one for the list of authors. A common id links the two - that is,each author has an associated author citation id which matches the citation id. The citation.id is the parent, the citation_author.citation_id the child. Parent-child is a somewhat misleading description. It is simply a bi-directional link between two items of data. The one from which the link is derived is considered the parent and the associated data item the child. Provided one direction of the link is present the other link is inferred. Here is an example citation from a data file and it is followed by  the dictionary description of the associated data items. We begin with the citation category which loops over a series of citations identified by _citation.id.
               primary  yes
             ; Crystallographic analysis of a complex between human immunodeficiency
               virus type 1 protease and acetyl-pepstatin at 2.0-Angstroms resolution.
               US  'J. Biol. Chem.'  265  .  14209  14219  1990
               HBCHA3  0021-9258  071  .  .  .
             ; The publication that directly relates to this coordinate set.
               2  no
             ; Three-dimensional structure of aspartyl-protease from human
               immunodeficiency virus HIV-1.
               UK  'Nature'  337  .  615  619  1989
               NATUAS  0028-0836  006  .  .  .
             ; Determination of the structure of the unliganded enzyme.

    We now have a list of authors for all citations described above.

               primary  1  'Fitzgerald, P.M.D.'
               primary  2  'McKeever, B.M.'
               primary  3  'Van Middlesworth, J.F.'
               primary  4  'Springer, J.P.'
               2  5  'Heimbach, J.C.'
               2  6  'Leu, C.-T.'
               2  7  'Herber, W.K.'
              2  8  'Dixon, R.A.F.'
               2  9  'Darke, P.L.'

    Consider how the single data item _citation.id is described in the dictionary.

    Note: since the value of some of the data names show here are in fact themselves data names they are placed in single quotes insure literal interpretation by the parser.

    The _item_linked DDL category defines this parent-child relationship. That is, the data item _citation.id is declared to be the parent of _citation_author.citation_id, as well as two  other data items. This declaration is not made in the definition of the child data item _citation_author.citation_id (an issue of style).
    Dependent item relationships are important in some instances, although not used in the sequence
    dictionary. These relationships declare that for a given data item in a category, certain other
    data items must be present to make that category valid. So for example, consider the
    mmCIF atom_site category. The data item _atom_site.Cartn_x is only valid if
    _atom_site.Cartn_y and _atom_site.Cartn_z are present.

    The _item_dependent.dependent_name defines other data items that must be present in the category to make this data item valid. There is no inference here. That is, even though _atom_site.Cartn_x declares a dependency on _atom_site.Cartn_y this does not infer that y is dependent on x unless stated explicitly in the definition of _atom_site.Cartn_y.

    Association of Methods with Data

    It is possible to associate a specific method -- code that operates on the data item -- with blocks, categories, subcategories and data items. You can imagine using this for data conversion, data validation, etc. Here, we wish to define Perl code that parses information from a FASTA sequence file and is used to create a STAR/CIF sequence data file. First a list of methods is defined - for the sake of abreviation only the function name is included here. The _method_list category identifies and descibes each available method. To apply a method to, for example, a specufuc data item then _item_methods.method.id should be included in the description of that data item.

    For a novel use of associating code with data items see Bourne, Biggs and Pu, 1997, ISMB 52-55. There we include in the dictionary pointers for each category. The pointer indicates the relevant module (encapsulated Perl code) that processes that category of data item in the conversion from PDB to CIF format. At run time the convertor is generated by reading all the conversion modules. This faciltates code maintenance.

    The advantage of both approachs is that (in principle) all users use the same code. This avoids the situation where hundreds of programmers implement their own mutually inconsistent interpretations of, for example, how to convert a particular item of data in a PDB file to its STAR/CIF data item counterpart.

    Enumeration of Allowed Values

    Some data items can only have a limited set of values. In these cases those values may be enumerated within the data item definition. Here is an example from the sequence dictionary which indicates that a _sequence.type can only have the value PROTEIN or DNA or RNA. _item_enumeration.value describes the values that this data item can assume and _item_enumeration.detail more detail of that enumeration.

    Definition of Units and Unit Conversions

    Many data items have units associated with them. It is important to specify these units in the dictionary, and it is helpful to include methods of converting the dictionary defined units to other commonly used units. Here is an example for expressiong a molecular weight in Daltons.  Specific unit descriptions must also appear in an _item_units_list, for example: The _item_units_conversion can be used to provide conversion factors between unit types. It is not described here since it is not used in the sequence dictionary.

    Data Types

    The data type indicates whether a data item is a real number, an integer, a string, or some other predefined data type. Data typing is important in validating data in STAR/CIF files when, for example, loading the data into a database. The loader will not be able to determine whether the residue name at a given position is correct, but it will be able to check for the data type - if it is given as a floating point number rather than a a character string then something is wrong! Here is an example we have seen before for declaring a data type for a date. Recall that the data type is also defined in the _item_type_list.

    Definition of Separate Datablocks and Dictionaries

    A datablock delimits a set of name-value pairs. For example, a PDB entry with a single X-ray structure is contained within a single datablock. Each dictionary defined thus far, and including the sequence dictionary defined here, is contained within a single datablock.  The sequence dictionary datablock defined by: and terminated by an end-of-file.

    While an identical data item can exist in each datablock a concatanation of datablock name and data item makes them unique. What if you wish to merge data items from different data blocks, for example in creating a list of active sites from multiple proteins. How is the source of each atom site identified in the single list of sites?

    The macromolecular dictionary defines a category entry with an associated _entry.id which is a parent of many entry ids associated with other categories, for example _atom_sites.entry_id. The _entry.id matches the datablock name and is mandatory for inclusion in the loop. Thus the source of each atom site in a merged list can be identified.  The entry category is currently not used in the sequence dictionary.

    Writing the Sequence Dictionary

    With this basic understanding of how definitions are represented in a dictionary using DDL 2.x we are now ready to begin writing our seqeunce dictionary. As in writing code, the best approach is not to jump right in, but to plan ahead. In planning, the following issues must be considered: We begin by establishing the scope of the first version of our sequence dictionary.

    Defining the Scope

    The final goal is a complete generic description for all items of data associated with any protein family. The description of a single sequence is an important part of that overall goal.
    The basic unit within our scheme is each individual wild type (native) protein. Wild type proteins can be arranged in families based on sequence similarity and the family described. Individual proteins have specific functions and can possibly be linked to a disease or ailment. Based on function there are further sets of descriptos. For example, if this protein is an enzyme, there are many data items associated with the activity of this enzyme. Mutants and post-translationally modified proteins are considered to be derived from the wild type. Each individual sequence and structure must be described along with features shared by groups of structures or sequences.

    Obviously providing these definitions is a monumental task and highlights the importance of scope. To make it more manageable, we have broken the description of a protein family into segments. To date, we have developed a single-structure dictionary (merely a subset of the mmCIF dictionary), an enzyme dictionary  (ePIF), and the sequence dictionary that is used as an example here.

    Each of these dictionaries is further subdivided to describe single versus shared features, for example, single sequences versus the features of a multiple sequence alignment. The sequence dictionary described here only covers features of a single sequence. In the future it will be necessary to merge definitions found in multiple dictionaries into a single dictionary. To simplify this task definitions which are obviously common to multiple dictionaries, for example citations, will be maintained as part of a separate universal dictionary and not included in the sequence dictionary described here. However, to make the dictionary usable in a stand alone fashion a global category of data items is included here.

    Given the initial scope for the dictionary, the next step is to identify those category groups that already exist in other dictionaries and that could be used here. These will normally have to be supplemented with new categories and data items as needed.

    There are no sequence specific data items that we know of that can be used in the current seqeunce dictionary (Trying to determine what has been done by others  raises the need for a single place where one may explore existing official and unofficial dictionaries.). There are, of course, more generic descriptions of citations etc. that can be used and these are not discussed here. The macromolecular dictionary does have a sequence description under the category _entity_poly_seq which describes a sequence of monomers, however, since the notion of entities is not developed as part of the sequence dictionary we chose to define a new set of data items for the sequence (at this time the actual sequence of amino acids or nucleotides is described as a single data item)..

    One last issue related to scope is that of mandatory versus non-mandatory categories and data items. You need to consider what must be present to make a scientifically meaningful data file and make sure that the appropriate definitions in the dictionary are marked mandatory. (This is apart from data items that must be included to provide a valid loop structure).

    Defining the  Atomicity

    Atomicity implies level of parsing when loading a database or program and granularity of query on the data in the database. For example, defining a sequence as a single data item limits what I can validate for that sequence when entering it into the database and what I can return from the database. Breaking the sequence up into a series of consecutive monomers where each monomer is described requires more effort in defining the dictionary but provides a better level of data checking and data retrieval. In other words it is a trade-off. In the current sequence dictionary the sequence is retrieved as a single data item.

    We are now ready to write the actual dictionary. The first step is to define the category groups and categories within each of those groups. Table 1 provides that information and includes data items for everything which we want to extract from a typical sequence entry in GenBank, PIR, etc. in a FASTA format.

    Choosing Category Groups and Categories

    Most of your time will be spent here since there are as many different opinions on how the data should be organized as there are people involved. The left column shows the category groups and the right column the categories in each group.
    xref_group _xref.id 
    general_group _uid.id 
    sequence_group _sequence.id 



    Table 1 Category Groups and Categories in the Sequence Dictionary

     This in our dictionary we define only three other category groups as follows:

    While the order of components is not defined by STAR and no specific style guide exists we follow the general style found in the core and mmCIF dictionaries. Thus components are discussed in the order they will appear in the dictionary. The individual save frames for each data item are listed in alphabetical order.

    Defining the Dictionary

    We begin by declaring the data block and making any comments. There are no official style guides for doing much of what follows. Our approach is to use the mmCIF dictionary as our style guide (which in turn uses the style of the core CIF dictionary). Next come the data items describing the dictionary itself: The dictionary history is more for your use in developing the dictionary than for the users of the dictionary. The dictionary will evolve and the DDL associated with the dictionary history provides an audit trail.

    Defining Category Groups

    All dictionaries have an inclusive_group to which all category groups are a member.
    This list of category groups are described in the dictionary as follows: The parent of each of the defined category groups is the inclusive_group. A description is given for each category group.

    Defining Categories

    The next step is to define categories to be contained within each category group. Here is an example of defining a category called sequence. Following the style of the other dictionaries, categories would be presented in alphabetical order as would the data items within each category. Note that for readability bulleted notes have been inserted in the category description.

    Defining Data Items

    Here is a data item for the sequence creation date. Note that:

    Defining Relationships

    Here are two data items describing an identifier, one for the sequence and one for the universal identifier (uid). We have already described most features of the definition. What is new here is the parent child relationship between these two data items. The uid is an identifier which applies to any object in our system that requires unique identification. It could be a sequence, a structure, a feature of a group of sequences, and so on. Thus, the sequence identifier is only one of a number of children that relate to this parent.

    Defining New Item Types

    If you define a new data type as we have done in the above example (e.g., PKR_ID) that must be declared in an _item_type_list as follows:

    ## ITEM_TYPE_LIST ##

    By style convention the _item_type_list is placed at the end of the dictionary.

    The Complete Dictionary

    Following these  instructions above it should be possible to develop a complete sequence dictionary which can be found at  http://www.sdsc.edu/pb/cif/write_dic/sequence970904.txt

    Parsing Your Dictionary

    There are various parsers available for checking the syntax of  your dictionary:


    Most of the sequence dictionary was developed by Stella Veretnik and Michael Gribskov. This work was funded by NSF grant 9310154 and a grant from the DOE. Thanks to Mike Gilson for a critical review of the material and to Helen Berman, Paula Fitzgerald, and John Westbrook for organizing the workshop that provided the impedus for the work. The original flow charts (Figs 1-3) are used by permission and can be found in a description of the DDL (http://ndb.rutgers.edu/NDB/mmcif/ddl/ddl/ddl.html) and in the mmCIF tutorials http://ndb.rutgers.edu/workshop/mmCIF-tutorials/mechanics/dict-struct.htm