Tutorial: Writing a STAR/CIF Compliant Dictionary using DDL 2.x
This tutorial is based on experience in developing
a dictionary to describe protein and DNA sequences at the level of detail
found in a GenBank entry.
Last Update Oct. 30, 1997 Phil Bourne San
Diego Supercomputer Center (SDSC)
CIF
Home (UK) (local mirror) | mmCIF
Home (Rutgers) (local
mirror) | SDSC STAR/CIF
Introduction
This tutorial discusses the creation of a data dictionary that is compliant
with the Self-defining Text Archival and Retrieval (STAR) language specification
(of which the Crystallographic Information FIle - CIF - is a subset). The
dictionary is based upon Dictionary Definition Language (DDL) version 2.x
which was developed to describe the macromolecular Crystallographic Information
Files (mmCIF) dictionary. The mmCIF dictionary describes in great detail
a structure determined by X-ray crystallography or NMR, and the X-ray crystallographic
experiment. These various terms are explained subsequently.
The data dictionary we will use as an example describes DNA and protein
sequence data at the level of detail found in a GenBank entry described
in FASTA format. The dictionary can be viewed in its entirety
at http://www.sdsc.edu/pb/cif/write_dic/sequence970904.txtor via a
series of Web pages created using StarHtml.
This dictionary was chosen as it is covers many features of STAR and the
dictionary definition language, yet is relatively simple with respect to
content.
Motivation
Why develop a data dictionary?
-
Annotation. If two researchers are
talking about a sequence feature table are they really talking about the
same thing? An exact description of what constitutes a sequence feature
table is needed. In other words, a dictionary definition for each and every
item of data is needed. Depending on the item of data that implies: a definition;
how that item of data relates to other items of data; enumeration of possible
values; data type; and so on. Good annotation is of great benefit
to researchers not familiar with the data, yet wishing to use it without
misinterpretation.
-
Support foir Databases. There is so
much data that the only way it can use it effectively is by placing it
in a database. To avoid loss of data when placing it in a database through
the parser's failure to interpret parts of the data that data must be defined
rigorously.
Why should the dictionary be compliant with the STAR and CIF specifications?
-
The STAR/CIF approach yields a dictionary that is extensible;
i.e., a dictionary that can accommodate new items of data as a discipline
evolves.
-
STAR/CIF has already been used to develop
scientific dictionaries, notably those for small-molecular and macromolecular
crystallography. Therefore, it may be possible to use definitions already
defined by others. This may seem unimportant at first, but as you begin
to develop definitions and see the time and effort involved, work already
done becomes more valuable. Consider something as complex as a macromolecular
structure. It took several years to finalize a set of definitions which
encapsulated the many types of macromolecular structure and their associated
features.Over time it may be this description that becomes the most
valuable feature of the dictionary.
-
STAR/CIF permits
the inclusion of methods (i.e.,
software) which are associated with particular items of data. While this
is not used widely at present (see Biggs
et al., 1997 for an example) it presents some interesting possibilities.
At present many programmers write code which interprets the meaning of
data in a data file. Those interpretations can be different leading to
inconsistent use of the same item of data. The definitions in the dictionary
reduce these inconsistencies, but the availability of software within the
dictionary (or at least associated with the dictionary) provide even greater
consistency than the definitions alone. In our sequence dictionary we shall
see how to include code which, for example, parses specific data
items in FASTA format.
-
Software
exists already for performing basic data manipulation of STAR/CIF files.
-
The STAR/CIF specification is well documented
and supported by the community.
-
STAR/CIF is supported by the International
Union of Crystallography (IUCr).
The Example Dictionary
We needed to develop a database to support data being collected on the
protein kinase family of enzymes (http://www.sdsc.edu/kinases).
These data are extensive and cover sequence, structure, enzymology, function,
protocols, and so on. Based on experience in developing other databases
we realized that a complete description of each item of data to be included
was critical. This description should include the attributes (classification,
type, enumerated values, etc.) of each item of data and its relationship
to other items (formally referred to as a schema). Since we were familiar
with STAR/CIF we chose it to form the underlying data representation. We
took particular care that data items be as far as possible generic, that
is applicable to any protein family and not specific to protein kinases.
The task of defining the data items is daunting and we divided it
up into items related to sequence, structure (mostly taken from the mmCIF
dictionary), enzymology, function, and global (applicable to all types
of data items). For each set of data items we solicited help from experts
in that area. It is the sequence dictionary developed by Mike Gribskov
and Stella Veretnick which is used in this tutorial.
Religious Wars
Why STAR/CIF, what about other data representations?
There are other data representations that could form the basis for casting
a dictionary, notably ASN.1. Certainly ASN.1 has some positive features:
-
It is an IEEE standard and not defined by one domain of users like STAR/CIF.
-
There are tools available for use with ASN.1.
-
It is used extensively by the NCBI
i.e., it works well.
-
It is a simple yet rigorous data description.
-
MMDB developed at NCBI has a set of definitions defining macromolecular
structure. This includes generally applicable chemical graphs, but lacks
a description of the experiment.
Trying to decide which data representation to use can take on a religious
fervour. It is not our intention to endorse one approach over another,
but rather to provide a balanced view of what is available. Familiarity
with STAR/CIF and the number of data items already defined for the mmCIF
dictionary made STAR/CIF a good choice.
A critical issue is that at some future date we could convert to an
ASN.1 (or other) representation and back to STAR/CIF without loss of information.
The ability to convert without data loss is what is critical. Based
on this criteria we believe STAR/CIF and ASN.1 to be suitable candidates
for casting a dictionary.
What are STAR, DDL and CIF?
A detailed answer to this question can be found in a series of papers listed
at http://ndb.sdsc.edu/NDB/mmcif/references/index.html.
An abbreviated answer is given here.
STAR: A General Language for Describing Information
Self-defining Text Archival and Retreival (STAR) is a set of simple syntax
rules for defining: a dictionary definition language (DDL), dictionaries
based upon the DDL, and data files containing the data items defined in
the dictionaries. Thus DDL, dictionaries, and data files are in the same
format and are interpretable by the same basic parsers. We will meet these
syntax rules as we go along, but here are some basics. An alternative description
of STAR syntax is available at http://ndb.rutgers.edu/workshop/mmCIF-tutorials/mechanics/syntax.htm.
-
Everything consists of a name-value pair. A name is distinguished
from a value by a leading underscore (_). A name-value pair is refered
to as a data item. Example:
_protein.compound_name_common "cAMP dependent protein
kinase"
-
When a single name has multiple values, a loop_ construct
is used. This construct defines a loop over multiple values until the next
name or next loop or next datablock (see below) or end-of-file is found.
Although STAR supports nested loops (with each level of loop terminated
by stop_), CIF does not, and nested loops will not be used
in the dictionary we are building here. This restriction requires
that data items which would otherwise be related via nested loops instead
be explicitly linked with each other in "parent-child" relationships (see
below). The integrity of loops over name-value pairs requires that a place
be held for every item of data. However, it is possible to note that a
specific data item has been deliberately left unspecified by including
a period (.) in its place. When a datum is missing, but would be present
if the data set were complete, its place is held instead by a question
mark (?). Example:
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
1 1 ARG 1 2 ILE 1 3 CYS 1 4 PHE 1 5 ASN 1 6 HIS 1 7 GLN 1 8 SER
1 9 SER
1 10 GLN 1 11 PRO 1 12 GLN 1 13 THR 1 14 THR 1 15 LYS 1 16 THR 1
17 CYS
1 18 SER 1 19 . 1 20 GLY
Here the period in the last line indicates that the identity of a residue
(_entity_poly_seq.mon_id) has not been specified.
-
There is no restriction on names in STAR syntax other than they begin with
a leading underscore (_). As we shall see the DDL places additional restrictions
on these names.
-
Values are delimited by white space. A single value of multiple words is
contained in single or double quotes. A single value covering multiple
lines is contained within semicolons (;) as the first character of the
line.
-
A series of name-value pairs is contained in a data block which
starts with data_{name}; for example, data_2cpk. A data block
is terminated by an end-of-file or the start of another data block.
-
A save frame is a chunk of information that can be referenced anywhere
within a data block and is indicated by _save_name. A save frame
is terminated by save_. The save frame is referenced by ${name}
where name is the name of the save frame. We will be using save
frames in developing the dictionary. They are used to delimit the definition
of each data item, but are not referenced anywhere. This is the convention
used in the mmCIF dictionary.
-
A data name can appear only once in a data block.
-
Comments begin with a hash (#) and are terminated by an end-of-line.
-
White space is ignored, except as a delimiter.
-
STAR syntax is case insensitive.
-
There is no implied ordering of save frames, name-value pairs, or datablocks.
A detailed discussion of the STAR syntax can be found in
-
A. Cook and S. R. Hall. The STAR File: a New Format for Electronic Data
Transfer and Archiving. J. Chem. Inf.Comp. Sci. (1991) 31,
326-333.
-
S.R.Hall and N. Spadaccini. The STAR file: detailed specifications. J.
Chem Inf. Compt. Sci. 34, 505-508, 1994.
Dictionary Definition Languages (DDLs) Based upon STAR
We can now use these simple STAR rules to define a language upon which
to base our dictionaries. Consider the analogy to a Websters English Dictionary.
The dictionary only uses the English Language (cf STAR rules) and each
word is presented in the dictionary alphabetically followed by a pronunciation
and then a definition. A formal definition of how the dictionary is laid
out is analogous to the DDL. To date, two such presentation formats
have been developed within the crystallography community. These are referred
to as DDL v1.x and DDL v2.x, where x represents different versions of the
major format.
DDL v1.x is simple but not rigorous. For example, using DDL 1.x the
name given to the data item infers a hierarchy of parts, for example atom_site_cartsn_x.
But what is the hierarchy? A crystallographer would know that
the logical hierarchy is that cartsn_x is one member of a class of data
items called atom_site, however, there is no way for a parser to automatically
determine that fact. DDL 2.x addresses this problem by referencing
data names as atom_site.Cartsn_x. That is a period separates the
category of data items (atom_site) from the members of the category
(e.g., Cartsn_x) Even this requires inference from the data name and
is not rigorous. Hence, as we shall see subsequently, the category to which
a data item belongs is defined explicitly as part of the definition of
that data item.
DDL 1.x is currently used to describe small molecule crystallographic
data - the core
dictionary is expressed using DDL 1.x.. DDL v2.x is used to express
the macromolecular
crystallographic dictionary. DDL v2.x is backward compatible with v1.x.
Thus, any data item in a DDL 2.x dictionary which has a DDL 1.x counterpart
includes an alias to that data item. Consider our example of an x coordinate
as it appears in the macromolecular dictionary:.
save__atom_site.Cartn_x
.....
_item_aliases.alias_name '_atom_site_Cartn_x'
_item_aliases.dictionary
cif_core.dic
_item_aliases.version
2.0.1
The data name is declared to be the name of the save frame (save_
followed by _atom_site.Catsn_x hence two underscores concatenanted
together). On the left are data names defined by the DDL and the values
are on the right. This is only a partial description of that data item.
We will get to the full description subsequently. This example defines
in the macromolecular dictionary a data item _atom_site.Cartn_y
which has a counterpart called _atom_site_Cartn_y located in version
2.0.1 of the core dictionary cif_core.dic.
The alias mechanism is not used in the sequence dictionary that we are
about to develop, since none of the data items have been defined in other
DDL v1.x dictionaries.It is introduced here to highlight differences between
dictionaries cast in DDL 1.x and DDL 2.x.
In summary, DDL 2.x has all the capabilities of DDL 1.x and more.
Software tools exist that work with files described by both DDLs, with
just DDL v1.x, or with just DDL v2.x. Our dictionary development
will be based on DDL v2.x.
A full description of the DDL can be found at http://ndb.sdsc.edu/NDB/mmcif/ddl/index.html
in various formats. They have been prepared by John
Westbrook, the original author of DDL v2.x. There is also a DDL
tutorial at
http://ndb.rutgers.edu/workshop/mmCIF-tutorials/mechanics/dict-struct.htm
You will need a working knowledge of the DDL to build your dictionary
and it would be worthwhile reading at least the DDL tutorial at this
point. What follows is an application of the material described in the
tutorial.
Crystallographic Information File (CIF)
CIF only uses a subset of the STAR syntax for all the crystallographic
dictionaries and associated data files. We have chosen to stick with it
for the sequence dictionary since it provides maximum usage of existing
software. However, some of these restrictions make the representation cumbersome.
Important restrictions are:
-
CIF restricts columns to 80 characters
-
Nested loops are not permitted
Where Do the Data Fit In?
Where do the data fit in to this STAR/CIF model? At the conclusion of this
tutorial we will have a dictionary which contains a set of data items for
describing sequences. Our data files (one or more data blocks per file)
contain only the names of the data items and the values of those data items,
either looped or as single name-value pairs. The data names correspond
only to names found in the dictionary. Several tools
exist for validating a data file against a data dictionary. The same
tools that parse the dictionary should be able to parse a data file. To
see what a macromolecular CIF (mmCIF) data file looks like, go to moose,
select a structure using the text option, and then select mmCIF(ASCII)
as the display option. This will show your favorite PDB entry as a STAR/CIF
compliant mmCIF file.
Characteristics of the Dictionary
This section highlights the various characteristics of the dictioanry which
are illustrated using segments of the sequence dictionary. In the following
section we put these segments together to complete the dictionary.
Organization
The dictionary description and the data items defined in the dictionary
are organized into dictionary, category_group, category,
sub_category, and finally the data items themselves (item).
Each of levels of organization are described subsequently, starting with
the dictionary description.
Dictionary
Figure 1 Datablock and Dictionary Relationships
There are two tables (categories) describing the dictionary itself,
dictionary and dictionary_history. They in turn have a pointer
to a category datablock which describes the datablock itself, which
in turn has a pointer to datablock_methods defining methods associated
with the datablock.In the sequence dictionary this is written as:
data_seqdic.s97
###############
## DATABLOCK##
###############
_datablock.id
seqdic.s97
_datablock.description
; This datablock contains the sequence
dictionary developed at the San Diego Supercomputer
Center
;
###############
## DICTIONARY ##
###############
_dictionary.title
seqdic.s97
_dictionary.version
0.1.01
_dictionary.datablock_id
seqdic.s97
########################
## DICTIONARY_HISTORY ##
########################
loop_
_dictionary_history.version
_dictionary_history.update
_dictionary_history.revision
0.1.01 1997-08-13
; First production version
;
0.1.02 1997-09-04
; Added method_list and category methods. These
are pretty much a guess for
now.
;
This introduction to the dictionary introduces a number of features, some
syntax some style:
-
Categories begin with text in ##'s which highlights the location of the
category in the dictionary which makes them easy to find when reading the
dictionary (style).
-
Names on the left are defined in the DDL 2.x dictionary (syntax)
-
The DDL 2.x dictionay indicates that _datablock.id is the parent
of _dictionary.datablock_id and that the relationship is implicit
(not shown). Thus we have two tables (categories) datablock and
dictionary which share a common data value. This data value, although
known by different names defines the parent-child relationship between
the two categories. The relationship is implicit in that the child can
be present without the necessity of having the parent - the parent is implied
(style and syntax).
-
Notice the form of dates (e.g., 1997-08-13
) Date is a data type and defined explicitly in the dictionary - We will
see how to do that subsequently (syntax).
-
The figure references methods - we will get to that subsequently also (syntax).
Category Groups
The next level of organization is category groups - categories that it
makes sense to group together. The organization of category groups, categories
and sub-categories are:
Figure 2 Relationship Between Categories and Category
Groups
We begin with category groups. In the sequence dictionary there are
three category groups. inclusive_group, general_group and
sequence_group. They are declared by the DDL _category_group_list
(itself a category) which identifies the name of the category group, the
parent category group from which it was derived and a description of the
category group.
#############################
## CATEGORY_GROUP_LIST ##
#############################
loop_
_category_group_list.id
_category_group_list.parent_id
_category_group_list.description
'inclusive_group' .
;
Categories that belong to the macromolecular dictionaries.
;
'general_group' 'inclusive_group'
;
Categories common to all data types (sequences, structures,
enzymes, features) such as uid, citation, etc.
;
'sequence_group' ' inclusive_group'
;
Categories common to protein and nucleic acid sequences.
;
The inclusive_group, as the name suggests, is the parent of the
other category_groups.
Categories
Within each category group are one or more categories of data items. Here
is an example of expressing a category:
##############
## SEQUENCE ##
##############
save_SEQUENCE
_category.description
;
Protein/DNA
sequence and other relevant information parsed from the
database
entry (cross-references, 'domains' name and location, etc).
The sequence
category contains only the information publically
available
- local enhancements are entered in the "local" category.
;
_category.id
sequence
_category.mandatory_code
yes
loop_
_category_key.name
'_sequence.id'
'_sequence.sequence'
loop_
_category_group.id
'inclusive_group'
'sequence_group'
loop_
_category_methods.method_id
_category_methods.category_id
'get_sequence_category' sequence
loop_
_category_example.case
_category_example.details
#------------------------------------------------------------------
'Example
- Tyrosine_protein kinase DASH/ABL'
;
loop_
_sequence.id
_sequence.type
_sequence.name
_sequence.synonym
_sequence.date_create
_sequence.update_sequence
_sequence.update_annotation
_sequence.citation
_sequence.length
_sequence.mol_weight
_sequence.sequence
PKRPSEQ012345
PROTEIN
ABL_DROME
'abl, d-abl, dm_abl'
1986-07-21
1990-01-01
1995-11-01
;
RX MEDLINE; 88174728.
RA HENKEMEYER M.J., BENNETT R.L., GERTLER F.B., HOFFMANN F.M.;
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 88174728. [NCBI, Geneva]
RA HENKEMEYER M.J., BENNETT R.L., GERTLER F.B., HOFFMANN F.M.;
RL MOL. CELL. BIOL. 8:843-853(1988).
RN [2]
RP SEQUENCE OF 374-648 FROM N.A.
RX MEDLINE; 84082064.
RA HOFFMANN F.M., FRESCO L.D., HOFFMAN-FALK H., SHILO B.-Z.;
RL CELL 35:393-401(1983).
;
1520
161836
;
MGAQQGKDRGAHSGGGGSGAPVSCIGLSSSPVASVSPHCISSSSGVSSAPLGGGSTLRGS
RIKSSSSGVASGSGSGGGGGGSGSGLSQRSGGHKDARCNPTVGLNIFTEHNGTKHSSFRG
HPGKYHMNLEALLQSRPLPHIPAGSTRPLFWRIAELQQHQQDSGGLGLQGSSLGGGHSST
.....
;
;
save_
The category description is contained within a save frame that begins and
end with save_ Since the name of the category starts with an underscore
there is a double underscore at the beginning i.e. save__sequence.
The category uses new DDL names that we have yet to encounter.
-
_category.description describes the contents of the category.
-
_category.id is the identifier of the category. By convention the
part of the data name preceding the period corresponds to the _category.id.
-
_category.mandatory_code (with a value of yes, no, or implicit)
indicates whether the category must appear in the data block for that data
block to be valid. It is used here for validation purposes. A data block
without a sequence category (i.e., a sequence) is not valid.
-
_category_key.name defines the data items within the category which
make every tuple (row in the table i.e. loop) unique. These data items
must be included for that row of the table, and the table itself to be
valid.
-
_category_group.id indicates what category group the category belongs
to.
-
_category_examples.detail indentifies the particular example.
-
_category_examples.case provides the example. This is useful since
it includes as many of the data items present in the category as possible
and gives insight into what the data files with data items from that category
will look like. Notice the syntax used here. Th example is surrounded by
semicolons and any semicolons found in the example itself are not interpreted
as valid syntax since the example has been indented and they are not found
in column one.
Subcategories
A sub_category is used to associate a procedure with a subset of
data items. A sub_category classification is not used in the sequence
dictionary described here, nor is it fully exploited in the macromolecular
CIF dictionary, hence only a brief introduction of how it might be used
is given here. First a subcategory is declared.
loop_
_sub_category.id
_sub_category.description
'cartesian coordinate'
; The collection of x,y,z ...
;
The methods associated with this subcategory are declared in a sub_catetegory_methods
loop
loop_
_sub_category_methods.method_id
_sub_category_methods.sub_category_method
'cartesian coordinate'
; Validate ...
;
The specific method then appears in the method_list which is defined subsequently.
Data Items
The final level of the orgaization are the individual data items which
have a number of DDL descriptors as shown in Figure 3. Only a subset are
used in describing most data items.
Figure 3 Relationship Between Items
Here is a simple data item:
save__sequence.date_create
_item_description.description
;
Creation
date for the entry in the source database. We suppose that
not all
databases have this so we make it optional.
;
_item.name
'_sequence.date_create'
_item.category_id
sequence
_item.mandatory_code
yes
_item_type.code
YYYY-MM-DD
_item_examples.case
'1995-11-01'
save_
Note:
-
The definition of the data item is contained in a save frame.
-
_item_description.description describes the
specific data item.
-
The name is declared explicity with _item_name.
-
The category to which the data item belongs is declared explicity with
_item.category_id.
-
_item.mandatory_code indicates whether the day item must be present
for a row in a table (loop) to be valid. Values can be yes, no, or implicit.
Implicit means whetehr it is mandatory is implied by aniother data item
which has a relationship to the current data item.
-
_item_type.code defines the data type of the data item. These data
types are listed in the _item_type_list as follows:
loop_
_item_type_list.code
_item_type_list.primitive_code
_item_type_list.construct
_item_type_list.detail
yyyy-mm-dd char
' [0-9]?[0-9]?[0-9]?[0-9]-[0-9]?[0--9]-[0-9]?[0-9]
'
; Standard form of STAR/CIF date
;
Expression of Relationships Between Data Items
It is necessary to express a variety of relationships between data items
and categories when using DDL 2.x. Here, we describe common relationships
that that are found in our example sequence dictionary.
Parent-child relationships are used to relate tables (loops)
where instances of a data item has the same value and occurs in different
categories. This is necessary because of the restriction resulting from
single-level looping (see above). That is, a nested loop, such as the list
of authors associated with a single publication, which is in turn part
of a list of publications, appears in two separate lists (categories) -
one for the list of publications and one for the list of authors. A common
id links the two - that is,each author has an associated author citation
id which matches the citation id. The citation.id is the parent,
the citation_author.citation_id the child. Parent-child is a somewhat
misleading description. It is simply a bi-directional link between two
items of data. The one from which the link is derived is considered the
parent and the associated data item the child. Provided one direction of
the link is present the other link is inferred. Here is an example citation
from a data file and it is followed by the dictionary description
of the associated data items. We begin with the citation category which
loops over a series of citations identified by _citation.id.
loop_
_citation.id
_citation.coordinate_linkage
_citation.title
_citation.country
_citation.journal_abbrev
_citation.journal_volume
_citation.journal_issue
_citation.page_first
_citation.page_last
_citation.year
_citation.journal_id_ASTM
_citation.journal_id_ISSN
_citation.journal_id_CSD
_citation.book_title
_citation.book_publisher
_citation.book_id_ISBN
_citation.details
primary yes
;
Crystallographic analysis of a complex between human immunodeficiency
virus type 1 protease and acetyl-pepstatin at 2.0-Angstroms resolution.
;
US 'J. Biol. Chem.' 265 . 14209 14219
1990
HBCHA3 0021-9258 071 . . .
;
The publication that directly relates to this coordinate set.
;
2 no
;
Three-dimensional structure of aspartyl-protease from human
immunodeficiency virus HIV-1.
;
UK 'Nature' 337 . 615 619 1989
NATUAS 0028-0836 006 . . .
;
Determination of the structure of the unliganded enzyme.
;
We now have a list of authors for all citations described above.
loop_
_citation_author.citation_id
_citation_author.ordinal
_citation_author.name
primary 1 'Fitzgerald, P.M.D.'
primary 2 'McKeever, B.M.'
primary 3 'Van Middlesworth, J.F.'
primary 4 'Springer, J.P.'
2 5 'Heimbach, J.C.'
2 6 'Leu, C.-T.'
2 7 'Herber, W.K.'
2 8 'Dixon, R.A.F.'
2 9 'Darke, P.L.'
Consider how the single data item _citation.id is described in
the dictionary.
_item_description.description
;
The value of _citation.id must uniquely identify a record in the
CITATION list.
The _citation.id 'primary' should be used to indicate the
citation that the author(s) consider to be the most pertinent to
the contents of the data block.
Note that this item need not be a number; it can be any unique
identifier.
;
loop_
_item.name
_item.category_id
_item.mandatory_code
'_citation.id'
citation yes
'_citation_author.citation_id' citation_author yes
'_citation_editor.citation_id' citation_editor yes
'_software.citation_id'
software yes
_item_aliases.alias_name
'_citation_id'
_item_aliases.dictionary
cif_core.dic
_item_aliases.version
2.0.1
loop_
_item_linked.child_name
_item_linked.parent_name
'_citation_author.citation_id' '_citation.id'
'_citation_editor.citation_id' '_citation.id'
'_software.citation_id'
'_citation.id'
_item_type.code
code
loop_
_item_examples.case
'primary'
'1'
'2'
save_
Note: since the value of some of the data names show here are in fact themselves
data names they are placed in single quotes insure literal interpretation
by the parser.
The _item_linked DDL category defines this parent-child relationship.
That is, the data item _citation.id is declared to be the parent
of _citation_author.citation_id, as well as two other data
items. This declaration is not made in the definition of the child data
item _citation_author.citation_id (an issue of style).
.
Dependent item relationships are important in some instances,
although not used in the sequence
dictionary. These relationships declare that for a given data item
in a category, certain other
data items must be present to make that category valid. So for example,
consider the
mmCIF atom_site category. The data item _atom_site.Cartn_x
is only valid if
_atom_site.Cartn_y and _atom_site.Cartn_z are present.
save__atom_site.Cartn_x
_item_description.description
;
The x atom site coordinate in angstroms specified according to
a set of orthogonal Cartesian axes related to the cell axes as
specified by the description given in
_atom_sites.Cartn_transform_axes.
;
_item.name
'_atom_site.Cartn_x'
_item.category_id
atom_site
_item.mandatory_code
no
_item_aliases.alias_name
'_atom_site_Cartn_x'
_item_aliases.dictionary
cif_core.dic
_item_aliases.version
2.0.1
loop_
_item_dependent.dependent_name
'_atom_site.Cartn_y'
'_atom_site.Cartn_z'
_item_related.related_name
'_atom_site.Cartn_x_esd'
_item_related.function_code
associated_esd
_item_sub_category.id
cartesian_coordinate
_item_type.code
float
_item_type_conditions.code
esd
_item_units.code
angstroms
save_
The _item_dependent.dependent_name defines other data items that
must be present in the category to make this data item valid. There is
no inference here. That is, even though _atom_site.Cartn_x declares
a dependency on _atom_site.Cartn_y this does not infer that y is
dependent on x unless stated explicitly in the definition of _atom_site.Cartn_y.
Association of Methods with Data
It is possible to associate a specific method -- code that operates on
the data item -- with blocks, categories, subcategories and data items.
You can imagine using this for data conversion, data validation, etc. Here,
we wish to define Perl code that parses information from a FASTA sequence
file and is used to create a STAR/CIF sequence data file. First a list
of methods is defined - for the sake of abreviation only the function name
is included here.
#################
## METHOD_LIST ##
#################
loop_
_method_list.id
_method_list.details
_method_list.language
_method_list.inline
#-------------------------------------------------------------
'get_sequence_category'
; Parse the items in the sequence
category from a source
database file.
;
PERL
;
make_loop;
get_sequence-id;
get_sequence-type;
get_sequence-name;
get_sequence-synonym;
get_sequence-date_create;
get_sequence-update_sequence;
get_sequence-update_annotation;
get_sequence-citation;sequence_
get_sequence-length;
get_sequence-mol_weight;
get_sequence-sequence;
;
The _method_list category identifies and descibes each available
method. To apply a method to, for example, a specufuc data item then _item_methods.method.id
should be included in the description of that data item.
For a novel use of associating code with data items see Bourne,
Biggs and Pu, 1997, ISMB 52-55. There we include in the dictionary
pointers for each category. The pointer indicates the relevant module (encapsulated
Perl code) that processes that category of data item in the conversion
from PDB to CIF format. At run time the convertor is generated by reading
all the conversion modules. This faciltates code maintenance.
The advantage of both approachs is that (in principle) all users use
the same code. This avoids the situation where hundreds of programmers
implement their own mutually inconsistent interpretations of, for example,
how to convert a particular item of data in a PDB file to its STAR/CIF
data item counterpart.
Enumeration of Allowed Values
Some data items can only have a limited set of values. In these cases those
values may be enumerated within the data item definition. Here is an example
from the sequence dictionary which indicates that a _sequence.type
can only have the value PROTEIN or DNA or RNA.
save__sequence.type
_item_description.description
;
Describes
the type of molecule in this sequence entry. Allowed types
are PROTEIN,
RNA, and DNA.
;
_item.name
'_sequence.type'
_item.category_id
sequence
_item.mandatory_code yes
loop_
_item_enumeration.value
_item_enumeration.detail
DNA
; DNA sequence
;
RNA
; RNA sequence
;
PROTEIN
; protein sequence
;
save_
_item_enumeration.value describes the values that this data item
can assume and _item_enumeration.detail more detail of that enumeration.
Definition of Units and Unit Conversions
Many data items have units associated with them. It is important to specify
these units in the dictionary, and it is helpful to include methods of
converting the dictionary defined units to other commonly used units. Here
is an example for expressiong a molecular weight in Daltons.
save__sequence.mol_weight
_item_description.description
;
The molecular
weight of the unmodified sequence in daltons.
;
_item.name
'_sequence.mol_weight'
_item.category_id
sequence
_item.mandatory_code no
_item_type.code
int
_item_units.code
Daltons
_item_example.case '161836'
save_
Specific unit descriptions must also appear in an _item_units_list,
for example:
loop_
_items_units_list.code
_items_units_list.detail
Daltons
'One measure of molecular weight'
The _item_units_conversion can be used to provide conversion factors
between unit types. It is not described here since it is not used in the
sequence dictionary.
Data Types
The data type indicates whether a data item is a real number, an integer,
a string, or some other predefined data type. Data typing is important
in validating data in STAR/CIF files when, for example, loading the data
into a database. The loader will not be able to determine whether the residue
name at a given position is correct, but it will be able to check for the
data type - if it is given as a floating point number rather than a a character
string then something is wrong! Here is an example we have seen before
for declaring a data type for a date.
save__sequence.date_create
_item_description.description
;
Creation
date for the entry in the source database. We suppose that
not all
databases have this so we make it optional.
;
_item.name
'_sequence.date_create'
_item.category_id
sequence
_item.mandatory_code
no
_item_type.code
YYYY-MM-DD
_item_examples.case
'1995-11-01'
save_
Recall that the data type is also defined in the
_item_type_list.
Definition of Separate Datablocks and Dictionaries
A datablock delimits a set of name-value pairs. For example, a PDB entry
with a single X-ray structure is contained within a single datablock. Each
dictionary defined thus far, and including the sequence dictionary defined
here, is contained within a single datablock. The sequence dictionary
datablock defined by:
and terminated by an end-of-file.
While an identical data item can exist in each datablock a concatanation
of datablock name and data item makes them unique. What if you wish to
merge data items from different data blocks, for example in creating a
list of active sites from multiple proteins. How is the source of each
atom site identified in the single list of sites?
The macromolecular dictionary defines a category entry with an
associated _entry.id which is a parent of many entry ids associated
with other categories, for example _atom_sites.entry_id. The _entry.id
matches the datablock name and is mandatory for inclusion in the loop.
Thus the source of each atom site in a merged list can be identified.
The entry category is currently not used in the sequence dictionary.
Writing the Sequence Dictionary
With this basic understanding of how definitions are represented in a dictionary
using DDL 2.x we are now ready to begin writing our seqeunce dictionary.
As in writing code, the best approach is not to jump right in, but to plan
ahead. In planning, the following issues must be considered:
-
The scope of the dictionary
-
The atomicity of the dictionary
-
What has been done already and can be "borrowed" from other dictionaries
-
The grouping of and relationships between data items in the dictionary
We begin by establishing the scope of the first version of our sequence
dictionary.
Defining the Scope
The final goal is a complete generic description for all items of data
associated with any protein family. The description of a single sequence
is an important part of that overall goal.
The basic unit within our scheme is each individual wild type (native)
protein. Wild type proteins can be arranged in families based on sequence
similarity and the family described. Individual proteins have specific
functions and can possibly be linked to a disease or ailment. Based on
function there are further sets of descriptos. For example, if this protein
is an enzyme, there are many data items associated with the activity of
this enzyme. Mutants and post-translationally modified proteins are considered
to be derived from the wild type. Each individual sequence and structure
must be described along with features shared by groups of structures or
sequences.
Obviously providing these definitions is a monumental task and highlights
the importance of scope. To make it more manageable, we have broken the
description of a protein family into segments. To date, we have developed
a single-structure dictionary (merely a subset of the mmCIF dictionary),
an enzyme dictionary (ePIF),
and the sequence
dictionary that is used as an example here.
Each of these dictionaries is further subdivided to describe single
versus shared features, for example, single sequences versus the features
of a multiple sequence alignment. The sequence dictionary described here
only covers features of a single sequence. In the future it will be necessary
to merge definitions found in multiple dictionaries into a single dictionary.
To simplify this task definitions which are obviously common to multiple
dictionaries, for example citations, will be maintained as part of a separate
universal dictionary and not included in the sequence dictionary described
here. However, to make the dictionary usable in a stand alone fashion a
global category of data items is included here.
Given the initial scope for the dictionary, the next step is to identify
those category groups that already exist in other dictionaries and that
could be used here. These will normally have to be supplemented with new
categories and data items as needed.
There are no sequence specific data items that we know of that can be
used in the current seqeunce dictionary (Trying to determine what has been
done by others raises the need for a single place where one may explore
existing official and unofficial dictionaries.). There are, of course,
more generic descriptions of citations etc. that can be used and these
are not discussed here. The macromolecular dictionary does have a sequence
description under the category _entity_poly_seq which describes
a sequence of monomers, however, since the notion of entities is not developed
as part of the sequence dictionary we chose to define a new set of data
items for the sequence (at this time the actual sequence of amino acids
or nucleotides is described as a single data item)..
One last issue related to scope is that of mandatory versus non-mandatory
categories and data items. You need to consider what must be present to
make a scientifically meaningful data file and make sure that the appropriate
definitions in the dictionary are marked mandatory. (This is apart from
data items that must be included to provide a valid loop structure).
Defining the Atomicity
Atomicity implies level of parsing when loading a database or program and
granularity of query on the data in the database. For example, defining
a sequence as a single data item limits what I can validate for that sequence
when entering it into the database and what I can return from the database.
Breaking the sequence up into a series of consecutive monomers where each
monomer is described requires more effort in defining the dictionary but
provides a better level of data checking and data retrieval. In other words
it is a trade-off. In the current sequence dictionary the sequence is retrieved
as a single data item.
We are now ready to write the actual dictionary. The first step is to
define the category groups and categories within each of those groups.
Table 1 provides that information and includes data items for everything
which we want to extract from a typical sequence entry in GenBank, PIR,
etc. in a FASTA format.
Choosing Category Groups and Categories
Most of your time will be spent here since there are as many different
opinions on how the data should be organized as there are people involved.
The left column shows the category groups and the right column the categories
in each group.
| xref_group |
_xref.id
_xref.primary
_xref.dbname
_xref.name
_xref.acc
_xref.date
_xref.type
_xref.source_db
_xref.source_accession
_xref.comments |
| general_group |
_uid.id
_uid.status
_uid.date_create
_uid.update
_uid.keyword
_uid.description_short
_uid.description_long |
| sequence_group |
_sequence.id
_sequence.type
_sequence.name
_sequence.synonym
_sequence.date_create
_sequence.update_sequence
_sequence.update_annotation
_sequence.citation
_sequence.length
_sequence.mol_weight
_sequence.sequence
_sequence.keyword
_local.id
_local.synonym
_local.keyword
_local.comment
_local.description_short
_local.description_long
_feature_table.id
_feature_table.feature_id
_feature_table.feature_name
_feature_table.feature_location
_feature_table.feature_type
_feature_table.feature_source |
Table 1 Category Groups and Categories in the Sequence
Dictionary
This in our dictionary we define only three other category groups
as follows:
-
sequence_group - all features of a given sequence. This contains
the following categories:
-
sequence - the sequence itself,
-
local - locally defined features of the sequence, and
-
feature_table - noteworthy features of the sequence, such as, recognized
sequence motifs, secondary structure predictions, and active sites.
-
xref_group - cross references
-
general_group - categories of a general nature and not restricted
to a description of a single sequence.
While the order of components is not defined by STAR and no specific style
guide exists we follow the general style found in the core and mmCIF dictionaries.
Thus components are discussed in the order they will appear in the dictionary.
The individual save frames for each data item are listed in alphabetical
order.
Defining the Dictionary
We begin by declaring the data block and making any comments. There are
no official style guides for doing much of what follows. Our approach is
to use the mmCIF dictionary as our style guide (which in turn uses the
style of the core CIF dictionary).
################################################################################
# Macromolecular Sequence Dictionary
#
# This dictionary describes protein and nucleic
acid sequence data for use
# in a federated database environment.
This dictionary is compliant with
# DDL v2.1.
#
# Primary Authors:
# Stella Veretnik
(veretnik@sdsc.edu)
# Michael Gribskov (gribskov@sdsc.edu)
# Phil Bourne
(bourne@sdsc.edu)
################################################################################
data_seqdic.s97
Next come the data items describing the dictionary itself:
########################
## DICTIONARY_HISTORY ##
########################
loop_
_dictionary_history.version
_dictionary_history.update
_dictionary_history.revision
0.1.01 1997-08-13
; First production version
;
0.1.02 1997-09-04
; Added method_list and category methods. These
are pretty much a guess for
now.
;
The dictionary history is more for your use in developing the dictionary
than for the users of the dictionary. The dictionary will evolve and the
DDL associated with the dictionary history provides an audit trail.
Defining Category Groups
All dictionaries have an inclusive_group to which all category groups
are a member.
This list of category groups are described in the dictionary as follows:
#########################
## CATEGORY_GROUP_LIST ##
#########################
loop_
_category_group_list.id
_category_group_list.parent_id
_category_group_list.description
'inclusive_group' .
;
Categories that belong to the macromolecular dictionaries.
;
'general_group'
'inclusive_group'
;
Categories common to all data types (sequences, structures,
enzymes, features) such as uid, citation, etc.
;
'sequence_group' 'inclusive_group'
;
Categories common to protein and nucleic acid sequences.
;
The parent of each of the defined category groups is the inclusive_group.
A description is given for each category group.
Defining Categories
The next step is to define categories to be contained within each category
group. Here is an example of defining a category called sequence. Following
the style of the other dictionaries, categories would be presented in alphabetical
order as would the data items within each category. Note that for readability
bulleted notes have been inserted in the category description.
##############
## SEQUENCE ##
##############
save__SEQUENCE
_category.description
;
Protein/DNA
sequence and other relevant information parsed from the
database
entry (cross-references, 'domains' name and location, etc).
The sequence
category contains only the information publically
available
- local enhancements are entered in the "local" category.
;
_category.id
sequence
_category.mandatory_code
yes
loop_
_category_key.name
'_sequence.id'
'_sequence.sequence'
-
The category description should be as meaningful as possible
-
The category is defined explicitly by _category.id
-
The category is mandatory since _category.mandatory_code
is yes. That is the datablock will be invalid if it does
not contain sequence loop. (table)
-
The category has two _category_key.name (s) That
is, both the sequence identifier and the sequence must be present in each
row of a loop over multiple sequence entries.
loop_
_category_group.id
'inclusive_group'
`'sequence_group'
loop_
_category_methods.method_id
_category_methods.category_id
'get_sequence_category' sequence
-
The fact that this category is part of the inclusive_group and sequence_group
of categories is declared.
-
A method is associated with the category called 'get_sequence_category'
This method will appear in the method list (defined subsequently). This
method is used to parse a FASTA formatted entry to extract data items of
the sequence category.
loop_
_category_example.case
_category_example.details
#------------------------------------------------------------------
'Example
- Tyrosine_protein kinase DASH/ABL'
;
loop_
_sequence.id
_sequence.type
_sequence.name
_sequence.synonym
_sequence.date_create
_sequence.update_sequence
_sequence.update_annotation
_sequence.citation
_sequence.length
_sequence.mol_weight
_sequence.sequence
PKRPSEQ012345
PROTEIN
ABL_DROME
'abl, d-abl, dm_abl'
1986-07-21
1990-01-01
1995-11-01
;
RX MEDLINE; 88174728.
RA HENKEMEYER M.J., BENNETT R.L., GERTLER F.B., HOFFMANN F.M.;
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 88174728. [NCBI, Geneva]
RA HENKEMEYER M.J., BENNETT R.L., GERTLER F.B., HOFFMANN F.M.;
RL MOL. CELL. BIOL. 8:843-853(1988).
RN [2]
RP SEQUENCE OF 374-648 FROM N.A.
RX MEDLINE; 84082064.
RA HOFFMANN F.M., FRESCO L.D., HOFFMAN-FALK H., SHILO B.-Z.;
RL CELL 35:393-401(1983).
;
1520
161836
;
MGAQQGKDRGAHSGGGGSGAPVSCIGLSSSPVASVSPHCISSSSGVSSAPLGGGSTLRGS
RIKSSSSGVASGSGSGGGGGGSGSGLSQRSGGHKDARCNPTVGLNIFTEHNGTKHSSFRG
HPGKYHMNLEALLQSRPLPHIPAGSTRPLFWRIAELQQHQQDSGGLGLQGSSLGGGHSST
TSVFESAHRWTSKENLLAPGPEEDDPQLFVALYDFQAGGENQLSLKKGEQVRILSYNKSG
..............
;
;
save_
-
The example is critical to anyone trying to read and understand your dictionary.
The more complete the example -- the more data items you include -- the
better. Since the example is actually a comment it is not parsed and any
syntax errors in the example will be missed by software and only serve
to confuse the human dictionary reader - take care in the syntax of examples.
Defining Data Items
Here is a data item for the sequence creation date. Note that:
-
Dates have a declared format of YYYY-MM-DD defined by _item_type.code
-
It is necessary to declare explicitly which category (_item.category_id
) the data item belongs to. This is despite the fact that the
name of the item by convention includes the name of the category.
-
The data item in this example is mandatory (_item.mandatory_code
yes). This means it must be present for the category to be valid.
This can be used as a validation mechanism: if a user does not submit this
data item the data file is not valid. Careful thought needs to be given
to what data items are mandatory and what are optional.
save__sequence.date_create
_item_description.description
;
Creation
date for the entry in the source database. We suppose that
not all
databases have this so we make it optional.
;
_item.name
'_sequence.date_create'
_item.category_id
sequence
_item.mandatory_code
yes
_item_type.code
YYYY-MM-DD
_item_examples.case
'1995-11-01'
save_
Defining Relationships
Here are two data items describing an identifier, one for the sequence
and one for the universal identifier (uid). We have already described most
features of the definition. What is new here is the parent child relationship
between these two data items. The uid is an identifier which applies to
any object in our system that requires unique identification. It could
be a sequence, a structure, a feature of a group of sequences, and so on.
Thus, the sequence identifier is only one of a number of children that
relate to this parent.
save__uid.id
_item_description.description
;
This is
the accession number, each object in the database must have
one.
The _uid.id provides the mechanism for associating information in
various
categories with a single object.
;
loop_
_item.name
_item.category_id
_item.mandatory_code
'_uid.id'
uid
yes
'_sequence.id'
sequence
yes
'_feature_table.id'
feature_table yes
'_local.id'
local
yes
'_xref.id'
xref
yes
_item_type.code
code
loop_
_item_linked.child_name
_item_linked.parent_name
'_sequence.id'
'_uid.id'
'_feature_table.id'
'_uid.id'
'_local.id'
'_uid.id'
'_xref.id'
'_uid.id'
save_
-
All the parent child relationships are defined in the parent data item
and only inferred from the child data item.
-
By convention the categories associated with each of the children and whether
they are mandatory are also declared here.
save__sequence.id
_item_description.description
;
UID (accession
number) of this sequence. This is the same as uid.id
;
_item.name
'_sequence.id'
_item.category_id
sequence
_item.mandatory_code yes
_item_type.code
PKR_ID
save_
Defining New Item Types
If you define a new data type as we have done in the above example (e.g.,
PKR_ID) that must be declared in an _item_type_list as follows:
#####################
## ITEM_TYPE_LIST ##
#####################
loop_
_item_type_list.code
_item_type_list.primative_code
_item_type_list.construct
_item_type_list.detail
PKR_ID char
' [A-Za-z0-9,.;:"&<>/\{}`~!@#$%]* '
; The data type PKR_ID can be made up
of any alphanumeric character.
;
By style convention the _item_type_list is placed at the end of the dictionary.
The Complete Dictionary
Following these instructions above it should be possible to develop
a complete sequence dictionary which can be found at http://www.sdsc.edu/pb/cif/write_dic/sequence970904.txt
Parsing Your Dictionary
There are various parsers available for checking the syntax of your
dictionary:
-
CIFPARSE
(C++) will check the dictionary against itself or the dictionary against
any dictionary compliant file.
-
Cyclops
(Fortran) will check the dictionary against itself or the dictionary against
any dictionary compliant file.
-
Parser (Objective
C - Sun executabes available) parse a STAR compliant file.
Acknowledgements
Most of the sequence dictionary was developed by Stella Veretnik and Michael
Gribskov. This work was funded by NSF
grant 9310154 and a grant from the DOE. Thanks to Mike Gilson for a
critical review of the material and to Helen Berman, Paula Fitzgerald,
and John Westbrook for organizing the workshop that provided the impedus
for the work. The original flow charts (Figs 1-3) are used by permission
and can be found in a description of the DDL (http://ndb.rutgers.edu/NDB/mmcif/ddl/ddl/ddl.html)
and in the mmCIF tutorials http://ndb.rutgers.edu/workshop/mmCIF-tutorials/mechanics/dict-struct.htm