Understanding the extent to which macromolecular structure can currently be queried by MMQL is best illustrated by example. However, an explanation of MMQL terminology and syntax is prerequisite. (Figure. Simple query construct.)
Some of the features seen in the query and defined below map directly to classes within MMQLlib (Figure. MMQLlib classes.)
Query - Grouping of valid query Statements and Comments (see below). A query is executed on input data read from a file or specified using PDBquery.
Expression - Specification of the action to perform, comprising a search method and parameters. Expression can be one of the following types: Pattern Expression, Filter Expression, or Operator Expression (Figure. Simple query construct.) The search method is the code executed when the Expression is evaluated.
Variable - Container of the macromolecular data as input to the search and for representing the results of a search, intermediate or final. Each Variable consists of objects according to a four level Set schema comprising Compound, Entity, Subentity and Atom classes (Figure. Set object design.) Variable names start with a "$" character.
Statement - Combination of Variables and Expressions. Each Statement has a single output Variable, and one or more input Variables. How input Variables are interpreted depends on the Expression type and the Expression parameters. Defaults are used if Variables or Specificators (see Pattern below) are not given. Statements are terminated with a semicolon (;).
Selection - Each Selection corresponds to one search hit defined by an Expression. Variable may have zero, one or many Selections depending on how many actual data sets match the query. When a subsequent query Expression is executed on a Variable, iteration is over all Selections belonging to that Variable. Thus, Expressions are sequentially applied to individual Selections from a Variable. For Expressions involving many input Variables multilevel iteration is performed and all possible combinations of Selections from the appropriate Variables are considered.
Command - A set of specialized actions on Variables. At present only READ and WRITE commands are implemented. READ loads data from a file or a database into a Variable and WRITE transfers query results from a Variable to an output file. Commands start with an exclamation (!).
Comment - Lines in the query script which are not interpreted during query execution. Comments start from "%" and end at the end of the line.
Pattern - A powerful subclass of Expression which comprises a pattern name (actually a search method), Specificators and Elements (parameters that modify the search method). Specificators provide search conditions while Elements represent a Pattern as an array of homogeneous objects specific to this Pattern. Pattern names start with a "#" character.
Filter - Subclass of Expression which applies a condition to the attributes of the objects comprising a Variable.
Operator - Subclass of Expression that provides simple operators for use with Variables.
In summary, each query comprises Variables and actions on Variables in the form of Expressions and/or Commands. Starting Variables may be assigned in query scripts using the READ Command The WRITE command will output the results of a query to a text file for perusal or for use by PDBquery.
A final note on terminology, not relating to MMQL but taken from the Crystallographic Information File (CIF) data representation (Fitzgerald et al., 1993), concerns the terms entity and subentity. PDBlib and MMQLlib do not distinguish DNA and proteins at the level of basic representation, hence DNA strands, and polypeptide chains are termed entities, whereas amino acids and nucleotides are termed subentities.
The design of MMQLlib is best understood after considering how MMQL can be used to ask simple questions of the collective features of protein structure. More complex questions are considered in the Discussion.
The following pattern expressions are currently implemented: sequence pattern, property pattern, contact pattern, hydrogen-bonding pattern, secondary structure pattern, dihedral angle pattern, and conformational pattern (Table. Summary of MMQL Patterns.) Formulating a query for each pattern type is outlined as a prerequisite to exploring more complex queries.
This pattern type queries combinations of subentity types as defined by Chang et al., (1994), that is, canonical and non-canonical amino acids and nucleotides as well as single heterogen groups. The sequence pattern is implemented as a simple string matching algorithm. Without a more heuristic approach a problem arises with non-canonical and heterogen groups when searching PDB files because of a lack of consistency in naming these groups. A work around is to use alternative names and wild card symbols for any name. Consistency in data representation will improve with the adoption by the PDB of the macromolecular CIF dictionary (Fitzgerald et al., 1993).
Below are some examples of sequence pattern expressions:
(a) Search for the sequence of subentities ALA-SER-ALA in the data set represented by the Variable $V1 and assign the results to Variable $V2.
$V2 = $V1 #SequencePattern (ALA) (SER) (ALA) ;(b) Search for ALA-(SER OR THR)-ALA in the Variable $V1 and assign results to the Variable $V2.
$V2 = $V1 #SequencePattern (ALA) (SER,THR) (ALA) ;(c) Search for ALA-(anything)-(anything)-ALA in the Variable $V1 and assign the results to Variable $V2.
$V2 = $V1 #SequencePattern (ALA) (*) (*) (ALA) ;Currently variable gaps in a sequence pattern cannot be queried.
A secondary structure pattern defines a search for a sequentially ordered set of subentities belonging to a predefined type of conformation as assigned by algorithmic methods or during the structure determination. In the current implementation only the types Alpha and Beta are recognized as taken from the assignments found in the PDB file according to the HELIX and SHEET record types. However, a types Specificator has been implemented such that alternative algorithmic assignments, if coded, could be accessed from MMQL. Any region of the entity containing a non-recognizable secondary structure type is treated as spacing between recognized secondary structure components. The Specificator element allows the selection of a subset of pattern elements even though the search is over the whole pattern. The subset is assigned to a new Variable and can be used subsequently.
Below are two examples of secondary structure pattern expressions:
(a) Search for alpha helices and collect the results in the Variable $A.
$A = $INP #SecondaryStructurePattern (Alpha) ;(b) Search for beta, alpha, beta patterns and collect the results in the Variable $BAB.
$BAB = $INP #SecondaryStructurePattern (Beta) (Alpha) (Beta) ;
This type of pattern query permits selection based on the physical properties of subentities. Currently, the properties for which methods are available are volume, polarity, isoelectric point, hydrophobicity, mean exposure, exposure in entity, and exposure in compound. Excluding exposure in entity and exposure in compound, values are expressed on a percentage scale as defined by Bogardt et al. (1980).
Values for exposure in entity, and exposure in compound which depend on the subentity environment are calculated according to Lee and Richards (1971), with percentages representing the actual percentage or exposed surface, unlike the other properties which are defined on a relative scale, where the subentity type with the smallest property value is assigned a value of 0, and the largest a value of 100. Specificators for the exposure in entity, and exposure in compound property patterns are fragment length (min_length, max_length) and averaging for window size. The default Specificator implies a search for a pattern of from 5 to 15 subentities with the average for each subentity defined by a window size of 5 subentities. That is, the value for each subentity is the average for the subentity in question plus the two toward the beginning and the two toward the end of the polypeptide chain or DNA strand being evaluated.
Below are some examples of property pattern expressions:
(a) Search for the default property pattern (hydrophobicity profile) in a data set represented by the variable $V1 and assign the result to variable $V2. According to the default Specificator and Element (Table. Summary of MMQL Patterns.) a list of compounds are returned where a hydrophobic string of residues is detected such that five continuous residues have greater that 30% hydrophobicity.
$V2 = $V1 #PropertyPattern ;(b) Search for exposed loops in the Variable $V1 and assign results to the Variable $V2. That is, a string of 5 residues where each of the residues is more that 10% exposed to solvent.
$V2 = $V1 #PropertyPattern property=ExposureInCompound (10.0,100.0) ;Note that a string of 6 residues where each is more that 10% exposed to solvent would result in two hits. Thus, overall numbers of hits can be misleading and require careful analysis.
(c) Search for positive-negative charge change profiles where a pattern of 5 positive subentities is followed by 5 negative subentities.
$V2 = $V1 #PropertyPattern property=IsoelectricPoint
min_lengh=10 max_lengh=10 averaging=1
(50.0,100.0) (50.0,100.0) (50.0,100.0) (50.0,100.0) (50.0,100.0)
(0.0,50.0) (0.0,50.0) (0.0,50.0) (0.0,50.0) (0.0,50.0);
Note that the Elements (in this case percentages) which define the range
of values for each subentity are applied sequentially.
This pattern type examines the spatial relationship between collections of subentities or atoms. The input Variables specify at least two groups of objects upon which to evaluate potential contacts. The groups of objects would typically be the result of a previous query. There is no limit, beyond hardware capability, in the number of input Variables that may be chosen. Thus, in finding all subentity-subentity contacts the input Variable is established by first defining the macromolecule as an ordered set of subentities. This can be done by querying a one-element sequence pattern with a wild card subentity name (example d below).
The threshold Specificator defines the minimum fraction of subentities that must be involved to detect a contact (20% by default) and the atoms Specificator specifies which atoms to use (all by default) when evaluating contacts. The criteria for subentity contact is one pair of atoms at a distance below 5A or at a minimum distance defined by a distance Specificator.
One method for specifying a contact pattern between multiple Variables is to specify a contact map as a triangular matrix where each element of the matrix defines a contact or non-contact between supplied input Variables in the order they are listed in the query statement (see example b below).
Below are some examples of contact pattern expressions:
(a) Search for two alpha helices with a default degree of contact. That is, helixes for which at least one atom in one subentity in one helix is less than 5A from an atom in a subentity of the other helix in at least 20% of the subentities.
Helices as input Variables are collected in $HELIX1 and $HELIX2 and final query results are collected in the Variable $BUNDLE.
$HELIX1 = $INP #SecondaryStructurePattern (Alpha) ; $HELIX2 = $HELIX1 ; $BUNDLE = $HELIX1 $HELIX2 #ContactPattern ;(b) Search for four alpha helix bundles according to the definition that each helix in the bundle is in contact with two neighbors but does not contact the third. Helices are defined by Variables $HELIX1, $HELIX2, $HELIX3, and $HELIX4 and results are defined in Variable $BUNDLE.
In the simple contact matrix, where 1 represents a default contact as described previously and 0 no contact, the columns represent $HELIX1, $HELIX2, and $HELIX3, respectively and the rows represent $HELIX2, $HELIX3 and $HELIX4, respectively.
$HELIX1 = $INP #SecondaryStructurePattern (Alpha) ;
$HELIX2 = $HELIX1 ;
$HELIX3 = $HELIX1 ;
$HELIX4 = $HELIX1 ;
$BUNDLE = $HELIX1 $HELIX2 $HELIX3 $HELIX4 #ContactPattern
(1,
0, 1,
1, 0, 1) ;
(c) Search for
possible salt bridges associated with contacts between (ARG, LYS, HIS)
and (GLU, ASP). Input is first collected in
Variables $POSITIVE and $NEGATIVE and final
query results are collected
in the Variable $BRIDGE.
$POSITIVE = $INP #SequencePattern (ARG, LYS, HIS) ; $NEGATIVE = #SequencePattern (GLU, ASP) ; $BRIDGE = $POSITIVE $NEGATIVE #ContactPattern ;(d) Search for all default subentity-subentity contacts. Subentities are first collected in Variables $RESIDUE1 and $RESIDUE2 and contacts in the output Variable $CONTACT.
$RESIDUE1 = $INP #SequencePattern (*) ; $RESIDUE2 = $RESIDUE1 ; $CONTACT = $RESIDUE1 $RESIDUE2 #ContactPattern ;
A hydrogen bonding pattern is a specialized search for pairs of residues in the protein main chain which have specified atoms involved in an interaction. The Specificator method is used to define the method used in determining the interaction. Currently interactions are defined as having an energy below a defined threshold as specified by Lifson et al. (1979).
Specificators also define the relative residue numbers in the pattern sought (period) and the energy threshold (threshold). That is, the hydrogen bond is defined as a Coulombic interaction between four charges, 0.42e, -0.42e, -0.20e, and 0.20e placed on the C, O, N, and H atoms, respectively, where the total energy is above the default threshold of -0.1 kcal/mol.
One or two input Variables are permitted. In the case of a single Variable both sets of relative residue numbers refer to the same fragment defining a search for local hydrogen bonds. If two input Variables are provided, the first and second sets of relative residue numbers will be applied to the first and second Variables, respectively, specifying two independent data sets arbitrarily located. The search does not stop when a single pattern is found, but rather when all appropriate combinations of input fragments have been exhausted.
Below are some examples of hydrogen bonding pattern expressions:
(a) Search for antiparallel beta-like bridges. The input fragment is read from a file called input and contained in the Variable $FRAGMENT1; $FRAGMENT1 is then copied to $FRAGMENT2 and final query results are collected in the Variable $BRIDGE.
$FRAGMENT1 !READ[input] ;
$FRAGMENT2 = $FRAGMENT1 ;
$BRIDGE = $FRAGMENT1 $FRAGMENT2 #HbondingPattern
(1O,3N) (2N,2O)
(2O,2N) (3N,2O) ;
(b) Search for 4-periodic alpha-helices.
The default input Variable, namely all available subentity information,
is used with a period of 1 based on a mainchain hydrogen bond between 1O
and 5N. The period of 1 defines a hydrogen bonding scheme of the form,
(1O .. 5N; 2O .. 6N; 3O .. 7N) and so on.
Results are collected in the variable $HELIX.
$HELIX= #HbondingPattern period=1 (1O,5N) ;At present only main chain hydrogen bonds can be calculated in proteins, but by tabulating charges for other possible hydrogen bond forming groups this scheme could be extended to protein side chains and polynucleotides. It is possible, however, to specify an anti-parallel hydrogen bonding scheme by specifying a negative period.
A conformational pattern defines a search for a conformation described in terms of coordinates in 3D space. The Specificators types, method and deviation define the details of the search. Thus, types can potentially be one of Calpha, MainChain, SideChain, AllAtoms which defines the atom group for which a match is to be found; Calpha is the only types implemented at this time. The deviation defines the level of discrepancy between actual conformation and the pattern specified.
The difference distance matrix approach (DDM) (Nishikawa et al., 1972, and Kundrot and Richards, 1987) is the only method thus far implemented for structure comparison, although again the Specificator method is reserved for alternative approaches, such as direct superposition, should they be added.
Below is an example of a conformational pattern expression:
(a) Search for the given CA-atom conformation. Results are collected in the variable $CONF.
$CONF = $INP #ConformationalPattern (28.742, 22.179, 27.728) (27.784, 22.386, 31.365) (29.905, 22.152, 34.510) (29.917, 18.296, 34.153) (31.385, 18.107, 30.649) ;
A dihedral angle pattern defines a search for conformations with dihedral angles in a specified range. Dihedral angle pattern Elements provide a range of allowable angles by specifying minimum and maximum values. Whereas the Specificator angles allows the user to select which dihedral angles to query. The Specificators min_length and max_length indicate the length over which the entity search should be conducted.
Below are examples of dihedral angle pattern expressions, the angles are taken from the IUPAC-IUB commission on biomedical nomenclature (IUPAC-IUB, 1970).
(a) Search for conformations with dihedral angles matching right-handed alpha-helix over 7 or more subentities and collect the results in the Variable $ALPHA.
$ALPHA = $INP #DihedralAnglesPattern min_length=7 max_length=100
(-87.0,-27.0,-77.0,-17.0) ;
The order of angles is defined by the angles Specificator; in this
default case the order is (phi, psi, phi, psi).(b) Search for conformations with dihedral angles matching a beta-strand from an antiparallel beta sheet. Results are collected in the variable $STRAND.
$STRAND = $INP #DihedralAnglesPattern min_length=5 max_length=30
(-169.0,-109.9,105.0,165.0) ;
Filter expressions support queries on the actual object attributes as they are described in PDBlib (Chang et al., 1994). Attributes of four object types can be used Compound, Entity, Subentity and Atom (Table. Summary of MMQL Filters.). In each filter expression both object type and attributes (separated by a period) must be presented. Specific operators (see Operator Expressions), namely =, <, and >, and values are available for use with a particular attribute. If no operator is given the query will select all macromolecular objects possessing the feature described by the attribute (see c below).
Below are examples of filter expressions:
(a) Search for all hemoglobins as defined in the PDB COMPND record type.
$HEMOGLOBIN = $INP Compound.compnd = "HEMOGLOBIN" ;(b) Search for compounds with a resolution better than 1.5A.
$HIRES = $INP Compound.resolution < 1.5 ;(c) Search for all beta-sheets.
$SHEET = $INP Compound.sheet ;
This Expression type permits operations on Variables. Operator Expressions, while not querying any biological information, provide a methodology for refining queries. These Expressions are important since the query design has each Variable treated separately at all times, except when the same Variable is used twice or more in a query expression. Available Operator Expression types are presented below.
Copy Variable $V1 to Variable $V2.
$V2 = $V1 ;
The following is a binary expression where $INP1 and $INP2 are input Variables and $OUT is an output Variable and operator is one of &, |, +, > or *.
$OUT = $INP1 operator $INP2 ;
Possible values for operator are described below.
& is a non-symmetric AND. A single loop is performed over $INP1 and data written to $OUT if that data also exists in Variable $INP2.
| is a conditional copy. That is, $OUT gets the contents of Variable $INP1 if any data in $INP1 is found in $INP2.
+ is a logical OR. Variable $OUT gets the combined macromolecular data from Variables $INP1 and $INP2.
> is a logical OR with condition on order in sequence. Variable $OUT gets macromolecular data from $INP1 and $INP2 if data from $INP1 precedes data from $INP2.
* is a Selection adding operator. This operator takes Selections from $INP1 and $INP2 and puts them into $OUT. It does not touch existing Selections but only joins two Selections into one bigger set.
This operator selects a subset of subentity data contained in a Variable and assigns it to another Variable. A Variable Expression with a select operator has the following form:
$OUT = $INP >> range ;
where range is a positive or negative integer or two integer values separated by a comma. Positive values mean subentity numbers from the beginning of a selection, negative from the end.
Search Type Specifiers are used in pattern and filter expressions to define how search results are interpreted when assigning an output Variable. Search type specifiers appear between the input variable and the pattern type or filter attribute.
$OUT = $INP search_type_specifier {pattern_type or filter_attribute} ;
& Variable $OUT gets the result of the query. This is the default when no search type specifier is given.
| Variable $OUT gets copy of Variable $INP if the query condition is matched.
Two commands READ and WRITE are implemented to input and output data between external storage and query Variables. The format of commands are as follows:
$Variable(s) !READ[input_file/database] ; $Variable(s) !WRITE[output_file] ;