Volume 1 Chapter 6 2D Similarity Searching Overview

Back to Table of Contents

6.4.1 Basic QUEST

Ex.1 Suppose we wish to conduct a similarity search using p-aminophenol as the query structure.

The search fragment and the associated search question formulation is shown below:

Against certain instructions explanatory notes are included.

STOP  16         (For convenience of documentation this low stop limit is chosen.)
SIMI             (SIMI or SIMILAR is equivalent to T1  *CONN)
AT1  N  1  2  E  (nh and E are specified for this and all other atoms.)
AT2  C  3  0  E
AT3  C  2  1  E
AT4  C  2  1  E
AT5  C  3  0  E
AT6  C  2  1  E
AT7  C  2  1  E
AT8  O  1  1  E
BO  1  2  1  A   (This bond is declared to be acyclic, as is bond 5-8.)
BO  2  3  5      (This bond and others of the ring are automatically cyclic since bt=5.)
BO  3  4  5
BO  4  5  5
BO  5  6  5
BO  6  7  5
BO  7  2  5
BO  5  8  1  A
END              (END terminates the connectivity instruction packet in the usual way.)
HITA             (Since all database entries are to be examined we must specify HITA(LL).)
STAR             (The STAR(T) instruction initiates the search process.)

>End of database encountered
 
Finished reading ASER
 QUEST92 started at  6-OCT-1992 15:17:37.13
 database creation date : 22-SEP-1992 09:00:00.00
 Database is DCX compressed
CSD VERSION  5.04
MODE=4
DATA=11111110101000001101111111111111
         Similarity Search     16 out of  16 entries recovered
 Rank   Refcode   Compound Name
 
 1    AMPHOL   p-Aminophenol
 2    AMPHOM02 2-Aminophenol
 3    AMPHOM10 2-Aminophenol
 4    MAMPOL   m-Aminophenol
 5    AMCPHO   2-Amino-4-chlorophenol
 6    AMPHCL   o-Aminophenol hydrochloride
 7    GEBVAK   2-Amino-4-methylphenol
 8    PANISD   p-Anisidine
 9    PANISD01 p-Anisidine
10    ANCPOL   Aniline-2,4,5-trichlorophenol complex
11    ANLPCP   Aniline-pentachlorophenol complex
12    HYQUIN   Hydroquinone
13    HYQUIN02 Quinol
14    HYQUIN05 Hydroquinone
15    JAMKEN   tris(beta-Hydroquinone) xenon clathrate
16    PENDAM   p-Phenylenediamine
 
Results of Similarity Search
----------------------------
Best score     98%
Worst score    76%
average        81%
range          22%
 
Distribution Table for  16 entries in SAVEd database
--------------------------------------------------
 TANI SCORE       Entries  Cumulative
 100  -   94           1       1
  93  -   89           3       4
  88  -   84           0       4
  83  -   79           1       5
  78  -   74          11      16
 
CSD SIMILARITY SEARCH
---------------------
Information for   16  Entries Now Follows
Entries are ordered by decreasing similarity to the input fragment
 
>More(Y/N)?

You are finally asked if you wish to display more information.

The results of the various responses to this prompt are shown below.

Response 1

If you request the full printout then the reference information for each hit is displayed as shown below. This information is also written to the Journal file.

Note that the last item for each hit is *SCOR=m. This is the value of the similarity coefficient for the hit.

>More(Y/N)?Y
>Full printout (Y/N)?Y
---------+---------+---------+---------+---------+---------+---------+---------+
*REFC=AMPHOL // *COMP=p-Aminophenol // *FORM=C6 H7 N1 O1 // *AUTH=C.J.Brown // *
CODE=1(Acta Crystallogr.) // *VOLU= 4 // *PAGE= 100 // *YEAR=1951 // *SCOR=98
---------+---------+---------+---------+---------+---------+---------+---------+
Type "K"(Keep), "R"(Reject) or "O"(for list of options)P
*REFC=AMPHOL // *COMP=p-Aminophenol // *FORM=C6 H7 N1 O1 // *AUTH=C.J.Brown // *
CODE=1(Acta Crystallogr.) // *VOLU= 4 // *PAGE= 100 // *YEAR=1951 // *SCOR=98
---------+---------+---------+---------+---------+---------+---------+---------
*REFC=AMPHOM02 // *COMP=2-Aminophenol // *FORM=C6 H7 N1 O1 // *AUTH=J.D.Korp,I.B
ernal,L.Aven,J.L.Mills // *CODE=195(J.Cryst.Mol.Struct.) // *VOLU= 11 // *PAGE=
117 // *YEAR=1981 // *SCOR=93 //
---------+---------+---------+---------+---------+---------+---------+---------
........................
........................
---------+---------+---------+---------+---------+---------+---------+---------
*REFC=JAMKEN // *COMP=tris(beta-Hydroquinone) xenon clathrate // *FORM=3(C6 H6 O
2),0.866(Xe1) // *AUTH=T.Birchall,C.S.Frampton,G.J.Schrobilgen,J.Valsdottir // *
CODE=591(Acta Cryst.,C (Cr.Str.Comm.)) // *VOLU= 45 // *PAGE= 944 // *YEAR=1989
// *SCOR=76 //
---------+---------+---------+---------+---------+---------+---------+---------
*REFC=PENDAM // *COMP=p-Phenylenediamine // *FORM=C6 H8 N2 // *AUTH=Z.P.Povet'ev
a,Z.V.Zvonkova // *CODE=41(Kristallografiya) // *VOLU= 20 // *PAGE= 69 // *YEAR=
1975 // *SCOR=76 //
---------+---------+---------+---------+---------+---------+---------+---------
 Saving hits in database format now
 

Response 2

If you request the short printout then the similarity coefficients and screen information are tabulated as below.

>More(Y/N)?Y
>Full printout (Y/N)?N
 
Refcode  Rank No.  Entry Screens    Common to Fragment    Similarity Coefficient
AMPHOL     1            44            43                     0.977
AMPHOM02   2            44            42                     0.933
AMPHOM10   3            44            42                     0.933
MAMPOL     4            44            42                     0.933
AMCPHO     5            53            43                     0.811
AMPHCL     6            48            40                     0.784
GEBVAK     7            55            43                     0.782
PANISD     8            49            40                     0.769
PANISD01   9            49            40                     0.769
ANCPOL    10            56            43                     0.768
ANLPCP    11            56            43                     0.768
HYQUIN    12            36            34                     0.756
HYQUIN02  13            36            34                     0.756
HYQUIN05  14            36            34                     0.756
JAMKEN    15            36            34                     0.756
PENDAM    16            36            34                     0.756
 Saving hits in database format now

The entry ranked number 1, AMPHOL, is p-aminophenol and you would expect the similarity coefficient to be 1.000, since the query fragment is p-aminophenol. In fact the value is 0.977.

The reason for this is that, as shown in the table, AMPHOL contains 44 connectivity screens whereas the query fragment contains 43.

AMPHOL contains screen 585, ie. "Compound contains ONE ring only"

Screens 585-589, as indicated in Appendix 1, are "Miscellaneous ring screens (not dependent on full ring analysis)".

A screen of this type is not set in the query analysis and this leads to the small discrepancy for exact hits.

In spite of this the relative ranking of hits is correct.

In this case no reference information is written to the Journal file.

Response 3

If you reply N to the >More prompt then no further information is displayed or written to the Journal file.

Comparison of Tanimoto and Dice Coefficients

For most similarity searches the Tanimoto and Dice coefficients yield the same ranking; only the actual values of the coefficients differ. For the example of p-aminophenol the comparison is as follows:

REFCODE  TANIMOTO  DICE
 
AMPHOL    0.977   0.989
AMPHOM02  0.933   0.966
AMPHOM10  0.933   0.966
MAMPOL    0.933   0.966
AMCPHO    0.811   0.896
AMPHCL    0.784   0.879
GEBVAK    0.782   0.878
PANISD    0.769   0.870
PANISD01  0.769   0.870
ANCPOL    0.768   0.869
ANLPCP    0.768   0.869
HYQUIN    0.756   0.861
HYQUIN02  0.756   0.861
HYQUIN05  0.756   0.861
JAMKEN    0.756   0.861
PENDAM    0.756   0.861

Ex.2 This illustrates a very useful facility whereby you can specify the query fragment simply by its refcode.

This obviates the need to type a large instruction document for a complex query.

We will repeat the similarity search for p-aminophenol, using its refcode AMPHOL.

The dialogue will be as follows:

At this point the search is initiated and output is displayed as in Ex.1

It is interesting to compare the output from Ex.1 and Ex.2. The rankings from each run are shown below.

Ranking from Ex.1

 
 Rank   Refcode   Compound Name
 
 1    AMPHOL   p-Aminophenol
 2    AMPHOM02 2-Aminophenol
 3    AMPHOM10 2-Aminophenol
 4    MAMPOL   m-Aminophenol
 5    AMCPHO   2-Amino-4-chlorophenol
 6    AMPHCL   o-Aminophenol hydrochloride
 7    GEBVAK   2-Amino-4-methylphenol
 8    PANISD   p-Anisidine
 9    PANISD01 p-Anisidine
10    ANCPOL   Aniline-2,4,5-trichlorophenol complex   <------
11    ANLPCP   Aniline-pentachlorophenol complex       <------
12    HYQUIN   Hydroquinone
13    HYQUIN02 Quinol
14    HYQUIN05 Hydroquinone
15    JAMKEN   tris(beta-Hydroquinone) xenon clathrate
16    PENDAM   p-Phenylenediamine

Ranking from Ex.2

 Rank   Refcode   Compound Name
 
 1    AMPHOL   p-Aminophenol
 2    AMPHOM02 2-Aminophenol
 3    AMPHOM10 2-Aminophenol
 4    MAMPOL   m-Aminophenol
 5    AMCPHO   2-Amino-4-chlorophenol
 6    AMPHCL   o-Aminophenol hydrochloride
 7    GEBVAK   2-Amino-4-methylphenol
 8    PANISD   p-Anisidine
 9    PANISD01 p-Anisidine
10    HYQUIN   Hydroquinone
11    HYQUIN02 Quinol
12    HYQUIN05 Hydroquinone
13    JAMKEN   tris(beta-Hydroquinone) xenon clathrate
14    PENDAM   p-Phenylenediamine
15    PENDAM01 p-Diaminobenzene                         <-----
16    ARESRC10 2-Aminoresorcinol hydrochloride          <-----

You will notice that hits 10 and 11 in Ex.1 are absent from Ex.2. Instead hits 15 and 16 in Ex.2 are now in the list.

This apparent anomaly is a result of the factor, discussed in Ex.1 , relating to screen 585.

Since we have used the refcode AMPHOL as the query fragment the query screen set contains 585.

This results in a perfect match for AMPHOL and, in fact, its similarity coefficient is now 1.000.

Furthermore 585 specifies "Compound contains ONE ring only".

Hits 10 and 11 in Ex.1 each contain two rings whereas hits 15 and 16 in Ex.2 each contain one ring.

Further Notes On Similarity Searching

Back to Table of Contents

Volume 1 Chapter 6 2D Similarity Searching in Graphics QUEST3D.