Volume 1 Chapter 6 2D Similarity Searching Overview
The search fragment and the associated search question formulation is shown below:
Against certain instructions explanatory notes are included.

STOP 16 (For convenience of documentation this low stop limit is chosen.) SIMI (SIMI or SIMILAR is equivalent to T1 *CONN) AT1 N 1 2 E (nh and E are specified for this and all other atoms.) AT2 C 3 0 E AT3 C 2 1 E AT4 C 2 1 E AT5 C 3 0 E AT6 C 2 1 E AT7 C 2 1 E AT8 O 1 1 E BO 1 2 1 A (This bond is declared to be acyclic, as is bond 5-8.) BO 2 3 5 (This bond and others of the ring are automatically cyclic since bt=5.) BO 3 4 5 BO 4 5 5 BO 5 6 5 BO 6 7 5 BO 7 2 5 BO 5 8 1 A END (END terminates the connectivity instruction packet in the usual way.) HITA (Since all database entries are to be examined we must specify HITA(LL).) STAR (The STAR(T) instruction initiates the search process.)
For the Dice coefficient you should type SIMI DICE
When you press <RETURN> at the end of the SIMI instruction you are informed:
maximum of 16 entries will be returned by similarity search
stop limit of 1000 exceeds maximum allowed ( 500) maximum of 500 entries will be returned by similarity search
Total Number of Screens Set = 43
Scanned 1000 entries, last = ADRMVL
>End of database encountered
Finished reading ASER
QUEST92 started at 6-OCT-1992 15:17:37.13
database creation date : 22-SEP-1992 09:00:00.00
Database is DCX compressed
CSD VERSION 5.04
MODE=4
DATA=11111110101000001101111111111111
Similarity Search 16 out of 16 entries recovered
Rank Refcode Compound Name
1 AMPHOL p-Aminophenol
2 AMPHOM02 2-Aminophenol
3 AMPHOM10 2-Aminophenol
4 MAMPOL m-Aminophenol
5 AMCPHO 2-Amino-4-chlorophenol
6 AMPHCL o-Aminophenol hydrochloride
7 GEBVAK 2-Amino-4-methylphenol
8 PANISD p-Anisidine
9 PANISD01 p-Anisidine
10 ANCPOL Aniline-2,4,5-trichlorophenol complex
11 ANLPCP Aniline-pentachlorophenol complex
12 HYQUIN Hydroquinone
13 HYQUIN02 Quinol
14 HYQUIN05 Hydroquinone
15 JAMKEN tris(beta-Hydroquinone) xenon clathrate
16 PENDAM p-Phenylenediamine
Results of Similarity Search
----------------------------
Best score 98%
Worst score 76%
average 81%
range 22%
Distribution Table for 16 entries in SAVEd database
--------------------------------------------------
TANI SCORE Entries Cumulative
100 - 94 1 1
93 - 89 3 4
88 - 84 0 4
83 - 79 1 5
78 - 74 11 16
CSD SIMILARITY SEARCH
---------------------
Information for 16 Entries Now Follows
Entries are ordered by decreasing similarity to the input fragment
>More(Y/N)?You are finally asked if you wish to display more information.
The results of the various responses to this prompt are shown below.
Response 1
If you request the full printout then the reference information for each hit is displayed as shown below. This information is also written to the Journal file.
Note that the last item for each hit is *SCOR=m. This is the value of the similarity coefficient for the hit.
>More(Y/N)?Y >Full printout (Y/N)?Y ---------+---------+---------+---------+---------+---------+---------+---------+ *REFC=AMPHOL // *COMP=p-Aminophenol // *FORM=C6 H7 N1 O1 // *AUTH=C.J.Brown // * CODE=1(Acta Crystallogr.) // *VOLU= 4 // *PAGE= 100 // *YEAR=1951 // *SCOR=98 ---------+---------+---------+---------+---------+---------+---------+---------+ Type "K"(Keep), "R"(Reject) or "O"(for list of options)P *REFC=AMPHOL // *COMP=p-Aminophenol // *FORM=C6 H7 N1 O1 // *AUTH=C.J.Brown // * CODE=1(Acta Crystallogr.) // *VOLU= 4 // *PAGE= 100 // *YEAR=1951 // *SCOR=98 ---------+---------+---------+---------+---------+---------+---------+--------- *REFC=AMPHOM02 // *COMP=2-Aminophenol // *FORM=C6 H7 N1 O1 // *AUTH=J.D.Korp,I.B ernal,L.Aven,J.L.Mills // *CODE=195(J.Cryst.Mol.Struct.) // *VOLU= 11 // *PAGE= 117 // *YEAR=1981 // *SCOR=93 // ---------+---------+---------+---------+---------+---------+---------+--------- ........................ ........................ ---------+---------+---------+---------+---------+---------+---------+--------- *REFC=JAMKEN // *COMP=tris(beta-Hydroquinone) xenon clathrate // *FORM=3(C6 H6 O 2),0.866(Xe1) // *AUTH=T.Birchall,C.S.Frampton,G.J.Schrobilgen,J.Valsdottir // * CODE=591(Acta Cryst.,C (Cr.Str.Comm.)) // *VOLU= 45 // *PAGE= 944 // *YEAR=1989 // *SCOR=76 // ---------+---------+---------+---------+---------+---------+---------+--------- *REFC=PENDAM // *COMP=p-Phenylenediamine // *FORM=C6 H8 N2 // *AUTH=Z.P.Povet'ev a,Z.V.Zvonkova // *CODE=41(Kristallografiya) // *VOLU= 20 // *PAGE= 69 // *YEAR= 1975 // *SCOR=76 // ---------+---------+---------+---------+---------+---------+---------+--------- Saving hits in database format now
Response 2
If you request the short printout then the similarity coefficients and screen information are tabulated as below.
>More(Y/N)?Y >Full printout (Y/N)?N Refcode Rank No. Entry Screens Common to Fragment Similarity Coefficient AMPHOL 1 44 43 0.977 AMPHOM02 2 44 42 0.933 AMPHOM10 3 44 42 0.933 MAMPOL 4 44 42 0.933 AMCPHO 5 53 43 0.811 AMPHCL 6 48 40 0.784 GEBVAK 7 55 43 0.782 PANISD 8 49 40 0.769 PANISD01 9 49 40 0.769 ANCPOL 10 56 43 0.768 ANLPCP 11 56 43 0.768 HYQUIN 12 36 34 0.756 HYQUIN02 13 36 34 0.756 HYQUIN05 14 36 34 0.756 JAMKEN 15 36 34 0.756 PENDAM 16 36 34 0.756 Saving hits in database format now
The entry ranked number 1, AMPHOL, is p-aminophenol and you would expect the similarity coefficient to be 1.000, since the query fragment is p-aminophenol. In fact the value is 0.977.
The reason for this is that, as shown in the table, AMPHOL contains 44 connectivity screens whereas the query fragment contains 43.
AMPHOL contains screen 585, ie. "Compound contains ONE ring only"
Screens 585-589, as indicated in Appendix 1, are "Miscellaneous ring screens (not dependent on full ring analysis)".
A screen of this type is not set in the query analysis and this leads to the small discrepancy for exact hits.
In spite of this the relative ranking of hits is correct.
In this case no reference information is written to the Journal file.
Response 3
If you reply N to the >More prompt then no further information is displayed or written to the Journal file.
Comparison of Tanimoto and Dice Coefficients
For most similarity searches the Tanimoto and Dice coefficients yield the same ranking; only the actual values of the coefficients differ. For the example of p-aminophenol the comparison is as follows:
REFCODE TANIMOTO DICE AMPHOL 0.977 0.989 AMPHOM02 0.933 0.966 AMPHOM10 0.933 0.966 MAMPOL 0.933 0.966 AMCPHO 0.811 0.896 AMPHCL 0.784 0.879 GEBVAK 0.782 0.878 PANISD 0.769 0.870 PANISD01 0.769 0.870 ANCPOL 0.768 0.869 ANLPCP 0.768 0.869 HYQUIN 0.756 0.861 HYQUIN02 0.756 0.861 HYQUIN05 0.756 0.861 JAMKEN 0.756 0.861 PENDAM 0.756 0.861
Ex.2 This illustrates a very useful facility whereby you can specify the query fragment simply by its refcode.
This obviates the need to type a large instruction document for a complex query.
We will repeat the similarity search for p-aminophenol, using its refcode AMPHOL.
The dialogue will be as follows:
Type SIMI TO REFCODE <R>
maximum of 16 entries will be returned by similarity search refcode or END>
refcode or END>
refcode AMPHOL will provide definition of similarity search
Type STAR <R>
It is interesting to compare the output from Ex.1 and Ex.2. The rankings from each run are shown below.
Ranking from Ex.1
Rank Refcode Compound Name 1 AMPHOL p-Aminophenol 2 AMPHOM02 2-Aminophenol 3 AMPHOM10 2-Aminophenol 4 MAMPOL m-Aminophenol 5 AMCPHO 2-Amino-4-chlorophenol 6 AMPHCL o-Aminophenol hydrochloride 7 GEBVAK 2-Amino-4-methylphenol 8 PANISD p-Anisidine 9 PANISD01 p-Anisidine 10 ANCPOL Aniline-2,4,5-trichlorophenol complex <------ 11 ANLPCP Aniline-pentachlorophenol complex <------ 12 HYQUIN Hydroquinone 13 HYQUIN02 Quinol 14 HYQUIN05 Hydroquinone 15 JAMKEN tris(beta-Hydroquinone) xenon clathrate 16 PENDAM p-Phenylenediamine
Ranking from Ex.2
Rank Refcode Compound Name 1 AMPHOL p-Aminophenol 2 AMPHOM02 2-Aminophenol 3 AMPHOM10 2-Aminophenol 4 MAMPOL m-Aminophenol 5 AMCPHO 2-Amino-4-chlorophenol 6 AMPHCL o-Aminophenol hydrochloride 7 GEBVAK 2-Amino-4-methylphenol 8 PANISD p-Anisidine 9 PANISD01 p-Anisidine 10 HYQUIN Hydroquinone 11 HYQUIN02 Quinol 12 HYQUIN05 Hydroquinone 13 JAMKEN tris(beta-Hydroquinone) xenon clathrate 14 PENDAM p-Phenylenediamine 15 PENDAM01 p-Diaminobenzene <----- 16 ARESRC10 2-Aminoresorcinol hydrochloride <-----
You will notice that hits 10 and 11 in Ex.1 are absent from Ex.2. Instead hits 15 and 16 in Ex.2 are now in the list.
This apparent anomaly is a result of the factor, discussed in Ex.1 , relating to screen 585.
Since we have used the refcode AMPHOL as the query fragment the query screen set contains 585.
This results in a perfect match for AMPHOL and, in fact, its similarity coefficient is now 1.000.
Furthermore 585 specifies "Compound contains ONE ring only".
Hits 10 and 11 in Ex.1 each contain two rings whereas hits 15 and 16 in Ex.2 each contain one ring.
Further Notes On Similarity Searching
rather than: SIMI TO REFCODE
Back to Table of Contents
Volume 1 Chapter 6 2D Similarity Searching in Graphics QUEST3D.