Exercise 2
Key - Created April, 2001

Also please include your email address in the above information ...
We would like Name and email address ... Thanks

BI141S
Doug Smith

My comments look like this.
A key to the questions is included at the end of the file.

Score: 365 pts

Summary of Grading 
A. Use Netscape and the Web to Find and Extract a SWISS-PROT Sequence       37 pts 
B. Searches using Boolean Operators.                                        64 pts 
C. Looking up Sequences with Entrez.                                        54 pts 
D. Introduction to the GCG package.                                         21 pts
E. Questions.                                                              189 pts 
                                                                    TOTAL 365 pts

did this ...

{A. Use Netscape and the Web to Find and Extract a SWISS-PROT Sequence:}

{1. Turn on Netscape Communicator.}
did this ...

{manuever via Hypertext links for awhile}
did this ...
 
 

{2. Bioinformatics Links: the DNASYSTEM Web Page or the CMS MBR Web Page}

{a. Go to the DNASYSTEM or the CMS MBR Web Page}
did this ...

{Spend a little time browsing the DNASYSTEM Web Page or the CMS MBR Web Page.}
did this ... 3 pts
 

{3. SWISS-PROT and TrEMBL at the ExPASy Web site}

{From the DNASYSTEM Web Page or from the links above, click on "General Sites'}
did this ...

{Access the ExPASy site by clicking on 'ExPASy'.}
did this ...

{Browse a bit to see what's at ExPASy ...}
did this ... 3 pts
 

{4. Find your Protein Sequence in the SWISS-PROT and TrEMBL Protein Databases}

{In "Access to SWISS-PROT and TrEMBL', click on "by description or identification"}
did this ...

{Now enter TWO keywords in the box shown and click "Submit"}
show the two keywords used here ...

{Try a keyword search with TWO words and record your findings; Use your own keywords; do NOT use lambda repressor ! also try REVERSING the two keywords.}
words are the same here ... 3 pts

{Do the same keyword search but with each keyword separately; record your findings}
 record findings for:
1) two words                                 dnaa AND coli           27
2) two words reversed                  coli AND dnaa          27
3) word 1                                        dnaa                         245
4) word 2                                        coli                       14762
Exercise does not explicitly ask to explain these results ... 8 pts, 2 for each
 

{5. Copy your Sequence}

{Choose one of the Protein Sequences you found and have a look at it by clicking on the hypertext-linked SwissProt name for the sequence.}
did this ...

{Note the links present to visualize the protein and to learn more about its structural elements. Browse through some of these links.}
should include some statement of links looked at ... 3 pts

{Obtain a copy of the sequence entry}
ok ...

{Save copies both as "Text" and "Source" from both "NiceProt View" and "old SP format"}
did this ...

{Include some or all of the "Text" version of the "old SP format" in your Notebook}
ID   DNAA_ECOLI     STANDARD;      PRT;   467 AA.
AC   P03004; P78122;
DT   21-JUL-1986 (Rel. 01, Created)
DT   01-JUL-1993 (Rel. 26, Last sequence update)
DT   01-OCT-2000 (Rel. 40, Last annotation update)
DE   CHROMOSOMAL REPLICATION INITIATOR PROTEIN DNAA.
GN   DNAA OR B3702.
OS   Escherichia coli.
OC   Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; 5 pts

{Note the First line in the Entry and the Last line in the Entry. What are these?}
First line is the ID line and last line is the // line after the sequence.  Both lines serve as delimiters of the sequence entry. 6 pts, 3 pts each

{Note also the First Two Letters in each line of the "old SP format".}
Some statement about what these are should be included ... the first two letters defines the nature of the field. 6 pts, 3 pts each

{Use COPY-PASTE to COPY the Sequence directly from the Netscape window and PASTE it into your Notebook file for Exercise 1.}
????? why have this here ???  ... Exercise 1 ??? ... already did a COPY-PASTE operation above ...
 
 

{B. Searches using Boolean Operators}

{1. Go to the SWISS-PROT full text search service.}
did this ...

{2. Enter the two keywords into the search box and hit SUBMIT to start the search. How many matches were reported?}
coli AND dnaa 26 hits (25 + 1) 3 pts

{Now reverse the order of the TWO keywords, i.e. "repressor lambda" instead of "lambda repressor."}
same result ... 25 + 1 = 26 hits 3 pts

{Now go back to the SWISS-PROT fill text search page and try again using EACH of the two keywords. How many matches were reported?}
3) word 1                                        dnaa                         242 3 pts
4) word 2                                        coli                       12900 3 pts

{Do the above searchs with the "append * before and after" option selected}
 record findings for:
1) two words                                 dnaa AND coli           27
2) two words reversed                  coli AND dnaa          27
3) word 1                                        dnaa                         245
4) word 2                                        coli                       14762 4 pts

{Construct a table of your results as per the following}
record findings for:              description-ID search               full-text search
keywords        words used       SwissProt TrEMBL  total          SwissProt  TrEMBL  total
--------        ----------       --------- ------  -----          ---------  ------  -----
1) two words    dnaa  coli           0        0       0               0        0       0
              with * appended        * not supported                  0        0       0 4 pts
2) two words    coli  dnaa           0        0       0               1        0       1
   reversed   with * appended        * not supported                  2        0       2 4 pts
3) word 1       dnaa                40       38      78             170       72     242
              with * appended        * not supported                171       74     245 4 pts
4) word 2       coli              4974     4070    9044            7568     5332   12900
              with * appended        * not supported               8193     6569   14762 4 pts

Search for the two words did NOT yield the same results when they were reversed because the search assumes the
two words are adjacent to each other.
 

{3. Do a search with the same two keywords using the AND operator. How many matches did you get? What are the results when you reverse the order of the keywords?}
record findings for:            description-ID search                full-text search
keywords        words used     SwissProt  TrEMBL  total        SwissProt   TrEMBL   total
--------        ----------     ---------  ------  -----        ---------   ------   -----
1) two words   dnaa AND coli      0          0       0             25        1        26
              with * appended       * not supported                26        1        27 4 pts
2) two words   coli AND dnaa      0          0       0             25        1        26
   reversed   with * appended       * not supported                26        1        27 4 pts
Same results are obtained independent of order of the two keywords because the two words can be anywhere.
 

{Do the same for both the OR and NOT boolean operators, and construct a table for your Notebook similar to the one below.}
                         AND operator                  OR operator            NOT operator
 Search words       SWISS-PROT TrEMBL total    SWISS-PROT TrEMBL total   SWISS-PROT TrEMBL  total
 -----------------  ---------- ------ -----    ---------- ------ -----   ---------- ------  -----
 dnaa coli              25        1     26         7713    5403  13116      145         71    216 4 pts
   with * appended      26        1     27         8338    6642  14980      145         73    218 4 pts
 coli dnaa              25        1     26         7713    5403  13116     7543       5331  12874 4 pts
   with * appended      26        1     27         8338    6642  14980     8167       6568  147354 pts
 
 

{Try also parts of keywords. For example, for "lambda repressor", we would try "lam AND rep". Note that you must check the "Prefix and append wildcard '*' to words." box with this server. Try the equivalent to "lam* AND *rep*" for your keywords.}
  8 pts
 
 

{C. Looking up Sequences with NCBI Entrez}

{1. Go to NCBI Entrez and read some of the "Entrez Help" information.}
did this ...

{2. Return to the Entrez Home page using the NetScape "Back" operation and click on "Proteins" to search for a protein sequence.}
did this ...

{3. Enter one of the keywords you used at ExPASy in the search box and click the "Go" button (or just do a RETURN) to start the search.}
should say what keyword was used and results obtained ... 3 pts
keyword 'coli', 74384 hits ...

{4. If the number of matches is greater than 200, use a second and/or third keyword in the <keyword(s)> section together with appropriate Boolean Operators.}
ok ... coli AND dnaA yielded 105 hits ...

{5. Click in turn on each of the four links "Limits", "Preview/Index", "History", "Clipboard". Answer the Exercise Questions on these four links. Redo the AND query with two of your keywords using "Limits" to limit your search to new GenPept entries within the past 5 years. Compare these results with those above.}
did this ... coli AND dnaA, limited to mod dates of 1996-01-01 to 2002-01-01 yielded 90 hits 3 pts

{6. In the "Display <menu>" section, choose different items from the <menu> and click the "Display" button. Briefly describe what the different items are in the <menu>, particularly the "Neighborhood" and "Link" items.}
did this ... 1) different display formats, eg GenPept, ASN.1, Brief 3 pts
2) links to PubMed articles (abstracts), Taxonomy, Structure, Genome, OMIM, etc 3 pts
3) neighbors ... homologues to proteins, nucleotide sequences 3 pts

{7. Return to the <menu> item termed "Summary" and click the "Display" button. Select the boxes of any five (5) of the sequences by clicking on each of the boxes. Select "GenPept" from the <menu> and click on the "Display" button. Record in your Lab Notebook what this has done.}
did this ... this brings up the GenPept entries of the five chosen sequences ... 3 pts

{8. The GenPept sequences are Displayed as HTML. Click on the HTML button, choose "Plain Text", and click on the "Display" button. Record in your Lab Notebook what this has done.}
did this ... this now displays the five chosen seqs as Text with the GenPept annotation ...3 pts

{9. Examine the contents of the GenPept annotation for the first of your Sequences. What are the Fields? Save this GenPept entry to your Lab Notebook.}
did this ... Fields are each type of annotation, eg LOCUS, DEFINITION, ACCESSION, REFERENCE, etc3 pts

ID   RNPA_PROMI     STANDARD;      PRT;   119 AA.
AC   P22835;
DT   01-AUG-1991 (Rel. 19, Created)
DT   01-AUG-1991 (Rel. 19, Last sequence update)
DT   01-OCT-2000 (Rel. 40, Last annotation update)
DE   RIBONUCLEASE P PROTEIN COMPONENT (EC 3.1.26.5) (PROTEIN C5) (RNASE P).
GN   RNPA.
OS   Proteus mirabilis.
OC   Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
OC   Proteus.
OX   NCBI_TaxID=584;
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=LM1509;
RX   MEDLINE=91033012; PubMed=2172087; [NCBI, ExPASy, EBI, Israel, Japan]
RA   Skovgaard O.;
RT   "Nucleotide sequence of a Proteus mirabilis DNA fragment homologous
RT   to the 60K-rnpA-rpmH-dnaA-dnaN-recF-gyrB region of Escherichia
RT   coli.";
RL   Gene 93:27-34(1990).
CC   -!- FUNCTION: RIBONUCLEASE P GENERATES MATURE TRNA MOLECULES BY
CC       CLEAVING THEIR 5' ENDS. IT CAN CLEAVE ALSO THE 4.5S RNA (BY
CC       SIMILARITY).
CC   -!- CATALYTIC ACTIVITY: ENDONUCLEOLYTIC CLEAVAGE OF RNA, REMOVING
CC       5'-EXTRA-NUCLEOTIDE FROM TRNA PRECURSOR.
CC   -!- MISCELLANEOUS: RNASE P CONSISTS OF A RNA MOIETY (M1, RNPB) AND THE
CC       PROTEIN COMPONENT. BOTH ARE NECESSARY FOR FULL ENZYMATIC ACTIVITY.
CC       HOWEVER, IT IS THE RNA THAT CARRIES THE CATALYTIC SITE.
CC   -!- SIMILARITY: BELONGS TO THE RNPA FAMILY.
CC   --------------------------------------------------------------------------
CC   This SWISS-PROT entry is copyright. It is produced through a collaboration
CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -
CC   the European Bioinformatics Institute.  There are no  restrictions on  its
CC   use  by  non-profit  institutions as long  as its content  is  in  no  way
CC   modified and this statement is not removed.  Usage  by  and for commercial
CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC   or send an email to license@isb-sib.ch).
CC   --------------------------------------------------------------------------
DR   EMBL; M58352; AAA83956.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   PIR; JQ0731; JQ0731.
DR   InterPro; IPR000100; Ribonuclease_P.
DR   InterPro; Graphical view of domain structure.
DR   Pfam; PF00825; Ribonuclease_P; 1.
DR   PROSITE; PS00648; RIBONUCLEASE_P; 1.
DR   ProDom [Domain structure / List of seq. sharing at least 1 domain]
DR   BLOCKS; P22835.
DR   DOMO; P22835.
DR   PROTOMAP; P22835.
DR   PRESAGE; P22835.
DR   DIP; P22835.
DR   SWISS-2DPAGE; GET REGION ON 2D PAGE.
KW   Hydrolase; Nuclease; Endonuclease; tRNA processing.
SQ   SEQUENCE   119 AA;  14059 MW;  80323E9611F89891 CRC64;
     MVKLAFPREL RLLTPKHFNF VFQQPQRASS PEVTILGRQN ELGHPRIGLT IAKKNVKRAH
     ERNRIKRLAR EYFRLHQHQL PAMDFVVLVR KGVAELDNHQ LTEVLGKLWR RHCRLAQKS
//

{10. Return to the display of your GenPept sequences as HTML, and select from <menu> the "Graphics" option and click on the "Display" button. Describe what you see. What options does the User have? The Graphics display is only of the first of your five sequences; how would you observe the graphics for the fourth sequence?}
did this ..
1. The "Graphics" Option (when it works !!!) shows a map of the Sequence, with its annotated Features displayed on the map ... The sequence is also shown, with annotated Features along the sequence ... 3 pts
2. To observe graphics for the fourth sequence, return to the Summary display, choose only the fourth sequence, display this as GenPept, and select the Graphics display ... 3 pts

{11. Return to a "Display GenPept as HTML" and click on the "Add to Clipboard" button. Describe what happened.}
This places the GenPept version of all 5 sequences into a Clipboard, available for downloading or saving in a file on your local computer ... 3 pts

{12. Choose "Summary" from the <menu> and click on the "Display" button. Note the links to the far right for each of your five GenPept sequences. Briefly describe what each of these links do.}
These provide links to relevant references in PubMed, to Related Sequences (homologues), and to Taxonomy information for each of the organisms from which the sequence were obtained ... 3 pts

{13. Click on "Related Sequences" for one of your five GenPept Sequences. Briefly describe what this does.}
did this for the E. coli dnaK entry, returned 1983 related sequences ...
all are heat shock proteins, presumbably homologues of the E. coli dnaK protein ... 3 pts

{14. Use the Netscape Back command to return to the "Display Summary" page for the five GenPept sequences, and click on "PubMed" for the GenPept Sequence that you saved for your Lab Notebook above. How do the PubMed References that come up compare with those in the GenPept annotation for this sequence? Now click on "Related Articles" for one of these References. How does this compare with the "Related Sequences" for the GenPept Sequence itself?}
1) the initial PubMed refs that come up are the same as annotated in the GenPept entry ...3 pts
2) those that come up under "Related Articles" are those that have similar keywords to the original articles, ie are "related" ... 3 pts

{15. Use the Netscape Back command to return to the "Display Summary" page for the five GenPept sequences, and do the same for the "Nucleotide", "Genome", and "Taxonomy" links if present. Briefly describe what each does.}
Each brings related GenBank entries in each of these categories (Nucleotide - the cognate DNA sequences to the protein sequences; Genome - cognate DNA sequences found in completely sequenced Genome organisms to the protein sequences; Taxonomy - info from the Taxonomy database on the organism encoding each of the protein sequences ...9 pts, 3 pts for each of three of these types of links ...
 
 
 

{D. Introduction to the GCG Package}

{1. Log into the GCG server machine and set up the GCG package. }

Did this as follows:
at %, did:  telnet y4306-su-1   (ssh y4306-su-1  works also ... and is a 'more sucure' connection ...)
... logged in
... then did at %:  prep gcg

Some description of what was done should be included here ... ... 3 pts
 

{2. Familiarize yourself with the capabilities of the GCG package. }
Used the Web-based GCG help facilities ...
Also tried using GCG GenHelp on y4306-su-1 ... ... 3 pts

{3. Initialize the GCG graphics system. }
To set graphics output to go to a file in png format, did command at %:   png
To set graphics output to go to the screen,  did command at %:   xwindows
 did this ...3 pts

{4. Make a test plot and save it to your lab report.}

And then did at %:  plottest
Then inserted results of the plottest here: ... ... 3 pts

{5. Format a sequence for use with the GCGpackage.}
Retrieved the E. coli dnaA gene sequence entry  ECODNAAOP encoding dnaA, dnaN, and rpmH, AccNum J01602
in GenBank format, linked here to NCBI and here to html file in account.
Also retrieved a FASTA-formatted version as text of the 3873 bp sequence entry, linked here.
did this ...3 pts

{6. Format a sequence using a text editor and the reformat command.}
Did this by editing the FASTA-formatted version of the ECODNAAOP file, inserting a .. line between the first, header line and the sequence lines.
did this ...3 pts

This file "dnaaop.edited" was used as input for REFORMAT which yielded a file with the header line as annotation followed by the GCG .. line followed by the sequence with the usual GCG formatting (nuc numbers, spaces, residues in groups of 10); this file is here.
did this ...   3 pts

The following was not asked for ... ...
Did the same with a SwissProt entry for DnaA protein from Bacillus subtilis: entry DNAA_BACSU
Used FROMEMBL to convert the entry to GCG format, set the graphics to deliver PNG-formatted output to file DNAA_BACSU.PEPPLOT, and ran PEPPLOT.  The output graphics file is as follows:


 

{E. Questions:}

{Answer all of the following questions:}

1. What are some of the links from the ACS Home Page?
Any and all links relevant to the ACS activities: Class Web sites, instructional computing info, system and network status, microcomputer support, etc ...        list any five:       5 pts
2. Why is the ACS Home Page of importance to this course?
It provides ready access to the BIMM 140 / 141 Web pages           3 pts
3. How do the DNASYSTEM Web site and the CMS MBR Web site differ from each other?
DNASYSTEM focuses on Sequence Analysis and Genomics, with sites categorized according to task ... MBR does this and more, including General Biochem, Biotech, General Bioscience, Biomolecular modeling, etc ...    ..3 pts
4. How are these two Web sites similar to each other?
The DNA and Protein Analysis sections are quite similar  ...3 pts
 
5. What are the general categories of Resource links available at CMS MBR?
Protein Analysis, DNA Analysis,  General Biochem, Biotech, General Bioscience, Biomolecular modeling, Comp Bio  ... 3 pts, one each for any two of these
 
6. According to the DNASYSTEM home page, what are the "main sequence databases"?
GenBank, EMBL, PIR, SwissProt, Expasy, TIGR, PDB,
GSDB,  EBI, EC Enzyme ...  5 pts, 1 each for any five
7. Briefly (one sentence each) describe the contents of the following databases: OMIM, ENZYME, Klotho, GDB, PDB, NDB, EPD, ReBase, dbEST.
OMIM:  Online Mendelian Inheritance in Man; a catalog of human genes and genetic disorders
ENZYME:  enzyme nomenclature database, describes each enzyme with an EC number
Klotho: part of an attempt to model biological processes, beginning with biochemistry
GDB: The Genome Database (GDB) stores and curates genomic mapping data submitted by researchers worldwide ...
PDB: Protein DataBase - database of 3D structures of proteins and protein-nucleic acid interactions
NDB: Nucleic acid DataBase - database of 3D structures of nucleic acids
EPD:  Eukaryotic Promoter Database - information about eukaryotic promoters available at EMBL
ReBase:  database of Restrictions Enzymes and their properties
dbEST:  NCBI database of EST sequences (Expressed Sequence Tags - cDNA sequences)
9 pts, one point each
8. What is the difference between the SWISS-PROT and TrEMBL Protein databases?
SwissProt has more complete annotation than TrEMBL ... TrEMBL is little more than a translation of EMBL nucleic acid entries, plus some similar annotation ...   3 pts
9. For what reasons might the sequences of protein entries in TrEMBL be incorrect? 4 pts
1) sequence errors in DNA sequencing
2) post translation modification of the protein
3) non-universal genetic code could be used ... ...
4) gene in DNA sequence incorrectly annotated, e.g. introns not correct, short exon missed, etc 
10. What is meant by "Gene Ontology"?
Gene Ontology refers to the determination of a standard set of terminology for genes, to be used for all organisms ...  3 pts
 
11. Why are "Gene Ontologies" an issue in today's world of whole genome sequencing?
There is now general recognition that a universal scheme for naming of genes, and for their annotation (common keywords used for queries, for example) would be a very good thing ....    3 pts
12. Why are there generally more "hits" found in TrEMBL than in SWISS-PROT?
SwissPROT has more or less "complete" annotation for the proteins ... this takes time and labor, and the SwissPROT workers have not been able to keep up with the rate of submission of new DNA sequences, whose translations go directly into TrEMBL, to EMBL ...    3 pts
13. The SwissProt search for two keywords is implicitly a Boolean AND operation. What does this mean?
Entries must contain BOTH keywords is appropriate fields to be reported ...    3 pts
14. Why are there more hits found in the Boolean OR operation that in the Boolean AND operation?
OR reports entries containing either one or the other of the two keywords, as well as all entries reported by the AND operator (those entries containing BOTH keywords) ...    3 pts
15. How do the number of hits found in the Boolean AND and OR operations compare with those found in the Single Keyword searches?
The OR operation will always report more, since all those found in EACH single Keyword search are reported ... 
The AND operation will always report fewer or the same as a single keyword search, since it reports only those containing BOTH keywords  ...   
4 pts
16. What is the naming convention used for Names of Protein Sequences in SWISS-PROT?
3-4 letter-number name for the Protein, followed by underline, followed by 5 letter designation of the organism.  This latter designation can be a common word for a few common organisms, eg HUMAN, MOUSE, YEAST, but is most often the first 3 letters of the genus plus the first 2 letters of the species, eg DICDI for Dictyostelium discoideum ...    3 pts
17. What is the naming convention used for Names of Protein Sequences in TrEMBL?
The Accession Number is used as the Name ...    3 pts
18. What is the difference between the "NiceProt" View of a SWISS-PROT entry and the "original SWISS-PROT format"?
The "original" is a text-only display of the EMBL-style of fields in the flat-file database ...
The "NiceProt" is an html-based Table layout ... looks great on the Web page ... but is terrible to save and use...    6 pts
19. What databases are accessible from sequences displayed by the ExPASy Web site? What other information about your sequence is accessible from this Web site? Examine several sequence entries and their embedded links.
SWISS-PROT provides direct links to MEDLINE, EMBL, Genbank, DDBJ, ECOCYC, PROSITE, PRODOM, SWISS-2DPAGE, PDB, HSSP, SWISS-3DIMAGE, and there may be others. Via SRS links to , SWISSPROT, SWISSNEW, PIR, EMBL, EMNEW, NRL3D, NRSUB, SPTREMBL, REMTREMBL, PDB, HSSP, DSSP, ALI, FSSP, PROSITE, PROSITEDOC, BLOCKS, EPD, ECDC, CPGISLE, PRODOM, FLYGENE, SWISSDOM, SEQANALREF, SEQANALRABS, MEDLINE, FLYREFS, ENZYME, REBASE, LIMB, FLYCLONES, FLYPEOPLE, BIOCAT, FLYWILDSTOCK, FLYABERRATION, UTR, DBEST, DBESTNEW, DBSTS, DBSTSNEW, TFSITE, TFFACTOR, RHDB, RHEXP, RHPANEL and RHMAP are available. For my sequence, CHI1_SOLTU, information from the EC enzyme database, MEDLINE, EMBL, Genbank, DDBJ, PRODOM, and SWISS-2DPAGE were available.
1 pt each up to 10 ... ... ...
10 pts
These probably were NOT links from within your Sequence Entry ... - 3 pts
20. When using boolean operators to combine keywords, does the order of the keywords affect the number of matches found? Briefly explain your answer.
In general no, for single operator AND and OR expressions, the order of keywords does not matter. For the BUTNOT operator, order is important. For more than two keywords, of course, the order could matter. 5 pts
except for the BUTNOT operator ... - 2 pts
21. In the "old SP format", what is the function of the first two letters of each line? What do you think these are abbreviations of?
These indicate what the properties are for each field ...    3 pts
These are Unix-like acronyms for each properties, eg ID for Identification, AC for Accesion Number, DT for date, GN for Gene, RN for Ref Number, CC for comment, ...    2 pts
22. What are "delimiters"? Does this old SP format have delimiters? If so, what are they? 6 pts
A delimiter is a symbol used to indicate the beginning or end of a database entry in a flatfile database system. (2 pts)
Old SP format has BOTH beginning (ID) and end (//) delimiters ....   
(2 pts each)
 
23. Are there more delimiters in old SP format than needed to distinguish between multiple sequence entries in a single text file (flatfile database)? Briefly explain your answer.
Yes. One actually needs to have EITHER a delimiter to indicate the beginning OR one to indicate the end of an entry ... but NOT both as long as consistency is maintained ... ...    3 pts
24. When two keywords were used with no boolean operators in the SWISS-PROT full text search, why were different numbers of hits reported than using the SP description-id search? Briefly explain the difference in search criteria used by these two search facilities when presented with two keywords joined by a space.
In the "by description-id" search, there is an implicit AND boolean operator assumed when two keywords are used, ie all entries containing BOTH keywords are returned ... ...    3 pts
In the "by full-text" search, however,  NO BOOLEAN operator is assumed, and thus only entries containing both keywords, ONE IMMEDIATELY FOLLOWING THE OTHER, are returned. This also explains the difference in results when one REVERSES the order of the keywords ... ...   3 pts
 
25. What does the wildcard * do when added before and after a keyword?
The wildcard * can stand for any symbols before or after a "keyword", thus expanding the search to return words that are longer than the query keywords, eg *phage returns entries containing "bacteriophage" ...    3 pts
 
26. Write a query that would find all of the repressor proteins in bacteriophages lambda and P22 (but no others). You need not actually run this query. Why are parentheses a good idea?
repressor AND [lambda OR P22]    3 pts
 
27. Write a query that would find all bacteriophage repressor or activator proteins except the repressor from bacteriophage lambda. You need not actually run this query.
[repressor OR activator] NOT lambda ... ... 3 pts
 
28. Briefly explain why the relationships between the boolean operators are true.
The relationships hold because of the overlap characteristics of each of the Boolean operations for the entries containing each of the two keywords  ... see the discussion in Lecture 2 on this ...    3 pts
 
29. What are some of the databases accessed by the NCBI Entrez search system?
PubMed, Nucleotide dbs, GenProt protein dbs, Structure, Genome, OMIM, Domain, Taxonomy dbs ..  5 pts, 1 pt for each of 5 correct answers
 
30. What is a neighborhood as used by NCBI Entrez?
The neighborhood refers to sequences or pubmed sequences that are closely related to the hit displayed ...
for sequences these are likely to be homologues; for PubMed refs, they are likely to contain similar keywords.   
3 pts
 
31. Why in the Entrez Protein search bacteriophage NOT phage were 7731 matches found?
These are all entries that contain the keyword 'bacteriophage'  BUT do NOT contain the keyword 'phage' per se    3 pts
 
32. Why in these searches were the number of matches for bacteriophage NOT phage and for phage NOT bacteriophage not the same?
The number of entries that contain the keyword 'bacteriophage' but not 'phage' is different from the number of entries that contain the keyword 'phage' but not 'bacteriophage' ... ... 3 pts
 
33. Why were so many matches (31290) found for lambda NOT repressor?
Phage lambda has many sequences that have nothing to do with repressors ...    3 pts
 
34. The same number of matches (46) was found for lambda repressor and for "lambda repressor", and this number is less than the number of matches found for lambda AND repressor. Explain these observations.
This search was done with the 'full text' option in and implicitly reports only entries with the two key words adjacent to each other ... ...    3 pts
The search for "lambda AND repressor" returns entries containing both keywords but not necessarily adjacent to each other; thus, the previous entries are returned, plus many more ... ...   
3 pts
 
35. What is the meaning of the XOR Boolean operator? Is it supported by NCBI Entrez? Why not?
XOR returns entries containing NEITHER of the two keywords. ...    3 pts
It is NOT supported by NCBI Entrez, mainly because an enormous number of entries would be returned, most of which would not be useful to the User ... better to redo the query ...   
3 pts
 
36. What is the purpose of the Limits link in the Search <database> section?
The Limits feature permits the User to further limit the search, eg for entries modified between two particular dates    3 pts
 
37. What is the function of limiting the Fields? What in your ExPASy searches above does this correspond to?
{This is the same question as Question 36 ???}
In ExPASy, one can search by author, by citation, by Accession Number, or by description or ID ... these limitations partially correspond to the NCBI Limits protocols ... ...     3 pts
 
38. What is the default Field used by NCBI Entrez?
By default, the complete range of any given Limit is used ... ...    3 pts
 
39. What is the purpose of the Preview/Index link in the Search <database> section?
This feature at NCBI permits 1) examination of results already obtained from queries already performed, and 2) permits combining new queries with already performed queries ... ... ...    3 pts
 
40. What is the difference between the Preview function and the Index function?
The 'Preview' function permits one to add terms to the query ...
The 'Index' function is used to view terms within a field. These terms are specific indices used for each field.    3 pts
 
41. What is the purpose of the History link in the Search <database> section?
'History" provides a record of the searches already performed.    3 pts
 
42. How does one use the History information in design of subsequent searches?
One can used "History" to combine the results of searches already performed in many ways using the Boolean operators ... One can also use "History" to refine searches by using a performed search with a new search ... ...    3 pts
 
43. What is the purpose of the Clipboard link in the Search <database> section?
The 'Clipboard' is a temporary place to save the results of a given search ...
Items from the 'Clipboard' can be saved to a file on a local computer disk ...    3 pts
 
44. In a comparison of a GenPept protein sequence entry with a SwissProt protein sequence entry, how do the Fields compare? Which has more annotation? Which has more links to other databases?
They are similar, although SwissProt generally has more annotation and more links to other databases ...    3 pts
 
45. How might the Plain Text display of the GenPept entry documentation be useful compared to the HTML display?
The 'Plain Text' is useful for CUT-PASTE operations and for saving operations ... ...    3 pts
 
46. How would you obtain a display of your five GenPept sequences in FASTA-format?
Select the 5 sequences by clicking in their Select Boxes, choose "FASTA' in the Display menu, and clicking on Display button ... ...    3 pts
47. The GenPept format for sequences in Entrez has a lot of information unavailable in the compact FASTA format. One important kind of information is cross-references (DBSOURCE xrefs) to other databases. List the names of at least two sequence databases and two non-sequence databases that are found in Entrez. Note that sequences that originate in SWISS-PROT are likely to have the greatest number of xrefs.
Sequence database include Genbank, EMBL, Swiss-PROT, and geninfo. Non-sequence databases include MEDLINE, PDB, PROSITE, ECOGENE. There are many possibilities.    5 pts
 
48. What is the required format that a sequence must be in to run the GCG reformat program?
GCG format  ... ...  3 pts
49. Why is the GCG fromembl program used to reformat SWISS-PROT files?
SwissPROT files, in standard text "old" format, are in fact EMBL format   ... ... 3 pts