Exercise 2

{A. Use Netscape and the Web to Find and Extract a SWISS-PROT Sequence:}

{1. Turn on Netscape Communicator.}

{manuever via Hypertext links for awhile}

{2. Bioinformatics Links: the DNASYSTEM Web Page or the CMS MBR Web Page}

{a. Go to the DNASYSTEM or the CMS MBR Web Page}

{Spend a little time browsing the DNASYSTEM Web Page or the CMS MBR Web Page.}

{3. SWISS-PROT and TrEMBL at the ExPASy Web site}

{From the DNASYSTEM Web Page or from the links above, click on "General Sites'}

{Access the ExPASy site by clicking on 'ExPASy'.}

{Browse a bit to see what's at ExPASy ...}

{4. Find your Protein Sequence in the SWISS-PROT and TrEMBL Protein Databases}

{In "Access to SWISS-PROT and TrEMBL', click on "by description or identification"}

{Now enter TWO keywords in the box shown and click "Submit"}

{Try a keyword search with TWO words and record your findings; Use your own keywords; do NOT use lambda repressor ! also try REVERSING the two keywords.}

{Do the same keyword search but with each keyword separately; record your findings}

{5. Copy your Sequence}

{Choose one of the Protein Sequences you found and have a look at it by clicking on the hypertext-linked SwissProt name for the sequence.}

{Note the links present to visualize the protein and to learn more about its structural elements. Browse through some of these links.}

{Obtain a copy of the sequence entry}

{Save copies both as "Text" and "Source" from both "NiceProt View" and "old SP format"}

{Include some or all of the "Text" version of the "old SP format" in your Notebook}

{Note the First line in the Entry and the Last line in the Entry. What are these?}

{Note also the First Two Letters in each line of the "old SP format".}

{Use COPY-PASTE to COPY the Sequence directly from the Netscape window and PASTE it into your Notebook file for Exercise 1.}

{B. Searches using Boolean Operators}

{1. Go to the SWISS-PROT full text search service.}

{2. Enter the two keywords into the search box and hit SUBMIT to start the search. How many matches were reported?}

{Now reverse the order of the TWO keywords, i.e. "repressor lambda" instead of "lambda repressor."}

{Now go back to the SWISS-PROT fill text search page and try again using EACH of the two keywords. How many matches were reported?}

{Do the above searchs with the "append * before and after" option selected}

{Construct a table of your results as per the following}

{3. Do a search with the same two keywords using the AND operator. How many matches did you get? What are the results when you reverse the order of the keywords?}

{Do the same for both the OR and NOT boolean operators, and construct a table for your Notebook similar to the one below.}

{Try also parts of keywords. For example, for "lambda repressor", we would try "lam AND rep". Note that you must check the "Prefix and append wildcard '*' to words." box with this server. Try the equivalent to "lam* AND *rep*" for your keywords.}

{C. Looking up Sequences with NCBI Entrez}

{1. Go to NCBI Entrez and read some of the "Entrez Help" information.}

{2. Return to the Entrez Home page using the NetScape "Back" operation and click on "Proteins" to search for a protein sequence.}

{3. Enter one of the keywords you used at ExPASy in the search box and click the "Go" button (or just do a RETURN) to start the search.}

{4. If the number of matches is greater than 200, use a second and/or third keyword in the <keyword(s)> section together with appropriate Boolean Operators.}

{5. Click in turn on each of the four links "Limits", "Preview/Index", "History", "Clipboard". Answer the Exercise Questions on these four links. Redo the AND query with two of your keywords using "Limits" to limit your search to new GenPept entries within the past 5 years. Compare these results with those above.}

{6. In the "Display <menu>" section, choose different items from the <menu> and click the "Display" button. Briefly describe what the different items are in the <menu>, particularly the "Neighborhood" and "Link" items.}

{7. Return to the <menu> item termed "Summary" and click the "Display" button. Select the boxes of any five (5) of the sequences by clicking on each of the boxes. Select "GenPept" from the <menu> and click on the "Display" button. Record in your Lab Notebook what this has done.}

{8. The GenPept sequences are Displayed as HTML. Click on the HTML button, choose "Plain Text", and click on the "Display" button. Record in your Lab Notebook what this has done.}

{9. Examine the contents of the GenPept annotation for the first of your Sequences. What are the Fields? Save this GenPept entry to your Lab Notebook.}

{10. Return to the display of your GenPept sequences as HTML, and select from <menu> the "Graphics" option and click on the "Display" button. Describe what you see. What options does the User have? The Graphics display is only of the first of your five sequences; how would you observe the graphics for the fourth sequence?}

{11. Return to a "Display GenPept as HTML" and click on the "Add to Clipboard" button. Describe what happened.}

{12. Choose "Summary" from the <menu> and click on the "Display" button. Note the links to the far right for each of your five GenPept sequences. Briefly describe what each of these links do.}

{13. Click on "Related Sequences" for one of your five GenPept Sequences. Briefly describe what this does.}

{14. Use the Netscape Back command to return to the "Display Summary" page for the five GenPept sequences, and click on "PubMed" for the GenPept Sequence that you saved for your Lab Notebook above. How do the PubMed References that come up compare with those in the GenPept annotation for this sequence? Now click on "Related Articles" for one of these References. How does this compare with the "Related Sequences" for the GenPept Sequence itself?}

{15. Use the Netscape Back command to return to the "Display Summary" page for the five GenPept sequences, and do the same for the "Nucleotide", "Genome", and "Taxonomy" links if present. Briefly describe what each does.}

{D. Introduction to the GCG Package}

{1. Log into the GCG server machine and set up the GCG package. }

{2. Familiarize yourself with the capabilities of the GCG package. }

{3. Initialize the GCG graphics system. }

{4. Make a test plot and save it to your lab report.}

{5. Format a sequence for use with the GCGpackage.}

{6. Format a sequence using a text editor and the reformat command.}

{E. Questions:}

{Answer all of the following questions:}

1. What are some of the links from the ACS Home Page?
2. Why is the ACS Home Page of importance to this course?
3. How do the DNASYSTEM Web site and the CMS MBR Web site differ from each other?
4. How are these two Web sites similar to each other?
5. What are the general categories of Resource links available at CMS MBR?
6. According to the DNASYSTEM home page, what are the "main sequence databases"?
7. Briefly (one sentence each) describe the contents of the following databases: OMIM, ENZYME, Klotho, GDB, PDB, NDB, EPD, ReBase, dbEST.
8. What is the difference between the SWISS-PROT and TrEMBL Protein databases?
9. For what reasons might the sequences of protein entries in TrEMBL be incorrect?
10. What is meant by "Gene Ontology"?
11. Why are "Gene Ontologies" an issue in today's world of whole genome sequencing?
12. Why are there generally more "hits" found in TrEMBL than in SWISS-PROT?
13. The SwissProt search for two keywords is implicitly a Boolean AND operation. What does this mean?
14. Why are there more hits found in the Boolean OR operation that in the Boolean AND operation?
15. How to the number of hits found in the Boolean AND and OR operations compare with those found in the Single Keyword searches?
16. What is the naming convention used for Names of Protein Sequences in SWISS-PROT?
17. What is the naming convention used for Names of Protein Sequences in TrEMBL?
18. What is the difference between the "NiceProt" View of a SWISS-PROT entry and the "original SWISS-PROT format"?
19. What databases are accessible from sequences displayed by the ExPASy Web site? What other information about your sequence is accessible from this Web site? Examine several sequence entries and their embedded links.
20. When using boolean operators to combine keywords, does the order of the keywords affect the number of matches found? Briefly explain your answer.
21. In the "old SP format", what is the function of the first two letters of each line? What do you think these are abbreviations of?
22. What are "delimiters"? Does this old SP format have delimiters? If so, what are they?
23. Are there more delimiters in old SP format than needed to distinguish between multiple sequence entries in a single text file (flatfile database)? Briefly explain your answer.
24. When two keywords were used with no boolean operators in the SWISS-PROT full text search, why were different numbers of hits reported than using the SP description-id search? Briefly explain the difference in search criteria used by these two search facilities when presented with two keywords joined by a space.
25. What does the wildcard * do when added before and after a keyword?
26. Write a query that would find all of the repressor proteins in bacteriophages lambda and P22 (but no others). You need not actually run this query. Why are parentheses a good idea?
27. Write a query that would find all bacteriophage repressor or activator proteins except the repressor from bacteriophage lambda. You need not actually run this query.
28. Briefly explain why the relationships between the boolean operators are true.
29. What are some of the databases accessed by the NCBI Entrez search system?
30. What is a neighborhood as used by NCBI Entrez?
31. Why in the Entrez Protein search bacteriophage NOT phage were 7731 matches found?
32. Why in these searches were the number of matches for bacteriophage NOT phage and for phage NOT bacteriophage not the same?
33. Why were so many matches (31290) found for lambda NOT repressor?
34. The same number of matches (46) was found for lambda repressor and for "lambda repressor", and this number is less than the number of matches found for lambda AND repressor. Explain these observations.
35. What is the meaning of the XOR Boolean operator? Is it supported by NCBI Entrez? Why not?
36. What is the purpose of the Limits link in the Search <database> section?
37. What is the function of limiting the Fields? What is your ExPASy searches above does this correspond to?
38. What is the default Field used by NCBI Entrez?
39. What is the purpose of the Preview/Index link in the Search <database> section?
40. What is the difference between the Preview function and the Index function?
41. What is the purpose of the History link in the Search <database> section?
42. How does one use the History information in design of subsequent searches?
43. What is the purpose of the Clipboard link in the Search <database> section?
44. In a comparison of a GenPept protein sequence entry with a SwissProt protein sequence entry, how do the Fields compare? Which has more annotation? Which has more links to other databases?
45. How might the Plain Text display of the GenPept entry documentation be useful compared to the HTML display?
46. How would you obtain a display of your five GenPept sequences in FASTA-format?
47. The GenPept format for sequences in Entrez has a lot of information unavailable in the compact FASTA format. One important kind of information is cross-references (DBSOURCE xrefs) to other databases. List the names of at least two sequence databases and two non-sequence databases that are found in Entrez. Note that sequences that originate in SWISS-PROT are likely to have the greatest number of xrefs.
48. What is the required format that a sequence must be in to run the GCG reformat program?
49. Why is the GCG fromembl program used to reformat SWISS-PROT files?