Gribskov & Smith

 BIMM 141 Laboratory

Spring, 2001

Introduction to Bioinformatics

 

 

Exercise 2

Data Resources: Internet and Sequence Databases

Exercise 2 is a further introduction to the Internet, with an emphasis on the use of the Internet to find, use, and extract nucleic acid and protein sequence information. This information is the primary information in the genomes of organisms, forms the basis of the current genome-wide revolution in molecular biology, and is the basis underlying Bioinformatics.

The Objectives of Exercise 2 are:

All of the links used in this and other Exercises can be reached through the DNASYSTEM Web site, and the CMS Molecular Biology Resource (MBR). Both of these sites are compendia of Web sites of relevance to Bioinformatics.

Most of the needed links are also provided here in the Exercise, often with a new, blank NS Navigator page as target.

 

Relevant Articles from the BIMM 140 Course Reader:

The following articles in the Reader are relevant to Exercise 2:

 

Baxevanis-Ouellette, 2nd Edition, Textbook Relevant Chapters:

 

To do Exercise 2:
As described in the Exercise Web page under "Lab Notebook files" and as you did with Exercise 1, you will create your Lab Notebook file for Exercise 2 using Netscape Composer, a program within Netscape Communicator for the creation of Web pages. Your Lab Notebook for Exercise 2 will thus be a Web Page, i.e. a text html file, with images present as separate gif-formatted files. These will be in the exer2 subdirectory of your BIMM 141 account (these subdirectories are created using procedures you learned in Exercise 1).
Specific instructions are as follows - you can do these in a new Netscape Navigator window, as you read the instructions in this Netscape Navigator window - use the New command in the File pulldown Navigator menu)::

  1. Login to your BIMM141 account on one of the Unix computers in 4306 York Hall
  2. Turn on Netscape Communicator, which will turn on the browser Netscape Navigator by default
    1. In a Console or Command Line window, type in the command: nscomm
    2. This will bring up the ACS Home Web Page in NS NAvigator as the default Home Page
  3. Go to the Exercises part of the Exercises Web page for BIMM 141.
    1. Click on the Class Web Sites link.
    2. Click on the courses link.
    3. Click on the BIMM 140 link for Spring, 2001.
    4. Click on the Exercises Web page for BIMM 141 and go to the Exercises part.
  4. Click on Template for Exercise 2 - this will bring up the Template or Foundation part of the Exercise 2 (all of the text present as {italicized bold text within curly brackets} ).
  5. Save this as a file in your account:
    1. Go to the File pulldown menu of NS Navigator and select: Save Page as ...
    2. Follow instructions in the popup menu; name the saved Page file something like: exer2.html
  6. Open Netscape Composer by choosing Composer from the Communicator pulldown menu.
  7. Close the blank Composer file and Open your saved Exercise 2 Template file.
  8. Use this to construct your Lab Notebook file for Exercise 2 as you actually do Exercise 2.
  9. Answer the Questions at the end of the Exercise as you proceed through the Exercise

Thus, in actually doing an Exercise, you will have at least three windows open simultaneously:

  1. Netscape Composer - the window is which you do your Exercise by making the file exer2.html starting from the saved template for Exercise 2.
  2. Netscape Navigator - first Navigator window, open to this page, the Exercise 2 Web page, for instructions on what to do for Exercise 2.
  3. task window - window for execution of the tasks requested in a given Exercise
    1. For Exercise 2, the only tasks you will perform are tasks at other Web sites such as NCBI and ExPASy; hence this window will be a second Netscape Navigator window. This is turned on as described in Exercise 1 in one of two general ways:
      1. via the New Navigator command from the File menu from within NS Communicator
      2. by again executing the nscomm command in a new Console or Command Line window; to get such an additional Unix Command Line window, do as described in Exercise 1.

Note: some of the Links in the Exercise are set to open a second NS Navigator window automatically for you.

 



BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |



Main Specific Tasks to Perform in Exercise 2:

A. Use Netscape and the Web to Find and Retrieve a SWISS-PROT Protein Sequence
  1. Turn on Netscape Communicator
  2. Bioinformatics Links
    Go to the DNASYSTEM or the CMS MBR Web Pages
    Spend a little time browsing the DNASYSTEM Home Page
  3. SWISS-PROT and TrEMBL at the ExPASy Web Site
    Access the ExPASy site
    Browse a bit to see what's at ExPASy
  4. Find your Protein Sequence
    Choose "by description or identification"
  5. Copy your Sequence
    Once you have located your sequence, have a look at it
    Obtain a copy of the sequence entry
    Use COPY-PASTE to COPY the Sequence from Netscape window and PASTE it into your Notebook
B. Searches using Boolean Operators.
  1. Go to the SWISS-PROT full-text search service.
  2. Do a Two Keywords search.
  3. Search with AND operator.
C. Looking up Sequences with NCBI Entrez.
  1. Go to NCBI Entrez and read the Entrez help information.
  2. Go to protein database for protein search
  3. Search with single keyword used at ExPASy
  4. Add additional keywords as needed
  5. Examine links "Limits", "Preview/Index", "History", "Clipboard"
  6. Examine Display options
  7. Display GenPept entries for five Protein Matches
  8. Examine the "Plain Text" Display Option
  9. Examine the GenPept Annotation and Fields
  10. Examine the "Graphics" Display Option
  11. Examine the "Add to Clipboard" Feature
  12. Links from each GenPept Protein Entry
  13. Examine the "Related Sequences" Link
  14. Examine the "PubMed" Links
  15. Examine the "Nucleotide", "Genome", and "Taxonomy" Links
D. Introduction to the GCG Package.
  1. Log into the GCG server machine and setup the GCG package.
  2. Familiarize yourself with the capabilities of the GCG package.
  3. Initialize the GCG graphics system
  4. Make a test plot and save it to your lab report.
  5. Format a sequence for use with the GCG package.
  6. Format a sequence using a text editor and the reformat command.
E. Questions.

 

Exercise 2

Data Resources: Internet and Sequence Databases

Use of Netscape and the Web to Find and Retrieve Sequences

Use of the Web for both 1) access to the data and to various transformed or reinterpreted versions of the data, and 2) to software tools for further analysis of the data is a fundamental feature of this revolution in molecular biology. As a specific example, sequence DNA and protein sequence databases are by-and-large no longer maintained at individual universities. Rather, specific centers such NCBI, EBI, and ExPASy maintain these sequence databases. New sequences are submitted to these centers, where developed software immediately makes the new sequences available to the public. Further, these centers have developed software tools for analysis of data in these databases, as well as new ways of interpreting and displaying such data; further developments in these areas continues at a frenetic pace.


{A. Use Netscape and the Web to Find and Extract a SWISS-PROT Sequence:}

We will now use the Web to find a protein sequence at an appropriate Web site and copy the sequence into a file in your account. Netscape Navigator is an example of a Web "browser". Because we will need a forms capable browser, you should try to use a version of Netscape labelled 4.0 or higher; this is the case for all ACS computers. Microsoft Internet Explorer is as good for most purposes, and in fact better than Netscape for printouts (Netscape appears not to printout all detail in complex ".com" pages). Internet Explorer however does not have a program comparable to Netscape Composer. Microsoft has chosen to have separate programs for Web Browser (Internet Explorer, akin to Netscape Navigator), for Web page editing (MS Word, akin to Netscape Composer), and for email communication (Outlook, akin to Netscape Messenger). Current versions of both of these Web Browsers are available for free for the downloading using the Web. It is in general best to stay away from the most recent versions of these programs; too many bugs ...

{1. Turn on Netscape Communicator.}

As in Exercise 1 and as per the Instructions above, to turn on Netscape Communicator on one of the Unix computers in 4306 York Hall, use a Console or Command Line window and type in, followed by a RETURN: nscomm

Netscape Communicator is executed, and you will have a window with colored graphics and colored text from the Netscape browser program, Netscape Navigator. Any of the blue items are hypertext items. You can "go to" those items by just clicking on them with the mouse. You did a little of this in Exercise 1 but you can again:

{manuever via Hypertext links for awhile}

Note that the default Home Page for the ACS Unix computers in 4306 York Hall is the ACS Home Page.

If you have not already done so, as in Exercise 1 and as per the Instructions above, turn on Netscape Composer, and use this to work on your Exercise 2, by using a copy of the Exercise 2 Template as a beginning foundation or Template file.

 

{2. Bioinformatics Links: the DNASYSTEM Web Page or the CMS MBR Web Page}

{a. Go to the DNASYSTEM or the CMS MBR Web Page}

The DNASYSTEM Web Page provides a set of Bioinformatics links maintained by Douglas Smith at UCSD.
The CMS MBR (Molecular Biology Resource) is a somewhat similar set of links provided by Chris Smith at SDSC. There are many are "Web Pages of links" for Bioinformatics and Molecular Biology; some of these are listed in the above two Web Pages.We will use the DNASYSTEM or CMS MBR links at numerous times in the course.

You can go to these Web Pages by clicking on the above links.
A new NS Navigator page will open; to Close the page, click on Close in the File pulldown menu.

Note that, in the UCSD section, the DNASYSTEM page has links to BIMM 140 and BIMM 141. Note also the link to the Genomics Workshop in this section of the DNASYSTEM Home page. A set of exercises similar to those used in BIMM 141 was used for this Workshop which was given in the summer of 1997.

 

{Spend a little time browsing the DNASYSTEM Web Page or the CMS MBR Web Page.}

Note that the "Contents via Task Classification" takes you rapidly to other places in the same Home Page.

 

{3. SWISS-PROT and TrEMBL at the ExPASy Web site}

We now will lookup and retrieve a protein sequence from the SWISS-PROT protein database at the general site ExPASy in Geneva, Switzerland

Note that item 3 under "Contents" in the DNASYSTEM Home Page concerns Genome Analysis Task Areas, and that the second item on the list of sites is DB Sites. These Categories of Tasks are the same as those given in the Toolbar Contents:

| Contents | General Sites | DB Sites | MainDBs | MultDB Sites | Other DBs | DB Search |
| Nuc Seq | Prot Seq | Mult Seq | 3D | Organism | AceDBs | Genomes | Pathways | ftp Sites |

 

{From the DNASYSTEM Web Page or from the links above, click on "General Sites'}

Several sites are now available for each of the main sequence databases under B. General Web Sites for Genome Analysis.

 

{Access the ExPASy site by clicking on 'ExPASy'.}

NOTE: People from all over the world are now using the Internet all the time. There may be times when you will not be able to access a given site due to heavy usage or due to the site no longer being available. Sites in Europe are often most rapidly accessed late in the afternoon or early evening when most people in Europe are sleeping; those in Japan are most rapidly accessed early in the morning.

If you get through, note that you are now connected to a computer in Geneva, Switzerland!!

{Browse a bit to see what's at ExPASy ...}

 

{4. Find your Protein Sequence in the SWISS-PROT and TrEMBL Protein Databases}

From the ExPASy Home Page, click on the SWISS-PROT and TrEMBL link.

Note the difference between the SWISS-PROT and TrEMBL Protein Databases.

 

{In "Access to SWISS-PROT and TrEMBL', click on "by description or identification"}

This brings up a typical "searchable index".

{Now enter TWO keywords in the box shown and click "Submit"}

For example, if we were trying to find the sequence of bacteriophage Lambda CI repressor, we might enter the following two words:

lambda repressor

Note the space between the two words. This is equivalent to searching the database for entries that have BOTH the word "lambda" AND the word "repressor" in the description line (the DE line in the EMBL or SWISS-PROT style of documentation). This is also known as a Boolean AND operation. The concept of Boolean operations is considered in more detail in the next part of this Exercise 2.

{Try a keyword search with TWO words and record your findings;
Use your own keywords; do NOT use lambda repressor !

also try REVERSING the two keywords.}

Search both the SWISS-PROT and TrEMBL databases (both boxes should be left checked...)

Use keyword(s) of relevance to the finding of proteins of specific interest to you ... or any other keywords you wish to use.

Examples of some reasonable keywords are: transcription factor; MAP kinase; globin mouse; cytochrome bacteria; tryptophan coli; DNA polymerase; protease HIV; dehydrogenase cholera; activator human; heat yeast.

Further examples can be found by looking at the relationships between Gene Ontologies currently being developed for specific organisms and across all organisms and the following other Classification Systems:

These lists should provide lots of ideas for keywords.
You could also look at the following: list of COG proteins; TIGR gene indices, or NCBI Gene Nomenclature Resources for specific organisms


{Do the same keyword search but with each keyword separately; record your findings}

Results of these searches with "lambda" and/or "repressor":

     Search words       SWISS-PROT   TrEMBL
     -----------------  ----------   ------
     lambda repressor         1          3
     repressor lambda         1          3
     repressor              304        402
     lambda                 135         60

Judicious choice of search keywords is always desirable, and takes practice!

You can also use the Find... command in the Edit dropdown menu in NS Navigator to search for the second keyword. . Thus, if we had first searched only for the keyword "repressor" and obtained over 200 entries, we could now use "Find..." to search among these entries for "lambda".

 

{5. Copy your Sequence}

{Choose one of the Protein Sequences you found and have a look at it by clicking on the hypertext-linked SwissProt name for the sequence.}

What should now come up is the ExPASy 'NiceProt View' of the SwissProt or TrEMBL entry.
This is a Web 'table' presentation rather than flat text

Note that the SwissProt sequence entry ITSELF contains hypertext links in it!
Thus, you can go immediately to references that are cited, or to cognate sequences in other databases.
For some of the entries, you can also get 3-D images of the protein if its crystal structure has been determined !!

{Note the links present to visualize the protein and to learn more about its structural elements.
Browse through some of these links.}


{Obtain a copy of the sequence entry}
by using the Save as ... command from the File dropdown menu.

You have the option to save as "text" or as "source".

{Save copies both as "Text" and "Source" from both "NiceProt View" and "old SP format"}


Note the options at the bottom of the entry, as follows:
1. View entry in original SWISS-PROT format.
2. View entry in raw text format (no links)

{Include some or all of the "Text" version of the "old SP format" in your Notebook}

{Note the First line in the Entry and the Last line in the Entry. What are these?}
{Note also the First Two Letters in each line of the "old SP format".}

 

Notice that you can also {Use COPY-PASTE to COPY the Sequence directly from the Netscape window and PASTE it into your Notebook file for Exercise 1.}

To do this, when you find output you wish to include in your Lab Notebook, "select" this output by going to the beginning of the output with the Mac cursor, clicking on the Mouse and HOLDING IT DOWN while you go to the end of the output. The "selected" output is then "highlighted".

Now use the CUT function in the EDIT pulldown menu to "cut" the selected output out
Now go to your Lab Notebook file in NS Composer, place the cursor at the position in the Lab Notebook file where you wish to insert the output, and execute the PASTE function in the EDIT pulldown menu of NS Composer.

You will note that, when you look at your sequence, the lines are of varying lengths. This is because the text is, by default, displayed in a proportional font wherein each letter is a different width.

Sequences entries are best displayed in a monospaced font such as Courier.

In NS Composer or any html editor, you can select the text and convert it to Formatted text;
this usually converts the text to a monospaced font such as Courier.

Also in NS Composer, you can select Fixed Width under Font in the Format pulldown menu.

Note that the ExPASy 'NiceProt View' of the entry does not permit COPY-PASTE operations as easily as the original SWISS-PROT format.

 

 


BIMM 140: | Main | Course | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | Course | Exercises | DNASYSTEM | CMS MBR |

More Sophisticated Searches

The goal of the second part of this exercise is to introduce some searching techniques that are more sophisticated than simple keyword searches. First we will use search engines that permit combinations of more than one keyword using boolean operators, and then we will move on to a system, NCBI Entrez, that uses a sophisticated cross-indexing sytem to allow very powerful searches of both sequences and literature.

 

{B. Searches using Boolean Operators}

For this part of the exercise we use the Full-Text Search option at SWISS-PROT. Other services using boolean operators work in a similar way, although there will be some differences in syntax (check the instructions).

{1. Go to the SWISS-PROT full text search service.}

Note the information provided as

Note in particular that several boolean operators (AND, OR, NOT) are supported.
Note also the option to add the wildcard * before and after the keyword.

Use the same pair of keywords here as you used in the above Search operations.

We first repeat what you did above, to see if the full text search yields more results that just the search by description or identification done above.

 

{2. Enter the two keywords into the search box and hit SUBMIT to start the search. How many matches were reported?}

Think about why this number is different than what you found above (if it is).

{Now reverse the order of the TWO keywords, i.e. "repressor lambda" instead of "lambda repressor."}

{Now go back to the SWISS-PROT fill text search page and try again using EACH of the two keywords. How many matches were reported?}

{Do the above searchs with the "append * before and after" option selected}
{Construct a table of your results as per the following}

Results with "lambda repressor", with comparison to the above results:

                         descrip-ID search       full text search
     Search words       SWISS-PROT   TrEMBL     SWISS-PROT   TrEMBL
     -----------------  ----------   ------     ----------   ------
     lambda repressor         1          3            3          0
        with * appended   * not supported             3          0
     repressor lambda         1          3            0          0
        with * appended   * not supported             0          0
     repressor              304        402         1067        819
        with * appended   * not supported          1096        872
     lambda                 135         60          335        581
        with * appended   * not supported           336        583

 

Note that the above table was done using formatted text.

This SWISS-PROT search engine treats multiple keywords as a phrase that must be matched in its entirety.
Note that many search engines interpret multiple keywords as a boolean "OR" operation. They report all entries that match one or the other of the keywords. The other boolean operators are "AND" and "NOT". The AND operator limits the search to entries that match ALL the keywords and the NOT operator selects entries that do not match the keyword.

 

{3. Do a search with the same two keywords using the AND operator. How many matches did you get? What are the results when you reverse the order of the keywords?}

{Do the same for both the OR and NOT boolean operators, and construct a table for your Notebook similar to the one below.}

Results for the two keywords "lambda" and "repressor" using the full text search SWISS-PROT search engine:

                         AND operator         OR operator          NOT operator
 Search words       SWISS-PROT  TrEMBL   SWISS-PROT   TrEMBL   SWISS-PROT   TrEMBL
 -----------------  ----------  ------   ----------   ------   ----------   ------
 lambda repressor        41        9        1361       1391        294        572
   with * appended       41        9        1391       1446        295        574
 repressor lambda        41        9        1361       1391       1026        810
   with * appended       41        9        1391       1446       1055        863

 

It is often useful to use braces or parentheses (depending on the server) to force the evaluation of the boolean operators in the desired order. Without the parentheses, it's not always clear how expressions with multiple boolean operators are evaluated.

{Try also parts of keywords. For example, for "lambda repressor", we would try "lam AND rep". Note that you must check the "Prefix and append wildcard '*' to words." box with this server. Try the equivalent to "lam* AND *rep*" for your keywords.}

Results for the two keywords "lambda" and "repressor" using the full text search SWISS-PROT search engine:

 Search words               SWISS-PROT  TrEMBL  
 -----------------          ----------  ------   
 lambda AND repressor            41         9  
   with * appended               41         9     
 lam AND rep                      1         8   
 *lam* AND *rep*               1511      1627

 

Summary of the Full Text Search of both SWISS-PROT and TrEMBL protein databases:

With no "prefix and append * wildcard":

1. lambda AND repressor:       50 matches
2. lambda OR repressor:      2752 matches
3. lambda NOT repressor:      866 matches
4. repressor NOT lambda:     1836 matches
5. repressor XOR lambda:        0 matches
6. repressor AND lambda:       50 matches
7. NOT repressor NOT lambda:    0 matches

With "Prefix and append wildcard "*" to words' button turned on:

8. lam AND rep:              3138 matches

Note the following relationships:

a. (1) = (6)
b. (3) + (4) = (2) - (1): 866 + 1836 = 2088; 2752 - 50 = 2702
c. XOR does not work at ExPASy; one should get (5) = (3) + (4) = (2) - (1) = 2702
d. ExPASy also did not like the NOT ... NOT ... search


BIMM 140: | Main | Course | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | Course | Exercises | DNASYSTEM | CMS MBR |


{C. Looking up Sequences with NCBI Entrez}

The NCBI Entrez browser system is one of the more useful and powerful searching systems available on the Web. Entrez is a sophisticated system that interelates nucleic acid sequence, protein sequence, macromolecular structure, complete genome, taxonomic, human genetic disease, and literature databases. Entrez uses extensive precalculated indices to speed searches.

A neighborhood (or list of neighbors) is a central concept in Entrez. NCBI precalculates a variety of neighborhoods for all the entries in their databases. For a protein sequence, a neighborhood might be all other sequences that score above a certain level with the BLAST database searching program. For a research paper, a neighborhood might be defined as other papers that share a certain number of keywords in common. Entrez allows one to use these precalculated neighborhoods to quickly locate groups of DNA sequences, protein sequences and structures, and literature references.

 

{1. Go to NCBI Entrez and read some of the "Entrez Help" information.}

You can in general access NCBI from the General Sites section of the DNASYSTEM home page, or from the Sequence and Structure Databases under the DNA Analysis and Molecular Biology section of the CMS MBR home page.

For this Exercise, you can also click on the above link.

Note the extensive nature of the Entrez Help facility. The section Refining Your Search is of particular value.

 

{2. Return to the Entrez Home page using the NetScape "Back" operation and click on "Proteins" to search for a protein sequence.}

The query page looks somewhat like the the SWISS-Prot page you looked at above. As you should expect by now, the search keyword goes in the entry box.

Note however the general nature of the Search box and region (an html table). In particular, the <database> in Search <database> for < keyword(s) > permits the User to choose the database in which to search from any of these Web pages. Thus, you really did not need to go to the Proteins Web page to search the protein database.

 

{3. Enter one of the keywords you used at ExPASy in the search box and click the "Go" button (or just do a RETURN) to start the search.}

The response is divided into three main sections, each in grey at the top of the display, plus the matches:

  1. The top Search <database> for <keyword(s)> section. This section is the search section, and is used to refine your search by using boolean operators and additional keywords. One begins the search either by clicking the Go button or by doing a RETURN on your computer; thus it is very rapid to execute additional, refined searches. Note the links to the eight primary types of databases above this section. Note also the four links below: Limits, Preview/Index, History, Clipboard.
  2. The middle Display <menu> section. This section is the display option section, as well as retrieval of desired data to your home computer (the Save, Text, Add to Clipboard buttons). The Details button provides the actual query syntax, something of little interest to most Users.
  3. The bottom Show <number> section. This section is the items per page section, and is used to vary the number of items per page, and to display a specific page.

 

{4. If the number of matches is greater than 200, use a second and/or third keyword in the <keyword(s)> section together with appropriate Boolean Operators.}

To refine your search, you simply use Boolean operators to add additional keywords. This is done in the same Search Protein for <keywords> area at the top of the typical NCBI page. To learn more precisely how Boolean operators etc are used, look at the Refining Your Search section of the Help facility. You may even wish to open a new NS Navigator window, to use for the Help facility.

Here follow some results for keywords "lambda", "repressor", bacteriophage", and "phage":

 1. lambda                            31514 matches
 2. phage                             13341 matches
 3. bacteriophage                       695 matches
 4. bacteriophage NOT phage            7731 matches
 5. phage NOT bacteriophage            7177 matches
 6. repressor AND phage                 589 matches
 7. repressor NOT lambda               4569 matches
 8. lambda NOT repressor              31290 matches
 9. lambda XOR repressor               syntax error
10. lambda OR repressor               36083 matches
11. lambda AND repressor                224 matches
12. lambda repressor                     46 matches
13. "lambda repressor"                   46 matches
13. lambda AND repressor AND phage       61 matches
14. lambda AND repressor AND CI          47 matches
15. lambda AND repressor AND cro         33 matches

The aim of this search refinement is to limit your search to only proteins of interest to you, usually only 100-200 or so. In the above, the last three or four searches satisfy such aims. The first ten searches all yield an inordinately large number of matches.

Note the speed by which the retrieval is done, even though the databases are large.

 

{5. Click in turn on each of the four links "Limits", "Preview/Index", "History", "Clipboard". Answer the Exercise Questions on these four links. Redo the AND query with two of your keywords using "Limits" to limit your search to new GenPept entries within the past 5 years. Compare these results with those above.}

Basically, the Limits option permits you to "limit" the search to examination of specific database Fields, specific Modification Dates, specific Organisms, etc.

The Preview/Index simplifies creation of more complex Queries by permiting the User to combine the results of simple Queries already performed.

The History link provides information on Queries already performed, assigns numbers to these, and permits the User to use these again in various combinations, thereby creating more complex Queries.

The Clipboard link shows what items have been saved to the Clipboard for subsequent downloading.

For the search lambda AND repressor,

 

{6. In the "Display <menu>" section, choose different items from the <menu> and click the "Display" button. Briefly describe what the different items are in the <menu>, particularly the "Neighborhood" and "Link" items.}

 

{7. Return to the <menu> item termed "Summary" and click the "Display" button. Select the boxes of any five (5) of the sequences by clicking on each of the boxes. Select "GenPept" from the <menu> and click on the "Display" button. Record in your Lab Notebook what this has done.}

This is the way to select specific matches and to display their complete documentation. Note that GenPept is the cognate to GenBank: just as TrEMBL is the protein translate for each of the genes found in the EMBL nucleic acid databases, so to GenPept is the protein translate of each of the genes found in the GenBank nucleic acid databases.

 

{8. The GenPept sequences are Displayed as HTML. Click on the HTML button, choose "Plain Text", and click on the "Display" button. Record in your Lab Notebook what this has done.}

 

{9. Examine the contents of the GenPept annotation for the first of your Sequences. What are the Fields? Save this GenPept entry to your Lab Notebook.}

 

{10. Return to the display of your GenPept sequences as HTML, and select from <menu> the "Graphics" option and click on the "Display" button. Describe what you see. What options does the User have? The Graphics display is only of the first of your five sequences; how would you observe the graphics for the fourth sequence?}

 

{11. Return to a "Display GenPept as HTML" and click on the "Add to Clipboard" button. Describe what happened.}

 

{12. Choose "Summary" from the <menu> and click on the "Display" button. Note the links to the far right for each of your five GenPept sequences. Briefly describe what each of these links do.}

 

{13. Click on "Related Sequences" for one of your five GenPept Sequences. Briefly describe what this does.}

For Lambda Repressor, clicking on Related Sequences brings up 81 GenPept entries, several of which are Lambda Repressor itself. These Related Sequences constitute the Neighborhood for protein database entry items.

Each of these 81 entries in turn has Related Sequences. One can thus pursue extension of the Neighborhood further. The hope (and expectation) is that such operations will "converge" to a set of protein sequences, each of which is a neighbor to others and such that no other database entries are neighbors to these sequences. Depending on the criteria used to define the neighborhood, this set of protein sequences will then be related to the concepts of a protein family, a protein superfamily, and a profile for these proteins. These concepts will be discussed later in BIMM 140.

 

{14. Use the Netscape Back command to return to the "Display Summary" page for the five GenPept sequences, and click on "PubMed" for the GenPept Sequence that you saved for your Lab Notebook above. How do the PubMed References that come up compare with those in the GenPept annotation for this sequence? Now click on "Related Articles" for one of these References. How does this compare with the "Related Sequences" for the GenPept Sequence itself?}

The Related Articles provide the Reference-based Neighborhood.

 

{15. Use the Netscape Back command to return to the "Display Summary" page for the five GenPept sequences, and do the same for the "Nucleotide", "Genome", and "Taxonomy" links if present. Briefly describe what each does.}

The Nucleotide Sequence is mainly the DNA sequence encoding your Protein.

 


BIMM 140: | Main | Course | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | Course | Exercises | DNASYSTEM | CMS MBR |


{D. Introduction to GCG package:}

The GCG (Genetics Computer Group) package is a general purpose sequences analysis system that runs on UNIX computers. In the next lesson, we will use the GCG package to examine dotplots and alignments. In order to be prepared for the next exercise, we introduce the package here.

{1. Log into the GCG server machine and set up the GCG package. }

The GCG package is licensed only to run on a single processor. For our laboratory this is y4306-su-1. Log onto this machine by typing

ssh y4306-su-1
The GCG package uses many UNIX symbols as it runs. These symbols, which allow you to run the various programs by simply typing their names, must be intialized. To do this, type

prep gcg
After a few seconds you will se a banner announcing "Welcome to the Wisconsin package". This is a synonym for the GCG package.

{2. Familiarize yourself with the capabilities of the GCG package. }

Brief descriptions of the programs in the GCG package are available on the GCG website. As you can see, there are MANY programs. Browse through the entries in the online table to get a feeling for the extent of the operations you can perform. We will use many, but not all, of these capabilities during the course. More detailed information about each program is available as HTML pages locally. Use netscape or internet explorer to open the local file

/software/nonrdist/GCG_10.1/gcghelp/html/unix/users_guide.html

{3. Initialize the GCG graphics system. }

Many, if not most, programs in the GCG package produce a graphical output. This output can be viewed on a variety of devices so you must tell the system which output to produce. You can read about some of the options in Appendix C of the GCG users guide. We will most often display graphics to the the terminal screen. This requires using an "XWindows" graphics driver. To intialize the XWindows driver, type xwindows. To select the default "color workstation", simply hit return (enter). Note that the GCG system always displays default values within (* *). You can always accept the default value by simply hitting return (enter). Some other graphics drivers that may be useful are the PNG driver (for including graphics in your lab reports), and the postscript driver (for publications).

{4. Make a test plot and save it to your lab report. }

Use the GCG plottest program to produce a test plot. First view it on the screen using the XWindows driver, then save a copy in PNG format and include it in your notebook.

{5. Format a sequence for use with the GCGpackage. }

Similarly to many packages, the GCG packages requires files to be in a certain format to work with the programs in the system. This applies most particularly to sequence files. The GCG package has a number of utilities that will reformat sequences between various native formats and the GCG format. See the section on "reformatting sequence files" in the users guide section on "Using sequences" for more information. Use the gcg program fromembl to reformat the SWISS-PROT sequence you retrieved in this exercise (A5, above).

{6. Format a sequence using a text editor and the reformat command. }

Sometimes you will want to edit a sequence manually, or a sequence will be in an unusual format. In this case you can edit the sequence using any text editor and reforamt using the GCG reformat command. Read the section on "Creating and editing single sequences" in the "Using sequences" section of the GCG Users Guide. Try this out usng the sequence you retrieved from Entrez in section C7, above. After reformatting the sequence, run the GCG program pepplot and include the output in your notebook to confirm that the formatting process was successful.


BIMM 140: | Main | Course | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | Course | Exercises | DNASYSTEM | CMS MBR |

{E. Questions:}

{Answer all of the following questions:}

  1. What are some of the links from the ACS Home Page?
  2. Why is the ACS Home Page of importance to this course?
  3. How do the DNASYSTEM Web site and the CMS MBR Web site differ from each other?
  4. How are these two Web sites similar to each other?
  5. What are the general categories of Resource links available at CMS MBR?
  6. According to the DNASYSTEM home page, what are the "main sequence databases"?
  7. Briefly (one sentence each) describe the contents of the following databases: OMIM, ENZYME, Klotho, GDB, PDB, NDB, EPD, ReBase, dbEST.
  8. What is the difference between the SWISS-PROT and TrEMBL Protein databases?
  9. For what reasons might the sequences of protein entries in TrEMBL be incorrect?
  10. What is meant by "Gene Ontology"?
  11. Why are "Gene Ontologies" an issue in today's world of whole genome sequencing?
  12. Why are there generally more "hits" found in TrEMBL than in SWISS-PROT?
  13. The SwissProt search for two keywords is implicitly a Boolean AND operation. What does this mean?
  14. Why are there more hits found in the Boolean OR operation that in the Boolean AND operation?
  15. How to the number of hits found in the Boolean AND and OR operations compare with those found in the Single Keyword searches?
  16. What is the naming convention used for Names of Protein Sequences in SWISS-PROT?
  17. What is the naming convention used for Names of Protein Sequences in TrEMBL?
  18. What is the difference between the "NiceProt" View of a SWISS-PROT entry and the "original SWISS-PROT format"?
  19. What databases are accessible from sequences displayed by the ExPASy Web site? What other information about your sequence is accessible from this Web site? Examine several sequence entries and their embedded links.
  20. When using boolean operators to combine keywords, does the order of the keywords affect the number of matches found? Briefly explain your answer.
  21. In the "old SP format", what is the function of the first two letters of each line? What do you think these are abbreviations of?
  22. What are "delimiters"? Does this old SP format have delimiters? If so, what are they?
  23. Are there more delimiters in old SP format than needed to distinguish between multiple sequence entries in a single text file (flatfile database)? Briefly explain your answer.
  24. When two keywords were used with no boolean operators in the SWISS-PROT full text search, why were different numbers of hits reported than using the SP description-id search? Briefly explain the difference in search criteria used by these two search facilities when presented with two keywords joined by a space.
  25. What does the wildcard * do when added before and after a keyword?
  26. Write a query that would find all of the repressor proteins in bacteriophages lambda and P22 (but no others). You need not actually run this query. Why are parentheses a good idea?
  27. Write a query that would find all bacteriophage repressor or activator proteins except the repressor from bacteriophage lambda. You need not actually run this query.
  28. Briefly explain why the relationships between the boolean operators are true.
  29. What are some of the databases accessed by the NCBI Entrez search system?
  30. What is a neighborhood as used by NCBI Entrez?
  31. Why in the Entrez Protein search bacteriophage NOT phage were 7731 matches found?
  32. Why in these searches were the number of matches for bacteriophage NOT phage and for phage NOT bacteriophage not the same?
  33. Why were so many matches (31290) found for lambda NOT repressor?
  34. The same number of matches (46) was found for lambda repressor and for "lambda repressor", and this number is less than the number of matches found for lambda AND repressor. Explain these observations.
  35. What is the meaning of the XOR Boolean operator? Is it supported by NCBI Entrez? Why not?
  36. What is the purpose of the Limits link in the Search <database> section?
  37. What is the function of limiting the Fields? What is your ExPASy searches above does this correspond to?
  38. What is the default Field used by NCBI Entrez?
  39. What is the purpose of the Preview/Index link in the Search <database> section?
  40. What is the difference between the Preview function and the Index function?
  41. What is the purpose of the History link in the Search <database> section?
  42. How does one use the History information in design of subsequent searches?
  43. What is the purpose of the Clipboard link in the Search <database> section?
  44. In a comparison of a GenPept protein sequence entry with a SwissProt protein sequence entry, how do the Fields compare? Which has more annotation? Which has more links to other databases?
  45. How might the Plain Text display of the GenPept entry documentation be useful compared to the HTML display?
  46. How would you obtain a display of your five GenPept sequences in FASTA-format?
  47. The GenPept format for sequences in Entrez has a lot of information unavailable in the compact FASTA format. One important kind of information is cross-references (DBSOURCE xrefs) to other databases. List the names of at least two sequence databases and two non-sequence databases that are found in Entrez. Note that sequences that originate in SWISS-PROT are likely to have the greatest number of xrefs.
  48. What is the required format that a sequence must be in to run the GCG reformat program?
  49. Why is the GCG fromembl program used to reformat SWISS-PROT files?

 



BIMM 140: | Main | 140_Info | Syllabus | Lectures | Exams | DNASYSTEM | CMS MBR |
BIMM 141: | Main | 141_Info | Syllabus | Exercises | DNASYSTEM | CMS MBR |


Latest modification: 2 April, 2001

If you have problems or questions, send email to Michael Gribskov or to Doug Smith