| Gribskov & Smith |
BIMM 141 Laboratory |
Spring, 2001 |
Introduction to Bioinformatics

Exercise 2 is a further introduction to the Internet, with an emphasis on the use of the Internet to find, use, and extract nucleic acid and protein sequence information. This information is the primary information in the genomes of organisms, forms the basis of the current genome-wide revolution in molecular biology, and is the basis underlying Bioinformatics.
The Objectives of Exercise 2 are:
All of the links used in this and other Exercises can be reached through the DNASYSTEM Web site, and the CMS Molecular Biology Resource (MBR). Both of these sites are compendia of Web sites of relevance to Bioinformatics.
Most of the needed links are also provided here in the Exercise, often with a new, blank NS Navigator page as target.
The following articles in the Reader are relevant to Exercise 2:
Baxevanis-Ouellette, 2nd Edition, Textbook Relevant Chapters:
To do Exercise 2:
As described in the Exercise Web page under "Lab
Notebook files" and as you did with Exercise 1,
you will create your Lab Notebook file for Exercise 2 using
Netscape Composer, a program within Netscape Communicator
for the creation of Web pages. Your Lab Notebook for Exercise
2 will thus be a Web Page, i.e. a text html file, with images
present as separate gif-formatted files. These will be in the
exer2 subdirectory of your BIMM 141 account (these subdirectories
are created using procedures
you learned in Exercise 1).
Specific instructions are as follows - you can do these
in a new Netscape Navigator window, as you read
the instructions in this Netscape Navigator window - use
the New command in the File pulldown Navigator
menu)::
Thus, in actually doing an Exercise, you will have at least three windows open simultaneously:
Note: some of the Links in the Exercise are set to open a second NS Navigator window automatically for you.
Use of the Web for both 1) access to the data and to various transformed or reinterpreted versions of the data, and 2) to software tools for further analysis of the data is a fundamental feature of this revolution in molecular biology. As a specific example, sequence DNA and protein sequence databases are by-and-large no longer maintained at individual universities. Rather, specific centers such NCBI, EBI, and ExPASy maintain these sequence databases. New sequences are submitted to these centers, where developed software immediately makes the new sequences available to the public. Further, these centers have developed software tools for analysis of data in these databases, as well as new ways of interpreting and displaying such data; further developments in these areas continues at a frenetic pace.
We will now use the Web to find a protein sequence at an appropriate Web site and copy the sequence into a file in your account. Netscape Navigator is an example of a Web "browser". Because we will need a forms capable browser, you should try to use a version of Netscape labelled 4.0 or higher; this is the case for all ACS computers. Microsoft Internet Explorer is as good for most purposes, and in fact better than Netscape for printouts (Netscape appears not to printout all detail in complex ".com" pages). Internet Explorer however does not have a program comparable to Netscape Composer. Microsoft has chosen to have separate programs for Web Browser (Internet Explorer, akin to Netscape Navigator), for Web page editing (MS Word, akin to Netscape Composer), and for email communication (Outlook, akin to Netscape Messenger). Current versions of both of these Web Browsers are available for free for the downloading using the Web. It is in general best to stay away from the most recent versions of these programs; too many bugs ...
{1. Turn on Netscape Communicator.}
As in Exercise 1 and as per the Instructions above, to turn on Netscape Communicator on one of the Unix computers in 4306 York Hall, use a Console or Command Line window and type in, followed by a RETURN: nscomm
Netscape Communicator is executed, and you will have a window with colored graphics and colored text from the Netscape browser program, Netscape Navigator. Any of the blue items are hypertext items. You can "go to" those items by just clicking on them with the mouse. You did a little of this in Exercise 1 but you can again:
{manuever via Hypertext links for awhile}
Note that the default Home Page for the ACS Unix computers in 4306 York Hall is the ACS Home Page.
If you have not already done so, as in Exercise 1 and as per the Instructions above, turn on Netscape Composer, and use this to work on your Exercise 2, by using a copy of the Exercise 2 Template as a beginning foundation or Template file.
{2. Bioinformatics Links: the DNASYSTEM Web Page or the CMS MBR Web Page}
{a. Go to the DNASYSTEM or the CMS MBR Web Page}
The DNASYSTEM Web Page provides a set of Bioinformatics
links maintained by Douglas Smith at UCSD.
The CMS MBR (Molecular Biology Resource) is a somewhat
similar set of links provided by Chris Smith at SDSC. There are
many are "Web Pages of links" for Bioinformatics and
Molecular Biology; some of these are listed in the above two Web
Pages.We will use the DNASYSTEM or CMS MBR links at numerous times
in the course.
You can go to these Web Pages by clicking on the above links.
A new NS Navigator page will open; to Close the
page, click on Close in the File pulldown menu.
Note that, in the UCSD section, the DNASYSTEM page has links to BIMM 140 and BIMM 141. Note also the link to the Genomics Workshop in this section of the DNASYSTEM Home page. A set of exercises similar to those used in BIMM 141 was used for this Workshop which was given in the summer of 1997.
{Spend a little time browsing the DNASYSTEM Web Page or the CMS MBR Web Page.}
Note that the "Contents via Task Classification" takes you rapidly to other places in the same Home Page.
{3. SWISS-PROT and TrEMBL at the ExPASy Web site}
We now will lookup and retrieve a protein sequence from the SWISS-PROT protein database at the general site ExPASy in Geneva, Switzerland
Note that item 3 under "Contents" in the DNASYSTEM Home Page concerns Genome Analysis Task Areas, and that the second item on the list of sites is DB Sites. These Categories of Tasks are the same as those given in the Toolbar Contents:
| Contents | General Sites | DB Sites | MainDBs | MultDB Sites | Other DBs | DB Search |
| Nuc Seq | Prot Seq | Mult Seq | 3D | Organism | AceDBs | Genomes | Pathways | ftp Sites |
{From the DNASYSTEM Web Page or from the links above, click on "General Sites'}
Several sites are now available for each of the main sequence databases under B. General Web Sites for Genome Analysis.
{Access the ExPASy site by clicking on 'ExPASy'.}
NOTE: People from all over the world are now using the Internet all the time. There may be times when you will not be able to access a given site due to heavy usage or due to the site no longer being available. Sites in Europe are often most rapidly accessed late in the afternoon or early evening when most people in Europe are sleeping; those in Japan are most rapidly accessed early in the morning.
If you get through, note that you are now connected to a computer in Geneva, Switzerland!!
{Browse a bit to see what's at ExPASy ...}
{4. Find your Protein Sequence in the SWISS-PROT and TrEMBL Protein Databases}
From the ExPASy Home Page, click on the SWISS-PROT and TrEMBL link.
Note the difference between the SWISS-PROT and TrEMBL Protein Databases.
{In "Access to SWISS-PROT and TrEMBL', click on "by description or identification"}
This brings up a typical "searchable index".
{Now enter TWO keywords in the box shown and click "Submit"}
For example, if we were trying to find the sequence of bacteriophage Lambda CI repressor, we might enter the following two words:
lambda repressor
Note the space between the two words. This is equivalent to searching the database for entries that have BOTH the word "lambda" AND the word "repressor" in the description line (the DE line in the EMBL or SWISS-PROT style of documentation). This is also known as a Boolean AND operation. The concept of Boolean operations is considered in more detail in the next part of this Exercise 2.
{Try a keyword search with TWO words and record your
findings;
Use your own keywords; do NOT use lambda repressor !
also try REVERSING the two keywords.}
Search both the SWISS-PROT and TrEMBL databases (both boxes should be left checked...)
Use keyword(s) of relevance to the finding of proteins of specific interest to you ... or any other keywords you wish to use.
Examples of some reasonable keywords are: transcription factor; MAP kinase; globin mouse; cytochrome bacteria; tryptophan coli; DNA polymerase; protease HIV; dehydrogenase cholera; activator human; heat yeast.
Further examples can be found by looking at the relationships between Gene Ontologies currently being developed for specific organisms and across all organisms and the following other Classification Systems:
These lists should provide lots of ideas for keywords.
You could also look at the following: list
of COG proteins; TIGR
gene indices, or NCBI
Gene Nomenclature Resources for specific organisms
{Do the same keyword search but with each keyword separately;
record your findings}
Results of these searches with "lambda" and/or "repressor":
Search words SWISS-PROT TrEMBL
----------------- ---------- ------
lambda repressor 1 3
repressor lambda 1 3
repressor 304 402
lambda 135 60
Judicious choice of search keywords is always desirable, and takes practice!
You can also use the Find... command in the Edit dropdown menu in NS Navigator to search for the second keyword. . Thus, if we had first searched only for the keyword "repressor" and obtained over 200 entries, we could now use "Find..." to search among these entries for "lambda".
{Choose one of the Protein Sequences you found and have a look at it by clicking on the hypertext-linked SwissProt name for the sequence.}
What should now come up is the ExPASy 'NiceProt View'
of the SwissProt or TrEMBL entry.
This is a Web 'table' presentation rather than flat text
Note that the SwissProt sequence entry ITSELF contains hypertext
links in it!
Thus, you can go immediately to references that are cited, or
to cognate sequences in other databases.
For some of the entries, you can also get 3-D images of the protein
if its crystal structure has been determined !!
{Note the links present to visualize the protein and
to learn more about its structural elements.
Browse through some of these links.}
{Obtain a copy of the sequence entry}
by using the Save as ... command from the File dropdown
menu.
You have the option to save as "text" or as "source".
{Save copies both as "Text" and "Source" from both "NiceProt View" and "old SP format"}
Note the options at the bottom of the entry, as follows:
1. View entry in original SWISS-PROT format.
2. View entry in raw text format (no links)
{Include some or all of the "Text" version of the "old SP format" in your Notebook}
{Note the First line in the Entry and the Last line in
the Entry. What are these?}
{Note also the First Two Letters in each line of the "old
SP format".}
Notice that you can also {Use COPY-PASTE to COPY the Sequence directly from the Netscape window and PASTE it into your Notebook file for Exercise 1.}
To do this, when you find output you wish to include in your Lab Notebook, "select" this output by going to the beginning of the output with the Mac cursor, clicking on the Mouse and HOLDING IT DOWN while you go to the end of the output. The "selected" output is then "highlighted".
Now use the CUT function in the EDIT pulldown
menu to "cut" the selected output out
Now go to your Lab Notebook file in NS Composer, place the cursor
at the position in the Lab Notebook file where you wish to insert
the output, and execute the PASTE function in the EDIT
pulldown menu of NS Composer.
You will note that, when you look at your sequence, the lines are of varying lengths. This is because the text is, by default, displayed in a proportional font wherein each letter is a different width.
Sequences entries are best displayed in a monospaced font such as Courier.
In NS Composer or any html editor, you can select the text
and convert it to Formatted text;
this usually converts the text to a monospaced font such as Courier.
Also in NS Composer, you can select Fixed Width under Font in the Format pulldown menu.
Note that the ExPASy 'NiceProt View' of the entry does not permit COPY-PASTE operations as easily as the original SWISS-PROT format.
The goal of the second part of this exercise is to introduce some searching techniques that are more sophisticated than simple keyword searches. First we will use search engines that permit combinations of more than one keyword using boolean operators, and then we will move on to a system, NCBI Entrez, that uses a sophisticated cross-indexing sytem to allow very powerful searches of both sequences and literature.
For this part of the exercise we use the Full-Text Search option at SWISS-PROT. Other services using boolean operators work in a similar way, although there will be some differences in syntax (check the instructions).
{1. Go to the SWISS-PROT full text search service.}
Note the information provided as
Note in particular that several boolean operators (AND,
OR, NOT) are supported.
Note also the option to add the wildcard * before and after
the keyword.
Use the same pair of keywords here as you used in the above Search operations.
We first repeat what you did above, to see if the full text search yields more results that just the search by description or identification done above.
{2. Enter the two keywords into the search box and hit SUBMIT to start the search. How many matches were reported?}
Think about why this number is different than what you found above (if it is).
{Now reverse the order of the TWO keywords, i.e. "repressor lambda" instead of "lambda repressor."}
{Now go back to the SWISS-PROT fill text search page and try again using EACH of the two keywords. How many matches were reported?}
{Do the above searchs with the "append * before
and after" option selected}
{Construct a table of your results as per the following}
Results with "lambda repressor", with comparison to the above results:
descrip-ID search full text search
Search words SWISS-PROT TrEMBL SWISS-PROT TrEMBL
----------------- ---------- ------ ---------- ------
lambda repressor 1 3 3 0
with * appended * not supported 3 0
repressor lambda 1 3 0 0
with * appended * not supported 0 0
repressor 304 402 1067 819
with * appended * not supported 1096 872
lambda 135 60 335 581
with * appended * not supported 336 583
Note that the above table was done using formatted text.
This SWISS-PROT search engine treats multiple keywords as a
phrase that must be matched in its entirety.
Note that many search engines interpret multiple keywords as a
boolean "OR" operation. They report all entries that
match one or the other of the keywords. The other boolean
operators are "AND" and "NOT". The AND operator
limits the search to entries that match ALL the keywords and the
NOT operator selects entries that do not match the keyword.
{3. Do a search with the same two keywords using the AND operator. How many matches did you get? What are the results when you reverse the order of the keywords?}
{Do the same for both the OR and NOT boolean operators, and construct a table for your Notebook similar to the one below.}
Results for the two keywords "lambda" and "repressor" using the full text search SWISS-PROT search engine:
AND operator OR operator NOT operator Search words SWISS-PROT TrEMBL SWISS-PROT TrEMBL SWISS-PROT TrEMBL ----------------- ---------- ------ ---------- ------ ---------- ------ lambda repressor 41 9 1361 1391 294 572 with * appended 41 9 1391 1446 295 574 repressor lambda 41 9 1361 1391 1026 810 with * appended 41 9 1391 1446 1055 863
It is often useful to use braces or parentheses (depending on the server) to force the evaluation of the boolean operators in the desired order. Without the parentheses, it's not always clear how expressions with multiple boolean operators are evaluated.
{Try also parts of keywords. For example, for "lambda repressor", we would try "lam AND rep". Note that you must check the "Prefix and append wildcard '*' to words." box with this server. Try the equivalent to "lam* AND *rep*" for your keywords.}
Results for the two keywords "lambda" and "repressor" using the full text search SWISS-PROT search engine:
Search words SWISS-PROT TrEMBL ----------------- ---------- ------ lambda AND repressor 41 9 with * appended 41 9 lam AND rep 1 8 *lam* AND *rep* 1511 1627
Summary of the Full Text Search of both SWISS-PROT and TrEMBL protein databases:
With no "prefix and append * wildcard":
1. lambda AND repressor: 50 matches 2. lambda OR repressor: 2752 matches 3. lambda NOT repressor: 866 matches 4. repressor NOT lambda: 1836 matches 5. repressor XOR lambda: 0 matches 6. repressor AND lambda: 50 matches 7. NOT repressor NOT lambda: 0 matches
With "Prefix and append wildcard "*" to words' button turned on:
8. lam AND rep: 3138 matches
Note the following relationships:
a. (1) = (6) b. (3) + (4) = (2) - (1): 866 + 1836 = 2088; 2752 - 50 = 2702 c. XOR does not work at ExPASy; one should get (5) = (3) + (4) = (2) - (1) = 2702 d. ExPASy also did not like the NOT ... NOT ... search
The NCBI Entrez browser system is one of the more useful and powerful searching systems available on the Web. Entrez is a sophisticated system that interelates nucleic acid sequence, protein sequence, macromolecular structure, complete genome, taxonomic, human genetic disease, and literature databases. Entrez uses extensive precalculated indices to speed searches.
A neighborhood (or list of neighbors) is a central concept in Entrez. NCBI precalculates a variety of neighborhoods for all the entries in their databases. For a protein sequence, a neighborhood might be all other sequences that score above a certain level with the BLAST database searching program. For a research paper, a neighborhood might be defined as other papers that share a certain number of keywords in common. Entrez allows one to use these precalculated neighborhoods to quickly locate groups of DNA sequences, protein sequences and structures, and literature references.
{1. Go to NCBI Entrez and read some of the "Entrez Help" information.}
You can in general access NCBI from the General Sites section of the DNASYSTEM home page, or from the Sequence and Structure Databases under the DNA Analysis and Molecular Biology section of the CMS MBR home page.
For this Exercise, you can also click on the above link.
Note the extensive nature of the Entrez Help facility. The section Refining Your Search is of particular value.
{2. Return to the Entrez Home page using the NetScape "Back" operation and click on "Proteins" to search for a protein sequence.}
The query page looks somewhat like the the SWISS-Prot page you looked at above. As you should expect by now, the search keyword goes in the entry box.
Note however the general nature of the Search box and region (an html table). In particular, the <database> in Search <database> for < keyword(s) > permits the User to choose the database in which to search from any of these Web pages. Thus, you really did not need to go to the Proteins Web page to search the protein database.
{3. Enter one of the keywords you used at ExPASy in the search box and click the "Go" button (or just do a RETURN) to start the search.}
The response is divided into three main sections, each in grey at the top of the display, plus the matches:
{4. If the number of matches is greater than 200, use a second and/or third keyword in the <keyword(s)> section together with appropriate Boolean Operators.}
To refine your search, you simply use Boolean operators to add additional keywords. This is done in the same Search Protein for <keywords> area at the top of the typical NCBI page. To learn more precisely how Boolean operators etc are used, look at the Refining Your Search section of the Help facility. You may even wish to open a new NS Navigator window, to use for the Help facility.
Here follow some results for keywords "lambda", "repressor", bacteriophage", and "phage":
1. lambda 31514 matches 2. phage 13341 matches 3. bacteriophage 695 matches 4. bacteriophage NOT phage 7731 matches 5. phage NOT bacteriophage 7177 matches 6. repressor AND phage 589 matches 7. repressor NOT lambda 4569 matches 8. lambda NOT repressor 31290 matches 9. lambda XOR repressor syntax error 10. lambda OR repressor 36083 matches 11. lambda AND repressor 224 matches 12. lambda repressor 46 matches 13. "lambda repressor" 46 matches 13. lambda AND repressor AND phage 61 matches 14. lambda AND repressor AND CI 47 matches 15. lambda AND repressor AND cro 33 matches
The aim of this search refinement is to limit your search to only proteins of interest to you, usually only 100-200 or so. In the above, the last three or four searches satisfy such aims. The first ten searches all yield an inordinately large number of matches.
Note the speed by which the retrieval is done, even though the databases are large.
{5. Click in turn on each of the four links "Limits", "Preview/Index", "History", "Clipboard". Answer the Exercise Questions on these four links. Redo the AND query with two of your keywords using "Limits" to limit your search to new GenPept entries within the past 5 years. Compare these results with those above.}
Basically, the Limits option permits you to "limit" the search to examination of specific database Fields, specific Modification Dates, specific Organisms, etc.
The Preview/Index simplifies creation of more complex Queries by permiting the User to combine the results of simple Queries already performed.
The History link provides information on Queries already performed, assigns numbers to these, and permits the User to use these again in various combinations, thereby creating more complex Queries.
The Clipboard link shows what items have been saved to the Clipboard for subsequent downloading.
For the search lambda AND repressor,
{6. In the "Display <menu>" section, choose different items from the <menu> and click the "Display" button. Briefly describe what the different items are in the <menu>, particularly the "Neighborhood" and "Link" items.}
{7. Return to the <menu> item termed "Summary" and click the "Display" button. Select the boxes of any five (5) of the sequences by clicking on each of the boxes. Select "GenPept" from the <menu> and click on the "Display" button. Record in your Lab Notebook what this has done.}
This is the way to select specific matches and to display their complete documentation. Note that GenPept is the cognate to GenBank: just as TrEMBL is the protein translate for each of the genes found in the EMBL nucleic acid databases, so to GenPept is the protein translate of each of the genes found in the GenBank nucleic acid databases.
{8. The GenPept sequences are Displayed as HTML. Click on the HTML button, choose "Plain Text", and click on the "Display" button. Record in your Lab Notebook what this has done.}
{9. Examine the contents of the GenPept annotation for the first of your Sequences. What are the Fields? Save this GenPept entry to your Lab Notebook.}
{10. Return to the display of your GenPept sequences as HTML, and select from <menu> the "Graphics" option and click on the "Display" button. Describe what you see. What options does the User have? The Graphics display is only of the first of your five sequences; how would you observe the graphics for the fourth sequence?}
{11. Return to a "Display GenPept as HTML" and click on the "Add to Clipboard" button. Describe what happened.}
{12. Choose "Summary" from the <menu> and click on the "Display" button. Note the links to the far right for each of your five GenPept sequences. Briefly describe what each of these links do.}
{13. Click on "Related Sequences" for one of your five GenPept Sequences. Briefly describe what this does.}
For Lambda Repressor, clicking on Related Sequences brings up 81 GenPept entries, several of which are Lambda Repressor itself. These Related Sequences constitute the Neighborhood for protein database entry items.
Each of these 81 entries in turn has Related Sequences. One can thus pursue extension of the Neighborhood further. The hope (and expectation) is that such operations will "converge" to a set of protein sequences, each of which is a neighbor to others and such that no other database entries are neighbors to these sequences. Depending on the criteria used to define the neighborhood, this set of protein sequences will then be related to the concepts of a protein family, a protein superfamily, and a profile for these proteins. These concepts will be discussed later in BIMM 140.
{14. Use the Netscape Back command to return to the "Display Summary" page for the five GenPept sequences, and click on "PubMed" for the GenPept Sequence that you saved for your Lab Notebook above. How do the PubMed References that come up compare with those in the GenPept annotation for this sequence? Now click on "Related Articles" for one of these References. How does this compare with the "Related Sequences" for the GenPept Sequence itself?}
The Related Articles provide the Reference-based Neighborhood.
{15. Use the Netscape Back command to return to the "Display Summary" page for the five GenPept sequences, and do the same for the "Nucleotide", "Genome", and "Taxonomy" links if present. Briefly describe what each does.}
The Nucleotide Sequence is mainly the DNA sequence encoding your Protein.
{1. Log into the GCG server machine and set up the GCG package. }
The GCG package is licensed only to run on a single processor. For our laboratory this is y4306-su-1. Log onto this machine by typingssh y4306-su-1The GCG package uses many UNIX symbols as it runs. These symbols, which allow you to run the various programs by simply typing their names, must be intialized. To do this, type
prep gcgAfter a few seconds you will se a banner announcing "Welcome to the Wisconsin package". This is a synonym for the GCG package.
{2. Familiarize yourself with the capabilities of the GCG package. }
Brief descriptions of the programs in the GCG package are available on the GCG website. As you can see, there are MANY programs. Browse through the entries in the online table to get a feeling for the extent of the operations you can perform. We will use many, but not all, of these capabilities during the course. More detailed information about each program is available as HTML pages locally. Use netscape or internet explorer to open the local file/software/nonrdist/GCG_10.1/gcghelp/html/unix/users_guide.html
{3. Initialize the GCG graphics system. }
Many, if not most, programs in the GCG package produce a graphical output. This output can be viewed on a variety of devices so you must tell the system which output to produce. You can read about some of the options in Appendix C of the GCG users guide. We will most often display graphics to the the terminal screen. This requires using an "XWindows" graphics driver. To intialize the XWindows driver, type xwindows. To select the default "color workstation", simply hit return (enter). Note that the GCG system always displays default values within (* *). You can always accept the default value by simply hitting return (enter). Some other graphics drivers that may be useful are the PNG driver (for including graphics in your lab reports), and the postscript driver (for publications).{4. Make a test plot and save it to your lab report. }
Use the GCG plottest program to produce a test plot. First view it on the screen using the XWindows driver, then save a copy in PNG format and include it in your notebook.{5. Format a sequence for use with the GCGpackage. }
Similarly to many packages, the GCG packages requires files to be in a certain format to work with the programs in the system. This applies most particularly to sequence files. The GCG package has a number of utilities that will reformat sequences between various native formats and the GCG format. See the section on "reformatting sequence files" in the users guide section on "Using sequences" for more information. Use the gcg program fromembl to reformat the SWISS-PROT sequence you retrieved in this exercise (A5, above).{6. Format a sequence using a text editor and the reformat command. }
Sometimes you will want to edit a sequence manually, or a sequence will be in an unusual format. In this case you can edit the sequence using any text editor and reforamt using the GCG reformat command. Read the section on "Creating and editing single sequences" in the "Using sequences" section of the GCG Users Guide. Try this out usng the sequence you retrieved from Entrez in section C7, above. After reformatting the sequence, run the GCG program pepplot and include the output in your notebook to confirm that the formatting process was successful.
{Answer all of the following questions:}
Latest modification: 2 April, 2001
If you have problems or questions, send email to Michael Gribskov or to Doug Smith