INTERNET RESOURCES FOR MOLECULAR BIOLOGISTS AND PROTEIN CHEMISTS


Lincoln D. Stein and Tau Mu Yi
MIT Genome Center


This summary describes presentations made at the workshop by both authors. As the technology to rapidly identify, clone, and sequence genes advances, sequence analysis software becomes an increasingly important part of the research effort. Often the only clue to a newly-cloned gene's function is its DNA sequence. Until recently, however, sequence analysis software had to be installed locally on a mainframe or workstation. Often these programs were distributed with restrictive license agreements and required regular maintenance and updating. For this reason, these programs were usually installed in academic and commercial computer centers, and a fee was charged for their use.

A revolution has occurred in the past few years. With the advent of the Internet, sequence analysis programs of many sorts have been put on-line. Current Internet-based software enables you to predict the location of exons in genomic sequences, perform rapid similarity searches across protein and nucleotide sequence databases, identify conserved sequence motifs, browse large bibliographic and phenotypic databases, and even obtain three-dimensional crystallographic structures of related proteins. These programs are typically as powerful as the ones distributed for local use, are dynamically updated, and are entirely free to use.

Levels of Internet Access

Internet-based software is based on the client/server model. A client program is run by someone with a question (you). It connects, over the Internet, to a server program. The server reads the query, does the requested calculations, and returns the result to the client. The client then displays the result in human-readable form.

There are roughly three levels of Internet access. The higher your level, the more powerful the client software you can use. The lowest level of access is the ability to send and receive E-mail only. With this level of access, you have the ability to use a limited number of aging server programs. Your E-mail software is the client: you compose an E-mail message formatted according to the server's requirements, mail it to a designated Internet address, and receive the server's response by return mail.

The next level up is log-in access. Here you have an account on some central mainframe or workstation that has Internet access. You use a telecommunications program on a personal computer to log into the mainframe. The range of client software is much larger, but it is limited by and large to text-only interaction. In addition, it is inconvenient to move data from your personal computer to the mainframe and back again.

The highest level of Internet access is a direct TCP/IP connection. Here your personal computer becomes a first-class member of the Internet and can speak directly with servers located around the world. With this level of access, you can run clients that take advantage of the personal computer's graphical user environment to display styled text, pictures, animations, and sounds. In addition, personal computer-based client software is generally a lot more user-friendly than mainframe-based software.

The World Wide Web

Internet servers use different communications protocols to transfer data to and from the client. The various protocols are distinguished both by intended function (e.g., some are more useful for bulk file transfers, while others work best for transmitting small amounts of data interactively) and by historical accident. The four most widely used protocols are:

* FTP (File Transfer Protocol)--bulk transfer of large files. Not particularly interactive. Not particularly searchable.

* NNTP (Net News Transfer Protocol)--distribution of user-contributed messages and news articles over a worldwide bulletin board system.

* Gopher (the university mascot)--a campus wide information system developed at the University of Minnesota that uses a distributed menu system to link up servers across the Internet. It supports simple text-based database searching.

* HTTP (Hypertext Transfer Protocol)--a protocol that allows clients to display hypertext documents containing styled text, pictures, animations, and sounds. These documents typically contain one or more links, textual and graphic elements that are linked to other Internet documents. Selecting the link instructs the client to download the linked document, which may reside on the same server or may live on a server located in a different country. The HTTP protocol allows for interactive database searching and data analysis programs using fill-out forms: you fill in the blanks and select options from a variety of pop-up menus and checkboxes. The result is returned to you as a hypertext document that you can use as the basis for further queries.

There are individual client programs that speak just one of each of these protocols, but World Wide Web browsers allow you to use all these protocols (and several others) from a single program without worrying about the details. These browsers provide a one-stop shopping solution to your data analysis needs. Web browsers are available for all makes of personal computer and operating systems. The most well-known browsers are the public domain Mosaic from the National Center for Supercomputing Applications and the free-but-copyrighted Netscape from the commercial Netscape Communications Corporation. However, there are many more browsers in addition to these two. A variety of Web browsers should be available through your institution's network administrator.

In order to negotiate the maze of protocols and servers, Web browsers identify each document or service on the Internet using a simple naming scheme known as the Uniform Resource Locator (URL) notation. A typical URL looks like this:

http://gc.bcm.tmc.edu:8088/bio/bio_home.html

The URL begins with the name of the protocol followed by a colon, in this case http: for a server that speaks the HTTP protocol. Next, the URL contains a double-slash (//) followed by the server's Internet address and an optional communications port number, in this case the server gc.bcm.tmc.edu at port number 8088. The remainder of the URL is the path to the specific resource, in this case a file located at /bio/bio_home.html. These paths can be quite long and differ in meaning according to the communications protocol. They can point at static files or at programs that will perform a variety of analyses and searches.

Fortunately you only have to worry about a handful of URLs. Once you have found a relevant document, it will contain links to other resources. You can travel from one link to the next by pointing and clicking. When you find a resource you'll want to come back to, you can add its URL to a "hotlist" by selecting an option from the browser's menu. You can now travel directly to this resource by selecting it from a menu.

Main Entry Points for Molecular Biologists

There are several URLs that are good starting points for biomolecular researchers. From these sites, you can reach any of the other sites mentioned in this article and many others.

The Biosciences Virtual Library, URL: http://golgi.harvard-.edu/. This is a subject catalog-based document that lists biologically-related resources in an hierarchical manner. It is maintained at Harvard University.

The Baylor Biologist's Control Panel, URL: http://gc.bcm.tmc.edu:8088/bio/bio_home.html. This is a collection of pointers to DNA and protein sequence search and analysis software available over the Internet. Much of the software was developed at Baylor itself.

Yahoo, URL: http://www.yahoo.com/Science/Biology/. This is a large catalog of Internet resources that contains a comprehensive listing of biologically-oriented sites. It is not as well-organized as the Virtual Library but tends to contain more pointers.

Pedro's Home Page, URL: http://www.public.iastate.edu/~pedro/rt_1.html. This is a site maintained by someone named Pedro that contains pointers to a large number of protein structural analysis tools.

University of Cambridge, URL: http://www.bio.cam.ac.uk/. This is another good collection of pointers to sequence analysis tools. Because it is located in the UK, it's response time will be faster for European researchers than for US-based sites.

Johns Hopkins University/GDB, URL: http://www.gdb.org/. This is the home site of GDB (Genome Database) and is a good entry point for the genome mapping resources of the Human Genome Project.

Lycos, URL: http://lycos.cs.cmu.edu/. Lycos is a powerful keyword-based Internet search engine. If you can't find what you're looking for in one of the collections above, you can search for potentially relevant documents rapidly using the Lycos index.

Sequence Database Searching

Entrez, URL: http://atlas.nlm.nih.gov:5700/Entrez/. Entrez is an integrated database run by the NCBI (National Center for Biomolecular Information). It contains the nucleotide sequence entries from GenBank, the protein sequences from SWISSPROT and PIR, and the molecular biology subset of the MEDLINE bibliographic database. It can be rapidly searched on any of the fields defined by these databases. Once an entry has been retrieved, you can link to related entries from any of the three databases. For example, you can jump from a nucleotide sequence to its translated protein sequence, and then perform a similarity search to find all related protein entries in SWISSPROT. You can then browse journal articles related to these entries.

SRS, URL: http://www.embl-heidelberg.de/srs/srsc. SRS is a collection of approximately 80 sequence-related databases. In addition to nucleotide and protein databases, SRS contains a number of three-dimensional structure databases, sequence motif databases, and the phenotypic database OMIM (Online Mendelian Inheritance in Man). Like Entrez, you can search any of these databases by keyword and then find related entries in other databases. Because SRS is located in Germany, it will be faster for European researchers than Entrez.

Gene Finding, Sequence Analysis, and Protein Structure

GRAIL, URL: http://avalon.epm.ornl.gov/. GRAIL is the most popular program for identifying protein coding regions within genomic sequence. It relies on a neural network to combine information diagnostic for exons (e.g., codon frequencies, consensus splice sites, etc.) to locate regions that encode for proteins. Tested on human genomic sequence, the program identifies 60 to 80% of all exons (worse on short exons).

Gene Finder, URL: http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html. Gene Finder is another well-known gene finding program developed at the Baylor College of Medicine. Its accuracy is comparable to that of GRAIL. Gene Finder provides several other services such as ranking the strength of splice sites and attempting to assemble the different exons into a single coding region.

BLAST, URL: http://www.ncbi.nlm.nih.gov/Recipon/blast-_search.html or http://specter.dcrt.nih.gov:8004/userblast. BLAST is a powerful sequence similarity searching tool that is essential for the initial characterization of a sequence. You can submit a sequence to BLAST and within a minute it will return all similar sequences in a nucleotide or protein sequence database. This often provides the first hints of the function of a newly cloned gene. BLAST searches are available at a number of sites. The two main sites are the NCBI's server (the first URL address above) and the server provided by GenBank itself (the second URL address).

BLITZ, BEAUTY, and BLASTPAT, URL: http://dot.im-gen.bcm.tmc.edu:9331/seq-search/protein-search.html. The Baylor College of Medicine (BCM) offers several alternative methods for searching the sequence database. The BCM protein sequence/pattern search page has a link to the BLITZ server at EMBL that uses the Smith-Waterman dynamic programming algorithm to search for related protein sequences. BEAUTY performs a BLAST search against an annotated database containing background information on each sequence about functionally important residues, phosphorylation sites, etc. BLASTPAT uses the BLAST algorithm to search a database of sequence patterns, which is potentially more sensitive than alignment against single sequences.

Protein Structure Prediction, URL: http://www.embl-heidelberg.de/predictprotein/predictprotein.html. EMBL-Heidelberg offers a suite of programs, PredictProtein, that predicts the secondary structure, solvent accessibility pattern, and membrane-spanning regions of a protein. These programs represent the state-of-the-art for protein structure prediction.

PROSITE, URL: http://www.ebi.ac.uk/searches/prosite-_input.html. PROSITE is a database of common sequence motifs, many of which have important functional or regulatory roles. Examples include GTP-binding motifs, zinc-finger motifs as well as consensus phosphorylation and glycosylation sites. The identification in your protein of a motif that performs a specific function can be very informative, but most proteins do not register a hit against the database.

BLOCKS, URL: http://www.blocks.fhcrc.org/. BLOCKS is an algorithm that attempts to identify conserved motifs in a protein query sequence by searching a database of aligned target sequences. For example, a match to a set of aligned zinc-finger sequences indicates that your sequence contains a zinc-finger and is probably a DNA-binding protein. BLOCKS is expected to be more sensitive than PROSITE for many motifs, but may not contain as complete a collection. A BLOCKS server is available at the Hutchinson Cancer Center.

Protein Data Bank (PDB), URL: http://www.pdb.bnl.gov/. The Protein Data Bank at Brookhaven National Laboratories is a central repository for protein (also RNA and DNA) X-ray crystallographic and NMR structures. You can search the database using a variety of criteria and download the atomic coordinate files. Using public domain tools available through PDB, you can then view the rendered three-dimensional structures.

Making Your Own Data Available on the World Wide Web

It is inexpensive and reasonably easy to set up a World Wide Web site of your own to make publications and data available to other laboratories. The easiest server to set up is a Macintosh-based server known as MacHTTP. It can be downloaded for free at URL: http://www.biap.com/. Its usage terms allow you to use it for free for a period of time. If you continue to use it after this evaluation period you are asked to pay a small fee. MacHTTP is self-configuring. All you have to do is to double-click its icon and watch it run. Any files placed within MacHTTP's folder will become Internet accessible.

In order to make your site friendly, you'll want to provide some of your data in the form of hypertext documents written in a simple markup language called HTML (Hypertext Markup Language). Many tutorials exist for HTML. Among the best is a book with the somewhat silly title of "Teach Yourself Web Publishing with HTML in a Week" by Laura Lemay (SAMS Publishing).

If you are interested in doing more than distributing static documents, for example providing interfaces to databases and data analysis programs, you will need a more sophisticated server, such as those based on Unix or Windows NT. A good source of information on setting up one of these servers is "How to Set Up and Maintain a World Wide Web Site" (Addison-Wesley Publishing), which was written by Lincoln Stein.


Return to the The ABRF Home Page


Created: 11th September 1995
Last modified: 11th September 1995