Summaries of Presentations at ABRF '97: Techniques at the Genome Proteome Interface


Two of the presentations from the ABRF meeting held in Baltimore, MD on February 9-12, 1997 are summarized here. Additional presentations will be summarized in subsequent issues of ABRF News.


Developing a Genome Attitude

Towards Biology and Medicine

 

John Rush

HHMI/Harvard Medical School

 

Francis Collins delivered the keynote address at ABRF '97: Techniques at the Genome Proteome Interface. Since 1993, Collins has served as Director of the National Center for Human Genome Research (NCHGR) at the National Institutes of Health. This organization is a leader in the U.S. government's effort to sequence the entire human genome by the year 2005. In January 1997, the NCHGR was given institute status and was renamed the National Human Genome Research Institute.

 

Collins' keynote address emphasized that it is critical to bring together scientists with expertise in proteins and nucleic acids if we are to see the fruits of the genome project. We are hurtling towards the time when there will be a complete reference sequence of the human genome. But even now some are concerned that, without close and regular contact, biologists with different areas of expertise will not be able to make good use of all the information that will become available as the genome project is completed. This interaction is important enough that it should be promoted and developed now, so we will be better positioned to take advantage of the many opportunities that lie ahead.

 

The first part of the keynote address was a report card on the status of the genome project: genetic mapping, physical mapping, and sequencing (1). The second part was a preview of the opportunities that extend beyond genome sequencing (2), and how these opportunities might be explored with DNA microarray technologies (3, 4).

Francis Collins, Director of the National Human Genome Research Institute, delivered the keynote address at ABRF '97: Techniques at the Genome Proteome Interface.

The Current Status of the Genome Project

One reason the genome project has been received with enthusiasm by the public and the U.S. Congress is that virtually all diseases have genetic components, even infectious diseases. The genome project is founded on the belief that uncovering the genetic basis of predisposition to disease will be greatly facilitated by obtaining a reference sequence consisting of the 3 billion base pairs and 80,000 genes of the human genome.

 

To appreciate the usefulness of a reference sequence, it is worthwhile to review how disease genes are identified now, without a reference sequence. Currently most disease genes are found by positional cloning (5). In this approach, a disease gene is mapped to an ever smaller region of a chromosome until the region is small enough that it can be cloned and eventually sequenced; comparing sequences for several affected and unaffected individuals results in identification of candidate disease genes. With positional cloning, it is not necessary to know what a disease gene does to locate it.

 

However, positional cloning is a daunting exercise. To map a disease gene, families are identified in which the disease has been inherited by several family members. DNA samples from these family members are then evaluated by linkage analysis with markers that cover all the human chromosomes. If there are enough markers and enough family members are examined, a marker can be found that segregates with the disease, helping to physically locate the disease gene. This mapping process is repeated several times to define the physical region as narrowly as possible. After cloning, a physical map of this small region is constructed, candidate genes are identified, and they are then sequenced to identify the subtle changes that may be responsible for disease.

 

Before the human genome project started in 1990, this approach to identify disease genes was occasionally successful, but only after decade-long efforts. The approach was inefficient and expensive, because researchers always found themselves in unfamiliar areas and were forced to define the basic geographic characteristics of part of a chromosome. The human genomeproject was initiated because it was recognized that it is better to sequence the entire genome without knowing the parts that are of particular interest now, since eventually all the parts will be of interest.

 

Genetic Maps When the genome project was initiated, one of its original goals was to obtain 1,500 easy-to-use, universal markers for genetic mapping. This goal was met and surpassed in 1993. There are now more than 10,000 markers, and their sequences are publicly available. The markers are based on microsatellite sequences, simple repeats of di-, tri-, and tetranucleotides; for example, [CA]n where n is more than 15. These simple sequence repeats make useful genetic markers because the exact number of repeats (n in [CA]n) varies among individuals.

 

These markers are currently used in PCR assays with organized sets of fluorescently labeled primers. The primers are multiplexed by tagging them with different fluorophores. They are also multiplexed by the lengths of their expected amplification products: primers for different markers but with the same fluorophore tag are designed so their amplification products migrate at discrete, non-overlapping regions in a single gel lane. This makes it possible for individual laboratories to evaluate tens of thousands of markers in a reasonable time frame.

 

Researchers assay whether these markers are inherited along with a disease gene. The availability of these markers recently led to the rapid identification of loci implicated in prostate cancer on chromosome 1 (6) and Parkinson's disease on chromosome 4 (7). Even when a reference sequence is available, these genetic markers will help rapidly pinpoint the chromosomal locations of specific disease genes.

 

Physical Maps After cloning DNA fragments and before large-scale sequencing, the positional relationships of the cloned fragments are established by physical mapping. Organizing clones in overlapping sets defines how the fragments are connected in chromosomes.

 

Physical maps are built around the concept of the sequence-tagged site (STS), a sequence from a cloned fragment shown by PCR assays to be unique in chromosomal DNA. Overlap is demonstrated by showing that clones share some STSs and, therefore, contain contiguous chromosomal sequences. Physically ordering STSs in this manner provides the scaffold needed for large-scale sequencing.

 

Physically ordered STSs are preferable to physical maps based on specific clones in libraries, which come and go as better library vectors are developed. In the short history of the genome project, the preferred library vectors have shifted from cosmids to yeast artificial chromosomes (YACs) and most recently to P1 artificial chromosomes (PACs) and bacterial artificial chromosomes (BACs). If the order of STSs is known, clones within a library can be ordered, regardless of the library's source and vector. STSs are viewed as more durable physical map elements than the libraries themselves.

 

A goal of the genome project is to develop one STS for every 100 kilobases of the genome or 30,000 STSs altogether. To accomplish this kind of throughput, laboratories heavily involved in physical mapping rely on robots and other means of automation such as the "genomatron" (8).

 

This physical mapping strategy can be used not only for genomic clones but also for cDNA clones to produce a variation of the STS termed the "expressed sequence tag" (EST). So far there are 30,000-40,000 human cDNA clones, and 16,000 have been mapped. After genetic mapping, if it is found that a chromosomal region of interest contains a known EST, appropriate clones can be readily obtained along with all available physical mapping data, greatly accelerating disease gene identification projects. Although ESTs allow researchers to focus on expressed genomic elements, they tend to under-represent genes that are expressed at low copy number or early in development.

 

Currently, about 97% of the human genome is publicly available in ordered libraries anchored by STSs, so clones containing chromosomal regions of interest can be readily identified and obtained. To facilitate their use, annotated physical maps are available at a WWW site maintained by the National Center for Biotechnology Information (URL: http://www.ncbi.nlm.nih.gov/SCIENCE96/). From that resource, browsers can select one particular chromosome and view up-to-date information on the genes mapped to that chromosome.

 

Sequencing The ultimate goal of the genome project is to obtain a complete human reference sequence, and the year 2005 is the target date for reaching this milestone. Along the way, it is expected that reference sequences will be obtained for several other organisms that have served as models in experimental biology and that have simpler genomes with higher gene densities than humans. Among the model organisms, the 4.7 Mb sequence of E. coli was finished in January 1997; the 12 Mb sequence of the yeast S. cerevisiae was completed in April 1996; the 100 Mb C. elegans sequence is due at the end of 1998; and the D. melanogaster sequence is expected in the year 2000. Interestingly, about 40% of the human genes now known to be involved in disease have homologues in yeast, a somewhat unanticipated bonus from that genome sequencing effort.

 

The mouse genome is as large as the human genome; and, while there is a commitment to map the mouse genome, there is not yet a commitment to sequence it to completion. However, that commitment is more likely to be made if sequencing costs can be lowered from their current level of 50 cents per base to less than 20 cents per base. Considerations such as these continue to push sequencing technology toward more efficient methods, based on capillary electrophoresis, mass spectrometry, single-molecule sequencing, and, most recently, micro-electromechanical systems, such as DNA arrays on integrated circuit chips.

Developments in sequencing technology will be needed to make good use of a human reference sequence. When it becomes important to put the reference sequence in context by evaluating variations from the reference sequence among individuals and among species, the appetite for sequence information will be insatiable and will have to be met by methods considerably more powerful than the ones currently available. From this perspective, support for continued development of sequencing technologies is thoroughly justifiable.

 

Large-scale sequencing of the human genome is currently in a "ramp up" phase. There are six pilot projects underway, and it is anticipated they will produce 100 Mb of sequence by April 1998. The U.S. is expected to contribute about 60% of the total sequence, and the remainder will originate from genome centers in England, Germany, France, and Japan.

 

Even though large-scale sequencing has just begun, the genome project has already had significant impact on the disease gene discovery process. Although the number of disease genes identified by positional cloning grew slowly from 1986, beginning in 1993 that number began to grow almost exponentially. The most productive year so far was 1996, and 1997 is off to a promising start.

 

Many of the disease genes identified thus far can induce disease on their own. Future efforts are likely to become more focused on polygenic illnesses, brought on by several genetic alterations acting in concert. The most common diseases with significant environmental contributions have this polygenic characteristic. Because each disease gene has a weak effect when acting individually, these disease genes are more difficult to discern and to map. Among this class of diseases, a project to identify the disease genes responsible for type I juvenile onset diabetes is furthest along.

 

In summary, the genetic mapping goal of the genome project has been complete for several years, the physical mapping goal is more than 95% complete, and work toward the sequencing goal is now beginning in earnest (9).

 

The Second Manifestation of the Genome Project

Even though the genome project has not yet been completed, it is not premature to consider taking a "genome attitude" toward other aspects of biology and medicine. In fact, it is necessary to begin thinking now about the technologies and methods that will be needed to make use of all this forthcoming information.

 

Just as chemistry did not end when Mendeleyev completed the periodic table, so completing the human genome sequence will mark a beginning in biology, not an end. Genetic variants within the human population will be analogous to isotopes of elements. Most interesting will be determining how the "elements" work together to make cells and organisms. We will want to understand what genes do and how they contribute to common disorders. So how will we do this?

Understanding Function on a Global Scale Traditionally, functional studies have taken an isolated, one-gene one-protein at a time approach, partly from experimental necessity. As genome structure becomes more explicitly defined, it will become possible to explore the functions of genome elements with global methods that reveal how biochemical networks are interconnected. For an early indication of how all this might develop, watch the yeast biologists, because they are now in their post-genome era.

 

For functional studies on a global scale, thousands of RNA transcripts can be analyzed simultaneously using DNA microarrays. With this technology, changes in transcript levels in response to environmental changes - for example, treatment with a therapeutic agent or the onset of infectious disease can be rapidly evaluated.

 

As an example, DeRisi et al. (10) used high-density arrays to evaluate the effects of a tumor suppressor on gene expression. The arrays were produced by attaching known amounts of PCR products from specific cDNA clones at different but known locations (addresses) on microscope slides. RNA was then isolated from two melanoma cell lines that differed only in that one cell line contained an extra copy of chromosome 6, which encodes a tumor suppressor. The RNA was amplified by reverse transcription-PCR so that the PCR products from one cell line were tagged with a green fluorophore, while the PCR products from the other cell line were tagged with a red fluorophore. These PCR products from two sources were combined in equal amounts (to accentuate differences in expression levels) and hybridized to the arrays of immobilized cDNA probes. Under these conditions, yellow fluorescence (from equal amounts of green and red fluorophores) indicated genes expressed at equivalent levels in both cells lines, green fluorescence higher expression in the malignant cell line, and red fluorescence higher expression in the tumor suppressor cell line.

 

In this experiment each microscope slide contained an array of 1,000 probes, but it is possible to make the array features smaller so one array contains tens of thousands of probes. Gene expression can be described qualitatively and also quantitatively, from the intensity of each fluorescent array feature. The detection limit currently corresponds to a transcript abundance of 1 in 300,000 (about 1 copy per cell). This approach enables quantitation of thousands of transcripts simultaneously and comparison of transcripts between two cell lines.

 

The power of this method is in its ability to take apart pathways and determine what is really happening in cells, without prior knowledge and without biases based on what is already known to be involved in function. Methods like this can also reveal the complexity of pathways. Microarrays with 20,000 to 30,000 probes at defined, addressable sites are likely to become available within the next two years.

 

Developments in technology such as this will be just as important during the second manifestation of the genome project

as they are in its first manifestation, and biologists, who sometimes undervalue technology, should recognize this now. Just as computers are currently in a phase of exponential development, the technologies used by biologists are expected to develop at a similar pace in the post-genome era.

 

Detecting Mutations DNA microarrays also provide a powerful technology for detecting mutations. Recently this technology was used to identify mutations in exon 11 of the human breast cancer gene BRCA1 (11).

 

Mutations in BRCA1 and BRCA2 lead to increased risk of breast and ovarian cancer. In most populations, every affected family carries a different BRCA mutation, so in practice both genes must be sequenced in their entirety to define the disease mutation in that pedigree (12). This is made difficult by the sizes of the genes, 5,592 base pairs over 22 exons for BRCA1 and 11,385 base pairs over 27 exons for BRCA2.

 

To rapidly identify mutations in the 3.45-kilobase exon 11 of the BRCA1 gene, microarrays were constructed that contained oligonucleotide probes with defined sequences at known, specific addresses in the array. Using light-mediated synthesis, 90,000 20-base oligonucleotides were synthesized on 1.25 x 1.25 cm chips. For each base in the gene sequence, there were four oligonucleotides on the chip, one corresponding to the wild-type sequence and three for each possible point mutation sequence.

 

These arrays can be used as "sequencing" chips, provided three issues are addressed during sample preparation and hybridization. First, RNA-DNA duplexes have more easily controlled hybridization parameters than DNA-DNA duplexes, so samples were prepared by PCR amplification with probes containing T3 and T7 promoters, for subsequent in vitro transcription. Second, long transcripts are likely to hybridize to themselves, so this was minimized by randomly fragmenting the RNA to a size of 50-60 bases. Third, hybridization conditions must be controlled carefully to stringently discriminate against mismatched sequences.

 

A final but easily resolved complication is that when analyzing a mutation from one individual, it can be difficult to distinguish cross-hybridization artifacts from a heterozygous mutation, where two different bases are present at the same position, one from each chromosomal copy of the gene. As described above, interpretation is actually simplified by a two-fluorophore comparison. One individual's mutant sample is tagged with a red fluorophore, and a second individual's wild-type sample is tagged with a green fluorophore. After hybridization, wherever the sequences are the same, the array feature fluoresces yellow (green + red), including cross-hybridizations to mismatched sequences; the mutations then stand out in the array as red spots. Two-fluorophore comparisons place the emphasis on the difference between a mutant sample and a reference wild-type sample, where it belongs.

Chips have also been constructed to contain all possible one-base deletions and all possible one-base insertions and to allow chip "sequencing" from both strands.

 

This technology shows great promise as a clinical tool: it is 100% specific (false positives have not been observed), and even in its early stages of development it can detect up to 90% of mutations previously identified by more traditional and time-consuming sequencing approaches.

 

Developing a View of Population Relatedness The availability of a reference genome sequence and DNA microarray technology may elucidate certain aspects of human population biology and change the way disease genes are mapped and identified.

 

The human genome contains 3 billion base pairs, and it is thought that a common sequence variant will be found at about every thousandth base pair, or that 3 million common sequence variants exist. Of these, most will occur in non-coding regions, but 1% or about 30,000 might occur in coding regions and affect gene function. In other words, one out of every 2 or 3 genes might have common sequence variants that contribute to disease. This fits well within the current context of human biology. For example, it is already known that apolipoprotein E, which has been shown to have a role in Alzheimer's disease, has three common variants; more examples are likely to be found among the already known electrophoretic variants of other proteins. How will these common sequence variants be identified?

 

Related to this is the relatively recent appearance of humans as a species. It is estimated that 100,000 years ago, there were only 100,000 of us. Because we are a recent species, there has not been time for complete homogenization of our genomes by genetic recombination. As a first approximation, we are not a collection of 3 billion base pairs that vary independently but instead a collection of about 30,000 segments, each about 100 kilobases long, derived from our common ancestors and uninterrupted by recombination.

 

Each 100 kilobase segment might have only a small number of common alleles, marked by about 100 polymorphisms, one of which is functionally important. These polymorphisms for each segment could be catalogued, and the catalog could be used to find disease genes by association analysis. With a large enough population and catalog of polymorphisms at each segment, it should be possible to associate specific diseases with specific polymorphisms in specific blocks, sidestepping the need to collect family pedigrees and allowing disease gene identification among unrelated individuals. Also, this could become a powerful method for locating genes with weak individual contributions to disease.

 

Since the start of the genome project, genetic maps have evolved from ones based on restriction fragment length polymorphisms (RFLP maps) to those based on a variable number of

tandem repeats (VNTR maps) to current maps based on microsatellite markers. The next generation of genetic maps may be built from simple, single-nucleotide polymorphisms in segments, scored with DNA microarray chips instead of gels (13).

 

In summary, there is much to anticipate from the genome project, and many exciting developments will occur over the coming decades. For those who have felt that proteins did not get any respect for awhile, your day is coming.

 

References

1. E. Jordan and F.S. Collins. "A march of genetic maps". Nature 380, 111-112 (1996).

2. E.S. Lander. "The New Genomics: Global Views of Biology." Science 274, 536-539 (1996).

3. E.M. Southern. "DNA chips: analyzing sequence by hybridization to oligonucleotides on a large scale." Trends in Genetics 12, 110-115 (1996).

4. Anonymous editorial. "To affinity...and beyond!" Nature Genetics 14, 367-370 (1996).

5. F.S. Collins. "Positional cloning moves from perditional to traditional." Nature Genetics 9, 347-350 (1995).

6. J.R. Smith, D. Freije, J.D. Carpten, H. Gronberg, J. Xu, S.D. Isaacs, M.J. Brownstein, G.S. Bova, H. Guo, P. Bujnovsky, D.R. Nusskern, J.E. Damber, A. Bergh, M. Emanuelson, O.P. Kallioniemi, J. Walker-Daniels, J.E. Bailey-Wilson, T.H. Beaty, D.A. Meyers, P.C. Walsh, F.S. Collins, J.M. Trent, and W.B. Isaacs. "Major susceptibility locus for prostate cancer on chromosome 1 suggested by a genome-wide search." Science 274, 1371-1374 (1996).

7. M.H. Polymeropoulos, J.J. Higgins, L.I. Golbe, W.G. Johnson, S.E. Ide, G. Di Iorio, G. Sanges, E.S. Stenroos, L.T. Pho, A.A. Schaffer, A.M. Lazzarini, R.L. Nussbaum, and R.C. Duvoisin. "Mapping of a gene for Parkinson's disease to chromosome 4q21-q23." Science 274, 1197-1199 (1996).

8. T.J. Hudson, L.D. Stein, S.S. Gerety, J. Ma, A.B. Castle, and 46 others. "An STS-Based Map of the Human Genome." Science 270, 1945-1954 (1995).

9. G.D. Schuler, M.S. Boguski, E.A. Stewart, L.D. Stein, G. Gyapay, and 99 others. "A Gene Map of the Human Genome." Science 274, 540-546 (1996).

10. J. DeRisi, L. Penland, P.O. Brown, M.L. Bittner, P.S. Meltzer, M. Ray, Y. Chen, Y.A. Su, and J.M. Trent. "Use of a cDNA microarray to analyse gene expression patterns in human cancer." Nature Genetics 14, 457-460 (1996).

11. J.G. Hacia, L.C. Brody, M.S. Chee, S.P.A. Fodor, and F.S. Collins. "Detection of heterozygous mutations in BRCA1 using high density oligonucleotide arrays and two-color fluorescence analysis." Nature Genetics 14, 441-447 (1996).

12. F.S. Collins. "BRCA1 Lots of Mutations, Lots of Dilemmas (editorial)." New England Journal of Medicine 334, 186-188 (1996).

13. F.S. Collins. "Sequencing the human genome." Hospital Practice 32, 35-43 (1997).


John Rush may be contacted at the Department of Genetics, HHMI/Harvard Medical School, 200 Longwood Avenue, Boston, MA 02115, Tel: (617) 432-7480, Fax: (617) 432-7440, Email: rush@rascal.med.harvard.edu.


Return to the The ABRF Home Page


Created: 13th June 1997
Last modified: 13th June 1997