Tarceva Lung Cancer Survival Rate

”Tarceva
dr. eric lander:oh, wow. i want to say a couple of thank yous and a couple of things. first to jeff trent,it is a tremendous honor to come and give the trent lecture. i think it's great naminglectures after people while they're still alive. it's better than coming and givingthe trent memorial lecture, to give the trent lecture while there's still a trent to enjoyit, and so i salute you for honoring jeff for the wonderful thing he did in startingthe intramural program in nhgri. i very much want to also salute the scientistsof nisc, eric green and all of the people who have worked with nisc. at the beginningof the day, they stood up and we saluted them but many more people have come and gone overthe course of the day and there's a bunch of new people so if it's okay, i would liketo ask the fantastic scientists who created nisc, who continue to run nisc, and who havedone this amazing thing by insuring that the world's best biomedical campus, the nih, hasa world-class sequencing center. so if i could ask all the people from nisc who are heretoday, because many of them are, to please stand. i think we want to salute you again. we have had the pleasure of working with niscon many, many projects, admiring many other projects, their role in the mammalian genomeprojects and end coding the mouse genome and cancer genome anatomy and in brainstormingmany projects still on the drawing boards
and soon to get underway. so well done, happy10th birthday and we look forward to many, many more and watching the impact that youhave on the nih and the impact you have on the world continue to grow. i just want to thank all the preceding speakerswho are good and close friends from the world of genomes, claire and richard and rick andwylie and rick and andy and evan and david, and many of the things i'll touch on are thingsthey have already touched on, because, in fact, we're all interested in this broad commonworld of what you can learn from genomes. so, since you'll have heard bits and piecesof all of these amazing ideas from people over the course of the day, what i'm goingto try to do is draw together, run a thread through it, and really address the questionof genomic information, what we can learn from it. because i think the single greatestchange over the course of the last 20 years or so in biology is the recognition that biologyis, yes, it's the study of organisms and, yes, it's the study of molecules and things,but that at its very core, it is about information, and that there is genomic information. bygenomic i don't necessarily mean the dna, i mean genome scale, comprehensive, completeinformation that all the components of the cell, dna, rna, proteins, modifications thereof,and that by laying out all of that information, we can transform the sort of questions thatwe can address. all of the speakers today have shown beautifulexamples of that and i'm particularly delighted
to see so many young people, post docs, graduatestudents in the audience because this is the world you guys are inheriting. this worldwhere it is not just about the experiments on your own bench but the experiments of theentire world laid out before you to pick through and figure out how to extract the informationfrom. so that's the theme. and i'm going to touch on many different forms of genomic information,if i can. but the granddaddy of all genomic information projects, of course, was thishuman genome project. it taught us some very important things. it taught us it wasn't abad idea to lay out some clear goals. goal directed science had a bad name originallybut the idea that if we thought clearly and had some things we had to get and we coulddefine some goals, wouldn't be bad to layout those goals and try to go for them andhold ourselves toward them. it also taught us that if we're making a project about information,it was absolutely crucial that that information be completely and freely and immediately availableto anybody because it was simply absurd that the people who were producing the projectswere the only ones who could use it well. we needed to enlist the ideas and the creativityof everybody around the world in any country, in academia or in industry. and so that wasan important lesson that emerged from it. we learned the importance of laying out concreteplans, timelines. there was a plan and timeline laid out for the human genome project overthe course of 15 years and actually pretty much worked according to plan. there were,you know, lots of innovations along the way,
but there was a sensible plan and we learnedhow to plan together, including planning in the face sometimes of huge uncertainty. and we learned the importance of collaboration.the importance of international collaboration. the genome project, again, as a kind of granddaddyhere involving six countries, 20 centers but every project that we talk about has beenan international project involving many groups in the united states and many groups in othercountries in this ever-changing mix of centers helping one another to stay at the edge. in the case of the human genome project, asyou all know, a rough draft sequence came out in 2001, a finished sequence came outin 2003. there was another little lesson there,
finished. finished is a technical term inthe world of genomics. it means the vast majority of, but there's still 300 gaps and that'sokay, we're aware of it. absolute completion shouldn't be the enemy of getting the vastmajority of the information out. and there are many things we can state and have statedthat we can't quite get the last little bit out but we can get the first 95, 98 percentout and we should get out in the hands of as many scientists as possible. and of course, what's been the impact of it?well, it's laid out before us, the landscape of a human genome. it's a beautiful landscapewith all of these interesting mountains and valleys, dense gene regions, poor gene, poorregions, all sorts of these striking things.
but the real test has been its impact on medicine.when the human genome project started, there had only been about 70 diseases that beenidentified molecularly, single monogenic mendelian disorders that have been identified beforethe human genome project. with the tools that have emerged during the course of the humangenome project we're now up to some 2,600 mendelian conditions, for which we know theguilty gene, and people can study them in great detail. so that was all fun but that's past history,it was the human genome project. what about beyond the human genome project? what is theagenda today? what are the sorts of things that genome centers, that people around theworld are trying to insure that we have and have freely available on the web for everyone?



well, human genome project had a goal, know all the sequence in the human genome. allis in italics because it means, you know, the vast majority, and don't give me a hardtime, at the last percentage or so. here's some other things. we need to knowall the genetic variation in the human population and its relationship to disease. we need toknow all the functional elements in the human genome. we have been hearing about these thingsalready from the speakers today. we need to know all the signatures of cellular responses.cells only know how to do a limited number of things. i don't know if it's 500 or 5,000,but there's a limited number and we're going to be able to recognize what those thingsare by some reduced signatures of cellular
responses. we need to be able to modulateall the genes in the genome. we need to know all the mechanisms of cancer and we need toknow similar information about the genomes of all the major infectious agents. that's a good to do list, and that is theto do list for not the 21st century, for goodness sakes, this is the to-do list for the nextten years. and indeed, for those of us involved in this, we know that more than half the stuffon this, there's already been great progress and we can begin to start putting checkmarksnext to things on this list, because we're quite far along on them and there's nothingon this list, i think, that should take us more than the next decade or so with the appropriateinterpretation of the word “all.” there
will still be things to discover 30 yearsfrom now and all that, but to get the vast majority of out there. “it is helped by” is one of the themesin this symposium, the continuing innovation in technologies. the human genome projectwas helped greatly by the appearance of first florescent sequence in the capillary sequencingand then we've had the appearance of all sorts of next generation and next generation andnext generation sequencers. these 454s and selexs, it's solids and helicoses [spelledphonetically] and others, and i won't fuss over what their throughputs and re-links arebecause they're changing every day as people are continuing to improve these machines.but one is getting up to points of gigabases
per run or perhaps two gigabases per run.i've heard of four gigabases per run on some of these platforms, and there seems to beno reason why those things can't be achieved. so i want to turn to the topics i was talkingabout, human genetic variation. let me take that one first and just describe what hasbeen just the remarkable, remarkable period since the human genome project. now, as variousspeakers have referred to, there is a fair amount of polymorphism in the human population.it's actually not that large compared to most mammalian species, they are more polymorphicthan we are but we have about one heterozygous base per thousand bases or so, or 1300 basesin the human genome. and if i take a random heterozygous base in you, the probabilityis greater than 90 percent that it's shared
with other people in this room. that is, thevast majority of the variation in you is common genetic variation. it's not these rare mendelianthings that are private mutations, the vast majority of what you have got is common geneticvariation. and what does it do? well, we know some examples,it's already been referred to apolipoprotein e has a common genetic variant widely referredto that confers risk of alzheimer's disease. we have got some other examples of a commongenetic variant, nccr 5 [spelled phonetically], that confers protection against hiv. but wereally had no systematic way of looking at what might be the medical implications ofcommon genetic variation. so in 1996, several folks, myself included, began to get veryinterested in the idea, even before we had
the sequence of human genome all tidied up,in fact, before we even had most of it, in the idea that we needed more than a sequence.we really needed to understand all the common genetic variation in the human population.well, simple back of the envelope calculations could tell you that there are about 12 millioncommon genetic variants, and the hallucination was this, that one might be able to simplywrite down all the genetic variants along the top of an excel spreadsheet, write downall the diseases along the side of the excel spreadsheet, and human genetics might reducesimply to saying which genetic variants were enriched in which diseases. that would bevery nice. it was also kind of a nutty thing ten years ago to think about that, becauseit implied having 12 million genetic variants,
we had nothing close to that. it implied beingable to genotype these 12 million genetic variants in thousands and thousands of patients.and mind you, near completeness was necessary. if you only could do ten percent of it, well,you'd only catch ten percent of the things you were looking for. you really had to getthe whole thing. but as these kind of genomic information projects have taught us, put onefoot in front of another and consistently you may be able to build to these goals. to indicate just how poor the informationresources were when we started, one could publish, in fact, we did publish a paper in1998 entitled "large scale identification of snps" that could report 4,000 snps andcall it large scale. that was just an indication
of where we were at that point. but throughefforts like this and others, the idea came along that we should be able to collect snpsin a systematic fashion. a public/private consortium was put together, the snp consortiumin 1999, with what sounded like an ambitious goal, 300,000 snps across the genome. thatproved quickly to be under-ambitious as the snp consortium within two years reached 1.4million snps. and then as the human genome project came rolling along, it was quicklyincreased to two million snps, three million snps, blah, blah, blah, eight million snps,something like 10 million snps now. the vast majority of the common genetic variation inthe population is already in the public databases. if we find the heterozygous site in you, weknow empirically that the odds are very good
it is already in the databases. now, the problem was still how are you goingto type tens of 10 million snps across each patient? could you get away with less somehow,without sacrificing the information? well, here some of the ideas from mendelian diseasesbecame very helpful in organizing the thinking. some of the mendelian diseases that occurredin isolated populations with single founder chromosomes reminded us that every mutationoccurs on a single ancestral chromosome that has a bunch of polymorphisms on it, and asits passed down through the generations, recombination whittles away the markers of the far distances,but nearby, you still have strong correlation amongst the markers that are there. you stillhave linkage disequilibrium. and you could
use it for mapping, for example, in placeslike finland without even families, just looking across a population of finns with a rare geneticdisease, you could map it by linkage disequilibrium, that signature of ancestral chromosomes. a very important paper for mark daly showedthat even in a general european population in toronto, you could, if you were up closeand personal, detect that linkage disequilibrium. then he found in a population of patientswith crohn's disease that there was a highly stereotyped pattern of blocks of genetic markersthat hung together so well that you only needed a couple of those genetic markers to be ableto trace the proxy for the entire block. and so that gave rise to this notion that if weonly knew that correlation structure across
the genome, the haplotype structure acrossthe genome, we'd be able to pick out a mere 3 or 400,000 genetic markers and trace inheritancethis way. well, from a random proposal there of wouldn'tit be good to do that, the community swung into action within a year, a haplotype mapproject was launched. again, the same pattern involving multiple countries, multiple centers,clear goals, free information sharing. and by 2006 it was largely completed and thatnice correlation structure is quite evident in this correlation gram here across the tinyregion of the genome, but the slide goes on all the way across the nih campus.then, you also needed technologies to genotype. even 300,000 is a big number, but here a varietyof different ideas in both the private sector
and the public sector came together to allowmultiplexing of one marker, ten markers, a thousand markers. by the last year half amillion genetic markers means simultaneously genotyped on dna chips. it's up to a millionthis year. and so suddenly, one had to put up or shut up. one had to actually say, youhad the genetic variation of the human population, you had the tools for genotyping across people,why not do it? and many groups around the world have been doing just that for the pastyear. and it has been an annus mirabilis, 2007, a year of miracles. just to give you a graph here of the confirmedcommon disease common variants involved in common disease, 2000, a single, very interestingreport of ppar gamma and type 2 diabetes.
crohn's disease, published in 2001. anotherdiabetes gene in 2003. age related macular degeneration in 2005. 2006, several more.2007, through april, when the tools became available, through august, through september.i don't have october, i'm getting tired continuing to remake this slide here. and it's, it'sgoing to have a lot of trouble fitting on by the end of december. but it's clear thatthere is an extraordinary explosion right now of diseased genes disease associationsof common genetic variants. and why is that? it's because of the continued investment ininfrastructure. in building the tools in human genome projects, snp consortiums, hapmap projects,genotyping rays, it's the nih behind many of these things. it's the private sector behindmany of these things. it's private/public
partnerships behind these things. but it'sthe willingness to actually roll up sleeves and create that infrastructure and then makeit broadly available to a community. what are we learning from these sorts of findingsalready, in what just has been about a year of this? well, with regards to the commondisease, common genetic variant idea, we learned it works. you can find lots of them and thesignificance levels are extraordinary. ten to the minus tenth is hardly impressing anybodyany day. there are ten to the minus 60th, ten to the minus 120ths that are significant.we're learning that the vast majority of the genes that play a role are not the genes thatwere prior candidates on anybody's list. it's perhaps no surprise, we knew this from themendelian diseases, we were bad guessers.
we're bad guessers about the common diseasesas well. and we're also learning that many of the risk factors are not in coding sequences.they are noncoding. they are probably regulatory sequences. so out of shock we have alreadyheard from the speakers that a significant fractions of the human genome isn't the functionalstuff in the human genome is noncoding, while a significant fracture of the variation thataffects disease is noncoding. we have our work cut out for us to understand it, butit's in the population, it does affect risk and it's probably going to be a very goodhandle into what these things do. it's revealing new pathways, the complementpathway in macular degeneration, autophagy involved with multiple loci and inflammatorybowel disease, beta cell function, and in
particular, all sorts of new things, zinctransporters, et cetera, and type 2 diabetes. it's revealing connections between diseases,already referred to this morning, chromosome 9, this interesting region that has myocardialinfarction risk factor and a type 2 diabetes risk factor, very close to each other. whatdoes that mean? they're not the same, they're a little bit apart but very, very closewe're learning that the effect sizes may be modest but they may be very important, ppargamma. it's only a 1.3 fold increase in your risk but it happens to be a drug target fora drug that's useful in type 2 diabetes. we're learning that some of these markers, for example,in type 2 diabetes, again, can be very useful in a clinical sense of identifying which prediabeticpatients will benefit most from early interventions.
we're learning about ethnic variation andhealth disparities, about aq24, a risk factor for prostate cancer that is present in allpopulations but in higher frequency in african americans and may explain the somewhat higherfrequency of prostate cancer in african americans. we're learning that it's often hard to findthe specific gene, the specific allele, a lot of work is going to be needed for that.we'll come back to them. we're learning that more is more. larger sample sizes will yieldeven more. i can tell you stories from inflammatory bowel disease that mark daly tells me thatthe first thousand or so patients identified six loci, but when three different groupspooled their data to get 3 or 4,000 patients, they're now up to something like 30 highlysignificant loci that come with larger sample
sizes. we are learning that there's still much moreof the genetic variance to explain. we've explained maybe 50 percent of the variationfor macular degeneration but perhaps five percent of the variation for type 2 diabetes.why? is it we're missing the genes? is it epistasis between them? is it environment?well, it's only been a year, nobody knows. the dust hasn't come close to settling. theseare the sorts of questions. so what do we need? well, what we've reallylearned is we've barely scratched the surface of this. we've scratched the surface probablyof the genes and barely scratched the surface of the biology. what do we need? well, threethings. larger samples, and more diverse populations.
most of the work has gone on in european derivedpopulations. we know that different alleles are at different frequencies and you'll spotdifferent things, you'll have more power to spot different things if the allele frequenciesare somewhat different. and so african american populations will reveal different loci, notbecause there's fundamental differences but because the allele frequency fluctuationsbetween populations make it easier to spot some things, asian populations, hispanic populations.this is, this is essential to really being able to do the biology, as well as being ableto investigate health disparities. beyond that, as several of the speakers, notablyrichard gibbs referred to this morning, we've only examined some of the range of geneticvariation. we have looked only really with
these genome-wide association studies at thegenomic variance between 50 percent and five percent. polymorphism in the human population,the word technically means down to about one percent. common variation in the human population,segregating variation. that is to say, variation common enough that if you got a thousand patientsyou'd see it multiple times, enough times to recognize that it was an increased riskfactor, runs down another log below that five percent to at least half a percent, and yetthe studies now are not powered to do that. we don't have catalogs even that run downthere. and yet we know there's important stuff. helen hobbs' beautiful work on pcsk9 withvariants in the range of two to three percent, common genetic variation but not yet assayedby the types of maps we're using.
we need to have genome wide projects. wholegenome discussions of thousand genome projects to collect all that genetic variation so inthis hapmap-type fashion we can exploit all of that to do common variation studies. fornow, as regions come up people are extremely interested in sequencing those regions tofind the lower frequency variants. but here, since they are, in fact, common enough thatwe could collect them all, as richard referred to, let's collect them all. and then of course, there are rare mutations.there are mutations that are private mutations and they can be very revealing, too. helenhobbs has beautifully shown in a population of patients with low hdl that a couple ofgenes have just too many rare singleton mutations
and that, too, is a signature. a signaturethat can't be caught by the common genetic variation, and we need the tools for that. and i'll take for granted, but evan eichlerhas made a very good point about that, that human genome also has much more than snps,it has copy number variation in these interesting repeated regions, and we need to be able toput all of that into this pipeline as well and look at the copy number variation acrossthe genome. and for all of it there's a tremendous amount of sequencing that's going to haveto go on in the next couple of years, but like with these other projects, i think it'sguaranteed to give us the kinds of catalogs and tools we need to drive this problem home.at least to drive it home with regard to finding
genetic variants. what do they mean? well, we need tools toconnect these genetic variants to physiology. we can't forget that by piling up 20 thingsthat might be involved in inflammatory bowel disease, 20 things that might be involvedin type 2 diabetes, that's of course, just the start. how are we going to keep up withthat pace in the laboratory? well, i want to turn to some of the things we need forthat. so let's put aside all this human genetic variation and collecting it. i'm confidentthat can happen. what about breathing functional meaning into the genome so we can make senseof this human genetic variation, so we can connect it with disease? so i want to turnto a little bit about talking about all the
functional elements in the genome. well, there are two different ways that onecan approach them that i'll at least mention, probably some others. conservation maps, lookingat the portions of the human genome that evolution has voted on as really mattering. and davidhaussler has referred to this quite beautifully, that looking at the patterns of conservationacross the genome one can learn a lot about what matters in the genome, even if the mouseknockout doesn't show a phenotype. if evolution tells you it's not willing to change thatbase, i go with evolution, it knows what it's doing. and then i also want to talk aboutchromatin state maps, a new kind of map that i think we want to collect a lot of and putthem on the web. so let's turn to ways of
annotating the human genome so we'll be ableto make sense of some of these disease loci. so conservation maps, clearly the first thingafter the human genome project was to get the mouse genome done, and many of the peoplein this room played crucial roles in that, including folks at nisc, of getting the mousegenome done. and then using that mouse genome by lining up the mouse genome with the humangenome and with a few other genomes, the dog genome, the rat genome, and lining up justthat first handful of genomes has revealed a number of important things. genomic comparisonhas already revealed that the human gene catalog is very different than we thought. it's notthe hundred thousand that was in the textbooks a decade ago. it's not even the 30 or 40,000that we all wrote in the human genome paper
back in 2001. it's not even, i think, the25,000 protein coding genes that are in the current catalog that were in the current catalogslast year. in fact, comparative work from the handfulof mammalian species but michele clamp is very nicely shown in a paper coming out veryshortly. probably the human protein coding gene count is really in the neighborhood ofabout 20 to 21,000. but the current databases probably only have about 20,400 real proteincoding genes and much of the rest of the stuff are simply open reading frames that are spuriously.and i don't have time to go into the arguments. and that you can pick that out of by comparison.and the number that really primate-specific things is modest, measured in the hundreds,and they are the sort of things that evan
eichler talked about, these very excitinggene families that are getting born. there is new stuff, but for the most part the storywith protein coding genes is pairing them down and whittling them away. but even asthey're getting paired away, the coding things, the noncoding things in the genome are reallycrying out for our attention. they're burgeoning. as you look across the genome, as variousspeakers referred to, we find that there are patches of conservation, clear conservation,ranging from these ultra conserved elements to smaller binding sites that evolution haslovingly preserved and that something like two thirds of all the stuff evolution haspreserved is this noncoding stuff covering about five percent of the human genome. weknow in a few cases that there are regulatory
elements because when you, when you knockthem out of a mouse you're able to see that it disregulates genes nearby, but that's apretty tough thing to do to annotate half a million elements. half a million mouse knockoutsis daunting even for me to contemplate, which is big. so the best way to really home in and cleanthis up is to increase the power of the data, first. with just the human and a mouse ora dog there's a limit to how much you could get, but evolution kindly made many mammalsand by comparing more and more genomes, we're able to refine those signals, get rid of thenoise, pull up the signal. and so various groups came together, but here i particularlywant to credit the folks at nisc, collaborating
with some folks at the broad, for proposinga concrete program to sequence a large number, about two dozen mammalian genomes. and thatprogram the nih launched involving all of the sequencing centers, and with elephantsand armadillos and rabbits and bats and cats and hedgehogs and all that, and the projectis essentially complete. there are aspects of it still being tidied up, but the vastmajority of these data are already freely available on the web. david haussler has referredto some of this already and groups around the world are putting together all these twodozen sequences and saying, can we get down not just to 200 base pair conserved elementsbut 150, ten, can we pick out ten base pair elements, etcetera, and there's just beenan explosion of interest in folks who are
comfortable with both genomes and bioinformaticsand squeezing out all of the information that evolution was kind enough to leave us fromthe experiment that's called the mammalian radiation.so, i'll give you some examples of things that come out if we're looking at genomes.here's one. i'm fond of this one. if you line up many genomes and you start looking at what'sconserved, you find a funny little site here that's it's not that little, a funny sitehere that's present about 5,000 times across the human genome, and when it occurs it'svery well-conserved. what in the world does it mean? so we use that. we took a biotinylatedversion of that piece of dna and pulled down with it protein. we took cellular extractand bound to the biotinylated sequence that
contains that motif there, cellular extract,and found that when we pulled it down and flew it on a mass speck, the ctcf insulatorprotein, an insulator protein that blocks the spreading of gene expression. only about three insulator sites in the humanhad ever been characterized, but suddenly, maybe the genome has given us 5,000 candidateinsulator sites. how are you going to prove that they're really insulators? are you goingto go knock them all out? that's a lot of work. turns out, again, genomic informationcan give you a very good clue, right away. just take all the genes in the genome thatare divergently transcribed. if they're divergently transcribed and this thing is an insulatorsequence, when there's an insulator sequence
in the middle, those genes should have uncorrelatedgene expression. if there's no insulator sequence, they should have correlated gene expression.get the public databases, look at their gene expression patterns, it works. the guys whohave this tend to be uncorrelated, the guys who don't tend to be correlated. so you cantake that out of the information. obviously, you want to go do biochemistry after that,but it's very nice to be able to do this because you can do this in an afternoon. other things that you can come out. you cantake the things that david refers to, these ultra, ultra-conserved sequences way out atthe end, or little less ultra conserved, maybe super conserved or very conserved or something.the most five percent most conserved sequences
across the genome and see where they are acrossthe genome. and when you do that, you find the following curious fact, that the mostconserved noncoding sequences across the genome are not near genes. they're in gene deserts,gene-poor regions. but not no genes, just gene poor. what genes are in those gene poorregions? developmentally important transcription factors. almost every one of those 200 regionsthat have peaks of highly conserved noncoding elements are enriched for developmentallyimportant transcription factors or axon guidance receptors. half of that very conserved stuffis focused around these regions. they must be very interesting. what do they do? so we were curious about understanding whatwas going on special at these regions, and
that led us into the second part of the work,chromatin state maps. because we took a guess that maybe chromatin would be one way in whichthose loci were special. and so, we began to explore the chromatin structure of thesefunny regions, and i'll tell you about that now. chromatin structure is enormously complex.histones have these tails that are decorated with all sorts of modifications but for themoment i'll keep it simple and refer to only two histone modifications. one, lysine 4 trimethylation,which i'll color green because it's associated with active genes; and lysine 27 trimethylation,which i'll call a red, because it's been historically associated with inactive genes. one can thengo look and what we did was using chromatin
immunoprecipitation on a microarray, a dnamicroarray for just these special regions of the genome. we began to explore a chromatinstructure of those regions and we found that in mature cells sometimes they had the greenmark, sometimes they had the red mark, sometimes they didn't have any mark, but you never seeboth together, which was consistent with the literature that it was either a green it wasan on or an off, until we looked at embryonic stem cells. and in es cells we found a verycurious phenomenon. right around those developmentally important genes in these regions, we foundthat in embryonic stem cells they were marked with both red and green, both an on and anoff mark, and yet were silent, as if they were poised for either activation or repression,according to which lineage they might go down
into. at least that was our hallucination,there. well, to really look at that in a seriousway, one's got to expand to more cell types and expand to the genome. and as rick myershas already referred to, the idea of doing chromatin immunoprecipitation and hybervising[spelled phonetically] it to a dna array is something that's so 2006, it's really notat all au courant. the right way to do it now is through chromatin immunoprecipitation,get the dna and run it on one of these ultra high throughput sequencers that give you littlereads and you map them back to the genome. so we did that using a selexa and the dataare, as they would say, comparable. the top line is sequencing, the bottom line is a microarray,they look pretty the same.
and so we could do this across various celltypes and for a variety of different chromatin marks, and i'll summarize a bunch of datafor the following sort of questions. the question we really want to know deeply, we want toknow, how does a cell decide to take up a career? when a cell decides to go from beingan es cell to a fully differentiated cell, it makes a variety of career decisions alongthe way. it loses potential. it makes commitment. we say that in developmental biology, butwhat do we mean by it? what are the molecular correlates of a cell being committed to dosomething, or having the potential, still, to do something? we don't really in developmentalbiology have a clear, crisp way to read out what career decisions have been made, andwhich lie ahead. so what we have been trying
to do is study that with chromatin. and i'llgive you a brief summary of where we're at, at the moment, and this will be slightly oversimplifyingthe data, but it's not a bad description of it. in embryonic stem cells, genes break up intothree different categories. there are some at rich promoters and they're fickle. theycome on, they come off in different cell types. they're very fickle and my sense is theseguys here come on or off depending on whether there's a transcription factor to turn themon or off. very fickle. 70 percent of the genes are cpg rich islands and they're housekeepinggenes and they're on all the time. 15 percent of the genes, somewhat more than just in thosespecial regions, but highly enriched in those
special regions, are these bivalent genesthat start off in es cells in this bi potential state of red and green and then in differentlineages may go green or red, but we're finding now, sometimes, stay bi potential in someof those lineages. in which lineages do they stay bi potential?stay bi valent? roughly speaking, in those lineages that still have choices ahead involvingthat gene. so if we're looking at myoblast neural cells and fibroblasts, we're talkingabout a gene that's involved in hematopoietic cells, there are no more decisions to be made,it's made a final decision. but a gene involved in differentiation of some neurons still isbi potential here in a neuronal precursor, and a gene involved in differentiation ofadipocytes but not other descendants of fibroblast
is still bi potential there. and so very roughly, and this is the happything of when you only have a limited amount of data you can make a very simple happy model.so the very simple happy model right now is this bivalent mark is an indication of decisionsstill ahead. as we collect more data the model will surely become more complicated, but happily,i don't know enough yet to complicate you with it. but that's kind of the picture. these chromatin state maps are very interesting.they're revealing all sorts of things. here's a gene in embryonic stem cells. the codingregion here is the protocadherin gene that has a zillion different promoters. and youcan see in embryonic stem cells, every one
of these promoters is marked as a bivalentpromoter, independently with a green and a red, except that one, which is just green,and it's the one that's used in embryonic stem cells. you can oh, we also put ctcf,that insulator, on this, and it nicely insulates between each promoter. you can pick out the microrna genes. here'sa microrna. it's very hard to figure out what the primary transcript is for a microrna,but in fact, here is this green mark of activation and this other mark, k36, that identifiestranscribed regions, and it's very easy to pick out this must be the transcript thatresults in this mature microrna. similarly you can find new promoters for genes, foxp1instead of foxp2 that was talked about before.
here's a little promoter here, here's thetranscript. but in embryonic fibroblast, there's another promoter being used and you can clearlyread off the transcript there. you can read off which allele is being usedbecause you're sequencing so you can tell polymorphisms between the little reads, andyou can tell that in hybrid mice, f1 hybrid mice, you can tell that the green mark ison one parental chromosome and a different red mark called k9 is on a different parentalchromosome. this is imprinted. this is active, that's the imprinted chromosome, you can readit off right away from the chromatin state map. and you can also tell the different alleleshere. all of the transcription is occurring here off the cutaneous allele, not the 129allele. and so you can pick out and you can
do this with humans as well. and finally, going back to this human geneticvariation. we have begun to look at marks, the k4 mark, not trimethylation, but dimethylationand monomethylation. these marks, i don't want to confuse you with too many marks, butthese marks are marks that seem to indicate open chromatin and enhancers in particular.they're associated with hypersensitive sites in dna and you can kind of read these offas at least protoenhancer marks. and i put this region up for one reason, which is rememberi said chromosome 9 had this funny bit that was noncoding that was associated with bothmyocardial infarction and type 2 diabetes? it's there. and it's got all sorts of interestingenhancer things over it. now, i know these
enhancers are in a totally irrelevant celltype, they're in hl 60 cancer cell and they're in human umbilical vein cells here. but nonetheless,one can get cell types now and mark up those enhancer structures in more relevant celltypes. and my guess is there's a lot of interesting action going on over here in terms of enhancers,and maybe that will help guide us in. anyway, i'm going to quickly i'll just sayand won't really talk about, we have been doing the same thing now with methylation.we have been taking the dna and studying its chromatin structure its epigenomic structurewith regard to methylation. and you can do this by, you know, some genes have cpg islandswhich sometimes could become methylated and turn the genes off. and you can study thisby treating the dna with bisulfate, and you
can then shotgun sequence. problem is, it'sa lot of dna and so we've come up with, and i'll just mention, some, some interestingtricks where you can slice out one percent of the genome on a gel that contains justthe msp 1 fragments of a certain size, and it says msp 1 cuts its cpgs. these thingsare highly enriched for cpg islands and you can assay about 90 percent of the cpg islandsin the genome by sequencing about one percent of the genome. and you can pick out thoseregions that have, for example, become highly methylated in developed cells. i'll mention the following fact, which is,when you begin to measure methylation changes as cells develop, you take embryonic stemcells and you develop them into sox1 positive
cells and then to neural precursor cells andastrocytes. there's a huge change of methylation that occurs here. very unmethylated, hugechange to guys becoming methylated in this change, and then they stay the same past there.this got alex meissner, who did this work, beautiful work, very excited. i mention itbecause alex meissner is also very careful. we now think this is a very interesting artifact.we think that now we look at actual cells from tissue, in vivo tissue as opposed tocells being differentiated in cell culture. we don't see this methylation. in fact, itlooks like there's some very important changes in methylation that occur in cell culturein the same cell types but are not occurring in vivo. and this is of interest because theone place where you do see this methylation
is in cancer. there's something very funnygoing on with regard to methylation. i mention this because there's been some talk aboutusing bisulfite sequencing, and we're very excited about to go describe all this andnow it's very clear. there are some very interesting artifacts and i think at the end will tellus more about cancer than development, with regard to methylation. i mention it. anyway, all right. so those are those things.but those are annotating the genome. what about functional tools? what about the kindof genomic information that's going to shed light on cellular circuitry? i want to takea little bit of time and talk about tools for doing that. not for marking up the genomeanymore with variation or marking up with
conservation or marking it up with chromatinstate maps, although i think all of those things are very important and we’ve gotto keep generating them and getting them out on the web, but the tools for somewhat morehigh throughput biology to explore pathways. and so here, i want to describe work of astudent, piyush gupta, to indicate that even the very sensitive cell biological experimentsof a type that you might not think would yield to genomic approaches, can be made to yieldto genomic approaches. so, i'll describe briefly. piyush gupta, whocame to our lab from bob weinberg's lab, he's a cancer person, piyush is and was extremelyinterested in deploying the tools of rnai screening. so rnai is, of course, a fabuloustechnology for knocking out the gene of your
choice, and with a couple of groups, includingour own, we have built genome wide rnai libraries. you can at least imagine the idea of doinggenome wide screens with rnai’s, to find all the genes that might matter in a process.well, the process piyush cared about was to understand the signaling of the erbb2 receptor.he cared a lot about this problem because he was very interested in breast cancer, andbreast cancer comes in five basic groups, as defined by gene expression patterns. twoof them, these first two have very poor prognosis and we need much better therapies for them.and this first class here is has prominent signaling through the erbb2 receptor, andwe need much better therapies for this class. so piyush said, could i use high throughputrnai screenings as a genomic information tool
to tease apart the pathway? now, here's theproblem. this phenotype is very subtle. when you add heregulin to cells, breast cancercells, they start off clustered next to each other, and when you had heregulin, they moveapart a little bit and they get a like spiky, they put out filopodia marked by f actin,they separate a little bit. you can see it, but imagine trying to screen hundreds of thousandsof wells for that phenotype. that's not going to be an easy thing to do, but that's whatpiyush wanted to do. he wanted to say, use a genomic approach to screen for a very subtlecellular phenotype. and here, happily, we had some colleagues who also think genomicallybut with regard to image analysis, david sabatini and particularly anne carpenter. so the ithink the image takes a long time to come
up. did i get it? yep, there we go. you cansee the cells here without heregulin, with heregulin, have moved apart a little bit andhave got a little blotchy with f actin. this is not a friendly thing to imagine doinga high throughput screen for, but piyush was an optimist. so he took anne carpenter's softwarethat's very good at detecting all sorts of objects, shapes of cells, cell boundarieshere and other funny things, and used it to analyze lots of images and got all sorts ofdimensions, counting f actin puncta, the nearest neighbor, this is cell shaped metrics, etcetera, et cetera. got all of these different readouts of cells and then went away, beingvery smart and mathematical, and attempted to build a classifier. and after three months,this is the negative control here, he was
unable to do it. then he went back to anne carpenter and said,got any other tricks? and anne said, well, we have been working on something called cellclassifier, and it works like this. cell classifier gives you 50 pictures. with your mouse youdrag the ones that you think are in category a over to the left and the ones that are incategory b over to the right, and it goes off and makes up its own rules. based on itsrules, it gives you 50 more pictures, but this time it’s divided them and said, ithink these are as and these are bs, is that what you mean? and you move around the onesit got wrong. it goes away, gives you back. after a couple hours with cell profiler, it'sdoing a mighty fine job. and in fact, it was
able to accomplish in one such sitting, apretty good classification of cells as either treated as looking like they had been activatedby heregulin or not. anyway, to make a long story short, with thishe took a high throughput screen involving about a thousand genes in this case with multiplerep5 replicates, many hairpins per whatever, and found a number of established genes, lotsof new genes, but most interestingly, they fall into very sensible pathways. three pathwaysthat had been known to be involved in rb2 signaling come out right away, the pi 3 kinase,nf kappab, jackstat, and one entirely new pathway, junk 3, not previously known to beinvolved, and it's an interesting pathway because there are inhibitors involved thereare inhibitors that have been developed against
junk three, but for neurodegeneration, maybethey'll have a use here. in addition, recurrent functions come up in neurite extension cellmigration, ligand induced receptor endocytosis. the vast majority of those genes sort outnicely into different pathways and provide great sense for it.so i bring this up to say that even when you're talking about subtle cellular phenotypes,the genomic approaches can be quite handy and are quite tractable. and these are thesort of things i, at least, are on record as having advised piyush would be a terriblescreen, but in fact, turned out to be quite a reasonable screen and you can get a lotof really good pathways emerging out of that. i'll talk about another kind of way to recognizecellular signatures and i'll just, yeah, refer
to that, which is ways of recognizing cellularsignatures based on gene expression. and i just want to describe what's a beautiful projectthat's been continuing to grow of todd golub and justin lamb at the broad whose idea is,we basically want to take any subtle process we're studying, whether it's a disease, theaction of a drug, the action of a gene and put them all in one common language, one linguafranca, that whatever we're working on, the way to talk about it is what is its effecton perturbing rna expression? and if we were to make a big database of that, we would pickup all sorts of connections by putting it in this common language that we would neverotherwise have seen. and they have demonstrated very beautifullythat one can do this. they have put together
now a database of response signatures to numberof human drugs, a couple hundred human drugs now, against numbers of human cell lines,and their idea is this. for any biological signature you want, take your biological signature,run it against the database, kind of googling it, and out will pop the things that are similarto it. any diseased state, any other state, any gene inhibition, see if there are anydrugs or other perturbations that are similar. just show you examples of this. treat ratswith estrogen. paper in the literature treats rats with estrogen, looks at gene expressionchanges in uterus, take those genes that go up and down in response straight out of thepaper in the literature, run it against this connectivity map database, out pops all theknown estrogen analogs. out pops something
that wasn't known to be an estrogen analogbut was proven to be an estrogen analog. if you put in the minus of that signature, downwhen it should be up, up when it should be down, you get the estrogen inhibition here,you get tamoxifens, you get the selected estrogen receptor modulators. so you can read thisstuff right out. a beautiful example is they took the signatureof leukemia cells that are sensitive to dexamethazone treatment, some are, and leukemia cells thatare not sensitive to dexamethazone treatment, some are not, and you get the differentialgene signature. toss it into the database, say, “ever seen a drug that looks like itinduces the signature of being sensitive to dexamethazone”? and the database pops backand says, the immune suppressant rapamycin
does that. and then you say, “wow, i wonderif rapamycin just induce the signature of sensitivity to dexamethazone, but maybe itwill make cells sensitive to dexamethazone.” and you do the experiment and it does. butwho would have thought of using dexamethazone? we're certainly not smart enough but a genomicinformation database is smart enough, that if you simply ask it the question, it willtell you it's the best fit. and similarly, i'm going to skip through thisto simply say, in a screening experiment to find small molecules that could block androgensignaling, todd and his colleagues found these two natural products from these two plantsthat block androgen signaling. had no idea what they did, but of course, you don't needto know anything, you just toss its signature
into the connectivity map and the connectivitymap replies, “boy, that signature looks an awful lot like hsp90 inhibitors.” eventhough your molecules don't resemble any known hsp90 inhibitors, they clearly must be blockingthat pathway and they have gone on to show it is blocking that pathway.what we need, i would say, is, again, genomic information databases. we need to have signaturesof all the fda-approved drugs, of all the rnais, of all the bioactive compounds freelyavailable on the web. how are we going to get that cheap enough? well, we have begunto realize that if we're going to do lots of this, even doing it on microarrays forgene expression is too expensive, but todd is coming up with ways to do this by sequencingand it may be the new sequencing technologies
make this affordable. oh, well, those are ways of doing cellularcircuitry. i'll briefly mention, because it was referred to by rick this morning, we stillgot to know all the mechanisms of cancer. that's the next thing on the list there. verybriefly, mapping the cancer genome is going to be one of the most important things overthe next several years. these chips that let us track polymorphism in the human population,also lets you track deletions and amplifications in cancers. and this, this has become a veryimportant and active thing. and sequencing, it's already been referred to by rick thatfinding individual mutations like egfr mutations in lung cancer has pointed out that thereare subsets of lung cancer that have a distinct
form of the disease that are responsive toparticular drugs like tarceva and iressa. and so a task force at the nci recommendeda couple of years ago, i got to serve on this task force, that there ought to be a significantcancer genome project and that has morphed into this pilot project, the cancer genomeatlas project that is now underway with groups around the country and i think is increasinglyinvolving groups around the world, as it must. the concerns that have sometimes been expressedabout this are either, we already know all the cancer genes, or cancer is hopelesslycomplicated. i don't think either of those positions is justified by the data. i justput up a list of the 21st century cancer genes that have been discovered in major cancershere, and what's really striking is that virtually
all of them have come out of genomic approaches,not prior candidates, that of the drugable genes and common cancers, all have emergedin the 21st century from genomic approaches. that the genomic approaches have pointed usto new kinds of oncogenes we didn't know before, lineage-specific factors like mitf and titf,translocations in epithelial cancers that used to thought to be confined to blood cancers.and that this is all, as rick wilson said, from screens that have been highly limitedto really phosphatases, kinases, et cetera. and what we really need are unbiased genomicscreens of the sort that have been talked about today. what is the future of cancer genomics? itwill be, get a tumor, get rna and dna from
the tumor and sequence. sequence what? well,in the first instance, by sequencing in limited ways you can get whole genomic copy numberand rearrangement. you can sequence all the xohms, as richard gibbs has referred to. youcan sequence from cdna, as rick wilson has referred to. you can make chromatin and methylationmaps. and all of that, all told, the bill is less than probably 100 million short reads.and 100 million short reads is not such a big deal anymore, or won't be such a big dealanymore in the next couple of years. this isn't re-sequencing the entire cancergenome. the entire cancer genome is probably 3,000 million short reads, which is stillunthinkable for the next 12 to 14 24 months or so, but probably in the not-so-distantfuture. nobody will fuss over the first couple
of lines, we'll go to the latter but, youknow, those of us who are highly practical say the first four lines, there will be thefocus for the next five years, and then it will be more and more focused on probablybeing able to do the whole genome. anyway, genomic information. there are somany kinds of genomic information. there's, of course, all the sequence in the genome,there's all the genetic variation of the population and its relation to disease. all these functionalmaps emerging from conservation, from chromative state. these signature maps like, like connectivitymaps that let you look things up. or these tools like rnai inhibitions and databasesthat are built of the affects of rnai inhibition. all of the cancer mutations, we're just barelyat the starting point to that but i predict
we are going to see an explosion of that overthe next five years or so. i haven't talked about it, but claire fraser has referred verymuch to the genomes of all major infectious organisms and really being able to detailthose as well. for the young people in the audience, thisisn't what biology looked like two decades ago. it really was a world where what youdid on your bench was primarily the data you were looking at. now what you do on your benchis the starting point, but of course, it's comparison to everything out there, all thegenomic information out there in the world is at your disposal. we are by no means done.the human genome project, good start, but there's a lot more still to do. there aremany projects here and there are many more
still to go, and i encourage all of you tobe thinking whenever you do any experiment, ask if i'm going to do it more than threetimes what's the genomic resource that would have been helpful for me to have? it is aremarkable, remarkable period we're living through. it still is very much unclear whereand when it will end. we keep thinking maybe it's going to top off, but i see no sign ofit topping off for quite some time to come. well, i want to close by acknowledging theobvious, which is, this is the work of an extraordinary community. i want to acknowledgemy own colleagues at the broad institute, many of them working in many of these areaswho it's been fabulous to work with them. and, and i can't say enough about what a friendlyand collaborative spirit there is there in
boston amongst mit and harvard scientistsand harvard hospital scientists. but i also want to acknowledge something you often don'tacknowledge, which is the extraordinary role of consortia. so much of what i have talkedabout was not the result of any one lab, not any one institute, not any one city but itwas the result of being willing to put together consortia to get things done. and there hasbeen this floating group of consortia, i just put down some of the ones whose data i havereferred to here. human genome project, snp consortia, rnai consortia, all sorts of consortiathat have emerged over the years, and this has become such a powerful way to do sciencein the age of genomic information. and then lastly, i want to make a specialacknowledgement to the sequencing centers.
over the course of now almost 18 years, thesequencing centers have worked together in all sorts of combinations to help try to bringabout this revolution and get data out rapidly, and i think we all feel an enormous bond toeach other. i want to acknowledge washu and baylor and tiger and sanger of the joint genomeinstitute, the stanford genome center and others, and i particularly want to acknowledge,because it's a birthday party, nisc, for the extraordinary role it has played in makingsure that this genomic revolution and genomic information that is happening all over theworld is happening in spades here on the campus of the nih. anyway, this has been a great day, a greatbirthday party. the great thing about celebrating
a first decade, in this case, is that onecan be sure that the next decade is going to be vastly more exciting. so thanks forthe opportunity to kind of tie it all up today and hats off to everybody here for what they'redoing. happy birthday. dr. francis collins:so, we have time for a couple of questions before we adjourn to a reception. while peopleare finding their way, eric, clearly the ability to generate vast amounts of data is outstripping,i think, most people's expectations, although i suppose it shouldn’t be said that we weren'tsort of warned about this. are we going to keep up in terms of the analysis capabilitiesthat we have to put together to make sense out of all this or are we facing a mismatchin terms of algorithms, in terms of trainees?
are we in trouble or is everything just nicelydovetailed? dr. eric lander:oh, golly. well, i have enormous faith, over the long term, in young people. i think it'sclear that the next generation has already figured out that there's no distinction betweenbeing a wet scientist and a dry scientist. they're all recognizing they're damp. thatthey are, they are both. and we're seeing many more people going into biology now whoconsider it [unintelligible] bioinformatics training and such. so if you say, over thecourse of the next 15 years, will the young people lead us into the promised land by virtueof their understanding this new world, you know, us old generation may not fully enterthat promised land, but the new generation
will, and they understand it. now, will theyall fully show up in full force within the next 24 months to deal with the data, or willthere be this deluge of data beyond what the existing training base is? oh yeah, we're going to be just overburdenedwith tons and tons of data. but that's okay. i mean, you know, we'll, we'll manage to extractthe most interesting things that we see in the data so far, and then as more and morepeople come in more things will be extracted. the thing we've got to do is make sure thatthe training programs are there. we've got to make sure i mean, i hardly need to sayit, because this is something i think nih believes deeply in. but nih is the leaderin training in the world here and we've got
to make sure that essentially everybody goinginto biology, even if they think they're going to be a cell biologist, they need some cellularprocess, understands how to connect to this world, and also that we bring in large, largenumbers of people who have real training in mathematics and computer science, etcetera.so in a 15 year time horizon, i think the whole notion of what it means to be a biologistwill change, and the young people here will solve it. in the short term, well, we're justgoing to do all the paddling we can do to stay afloat. dr. francis collins

Comments

Popular posts from this blog

lung cancer symptoms timeline