ࡱ> E@ bjbj +CN       8$B$$rL-$/$/$/$/$/$/$$d%R'S$ ;;;S$  h$###;  -$#;-$#X#$  $f p)'tk $-$~$0$$^(/!\^($    ^( $(H#yS$S$d d#" LECTURE 1A: SCALE OF THE GENOME AND TRANSCRIPTOME SLIDE 1 This is Dietrich Stephan from Childrens National Medical Center in Washington, D.C. I am pleased to present some background on the Microarray Consortium sponsored by the NIHs National Institute of Neurologic Disorders and Stroke and the National Institute of Mental Health. For a better overview of the entire consortium, please refer to the lecture by Sarah Brautigam, which is the first link under this same series. Just to summarize again, the consortium is comprised of the Centers of Excellence in expression profiling, and employs all of the state-of-the-art expression profiling technologies. The ultimate goal is to enable extramural investigators with these technologies so that translational neuroscience research is furthered. SLIDE 2 The lecture blocks units are broken down into short 10 minute mini-lectures. Each mini-lecture is meant to impart an important concept. The speakers below are all geneticists who are adept at expression profiling and the related issues. We aim in this series to introduce these issues so that you can design and implement experiments which will have the highest likelihood of success. In addition, special thanks go to Dr. Rita Roy and Dr. Allan Goldstein of the George Washington University for assistance with lecture posting. Finally, this lecture series was funded by the NINDS/NIMH array consortium as well as in part through the NHLBI Programs in Genomic Applications. SLIDE 3 The first two lecture blocks are required prior to use of consortium services simply to facilitate designing streamlined proposals. Subsequent lectures, all of which will be optional, will be added to assist in data analysis and interpretation. SLIDE 4 The learning objectives for the mini-lecture are illustrated here. This first lecture is focused on giving a better perspective with respect to the scale of the genome and how it lives and breathes to form the transcriptome in individual cells and tissues. A further objective will be to introduce the concepts behind the technologies to assay the transcriptome in disease states. In subsequent lectures you will hear from speaker on genes, ESTs, and clusters in the transcriptomes and how many genes are known, and then speaker will come back with genome-wide approaches such as SNPs and Chips and the different experimental platforms. The third block of mini lectures will focus on how to use a standard suite of analysis tools, GeneSpring. This lecture is optional, but very useful. Finally, further lectures will describe in detail data normalization options and more sophisticated analysis tools which are available through the consortium. SLIDE 5 Let us start off with the scale of the genome. You have heard in many places, Im sure, that there are about three billion base-pairs in the genome and about 30,000 to 100,000 genes. In addition, you have heard that the average gene is somewhere about 50,000 base-pairs. Well to bring this into perspective and also to help others like your grandmother or your students in your classes gain a better perspective of the genome, it equates quite well to the Joy of Cooking, which is given here in parallel on the right hand side of the slide. The Joy of Cooking is a cookbook, and if you look inside of it, the different recipes are arranged into chapters such as soups, cakes, etc. To get the same number of genes in the genome as recipes in the Joy of Cooking, you need about 100 volumes of this cookbook. That is a lot of volumes! This many volumes would be about three billion letters long and have about 50,000 recipes in it. This gives you some idea of the magnitude of the sequencing effort we have recently completed! It also give you some idea of just how complicated the human body and its blueprint are! So you can draw a pretty close analogy between the genome, which is a cookbook to make a human being, and the Joy of Cooking, which is a cookbook to make all sorts of things that you know and recognize like soups, cakes, etc. So, in the genome, the three billion base-pairs are three billion letters. Actually the genome is simpler in one sense, there are only four base-pairs or letters (A,G,C,T) and our alphabet has 26 letters. You can conceptualize genes as specific ingredients to make a component of the body. So with a recipe in the Joy of Cooking you can make a layer-cake that has many things that go together like flour, sugar, milk, eggs, whatever, and those are all formed from letters and those are different components of the cake to make something you recognize as a cake. Similarly, you can take one gene or a number of genes to make some component of your body which you recognize. For example, there might be a couple thousand genes to make a hair on your head and a different couple thousand genes to make a toenail. Now some of those genes are shared between your toenail and hair, but many are different and you can recognize them as different. So what is important in the body is that while all cells contain all genes, only certain genes are produced in certain tissues. So, hair will only use hair genes and toenails will only use toenail genes. Again, that is not very dissimilar from the Joy of Cooking where certain recipes are produced as appropriate of the meal. You dont make every single recipe in a hundred volumes of the Joy of Cooking for one dinner. You only use certain ones. So what is a transcriptome then? If you take your toenail or take your hair and look at what genes are being used. Out of these 50 to 100,000 genes which ones are being used in your hair and which ones are being used in your toenails and how are they different? Then we can take that one step further and look at disease or normal development or the process of aging. So, lets take muscle as an example, muscle tissue. We can define the transcriptome of a baby who isnt walking yet, who has just been born but still needs to use their muscle or their heart and then look at a five year old and look at a twelve year old and look at an aging individual. Many of the genes are shared and are the same between these different ages of individual, but many of them are different too because muscle in a 90 year old is not the same as muscle in a newborn. So defining the transcriptome is a key feature of these genome-wide approaches where we are trying to go through and figure out what happens to each gene, each and every one, within a tissue both in normal development and in disease states. That helps us understand what causes a disease and particularly how to fix it, because if you can understand which genes are being dysregulated hopefully you can change them or change the down-stream effects. Here, expression profiling is the method we use to define the transcriptome in a series of samples. Thus, by definition, expression profiling as we will describe in a second, is the method that we use to look at the expression of each gene individually in this series of samples. Now in the past when you wanted to look at expression of genes, you picked one gene at a time and you said, Okay, lets say in cystic fibrosis there is something wrong with a patients lungs. Lets look at one gene at a time and try to find the one that is defective and causing all the problems. That would be really tedious and almost complete hit or miss. What insight we have about the major defect in CF was figured out by positional cloning where they found the cystic fibrosis gene but that initial gene defect causes a whole down-stream cascade of problems. So, even though a specific ion channel defect initiates cystic fibrosis, patients live for many years without any CFTR proteins. So, it is the down-stream effects that cause the problems. Expression profiling is particularly important in figure out these down-stream effects because down-stream effects can be very complicated and lots of genes may mediate disease pathogenesis. Lack of CFTR can cause ionic changes and many other compensatory ionic changes. Maybe other ion channels try to over express or under express to compensate for the lack of this one particular chloride channel and then you have opportunistic infection of bacteria or yeast or many different other microorganisms in the abnormal lung. And, then you have immune response to those infectious agents. So, you can see that so many things pile on, one after the other, in a lung with cystic fibrosis and studying one gene at a time can be very time consuming and tedious. In addition, you are never really sure what is primary or secondary or where you stand in this whole progression of all these things going down in the lung. So, expression profiling reverses the paradigm. Basically you take all the RNA from a piece of tissue and put it on a small glass surface which contains most, many, or all genes of the genome and you query the sample to say, Ok, for gene number 2675, which encodes a type of collagen, is that gene going up or down as a function of progression in cystic fibrosis? SLIDE 6 So, here I give an example of how we can use expression profiling to specifically look at cystic fibrosis, as a model disease. You can substitute in any neurologic disease of choice into this paradigm. You can take patients that are all homozygous for the same exact mutation; in this case 70% of alleles in cystic fibrosis are deltaF508. But even if you have a group of patients that have the same exact primary defect, lacking CFTR because of deltaF508, you still find patients that can show either fast or slow progression, again, the down-stream effects of a similar mutation. And, those downstream effects can be influenced by environment, by other genes, maybe a polymorphism in a different chloride channel that might be better or worse at compensating for CFTR. So expression profiling lets us look at that. What we will do in this example project is to take bronchial biopsies of clinically mild patients that are all homozygous for deltaF508 and then take a cohort of clinically severe patients that have the same genetic defect but show different down-stream effects. Then we compare the expression profiles of mild versus severe to define the pathophysiology, the down-stream effectors responsible for disease progression in CF. So one critical thing here is that we have to ask the question, Well, I just said we are looking at all genes. Do we really have all genes on a chip now, and if not, how many do we have? What is our sensitivity for testing the entire genome at once? Lets use the Affymetrix GeneChips as a template for this discussion. We will compare the Affymetrix platform, the cDNA arrays and the oligonucleotie arrays in the next couple vignettes. So, we wont go over that now but just to take what is currently available from Affymetrix as far as genes on a little piece of glass. You can currently buy from them a series of 2 chips which together have ~45,000 genes on them. Now here a really important distinction is what is often termed full-length genes and what is termed an EST. Full-length genes really have a name and we have some function tied to this gene and its protein product. ESTs are pieces of RNA from various tissues that have been sequenced and we know that they have probably come from a gene because they are represented as RNA but we really dont know what they do. They are anonymous. So, we will go into this in more detail in the next lecture block. So, what is a full-length gene? What is an EST and how do we know what the neuro-gene representation is on these various arrays? Do we have better ascertainment in brain than we do lung or better in muscle than we do in toenail? Essentially our dogma is that even if EVERY single gene in the genome is not on an array we are using, the pathway which is relevant will be represented and will light up. Thus, any of the array platforms which we have available within the consortium have an equal likelihood of success in any experimental system. LECTURE 1B: CONTENT OF GENE MICROARRAYS SLIDE 1 In this mini-lecture, we will be talking to you about what is on an expression array. Obviously, there are genes on expression arrays, but how are these genes derived? What tissues do they come from? Dietrich talked to you about the scale of the genome in the last mini-lecture. He told you there is an estimated 50,000 different genes contained within three billion base-pairs of genomic DNA. But how do we say there is a gene sitting right there in the genome. SLIDE 2 That is going to be the topic for this lecture. Primarily the way we annotate what is a real gene in through generating what are called expressed sequence tags. These express sequence tags really define what we put on the expression arrays. So, it is critical to know how the ESTs are derived and what they are. If you buy an Affymetrix GeneChip, you are going to get ~45,000 human genes on this chip as marketed by Affymetrix. What that breaks down to are ~33,000 full-length genes. These are full-length protein coding sequences. There are also about 13,000 elements that are portions of unknown protein coding sequences. They are expressed sequence tags. So, in this lecture we are going to talk about what an EST is and how it is generated, how it is compared to a full-length gene or transcript and how we know whether we have appropriate brain representation on the chips that we are using. SLIDE 3 Within every cell are three billion letters of genomic DNA, ~50,000 different genes. On the right, you see a metaphase spread. These are condensed human chromosomes and by looking at the slide you cannot say there is a gene sitting right here and there is another one sitting right here. We as humans, and the computers that we use, and the tools that we build are not very good at visualizing what a gene is, what a functional unit is, but the cell is really good at it. That is its job. SLIDE 4 So, a cell has a way of saying, On this chromosome, in this position right here, we have a gene and Im going to start making a RNA transcript, and thus a protein from this one. So, in the process of making a protein the first thing that happens is transcription. A primary RNA is generated from this gene. Its an exact copy of the genomic DNA. So, as you see on the upper right panel, this primary RNA transcript has introns and exons just as the genomic DNA has. Those introns and exons are eventually spliced out so the blue introns come out and the exons are joined together. In addition, the three-prime end of the transcript or the portion of the transcript which corresponds to the carboxyl terminus of the resultant protein is polyadenalated, a stretch of a residues is attached to that three-prime end and the five-prime end is capped. As soon as the splicing, capping, and polyadenalation occurs, that transcript is exported from the nucleus, hooked on to a ribosome and translated into a protein. So, these two terms, transcription and translation, are really critical when we talk about expression analysis because expression can relate to both of these. So, when we talk about gene expression in the context of this lecture series, we are going to be talking strictly about transcription of DNA into RNA and the regulation that is associated with that. SLIDE 5 So, how again do we as humans say, There is a gene sitting right there on that chromosome? The way we do it is through genearating expressed sequence tags. The way this is done is to extract mRNA from a cell or a tissue-type of interest and reverse transcribe it into what is called complementary DNA, or cDNA. The way this is done is by adding an oligo-dT primer which anneals to the polyadenalated three-prime end of the transcript. You anneal that primer to the cellular mRNA and copy the RNA back into a DNA strand using a viral enzyme called reverse transcriptase. Now that you have a copied or complementary DNA, you degrade the RNA within the cell, you synthesize a second strand cDNA, so now you have double-strand DNA complementary to every single mRNA transcript that you originally had in that cell, and you take that double-strand of DNA and you pop it into a vector. Now you can use a vector primer, one single primer, to sequence a ton of individual clones and figure out what transcripts you initially had in your cell or tissue of interest. So, this is called a cDNA library. You are going to have tens or hundreds of thousands of different cDNA molecules, and you can individually sequence all of those and figure out the transcript complement within your cell or tissue. What is generally done is a vector primer is used to sequence the vector insert once. So, a single pass sequencing is done on each insert and this approximately 500 base-pair sequence becomes a unique sequence which defines the three-prime end, usually, of the transcript. What can now be done is PCR primers can be developed to that 500 base-pair sequence of the gene and used to map that gene to certain locations on chromosomes. Alternately, the EST sequence can be aligned to the genome sequence and wherever it matches is a real functional gene. Even if you dont know the full protein coding sequence, you know there is a real gene sitting right there. SLIDE 6 All of these EST sequences are deposited into a public access database called dbEST. This database is, obviously, publicly funded by the National Institutes of Health. Within the database resides the sequencing trace which was generated by the DNA sequencer and annotations, for example which tissue was this RNA extracted from, how old was the individual that the RNA was extracted from, what vector was the cDNA cloned into, where was it sequenced, etc. All of this sequence, and there are about 8,000,000 million ESTs that reside within dbEST, are from various tissues and cells and individuals and cancers, etc. So, there are millions ESTs, but there is only an estimated ~50,000 genes. So, there is a high amount of redundancy within the EST database. SLIDE 7 So, multiple ESTs have been sequenced from identical genes. What we have to do is to reduce these ESTs into unique clusters through multiple sequence alignments. This is fairly straightforward and there are very simple algorithms to this and there are very sophisticated algorithms to do that. In general, you need pretty sophisticated algorithms to do the alignments because there are quite a few problems associated with doing such a massive assembly and you want high-fidelity EST clustering data. SLIDE 8 So, the problems that are associated with this process are that when you are generating your cDNA library you may have your oligo-dT primer sitting down on genomic DNA which has contaminated your preparation and priming off this genomic DNA where there happens to be a polyA stretch. And, what you are going to get is not a transcript but some intervening sequence of DNA. So, there is a significant amount of DNA contamination in the EST databases. In addition, individual transcripts may have stretches of As which are internal to the transcript, as opposed to as is on the three-prime end. So, your oligo-dT primer may sit down in the middle of a transcript and you may start reverse transcribing five-prime off that primer and that may not align with the ESTs generated from the three-prime end of that transcript. So, internal priming will generate multiple different ESTs for the same transcript. In addition, you might have alternate splicing and polyadenalation on your three-prime end, which may cause different ESTs to be generated from the same transcript as well and finally of sequencing errors. So, a lot of the data that has been deposited hasnt been quality checked. It is simply deposited in an automated fashion. Thus, the quality checking stage has to come during the alignment and that is why it is critical to have sequencer traces on all of these ESTs. How do you do high-fidelity EST clustering to figure out a non-redundant set of ESTs which define all ~50,000 genes within the genome? All of the traces are quality checked with an algorithm which will actually look at every single base and say, There is an associated confidence with this base and when Im aligning things and I have a low confidence, I might align two things which dont exactly match at this base. Singleton ESTs or ESTs that are never replicated within the entire database of several million ESTs are probably junk and usually discarded and rarely used to define genes. In addition, internal priming from polyA tracts, as we discussed, results an overestimation of unique genes and to a large extend this problem can be obviated by looking at mapping data. Finally, if a number of unique EST clusters mapped to the identical region of the genome, they may define alternate splicing or internal priming. SLIDE 9 So how do you access this data, this non-redundant set of ESTs which define a unique set of ~50,000 different genes which is encoded for in our human genome? There are two major sites for clustering of ESTs in the database. The first is the NCBI site of the NIH, and this set of clusters is called the Unigene set. The second is at the TIGR website. The Institute for Genomic Research has a very robust, multiple alignment website where they have actually attempted to build full-length protein coding sequences by looking at ESTs generated from not just the three-prime, but the five-prime ends of genes. SLIDE 10 So how do you get genes in a computer file? You simply click on this database and download a three-prime EST, which is shown to be defined by multiple ESTs. It is redundant and you can therefore have confidence that it is a real gene, and then you can download the gene. Just be aware that periodic reassemblies occur in both of these databases so that what you downloaded yesterday or last week may not be in the same contig and may not be identified with the same number as it was yesterday or last week. So, the numbering on your Affymetrix array which was from build 100, which was six months ago for example, may not be what is currently defined as that gene in the current builds. So, be aware that it is very important to look at that. You can also buy genes in a tube. There are repositories for every single clone which was generated from every EST sequencing effort. So, if you click on an EST in dbEST and you want that clone to be on an array, you can call research genetics, for example, and purchase that clone. From within one to ten dollars, you can get a gene, a single clone that you can print down on a cDNA array for example. A major issue with genes in a tube or physical reagents is that multiple manipulations have occurred from generating the EST library to having a bank that you can draw on. For example, library clones are picked. They are sequenced. They are rearrayed. They are frozen down. They are replicated and then they are sent off to the bank where they are perhaps replicated again, etc, and so what you are actually buying may not be what you think you are buying. So, sequence verification of physical clones is critical and in fact, NIH has supplemented a large number of resequencing efforts for just this reason. All of the reagents within the NINDS/NIMH consortium are sequence verified and of the highest quality available. SLIDE 11 So, now we have genes in computer files. These are clustered ESTs as we talked about. W also have genes in tubes. You can use these two reagents to develop two different types of expression profiling platforms. The first is Affymetrix arrays on the left, and the second is spotted arrays on the right. So, cDNA arrays utilize clones that are purchased. Affy arrays utilize sequences that are in the database and electronically parsed and built up on the arrays. Oligonucleotide arrays are somewhere between the Affy and the cDNA arrays, and consist of long synthetic single-stranded DNA strands which are synthesized from the EST sequence and robotically spotted onto the array. LECTURE 1C: SNPs, CHIPS AND PROFILING PLATFORMS SLIDE 1 There are typically two different methods or types of technologies that are used that fall under genome-wide approaches, one is called SNPs and one is called chips. There is a third emerging strategy which we do not discuss here so much that is still emerging and in its infancy stage called proteomics. SLIDE 2 Here we are just focused on the DNA based approaches, which are SNPs and chips. Now SNPs stands for single nucleotide polymorphisms and these are the variations in DNA between what we would call normal individuals. However, already here there is a big problem in defining normal. For example, somebody might respond more poorly to cigarette smoking than somebody else because of different genetic backgrounds in the individual that make them more sensitive or predisposed to a specific problem, that includes cancer or muscle weakness, or almost anything you can imagine where the gradation between normal and abnormal is very vague and becomes vaguer all the time. SNPs are really important for figuring out so-called complex inheritance where many different polymorphic variations between you and I, between different individuals, might still be considered normal in the larger scheme of things. Yet, they still make us respond to the environment or to some stimulus like cigarette smoking differently. Now SNPs are often categorized into different types and here Ive called them functional SNPs or non-synonymous SNPs. Synonymous SNPs are ones that dont change an amino acid, and as has pointed out in his previous lectures, there is a lot of genomic DNA out there, about three billion base-pairs, and only a relatively small subset codes for genes and proteins. We have lots of introns, lots of extra-genetic gene material and that can contain polymorphisms as well, but it is currently assumed that most of those polymorphisms in non-coding sequence do not cause differences that you see between two individuals or differences in the response to environment. That could be nave and in the future we might find that synonymous SNPs, ones that dont change an amino acid in a protein, still cause polymorphic variation or phenotypic variation, but for now, for this lecture, we are only going to look at what are called functional SNPs which are those that are in a coding sequence of a gene and furthermore, non-synonymous SNPs are those that actually change an amino acid of that gene. So within the gene, you can have both synonymous, that dont change an amino acid and probably dont affect the function of the protein, and non-synonymous, which actually change an amino acid and clearly because they change the protein probably at least have a chance of changing a function of that protein and having a phenotype. Take the example of cystic fibrosis again. Different patients that have the same CFTR mutations can show different progressions of the disease and SNPs are likely a cause of some of that variability. Infectious agents are another cause but even susceptibility to infectious agents can be driven by SNPs, differences between individuals, and multiple genes. What the rest of this lecture series focuses on are chips or microarrays. GeneChips, chips, microarrays, spotted arrays are all variations on the theme which is simply to take a gene or gene sequence and put it at a very tiny defined address on a solid support. The solid support is usually a piece of glass, but it can also be a nitrocellulose filter or other material that holds that gene in place so you know exactly where it is. So chips are microarrays of specific sequence. Each spot on the array has an address and you know what you put there so it has a known identity of sequence, you can identify that sequence. We use chips primarily for expression profiling. Now a quick aside as it is marked here in italics, you can also use chips for genotyping of SNPs, which means you can have a SNP chip. And a SNP chip can tell two alleles at a locus. You can put the two different alleles on a chip and say, Ok does this persons genomic DNA have this variant of that SNP or that second variant of that SNP? So, you can get genotyping chips, but again those are not in common usage right now and that is more of a future development of genome-wide approaches. So, for this lecture series the one type of genome-wide approach that is really widely being used is expression profiling and that is what we will focus on here. So expression profiling is done using microarrays, and there are two commonly used platforms. There are other types of platforms that Ill allude to, but they are not quite as commonly used yet. So, I will focus on these two types. These two types really differ in how you put the genes down and how you make the genes in the specific spots. It goes back to what we mentioned in the previous lecture. You can either get genes as a sequence in a computer file, so you can click over your computer right now at home, hit these genome databases and see the genes as written code. So, you can use that as the source of generating a microarray. From this electronic data you can generate Oligonucleotide arrays by synthesizing strands of DNA either on the array (Affymetrix) or spotting them on the array (oligonucleotide arrays). Alternatively, you can get the genes in tubes. You can buy from research genetics or other sources a solution of a purified gene. So in that tube is just one particular gene or part of a gene as a sequence, and you can actually take that liquid with the gene and spot it at a specific place on a glass or filter support. So using the genes in tubes are called cDNA spotted arrays. Now let me quickly contrast just the production of these, how they differ, and then in the next couple slides I will go over the pros and cons of each of these. SLIDE 3 So we turn to the left hand column, the Affymetrix arrays. What they do, Affymetrix is a company in California, and there is a link for their website called  HYPERLINK "http://www.affymetrix.com" www.affymetrix.com. What they do is they simply use computers to take all the sequences from the database that you want represented on your chip and then they use computers to design specific 25-mer oligonucleotides. These oligos are actually specifically directed against the RNAs, so they are actually sense and antisense. They keep track of that so they know exact what you are going to end up with in your sample, which strand of DNA and how you are going to exactly hybridize that to the spot on your chip. In making these you can see that there is really not much human intervention. It is all computers grabbing the sequences from databases, designing 25-mer oligos, and they design many of them so they can tile oligos across the sequence. So, typically Affymetrix chips have multiple assays for every gene, 20 to 40 oligos for each gene. So, instead of just one spot per gene, you have 20, 40 and on some chips weve made in this lab at the Research Center for Genetic Medicine, we have 100 different oligos directed against different parts of the same gene. So, it is very redundant. Then you take a piece of tissue, extract and label the RNA, and you put it on this array containing hundreds of thousands of oligonucleotides and the computer keeps track of which oligonucleotide corresponds to which part of which gene. And, then you simply look to see what elements of the array light up in response to your sample a direct correlation with how much the RNA of that type is in the sample. We will go over that in a second, but what Affymetrix does is provide a measurement of the absolute level of transcription of each gene in your sample, and does it in a highly redundant manner. So, it has lots of data to play with and looks for consistency within the many oligos per gene. That is one of the advantages. So the two major advantages, you dont use humans very much in this process, and number two is that it has lots of redundancy. Lets compare that to cDNA spotted arrays. This is where you simply buy microtiter plates containing 100 or 400 different genes per plate and you use robotics to pick up a little bit of solution usually a few picoliters, a microscopic drop of DNA solution, and put it at a specific place on a glass slide. The typical spotters or arrayers, as they are called, the robots that can pick up solutions, keep track of them and put them in specific dots, generally can print anywhere between 10,000 to 100,000 different spots on a regular microscope. Now this a scientist has much more control over because we can print anything that we want using our arrayer that we have a couple feet from here. So, they are all custom made robotically spotted PCR products, and what we are spotting is not a computer database, but we are spotting a solution or a cloned EST that we buy from a company or we make ourselves. Now here we generally have one or two spots per gene and this is important because we dont have the redundancy of the Affymetrix arrays but we have one big piece that is much bigger than the 25 base-pair oligos in Affymetrix. We can put a whole 1000 base-pair piece of a gene on an array now, but we only have one of them. So, if there is cross-hybridization to other genes we have to be concerned about that. We dont have the redundancy but we might have a lot more sensitivity because we have a bigger piece that we are looking at. Another big difference between spotted arrays and Affymetrix arrays in that spotted arrays provide a relative level of transcript expression of two RNA samples. You mix your control and experimental RNAs together and you label one with a red dye and one with green dye and you mix them together and then you see the ratio of red versus green that is hybridizing to the single spot. So, you are always looking a ratio of expression. To use the example I gave in the earlier lecture, if we take the severe progressing cystic fibrosis patients lungs or the mild progressive cystic fibrosis lungs we take their samples and isolate RNA, we label one with a red dye, label one with a green dye, and then we take equal amounts of those two groups of patients, mix them together and put them on one spotted array and look at the ratio. If there are equal amounts of red and green on that spot, there is no difference in RNA expression between the clinical states of that gene. So, you can see some differences here because Affymetrix you only put one sample on one array and then you database it and that is it. Where spotted arrays you are always getting a ratio. There are pros and cons to this which we will go over in the next two slides, and which are also available in table format on the NINDS/NIMH array portal site. SLIDE 4 So, here are just some examples of arrays we have done in the Research Center for Genetic Medicine. On the right is a cDNA array that the Stephan lab printed from mouse brain clones and on the left is an Affymetrix array, in this case I believe it is human muscle that you are seeing hybridized to a stock array. You can already see in the Affymetrix array, we have boxed it there, one gene represented by 40 different oligonucleotides. In this case, the top row is the perfect match and the lower row is a mismatch. So there are controls intrinsic, and so good hybridization signals should only be present on the top row, which in this case looks pretty good. You see a lot of the probes hybridizing to the top and not to the matched bottom row, which is to be expected if the hybridization is specific. Now to the right you see a cDNA array, and what you see instead are colors of red, or yellow, or green. So, whenever you see a red dot it means that that gene is expressed more highly in the RNA sample that was labeled with red. If you see a green dot, it means that it is more highly expressed in the other sample that is labeled with the green floor. If you see yellow, it means they are equally expressed. Now here you can see that most cDNAs are actually printed as duplicates. Here you see we printed the same spot in duplicate, just put the same cDNA in two spots to look for consistency between the hybridization, and if you look at this cDNA array you see that generally the paired spots are pretty consistent. Although, I can see one, for example, about 10 oclock, where you see one white spot, but you dont see a matched spot. So, that would then be looked at and probably something is wrong with that spot. Affymetrix also has such artifacts which can occur in the form of bubbles, scratches, or ever small hairs which need to be checked for with every hybridization. SLIDE 5 Now, we continue the comparison between these two commonly used platforms, Affymetrix arrays you generally get about 20,000 genes on a chip. But remember that these are redundant so there are roughly 20 or 40 oligonucleotides per gene. So, if you multiply that 40 oligonucleotides times 20,000 genes you end up with hundreds of thousands of oligonucleotides on a glass area smaller than a centimeter. You can buy multiple chip sets. Right now you can buy a human chip set, 2 different arrays, each with ~22,000 genes on them. So, together we can assay ~45,000 human genes and you can just buy these chips from Affymetrix. The cost is a bit high because if you do just one chip it costs about 600 dollars. If you do duplicates, which we often do duplicates chips per sample, or want to do the whole two chip set inhuman, the cost can quickly go up to up to 1000 dollars a sample. The equipment used is a fluidics workstation and a scanner, and that generally costs somewhere around 275,000 dollars, which is about two times more than an arrayer and scanner for spotted arrays costs. So overall, Affymetrix arrays are a bit more pricey. Aside from cost, the other disadvantage is that arrays are difficult to customize. As I pointed out, you buy these arrays from Affymetrix, premade, so you are pretty much are tied to what Affymetrix sells you. Now you can make custom chips with Affymetrix. We have made a custom muscle chip, but it was really quite expensive. The cost has come down dramatically, where there is essentially no set-up masking fee if a large numbers of arrays are purchased. One of the advantages of the factory made chips is there is little variation between chips because they are done by factories. They are all done by photolithography overlays, all driven by computers, and when you buy your U133A human chip from Affymetrix, you can pretty much assume it is very similar, if not identical to the next U133Achip that either you buy or somebody over in China or Japan or in California buys. So, that points out one of the big advantages of Affymetrix is that it is transportable. Both the data and the chips are shared between the different labs. They are the same chip. So it is easy to download profiles from one site and compare them to your profiles as long as you know what went on to that chip. This is what we are trying to do with the NINDS NIMH Microarray Consortium website where we hope to make all this Affymetrix information accessible to the worldwide scientific community. We are doing the same thing with all spotted array data, but it can be a bit more tricky as there is more variation between each array. You can quickly image that if everybody is putting easily accessible Affymetrix profiles on the web, everybody can grab everyone elses profiles and do incredibly fancy statistical analysis of hundreds or thousands of profiles. In addition, it is most often the clinical or biological sample which is limiting in number and thus limits the power to detect significant expression correlations. By depositing all the data in one place and making it a standard format, we hope to facilitate building larger, more powerful data sets. Now lets turn to cDNA spotted arrays on the right. You can print somewhere between 10,000 and 100,000 cDNAs per slide or spots per slide, and there are cDNA pools or arrays available that you can buy from companies for human, mouse, rat, or other species. You can make them yourself if you have the materials to do that. The cost is generally less. It is labor intensive to amplify all the cDNAs and print them, but once this is done you can printa ton of slides so that it can cost a relatively small amount per slide. There are some places that claim that they can print slides for only one or two dollars a slide. The relatively low cost makes obtaining many data points more feasible. If you are only paying 7$5 dollars an array for a cDNA array, you can much more easily do a 100 different samples. Maybe we want to do a hundred different CFTR bronchial biopsies instead of just five. Whereas with Affymetrix, the higher cost of the analysis can make the bigger experiments cost prohibitive. Another advantage of spotted arrays is that they are completely customizable. The user defines the genes to be printed. We can just go to our arrayer and decide on a whim which cDNAs we are going to print in that particular print. So, that is a huge advantage. So, maybe you only want to look at 10 genes or 100 genes or 1000 genes, so you dont want to print 50,000 on a slide. That is easy to do with spotted arrays, and you cant do it with Affymetrix. One problem with spotted arrays is that there is so much human intervention involved in printing, amplifying, and processing the arrays, that there can be some dropout of spots or genes for technical reasons. Remember that with spotted arrays you are always taking the ratio of two samples. You are taking an experimental RNA sample and a control RNA sample, mixing them together, labeled with two colors, and looking at the ratio. So, if you have a dropout of a specific gene, it doesnt really disturb your analysis so much because it is going to dropout for both your experimental and your control. So, you will just lose that analysis, but you know that. You will know that the gene is not there. So, it is not going to hurt your interpretation so much other than losing a data point, and you will know that. So, it is not that huge a problem, but it is sort of a pain in the neck. Another problem with cDNA spotted arrays is because you have to keep track of all these genes and all these solutions in all these tubes. Again, humans are involved with multiple steps and people get confused. Often what you will find is the wrong gene suddenly ends up in the wrong tube or you are amplifying the wrong thing. Simply because of the human involvement in maintaining these clone sets and then amplifying and printing them, things are not necessarily what you think they are. And, depending on the different clone sets you are talking about, or the different organisms, it is more or less of a problem. So, whenever you deal with cDNA arrays you really have to be intimately familiar with what is in your tube or see what the quality control for that was. We provide to you through the consortium a reliable set of cDNA arrays which are all sequence verified. SLIDE 6 Affymetrix arrays have a lot of extensive informatics intrinsic to the Affymetrix software so that you end up doing a lot of your data crunching up front. So, before you even see a level of a gene, a so called absolute call or a difference between an experimental and a control sample, Affymetrix software has looked at all probes, subtracted background, looked at distributions of your hybridization data over all those different oligonucleotides, looked at the chip as a whole, and generated all sorts of statistics and bioinformatics before it outputs a number and says this gene is expressed at some level, called a signal. Spotted arrays are a bit different. After you generate your ratio, you start needing to look at a lot of different ratios and figure out what your background is, what your specificity is, and need to do replicate arrays to get a handle on the statistics, of how confident you are between a difference between two samples, or how well you have assayed that gene. So, a generalization, which is really a gross generalization and people can give you lots of pros and cons of Affymetrix and spotted arrays as far as bioinformatics and we will go into those in later lectures, but just one generalization is that Affymetrix arrays tend to do more of the biomatic informatics upfront, where the spotted arrays does that a lot more retrospectively, where you take your data sets as a whole and superimpose the bioinformatics later after you have already done all your scanning and gotten your numbers. Another advantage of Affymetrix arrays, as I alluded to earlier, inter-experiment inter-lab comparisons are relatively straight forward because pretty much everybody is using the same chips. Also, specificity is pretty good because you have redundancy in multiple oligos and control oligos; however, we dont really know what the sensitivity of this assay is. In other words, can Affymetrix with these tiny 25 base-pair oligos really be as sensitive in detecting low level transcription as these much large 1000 or 2000 base-pair probes that are put on cDNA arrays? So, it is often assumed that cDNA arrays might be more sensitive, oligonucleotide arrays a happy medium with their 70-mer oligonucleotides, and Affymetrix arrays the least sensitive. The data on sensitivity isnt really out there in the literature yet to make that conclusion, but it is a good possibility that cDNAs are more sensitive. Just be aware that increased signal may be derived from cross-hybridization to these long targets rather than increased sensitivity! If we turn to the cDNA array column, inter-experiment comparisons can be difficult unless done on exactly the same slides and even then the robot might have had some differences in printing the same array twice the same way. For example, some of the pins that actually print the small amount of DNA can get clogged. So, maybe between slide 267 and 268, a pin gets clogged and suddenly you are missing a spot. So, again missing a spot, you can usually see that, but it just complicates the analysis. The second point is the ability to acquire more data points which allows one to do more arrays. This alludes to what I have said in the previous slide, because spotted arrays are cheaper, you can run a lot more of them on a lot more samples and you can just get more data and more data points. Say like, right now we are doing 50 time points in mouse muscle degeneration and regeneration. That whole process would probably be cheaper to do that with spotted arrays where we could look at many more arrays and many more time points. And, that opens up novel data analyses as well. Specificity is questioned when closely related genes are studied. To take an example in muscle, there are about 10 different myosin heavy chain genes that differ by only a couple amino acids and a couple base-pairs. So if you take a large 1000 base-pair cDNA for one myosin heavy chain gene and put it on a spotted cDNA array, the chances are that all the myosin heavy chain RNAs from all the 10 different genes will still hybridize to the same spot. So, it can be difficult because of the large size of the probe to differentiate closely related genes, and again sensitivity is not known but might be better than Affymetrix arrays. Page  PAGE 2 of  NUMPAGES 11  Page  PAGE 1 of  NUMPAGES 11 89A' ( 0 @ABW"X"Y"d+e+f+n+44466%7&7.7778889<<,>E>ŹŹŹŹŹűűśűŋŹŹh9 h CJ]aJh5h5h5CJaJh9 h9 CJaJh CJaJh9 h >*CJaJh9 h CJaJ#h9 h 5CJOJQJ^JaJh h9 h 5OJQJ^Jh55OJQJ^J689A' ( 0 ABX"Y"e+f+n+44gd9 C466%7&7.7889<<<|>}>>CCCKKKNNNPPPUYYgd9 E>}>>CCCCCIIKKKNNNPPPYYY[[\bMcNcOcXcf2f3f;flgmgugTrUrVr|||}}}}}}}Rh9 h 0JCJaJ#jh9 h CJUaJjh9 h CJUaJh9 h9 CJaJh CJaJh9 h 5OJQJ^Jh h9 h >*CJaJ#h9 h 5CJOJQJ^JaJh9 h CJaJ4YY[[\NcOcXcf2f3f;flgmgugUrVr||||ST%gd9 RST%BCEJKQRSTXYcdfghijlȵxpjh5U h#h5CJOJQJ^JaJ)h50JCJOJQJ^JaJmHnHu-jh#h50JCJOJQJU^JaJ$h#h50JCJOJQJ^JaJh5#h9 h 5CJOJQJ^JaJh h9 h CJaJh9 h9 CJaJh CJaJ*BCDEhikl$a$gd5$a$gd $a$gd#$a$gd9 gd9 lqrxyz{ְh h5 h5h5CJOJQJ^JaJ)h50JCJOJQJ^JaJmHnHu-jh5h50JCJOJQJU^JaJ$h5h50JCJOJQJ^JaJ) 01h:p5/ =!"#$%DyK www.affymetrix.comyK 6http://www.affymetrix.com/,DdZ  C 6ACNMClogo4colorR+fk]:gG+h F+fk]:gGJFIFddDuckyKAdobed     ^5  !1AQa"2qVW R#B3t%7brCcs$d&Fv !1ASQaq"RT2B3b#crs ?NP @(P @(P @(Py{mSsM2"թͩ"KoX8u-;8 ]U<ބk.ۦxSJ|D>S|yE+]ړ128y>ܽ #p!5%l#HT$^\n>3˛wY/pD;BCZ>"`Bܯ(ZZLW \ڏ^\ C$cDV宴1.\Rz5Ah,"Ap//a&nMP+i? ϓ%:#xw i80sp߶܍XQq顩aZzòԟP @(P @(P @(P6E㒙T:,?EbjI%VUI%8O治R.$E z\SHb5x}d}ùX+<3M~vz-#`G P,"{q|usMIܹ<|t,‹h8VUmlZ[!_Bx3hmc8VJ=P"ڱ` t!t|(!}3_T5Zjt"^2FZv먶?TAw%78jPqXjP֎pFA̎PC^[2e۴͹vݑlyq[ yCQTH]MW7"=!q-_8MN(5nkv/*u5^R<8$u;n^%jZʑOͬ3Xź95 cr,2ajdYIE,Ѣ NHe"qj#'Iyi<2 Fen zЎnlQPՋeZKQ-#Yy# ”qw$'e6jڗa3<c/vӢH$E1Nb8Uq%l<$&A2wSBF,">K6PXThg,yw*d\3q{ ]BTeW@ʝι5 P0{ f"ܹ'4jnRXDU a7Dn) ܈jxw?ˁYjZ ŝ6v"x D:k1L@cSxJEt BiY4(?>jD*0 /=nbLbP(wڭ欋) 1 1mojFId8ꊷP-&+%ސKpJq\MaP!0f7Sj]I!!#57SVt Y[pYdTBbs6܁v'#xAjo'5F dAfD|CE֋O! `k&jR1bRrrј d0\TҰm,<"jǵW%O%#2N4L-3Ħ.MU̮n* }Vn5uXi*Mms Fc?0P/İ_ qS u7#Ryl;ͨ$qbSאEc(.JeU4C֯jZowB ;!NbK>Idr U0Aћ8h A0HTc b8%@Ό4R3zb&R7{j6WX" p,LxoRtRHRĺ;6kU6H7hE8"[R@~@J5'6;7&庠 ~%0i5FRIt*Q"1L foJ\ao * f"Q GjҚY+2DGHI" !CQpEkLC{*])P\04;(],,1F w Jۍ›*. ٹ[B@@ @iӧjp6q͂ n&$*d(߁JhJSnvnŗXQe9scQb<Gw7TcԺI kEj;+8ݥp}O\g(cNć.s~Z?QG{w,?W$>>.is>v;ӹa!>qwKQƝlj󏋽\݄~4X~'3wrqC|v9A6!ŀ~ȀwaO/_SὈ!>swK_a?-q;S{ kEiܰ\x8ݥp}O\g(cNć.s~Z?QG{w,?W$>>.is>v;ӹa!>qwKQƝlj󏋽\݄~4X~swK[a?-q;S{ kEiܰ\x8ݥp}O\g(cNć.sٝCcAAAE} )+G0"]vդm{U}iTϥޒ;[s٭vv]馪8}OkoƲ'G[Ī|Pq_V1iQZDii{Y-?ȩT}`*O3#h/#SG]^Ymdwobef^ y&v,eLt`L N=o{NbOjJcem/Ftplس|X+;C):E?[I~wܳOTͮBK HwA tm38um[kb 'U#l]gfGfUmtw 6$4_FȚA k-kѱ٧F/pK3ix$^ 3*?j5HI 0 čkڴYvja/#Q (Y\PMBe~Y@ yFpdH\;|2}Ώ9x!(lV؝NjiHbgj(b?X[o12p;݋t lnR) rGگT)'ԗS3Rd# ޱ~ۦEƷ̹S@odj.kKN,W=\]TT# JkC7*l`Q/]7Al=[?y\gYDysFN!ǿ:G(J}WF80p2|ixLЦ@PMY2+أf|.oa1CsVfC>KY187?-Tb6~Xql:t5ۍSJ&m?+?N{7xiRV&1 1ʉN%OE*n?O)|7 osO5WTMv<VmV)R3p>j* b { l{%nTEe4 &r!)fVmH4[ *kӡFqv{1'*8hxrWf"ixun'6p:/dzlc#WR<63Bbj.P̸p[WjT Y/QPQ АئM`)Pt{(Tv͍f0nЮ"LN̪|6j[EK,b:9ۼw A:R8~IZT ]p'7<8A=>%c9PXc Hp)&Q(,MM!q9 ¿ǧt |@Ar g !W 4H<0bR8 \'X|d[#d1Ief)ffLɘy'.W+isl3S=^v %L#0) 5(mgs R6*`k~mC=$m@?GQv/`5j}TZ(*g;SךMZ-BN_0Y49Z<ڮ#T^R%c#52O}rg@YH6FlmRfag/sD⤞+BEm2+.'*g0=+_j=,TpXaCd{@ NR<) A`PD~vJA';z6Mј': K&LN7[Q+uu?= ޤ:>@.bM"R547Qn`\RT:=b(B dw 7jSkYQP\<򂀙 CWT2>#2n,|bIYT TOyPzh*[tE+ldԚ[3bA*DV|?8D ^fv7T/z}3IW'snZjt~N Y$` 0\?/?ot1V]ZjqόQ/BB 0tvY*?A{LD`"r" \U9Kn5Q6&>^ i&tmuoCפski驺KJKizrR DûXP<+G᭧7Zûx>i6/񼑊LDiep^ɶ֊|;WVGOFS[O$ Cu{+gʯ`rknz!EϵHm)~o"\|Eͬ;{m!M/*pŸ;ɱ7(GkB/9H&QT]I$DqV_Ozr-ۂoՃ1Kk,#:mKlT̓E'ɵs- K0z4N͍j ݦ ۓ7,$;}{ϔުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeo=C}JzwXR(ުeoc[d-2( S vApXXJ}A{Xm-FrO&dwb]e]UQ<ؑS#1(y/8tQ!%=CCWC2]raTC1~J RϿ(X+ij}pz8_i!r>jͭsM0țsX ( F]\~'̩+ݏ!cew*Ki/+|d0?{j;eo>FIKT[/O߰ђ~!G_ lV2d2^ګ_;zwǐ|6{+|Au7sJջ|g*?,ai[h2wG!\r<[V],/edǩe6лLrĠ6룞'mqb7[ fҞ"to(*󢢭1Lp7ᠣaWsmevV Lz%{6;ַ'oÂP @(P @(P @(P @(Ek!vW+H׫v߅J<#2Ey ((#:cWXMMlּs]8Z=  i)_̋f`<<\"[q3ٰ8VRfõD$&tEe'zt{iv<@j2%L7 .ZtJ yנ]ſO}Id(M*L*-qR线f?+t^qE4_.kt!=z(P @(@`@ NormalCJ_HaJmH sH tH DA@D Default Paragraph FontVi@V  Table Normal :V 44 la (k@(No List 6U@6 Hyperlink >*B*ph4@4 9 Header  !4 @4 9 Footer  !.)@!. # Page Number89A'(0 ABXYe#f#n#,,..%/&/./001444|6}66;;;CCCFFFHHHMQQQSSTN[O[X[^2^3^;^l_m_u_UjVjttttS}T}%BCDEhikl0(0p0p0p0p0p0000000000000000p00p00 0 0 0p@0p@0p@0p@0p0(@0p@0p00@0p@0p00@0p@0p00@0p@0p08@0p@0p0@@0@0p0@0p@0p0H@0p@0@0p0P000P@0p@0p0@0@0p@0p0X@0p@0p0X0X0@0p@0p@0p0p0p0p0x0p@0p@0p0@0p@0p000p@0p@0p00000p@0p0p@0z00xH@0z00x0j@0z00xhj@00] 8i89:00:000zz&)MPE>Rlaegi4YbdfhcuuuX #.57<GJP!!<  = ~> r? @ \nA  B C l D l E <F |G D(H (I TJ K L lM TXN drO Ę P ,Q R \jjs{MuMucclECCLLXX     rzWuWukrrOHHQQbb   B*urn:schemas-microsoft-com:office:smarttagscountry-region8*urn:schemas-microsoft-com:office:smarttagstime9*urn:schemas-microsoft-com:office:smarttagsState8*urn:schemas-microsoft-com:office:smarttagsCity=*urn:schemas-microsoft-com:office:smarttags PlaceType=*urn:schemas-microsoft-com:office:smarttags PlaceName9*urn:schemas-microsoft-com:office:smarttagsplace 010HourMinute 9F! % O\ Z#d#$$'(R)\)])f)))))))O*Y*++$-)-22M2W2X2`222*81868;8_8f8k8p88888g9u99 :a<l<<<==F=N=k=y===>">>>>> @$@o@s@CC)E-EEEEEFFFFFFHH IIIIJJKKKKJLYLLLMM@NDNOOOOOOdPiPQQbRfRRRSSTTUUV VWWdXhXmYvYZZ[[9\C\\\\\']6]X]\]e]i]^^^^__i_____bbmbxbfcjcccccccdd;f?fffg gcgggiijjjjjjjjllQmUmooqr1r@r|rrrr(t,tuu2u?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijlmnopqrstuvwxyz{|}~Root Entry Fp?tkData kv-1Table^(WordDocument+SummaryInformation(DocumentSummaryInformation8CompObjj  FMicrosoft Word Document MSWordDocWord.Document.89q