Search YMD:


 

Site Navigation:


Yale Microarray Database
(YMD)

Janet Hager, Ph.D.
300 George Street
Room 2115
New Haven CT, 06511

Kei-Hoi Cheung, Ph.D.
333 Cedar Street
PO Box 208009
New Haven, CT 06520

 

Yale University School of Medicine

 

Contact Us


YMD
Publications > Publications by YMD People about Data Analysis > David B. Allison, Gary L. Gadbury, Moonseong Heo, José R. Fernández,Cheol-Koo Lee, Tomas A. Prolla, Richard Weindruch. (2002)
A Mixture  Model Approach for the Analysis of Microarray Gene Expression Data. Computational Statistics & Data Analysis (in press).

For each gene, we often wish to answer the question "Is the true difference in expression between two or more groups statistically significant?" By "true difference" we mean an average difference in a gene expression level between populations from which the two sample groups are obtained. However, there are several other important and interesting questions including: . Is there statistically significant evidence that any of the genes
under study exhibit a difference in expression across the groups. What is the best estimate of the number of genes for which there is a true difference in expression? What is the confidence interval around that estimate? If we set some threshold above which we declare results for a particular gene 'interesting' and worthy of follow-up study, what proportion of those genes are likely to be genes for which there is a real difference in expression and what proportion are likely to be false leads? What proportion of those genes not declared 'interesting' are likely to be genes for which there is a real difference in expression (i.e., misses or false negatives)?

We have developed procedures that allow formal answers to the questions proposed above. This set of procedures is based on the idea that when many statistical tests are conducted, one obtains a distribution of test statistics and corresponding p-values and that there is exploitable information available in this distribution. If there is no mean difference in expression between two groups for any gene, p-values should appear uniformly distributed on the interval [0,1], as long as the p-values are considered independent. In contrast, if there exists a true mean difference for some genes, the distribution of p-values will tend to cluster closer to zero than to one. By fitting a mixture of a uniform distribution and one or more beta distributions (the beta is just a very flexible distribution used for convenience) to the observed collection of p-values, we can determine if a mixture model that includes components beyond a uniform distribution fits the data significantly better than a single uniform distribution. If this is indeed the case, we can answer "yes" to the question "Is there statistically significant evidence that any of the genes under study exhibit a difference in expression across the groups?", and proceed to estimate the number of such genes. Significance testing and construction of confidence intervals is performed via non-parametric bootstrap or simulation procedures. The effect of non-independence of p-values can be assessed via the same procedures, primarily via simulation studies. Once the model is fitted, we can use it to answer additional questions such as, "for each gene and for the p-value that was computed for that gene, what is the probability that the gene is a gene for which there is a true difference in gene expression?" The answer is referred to as a posterior probability computed from a technique called Bayes Rule. Again, by "true difference" we mean a difference at the population level as opposed to simply an observed difference in a sample that could be due to either measurement error or sampling variability.

Full Article.

    Top of Page
Medical Center Yale-New Haven Hospital Yale University

Copyright © 2003, Yale University, New Haven, Connecticut, USA. All rights reserved.
Comments or suggestions to Greg Blasko.

Last modified: 28-Mar-2003 (GB)