|
YMD
> Publications
> Publications by YMD People about Data Analysis > David B. Allison, Gary L. Gadbury, Moonseong
Heo, José R. Fernández,Cheol-Koo Lee, Tomas A. Prolla, Richard Weindruch. (2002)
A Mixture Model Approach for the Analysis of Microarray Gene
Expression Data. Computational Statistics & Data Analysis (in press).
For each gene, we often wish to answer the question "Is the true
difference in expression between two or more groups statistically significant?"
By "true difference" we mean an average difference in a gene expression level
between populations from which the two sample groups are obtained. However,
there are several other important and interesting questions including: . Is
there statistically significant evidence that any of the genes
under study exhibit a difference in expression across the groups. What is the
best estimate of the number of genes for which there is a true difference in
expression? What is the confidence interval around that estimate? If we set some
threshold above which we declare results for a particular gene 'interesting' and
worthy of follow-up study, what proportion of those genes are likely to be genes
for which there is a real difference in expression and what proportion are
likely to be false leads? What proportion of those genes not declared
'interesting' are likely to be genes for which there is a real difference in
expression (i.e., misses or false negatives)?
We have developed procedures that allow formal answers to the questions proposed
above. This set of procedures is based on the idea that when many statistical
tests are conducted, one obtains a distribution of test statistics and
corresponding p-values and that there is exploitable information available in
this distribution. If there is no mean difference in expression between two
groups for any gene, p-values should appear uniformly distributed on the
interval [0,1], as long as the p-values are considered independent. In contrast,
if there exists a true mean difference for some genes, the distribution of p-values
will tend to cluster closer to zero than to one. By fitting a mixture of a
uniform distribution and one or more beta distributions (the beta is just a very
flexible distribution used for convenience) to the observed collection of p-values,
we can determine if a mixture model that includes components beyond a uniform
distribution fits the data significantly better than a single uniform
distribution. If this is indeed the case, we can answer "yes" to the question
"Is there statistically significant evidence that any of the genes under study
exhibit a difference in expression across the groups?", and proceed to estimate
the number of such genes. Significance testing and construction of confidence
intervals is performed via non-parametric bootstrap or simulation procedures.
The effect of non-independence of p-values can be assessed via the same
procedures, primarily via simulation studies. Once the model is fitted, we can
use it to answer additional questions such as, "for each gene and for the p-value
that was computed for that gene, what is the probability that the gene is a gene
for which there is a true difference in gene expression?" The answer is referred
to as a posterior probability computed from a technique called Bayes Rule.
Again, by "true difference" we mean a difference at the population level as
opposed to simply an observed difference in a sample that could be due to either
measurement error or sampling variability.
Full
Article.
|