Speakers and Titles
William Bush, Vanderbilt University
Multifactor Dimensionality Reduction for Detecting Epistasis
In the quest for disease susceptibility genes, the reality of
gene-gene interactions creates difficult challenges for many current
statistical approaches. In an attempt to overcome limitations
with current disease gene detection methods, we have previously
developed the Multifactor Dimensionality Reduction (MDR) approach.
In brief, MDR is a method that reduces the dimensionality of multilocus
information to identify genetic variations associated with an
increased risk of disease. This approach takes multilocus genotypes
and develops a model for defining disease risk by pooling high-risk
genotype combinations into one group and low-risk combinations
into another group. Cross validation and permutation testing are
used to identify optimal models.
We simulated data using a variety of epistasis models varying
in allele frequency, heritability, and the number of interacting
loci. We estimated power as the number of times that each method
identified the correct functional SNPs for each model out of a
set of 10 total SNPs. Using simulated data, we show that MDR has
high power to detect interactions in sample sizes of 500 cases
and 500 controls, in datasets with 100, 500, 1000, and 5000 SNPs.
This study provides evidence that MDR is a powerful statistical
approach for detecting gene-gene interactions. In addition, a
parallel implementation of the method allows analysis of much
larger datasets. MDR will continue to emerge as a valuable tool
in the study of the genetics of common, complex disease.
The authors for the work are:
William S. Bush, Todd L. Edwards, and Marylyn D. Ritchie
Center for Human Genetics Research, Vanderbilt University Medical
School, Nashville, TN
Zhengxin Chen, University of Nebraska
at Omaha
Thoughts on Foundations of Data Mining
Abstract: As noted in recent IEEE Foundations of Data Mining
(FDM) workshops, data Mining has been developed, though vigorously,
under rather ad hoc and vague concepts. There is a need to explore
various fundamental issues of data mining. In this talk, we discuss
the reasons of studying FDM, aspects need to be addressed, approaches
to FDM, as well as related issues. Materials presented in this
talk are based on publications by researchers working in this
area, as well as the speakers personal comments and observations
on the recent development of FDM.ncepts.
Simon Gluzman, Generation 5
Extrapolation and Interpolation with Self-Learning Approximants
Principles of extrapolation and interpolation with self-similar
root and factor approximants
to be outlined. Based on application of approximants and different
self-learning mechanisms,
we suggest three different methods for interpolation and extrapolation.
Several examples are presented, which can be solved using 1) unsupervised
iterative learning and 2) unsupervised self-consistent learning.
Finally, 3) supervised learning technique, "regression on
approximants", is discussed and illustrated.
Wenxue Huang, Milorad Krneta, Limin Lin and Jianhong Wu
, Generation 5
Association Bundle Identification
for Categorical Data(.pdf
file)
Wei-Yin Loh, University of Wisconsin,
Madison
Regression Models You Can See
There are numerous techniques for fitting regression models to
data, ranging from classical multiple linear regression to highly
sophisticated approaches such as spline-based, tree-based, rule-based,
neural network, and ensemble methods. Although it is important
in many applications that a regression model be interpretable,
research in this area is mostly driven by prediction accuracy.
It seems almost a fact that the more sophisticated an algorithm,
the less interpretable its models become. In this talk, we discuss
some basic problems that hinder model interpretation and propose
that the most interpretable model is one that can be visualized
graphically. The challenge is how to build such a model without
unduly sacrificing prediction accuracy. We propose one solution
and compare its prediction accuracy with other
methods.
Robert McCulloch, University of
Chicago
Bayesian Additive Regression Trees
BART is a Bayesian ``sum-of-trees'' model where each tree is constrained
by a prior to be a weak leaner, fitting and inference are accomplished
via an iterative backfitting MCMC algorithm. This model is motivated
by ensemble methods in general, and boosting algorithms in particular.
Like boosting, each weak learner (i.e., each weak tree) contributes
a small amount to the overall model, and the training of a weak
learner is conditional on the estimates for the other weak learners.
The differences from the boosting algorithm are just as striking
as the similarities: BART is defined by a statistical model: a
prior and a likelihood, while boosting is defined by an algorithm.
MCMC is used both to fit the model and to qualify predictive inference.
Abdissa Negassa, Albert Einstein
College of Medicine
of Yeshiva University
Tree-based approaches for censored survival data and model
selection
A brief description of tree-based model for censored survival
data will be provided. The performance of computationally inexpensive
model selection criteria will be discussed. It is shown through
a simulation study that no single model selection criterion exhibits
a uniformly superior performance over a wide range of scenarios.
Therefore, a two-stage approach for model selection is suggested
and shown to perform satisfactorily. Examples of medical data
analysis for developing prognostic classification and subgroup
analysis will be presented.
Vito Quaranta, Vanderbilt Integrative
Cancer Biology Center & Department of Cancer Biology, Vanderbilt
University School of Medicine,
Integrating Multiscale Data for Simulating Cancer Invasion
and Metastasis
Cancer research has undergone radical changes in recent times.
Producing information both at the basic and clinical levels is
no longer the issue. Rather, how to handle this information has
become the major obstacle to progress. Intuitive approaches are
no longer feasible. The next big step will be to implement mathematical
modeling approaches to interrogate the enormous amount of data
being produced, and extract useful answer. Quantitative simulation
of clinically relevant cancer situations, based on experimentally
validated mathematical modeling, provides an opportunity for the
researcher and eventually the clinician to address data and information
in the context of well formulated questions and "what if"
scenarios. At the Vanderbilt Integrative Cancer Biology Center
(http://www.vanderbilt.edu/VICBC/) we are implementing a vision
for a web site that will serve as a cancer simulational hub. To
this end, we are combining expertise of an interdisciplinary group
of scientist, including experimental biologists, clinical oncologists,
chemical and biological engineers, computational biologists, computer
modelers, theoretical and applied mathematicians and imaging scientists.
Currently, the major focus of our Center is to produce quantitative
computer simulations of cancer invasion at a multiplicity of biological
scales. We have several strategies for data collection and modeling
approaches at each of several scales, including the cellular (100
cells) multicellular (<102 cells) and tissue level (<106-108
cells).
For the cellular scale, simulation of a single cell moving in
an extracellular matrix field is being parameterized with data
from lamellipodia protrusion, cell speed, haptotaxis. Some of
these data are being collected in novel bioengineered gadgets.
For the multicellular scale, we have adopted the MCF10A 3dimensional
mammosphere system. Several parameters, including proliferation,
apoptosis, cell-cell adhesion, are being fed into a mathematical
model that simulates mammosphere morphogenesis and realistically
takes into account cell mechanical properties.
At the tissue level, our hybrid discrete-continuous mathematical
model can predict tumor fingering based on individual cell properties.
Therefore, we are parameterizing the hybrid model with data from
the cellular and multicellular scales and are validating the model
by in vivo imaging of tumor formation.
Sam Roweis, University of Toronto
Automatic Visualization and Classification of High Dimensional
Data
Say I give you a dataset of N points in D dimensions, each point
labelled as belonging to one of K classes. Can you find for me
(up to an irrelvant rotation and isotropic scaling) a projection
matrix A (of size d by D) such that in the projected space, nearby
points tend to have the same label? When d=2 or d=3 this is a
data visualization task; when d=D this task is one of learning
(the square root of) a distance metric for nearest neighbour classification.
For intermediate d, we are finding a low-dimensional projection
of the data that preserves the local separation of classes. This
dimensionality reduction simultaneouly reduces the computational
load of a nearest neighbour classifier at test time and learns
a (low-rank) metric on the input space.
In this talk, I will present an algorithm for solving this problem,
ie for linearly reducing the dimensionality high dimensional data
in a way that preserves the consistency of class labels amongst
neighbours. Results on a variety of datasets show that this method
produces good visualizations and that the performance of nearest
neighbour classification after using our dimensionality reduction
technique consistently exceeds the performance after applying
other linear dimensionality reduction methods such as PCA or LDA.
Tim Swartz, Simon Fraser University
Bayesian Analyses for Dyadic Data
Various dyadic data models are extended and unified in a Bayesian
framework where a single value is elicited to complete the prior
specification. Certain situations which have sometimes been problematic
(e.g. incomplete data, non-standard covariates, missing data,
unbalanced data) are easily handled under the proposed class of
Bayesian
models. Inference is straightforward using Markov chain Monte
Carlo methods and is implemented using WinBUGS software. Examples
are provided which highlight the variety of data sets that can
be entertained and the ease with which they can be analysed.
Chris Volinsky, AT&T Labs
Modelling Massive Dynamic Graphs
When studying large transactional networks such as telephone
call detail data, credit card transactions, or web clickstream
data, graphs are a convenient and informative way to represent
data. In these graphs, nodes represent the transactors, and edges
the transactions between them. When these edges have a time stamp,
we have a "dynamic graph" where the edges are born and
die through time. I will present a framework for representing
and analyzing dynamic graphs, with a focus on the massive graphs
found in telecommunications and Internet data. The graph is parameterized
with three parameters, defining an approximation to the massive
graph which allows us to prune noise from the graph. When compared
to using the entire data set, the approximation actually performs
better for certain predictive loss functions. In this talk I will
demonstrate the application of this model to a telecommunications
fraud problem, where we are looking for patterns in the graph
associated with fraud.
Gregory Warnes, Global Research and
Development, PFIZER INC.
Data Mining Opportunities in Pharmaceutical Research
The research activities of the pharmaceutical industry are incredibly
varied and span most areas of chemical and biological research.
Many of these activities are, in essence, data mining problems,
and could benefit substantially from the application data mining
approaches. In this talk, I will provide an overview of these
research questions and will highlight specific problems where
researchers in data mining can make a substantial impact.
Yale Zhang, Dofasco Inc., Process
Automation
Data mining in steel industry
The global steel industry today is facing significant pressure
to achieve operational excellence and improve profitability. A
large amount of investments have been made in upgrading instrumentation
and data infrastructures. The expectation of these efforts is
to have the ability to collect, store and analyze industrial data,
which can be used to maximize knowledge discovery for better business/process
decision-making support, and enhance process modeling capability
for performance improvement of process monitoring and control
systems. Unfortunately, new obstacles have emerged making it difficult
to discover valuable knowledge resident in industrial data. These
obstacles include, but not limited to, huge data volume; uncertain
data quality; misalignment between process and quality data, etc.
Using traditional data analysis and modeling tools to analyze
such industrial data can be very time consuming, and even generate
misleading results. Because of these obstacles, much of the industrial
data are in fact not used, or at least, not used efficiently,
which leads to an undesirable situation - data rich, but information
poor. This means a significant amount of valuable process knowledge
is lost, diminishing the return on data infrastructure investments.
In this presentation, the above obstacles existing in steel industry
are discussed. Two industrial examples, online monitoring of steel
casting processes for breakout prevention and analysis of steel
surface inspection data, along with future opportunities of data
mining in steel industry are presented.
Djamel Zighed, University Lumiere
Lyon 2
Some enhancements in decision trees
In decision tree, basically we built a set of successive partitions
on the training set. At each step, we assess the quality of the
generated partition according to specific criteria. Most of the
criterions used are based on information measures such as Shannons
entropy or Ginis Index. From the theoretical point of view,
these criterions are defined on a probability distribution which
is generally unknown. Thus, the probabilities are estimated at
each node by the likelihood estimator, the frequencies. The sample
size at each node of the tree decreases when the tree grows. In
this context, the estimates of the probabilities become less relevant
in the deeper level of the tree. In the presentation, we will
discuss this issue and propose some information measures more
suitable. They are sensitive to the sample size. We discuss also
some strategies for generating partition which lead to decision
lattice instead of decision tree. Some applications will be shown.
Back to top