NPCDS Workshop on Data Mining

Speakers and Titles

William Bush, Vanderbilt University
Multifactor Dimensionality Reduction for Detecting Epistasis

In the quest for disease susceptibility genes, the reality of gene-gene interactions creates difficult challenges for many current statistical approaches. In an attempt to overcome limitations with current disease gene detection methods, we have previously developed the Multifactor Dimensionality Reduction (MDR) approach. In brief, MDR is a method that reduces the dimensionality of multilocus information to identify genetic variations associated with an increased risk of disease. This approach takes multilocus genotypes and develops a model for defining disease risk by pooling high-risk genotype combinations into one group and low-risk combinations into another group. Cross validation and permutation testing are used to identify optimal models.

We simulated data using a variety of epistasis models varying in allele frequency, heritability, and the number of interacting loci. We estimated power as the number of times that each method identified the correct functional SNPs for each model out of a set of 10 total SNPs. Using simulated data, we show that MDR has high power to detect interactions in sample sizes of 500 cases and 500 controls, in datasets with 100, 500, 1000, and 5000 SNPs. This study provides evidence that MDR is a powerful statistical approach for detecting gene-gene interactions. In addition, a parallel implementation of the method allows analysis of much larger datasets. MDR will continue to emerge as a valuable tool in the study of the genetics of common, complex disease.

The authors for the work are:
William S. Bush, Todd L. Edwards, and Marylyn D. Ritchie
Center for Human Genetics Research, Vanderbilt University Medical School, Nashville, TN

Zhengxin Chen, University of Nebraska at Omaha
Thoughts on Foundations of Data Mining

Abstract: As noted in recent IEEE Foundations of Data Mining (FDM) workshops, data Mining has been developed, though vigorously, under rather ad hoc and vague concepts. There is a need to explore various fundamental issues of data mining. In this talk, we discuss the reasons of studying FDM, aspects need to be addressed, approaches to FDM, as well as related issues. Materials presented in this talk are based on publications by researchers working in this area, as well as the speaker’s personal comments and observations on the recent development of FDM.ncepts.

Simon Gluzman, Generation 5
Extrapolation and Interpolation with Self-Learning Approximants

Principles of extrapolation and interpolation with self-similar root and factor approximants
to be outlined. Based on application of approximants and different self-learning mechanisms,
we suggest three different methods for interpolation and extrapolation. Several examples are presented, which can be solved using 1) unsupervised iterative learning and 2) unsupervised self-consistent learning. Finally, 3) supervised learning technique, "regression on approximants", is discussed and illustrated.

Wenxue Huang, Milorad Krneta, Limin Lin and Jianhong Wu , Generation 5
Association Bundle Identification for Categorical Data(.pdf file)

Wei-Yin Loh, University of Wisconsin, Madison
Regression Models You Can See

There are numerous techniques for fitting regression models to data, ranging from classical multiple linear regression to highly sophisticated approaches such as spline-based, tree-based, rule-based, neural network, and ensemble methods. Although it is important in many applications that a regression model be interpretable, research in this area is mostly driven by prediction accuracy. It seems almost a fact that the more sophisticated an algorithm, the less interpretable its models become. In this talk, we discuss some basic problems that hinder model interpretation and propose that the most interpretable model is one that can be visualized graphically. The challenge is how to build such a model without unduly sacrificing prediction accuracy. We propose one solution and compare its prediction accuracy with other
methods.

Robert McCulloch, University of Chicago
Bayesian Additive Regression Trees

BART is a Bayesian ``sum-of-trees'' model where each tree is constrained by a prior to be a weak leaner, fitting and inference are accomplished via an iterative backfitting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model, and the training of a weak learner is conditional on the estimates for the other weak learners. The differences from the boosting algorithm are just as striking as the similarities: BART is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm. MCMC is used both to fit the model and to qualify predictive inference.

Abdissa Negassa, Albert Einstein College of Medicine
of Yeshiva University
Tree-based approaches for censored survival data and model selection

A brief description of tree-based model for censored survival data will be provided. The performance of computationally inexpensive model selection criteria will be discussed. It is shown through a simulation study that no single model selection criterion exhibits a uniformly superior performance over a wide range of scenarios. Therefore, a two-stage approach for model selection is suggested and shown to perform satisfactorily. Examples of medical data analysis for developing prognostic classification and subgroup analysis will be presented.

Vito Quaranta, Vanderbilt Integrative Cancer Biology Center & Department of Cancer Biology, Vanderbilt University School of Medicine,
Integrating Multiscale Data for Simulating Cancer Invasion and Metastasis

Cancer research has undergone radical changes in recent times. Producing information both at the basic and clinical levels is no longer the issue. Rather, how to handle this information has become the major obstacle to progress. Intuitive approaches are no longer feasible. The next big step will be to implement mathematical modeling approaches to interrogate the enormous amount of data being produced, and extract useful answer. Quantitative simulation of clinically relevant cancer situations, based on experimentally validated mathematical modeling, provides an opportunity for the researcher and eventually the clinician to address data and information in the context of well formulated questions and "what if" scenarios. At the Vanderbilt Integrative Cancer Biology Center (http://www.vanderbilt.edu/VICBC/) we are implementing a vision for a web site that will serve as a cancer simulational hub. To this end, we are combining expertise of an interdisciplinary group of scientist, including experimental biologists, clinical oncologists, chemical and biological engineers, computational biologists, computer modelers, theoretical and applied mathematicians and imaging scientists.

Currently, the major focus of our Center is to produce quantitative computer simulations of cancer invasion at a multiplicity of biological scales. We have several strategies for data collection and modeling approaches at each of several scales, including the cellular (100 cells) multicellular (<102 cells) and tissue level (<106-108 cells).
For the cellular scale, simulation of a single cell moving in an extracellular matrix field is being parameterized with data from lamellipodia protrusion, cell speed, haptotaxis. Some of these data are being collected in novel bioengineered gadgets.
For the multicellular scale, we have adopted the MCF10A 3dimensional mammosphere system. Several parameters, including proliferation, apoptosis, cell-cell adhesion, are being fed into a mathematical model that simulates mammosphere morphogenesis and realistically takes into account cell mechanical properties.

At the tissue level, our hybrid discrete-continuous mathematical model can predict tumor fingering based on individual cell properties. Therefore, we are parameterizing the hybrid model with data from the cellular and multicellular scales and are validating the model by in vivo imaging of tumor formation.

Sam Roweis, University of Toronto
Automatic Visualization and Classification of High Dimensional Data

Say I give you a dataset of N points in D dimensions, each point labelled as belonging to one of K classes. Can you find for me (up to an irrelvant rotation and isotropic scaling) a projection matrix A (of size d by D) such that in the projected space, nearby points tend to have the same label? When d=2 or d=3 this is a data visualization task; when d=D this task is one of learning (the square root of) a distance metric for nearest neighbour classification. For intermediate d, we are finding a low-dimensional projection of the data that preserves the local separation of classes. This dimensionality reduction simultaneouly reduces the computational load of a nearest neighbour classifier at test time and learns a (low-rank) metric on the input space.

In this talk, I will present an algorithm for solving this problem, ie for linearly reducing the dimensionality high dimensional data in a way that preserves the consistency of class labels amongst neighbours. Results on a variety of datasets show that this method produces good visualizations and that the performance of nearest neighbour classification after using our dimensionality reduction technique consistently exceeds the performance after applying other linear dimensionality reduction methods such as PCA or LDA.

Tim Swartz, Simon Fraser University
Bayesian Analyses for Dyadic Data

Various dyadic data models are extended and unified in a Bayesian framework where a single value is elicited to complete the prior specification. Certain situations which have sometimes been problematic (e.g. incomplete data, non-standard covariates, missing data,
unbalanced data) are easily handled under the proposed class of Bayesian
models. Inference is straightforward using Markov chain Monte Carlo methods and is implemented using WinBUGS software. Examples are provided which highlight the variety of data sets that can be entertained and the ease with which they can be analysed.

Chris Volinsky, AT&T Labs
Modelling Massive Dynamic Graphs

When studying large transactional networks such as telephone call detail data, credit card transactions, or web clickstream data, graphs are a convenient and informative way to represent data. In these graphs, nodes represent the transactors, and edges the transactions between them. When these edges have a time stamp, we have a "dynamic graph" where the edges are born and die through time. I will present a framework for representing and analyzing dynamic graphs, with a focus on the massive graphs found in telecommunications and Internet data. The graph is parameterized with three parameters, defining an approximation to the massive graph which allows us to prune noise from the graph. When compared to using the entire data set, the approximation actually performs better for certain predictive loss functions. In this talk I will demonstrate the application of this model to a telecommunications fraud problem, where we are looking for patterns in the graph associated with fraud.

Gregory Warnes, Global Research and Development, PFIZER INC.
Data Mining Opportunities in Pharmaceutical Research

The research activities of the pharmaceutical industry are incredibly varied and span most areas of chemical and biological research. Many of these activities are, in essence, data mining problems, and could benefit substantially from the application data mining approaches. In this talk, I will provide an overview of these research questions and will highlight specific problems where researchers in data mining can make a substantial impact.

Yale Zhang, Dofasco Inc., Process Automation
Data mining in steel industry

The global steel industry today is facing significant pressure to achieve operational excellence and improve profitability. A large amount of investments have been made in upgrading instrumentation and data infrastructures. The expectation of these efforts is to have the ability to collect, store and analyze industrial data, which can be used to maximize knowledge discovery for better business/process decision-making support, and enhance process modeling capability for performance improvement of process monitoring and control systems. Unfortunately, new obstacles have emerged making it difficult to discover valuable knowledge resident in industrial data. These obstacles include, but not limited to, huge data volume; uncertain data quality; misalignment between process and quality data, etc. Using traditional data analysis and modeling tools to analyze such industrial data can be very time consuming, and even generate misleading results. Because of these obstacles, much of the industrial data are in fact not used, or at least, not used efficiently, which leads to an undesirable situation - data rich, but information poor. This means a significant amount of valuable process knowledge is lost, diminishing the return on data infrastructure investments. In this presentation, the above obstacles existing in steel industry are discussed. Two industrial examples, online monitoring of steel casting processes for breakout prevention and analysis of steel surface inspection data, along with future opportunities of data mining in steel industry are presented.

Djamel Zighed, University Lumiere Lyon 2
Some enhancements in decision trees

In decision tree, basically we built a set of successive partitions on the training set. At each step, we assess the quality of the generated partition according to specific criteria. Most of the criterions used are based on information measures such as Shannon’s entropy or Gini’s Index. From the theoretical point of view, these criterions are defined on a probability distribution which is generally unknown. Thus, the probabilities are estimated at each node by the likelihood estimator, the frequencies. The sample size at each node of the tree decreases when the tree grows. In this context, the estimates of the probabilities become less relevant in the deeper level of the tree. In the presentation, we will discuss this issue and propose some information measures more suitable. They are sensitive to the sample size. We discuss also some strategies for generating partition which lead to decision lattice instead of decision tree. Some applications will be shown.

The National Program on
Complex Data Structures

November 10-12,2005
Workshop on Data Mining
at the Fields Institute, Toronto

Speakers and Titles

The National Program on Complex Data Structures

November 10-12,2005 Workshop on Data Mining at the Fields Institute, Toronto

Speakers and Titles

The National Program on
Complex Data Structures

November 10-12,2005
Workshop on Data Mining
at the Fields Institute, Toronto