|
SCIENTIFIC PROGRAMS AND ACTIVITIES |
||||||||
December 23, 2024 | |||||||||
Abstracts Ana Aizcorbe, Virginia Tech In January, the Bureau of Economic Analysis released a set of statistics for health care spending for the US to supplement those in the official national accounts. These estimates are viewed as experimental in large part because the data sources are novel and not completely understood. This talk will detail some of the challenges faced when trying to combine the large number of observations available in health insurance claims data bases with the sound statistical properties of government surveys.
Hélène Bérard, Statistics
Canada
Pedro Ferreira, Carnegie Mellon University This talk will describe will describe a series of real-world experiments involving more than half a million households in which individuals participate organically. These experiments have been designed, deployed and analyzed in collaboration with a major European cable provider. In one experiment I show that likes help correct social bias in the consumption of video-on-demand. In another experiment, I show that word-of-mouth propels the sales of video offered at discounted prices. Still in this context, I will also show that firms have incentives to manipulate recommendations in ways that may not maximize consumer welfare. In a similar context, another experiment shows that video streaming services over the Internet substitute for TV consumption roughly at a 1:1 rate and that users offered access to movies for free for a while do not necessarily stop pirating copyrighted content. Finally, I will also show results from an experiment measuring the economic benefits from pro-active churn management targeting groups of users instead of single individuals.
Stephen Fienberg, Carnegie Mellon University The Living Analytics Research Centre is is a joint research initiative between Singapore Management University (SMU) and Carnegie Mellon University (CMU) focused on consumer and social analytics for the network-centric world. Working with a set of commercial partners and in an on-campus testbed in Singapore, LARC researchers are developing new concepts, methods, and tools that are experiment-driven, closed-loop, more real-time, and practical at societal scale. I will provide an introduction to the LARC research activities and the types of data with which we work. I will also describe a series of big data statistical challenges arising in our work.
For the last 75 years most information about human societies and their economies
have been generated from probability samples of near-universal frames of population
units. These samples were subjected to standardized measurement, the data
cleaned and assembled, and statistical analyses performed for statistical
inference to the full population.
Mike Holland, Center for Urban Science and Progress,
New York University
Gizem Korkmaz, Virginia Tech Civil unrest events (protests, strikes, etc.) unfold through complex mechanisms that cannot be fully understood without capturing social, political and economic contexts. People revolt for economic reasons, for democratic rights, to express their discontent. Others are driven by unions, political groups, and some through social media. Modern cultures use a variety communication tools (e.g., traditional and social media) to coordinate in order to gather a sufficient number of people to raise their voice. By combining different data sources, the number and diversity of data sources, we aim to explore what possible sources of information carry signals around key events in countries of Latin America. To accommodate dynamic features of social media feeds and their impacts on insurgency prediction, we develop a dynamic linear model based on daily keyword combinations. In addition, due to the large number of so-called n-grams, we employ a sparseness encouraging prior distribution for the coefficients governing the dynamic linear model. Included in the predictors are significant sets of keywords (~1000) extracted from Twitter, news, and blogs. In addition, we also include volume of requests to Tor, a widely-used anonymity network, economic indicators and two political event databases (GDELT and ICEWS). Insurgency prediction is further complicated by the difficulty in assessing the exact nature of the conflict. Our study is further enhanced by the existence of a ground truth measure of conflicts compiled by an independent group of social scientists and experts on Latin America.
Julia Lane, America Institutes of Research The recent Cambridge University Press book "Privacy, Big Data, and the Public Good: Frameworks for Engagement" addresses privacy concerns over the use of big data for commercial and intelligence purposes, and describes how these data can harness public good. This presentation provides a summary of the legal, economic and statistical thought that frames the many privacy issues associated with big data use and that are highlighted in the book. The presentation will also provide a summary of the book's practical suggestions for protecting privacy and confidentiality that can help researchers, policymakers and practitioners.
More and more, we see survey firms using respondents from Web panels to conduct their surveys. Most of the time, these panels cannot be considered as probabilistic samples, and consequently, inference from these panels is often questionable. In order to introduce a probabilistic component to Web panels, Rivers (2007) proposed Sample Matching. This method consists in selecting a probabilistic sample from a sampling frame, and then in associating this sample to the respondents from the Web panel by using statistical matching. With statistical matching, each individual of the probabilistic sample is matched to one of the respondents of the panel according to some given characteristics, without aiming at having two individuals corresponding exactly to the same person. In this presentation, we will first describe in details Sample Matching. Second, this method will be compared to Indirect Sampling (Lavallée, 2007). Third, we will present the hypothesis and theoretical justifications underlying Sample Matching. Finally, we will present some example of the application of the method.
Eric Miller, University of Toronto Over the past ten years agent-based microsimulation has increasing become the state of the art in large-scale urban travel demand forecasting model systems, as well as (to a lesser extent) integrated urban (land-use) modelling. At the same time, big data with transportation and urban form applications are becoming increasingly available. This paper begins by providing an introduction to and overview of urban systems modelling. It then discusses the current and emerging state of big data sources and applications in transportation planning. This leads to a discussion of the opportunities and challenges involved in big data applications within urban systems modelling and some recommendations for directions for future research.
Archan Misra, Singapore Management University
This talk will describe the analytics opportunities and challenges arising from the LiveLabs testbed on the SMU campus, which seeks to utilize longitudinal traces of mobile sensing and usage data to understand human behavior in the physical world. I shall first outline our work on inferring a variety of on-campus physical world behavior, including indoor movement, group-based interactions and queuing episodes. Additionally, I will show how the inferred physical world group interaction context proves useful for multiple higher-level objectives, including (a) establishing the social relationships ("strength of ties") among the participants and (b) identifying anomalous or unusual events in such public venues. Next, I shall demonstrate an operational platform that supports context-based in-situ experimental studies on LiveLabs participants (e.g., understanding the efficacy of promotions targeted to users who are "queuing as a group"). The uncertainty in the inferred context, however, poses a fundamental challenge to establishing the validity of both the analytical insights and the experimental outcomes. I shall conclude by summarizing our current approaches and open issues related to this challenge.
Ron Jarmin, US Census
While agent-based and hybrid dynamic modeling supports formulation of highly
articulated theories of human behavior including textured dynamic hypotheses
concerning social and environmental influences on decision-making, through-the-skin
coupling of physiological and decision-making and environmental interactions
traditional data collection tools and survey instruments are all too
frequently provide insufficient evidence to thoroughly inform and ground
much less fully parameterize or calibrate such models. Fortunately,
the rise of big data -- and especially its enhanced velocity, variety, veracity,
volume -- provide remarkable volumes of data for researcher, clinical- or
self- insight into health behavior, conditions, and health care service delivery
processes. Unfortunately, absent rich models that connect the observations
to decision-making needs, the value of such data will be severely constrained.
Within this talk, we highlight the synergistic character of dynamic models
and data drawn from such novel data sources -- including our iEpi smart-phone
based sensor, survey and crowdsourcing health platform. We also emphasize
particular avenues by which such data allows for articulated theory building
regarding difficult-to-observe aspects of human behavior. In this regard,
we highlight the synergistic role that dynamic modeling can play in interacting
with a range of machine learning and computational statistics approaches in
-- including batch Markov Chain Monte Carlo methods, Particle Filtering, Bayesian
conditional models and Hidden Markov Models.
Jonathan Ozik,Argonne National Laboratoy Second, ABMs are increasingly run as ensembles on large computational resources. Large scale scientific workflows are being developed to implement capabilities for methods such as adaptive parametric studies, large scale sensitivity analyses, scaling studies, optimization/metaheuristics, uncertainty quantification, data assimilation, and multi-scale model integration. These methods are targeting both heterogeneous computing resources as well as more specialized supercomputing environments, enabling many thousands to millions of ABM simulations to be run and analyzed concurrently. From this perspective, ABMs can be seen as generators of Big Data. I will present work on the development of large scale scientific workflows for use with ABMs.
Jerry Reiter, Duke University Large-scale databases from the social, behavioral, and economic
Stephanie Shipp, Virginia Tech
Aleksandra (Sesa) Slavkovic, Penn State We propose methods to release and analyze synthetic graphs in order to protect
privacy of individual relationships captured by the social network. Proposed
techniques aim at fitting and estimating a wide class of exponential random
graph models (ERGMs) in a differentially private manner, and thus offer rigorous
privacy guarantees. More specifically, we use the randomized response mechanism
to release networks under e-edge differential privacy. To maintain utility
for statistical inference, treating the original graph as missing, we propose
a way to use likelihood based inference and Markov chain Monte Carlo (MCMC)
techniques to fit ERGMs to the produced synthetic networks. We demonstrate
the usefulness of the proposed techniques on a real data example. (Joint work
with V. Karwan and P. Krivitsky).
Following a brief discussion of each of the terms in the title, the talk
will focus on the aim of timely and accurate descriptions of survey or census
populations. Three issues will be discussed in some detail: (i) the combination
of data from survey and other sources, for example to describe the emerging
market for Vaporized Nicotine Products; (ii) the promise and perils of population
registries as frames; (iii) the challenges of using network structures in
population sampling.
Stanley Wasserman, Indiana University In networks, nodes represent individuals and relational ties among individuals can indicate joint participation in one or more teams. This representation captures the overlapping team membership but unfortunately fails to preserve the team structures. We propose an alternative promising approach --- using affiliation networks to represent teams and individuals, with "links" representing team membership. There are multiple models to represent affiliation networks. And, this approach is easily adapted to "big network data". The past decade has seen considerable progress in the development of p* (also known as Exponential Random Graph) models to model these relationships. However, given the plethora of parameters afforded by these models, it is increasingly evident that specification of these models is more of an art than a science. Ideally social science theory should guide the identification of parameters that map on to specific hypotheses. However, in the preponderance of cases, extant theories are not sufficiently nuanced to narrow down the selection of specific parameters. Hence there is a pressing need for some exploratory techniques to help guide the specification of theoretically sound hypotheses. In order to address this issue, we propose the use of correspondence analysis, which enables us to incorporate multiple relations and attributes at both individual and team levels. This preempts concerns about independence assumptions. The results from correspondence analysis can be presented visually. With these advantages, correspondence analysis can be used as an important exploratory tool to examine the features of the dataset and the relations among variables of interest. We present the theory for this approach, and illustrate with an example focusing on combat teams from a fantasy-based online game, EverQuest II - a very large network. We explore the impact of various individual level and team level attributes on team performance while considering team affiliations as well as social relations among individuals. The results of our analysis in addition to offering important multilevel insights, also serve as a stepping stone for more focused analysis using techniques such as p*/ERGMs.
Michael Wolfson, University of Ottawa Agent-Based Models (ABMs) form an important tool for social and related policy analysis. In this presentation, we focus on two models at opposite ends of a spectrum of empirical, applied and complicated but not complex at one end, and more theoretical and complex at the other. The first data-intensive example is the Statistics Canada LifePaths longitudinal microsimulation model, which has recently been used to inform the current policy debate about the adequacy of Canada's retirement income system. The second is a formally much simpler complex systems model, the Theoretical Health Inequality Model, THIM. Both models are realized as dynamic computer microsimulation models. While LifePaths is highly detailed and complicated, weaving together the results of many diverse statistical analyses, it is not complex in the sense of possibly generating "emergent" phenomena. It provides detailed estimates of the distribution of disposable income Canada's baby boomers can expect 20 years hence, for example. THIM, on the other hand, can be described by only a handful of equations. But because these equations relate variables at all of individual, family, neighbourhood and city-wide levels, the behaviour of the system can be far less intuitive. Still, THIM can be used to explore aspects of social policy where much of the needed data simply do not exist. An example is the possible explanation of the observed contingent correlation between income inequality and mortality rates among cities - a correlation observed in the US, but not in Canada.
|
|||||||||