Current Projects

Benchmarks for Systematic Evaluation of Interactive Big Data Analytics

line chart of brush speeds over time. Distribution of total interactions performed for different tasks, partitioned by task complexity. response rates (fraction of queries successfully answered) by five different database management systems.

I am leading a large interdisciplinary research team to derive new database benchmarks from real logs of users manipulating dynamic query interfaces, which display live updates as interactions are performed in real time. Our approach merges new visualization theory on exploratory visual analysis (or EVA) and benchmarking methodology from databases. We carefully designed an exploratory user study with representative EVA tasks, datasets, and users, producing 128 separate session logs. We analyzed these logs to characterize the observed interaction patterns, and based on our analysis, highlighted design implications for database systems to support highly interactive visualization use cases. For example, our results show that when given the option to utilize real-time querying, users take advantage of it to perform long, continuous interactions that generate hundreds of queries per second. Finally, we developed a full end-to-end pipeline for translating our interaction logs to database queries, and evaluated our benchmark on five different database systems. This work will be presented at SIGMOD 2020. We plan to extend this work to generalize our approach and develop concrete interaction models for the future, e.g., by using our logs to train machine learning models to generate realistic interaction sequences.

Characterizing Exploratory Visual Analysis

Distribution of 'think times' for different interactions in Tableau. Distribution of 'think times' for different interactions in Tableau. Distribution of 'think times' for different interactions in Tableau.

Our database benchmarks are also based on my research in understanding the role of exploratory visual analysis in data science. Supporting exploratory visual analysis (EVA) is a central goal of visualization research, and yet our understanding of the process is arguably vague and piecemeal. My research contributes a consistent definition of EVA through review of the relevant literature, and an empirical evaluation of existing assumptions regarding how analysts perform EVA using Tableau, a popular visual analysis tool. We find striking differences between existing assumptions and the collected data through a study with 27 Tableau users exploring three different datasets. Participants successfully completed a variety of tasks, with over 80% accuracy across focused tasks with measurably correct answers. The observed cadence of analyses is surprisingly slow compared to popular assumptions from the database community. We find significant overlap in analyses across participants, showing that EVA behaviors can be predictable. Furthermore, we find few structural differences between behavior graphs for open-ended and more focused exploration tasks. This research was presented at EuroVis 2019.

Understanding User Queries for Optimization

heatmap of NDSI calculations across the world. heatmap of NDVI calculations in California-Mexico region. example diagram for ForeCache tiling scheme.

I developed three visualization systems that exploit knowledge of how users visually explore datasets to implement context-aware database optimizations. ScalaR dynamically adjusts the size of query results output from a DBMS, based on the available screen space to render the results, avoiding rendering issues such as over- plotting. ForeCache uses predictive prefetching techniques to support large-scale 2D data browsing. Sculpin combines predictive pre-fetching, incremental pre-computation, and visualization-aware caching techniques to support interactive visualization of queries executed on large, multidimensional array data. We collaborated with scientists to design and evaluate these systems when exploring satellite sensor data from the NASA MODIS.

Beagle: Supporting Data-Driven Visualization Design

Data available here!

"How common is interactive visualization on the web?" "What is the most popular visualization design?" "How prevalent are pie charts really?" These questions intimate the role of interactive visualization in the real (online) world. In this project, we present our approach (and findings) to answering these questions. First, we introduce Beagle, which mines the web for SVG-based visualizations and automatically classifies them by type (i.e., bar, pie, etc.). With Beagle, we extract over 41,000 visualizations across five different tools and repositories, and classify them with 85% accuracy, across 24 visualization types. Given this visualization collection, we study usage across tools. We find that most visualizations fall under four types: bar charts, line charts, scatter charts, and geographic maps. Though controversial, pie charts are relatively rare for the visualization tools that were studied. Our findings also suggest that the total visualization types supported by a given tool could factor into its ease of use. However this effect appears to be mitigated by providing a variety of diverse expert visualization examples to users. By using a scalable and automated data collection process, Beagle can support a variety of data-driven visualization design techniques, where a large input corpus is used to train machine learning models and extract design heuristics for automated visualization of new and unfamiliar datasets.

Past Projects

StreamTrace: Making Sense of Temporal Queries with Interactive Visualization

As real-time monitoring and analysis become increasingly important, researchers and developers turn to data stream management systems (DSMS's) for fast, efficient ways to pose temporal queries over their datasets. However, these systems are inherently complex, and even database experts find it difficult to understand the behavior of DSMS queries. To help analysts better understand these temporal queries, we developed StreamTrace, an interactive visualization tool that breaks down how a temporal query processes a given dataset, step-by-step. The design of StreamTrace is based on input from expert DSMS users; we evaluated the system with a lab study of programmers who were new to streaming queries. Results from the study demonstrate that StreamTrace can help users to verify that queries behave as expected and to isolate the regions of a query that may be causing unexpected results. This project was completed during my internship at Microsoft Research in 2014, and was presented at the CHI 2016 conference.

Exploring Medical Waveform Data with ForeCache and BigDAWG

ForeCache was connected to BigDAWG to support interactive exploration of medical patient data from the MIMIC II dataset, and was presented as part of the BigDAWG demo at the VLDB 2015 conference. BigDAWG is a federated database supporting multiple data models to power a variety of analysis and exploration use cases.