Wiki (UW only) | Web
ResearchThe bottleneck to scientific discovery is no longer data acquisition, but data analysis.This trend can be attributed to advances in data acquisition technology: high-throughput lab techniques, remote sensing platforms, and high resolution computational modeling. While the technology and resources necessary to collect or generate such data en masse are becoming widely available, technology to manage and analyze the data have not kept pace. Traditionally, each data acquisition activity was coupled to a specific hypothesis, but now researchers collect data en masse---they "download the world"---exchanging a problem of how to extract knowledge from the environment to one of how to extract knowledge from a database. Data analysis, and not experimental data acquisition, is the new bottleneck to discovery.
Management of very large or very complex science data. Data-intensive scalable computing, scientific databases, visualization, mashups, integration of ad hoc science data.
Current ProjectsMyria: Easy Analytics as a Service
Extracting knowledge out of Big Data today is a high-touch business requiring a human expert who deeply understands the application domain as well as a growing ecosystem of complex distributed systems and advanced statistical methods. These experts are hired in part for their statistical expertise, but report that the majority of their time is spent scaling and optimizing the relatively basic data manipulation tasks in preparation for the actual statistical analysis or machine learning step: identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. The Myria project focuses on building a new system called MyriaDB for Big Data Management that is both fast and flexible, offering this system as a cloud service, and addressing both the theoretical and systems challenges associated with Big Data Management as a Cloud service. In addition to the core system, we are building a platform, a common interface, and a suite of common optimizations for working with Big Data systems. This middleware layer is intended to facilitate experimental evaluations, but also to uncover the optimizations and abstractions that transcend specific systems and system implementations. We use Datalog-like languages as a common interface and a tool for reasoning about these optimizations and abstractions.
We also have other projects related to Big Data management. See our overview page.
And be sure to see our CSE-wide efforts in Big Data
Horizon: Visual Data Analytics in the Cloud
This work led to the HaLoop system, and has also led to new features for handling unstructured grids in the OPeNDAP data management software popular in the oceanography community.
SQLShare: Database-as-a-Service for Long Tail Science
Our approach is to provide a basic system for querying data in the cloud (using Microsoft SQL Azure), then explore a set of smart services to streamline and automate analysis. Specifically: 1) queries are saved as views and can be shared with others for collaborative analysis, 2) we derive automatic starter queries directly from the data to bootstrap analysis, 3) we derive dashboards (``mashups'') directly from the data to automate visual analysis, 4) we are working to translate English fragments into SQL fragments to assist SQL novices, and 5) we are using previous work here at UW on SQL Autocomplete features.
Our motivation is long tail science. In contrast to "big science" projects such as the Large Synoptic Survey Telescope and the Large Hadron Collider, the challenge faced in the long tail of science is not only about data volume, but about data complexity. Projects in oceanography or the life sciences may involve cleaning and integrating data from hundreds of heterogeneous data sources. Although sheer scale is not typically the defining feature of these data sources, the volumes involved are not insignificant: In the life sciences, for example, a modern short-read sequencer can generate a terabyte per day. At the University of Washington, there are approximately ten of these sequencers used on campus, and 20 more are scheduled to be purchased in the next few years. Low-cost, high-throughput mass spectrometry, microarray, and flow cytometry are similarly poised to produce exponential growth in data volumes in the next few years. Read more....
This project is supported by a Moore Foundation Grant and a 2010 Jim Gray Seed Grant from Microsoft Research.Parallel Datalog on new Computing Platforms
Building on our work on HaLoop, we are developing a Datalog interface to massively parallel platforms including HaLoop/Hadoop, the Cray XMT, and Microsoft's Daytona Platform on the Azure cloud. The Cray XMT supports massive multi-threading --- millions of simultaneous threads accessing shared memory at low latency --- eliminating dependence on a deep cache hierarchy for performance. PNNL is exploring the XMT as a platform for a Graph Database. While the XMT has proven capabilities in graph processing, a general-purpose semantic database necessarily involves "conventional" computation in addition to massively thread-parallel computation. A query language to insulate the user from this heterogeneity, transparently splitting a query into conventional and XMT components, does not exist. We are designing a prototype language with this property. GridFields: Algebraic Manipulation of Unstructured Meshes
The large datasets produced by simulations typically have a grid structure that is not amenable to storage within traditional database systems. We've developed an algebra of GridFields that allows convenient manipulation of grid-structured datasets much in the way the relational algebra allows convenient manuipulation of table-structured data. This work originated in the context of CMOP, the NSF Science and Tehcnology Center for Coastal Margin Observation and Prediction.
This work is supported by a subcontract from Woods Hole Oceanographic Institute via the NSF-funded Ocean Observatories Initiaitve and an NSF EAGER award.
Data Pricing I am a Co-PI on the Data Pricing project.
TeachingData-Intensive Computing in the Cloud, Spring 2012, University of Washington Educational Outreach
CS599c: Scientific Data Management, Spring 2010, University of Washington, with Magda Balazinska
CS410/510: Scientific Data Management, Summer 2006, Portland State University
BioBill Howe is the Director of Research for Scalable Data Analytics at the UW eScience Institute and holds an Affiliate Assistant Professor appointment in Computer Science & Engineering, where he studies data management, analytics, and visualization systems for science applications. Howe has received two Jim Gray Seed Grant awards from Microsoft Research for work on managing environmental data, has had two papers elected to VLDB Journal's "Best of Conference" issues (2004 and 2010), and co-authored what are currently the most-cited papers from both VLDB 2010 and SIGMOD 2012. Howe serves on the program and organizing committees for a number of conferences in the area of databases and scientific data management, and serves on the Science Advisory Board of the SciDB project. He has a Ph.D. in Computer Science from Portland State University and a Bachelor's degree in Industrial & Systems Engineering from Georgia Tech.
Phd, Computer Science, Portland State University, 2006
I've been working with databases since 1995 when I worked for Delta Airlines as a co-op in their Technical Operations facility. When I graduated from Georgia Tech, I went to work for Deloitte Consulting designing and building enterprise client-server applications, specifically Customer Relatonship Management (CRM) systems with Siebel. After Deloitte and before graduate school, I worked as an independent contractor at Microsoft and other companies as diverse as newly deregulated telecommunications carriers to providers of oil field exploration services.