University of Washington homepage
eScience Institute
Wiki (UW only) | Web

William Howe

Bill Howe

Associate Director, eScience Institute
Affiliate Assistant Professor, Department of Computer Science & Engineering
University of Washington

Office: 450 Paul G. Allen Center
Phone: 206-221-9261
Fax: 206-543-2969
email: billhowe at cs dot washington dot edu
NSF biosketch in pdf
OpenSSH rsa public key

 

Address:
Box 352350
Computer Science & Engineering
University of Washington
Seattle, WA 98195-2350

Projects

Publications

Talks

Bio

Service

Background

News

Research

The bottleneck to scientific discovery is no longer data acquisition, but data analysis.This trend can be attributed to advances in data acquisition technology: high-throughput lab techniques, remote sensing platforms, and high resolution computational modeling. While the technology and resources necessary to collect or generate such data at high rates are becoming widely available, technology to manage and analyze the data have not kept pace. Traditionally, each data acquisition activity was coupled to a specific hypothesis, but now researchers collect data en masse---they "download the world"---exchanging a problem of how to extract knowledge from the environment to one of how to extract knowledge from a database. Data analysis, and not experimental data acquisition, is the new bottleneck to discovery.

Research Topics

Management of very large or very complex science data. Data-intensive scalable computing, scientific databases, visualization, mashups, integration of ad hoc science data.

Current Projects

Myria: Easy Analytics as a Service
Extracting knowledge out of Big Data today is a high-touch business requiring a human expert who deeply understands the application domain as well as a growing ecosystem of complex distributed systems and advanced statistical methods. These experts are hired in part for their statistical expertise, but report that the majority of their time is spent scaling and optimizing the relatively basic data manipulation tasks in preparation for the actual statistical analysis or machine learning step: identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. The Myria project focuses on building a new system called MyriaDB for Big Data Management that is both fast and flexible, offering this system as a cloud service, and addressing both the theoretical and systems challenges associated with Big Data Management as a Cloud service. In addition to the core system, we are building a platform, a common interface, and a suite of common optimizations for working with Big Data systems. This middleware layer is intended to facilitate experimental evaluations, but also to uncover the optimizations and abstractions that transcend specific systems and system implementations. We use Datalog-like languages as a common interface and a tool for reasoning about these optimizations and abstractions.

We also have other projects related to Big Data management. See our overview page.

And be sure to see our CSE-wide efforts in Big Data

Horizon: Visual Data Analytics in the Cloud
I am the lead PI on two NSF grants exploring the question of how cloud computing can support interactive, visual, exploratory science. Through an NSF Cluster Exploratory grant, and in partnership with visualization experts at the University of Utah, we are exploring the use of MapReduce as a common framework for both scalable data processing and scalable visualization. Through an NSF EAGER grant, I am developing a new visualization algebra for use with the Microsoft Azure platform. The core goal of both projects is to allow scientists to analyze terabytes of data in the cloud as efficiently, conveniently, and as deeply as they can analyze megabytes of data on their laptops.

This work led to the HaLoop system (currently the most cited paper from VLDB 2010), and has also led to new features for handling unstructured grids in the OPeNDAP data management software popular in the oceanography community.

Read more...

SQLShare: Database-as-a-Service for Long Tail Science
Informed by the dataspace abstraction proposed by Halevy, Franklin, and Maier, we are developing a platform for ad hoc databases called SQLShare that allows a user to bootstrap a collaborative database environment just by uploading data, writing queries, and sharing the results.

Our approach is to provide a basic system for querying data in the cloud (using Microsoft SQL Azure), then explore a set of smart services to streamline and automate analysis. Specifically: 1) queries are saved as views and can be shared with others for collaborative analysis, 2) we derive automatic starter queries directly from the data to bootstrap analysis, 3) we derive dashboards (``mashups'') directly from the data to automate visual analysis, 4) we are working to translate English fragments into SQL fragments to assist SQL novices, and 5) we are using previous work here at UW on SQL Autocomplete features.

Our motivation is long tail science. In contrast to "big science" projects such as the Large Synoptic Survey Telescope and the Large Hadron Collider, the challenge faced in the long tail of science is not only about data volume, but about data complexity. Projects in oceanography or the life sciences may involve cleaning and integrating data from hundreds of heterogeneous data sources. Although sheer scale is not typically the defining feature of these data sources, the volumes involved are not insignificant: In the life sciences, for example, a modern short-read sequencer can generate a terabyte per day. At the University of Washington, there are approximately ten of these sequencers used on campus, and 20 more are scheduled to be purchased in the next few years. Low-cost, high-throughput mass spectrometry, microarray, and flow cytometry are similarly poised to produce exponential growth in data volumes in the next few years. Read more....

This project is supported by a Moore Foundation Grant and a 2010 Jim Gray Seed Grant from Microsoft Research.

Hybrid Graph-Relational Query Systems
I am a Co-PI on the Myria project with Dan Suciu and Magda Balazinska, one of only eight 5-year multi-million dollar NSF Big Data awards. The goal of the project is to establish a production-quality online data analytics service targeting domain researchers rather than IT professionals. My work focuses on a new language interface based on Datalog that can insulate the user from systems-level decision-making (e.g., the number of machines to use, how to partition the data), yet can still express complex analytics tasks required for modern data-driven discovery (e.g., graph traversals and machine learning tasks as well as conventional relational queries). Recent work has focused on code generation techniques for Datalog: given a query, emit minimal C++ code that implements the query and see how performance stacks up against mature database systems. Initial results show a factor of 5 speedup using this approach, suggesting that the overhead of relational databases can be avoided without throwing out the relational model. In addition, this compilation framework allows us to compile down to a number of different backends in addition to Myria and straight C++, including HaLoop/Hadoop and the Grappa system developed here at UW.

GridFields: Algebraic Manipulation of Unstructured Meshes
The large datasets produced by simulations typically have a grid structure that is not amenable to storage within traditional database systems. We've developed an algebra of GridFields that allows convenient manipulation of grid-structured datasets much in the way the relational algebra allows convenient manuipulation of table-structured data. This work originated in the context of CMOP, the NSF Science and Tehcnology Center for Coastal Margin Observation and Prediction.

This work is supported by a subcontract from Woods Hole Oceanographic Institute via the NSF-funded Ocean Observatories Initiaitve and an NSF EAGER award.

Data Pricing I am a Co-PI on the Data Pricing project.

SciDB
I am on the Science Advisory Board for the
SciDB project, representing requirements from theenvironmental modeling community.

Teaching

Data-Intensive Computing in the Cloud, Spring 2012, University of Washington Educational Outreach
CS599c: Scientific Data Management, Spring 2010, University of Washington, with Magda Balazinska
CS410/510: Scientific Data Management, Summer 2006, Portland State University

Publications

DBLP

Google Scholar

Selected Talks

Slideshare

Bio

Bill Howe is the Associate Director of the UW eScience Institute and holds an Affiliate Assistant Professor appointment in Computer Science & Engineering, where he studies data management, analytics, and visualization systems for science applications. Howe has received two Jim Gray Seed Grant awards from Microsoft Research for work on managing environmental data, has had two papers elected to VLDB Journal's "Best of Conference" issues (2004 and 2010), and co-authored what are currently the most-cited papers from both VLDB 2010 and SIGMOD 2012. Howe serves on the program and organizing committees for a number of conferences in the area of databases and scientific data management, and serves on the Science Advisory Board of the SciDB project. He has a Ph.D. in Computer Science from Portland State University and a Bachelor's degree in Industrial & Systems Engineering from Georgia Tech.

Professional Service

  • Program Committee, eScience 2014
  • Program Committee, SSDBM 2013
  • Program Committee, LDAV 2013
  • Chair, Workshop on HPC meets Databases, co-located with Supercomputing 2012
  • Program Committee, ICDE 2013
  • Program Committee, PVLDB 2012-2013
  • Demo Co-chair, SSDBM 2012
  • Program Committee, ScienceCloud 2012
  • Chair, Workshop on HPC meets Databases, co-located with Supercomputing 2011
  • Editorial Board, Journal of Data Semantics
  • Organizing Committee, XLDB 2011
  • Program Committee, LDAV 2011
  • Program Committee, ScienceCloud 2011
  • Co-Chair, Workshop on Array Databases
  • Registration Chair, SSDBM 2011
  • Program Committee, SSDBM 2011
  • Demonstrations Program Committee, SIGMOD 2011
  • Program Committee, EDBT 2010
  • Program Committee, SSDBM 2010
  • Program Committee, IIMAS Workshop, 2008
  • Reviewer, VLDB Journal, 2007
  • Program Committee, dg.o 2006
  • Program Committee, dg.o 2005
  • Demonstrations Program Committee, SIGMOD 2005
  • Student Session Program Committee, dg.o 2004
Professional background

Phd, Computer Science, Portland State University, 2006
BS, Industrial and Systems Engineering, Georgia Tech, 1999

I've been working with databases since 1995 when I worked for Delta Airlines as a co-op in their Technical Operations facility. When I graduated from Georgia Tech, I went to work for Deloitte Consulting designing and building enterprise client-server applications, specifically Customer Relatonship Management (CRM) systems with Siebel. After Deloitte and before graduate school, I worked as an independent contractor at Microsoft and other companies as diverse as newly deregulated telecommunications carriers to providers of oil field exploration services.

Page modified June 1, 2011

.