Assignment 4: Cloud Databases for Ad Hoc Analysis

Relational databases are often carefully engineered by large teams to be tightly coupled to one or more applications. But today, relational database management systems (RDBMS) may increasingly play a role for ad hoc collaborative data analysis, where a database is created quickly, on demand, to provide interactive query capabilities in support of a particular task. These databases may be thrown out or completely redesigned after the initial analysis task is completed.

Cloud computing offers an ideal platform for these transient, ad hoc, task-specific collaborative databases, since it neatly sidesteps questions of hardware, licensing, maintenance, and integration with existing infrastructure.

In this assignment, you will create an ad hoc database in two cloud platforms and use it to analyze the logs of a hadoop cluster. You will be given data in a raw format and be expected to create a schema, load the data, and answer some questions using SQL.

The datasets

The datasets we will use represent log information from a Hadoop cluster. There are two datasets, both in tab-separated value format:
  1. hadoop_jobs.tsv
  2. hadoop_tasks.tsv
In CSV:
  1. hadoop_jobs.csv
  2. hadoop_tasks.csv
The first line of each file provides column names. The type of each column is up to you to determine. One approach that I find useful in practice is to just assume that everything is a string to simplify data loading, then use views (saved, named queries) to cast the data to the appropriate data type. This approach gives you more control, lets you experiment with different schemas, and helps minimize painful loading errors. However, feel free to use whatever schema you find appropriate.

Part 1: Microsoft SQL Azure

Part 2: Set up an Amazon RDS instance

Part 3: Data analysis

Part 3: OPTIONAL: Fusion Tables

Try to upload the data to Google Fusion Tables and express as many of the queries above as you can.

Part 4: What to turn in

  1. Answers to the above questions.
  2. Three bullets comparing and contrasting Amazon RDS and SQL Azure. What problems did you run into with one or the other? Was anything surprising?
  3. OPTIONAL: Measure the performance of data loading and/or one or more queries above and report results between SQL Azure and Amazon RDS
  4. OPTIONAL: 1-2 sentences about your experience with fusion tables. Did the data upload cleanly? Were you able to answer the queries? Is fusion tables a viable option for these ad hoc data analysis tasks?

Other Resources

Paper describing SQL Azure
Overview of Getting Started with SQL Azure
Tools that work with SQL Azure