Assignment 4: Cloud Databases for Ad Hoc Analysis

Relational databases are often carefully engineered by large teams to be tightly coupled to one or more applications. But today, relational database management systems (RDBMS) may increasingly play a role for ad hoc collaborative data analysis, where a database is created quickly, on demand, to provide interactive query capabilities in support of a particular task. These databases may be thrown out or completely redesigned after the initial analysis task is completed.

Cloud computing offers an ideal platform for these transient, ad hoc, task-specific collaborative databases, since it neatly sidesteps questions of hardware, licensing, maintenance, and integration with existing infrastructure.

In this assignment, you will create an ad hoc database in two cloud platforms and use it to analyze the logs of a hadoop cluster. You will be given data in a raw format and be expected to create a schema, load the data, and answer some questions using SQL.

The datasets

The datasets we will use represent log information from a Hadoop cluster. There are two datasets, both in tab-separated value format:

In CSV:

The first line of each file provides column names. The type of each column is up to you to determine. One approach that I find useful in practice is to just assume that everything is a string to simplify data loading, then use views (saved, named queries) to cast the data to the appropriate data type. This approach gives you more control, lets you experiment with different schemas, and helps minimize painful loading errors. However, feel free to use whatever schema you find appropriate.

Part 1: Microsoft SQL Azure

Set up Microsoft SQL Azure instance

Create a new server following the instructions here

Create a database following Task 1 here

Create a schema to hold our test dataset

You may choose to use the Database Manager for SQL Azure, as described in Task 2 here

Or you can use SQL Management Studio, as described in Task 3 of the same exercise

Load the data

You can try these tools that work with SQL Azure
You can launch a Windows instance on Amazon, copy the data there, and load using SQL Server Management Studio.
Alternatively, I use pymssql to talk to SQL Server databases. Sometimes there are some version incompatibilities, however.
Another choice is to use freetds directly, which is the library underlying pymssql, and includes a sql client.
Another choice is to use an ODBC driver for linux, then use any database library you prefer that can talk through ODBC.

Part 2: Set up an Amazon RDS instance

Launch an RDS instance.

getting started guide.

Create a schema using your favorite MySQL client.
Load the data

Part 3: Data analysis

What timespan does this set of jobs cover?

SELECT min(submitTime), max(submitTime) FROM hadoopjobs;

What proportion of jobs failed? (status != 0)
What is the average number of mappers and reducers per job (tasktype=m vs. tasktype=r), ignoring failed tasks? (Join the two tables on jobid)
What is the average number of failed tasks per successful job? (status = 0 means success for both tasks and jobs)
OPTIONAL: How long did the top 10 longest-running jobs take to run?
OPTIONAL: For each of the top 10 longest-running jobs, how long did its longest-running task take to run?
OPTIONAL: What was the maximum number of concurrently running jobs? (One naive solution: For each job, count up the number of jobs that overlap it in time.)

Part 3: OPTIONAL: Fusion Tables

Try to upload the data to Google Fusion Tables and express as many of the queries above as you can.

Part 4: What to turn in

Answers to the above questions.
Three bullets comparing and contrasting Amazon RDS and SQL Azure. What problems did you run into with one or the other? Was anything surprising?
OPTIONAL: Measure the performance of data loading and/or one or more queries above and report results between SQL Azure and Amazon RDS
OPTIONAL: 1-2 sentences about your experience with fusion tables. Did the data upload cleanly? Were you able to answer the queries? Is fusion tables a viable option for these ad hoc data analysis tasks?

Other Resources

Paper describing SQL Azure
Overview of Getting Started with SQL Azure
Tools that work with SQL Azure