Assignment 4: Cloud Databases for Ad Hoc Analysis
Relational databases are often carefully engineered by large teams to be tightly coupled to one or more applications. But today, relational database management systems (RDBMS) may increasingly play a role for ad hoc collaborative data analysis, where a database is created quickly, on demand, to provide interactive query capabilities in support of a particular task. These databases may be thrown out or completely redesigned after the initial analysis task is completed.
Cloud computing offers an ideal platform for these transient, ad hoc, task-specific collaborative databases, since it neatly sidesteps questions of hardware, licensing, maintenance, and integration with existing infrastructure.
In this assignment, you will create an ad hoc database in two cloud platforms and use it to analyze the logs of a hadoop cluster. You will be given data in a raw format and be expected to create a schema, load the data, and answer some questions using SQL.
The datasets
The datasets we will use represent log information from a Hadoop cluster.
There are two datasets, both in tab-separated value format:
- hadoop_jobs.tsv
- hadoop_tasks.tsv
In CSV:
- hadoop_jobs.csv
- hadoop_tasks.csv
The first line of each file provides column names. The type of each column is up to you to determine. One approach that I find useful in practice is to just assume that everything is a string to simplify data loading, then use views (saved, named queries) to cast the data to the appropriate data type. This approach gives you more control, lets you experiment with different schemas, and helps minimize painful loading errors. However, feel free to use whatever schema you find appropriate.
Part 1: Microsoft SQL Azure
Part 2: Set up an Amazon RDS instance
- Launch an RDS instance.
You can use this getting started guide.
- Create a schema using your favorite MySQL client.
Clients exist for both Windows and Linux.
- Load the data
Using your favorite mechanism. Ask other students for help if you need it.
Part 3: Data analysis
- What timespan does this set of jobs cover?
One solution, depending on how you've organized your schema:
SELECT min(submitTime), max(submitTime) FROM hadoopjobs;
What proportion of jobs failed? (status != 0)
What is the average number of mappers and reducers per job (tasktype=m vs. tasktype=r), ignoring failed tasks? (Join the two tables on jobid)
What is the average number of failed tasks per successful job? (status = 0 means success for both tasks and jobs)
OPTIONAL: How long did the top 10 longest-running jobs take to run?
OPTIONAL: For each of the top 10 longest-running jobs, how long did its longest-running task take to run?
OPTIONAL: What was the maximum number of concurrently running jobs? (One naive solution: For each job, count up the number of jobs that overlap it in time.)
Part 3: OPTIONAL: Fusion Tables
Try to upload the data to Google Fusion Tables and express as many of the queries above as you can.
Part 4: What to turn in
- Answers to the above questions.
- Three bullets comparing and contrasting Amazon RDS and SQL Azure. What problems did you run into with one or the other? Was anything surprising?
- OPTIONAL: Measure the performance of data loading and/or one or more queries above and report results between SQL Azure and Amazon RDS
- OPTIONAL: 1-2 sentences about your experience with fusion tables. Did the data upload cleanly? Were you able to answer the queries? Is fusion tables a viable option for these ad hoc data analysis tasks?
Other Resources
Paper describing SQL Azure
Overview of Getting Started with SQL Azure
Tools that work with SQL Azure