Scalable and Data-Intensive Computing in the Cloud

Instructor:  Bill Howe, Phd

Email:  billhowe@cs.washington.edu

Phone:   (206) 616-5828

Office:   Allen 450

Period:   Mar 26 - Jun 4, 2012

Class Meeting Time: Mondays, 6:00-9:00 p.m.

Class Meeting Place: ST 359

Web:    http://www.cs.washington.edu/homes/billhowe/bigdatacloud

Office Hours:    by appointment

Moodle Site: http://moodle.extn.washington.edu/course/view.php?id=1731

 

Course Overview: We will explore the technology landscape at the intersection of “big data” and cloud computing.  Articles in the NYTimes, Science, Nature, the Economist, a variety of reports from the federal government and innumerable blog posts have made the case that technology supporting the deep analysis of massive datasets – big data – is now a critical enabling technology for business and research in all fields.  Cloud computing has been a catalyst for this trend, democratizing access to the infrastructure required to build big data applications.  Scalable data platforms are increasingly deployed in the cloud or offered directly by cloud providers. 

 

The course will be technology driven.  We will consider relational databases (specifically in the context of cloud computing), the Hadoop ecosystem and its variants, other NoSQL platforms emphasizing low-latency access, and more.  We will work directly with a selected set of these platforms, compare and contrast their relative strengths and weaknesses, and characterize the problems they are designed to solve.

 

Learning Objectives: By the end of this course, students will be able to:

 

Course Structure: Each class will consist of a 1-hour lecture, a 1-hour case study and demonstration of a specific system, and 1-hour of discussion and hands-on work.  Each week, we will consider a category of scalable data platform through a lecture and consider a representative example from this category in detail through a demonstration.  Students will be asked to complete a hands-on homework assignment based on the material presented in class and, in some cases, come prepared to discuss assigned reading.  The reading assignments will generally be research papers from relevant computer science conferences.

 

Student Assessment: Assignments: 80%, Participation: 20%. All assignments will be due 1 week later by the start of class. Participation will be a combination of attendance and discussion involvement; in class and online involvement will both contribute. Assignments will typically not be graded in terms of correct/incorrect answers, but students will be expected to demonstrate effort and insight. In this course, discussion between students is not discouraged; the goal is to learn as much as possible in a short time, and discussion is a very efficient way to do this.  Some assignments may be completed in groups, depending on students’ experience level.  In these cases, a portion of the grade will be based on peer review by one’s group members.

 

Textbook: None.  All materials will be on the web.  We will use a combination of slides, documentation for the selected systems covered in class, some custom material, and some relevant research papers.

 

Prerequisite:  The assignments will involve example-oriented programming assignments in various languages, possibly including Java, C#, and Python.  You will NOT be expected to be proficient in any of these languages, but you will be expected to “think computationally.”  Specifically, you will work through examples, answer questions about the code, and generally be “brave” with respect to learning new technology – read the documentation, ask questions, try experiments, make some educated guesses, etc.

Specific Course Topics and Tentative Schedule:

Week 1: Cloud computing review, big data overview, problem types, history and context, methodology

Case study: Amazon EC2, S3

Reading: None

Slides (pptx) (3MB)

Assignment 1: Cloud Storage Write Speed Shootout

1) Setup accounts

2) http://escience.washington.edu/get-help-now/get-started-amazon-web-services

3) http://www.windowsazure.com/en-us/pricing/free-trial/

4) Cloud Storage Write Speed Shootout

 

Week 2: Hadoop, Elastic MapReduce (Hadoop vs. RDBMS)

Case study: Elastic MapReduce

Slides (pptx)

Reading:

http://www.cse.nd.edu/~dthain/courses/cse598z/spring2010/benchmarks-sigmod09.pdf

http://www.cs.princeton.edu/courses/archive/spring11/cos448/web/docs/week10_reading2.pdf

Assignment 2: Elastic MapReduce Sample

 

Week 3: Iterative Applications. HaLoop, Spark, Daytona

Guest Lecturer: Jaliya Ekanayake

Case study: Daytona

Reading: http://www.cs.washington.edu/homes/billhowe/pubs/HaLoop.pdf

Assignment: Run k-means algorithm from SeaFlow data on Daytona.

 

Week 4: Relational Databases in the cloud: MS SQL Azure, RDS, database.com, Google Cloud SQL

Case study: SQL Azure (and SQLShare)

Slides (pptx)

Reading: Paper on SQL Azure

Reading: Erik Meijer's article on CoSQL vs. NoSQL

Assignment 4: Cloud Databases for Ad Hoc Analysis

 

Week 5: Low Latency NoSQL: Google Big Table / Apache HBase

Slides (pptx)

Case study: CouchDB

Reading: Google Big Table

Assignment: No assignment this week! Catch up on assignment 3 and 4.

 

Week 6: DynamoDB, integrating AWS data services, high performance cloud computing. (Guest lecture: Jaime Kinney, Amazon Web Services)

Slides (pptx)

Case study: DynamoDB

Assignment 6: High Performance Cloud Computing

 

Week 7: Google Tools

Slides (pptx)

Case study: Google BigQuery

Reading: http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf

Assignment: Google BigQuery

 

Week 8: Engineering Issues, Project planning data

Slides (pptx)

Slides

Assignment: Designing for Big Data

Week 9: Transactional applications in the Cloud, Scalable NoSQL vs. Scalable SQL

Survey of scalable SQL and NoSQL systems

Slides (pptx)

"NewSQL" systems: MonetDB, Vertica, VoltDB, JustOne DB, NimbusDB

Case Study: MonetDB

Assignment: Project Development

 

Week 10: Mini-Project Presentations, Review

 

The instructor reserves the right to alter the syllabus if circumstances dictate.

Other Resources

Reading: http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf