Phone: (206) 616-5828
Office: Allen 450
Period: Mar 26 - Jun 4, 2012
Class Meeting Time: Mondays, 6:00-9:00 p.m.
Class Meeting Place: ST 359
Web: http://www.cs.washington.
Office Hours: by appointment
Moodle Site: http://moodle.extn.washington.edu/course/view.php?id=1731
Course Overview: We will explore the technology landscape at the intersection of “big data” and cloud computing. Articles in the NYTimes, Science, Nature, the Economist, a variety of reports from the federal government and innumerable blog posts have made the case that technology supporting the deep analysis of massive datasets – big data – is now a critical enabling technology for business and research in all fields. Cloud computing has been a catalyst for this trend, democratizing access to the infrastructure required to build big data applications. Scalable data platforms are increasingly deployed in the cloud or offered directly by cloud providers.
The course will be technology driven. We will consider relational databases (specifically in the context of cloud computing), the Hadoop ecosystem and its variants, other NoSQL platforms emphasizing low-latency access, and more. We will work directly with a selected set of these platforms, compare and contrast their relative strengths and weaknesses, and characterize the problems they are designed to solve.
Learning Objectives: By the end of this course, students will be able to:
Course Structure: Each class will consist of a 1-hour lecture, a 1-hour case study and demonstration of a specific system, and 1-hour of discussion and hands-on work. Each week, we will consider a category of scalable data platform through a lecture and consider a representative example from this category in detail through a demonstration. Students will be asked to complete a hands-on homework assignment based on the material presented in class and, in some cases, come prepared to discuss assigned reading. The reading assignments will generally be research papers from relevant computer science conferences.
Student Assessment: Assignments: 80%, Participation: 20%. All assignments will be due 1 week later by the start of class. Participation will be a combination of attendance and discussion involvement; in class and online involvement will both contribute. Assignments will typically not be graded in terms of correct/incorrect answers, but students will be expected to demonstrate effort and insight. In this course, discussion between students is not discouraged; the goal is to learn as much as possible in a short time, and discussion is a very efficient way to do this. Some assignments may be completed in groups, depending on students’ experience level. In these cases, a portion of the grade will be based on peer review by one’s group members.
Textbook: None. All materials will be on the web. We will use a combination of slides, documentation for the selected systems covered in class, some custom material, and some relevant research papers.
Prerequisite: The assignments will involve example-oriented programming assignments in various languages, possibly including Java, C#, and Python. You will NOT be expected to be proficient in any of these languages, but you will be expected to “think computationally.” Specifically, you will work through examples, answer questions about the code, and generally be “brave” with respect to learning new technology – read the documentation, ask questions, try experiments, make some educated guesses, etc.
Week 1: Cloud computing review, big data overview, problem types, history and context, methodology
Case study: Amazon EC2, S3
Reading: None
Assignment 1: Cloud Storage Write Speed Shootout
1) Setup accounts
2) http://escience.washington.
3) http://www.windowsazure.com/
4) Cloud Storage Write Speed Shootout
Week 2: Hadoop, Elastic MapReduce (Hadoop vs. RDBMS)
Case study: Elastic MapReduce
Reading:
http://www.cse.nd.edu/~dthain/
http://www.cs.princeton.edu/
Assignment 2: Elastic MapReduce Sample
Week 3: Iterative Applications. HaLoop, Spark, Daytona
Guest Lecturer: Jaliya Ekanayake
Case study: Daytona
Reading: http://www.cs.washington.edu/
Assignment: Run k-means algorithm from SeaFlow data on Daytona.
Week 4: Relational Databases in the cloud: MS SQL Azure, RDS, database.com, Google Cloud SQL
Case study: SQL Azure (and SQLShare)
Reading: Paper on SQL Azure
Reading: Erik Meijer's article on CoSQL vs. NoSQL
Assignment 4: Cloud Databases for Ad Hoc Analysis
Week 5: Low Latency NoSQL: Google Big Table / Apache HBase
Case study: CouchDB
Reading: Google Big Table
Assignment: No assignment this week! Catch up on assignment 3 and 4.
Week 6: DynamoDB, integrating AWS data services, high performance cloud computing. (Guest lecture: Jaime Kinney, Amazon Web Services)
Case study: DynamoDB
Assignment 6: High Performance Cloud Computing
Week 7: Google Tools
Case study: Google BigQuery
Reading: http://infolab.stanford.edu/~
Week 8: Engineering Issues, Project planning data
Assignment: Designing for Big Data
Week 9: Transactional applications in the Cloud, Scalable NoSQL vs. Scalable SQL
Survey of scalable SQL and NoSQL systems
"NewSQL" systems: MonetDB, Vertica, VoltDB, JustOne DB, NimbusDB
Case Study: MonetDB
Assignment: Project Development
Week 10: Mini-Project Presentations, Review
The instructor reserves the right to alter the syllabus if circumstances dictate.
Reading:
http://db.csail.mit.edu/