Assignment 2: MapReduce, Elastic MapReduce, and Declarative Languages

Overview You will run multiple jobs on Elastic MapReduce and compare the performance of various parameters.

Due Date Monday April 16, 6pm

What to turn in Responses to the items labeled QUESTION X:. These will be conceptual or quantitative questions. In some cases, you may be asked to create a plot to visualize some data.

Part 0: Design MapReduce Algorithms

Based on what you learned in class, design a MapReduce algorithm for 2 out of the following tasks. You can describe the algorithm in any combination of english and pseudocode. The idea is to describe the basic roles of the Map function and Reduce function, not to fret about details. Be brief and convincing.
  1. Inverted index. Given a set of documents, an inverted index is a dataset with key = word and value = list of document ids. Assume you are given a dataset where key = document id and value = document text.
  2. Relational join. i.e., SELECT * FROM Order, LineItem WHERE Order.order_id = LineItem.order_id (Hint: Treat the two tables Order and LineItem as one big concatenated bag of records.)
  3. Social Network. Consider a simple social network dataset, where key = person and value = some friend of that person. Describe a MapReduce algorithm to count he number of friends each person has.
  4. Social Network (harder). Use the same dataset as the previous task. The relationship "friend" is often symmetric, meaning that I am your friend, you are my friend. Describe a MapReduce algorithm to check whether this property holds. Generate a list of all non-symmetric friend relationships.
  5. Bioinformatics. Consider a set of sequences where key = sequence id and value = a string of nucleotides, e.g., GCTTCCGAAATGCTCGAA.... Describe an algorithm to trim the last 10 characters from each read, then remove any duplicates generated. (Hint: It's not all that different from the Social Network example.)
QUESTION 0.0: Briefly describe a MapReduce algorithm for one of the tasks above. Use the form below.

QUESTION 0.1: Briefly describe a MapReduce algorithm for another one of the tasks above. Use the form below.
Map:
Input Key:
Input Value:
Output Key:
Output Value:
Description:
Reduce:
Input Key:
Input Value:
Output Key:
Output Value:
Description:

Part 1: Run a Job using Elastic MapReduce and Pig

Steps:
  1. Create a bucket to hold the results.
    1. Log in to the AWS console.
    2. Click the S3 tab.
    3. Click "Create Bucket" at the upper left.
    4. follow the instructions.
    (Alternatively, and perhaps preferably, do this programmatically using Python and boto! The loaddata_s3.py script from assignment 1 includes the code to create a bucket. If you take this route, let me know when you turn in the assignment.)
  2. Create and Execute the Job Flow
  3. If you run into trouble or want more detail, see these detailed instructions from Amazon
    1. In the AWS console, click the ElasticMapReduce tab.
    2. Click Create New Job Flow
    3. Enter a name for the job flow ("Assignment 2, test")
    4. Select "Run a sample application", then select "Apache Log Reports", click Continue
    5. Leave Script Location and Input Location with the default values. Modify Output Location to replace "<yourbucket>" with the name of the bucket you created in the first step. Click Continue.
    6. Leave Master Instance Group, Core Instance Group, and Task Instance Group with their default values. Click Continue.
    7. QUESTION 1: Question: The Task Instance Group Instance Count defaults to 0. What kind of applications would benefit from a non-zero value? Why do you think the default is 0?
    8. On the next page, check the box marked "Enable Debugging." Enter a path for the log file to be stored. For example, "<your bucket>/cloudburst/log". Use the defaults for the other options. Click Continue.
    9. On the next page, leave all the defaults (Proceed with No Bootstrap Actions). Click Continue.
    10. Click Create Job Flow
  4. Inspect the Results
  5. Click Refresh in the Job Flows tab. The State will be STARTING. Once it finishes, record the elapsed time.
    QUESTION 2: How long did your job take?

Part 2: Repeat the experiment, scaling out

  1. Repeat the experiment, but increase Core Instance Group Instance Count to 8 in step 6. Important: Change the output location to a unique name, or your job will fail! For example, append "/scaleout" to the end of the path.
  2. QUESTION 3: How long did the scale out job take?

Part 3: Repeat the experiment, scaling up

  1. Repeat the experiment, but increase Core Instance Group Instance Type to "m1.large". Important: Change the output location to a unique name, or your job will fail! For example, append "/scaleup" to the end of the path.
  2. QUESTION 4: How long did the scale up job take?
    QUESTION 5: Which method seems to improve performance? Why?
    QUESTION 6: This sample dataset is very small. How might the performance characteristics of scale out and scale up change if the dataset was much larger?