Assignment 8: Big Data Design

 

You will describe a project of interest to you and design a cloud-based solution for it.

 

1) Describe the data.  The 3 Vs:

á      Variety: What data types are you working with?  Relations? Documents/Emails? Images/video? XML? Social networks / graphs? Geospatial?

á      Volume: How much data do you expect to have? Which sources are the heavy hitters?

á      Velocity: How is the data updated?  Never (static dataset)? Rarely (historical data warehouse)? Daily (log files, typical data warehouse)? Continuously (sensor feed, twitter)? Real-time (financial apps)? 

 

Sometimes the alliteration gets out of control, and youÕll see other Vs: Veracity (how reliable is the data?), Variability (how often does the data structure change? (sometimes refers to how precise the data is)), Vulnerability (security issues), Value.

 

Example: Sloan Digital Sky Survey

 

Variety: Images, object catalog (relational)

Volume: 20-100TB

Velocity: Yearly data releases, mostly static

 

 

2) Describe the workload.

á      Write down 5-10 typical questions you expect to be able to answer efficiently using these data.  (just in English, not SQL or anything.)  

á      Describe the users.  What skills or interfaces will you need to design toward?  Customers (polished GUIs with fast response times)? IT (Java, SQL, Python)? Analysts (SAS, Matlab, Excel)? Management (dashboards, paper reports)?

 

Example: Sloan Digital Sky Survey

 

 ÒFind all galaxies with unsaturated pixels within 1 arcsecond of a given point in the sky (right ascension and declination)Ó

ÒFind all elliptical galaxies with spectra that have an anomalous emission lineÓ

ÒProvide a list of moving objects consistent with an asteroidÓ

ÒFind all objects within 1' of one another other that have very similar colors: that is where the color ratios u-g, g-r, r-I   are less than 0.05m. (Magnitudes are logarithms so these are ratios.)  This is a gravitational lens queryÓ

 

For more information, see ftp://ftp.research.microsoft.com/pub/tr/tr-99-30.pdf

 

3) Design a cloud-based solution to manage this data. 

(a) Draw a diagram of a proposed cloud-based solution.  Clearly label all cloud services used, and provide a one-line justification of each.  Make sure to indicate basic data flow – where the data comes from and where it is used. 

 

(b) Draw a SECOND diagram proposing an alternative solution, relying on a different set of services.  This one may be an on-site (non-cloud or private cloud) solution, or it may be an alternative cloud solution.  Try to have no more than one component in common with your first solution.

 

4) Estimate the cost of your two solutions. 

 

Use cost calculators discussed in class if appropriate.

 

Turn in just 1-2 sentences on each.  Mention a) the estimated cost, and b) which component of the system seems to drive the cost.  Storage?  Transmission? Compute?

 

Feel free to include or ignore development costs or level of effort estimates as you see fit.