Assignment 7: Google BigQuery

 

 

Step 1:  Sign up for BigQuery

 

https://developers.google.com/bigquery/docs/getting-started

 

You do need to enable billing, which means that a payment method needs to be established, but the first 100GB of processing each month is free, and we will remain well under that total.

 

We will also upload some data, but it will be very small – the cost will be approximately

 

Step 2: Run some queries

 

NCDC Global Summary of the Day Dataset

 

Each row represents one day from one station.  More information: https://developers.google.com/bigquery/docs/dataset-gsod

 

The table name is Òpublicdata:samples.gsodÓ

 

Use the query browser to answer the following questions.

 

Query 1: On how many days did some station detect a tornado?  (hint: there is a ÒtornadoÓ column, and you want to do some grouping by year, month, day)

 

Query 2: For station 846310, find the maximum mean_temp for the month of June across all years.

 

 

Step 3: Upload some data

 

The Google sample data page claims the following:

 

ÒThe data stored in this table is denormalized from the original data, so that each row contains complete information about the sample, rather than requiring a join to on station_number to get State, latitude, longitude, etc.Ó

 

This doesnÕt appear to be true.  The metadata for each station, including latitude and longitude, are not available.  So weÕre going to upload the data and compute the join.

 

3.1 Get the data:

http://www.cs.washington.edu/homes/billhowe/bigdatacloud/lecture7/station_data.csv



3.2 From the query browser tool, click the down arrow icon at the left above the list of datasets.

3.3 Choose "Create New Dataset" and give it a name.

3.4 Once the dataset is created, hover over it, and click the plus sign. The "Create Table" dialog opens.

3.5 Enter a name, and use the following string as the schema:
USAF:string,WBAN:string,STATIONNAME:string,CTRY:string,ST:string,CALL:string,LAT:string,LON:string,ELEV:string,BEGIN:string,END:string
3.6 Click OK. Wait a bit, then click Job History to check progress.

Step 4: Write the Join Query

 

Join publicdata:samples.gsod with your new table on station id to compute the maximum mean_temp for any station with a name containing ÒLYONÓ

 

 

 

 

 

Other resources:

 

Sample tables available: https://developers.google.com/bigquery/docs/sample-tables

 

You may find this tutorial relevant: https://developers.google.com/bigquery/docs/hello_bigquery_gui

 

 

You can use the bq command line tool to manipulate Google BigQuery

https://developers.google.com/bigquery/docs/cli_tool

 

ItÕs a python tool, so use your python environment from week 1. 

 

You should be able to use easy_install like this:

$ sudo easy_install http://google-bigquery-tools.googlecode.com/files/bigquery-2.0.4.tar.gz