Assignment 7: Google BigQuery
Step
1: Sign up for BigQuery
https://developers.google.com/bigquery/docs/getting-started
You do need to enable billing, which means that a payment method needs to be established, but the first 100GB of processing each month is free, and we will remain well under that total.
We will also upload some data, but it will be very small – the cost will be approximately
Step
2: Run some queries
NCDC Global Summary of the Day Dataset
Each row represents one day from one station. More information: https://developers.google.com/bigquery/docs/dataset-gsod
The table name is Òpublicdata:samples.gsodÓ
Use the query browser to answer the following questions.
Query 1: On how many days did some station detect a tornado? (hint: there is a ÒtornadoÓ column, and you want to do some grouping by year, month, day)
Query 2: For station 846310, find the maximum mean_temp for the month of June across all years.
Step
3: Upload some data
The Google sample data page claims the following:
ÒThe data stored in
this table is denormalized from the original data, so
that each row contains complete information about the sample, rather than
requiring a join to on station_number to
get State, latitude, longitude, etc.Ó
This doesnÕt appear to be true. The metadata for each station, including latitude and longitude, are not available. So weÕre going to upload the data and compute the join.
3.1 Get the data:
http://www.cs.washington.edu/homes/billhowe/bigdatacloud/lecture7/station_data.csv
USAF:string,WBAN:string,STATIONNAME:string,CTRY:string,ST:string,CALL:string,LAT:string,LON:string,ELEV:string,BEGIN:string,END:string3.6 Click OK. Wait a bit, then click Job History to check progress.
Step
4: Write the Join Query
Join publicdata:samples.gsod with your new table on station id to compute the maximum mean_temp for any station with a name containing ÒLYONÓ
Other
resources:
Sample tables available: https://developers.google.com/bigquery/docs/sample-tables
You may find this tutorial relevant: https://developers.google.com/bigquery/docs/hello_bigquery_gui
You can use the bq command line
tool to manipulate Google BigQuery
https://developers.google.com/bigquery/docs/cli_tool
ItÕs a python tool, so use your python environment from week
1.
You should be able to use easy_install
like this:
$ sudo easy_install
http://google-bigquery-tools.googlecode.com/files/bigquery-2.0.4.tar.gz