Reinforcement Learning -- University of Washington

Spring, 2020

Tuesdays / Thursdays, 11:30-12:50pm, Zoom! (Originally MEB 242)

Instructor:
Byron Boots
email: bbootscs.washington.edu
office: Bill and Melinda Gates Center (CSE2) 210
office hours: Tuesdays 1:00-3:00

TAs:

Anqi Li
Mohak Bhardwaj

TA Office Hours: Mondays 10am-Noon, Thursdays 9:30-11:30am

Contact: cse599W-staff@cs.washington.edu Please communicate to the instructor and TAs ONLY THROUGH THIS EMAIL (unless there is a reason for privacy).

Announcements and links Zoom Lectures will be posted via Canvas. Discussions and questions related to the material and assignemnts will take place on Piazza.

A growing number of state-of-the-art systems including field robots, acrobatic aerial vehicles, walking robots, and computer programs for games (Chess, Hex, Go, StarCraft) rely on machine learning to make decisions. The machine learning problems in these domains represent a fundamental departure from traditional classification and regression problems. The learner must contend with: a) the effect of their own actions on the world; b) sequential decision making and credit assignment; and c) the tradeoffs between exploration and exploitation. In the past ten years, the understanding of these problems have developed dramatically. One key to the advance of learning methods has been a tight integration with optimization techniques, and we will focus on this throughout the course.

This course is directed to graduate students who want to build adaptive software that interacts with the world. Although much of the material will be driven by robotics applications, anyone interested in applying learning to decision-making or an interest in complex adaptive systems is welcome.

Suggested readings will be posted in the schedule below.

Prerequesits

As an advanced course, familiarity with basic ideas from probability, machine learning, and decision making/control will all be helpful. As the course will be project driven, prototyping skills including C, C++, Python, and Matlab will also be important. Creative thought and enthusiasm are required.

Schedule

Date	Due	Topic	Reading Material	HW
03/31/20		Overview	UW Student Conduct Code: Academic Misconduct	Sign Up for Piazza
04/02/20		MDPs, Value Iteration	Notes on Markov Decision Problems MDP Slides -- Dan Klein
04/07/20		Value Iteration (continued)	Notes on Markov Decision Problems How to Design Good Tetris Players Probabilistic Robotics, Chapter 14	Think About Projects!
04/09/20		Q-Functions, Policy Iteration	Notes on Policy Iteration Policy Iteration Slides -- Dan Klein
04/14/20		The Linear Quadratic Regulator	Notes on Linear Quadratic Regulators LQR Slides -- Pieter Abbeel RL for Helicopter Flight	HW1
04/16/20		Time Varying Systems, Affine Quadratic Regulation, Tracking with LQR	Notes on Linear Quadratic Regulators Sequential Compositions of Behaviors Speeding Up Dynamic Programming LQR Trees
04/21/20		Iterative LQR, Receding Horizon Control / Model Predictive Control	Receding Horizon DDP Differentiable MPC in Pytorch An Online Learning Approach to Model Predictive Contol
04/23/20		Inverse Optimal Control	Notes on Imitation Learning Maximum Entopy IOC
04/28/20	HW1	Fitted Q-Iteration	Notes on Approximate Dynamic Programming Learning to Drive a Real Car Generalization in RL Stable Function Approximation	HW2
04/30/20	Project Proposal	Approximate Policy Iteration	Notes on Approximate Dynamic Programming API Survey
05/05/20		TD Learning, Eligibility Traces	Notes on TD, Q-Learning Sutton & Barto: Ch. 6
05/07/20		SARSA, Q-Learning, Replay Buffers	Notes on TD, Q-Learning Deep Q-Learning
05/12/20	HW2	Brute Force Simulation-Based Policy Search: Cross Entropy, Nelder Mead	Notes on Black Box Optimization Nelder Mead -- Wikipedia PEGASUS CEM Optimization Stories	HW3
05/14/20		Backpropagation	Notes on Backpropagation Deep Learning: Ch. 6 Blog Post on the Adjoint Method Fluid Control
05/19/20		Policy Gradients, Actor Critic	Notes on Policy Gradients Sutton & Barto: Ch. 13 REINFORCE Policy Gradient Methods -- Sutton et al. Policy Gradient Slides -- Levine
05/21/20		Natural Policy Gradient	Notes on Policy Gradients Natural Policy Gradient Covariant Policy Search Natural Actor Critic Trust Region Policy Optimization Actor Critic Slides -- Levine
05/26/20		Online Learning, Imitation Learning, DAgger, AggreVaTeD	Notes on Imitation Learning DAgger AggreVateD
05/28/20	HW3	Iterative Learning Control	Notes on Iterative Learning Control Using Inaccurate Models in RL DAgger for SysID
06/02/20		Student Project Presentations
06/04/20		Student Project Presentations
06/10/20	Project Report	Due at midnight

Grading

Final grades will be based on course projects (40%) and homework assignments (60%).

Typsetting your homework solutions in LaTex is required.

Late homework policy: Assignments are due at the beginning of class on the day that they are due. You will be allowed 3 total late days without penalty for the entire semester. Please use these wisely, and plan ahead for conferences, travel, deadlines, etc. Once those days are used, you will be penalized according to the following policy:

Homework is worth full credit at the beginning of class on the due date.
It is worth half credit for the next 48 hours.
It is worth zero credit after that.

Collaboration on homework: I expect that each student will conduct themself with integrity. You are researchers-in-training, and I expect that you understand proper attribution and the importance of intellectual honesty. Unless otherwise specified, homeworks will be done individually and each student must hand in their own assignment. It is acceptable, however, for students to collaborate in figuring out answers and helping each other understand the underlying concepts. When collaborating, the "whiteboard policy" is in effect: You may discuss assignments on a whiteboard, but, at the end of a discussion the whiteboard must be erased, and you must not transcribe or take with you anything that has been written on the board during your discussion. You must be able to reproduce the results solely on your own after any such discussion. Finally, you must write the names of the students you collaborated with on each homework.

Audit policy: If you wish to audit the course, you must either:

Do two homework assignments.
Do the course project

Disclaimer: I reserve the right to modify any of these plans as need be during the course of the class; however, I won't do anything capriciously, anything I do change won't be too drastic, and you'll be informed as far in advance as possible.

Projects

The course project is an opportunity for you to deeply explore one (or several) of the techniques covered in class and apply them to a problem that is of interest to you. Since the projects require a substantial amount of work, you may form groups of up to three students. The research topic is up to you, as long as it makes use of adaptive control or RL methods.

Project proposals: Your proposal should be 2-3 pages, and it should introduce the problem you are trying to solve, the approach you will take, and also address the following questions:

What are some impacts of this research?
What is novel about the approach you are taking?
How do learning and/or probabilistic inference techniques play a key role?
What is your metric for success?
What are key technical issues you will have to confront? Are there any other big challenges?
What software or datasets will you use?
What is your timeline? Include specific targets for the progress report.

Note on current research: You may use your current research as a course project, as long as you explore a new area of the problem, and you cannot use previous results. Your proposal should clearly state what novel part you will be tackling in your course project.

Final presentations: You’ll present your findings to the class at the end of the semester. This will be a presentation:

No more than 5 minutes! There will be a hard cutoff.
No more than 5 slides, exluding title slide.
Every group member must speak.
You must send me a copy of the slides in advance (10am on Tuesday).
Don't "decorate" your slides with equations. If there is an equation, I expect you to explain every variable.
Don't read your slides / show lots of text. Slides should contain brief, salient points.

Final Report: The final report will consist of one deliverable:

Written report: This is the detailed report of your approach and findings. You should re-state the problem you are solving and your approach, and summarize your results. The report should be no longer than a NeurIPS paper in size (8 pages including figures and tables), but a shorter and more concrete report is preferred.

Sample Projects: You should connect RL to your own research, if possible. If you don't want to do that or the connection is hard to make, here are some examples of projects that might be appropriate. These are just suggestions!

Train autonomous cars to navigate in CARLA
Learn how to race in Open-Ai gym with deep Q-learning
Imitation learning for aerial vehicle control
Train a starcraft II agent to win a minigame
Train a reactive controller to avoid obstacles in FlightGoggles
Consider how to safely explore in RL
Control a MuSHR car with MPC
Show how curriculum learning can help with difficult games

Example Environments:

Here are some enviornments that you can use for training an RL agent. You are by no means required to use any of these simulation environments. If you find other environments, feel free to share them on Piazza.

AirSim: https://github.com/Microsoft/AirSim
CARLA Autonomous Driving Simulator: http://carla.org/
FlighGoggles: https://flightgoggles.mit.edu/
OpenAI Gym: https://gym.openai.com/
OpenAI Universe: https://universe.openai.com/
Deepmind Lab: https://deepmind.com/blog/article/open-sourcing-deepmind-lab
Deepmind Control: https://github.com/deepmind/dm_control
Starcraft II Learning Environment: https://github.com/deepmind/pysc2
Project Malmo: https://www.microsoft.com/en-us/research/project/project-malmo/
VizDoom: http://vizdoom.cs.put.edu.pl/
PyRobot: https://pyrobot.org/
The Retro Learning Environment: https://github.com/nadavbh12/Retro-Learning-Environment
UETorch: https://github.com/facebookarchive/UETorch
MAgent: https://github.com/geek-ai/MAgent
The Roboterrium: https://www.robotarium.gatech.edu/
ML Agent Toolkit: https://github.com/Unity-Technologies/ml-agents
Bullet Physics https://pybullet.org/wordpress/

Acknowledgements

Assignments, lectures, and ideas on this syllabus are partially adapted from Drew Bagnell's course at Carnegie Mellon University. I would like to thank Drew for helpful discussions and access to his course materials.