Counting 1,627 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

Last Commit
Jul. 25, 2017
Jan. 23, 2017


A course on reinforcement learning in the wild. Taught on-campus in HSE and Yandex SDA (russian) and maintained to be friendly to online students (both english and russian).


  • Optimize for the curious. For all the materials that aren’t covered in detail there are links to more information and related materials (D.Silver/Sutton/blogs/whatever). Assignments will have bonus sections if you want to dig deeper.
  • Practicality first. Everything essential to solving reinforcement learning problems is worth mentioning. We won't shun away from covering tricks and heuristics. For every major idea there should be a lab that allows to “feel” it on a practical problem.
  • Git-course. Know a way to make the course better? Noticed a typo in a formula? Made the code more readable? Made a version for alternative framework? You're awesome! Pull-request it!

Coordinates and useful links



  • week0 Welcome to the MDP
    • Lecture: RL problems around us. Markov decision process. Simple solutions through combinatoric optimization.
    • Seminar: Frozenlake with genetic algorithms
      • Homework description - ./week0/
      • HSE Homework deadline: 23.59 1.02.17
      • YSDA Homework deadline: 23.59 19.02.17
  • week1 Crossentropy method and monte-carlo algorithms
    • Lecture: Crossentropy method in general and for RL. Extension to continuous state & action space. Limitations.
    • Seminar: Tabular CEM for Taxi-v0, deep CEM for box2d environments.
      • HSE homework deadline: 23.59 15.02.17
      • YSDA homework deadline: 23.59 26.02.17
  • week2 Temporal Difference

    • Lecture: Discounted reward MDP. Value iteration. Q-learning. Temporal difference Vs Monte-Carlo.
    • Seminar: Tabular q-learning
      • Homework description - see ./week2/
      • HSE homework deadline: 23.59 15.02.17
      • YSDA homework deadline: 23.59 8.03.17
  • week3 Value-based algorithms

    • Lecture: SARSA. Off-policy Vs on-policy algorithms. N-step algorithms. Eligibility traces.
    • Seminar: Qlearning Vs SARSA Vs expected value sarsa in the wild
    • Homework description
      • HSE homework deadline 23.59 22.02.17
  • week3.5 Deep learning recap

    • Lecture: deep learning, convolutional nets, batchnorm, dropout, data augmentation and all that stuff.
    • Seminar: Theano/Lasagne on mnist, simple deep q-learning with CartPole (TF version contrib is welcome)
    • Homework - convnets on MNIST or simple deep q-learning
      • HSE homework deadline 23.59 1.03.17
  • week4 Approximate reinforcement learning

    • Lecture: Infinite/continuous state space. Value function approximation. Convergence conditions. Multiple agents trick.
    • Seminar: Approximate Q-learning with experience replay. (CartPole, Acrobot, Doom)
    • Homework - convnets on MNIST or simple deep q-learning
      • HSE homework deadline 23.59 8.03.17

Future lectures:

  • week5 Deep reinforcement learning (coming 6.03.2017)

    • Lecture: Deep Q-learning/sarsa/whatever. Heuristics & motivation behind them: experience replay, target networks, double/dueling/bootstrap DQN, etc.
    • Seminar: Double DQN, Dueling DQN, experience replay on atari
  • week6 Policy gradient methods (coming 13.03.2017)

    • Lecture: Motivation for policy-based, policy gradient, logderivative trick, REINFORCE/crossentropy method, variance theorem(advantage), advantage actor-critic (incl.n-step advantage), off-policy actor-critic (off-PAC), natural gradients(briefly), continuous action space(teaser).
    • Seminar: a2c Vs qlearning for MountainCar/Doom, entropy regularization & tricks, simple demo with continuous action spaces

somewhere here comes RNN crash-course

  • week7 Partially observable MDPs (coming 20.03.2017)

    • Lecture: POMDP intro. Model-based solvers. RNN solvers. RNN tricks: attention, problems with normalization methods, pre-training.
    • Seminar: Deep kung-fu & doom with recurrent A2C vs feedforward A2C
  • week i+1 Trust Region Policy Optimization.

    • Lecture: Trust region policy optimization in detail.
    • approximate TRPO vs approximate Q-learning for gym box2d envs (robotics-themed)
  • week i+1 RL in Large/Continuous action spaces.

    • Lecture: Continuous action space MDPs. Model-based approach (NAF). Actor-critic approach (dpg, svg). Trust Region Policy Optimization. Large discrete action space problem. Action embedding.
    • Seminar: Classic Control and BipedalWalker with ddpg Vs qNAF. .
  • week i+1 Advanced exploration methods: intrinsic motivation

    • Lecture: Augmented rewards. Heuristics (UNREAL,density-based models), formal approach: information maximizing exploration. Model-based tricks(also refer mcts).
    • Seminar: Vime vs epsilon-greedy for Go9x9 (bonus 19x19)
  • week i+1 Advanced exploration methods: probablistic approach.

    • Lecture: Improved exploration methods (quantile-based, etc.). Bayesian approach. Case study: Contextual bandits for RTB.
    • Seminar: Bandits
  • week i+1 Case studies I

    • Lecture: Reinforcement Learning as a general way to optimize non-differentiable loss. KL(p||q) vs KL(q||p). Case study: machine ranslation, speech synthesis, conversation models.
    • Seminar: Optimizing Levenshtein distance with seq2seq for g2p
  • week i+1 Hierarchical MDP

    • Lecture: MDP Vs real world. Sparse and delayed rewards. When Q-learning fails. Hierarchical MDP. Hierarchy as temporal abstraction. MDP with symbolic reasoning.
    • Seminar: Hierarchical RL for atari games with rare rewards (starting from pre-trained DQN)
  • week i+1 Case studies II

    • Lecture: Direct policy optimization: finance. Inverse Reinforcement Learning: personalized medial treatment, robotics.
    • Seminar: Portfolio optimization as POMDP.

Course staff

Course materials and teaching by