DQN, Deep Q Network, is one of the most famous deep reinforcement learning algorithms that combined deep learning with reinforcement learning and really impressed people at that time. In this note, the basic idea of DQN is covered along with implemented code.

Q-Learning

Before talking about DQN, we shall discuss Q-learning first. What is Q-learning learning? Q-learning is a value-based method whose purpose is to learn a value function. In order to achieve this goal, we can adopt the $q$ value, which is action-value function:

$$q(S_t,A_t) = \mathbb{E} [R_{t+1} + \lambda R_{t+2}+\lambda^2 R_{t+3} + \dots|S_t,A_t]$$

This value represents the expectation of return of taking action $A_t$ at state $S_t$. The primitive way to store Q-value is using a table. Each entry of the table corresponds to a state-action pair. Here is an illustration:

Action 1Action 2Action 3
State 1$q(s_1,a_1)$$q(s_1,a_2)$$q(s_1,a_3)$
State 2$q(s_2,a_1)$$q(s_2,a_2)$$q(s_2,a_3)$

And the learning process is simple: just meet different states and update $q$ with received reward.

graph TD; a[initialize Q]-->b[fetch initial state]-->c[choose action with epsilon-greedy]-->d[get reward & update q-value]--Again-->a

Q-Learning is actually trying to approximate $q*$, the optimal $q$ and it update the current $q$ with the following rule:

$$q(S,A)\larr q(S,A)+\alpha[R+\gamma max_A q(S’,A’)-q(S,A)]$$

$$\alpha:learning\ rate\\gamma:decay\ parameter$$

DQN

As you may find in the table illustration above, the table itself is limited. If the state amount is huge, for example thousands in simple Atari game, the table will be super huge correspondingly. If the dimension of state, or say number of possible states, increases, table is far from capable for holding the $q$ values. This is so-called Curse of Dimensionality.

Fortunately, we can use a neural network, which is capable of approximating function and outputting value when given high-dimensional input, as our $q$ function. The only thing we need to deal with is the updating rule.

The target of $q$ is what we want the $q$ to be. We want it to be $R_{t+1} + \gamma max_A\ q_w(S’,A’)$. $w$ represents the parameters of NN. And the current value is $q_w(S,A)$. We need to pull them as close as possible. Thus we construct a loss function:

$$L(w) = \mathbb{E}[q_w(S,A) - (R_{t+1} + \gamma max_A\ q_w(S’,A’))]^2$$

Replay Buffer

Replay buffer is a technique used in the original paper Playing atari with deep reinforcement learning. The idea of using replay buffer is to 1)fulfill the i.i.d assuption of supervised learning 2) improve the efficiency of learning by re-using experiences.

The implementation maintains a buffer, which stores tuples $(S_t,A_t,R_{t+1},S_{t+1})$.

Target Network

As we have mentioned, the goal of updating NN is to make target and current value close. However, we can see the $q$ used in computing target and current value is the same, which can easily result in unstable training as the target changes along with parameter updating. So avoid such problem, we can use 2 NNs and fix one when training, namely $q_w(S,A)$ and $q_w’(S,A)$ .

We first modify the loss as following for one single $S$-$A$ pair: $$L(w) = (q_w(S,A) - (R_{t+1} + \gamma max_A\ q_w’(S’,A’)))^2$$

The first network $q_w(S,A)$ is updated in common manner with gradient decent. The second network with parameter $w’$, however, is the fixed one. It will only be updated every certain steps and synchronize its parameter with $w$.

DQN Algorithm

Now we are able to construct DQN:

graph TD; a[get initial state]-->b[select action with epsilon-greedy]-->c[take action and record in buffer]-->d{enough data?}--No-->b d--Yes-->e[sample N data from buffer]-->f[compute loss & update NN1]--Optional-->g[update NN2]