Next: MATLAB package description Up: MLB_Exercises_2012 Previous: RL game [3* P]

Policy Gradient Methods: Swimmer [4 P]

**Figure 3:** 3 link wwimmer task: The simulated snake-like robot should swim fast and energy efficient.

In this task you have to learn optimal policies for the swimmer (see Figure 3) using different policy gradient methods. You have to compare two algorithms to compute the gradient $\nabla_{\theta} J(\theta_h)$ , namely Finite Differences and Likelihood Ratio. The robot is a 3-link (2-joints) snake-like robot swimming in the water. It has two actuators, you have to learn how to use these actuators to swim as fast as possible in a given direction. The model, the policy and the reward function are already given in the provided matlab package swimmer.zip⁵. The used policy is a stochastic Gaussian policy implemented by a Dynamic Movement Primitive (DMP). The DMP uses 6 centers per joint. As we deal with a periodic movement the phase variable of the DMP is also periodic. The policy itself is a stochastic policy which adds noise to the velocity variable of the DMP, i.e.

$\displaystyle \pi(\dot{\mathbf{y}}\vert \mathbf{z}, x; \mathbf{b}) = \mathcal{N}(\dot{\mathbf{y}}\vert \Phi(x) \mathbf{b} + \mathbf{z}, \sigma^2 \mathbf{I}),$

where $\mathbf{b}$ is the parameter vector (also denoted as $\theta$ in the further description) of the policy which we have to learn. The DMP itself is already implemented so you do not have to deal with that, the MATLAB package provides you with all information you need to calculate the gradients. The reward function (already implemented) is given by $r_t = 10^{-2} v_x - 10^{-6} \mathbf{u}^2$ , where

is the velocity in x-direction and $\mathbf{u}$ is the used torque.

Subsections

Next: MATLAB package description Up: MLB_Exercises_2012 Previous: RL game [3* P]

Haeusler Stefan 2013-01-16