
In this task you have to learn optimal policies for the swimmer (see Figure 3) using different policy gradient methods.
You have to compare three algorithms to compute the gradient , namely Finite Differences, Likelihood Ratio and the Natural Policy Gradient. The robot is a 3link (2joints) snakelike robot swimming in the water. He has two actuators, you have to learn how to use these actuators to swim as fast as possible in a given direction. The model, the policy and the reward function are already given in the provided matlab package swimmer.zip^{7}. The used policy is a stochastic Gaussian policy implemented by a Dynamic Movement Primitive (DMP). The DMP uses 6 centers per joint. As we deal with a periodic movement the phase variable of the DMP is also periodic. The policy itself is a stochastic policy which adds noise to the velocity variable of the DMP, i.e.
where is the parameter vector (also denoted as in the further description) of the policy which we have to learn. The DMP itself is already implemented so you do not have to deal with that, the matlab package provides you with all information you need to calculate the gradients. The reward function (already implemented) is given by , where is the velocity in xdirection and is the used torque.