next up previous
Next: Bayesian networks [1+1* P] Up: MLB_Exercises_2010 Previous: Policy Gradient Methods

Reward Weighted Regression: Cannon Warfare [3 P]

Figure 4: Extremely powerful cannon.
Image cannon

Learn how to shoot a cannon at different target locations and under changing wind-conditions with reward weighted regression.

Use the matlab shootcannon.m provided in cannon.zip10 to simulate a cannon shot. The function takes the initial angle and velocity of the cannon ball as parameter. In addition, your have to provide the current wind strength $ w_S$ . The function returns the impact position (1D) $ x_I$ and the duration $ T$ of the flight of the cannonball. Now we want to use reward weighted regression to learn to shoot at targets at different distances under different wind conditions. Hence, we want to learn a policy $ \pi_{\beta}(\alpha, v \vert x_T, w_S)$ which chooses the optimal initial angle and velocity of the cannonball given the target position $ x_T$ and the wind strength $ w_S$ .

For valid target positions choose the range of $ [1;9]$ and the wind strength can be in located in the interval $ [0;1]$ . The initial shoot angle has to be located in the interval $ [0; \pi/2]$ , the initial shoot velocity in the range of $ [1;10]$ .

Use a $ 10 \times 10$ normalized RBF-network as linear feature representation $ \Phi(x_T, w_S)$ of your policy. Use $ r = \exp(- 20 (x_I - x_T)^2 - 2 T)$ as reward function, where $ x_I$ is the impact position, $ x_T$ is the target position and $ T$ is the duration of the flight (we punish longer flights because we want to destroy the target as fast as possible).

Use a Gaussian Policy $ \pi_\beta(\alpha, v \vert x_T, w_S) = \mathcal{N}([\alpha, v]\vert \Phi(x_T, w_S) \beta, \Sigma)$ with constant exploration $ \Sigma = [0.15 0; 0 6.25]$ . In order to learn how to shoot the cannon use the following procedure:

Report your training performance, also investigate the resulting policy and the error of this policy as function of the target position and the wind strength. Are you able to improve the performance by using a different feature representation?


next up previous
Next: Bayesian networks [1+1* P] Up: MLB_Exercises_2010 Previous: Policy Gradient Methods
Haeusler Stefan 2011-01-25