Next: Bayesian networks [1+1* P] Up: MLB_Exercises_2010 Previous: Policy Gradient Methods

Reward Weighted Regression: Cannon Warfare [3 P]

**Figure 4:** Extremely powerful cannon.

Learn how to shoot a cannon at different target locations and under changing wind-conditions with reward weighted regression.

Use the matlab shootcannon.m provided in cannon.zip¹⁰ to simulate a cannon shot. The function takes the initial angle and velocity of the cannon ball as parameter. In addition, your have to provide the current wind strength . The function returns the impact position (1D) and the duration of the flight of the cannonball. Now we want to use reward weighted regression to learn to shoot at targets at different distances under different wind conditions. Hence, we want to learn a policy $\pi_{\beta}(\alpha, v \vert x_T, w_S)$ which chooses the optimal initial angle and velocity of the cannonball given the target position and the wind strength .

For valid target positions choose the range of and the wind strength can be in located in the interval . The initial shoot angle has to be located in the interval $[0; \pi/2]$ , the initial shoot velocity in the range of .

Use a $10 \times 10$ normalized RBF-network as linear feature representation $\Phi(x_T, w_S)$ of your policy. Use $r = \exp(- 20 (x_I - x_T)^2 - 2 T)$ as reward function, where is the impact position, is the target position and is the duration of the flight (we punish longer flights because we want to destroy the target as fast as possible).

Use a Gaussian Policy $\pi_\beta(\alpha, v \vert x_T, w_S) = \mathcal{N}([\alpha, v]\vert \Phi(x_T, w_S) \beta, \Sigma)$ with constant exploration $\Sigma = [0.15 0; 0 6.25]$ . In order to learn how to shoot the cannon use the following procedure:

Generate 10 cannon shoots with random target positions and wind strength's (uniformly distributed) using the current policy $\pi_\beta(\alpha, v \vert x_T, w_S)$ . Add this experience to your training data.
Re-estimate $\beta$ using reward-weighted regression (using the whole training data).
Evaluate your policy. In order to do so, calculate the mean reward of 30 randomly picked initial states. Note that you should always use the same set of 30 states for each evaluation. In addition, do not use any exploration (just use the means of the Gaussian policy) for evaluation. Also, do not add this experience to your training data.
Repeat the whole procedure at least 500 times (to end up with 5000 training examples).

Report your training performance, also investigate the resulting policy and the error of this policy as function of the target position and the wind strength. Are you able to improve the performance by using a different feature representation?

Next: Bayesian networks [1+1* P] Up: MLB_Exercises_2010 Previous: Policy Gradient Methods

Haeusler Stefan 2011-01-25