Next: Bayesian networks [1+1* P] Up: MLB_Exercises_2010 Previous: Policy Gradient Methods

# Reward Weighted Regression: Cannon Warfare [3 P]

Learn how to shoot a cannon at different target locations and under changing wind-conditions with reward weighted regression.

Use the matlab shootcannon.m provided in cannon.zip10 to simulate a cannon shot. The function takes the initial angle and velocity of the cannon ball as parameter. In addition, your have to provide the current wind strength . The function returns the impact position (1D) and the duration of the flight of the cannonball. Now we want to use reward weighted regression to learn to shoot at targets at different distances under different wind conditions. Hence, we want to learn a policy which chooses the optimal initial angle and velocity of the cannonball given the target position and the wind strength .

For valid target positions choose the range of and the wind strength can be in located in the interval . The initial shoot angle has to be located in the interval , the initial shoot velocity in the range of .

Use a normalized RBF-network as linear feature representation of your policy. Use as reward function, where is the impact position, is the target position and is the duration of the flight (we punish longer flights because we want to destroy the target as fast as possible).

Use a Gaussian Policy with constant exploration . In order to learn how to shoot the cannon use the following procedure:

• Generate 10 cannon shoots with random target positions and wind strength's (uniformly distributed) using the current policy . Add this experience to your training data.
• Re-estimate using reward-weighted regression (using the whole training data).
• Evaluate your policy. In order to do so, calculate the mean reward of 30 randomly picked initial states. Note that you should always use the same set of 30 states for each evaluation. In addition, do not use any exploration (just use the means of the Gaussian policy) for evaluation. Also, do not add this experience to your training data.
• Repeat the whole procedure at least 500 times (to end up with 5000 training examples).

Report your training performance, also investigate the resulting policy and the error of this policy as function of the target position and the wind strength. Are you able to improve the performance by using a different feature representation?

Next: Bayesian networks [1+1* P] Up: MLB_Exercises_2010 Previous: Policy Gradient Methods
Haeusler Stefan 2011-01-25