PROBLEM SET Handed Out: 4/23/2007 Due:5/15/2007 Temporal Difference Learning Figure 1 (see description below) shows the state space for a Markov Decision Process in part (a). The state transition probabilities defined with respect to some policy, PI, are shown in part (b) of the figure. So, for example, if the agent is in state q(i) then action PI(i) is taken with probability 0.8 whereas alternative actions are taken with probability 0.2. Use the temporal difference learning algorithm to determine the utility function, U(i), as follows. First, choose some policy PI(i). The obvious one will suffice. Based on this policy and the indicated state transition probabilities, generate 100 trials or training sequences using a Monte Carlo technique. Next, define a reward function, R(i), by assigning the values indicated in part(a) to the final states and the constant -0.04 to all other states. Then run your program for the Temporal Difference Algorithm on your training data until some reasonable convergence criterion is satisfied. Finally, compare your values of the U(i) to those given in figure 2 (again, see description below) and explain any differences. Figure 1: A simple 4x3 environment that presents the agent with a sequential decision problem. On the right is a illustration of the transition model of the environment: the "intended" outcome occurs with probability 0,8, but with probability 0.2 the agent moves at right angles to the intended direction. A collision with a wall results in no movement. The two terminal states have reward +1 and -1 respectively, and all other states have a reward of -0.04 . Figure 2: The utilities of the states in the 4x3 world, calculates with gamma=1 and R(s)=-0.04 for nonterminal states.