Return to Index
Operations Research Models and Methods
Computation Section
Subunit Dynamic Programming Examples
 - Doors Solution

The MDP model uses the DP Solver to find solutions. The figure below shows the final solution. This is the steady-state solution obtained with the policy-interation method. Columns K and L hold the optimum policies for each state. Column O holds the total prizes associated with starting in each state. After state (0,0) and excepting state (3,3), the values are all slightly less than 1 because of the small penalties for using the blocking actions.


The optimum solution blocks all transitions into state (3,3).



Fixing the solution at the optimum, we performed 20 value iterations with the walker starting at (3,0). As the steps progress, more and more probability gets harvested at the final state (0,0).


After 20 steps almost 80% of the probability is at the final state. Eventually the final state will have almost all the probability. Not considering the final state, the system has the characteristic of a transient MDP.

  A Markov Chain analysis of the optimum solution shows that the expected number of steps for the system to reach the final state from state (3,0) is 13.25. The path to the final state passes through the exit state (0,0)


Discounting the Prize


The example to this point has the goal of minimizing the probability that the walker reaches the cell (3,3). The solution accomplishes that very well by using four doors to obtain a zero probability. With a nonzero discount factor, the prize loses value at each step that does not reach the state (0,0). When the discount rate is greater than 0, the tendency would be place doors in a manner that hurries the collection of the prize.

We modify the example to introduce a discount rate of 1% per step. All the move penalties are set to zero so as not to confuse the effects of the discount factor with the penalty. After 20 value iterations, the solution of the revised problem is below.


The solution has a blocking action for every state except the two exit states. The probabilities shown in column P are the state probabilities after 20 steps starting from state (3,0). Over 93% of the probability has reach the final state compared to 78% with no discounting.


Subsequent evaluation of the first passage times reveals that the first passage times to the final state are reduced for all starting states. In particular the expected first passage time for the state (3,0) is reduced to 8 from its value of 13.25 without discounting.



  This page has demonstrated a MDP model for the random walk. Inspired by the work of Dimitrov and Morton, we have constructed the blocking action for the MDP. Answers comparable to those of the paper were obtained using our Excel add-ins.
Return to Top

tree roots

Operations Research Models and Methods
by Paul A. Jensen
Copyright 2004 - All rights reserved