-
1 Classical Search
-
38 Question 1 (7 Points). The undirected graph in Figure 1-(Left) has four nodes (A,B,C,D). The number
-
39 on each edge indicates the cost of going through the edge – e.g., going from A to B has a cost of 5, same
-
40 as going from B to A. Set node A as the start, and D the goal node. Answer the following questions and
-
41 explain your reasoning.
-
57 Question 2 (3 Points). Consider the game in Figure 1-(Right). There are two players: the column player
-
58 and the row player. The column player goes first, and chooses either the first column (C1) or the second
-
59 (C2), and after that, the row player chooses either the first row (R1) or the second (R2). After these two
-
60 steps, the game ends and the pair of two numbers located by these two choices is the outcome of the game.
-
61 In each pair of numbers, the first number is the reward that the column player will receive, and the second
-
62 number is what the row player will receive. For instance, if the column player chooses C1, and then the row
-
63 player chooses R1, then the outcome is (10,2), which means the column player will receive a reward of 10
-
64 and the row player a reward of 2.
-
65 Assume that both players are rational players and want to maximize their own reward (no minimizers),
-
66 what should each of their action be, and what is the final outcome of the game under those actions?
-
67 In your answer, draw the game tree that is similar to the minimax tree, just that now the two players
-
68 are no longer the max and min players (just annotate their nodes as column and row players). Explain your
-
69 reasoning on the tree. I hope you realize this is how the algorithms we talked about in minimax can be
-
70 applied to games where players are not directly adversarial to each other (not zero-sum).
-
71 3 Markov Decision Processes and Reinforcement Learning
-
72 Consider the following MDP:
-
73 • State space: S = {s1, s2}
-
74 • Actions: A(s1) = {a1, a2} and A(s2) = {a1, a2}.
-
75 • Transition model:
-
76 – P (s1|s1, a1) = 0.2, P (s2|s1, a1) = 0.8, P (s1|s1, a2) = 0.8, P (s2|s1, a2) = 0.2
-
77 – P (s1|s2, a1) = 0.7, P (s2|s2, a1) = 0.3, P (s1|s2, a2) = 0.3, P (s2|s2, a2) = 0.7
-
78 • Rewards: R(s1) = 1 and R(s2) = −2.
-
79 • Discount factor: γ = 0.8.
-
80 Question 3 (2 Points). Draw a graph that fully represents the MDP. In the graph, show the states, actions,
-
81 Q-states (i.e., state-action pairs), and rewards. Label transitions with probabilities on the appropriate edges
-
82 in the graph.
-
83 Question 4 (3 Points). Perform two steps of value iteration from initial guess V0 (s1 ) = V0 (s2 ) = 0. That
-
84 is, perform two steps of Bellman update Vi+1 ← B(Vi) (check the notations in the slides). Note that each V
-
85 is a vector of the values on both states, i.e., V0 = (V0(s1),V0(s2)). So you should show how to compute V1
-
86 and V2, where each Vi is a vector of two numbers.
-
87 Question 5 (2 Points). Roughly how many iterations will it take for the value iteration to converge to be
-
88 within 0.1 error from the optimal values? That is, suppose the optimal values are V ∗ = (V ∗(s1), V ∗(s2)),
-
89 and we want |Vk(s1) − V ∗(s1)| < 0.1 and |Vk(s2) − V ∗(s2)| < 0.1, then how large does k need to be to
-
90 converge to under this threshold? Explain your reasoning mathematically, no implementation involved.
-
91 Question 6 (2 Points). Suppose we refer to the MDP defined above as M1, and write V ∗ (s1), V ∗ (s2) to M1 M1
-
92 represent the optimal values the M1. Now define another MDP M2 that is exactly the same as M1 except
-
93 that the rewards are now RM2 (s1) = 2 and RM2 (s2) = −4. Namely, M2 differs from M1 only in that the
-
98 Question 7 (4 Points). We now perform Q-learning in the MDP defined in the beginning, i.e., M1, without
-
99 accessing the transition probabilities. We initialize all Q-values to be zero Q(si,ai) = 0. For simplicity we
-
100 will use a fixed learning rate parameter α = 0.1 in all temporal-difference updates (so it does not change
-
101 with the number of visits). We start from the initial state s2, and suppose we perform/observe the following
-
102 three steps in the learning process in the MDP:
-
(Step 1) At the initial state s2, we take the action a1, and then observe that we are transitioned back to s2 by the MDP.
-
109 After each step of these three steps, we perform temporal-difference updates for the Q-values on the relevant
-
110 state-action pairs. Now answer the following questions.
-
122 Question 8 (2 Points). Consider the simple game of betting over the blue and red coins as shown in the
-
123 slides for this chapter. Explain why, from a regret perspective, choosing two coins uniformly randomly at
-
124 each round is the same as randomly sticking with one coin from the very beginning. You should use the
-
125 definition of regret over N rounds of play of the game in your explanation.
-
126 Question 9 (Extra 2 Points). This question is an advanced topic for extra credits, and it can be confusing
-
127 if you are not familiar with the notion of i.i.d. samples. In that case no need to attempt it.
-
128 In MCTS we use the Upper Confidence Bound (UCB) strategy to make decisions that balance exploration
-
129 and exploitation. We explained that for the multi-armed bandit problem such as the two-coin betting game,
-
130 the UCB strategy achieves logrithmic regret. That result needs to assume that we can draw i.i.d. samples
-
131 for each coin, because the reward distribution on each coin does not change over time.
-
132 Give a concrete example to explain why this may not be the case in the MCTS setting. That is, design
-
133 a minimax tree, such that the win rate distributions for the choices at the root node change over time, as
-
134 the MCTS algorithm expands the tree. Recall that the win rate distribution at a node is what we are using
-
135 roll-outs from that node perform Monte Carlo estimation of.
-
136 The mathematical detail does not need to be very precise, just focus on explaining what the MCTS
-
137 algorithm does on the tree that you design, so that it is clear that the win rate distribution for the options
-
138 at the root node may shift over time.
CSE150B Final Exam
In the correct algorithm for Dijkstra, if we visit a node that is already in the frontier, we need to compare the cost of this node from the current path with its cost in the frontier (the last two lines in the pseudocode of Slide 36 of this chapter 1 Classical Search.pdf). Now, on this given graph, if we do not do that comparison and replacement (i.e., remove the last line in the pseudocode), then what is the path from A to D that this wrong modification of Dijkstra algorithm will return? Explain the steps of how you obtained it.
What is the path that will be returned by the correct Dijkstra algorithm instead?
In your PA1, we agreed that it was ok to not do this frontier check and replacement, and still always get the optimal path from the wrong modification of the Dijkstra algorithm. Explain why it was ok in the graphs in PA1 and its difference with this graph in Figure 1. In your explanation, you can use the fact that the Dijkstra algorithm always pops nodes in the order of non-decreasing path cost.
Define the following heuristic function with heuristic values h(A) = 3,h(B) = 1,h(C) = 7,h(D) = 0. Is h a consistent heuristic function and why?
What is the path that the A* algorithm will return from A to D under this heuristic h?
Adversarial Search
94 reward on each state is twice as large. We write the optimal values of M2 as V ∗ (s1) and V ∗ (s2). M2 M2
95 Now, suppose you know what V∗ (s1) and V∗ (s2) are (which you don’t, and you do not need to M1 M1
96 implement it to find out), then what should V ∗ (s1) and V ∗ (s2) be? That is, write down the values of M2 M2
97 V ∗ (s1) and V ∗ (s2) using the values of V ∗ (s1) and V ∗ (s2). Explain your reasoning.
M2 M2 M1 M1
(Step 2) We then take the action a1 again from s2, and observe that we are transitioned now to s1 by the MDP.
(Step 3) Now at the state s1, we take the action of a1 yet again, and then observe that we are transitioned to state s2 by the MDP.
What values do we have for Q(s1,a1) and Q(s2,a1) now, after these three steps of updates? Write down how you obtained them.
Suppose from here we will use the ε-greedy strategy with ε = 0.3, which means that with ε probability we will use an arbitrary action (each of the two actions will be chosen equally likely in this case), and with 1 − ε probability we will choose the best action according to the current Q-values. Now that we are in s2 after Step 3, what is the probability of seeing the transition (s2 , a1 , s1 ) in the next step? That is, calculate the probability of the event “according to the ε-greedy policy, we obtained the action a1 in the current state s2, and after applying this action, the MDP puts us in s1 as the next state.”
If instead of ε-greedy policy, we take the greedy policy that always takes the action that maximizes
Q-values in each step, then what is the probability of seeing (s2 , a1 , s1 ) in the next step?
Monte Carlo Tree Search
咨询 Alpha 小助手,获取更多课业帮助