Is the optimal policy always stochastic if the environment is also stochastic?

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?

Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and takes action $a$, then the next state $s'$ is always the same, no matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).

edited Feb 15 at 14:22

asked Feb 15 at 13:20

nbro

1,543622

$begingroup$
Here's a related question: mathoverflow.net/q/44677.
$endgroup$
– nbro
Feb 23 at 15:40

add a comment |

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?

edited Feb 15 at 14:22

asked Feb 15 at 13:20

nbro

1,543622

$begingroup$
Here's a related question: mathoverflow.net/q/44677.
$endgroup$
– nbro
Feb 23 at 15:40

add a comment |

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?

edited Feb 15 at 14:22

asked Feb 15 at 13:20

nbro

1,543622

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?

reinforcement-learning stochastic-policy deterministic-policy policy environment

edited Feb 15 at 14:22

asked Feb 15 at 13:20

nbro

1,543622

edited Feb 15 at 14:22

asked Feb 15 at 13:20

nbro

1,543622

edited Feb 15 at 14:22

asked Feb 15 at 13:20

nbro

1,543622

asked Feb 15 at 13:20

nbro

1,543622

asked Feb 15 at 13:20

nbro

1,543622

$begingroup$
Here's a related question: mathoverflow.net/q/44677.
$endgroup$
– nbro
Feb 23 at 15:40

add a comment |

$begingroup$
Here's a related question: mathoverflow.net/q/44677.
$endgroup$
– nbro
Feb 23 at 15:40

Here's a related question: mathoverflow.net/q/44677.

– nbro
Feb 23 at 15:40

add a comment |

4 Answers
4

active

oldest

votes

I would say no.

For example, consider the multi-armed bandit problem. So, you have $n$ arms which all have a probability of giving you a reward (1 point, for example), $p_i$, $i$ being between 1 and $n$. This is a simple stochastic environment: this is a one state environment, but it is still an environment.

But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.

Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).

edited Feb 15 at 16:04

nbro

1,543622

answered Feb 15 at 13:54

Adrien Forbu

1114

$begingroup$
Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
$endgroup$
– nbro
Feb 15 at 14:03

2

$begingroup$
@nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
$endgroup$
– Neil Slater
Feb 15 at 14:07

$begingroup$
@NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
$endgroup$
– nbro
Feb 15 at 14:11

2

$begingroup$
@nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
$endgroup$
– Neil Slater
Feb 15 at 14:15

$begingroup$
Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
$endgroup$
– nbro
Feb 15 at 16:07

|
show 2 more comments

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?

No.

An optimal policy is generally deterministic unless:

Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.

There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.

Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).

That seems reasonable, but you can take that intuition further with any method based on a value function:

If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.

The above statement is just a natural language re-statement of the Bellman optimality equation:

$$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$

i.e. the optimal values are obtained when always choosing the action that maximises reward plus discounted value of next step. The $textmax_a$ operation is deterministic (if necessary you can break ties for max value deterministically with e.g. an ordered list of actions).

Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.

It is possible in such an environment that the optimal solution may not be stochastic at all (i.e. if you add any randomness to the deterministic optimal policy, the policy will become strictly worse). However, when there are ties for maximum value for one or more actions in one or more states then there are multiple equivalent optimal and deterministic policies. You may construct a stochastic policy that mixes these in any combination, and it will also be optimal.

edited Feb 15 at 15:13

answered Feb 15 at 13:47

Neil Slater

5,9281417

$begingroup$
"It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
$endgroup$
– nbro
Feb 15 at 14:14

1

$begingroup$
@nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
$endgroup$
– Neil Slater
Feb 15 at 14:24

1

$begingroup$
I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
$endgroup$
– nbro
Feb 15 at 14:39

$begingroup$
@nbro: Yes, that's it.
$endgroup$
– Neil Slater
Feb 15 at 15:12

add a comment |

I'm thinking of a probability landscape, in which you find yourself as an actor, with various unknown peaks and troughs. A good deterministic approach is always likely to lead you to the nearest local optimum, but not necessarily to the global optimum. To find the global optimum, something like an MCMC algorithm would allow to stochastically accept a temporarily worse outcome in order to escape from a local optimum and find the global optimum. My intuition is that in a stochastic environment this would also be true.

answered Feb 15 at 15:58

Jonathan Moore

1363

$begingroup$
I didn't refer to "good" policies, but "optimal" policies. This is different.
$endgroup$
– nbro
Feb 15 at 16:08

add a comment |

-2

The form of a solution in a stochastic environment is always a hedge across allstates

answered Feb 16 at 2:37

ron dembo

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f10591%2fis-the-optimal-policy-always-stochastic-if-the-environment-is-also-stochastic%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

I would say no.

But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.

Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).

edited Feb 15 at 16:04

nbro

1,543622

answered Feb 15 at 13:54

Adrien Forbu

1114

$begingroup$
Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
$endgroup$
– nbro
Feb 15 at 14:03

2

$begingroup$
@nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
$endgroup$
– Neil Slater
Feb 15 at 14:07

$begingroup$
@NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
$endgroup$
– nbro
Feb 15 at 14:11

2

$begingroup$
@nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
$endgroup$
– Neil Slater
Feb 15 at 14:15

$begingroup$
Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
$endgroup$
– nbro
Feb 15 at 16:07

|
show 2 more comments

I would say no.

But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.

Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).

edited Feb 15 at 16:04

nbro

1,543622

answered Feb 15 at 13:54

Adrien Forbu

1114

$begingroup$
Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
$endgroup$
– nbro
Feb 15 at 14:03

2

$begingroup$
@nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
$endgroup$
– Neil Slater
Feb 15 at 14:07

$begingroup$
@NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
$endgroup$
– nbro
Feb 15 at 14:11

2

$begingroup$
@nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
$endgroup$
– Neil Slater
Feb 15 at 14:15

$begingroup$
Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
$endgroup$
– nbro
Feb 15 at 16:07

|
show 2 more comments

I would say no.

But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.

Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).

edited Feb 15 at 16:04

nbro

1,543622

answered Feb 15 at 13:54

Adrien Forbu

1114

I would say no.

But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.

Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).

edited Feb 15 at 16:04

nbro

1,543622

answered Feb 15 at 13:54

Adrien Forbu

1114

edited Feb 15 at 16:04

nbro

1,543622

edited Feb 15 at 16:04

nbro

1,543622

edited Feb 15 at 16:04

nbro

1,543622

answered Feb 15 at 13:54

Adrien Forbu

1114

answered Feb 15 at 13:54

Adrien Forbu

1114

answered Feb 15 at 13:54

Adrien Forbu

1114

$begingroup$
Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
$endgroup$
– nbro
Feb 15 at 14:03

2

$begingroup$
@nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
$endgroup$
– Neil Slater
Feb 15 at 14:07

$begingroup$
@NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
$endgroup$
– nbro
Feb 15 at 14:11

2

$begingroup$
@nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
$endgroup$
– Neil Slater
Feb 15 at 14:15

$begingroup$
Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
$endgroup$
– nbro
Feb 15 at 16:07

|
show 2 more comments

$begingroup$
Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
$endgroup$
– nbro
Feb 15 at 14:03

2

$begingroup$
@nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
$endgroup$
– Neil Slater
Feb 15 at 14:07

$begingroup$
@NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
$endgroup$
– nbro
Feb 15 at 14:11

2

$begingroup$
@nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
$endgroup$
– Neil Slater
Feb 15 at 14:15

$begingroup$
Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
$endgroup$
– nbro
Feb 15 at 16:07

Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.

– nbro
Feb 15 at 14:03

@nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)

– Neil Slater
Feb 15 at 14:07

@NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.

– nbro
Feb 15 at 14:11

@nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .

– Neil Slater
Feb 15 at 14:15

Why do you say that the environment contains only one state in the "multi-armed bandit problem"?

– nbro
Feb 15 at 16:07

|
show 2 more comments

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?

No.

An optimal policy is generally deterministic unless:

Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.

There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.

Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).

That seems reasonable, but you can take that intuition further with any method based on a value function:

If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.

The above statement is just a natural language re-statement of the Bellman optimality equation:

$$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$

Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.

edited Feb 15 at 15:13

answered Feb 15 at 13:47

Neil Slater

5,9281417

$begingroup$
"It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
$endgroup$
– nbro
Feb 15 at 14:14

1

$begingroup$
@nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
$endgroup$
– Neil Slater
Feb 15 at 14:24

1

$begingroup$
I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
$endgroup$
– nbro
Feb 15 at 14:39

$begingroup$
@nbro: Yes, that's it.
$endgroup$
– Neil Slater
Feb 15 at 15:12

add a comment |

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?

No.

An optimal policy is generally deterministic unless:

Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.

There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.

Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).

That seems reasonable, but you can take that intuition further with any method based on a value function:

If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.

The above statement is just a natural language re-statement of the Bellman optimality equation:

$$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$

Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.

edited Feb 15 at 15:13

answered Feb 15 at 13:47

Neil Slater

5,9281417

$begingroup$
"It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
$endgroup$
– nbro
Feb 15 at 14:14

1

$begingroup$
@nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
$endgroup$
– Neil Slater
Feb 15 at 14:24

1

$begingroup$
I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
$endgroup$
– nbro
Feb 15 at 14:39

$begingroup$
@nbro: Yes, that's it.
$endgroup$
– Neil Slater
Feb 15 at 15:12

add a comment |

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?

No.

An optimal policy is generally deterministic unless:

Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.

There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.

Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).

That seems reasonable, but you can take that intuition further with any method based on a value function:

If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.

The above statement is just a natural language re-statement of the Bellman optimality equation:

$$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$

Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.

edited Feb 15 at 15:13

answered Feb 15 at 13:47

Neil Slater

5,9281417

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?

No.

An optimal policy is generally deterministic unless:

Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.

There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.

Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).

That seems reasonable, but you can take that intuition further with any method based on a value function:

If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.

The above statement is just a natural language re-statement of the Bellman optimality equation:

$$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$

Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.

edited Feb 15 at 15:13

answered Feb 15 at 13:47

Neil Slater

5,9281417

edited Feb 15 at 15:13

answered Feb 15 at 13:47

Neil Slater

5,9281417

answered Feb 15 at 13:47

Neil Slater

5,9281417

answered Feb 15 at 13:47

Neil Slater

5,9281417

$begingroup$
"It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
$endgroup$
– nbro
Feb 15 at 14:14

1

$begingroup$
@nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
$endgroup$
– Neil Slater
Feb 15 at 14:24

1

$begingroup$
I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
$endgroup$
– nbro
Feb 15 at 14:39

$begingroup$
@nbro: Yes, that's it.
$endgroup$
– Neil Slater
Feb 15 at 15:12

add a comment |

$begingroup$
"It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
$endgroup$
– nbro
Feb 15 at 14:14

1

$begingroup$
@nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
$endgroup$
– Neil Slater
Feb 15 at 14:24

1

$begingroup$
I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
$endgroup$
– nbro
Feb 15 at 14:39

$begingroup$
@nbro: Yes, that's it.
$endgroup$
– Neil Slater
Feb 15 at 15:12

"It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?

– nbro
Feb 15 at 14:14

@nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)

– Neil Slater
Feb 15 at 14:24

I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.

– nbro
Feb 15 at 14:39

@nbro: Yes, that's it.

– Neil Slater
Feb 15 at 15:12

add a comment |

answered Feb 15 at 15:58

Jonathan Moore

1363

$begingroup$
I didn't refer to "good" policies, but "optimal" policies. This is different.
$endgroup$
– nbro
Feb 15 at 16:08

add a comment |

answered Feb 15 at 15:58

Jonathan Moore

1363

$begingroup$
I didn't refer to "good" policies, but "optimal" policies. This is different.
$endgroup$
– nbro
Feb 15 at 16:08

add a comment |

answered Feb 15 at 15:58

Jonathan Moore

1363

answered Feb 15 at 15:58

Jonathan Moore

1363

answered Feb 15 at 15:58

Jonathan Moore

1363

answered Feb 15 at 15:58

Jonathan Moore

1363

answered Feb 15 at 15:58

Jonathan Moore

1363

$begingroup$
I didn't refer to "good" policies, but "optimal" policies. This is different.
$endgroup$
– nbro
Feb 15 at 16:08

add a comment |

$begingroup$
I didn't refer to "good" policies, but "optimal" policies. This is different.
$endgroup$
– nbro
Feb 15 at 16:08

I didn't refer to "good" policies, but "optimal" policies. This is different.

– nbro
Feb 15 at 16:08

add a comment |

-2

The form of a solution in a stochastic environment is always a hedge across allstates

answered Feb 16 at 2:37

ron dembo

add a comment |

-2

The form of a solution in a stochastic environment is always a hedge across allstates

answered Feb 16 at 2:37

ron dembo

add a comment |

-2

The form of a solution in a stochastic environment is always a hedge across allstates

answered Feb 16 at 2:37

ron dembo

The form of a solution in a stochastic environment is always a hedge across allstates

answered Feb 16 at 2:37

ron dembo

answered Feb 16 at 2:37

ron dembo

answered Feb 16 at 2:37

ron dembo

answered Feb 16 at 2:37

ron dembo

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Artificial Intelligence Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu