Is the optimal policy always stochastic if the environment is also stochastic?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












6












$begingroup$


Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?



Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and takes action $a$, then the next state $s'$ is always the same, no matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).










share|improve this question











$endgroup$











  • $begingroup$
    Here's a related question: mathoverflow.net/q/44677.
    $endgroup$
    – nbro
    Feb 23 at 15:40















6












$begingroup$


Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?



Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and takes action $a$, then the next state $s'$ is always the same, no matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).










share|improve this question











$endgroup$











  • $begingroup$
    Here's a related question: mathoverflow.net/q/44677.
    $endgroup$
    – nbro
    Feb 23 at 15:40













6












6








6


2



$begingroup$


Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?



Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and takes action $a$, then the next state $s'$ is always the same, no matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).










share|improve this question











$endgroup$




Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?



Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and takes action $a$, then the next state $s'$ is always the same, no matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).







reinforcement-learning stochastic-policy deterministic-policy policy environment






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 15 at 14:22







nbro

















asked Feb 15 at 13:20









nbronbro

1,543622




1,543622











  • $begingroup$
    Here's a related question: mathoverflow.net/q/44677.
    $endgroup$
    – nbro
    Feb 23 at 15:40
















  • $begingroup$
    Here's a related question: mathoverflow.net/q/44677.
    $endgroup$
    – nbro
    Feb 23 at 15:40















$begingroup$
Here's a related question: mathoverflow.net/q/44677.
$endgroup$
– nbro
Feb 23 at 15:40




$begingroup$
Here's a related question: mathoverflow.net/q/44677.
$endgroup$
– nbro
Feb 23 at 15:40










4 Answers
4






active

oldest

votes


















5












$begingroup$

I would say no.



For example, consider the multi-armed bandit problem. So, you have $n$ arms which all have a probability of giving you a reward (1 point, for example), $p_i$, $i$ being between 1 and $n$. This is a simple stochastic environment: this is a one state environment, but it is still an environment.



But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.



Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).






share|improve this answer











$endgroup$












  • $begingroup$
    Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
    $endgroup$
    – nbro
    Feb 15 at 14:03






  • 2




    $begingroup$
    @nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
    $endgroup$
    – Neil Slater
    Feb 15 at 14:07











  • $begingroup$
    @NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
    $endgroup$
    – nbro
    Feb 15 at 14:11






  • 2




    $begingroup$
    @nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
    $endgroup$
    – Neil Slater
    Feb 15 at 14:15











  • $begingroup$
    Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
    $endgroup$
    – nbro
    Feb 15 at 16:07


















2












$begingroup$


Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?




No.



An optimal policy is generally deterministic unless:



  • Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.


  • There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.



Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).




That seems reasonable, but you can take that intuition further with any method based on a value function:



If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.



The above statement is just a natural language re-statement of the Bellman optimality equation:



$$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$



i.e. the optimal values are obtained when always choosing the action that maximises reward plus discounted value of next step. The $textmax_a$ operation is deterministic (if necessary you can break ties for max value deterministically with e.g. an ordered list of actions).



Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.



It is possible in such an environment that the optimal solution may not be stochastic at all (i.e. if you add any randomness to the deterministic optimal policy, the policy will become strictly worse). However, when there are ties for maximum value for one or more actions in one or more states then there are multiple equivalent optimal and deterministic policies. You may construct a stochastic policy that mixes these in any combination, and it will also be optimal.






share|improve this answer











$endgroup$












  • $begingroup$
    "It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
    $endgroup$
    – nbro
    Feb 15 at 14:14






  • 1




    $begingroup$
    @nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
    $endgroup$
    – Neil Slater
    Feb 15 at 14:24






  • 1




    $begingroup$
    I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
    $endgroup$
    – nbro
    Feb 15 at 14:39










  • $begingroup$
    @nbro: Yes, that's it.
    $endgroup$
    – Neil Slater
    Feb 15 at 15:12


















0












$begingroup$

I'm thinking of a probability landscape, in which you find yourself as an actor, with various unknown peaks and troughs. A good deterministic approach is always likely to lead you to the nearest local optimum, but not necessarily to the global optimum. To find the global optimum, something like an MCMC algorithm would allow to stochastically accept a temporarily worse outcome in order to escape from a local optimum and find the global optimum. My intuition is that in a stochastic environment this would also be true.






share|improve this answer









$endgroup$












  • $begingroup$
    I didn't refer to "good" policies, but "optimal" policies. This is different.
    $endgroup$
    – nbro
    Feb 15 at 16:08


















-2












$begingroup$

The form of a solution in a stochastic environment is always a hedge across allstates






share|improve this answer









$endgroup$












    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "658"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f10591%2fis-the-optimal-policy-always-stochastic-if-the-environment-is-also-stochastic%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    5












    $begingroup$

    I would say no.



    For example, consider the multi-armed bandit problem. So, you have $n$ arms which all have a probability of giving you a reward (1 point, for example), $p_i$, $i$ being between 1 and $n$. This is a simple stochastic environment: this is a one state environment, but it is still an environment.



    But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.



    Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).






    share|improve this answer











    $endgroup$












    • $begingroup$
      Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
      $endgroup$
      – nbro
      Feb 15 at 14:03






    • 2




      $begingroup$
      @nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
      $endgroup$
      – Neil Slater
      Feb 15 at 14:07











    • $begingroup$
      @NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
      $endgroup$
      – nbro
      Feb 15 at 14:11






    • 2




      $begingroup$
      @nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
      $endgroup$
      – Neil Slater
      Feb 15 at 14:15











    • $begingroup$
      Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
      $endgroup$
      – nbro
      Feb 15 at 16:07















    5












    $begingroup$

    I would say no.



    For example, consider the multi-armed bandit problem. So, you have $n$ arms which all have a probability of giving you a reward (1 point, for example), $p_i$, $i$ being between 1 and $n$. This is a simple stochastic environment: this is a one state environment, but it is still an environment.



    But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.



    Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).






    share|improve this answer











    $endgroup$












    • $begingroup$
      Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
      $endgroup$
      – nbro
      Feb 15 at 14:03






    • 2




      $begingroup$
      @nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
      $endgroup$
      – Neil Slater
      Feb 15 at 14:07











    • $begingroup$
      @NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
      $endgroup$
      – nbro
      Feb 15 at 14:11






    • 2




      $begingroup$
      @nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
      $endgroup$
      – Neil Slater
      Feb 15 at 14:15











    • $begingroup$
      Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
      $endgroup$
      – nbro
      Feb 15 at 16:07













    5












    5








    5





    $begingroup$

    I would say no.



    For example, consider the multi-armed bandit problem. So, you have $n$ arms which all have a probability of giving you a reward (1 point, for example), $p_i$, $i$ being between 1 and $n$. This is a simple stochastic environment: this is a one state environment, but it is still an environment.



    But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.



    Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).






    share|improve this answer











    $endgroup$



    I would say no.



    For example, consider the multi-armed bandit problem. So, you have $n$ arms which all have a probability of giving you a reward (1 point, for example), $p_i$, $i$ being between 1 and $n$. This is a simple stochastic environment: this is a one state environment, but it is still an environment.



    But obviously the optimal policy is to choose the arm with the highest $p_i$. So this is not a stochastic policy.



    Obviously, if you are in an environment where you play against other agent (a game theory setting), your optimal policy will certainly be stochastic (think of a poker game, for example).







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 15 at 16:04









    nbro

    1,543622




    1,543622










    answered Feb 15 at 13:54









    Adrien ForbuAdrien Forbu

    1114




    1114











    • $begingroup$
      Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
      $endgroup$
      – nbro
      Feb 15 at 14:03






    • 2




      $begingroup$
      @nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
      $endgroup$
      – Neil Slater
      Feb 15 at 14:07











    • $begingroup$
      @NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
      $endgroup$
      – nbro
      Feb 15 at 14:11






    • 2




      $begingroup$
      @nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
      $endgroup$
      – Neil Slater
      Feb 15 at 14:15











    • $begingroup$
      Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
      $endgroup$
      – nbro
      Feb 15 at 16:07
















    • $begingroup$
      Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
      $endgroup$
      – nbro
      Feb 15 at 14:03






    • 2




      $begingroup$
      @nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
      $endgroup$
      – Neil Slater
      Feb 15 at 14:07











    • $begingroup$
      @NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
      $endgroup$
      – nbro
      Feb 15 at 14:11






    • 2




      $begingroup$
      @nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
      $endgroup$
      – Neil Slater
      Feb 15 at 14:15











    • $begingroup$
      Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
      $endgroup$
      – nbro
      Feb 15 at 16:07















    $begingroup$
    Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
    $endgroup$
    – nbro
    Feb 15 at 14:03




    $begingroup$
    Why would it be obvious to always choose the arm with the highest $p_i$? $p_i$ is a probability, so it's not certain you will always obtain the highest amount of reward (at least, in finite time) if you always choose the arm $i$.
    $endgroup$
    – nbro
    Feb 15 at 14:03




    2




    2




    $begingroup$
    @nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
    $endgroup$
    – Neil Slater
    Feb 15 at 14:07





    $begingroup$
    @nbro: It is certain in expectation, which is what the optimal policy maximises. Policies don't try to second-guess random number generators, that is assumed impossible (if it were possible due to some internal state of the system, you must either add that internal state to the model, or treat as a POMDP)
    $endgroup$
    – Neil Slater
    Feb 15 at 14:07













    $begingroup$
    @NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
    $endgroup$
    – nbro
    Feb 15 at 14:11




    $begingroup$
    @NeilSlater Ok. But would the conclusion change if time is finite? If you have a limited amount of time to play, then the expectation, I guess, must also consider the available time to play.
    $endgroup$
    – nbro
    Feb 15 at 14:11




    2




    2




    $begingroup$
    @nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
    $endgroup$
    – Neil Slater
    Feb 15 at 14:15





    $begingroup$
    @nbro: That may change your decisions, but is not really about the optimal policy. The optimal policy for the bandit arms is still deterministic, about using the best arm, but you don't know it. This is about exploration vs exploitation. You could phrase that as having "an optimal policy for exploring a bandit problem" perhaps. Not the terminology used in e.g. Sutton & Barto, but perhaps some parctioners do say that, I don't know . . .
    $endgroup$
    – Neil Slater
    Feb 15 at 14:15













    $begingroup$
    Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
    $endgroup$
    – nbro
    Feb 15 at 16:07




    $begingroup$
    Why do you say that the environment contains only one state in the "multi-armed bandit problem"?
    $endgroup$
    – nbro
    Feb 15 at 16:07













    2












    $begingroup$


    Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?




    No.



    An optimal policy is generally deterministic unless:



    • Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.


    • There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.



    Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).




    That seems reasonable, but you can take that intuition further with any method based on a value function:



    If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.



    The above statement is just a natural language re-statement of the Bellman optimality equation:



    $$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$



    i.e. the optimal values are obtained when always choosing the action that maximises reward plus discounted value of next step. The $textmax_a$ operation is deterministic (if necessary you can break ties for max value deterministically with e.g. an ordered list of actions).



    Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.



    It is possible in such an environment that the optimal solution may not be stochastic at all (i.e. if you add any randomness to the deterministic optimal policy, the policy will become strictly worse). However, when there are ties for maximum value for one or more actions in one or more states then there are multiple equivalent optimal and deterministic policies. You may construct a stochastic policy that mixes these in any combination, and it will also be optimal.






    share|improve this answer











    $endgroup$












    • $begingroup$
      "It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
      $endgroup$
      – nbro
      Feb 15 at 14:14






    • 1




      $begingroup$
      @nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
      $endgroup$
      – Neil Slater
      Feb 15 at 14:24






    • 1




      $begingroup$
      I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
      $endgroup$
      – nbro
      Feb 15 at 14:39










    • $begingroup$
      @nbro: Yes, that's it.
      $endgroup$
      – Neil Slater
      Feb 15 at 15:12















    2












    $begingroup$


    Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?




    No.



    An optimal policy is generally deterministic unless:



    • Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.


    • There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.



    Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).




    That seems reasonable, but you can take that intuition further with any method based on a value function:



    If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.



    The above statement is just a natural language re-statement of the Bellman optimality equation:



    $$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$



    i.e. the optimal values are obtained when always choosing the action that maximises reward plus discounted value of next step. The $textmax_a$ operation is deterministic (if necessary you can break ties for max value deterministically with e.g. an ordered list of actions).



    Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.



    It is possible in such an environment that the optimal solution may not be stochastic at all (i.e. if you add any randomness to the deterministic optimal policy, the policy will become strictly worse). However, when there are ties for maximum value for one or more actions in one or more states then there are multiple equivalent optimal and deterministic policies. You may construct a stochastic policy that mixes these in any combination, and it will also be optimal.






    share|improve this answer











    $endgroup$












    • $begingroup$
      "It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
      $endgroup$
      – nbro
      Feb 15 at 14:14






    • 1




      $begingroup$
      @nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
      $endgroup$
      – Neil Slater
      Feb 15 at 14:24






    • 1




      $begingroup$
      I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
      $endgroup$
      – nbro
      Feb 15 at 14:39










    • $begingroup$
      @nbro: Yes, that's it.
      $endgroup$
      – Neil Slater
      Feb 15 at 15:12













    2












    2








    2





    $begingroup$


    Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?




    No.



    An optimal policy is generally deterministic unless:



    • Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.


    • There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.



    Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).




    That seems reasonable, but you can take that intuition further with any method based on a value function:



    If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.



    The above statement is just a natural language re-statement of the Bellman optimality equation:



    $$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$



    i.e. the optimal values are obtained when always choosing the action that maximises reward plus discounted value of next step. The $textmax_a$ operation is deterministic (if necessary you can break ties for max value deterministically with e.g. an ordered list of actions).



    Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.



    It is possible in such an environment that the optimal solution may not be stochastic at all (i.e. if you add any randomness to the deterministic optimal policy, the policy will become strictly worse). However, when there are ties for maximum value for one or more actions in one or more states then there are multiple equivalent optimal and deterministic policies. You may construct a stochastic policy that mixes these in any combination, and it will also be optimal.






    share|improve this answer











    $endgroup$




    Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?




    No.



    An optimal policy is generally deterministic unless:



    • Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations. If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.


    • There is some kind of minimax game theory scenario, where a deterministic policy can be punished by the environment or another agent. Think scissors/paper/stone or prisoner's dilemma.



    Intuitively, if the environment is deterministic (that is, if the agent is in a state 𝑠 and takes action 𝑎, then the next state 𝑠′ is always the same, not matter which time step), then the optimal policy should also be deterministic (that is, it should be a map from states to actions, and not to a probability distribution over actions).




    That seems reasonable, but you can take that intuition further with any method based on a value function:



    If you have found an optimal value function, then acting greedily with respect to it is the optimal policy.



    The above statement is just a natural language re-statement of the Bellman optimality equation:



    $$v^*(s) = textmax_a sum_r,s'p(r,s'|s,a)(r+gamma v^*(s'))$$



    i.e. the optimal values are obtained when always choosing the action that maximises reward plus discounted value of next step. The $textmax_a$ operation is deterministic (if necessary you can break ties for max value deterministically with e.g. an ordered list of actions).



    Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.



    It is possible in such an environment that the optimal solution may not be stochastic at all (i.e. if you add any randomness to the deterministic optimal policy, the policy will become strictly worse). However, when there are ties for maximum value for one or more actions in one or more states then there are multiple equivalent optimal and deterministic policies. You may construct a stochastic policy that mixes these in any combination, and it will also be optimal.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 15 at 15:13

























    answered Feb 15 at 13:47









    Neil SlaterNeil Slater

    5,9281417




    5,9281417











    • $begingroup$
      "It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
      $endgroup$
      – nbro
      Feb 15 at 14:14






    • 1




      $begingroup$
      @nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
      $endgroup$
      – Neil Slater
      Feb 15 at 14:24






    • 1




      $begingroup$
      I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
      $endgroup$
      – nbro
      Feb 15 at 14:39










    • $begingroup$
      @nbro: Yes, that's it.
      $endgroup$
      – Neil Slater
      Feb 15 at 15:12
















    • $begingroup$
      "It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
      $endgroup$
      – nbro
      Feb 15 at 14:14






    • 1




      $begingroup$
      @nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
      $endgroup$
      – Neil Slater
      Feb 15 at 14:24






    • 1




      $begingroup$
      I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
      $endgroup$
      – nbro
      Feb 15 at 14:39










    • $begingroup$
      @nbro: Yes, that's it.
      $endgroup$
      – Neil Slater
      Feb 15 at 15:12















    $begingroup$
    "It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
    $endgroup$
    – nbro
    Feb 15 at 14:14




    $begingroup$
    "It is possible in such an environment that no stochastic policy is optimal", you mean deterministic policy?
    $endgroup$
    – nbro
    Feb 15 at 14:14




    1




    1




    $begingroup$
    @nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
    $endgroup$
    – Neil Slater
    Feb 15 at 14:24




    $begingroup$
    @nbro: No, I really mean that there is no optimal stochastic policy. This is commonly the case. Think for example of a simple maze solver. If the optimal deterministic solution is a single path from start to exit, adding any randomness at all to it will make the policy strictly worse. This does not change if the environment adds random noise (e.g. moves sometimes failing)
    $endgroup$
    – Neil Slater
    Feb 15 at 14:24




    1




    1




    $begingroup$
    I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
    $endgroup$
    – nbro
    Feb 15 at 14:39




    $begingroup$
    I understand now. You're saying that there's always a deterministic policy, then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy.
    $endgroup$
    – nbro
    Feb 15 at 14:39












    $begingroup$
    @nbro: Yes, that's it.
    $endgroup$
    – Neil Slater
    Feb 15 at 15:12




    $begingroup$
    @nbro: Yes, that's it.
    $endgroup$
    – Neil Slater
    Feb 15 at 15:12











    0












    $begingroup$

    I'm thinking of a probability landscape, in which you find yourself as an actor, with various unknown peaks and troughs. A good deterministic approach is always likely to lead you to the nearest local optimum, but not necessarily to the global optimum. To find the global optimum, something like an MCMC algorithm would allow to stochastically accept a temporarily worse outcome in order to escape from a local optimum and find the global optimum. My intuition is that in a stochastic environment this would also be true.






    share|improve this answer









    $endgroup$












    • $begingroup$
      I didn't refer to "good" policies, but "optimal" policies. This is different.
      $endgroup$
      – nbro
      Feb 15 at 16:08















    0












    $begingroup$

    I'm thinking of a probability landscape, in which you find yourself as an actor, with various unknown peaks and troughs. A good deterministic approach is always likely to lead you to the nearest local optimum, but not necessarily to the global optimum. To find the global optimum, something like an MCMC algorithm would allow to stochastically accept a temporarily worse outcome in order to escape from a local optimum and find the global optimum. My intuition is that in a stochastic environment this would also be true.






    share|improve this answer









    $endgroup$












    • $begingroup$
      I didn't refer to "good" policies, but "optimal" policies. This is different.
      $endgroup$
      – nbro
      Feb 15 at 16:08













    0












    0








    0





    $begingroup$

    I'm thinking of a probability landscape, in which you find yourself as an actor, with various unknown peaks and troughs. A good deterministic approach is always likely to lead you to the nearest local optimum, but not necessarily to the global optimum. To find the global optimum, something like an MCMC algorithm would allow to stochastically accept a temporarily worse outcome in order to escape from a local optimum and find the global optimum. My intuition is that in a stochastic environment this would also be true.






    share|improve this answer









    $endgroup$



    I'm thinking of a probability landscape, in which you find yourself as an actor, with various unknown peaks and troughs. A good deterministic approach is always likely to lead you to the nearest local optimum, but not necessarily to the global optimum. To find the global optimum, something like an MCMC algorithm would allow to stochastically accept a temporarily worse outcome in order to escape from a local optimum and find the global optimum. My intuition is that in a stochastic environment this would also be true.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Feb 15 at 15:58









    Jonathan MooreJonathan Moore

    1363




    1363











    • $begingroup$
      I didn't refer to "good" policies, but "optimal" policies. This is different.
      $endgroup$
      – nbro
      Feb 15 at 16:08
















    • $begingroup$
      I didn't refer to "good" policies, but "optimal" policies. This is different.
      $endgroup$
      – nbro
      Feb 15 at 16:08















    $begingroup$
    I didn't refer to "good" policies, but "optimal" policies. This is different.
    $endgroup$
    – nbro
    Feb 15 at 16:08




    $begingroup$
    I didn't refer to "good" policies, but "optimal" policies. This is different.
    $endgroup$
    – nbro
    Feb 15 at 16:08











    -2












    $begingroup$

    The form of a solution in a stochastic environment is always a hedge across allstates






    share|improve this answer









    $endgroup$

















      -2












      $begingroup$

      The form of a solution in a stochastic environment is always a hedge across allstates






      share|improve this answer









      $endgroup$















        -2












        -2








        -2





        $begingroup$

        The form of a solution in a stochastic environment is always a hedge across allstates






        share|improve this answer









        $endgroup$



        The form of a solution in a stochastic environment is always a hedge across allstates







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 16 at 2:37









        ron demboron dembo

        1




        1



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Artificial Intelligence Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f10591%2fis-the-optimal-policy-always-stochastic-if-the-environment-is-also-stochastic%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown






            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            How many registers does an x86_64 CPU actually have?

            Nur Jahan