Contents
What is optimal action-value function?
The optimal action-value function gives the values after committing to a particular first action, in this case, to the driver, but afterward using whichever actions are best. The contour is still farther out and includes the starting tee.
What is the difference between a deterministic and a stochastic policy?
A policy can be either deterministic or stochastic. A deterministic policy is policy that maps state to actions. You give it a state and the function returns an action to take. On the other hand, a stochastic policy outputs a probability distribution over actions.
Is the optimal policy always deterministic in the environment?
The max a operation is deterministic (if necessary you can break ties for max value deterministically with e.g. an ordered list of actions). Therefore, any environment that can be modelled by a MDP and solved by a value-based method (e.g. value iteration, Q-learning) has an optimal policy which is deterministic.
Which is an example of an optimal policy?
An optimal policy is generally deterministic unless: Important state information is missing (a POMDP). For example, in a map where the agent is not allowed to know its exact location or remember previous states, and the state it is given is not enough to disambiguate between locations.
When is the optimal policy always stochastic?
If the goal is to get to a specific end location, the optimal policy may include some random moves in order to avoid becoming stuck. Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state), but still lead to requiring a stochastic policy to solve it.
What is the difference between a stochastic and a deterministic policy?
A deterministic policy can be interpreted as a stochastic policy that gives the probability of $1$to one of the available actions (and $0$to the remaining actions), for each state. Share Improve this answer Follow edited Jun 29 at 12:41