Computational Intelligence WS24/25

Exercise Sheet 1 — October 31st, 2024

1 Simple Grid Environment

Consider the simple grid world example below, where an agent can move on tiles but is not allowed to stand still, starting from the initial position (S) with the purpose of reaching the goal tile (G). The agent can move between all numbered tiles that are not walls (outside the grid) or blocked (⊗). Execution is stopped on reaching the goal tile.

(i) Task

_Consider the small grid-world shown in Example 1a. Define a set of possible observations $O$ and a set of possible actions $A$ in accordance to the definition given in the lecture. For your definition, what are $∣ A ∣$ and $∣ O ∣$ ? _

Note: Recall that for any finite set $X = {x_{0}, \dots, x_{n}}$ we write $∣ X ∣ = n$ for the number of elements in $X$ . For infinite sets, we also write $∣ X ∣ = \infty$ .

(ii)

Similarly, for the grid world in Example 1a, define a goal predicate $γ_{1}$ that accepts a policy that (a) finds the way from the start tile (S) to the goal tile (G) in the optimal amount of steps and (b) only steps on tile-numbers which are strictly bigger than any of the tile-numbers visited before. Then provide a solution path, i.e., $π_{a}$ , for which $γ_{1} (π_{a}) = True$ . Which of the goal classes we have covered in the lecture could $γ_{1}$ fall under?

(iii)

_Now consider the grid world shown in Example 1b. For the given solution trajectory $π_{b}$ (red line) give $⟨ a_{t} ⟩_{t \in Z}$ and $⟨ o_{t} ⟩_{t \in Z}$ . What is the response from $γ_{1} (π_{b})$ in this case?_

(iv)

Recall that in the lecture we are covering simple (policy) search approaches, one of them being random search (see Algorithm 1 below) with $n$ attempts, i.e., $ρ (n)$ . Let us assume that random sampling $π \sim Π$ can only produce policies that always execute valid actions, i.e., never produce sequences of actions that would lead the agent to run into a wall or blocked field. What is the probability of the policy $π_{b}$ being found by $ρ (8)$ in the grid world Example 1b? Briefly explain your answer.

Algorithm 1 (random search (policy)). Let $A$ be a set of actions. Let $O$ be a set of observations. Let $Γ \subseteq (O \to A) \to B$ be a space of goal predicates on policy functions. Let $γ \in Γ$ be a goal predicate. We assume that the policy space $Π \subseteq O \to A$ can be sampled from, i.e., $π \sim Π$ returns a random element from $Π$ . Random search for $n$ samples is given via the function

ρ (n) = ⎩ ⎨ ⎧ \emptyset π ρ (n - 1) if n = 0, if n > 0 and γ (π) where π \sim Π, otherwise .

Hint: You can assume that the probability that a random policy $π \sim Π$ will execute valid action $a$ at time step $t$ from the set of valid actions $A_{t}$ at time step $t$ is given via $Pr (π (o_{t}) = a ∣ π \sim Π) = \frac{1}{∣ A _{t} ∣}$ .

(v)

Finally, provide a tile-numbering and goal-position in the Template 1c such that $γ_{1} (ρ (4))$ is guaranteed to be $True$ and explain why. You may set one blocking tile ⊗ (on any tile except the initial position (S)) that cannot be traversed.

2 Squirrel Environment

For scientific purposes, we want to deploy a SquirrelBot, i.e., a small robotic agent that is able to drive across soil and dig for nuts. It can observe its exact location $p \in C$ with $C = [0, 100] \times [0, 100] \subset R^{2}$ on a continuous 2D plane representing the accessible soil. In the same plane it can also observe a marked target location $g \in L$ that it wants to navigate to. The value of $g$ is provided by a MemoryAgent that tries to remember all locations where nuts are buried, but to the SquirrelBot that location $g$ (like its own location $p$ ) is just part of its observation. The SquirrelBot can execute an action $a$ of the form $a = (δ x, δy, dig) \in R \times R \times B$ once per time step. The action is resolved by the environment by updating the robot’s own location by $δ x, δy$ and then digging at the new location if $dig = True$ . However, all actions that attempt to drive a distance greater than 1 (i.e., $(δ x)^{2} + (δy)^{2} > 1$ ) per time step are completely ignored by the environment.

(i)

Assume that the complete state of the system is given by the position of the SquirrelBot at time step $t$ , the position of the marked target location $g_{t}$ , and a flag $dug_{t}$ marking if the SquirrelBot attempted to dig (after driving) at time step $t$ , i.e., the whole system generates a sequence of states $(s_{t})_{0 \leq t \leq T}$ for some fixed maximum episode length $T \in N$ and $s_{t} \in L \times C \times B$ . Give a goal predicate $γ_{1} : (L \times C \times B) \to B$ so that $γ_{1} ((p_{t}, g, dug_{t})_{0 \leq t \leq T})$ holds if the agent has at one point in time attempted to dig at a location nearer than 1 to the target location $g$ .

Note: The state reflects all information contained in both the agent and the environment, as opposed to the observations which may only contain parts of it. Since there is nothing more we could know about the policy, we can thus also define goal predicates to be evaluated on that system state.
Hint: You can use the function $dist : L \times L \to R$ to compute the Euclidean distance between two points in $L$ , i.e., $dist ((x_{1}, y_{1}), (x_{2}, y_{2})) = (x_{1} - x_{2})^{2} + (y_{1} - y_{2})^{2}$ .

(ii)

Given a SquirrelBot with the same information $(p_{t}, g_{t}, dug_{t})$ , now also give a goal predicate $γ_{2} : (L \times C \times B) \to B$ so that $γ_{2} ((p_{t}, g_{t}, dug_{t})_{0 \leq t \leq T})$ holds if the agent has, presumably running out of robot patience, attempted to dig at every second step for exactly 10 steps, coming closer to the goal each time and reaching the goal by digging (within a distance of less than 0.01) on the final, 10th of these steps, thus completing the trajectory.

(iii)

Assume that the whole plane of soil is without obstacles and thus easily navigable for the SquirrelBot. Give a policy $π$ that always fulfills the goal predicate $γ_{1}$ eventually regardless of the initial state. Also give $π$ ‘s type signature.
Hint: You do not need to construct the fastest such policy.

3 Running Example: Vacuum World

For the various classes of the goal hierarchy it may be intuitively helpful to code the running examples along as we develop them formally.

(i) Implement in Python a simple Vacuum World as we have seen it in the lecture. The implementation should adequately model the concepts of observations and actions, which a policy $π$ should be able to exert. A randomly acting policy will suffice initially.

(ii) In the lecture we have discussed the following goal predicate $γ$ :

$γ$ should hold iff the agent does not execute the same action for all observations,

or more formally:

γ (π) ⟺ \neg\exists a \in A : \forall o \in O : π (o) = a .

In your Python implementation, collect a number of actions and observations for a policy of your choice, then implement and verify the goal predicate $γ$ .

🎓 MyUniNotes

Explorer

EX1 - CoIn

Computational Intelligence WS24/25

1 Simple Grid Environment

(i) Task

(ii)

(iii)

(iv)

(v)

2 Squirrel Environment

(i)

(ii)

(iii)

3 Running Example: Vacuum World

Graph View

Table of Contents

Backlinks