V. CASE STUDY AND EXPERIMENTS
We require an abstract representation of the human’s commands
as a strategy to use our synthesis approach in a shared control
scenario, We now discuss how such strategies may be obtained
using inverse reinforcement learning and report on case study
A. Experimental setting
We consider two scenarios, the first of which is the wheelchair
scenario from Fig. 1. We model the wheelchair scenario inside
an interactive Python environment.
In the second scenario, we use a tool called AMASE1, which
is used to simulate multi-unmanned aircraft vehicles (UAV)
missions. Its graphical user interfaces allow humans to send
commands to the one or multiple vehicles at run time. It
includes three main programs: a simulator, a data playback
tool, and a scenario setup tool.
We use the model checker PRISM 19 to verify if the
computed strategies satisfy the specification. We use the LP
solver Gurobi 14 to check the feasibility of the LP problems
that is given in Section IV. We also implemented the greedy
approach for strategy repair in 15.
B. Data collection
We asked five participants to accomplish tasks in the wheelchair
scenario. The goal is moving the wheelchair to a target cell in
gridworld while never occupying the same cell as the moving
obstacle. Similarly, three participants performed the surveillance
task in the AMASE environment.
From the data obtained from each participant, we compute
an individual randomized human strategy ?h via maximum-
entropy inverse reinforcement learning (MEIRL) 28. Refer-
ence 16 uses inverse reinforcement learning to reason about
the human’s commands in a shared control scenario from
human’s demonstrations. However, they lack formal guarantees
on the robot’s execution. In 25, inverse reinforcement learning
is used to distinguish human intents with respect to different
tasks. We show the work flow of the case study in Figure 4.
In our setting, we denote each sample as one particular
command of the participant, and we assume that the participant
issues the command to satisfy the specification. Under this
assumption, we can bound the probability of a possible
deviation from the actual intent with respect to the number of
samples using Hoeffding’s inequality for the resulting strategy,
see 27 for details. Using these bounds, we can determine
the required number of command to get an approximation
of a typical human behavior. The probability of a possible
deviation from the human behavior is given by O(exp(?n?2)),
where n is the number of commands from the human and ?
is the upper bound on the deviation between the probability
of satisfying the specification with the true human strategy
and the probability obtained by the strategy that is computed
by inverse reinforcement learning. For example, to ensure an
upper bound ? = 0.05 on the deviation of the probability
Fig. 4. The setting of the case study for the shared control simulation.
of satisfying the specification with a probability of 0.99, we
require 1060 demonstrations from the human.
We design the blending function by assigning a lower weight
in the human strategy at states where the human strategy
yields a lower probability of reaching the target set. Using
this function, we create the autonomy strategy ?a and pass
(together with the blending function) back to the environment.
Note that the blended strategy ?ha satisfies the specification,
by Theorem 1.
The size of the gridworld in Fig. 1 is variable, and we
generate a number of randomly moving (e.g., the vacuum
cleaner) and stationary obstacles. An agent (e.g., the wheelchair)
moves in the gridworld according to the commands from a
human. For the gridworld scenario, we construct an MDP and
the states of the MDP represents the positions of the agent and
the obstacles. The actions in the MDP induce changes in the
position of the agent.
The safety specification specifies that the agent has to reach
a target cell while not crashing into an obstacle with a certain
probability ? ? 0, 1, formally P??(¬crash U target).
First, we report results for one particular participant in a
gridworld scenario with a 8× 8 grid and one moving obstacle.
The resulting MDP has 2304 states. We compute the human
strategy using MEIRL with the features, e. g., the components
of the cost function of the human, giving the distance to the
obstacle and the goal state.
We instantiate the safety specification with ? = 0.7, which
means the target should be reached with at least a probability
of 0.7. The human strategy ?h induces a probability of 0.546
to satisfy the specification. That is, it does not satisfy the
We compute the repaired strategy ?ha using the greedy and
the QCP approach, and both strategies satisfies the specification
by inducing a probability of satisfying the specification larger
than ?. On the one hand, the maximum deviation between
the human strategy ?h and ?ha is 0.15 with the LP approach,
which implies that the strategy of the human and the autonomy
protocol deviates at most 15% for all states and actions. On
the other hand, the maximum deviation between the human
strategy ?h and the blended strategy ?ha is 0.03 with the QCP
approach. The results shows that the QCP approach computes
a blended strategy that induces more similar strategy to the
human strategy compared to the LP approach.
(a) Strategy ?h (b) Strategy ?ah (c) Strategy ?a
Fig. 5. Graphical representation of the obtained human, blended, and autonomy
strategy in the grid.
SCALABILITY RESULTS FOR THE GRIDWORLD EXAMPLE.
grid states trans. LP synth. ?LP QCP synth. ?QCP
8× 8 2, 304 36, 864 14.12 0.15 31.49 0.03
10× 10 3, 600 57, 600 23.80 0.24 44.61 0.04
12× 12 14, 400 230, 400 250.78 0.33 452.27 0.05
To finally assess the scalability of our approach, consider
Table I. We generated MDPs for different gridworlds with
different number of states and number of obstacles. We list
the number of states in the MDP (labeled as “states”) and
the number of transitions (labeled as “trans”). We report on
the time that the synthesis process took with the LP approach
and QCP approach (labeled as “LP synth.” and “QCP synth”),
which includes the time of solving the LP or QCPs measured
in seconds. It also includes the model checking times using
PRISM for the LP approach. To represent the optimality of the
synthesis, we list the maximal deviation between the repaired
strategy and the human strategy for the LP and QCP approach
(labeled as “?LP” and “?QCP”). In all of the examples, we
observe that the strategies with the QCP approach yields
autonomy strategies with less deviation to the human strategy
while having similar computation time with the LP approach.
D. UAV mission planning
Similar to the gridworld scenario, we generate an MDP, in
which that of the states MDP denotes the position of the agents
and the obstacles in a AMASE scenario. Consider an AMASE
scenario in Fig. 6. In this scenario, the specification or the
mission of the agent (blue UAV) is to keep surveilling the
green regions (labeled as w1, w2, w3) while avoiding restricted
operating zones (labeled as “ROZ1, ROZ2”) and enemy agents
(purple and green UAVs). As we consider reachability problems,
we asked the participants to visit the regions in a sequence,
i.e., visiting the first region, then second, and then the third
region. After visiting the third region, the task is to visit the
first region again to perform the surveillance.
For example, if the last visited region is w3, then the
safety specification in this scenario is P??((¬crash ?
¬ROZ) U target), where ROZ is to visit the ROZ areas
and target is visiting w1.
We synthesize the autonomy protocol on the AMASE
scenario with two enemy agents that induces an MDP with
15625 states. We use the same blending function and same
(a) An AMASE simulator. (b) The GUI of AMASE.
Fig. 6. An example of an AMASE scenario.
threshold ? = 0.7 as in the gridworld example. The features
to compute the human strategy with MEIRL are given by
the distance to the closest ROZ, enemy agents and the target
The human strategy ?h induces a probability of 0.163 to
satisfy the specification, and it violates the specification like
in the gridworld example. Similar to the gridworld example,
we compute the repaired strategy ?ha with the LP and the
QCP approach, and both strategies satisfy the specification. On
the one hand, the maximum deviation between ?h and ?ha is
0.42 with the LP approach, which means the strategies of the
human and the autonomy protocol are significantly different in
some states of the MDP. On the other hand, the QCP approach
yields a repaired strategy ?ha that is more similar to the human
strategy ?h with the maximum deviation being 0.06. The time
of the synthesis procedure with the LP approach is 481.31
seconds and the computation time with the QCP approach
is 749.18 seconds, showing the trade-offs between the LP
approach and the QCP approach. We see that, the LP approach
can compute a feasible solution slightly faster, however the
resulting blended strategy may be less similar to the human
strategy compared to the QCP approach.
VI. CONCLUSION AND CRITIQUE
We introduced a formal approach to synthesize an autonomy
protocol in a shared control setting subject to probabilistic
temporal logic specifications. The proposed approach utilizes
inverse reinforcement learning to compute an abstraction of
a human’s behavior as a randomized strategy in a Markov
decision process. We designed an autonomy protocol such
that the robot behavior satisfies safety and performance
specifications given in probabilistic temporal logic. We also
ensured that the resulting robot behavior is as similar to the
behavior induced by the human’s commands as possible. We
synthesized the robot behavior using quasiconvex programming.
We showed the practical usability of our approach through
case studies involving autonomous wheelchair navigation and
unmanned aerial vehicle planning.
There is a number of limitations and also possible extensions
of the proposed approach. First of all, we computed an
globally optimal strategy by bisection, which requires checking
feasibility of a number of linear programming problems. A
convex formulation of the shared control synthesis problem
would make computing the globally optimal strategy more
We assumed that the human’s commands are consistent
through the whole execution, i. e., the human issues each
command to satisfy the specification. Also, this assumption
implies the human does not consider assistance from the robot
while providing commands – and in particular, the human does
not adapt the strategy to the assistance. It may be possible
to extend the approach to handle non-consistent commands
by utilizing additional side information, such as the task
Finally, in order to generalize the proposed approach to
other task domains, it is worth to explore transfer learning 21
techniques. Such techniques will allow us to handle different
scenarios without requiring to relearn the human strategy from
the human’s commands.