V. CASE STUDY AND EXPERIMENTS

We require an abstract representation of the human’s commands

as a strategy to use our synthesis approach in a shared control

scenario, We now discuss how such strategies may be obtained

using inverse reinforcement learning and report on case study

results.

A. Experimental setting

We consider two scenarios, the first of which is the wheelchair

scenario from Fig. 1. We model the wheelchair scenario inside

an interactive Python environment.

In the second scenario, we use a tool called AMASE1, which

is used to simulate multi-unmanned aircraft vehicles (UAV)

missions. Its graphical user interfaces allow humans to send

commands to the one or multiple vehicles at run time. It

includes three main programs: a simulator, a data playback

tool, and a scenario setup tool.

We use the model checker PRISM 19 to verify if the

computed strategies satisfy the specification. We use the LP

solver Gurobi 14 to check the feasibility of the LP problems

that is given in Section IV. We also implemented the greedy

approach for strategy repair in 15.

B. Data collection

We asked five participants to accomplish tasks in the wheelchair

scenario. The goal is moving the wheelchair to a target cell in

gridworld while never occupying the same cell as the moving

obstacle. Similarly, three participants performed the surveillance

task in the AMASE environment.

From the data obtained from each participant, we compute

an individual randomized human strategy ?h via maximum-

entropy inverse reinforcement learning (MEIRL) 28. Refer-

ence 16 uses inverse reinforcement learning to reason about

the human’s commands in a shared control scenario from

human’s demonstrations. However, they lack formal guarantees

on the robot’s execution. In 25, inverse reinforcement learning

is used to distinguish human intents with respect to different

tasks. We show the work flow of the case study in Figure 4.

In our setting, we denote each sample as one particular

command of the participant, and we assume that the participant

issues the command to satisfy the specification. Under this

assumption, we can bound the probability of a possible

deviation from the actual intent with respect to the number of

samples using Hoeffding’s inequality for the resulting strategy,

see 27 for details. Using these bounds, we can determine

the required number of command to get an approximation

of a typical human behavior. The probability of a possible

deviation from the human behavior is given by O(exp(?n?2)),

where n is the number of commands from the human and ?

is the upper bound on the deviation between the probability

of satisfying the specification with the true human strategy

and the probability obtained by the strategy that is computed

by inverse reinforcement learning. For example, to ensure an

upper bound ? = 0.05 on the deviation of the probability

1https://github.com/afrl-rq/OpenAMASE

Fig. 4. The setting of the case study for the shared control simulation.

of satisfying the specification with a probability of 0.99, we

require 1060 demonstrations from the human.

We design the blending function by assigning a lower weight

in the human strategy at states where the human strategy

yields a lower probability of reaching the target set. Using

this function, we create the autonomy strategy ?a and pass

(together with the blending function) back to the environment.

Note that the blended strategy ?ha satisfies the specification,

by Theorem 1.

C. Gridworld

The size of the gridworld in Fig. 1 is variable, and we

generate a number of randomly moving (e.g., the vacuum

cleaner) and stationary obstacles. An agent (e.g., the wheelchair)

moves in the gridworld according to the commands from a

human. For the gridworld scenario, we construct an MDP and

the states of the MDP represents the positions of the agent and

the obstacles. The actions in the MDP induce changes in the

position of the agent.

The safety specification specifies that the agent has to reach

a target cell while not crashing into an obstacle with a certain

probability ? ? 0, 1, formally P??(¬crash U target).

First, we report results for one particular participant in a

gridworld scenario with a 8× 8 grid and one moving obstacle.

The resulting MDP has 2304 states. We compute the human

strategy using MEIRL with the features, e. g., the components

of the cost function of the human, giving the distance to the

obstacle and the goal state.

We instantiate the safety specification with ? = 0.7, which

means the target should be reached with at least a probability

of 0.7. The human strategy ?h induces a probability of 0.546

to satisfy the specification. That is, it does not satisfy the

specification.

We compute the repaired strategy ?ha using the greedy and

the QCP approach, and both strategies satisfies the specification

by inducing a probability of satisfying the specification larger

than ?. On the one hand, the maximum deviation between

the human strategy ?h and ?ha is 0.15 with the LP approach,

which implies that the strategy of the human and the autonomy

protocol deviates at most 15% for all states and actions. On

the other hand, the maximum deviation between the human

strategy ?h and the blended strategy ?ha is 0.03 with the QCP

approach. The results shows that the QCP approach computes

a blended strategy that induces more similar strategy to the

human strategy compared to the LP approach.

(a) Strategy ?h (b) Strategy ?ah (c) Strategy ?a

Fig. 5. Graphical representation of the obtained human, blended, and autonomy

strategy in the grid.

TABLE I

SCALABILITY RESULTS FOR THE GRIDWORLD EXAMPLE.

grid states trans. LP synth. ?LP QCP synth. ?QCP

8× 8 2, 304 36, 864 14.12 0.15 31.49 0.03

10× 10 3, 600 57, 600 23.80 0.24 44.61 0.04

12× 12 14, 400 230, 400 250.78 0.33 452.27 0.05

To finally assess the scalability of our approach, consider

Table I. We generated MDPs for different gridworlds with

different number of states and number of obstacles. We list

the number of states in the MDP (labeled as “states”) and

the number of transitions (labeled as “trans”). We report on

the time that the synthesis process took with the LP approach

and QCP approach (labeled as “LP synth.” and “QCP synth”),

which includes the time of solving the LP or QCPs measured

in seconds. It also includes the model checking times using

PRISM for the LP approach. To represent the optimality of the

synthesis, we list the maximal deviation between the repaired

strategy and the human strategy for the LP and QCP approach

(labeled as “?LP” and “?QCP”). In all of the examples, we

observe that the strategies with the QCP approach yields

autonomy strategies with less deviation to the human strategy

while having similar computation time with the LP approach.

D. UAV mission planning

Similar to the gridworld scenario, we generate an MDP, in

which that of the states MDP denotes the position of the agents

and the obstacles in a AMASE scenario. Consider an AMASE

scenario in Fig. 6. In this scenario, the specification or the

mission of the agent (blue UAV) is to keep surveilling the

green regions (labeled as w1, w2, w3) while avoiding restricted

operating zones (labeled as “ROZ1, ROZ2”) and enemy agents

(purple and green UAVs). As we consider reachability problems,

we asked the participants to visit the regions in a sequence,

i.e., visiting the first region, then second, and then the third

region. After visiting the third region, the task is to visit the

first region again to perform the surveillance.

For example, if the last visited region is w3, then the

safety specification in this scenario is P??((¬crash ?

¬ROZ) U target), where ROZ is to visit the ROZ areas

and target is visiting w1.

We synthesize the autonomy protocol on the AMASE

scenario with two enemy agents that induces an MDP with

15625 states. We use the same blending function and same

(a) An AMASE simulator. (b) The GUI of AMASE.

Fig. 6. An example of an AMASE scenario.

threshold ? = 0.7 as in the gridworld example. The features

to compute the human strategy with MEIRL are given by

the distance to the closest ROZ, enemy agents and the target

region.

The human strategy ?h induces a probability of 0.163 to

satisfy the specification, and it violates the specification like

in the gridworld example. Similar to the gridworld example,

we compute the repaired strategy ?ha with the LP and the

QCP approach, and both strategies satisfy the specification. On

the one hand, the maximum deviation between ?h and ?ha is

0.42 with the LP approach, which means the strategies of the

human and the autonomy protocol are significantly different in

some states of the MDP. On the other hand, the QCP approach

yields a repaired strategy ?ha that is more similar to the human

strategy ?h with the maximum deviation being 0.06. The time

of the synthesis procedure with the LP approach is 481.31

seconds and the computation time with the QCP approach

is 749.18 seconds, showing the trade-offs between the LP

approach and the QCP approach. We see that, the LP approach

can compute a feasible solution slightly faster, however the

resulting blended strategy may be less similar to the human

strategy compared to the QCP approach.

VI. CONCLUSION AND CRITIQUE

We introduced a formal approach to synthesize an autonomy

protocol in a shared control setting subject to probabilistic

temporal logic specifications. The proposed approach utilizes

inverse reinforcement learning to compute an abstraction of

a human’s behavior as a randomized strategy in a Markov

decision process. We designed an autonomy protocol such

that the robot behavior satisfies safety and performance

specifications given in probabilistic temporal logic. We also

ensured that the resulting robot behavior is as similar to the

behavior induced by the human’s commands as possible. We

synthesized the robot behavior using quasiconvex programming.

We showed the practical usability of our approach through

case studies involving autonomous wheelchair navigation and

unmanned aerial vehicle planning.

There is a number of limitations and also possible extensions

of the proposed approach. First of all, we computed an

globally optimal strategy by bisection, which requires checking

feasibility of a number of linear programming problems. A

convex formulation of the shared control synthesis problem

would make computing the globally optimal strategy more

efficient.

We assumed that the human’s commands are consistent

through the whole execution, i. e., the human issues each

command to satisfy the specification. Also, this assumption

implies the human does not consider assistance from the robot

while providing commands – and in particular, the human does

not adapt the strategy to the assistance. It may be possible

to extend the approach to handle non-consistent commands

by utilizing additional side information, such as the task

specifications.

Finally, in order to generalize the proposed approach to

other task domains, it is worth to explore transfer learning 21

techniques. Such techniques will allow us to handle different

scenarios without requiring to relearn the human strategy from

the human’s commands.