SFFFF
SFFFF
Review
Inverse Reinforcement Learning as the Algorithmic Basis for
Theory of Mind: Current Methods and Open Problems
Jaime Ruiz-Serra and Michael S. Harré *
Modelling and Simulation Research Group, School of Computer Science, Faculty of Engineering,
The University of Sydney, Sydney, NSW 2006, Australia
* Correspondence: michael.harre@sydney.edu.au
Abstract: Theory of mind (ToM) is the psychological construct by which we model another’s internal
mental states. Through ToM, we adjust our own behaviour to best suit a social context, and therefore
it is essential to our everyday interactions with others. In adopting an algorithmic (rather than a
psychological or neurological) approach to ToM, we gain insights into cognition that will aid us in
building more accurate models for the cognitive and behavioural sciences, as well as enable artificial
agents to be more proficient in social interactions as they become more embedded in our everyday
lives. Inverse reinforcement learning (IRL) is a class of machine learning methods by which to infer
the preferences (rewards as a function of state) of a decision maker from its behaviour (trajectories
in a Markov decision process). IRL can provide a computational approach for ToM, as recently
outlined by Jara-Ettinger, but this will require a better understanding of the relationship between ToM
concepts and existing IRL methods at the algorthmic level. Here, we provide a review of prominent
IRL algorithms and their formal descriptions, and discuss the applicability of IRL concepts as the
algorithmic basis of a ToM in AI.
Keywords: social cognition; theory of mind; inverse reinforcement learning; artificial intelligence;
cognitive science
task that requires inferring beliefs and intentions of other players from natural language
to negotiate and coordinate with them—by the Cicero AI agent [7]. A key step in their
approach is to model an agent’s action choices by assuming it simultaneously attempts
to maximise the expected value of an action given other players’ actions and minimise
the difference between its action choices and that of a model from human behaviour data,
in a reward function that is structurally similar to Equation (41) below. As artificial agents
become more embedded in the world, not only will humans need to take an intentional
stance toward them [8], but artificial agents will need to have that same ability toward
others [9–11]. With these points in mind, the ability to socially integrate AI is important for
its future development, and ToM will play a central role in achieving it.
An important part of such an AI ToM is the specification of desires (goals), beliefs,
and intentions that are fundamental to how agents make choices [12]. Early work in the
intersection between psychology and AI (published the same year as the first paper on
ToM [13]) provided algorithmic methods by which to infer goals and plan structures from
actions, by using linguistic descriptions of action sequences [14] and later extended to
account for differing beliefs between the actor (i.e., the agent whose internal states are
being modelled) and observer (i.e., the agent doing the modelling) [15]. Others showed
semantic representations of the relationship between intentions and beliefs [16]. Notable
work by Yoshida et al. [17] proposed a Game ToM model wherein the value function in
a Markov decision process (MDP) is defined over the joint state spaces of all agents in
the environment. This leads to a recursive optimisation of the joint value function in
each agent up to a certain level of sophistication. Under the assumption that the rewards
are the same for and known by all agents, instead of inferring rewards from behaviour,
it is sufficient to infer other’s level of sophistication in order to act strategically. More
recent and oft-cited computational implementations of ToM, Bayesian ToM [18,19] and
Machine ToM [20], seek to recover agent goals as well as their beliefs in an MDP setting.
A class of machine learning methods that is particularly designed to operate in the MDP
framework is inverse reinforcement learning (IRL), the objective of which is to infer the
reward function of an agent from its state–action trajectories. The potential suitability
of preference learning, and IRL in particular, as a computational approach for ToM was
recently outlined by Langley et al. [21] and Jara-Ettinger [22], respectively. IRL has seen
a recent resurgence of interest, with multiple reviews of methods appearing in the last
few years [23–27]. Simultaneously, a growing body of research focuses on computational
approaches to modelling other agents [28–30]. In spite of these contributions, a better
understanding of the relationship between ToM concepts and existing IRL methods at the
algorthmic level is required to adopt IRL as the algorithmic basis of ToM.
Here we provide a review of prominent IRL algorithms and their formal descrip-
tions and discuss the applicability of IRL concepts as foundations for an algorithmic ToM.
Section 2 provides background on IRL, including the conceptual formulation of the prob-
lem, its foundations on reinforcement learning (RL), important concepts and notation,
and its relation to ToM. Section 3 explains the connection between desires and rewards
and reviews algorithmic approaches to two issues that arise: how to discriminate between
different reward functions that equally explain observed behaviour (Section 3.1), and how
to characterise the reward function in the context of the problem (Section 3.2). Section 4
discusses the importance of beliefs in the IRL problem and their interpretation in this
context as relating to transition dynamics (Section 4.1) and state observability (Section 4.2).
Section 5 covers methods that relate to the intentions of an agent, including how suboptimal
behaviour (Section 5.1) and multiple intentions (Section 5.2) are accounted for. Section 6
highlights important and promising considerations for expanding IRL and making it more
suitable as an algorithmic approach to ToM.
Algorithms 2023, 16, 68 3 of 42
2. Background
RL algorithms learn to optimise agent actions given observations of the state of the
agent’s environment, with respect to a reward function. This reward function is the most
succinct representation of a task. We use the terms “reward” and “utility” interchangeably.
Utility has more general connotations and widespread use in economics and game theory,
whereas reward is more common in AI, and specifically in RL. Conceptual efforts in
economics led to development of theories of rational multiobjective decision making
based on the attributes of each available choice, in what is known as multiattribute utility
theory. A central pillar in this line of work is the quantification of the decision maker’s
preferences [31]. Russell [32] called to attention the lack of work on the computational
aspects of this problem, which he related to machine learning as the dual of RL and named
it IRL. The task was characterised as follows.
Given (1) measurements of an agent’s behaviour over time, in a variety of circumstances,
(2) if needed, measurements of the sensory inputs to that agent, (3) if available, a model of
the environment.
Determine the reward function being optimised.
Under the principle of rationality, a rational agent’s behaviour is driven by a tendency
to optimise for its desires given its beliefs. The intentional stance invokes this principle to
attribute causality for behaviour to mental states [33]. An agent’s reward function is the
driver of its behaviour and can therefore operate as a representation of its desires. On the
other hand, the agent’s beliefs about the world inform what behaviour is appropriate or
feasible, and play a crucial role in planning toward fulfilling its desires. IRL may serve as
an algorithmic paradigm for inferring the mental states (beliefs, desires) of others based on
their observed behaviour (i.e., ToM) [22].
where dπ,T,t is the state–action distribution at time t, a result of the agent’s policy and
the environment’s transition probabilities, with s0 drawn from D. The Bellman equation
Algorithms 2023, 16, 68 4 of 42
provides a way to recursively compute the values of states under a policy. For a given MDP,
V π satisfies
Qπ (s, a) = R(s, a) + γ ∑
0
T (s, a, s0 )V π (s0 ) ∀ s ∈ S , a ∈ A, (3)
s ∈S
which defines the cumulative reward to be expected from performing action a while in
state s, and can be used to obtain an optimal policy π ∗ (s) ∈ arg maxa∈A Qπ (s, a; R) for a
given R.
A useful representation of a policy is its discounted state–action visitation distribution,
or occupancy. A policy π and its occupancy measure µπ can be used interchangeably for
a given environment—occupancy provides a representation of the policy as influenced
by the transition dynamics of the environment. By employing Kronecker delta notation
(δij = 1 if i = j, 0 otherwise), the occupancy is
∞
µπ (s, a) = E(st ,at )∼dπ,T,t [ ∑ γt δst s δat a | D, T, π ] (4)
t =0
and is sufficiently defined through the linear Bellman flow constraints [34]
A feature matrix F ∈ Rd×|S×A| can be employed to encapsulate the features for each
state–action pair F(·,s,a) = φ(s, a), resulting in vπ = Fµπ . Under a linear approximation
of the rewards, V π (Equation (1)) can alternatively be obtained from features (through
linearity of expectations) with
V π = θ T vπ = θ T Fµπ . (7)
Algorithms 2023, 16, 68 5 of 42
Feature counts from a given trajectory τ provide a compact representation of the trajectory
Hj
Φ j = Φ(τj ) = ∑ γ t φ ( s t , a t ) ∈ Rd . (8)
t =0
The measurements of the behaviour of the agent of interest (here actor or expert)
over time are given as demonstrations D , usually taking the form of an unordered set
j H
D = {τjπ }m π
j=1 of sequential paths (i.e., trajectories) τj = (( s, a )t )t=0 of length H j + 1.
Additional information may be provided in the demonstrations, including feature matrices,
occupancies, etc.
Empirical estimates of occupancy and feature expectations can be computed as average
counts from the observed trajectories
H
j
1
µ̃π (s, a |τj ) = ∑
H j + 1 t =0
γt δst s δat a , (9)
∑ a∈A π ( a|s) = 1 ∀s ∈ S ,
E[ Qπ (s, a)] = ∑ a∈A π ( a|s) Qπ (s, a) ∀s ∈ S ,
1
π ( a|s) = Pr( a|s, R, π ) = exp(αQπ (s, a; R)), (11)
Z
with normalising constant, or partition function Z (s, R, π, α) = ∑ a0 ∈A exp(αQπ (s, a0 ; R))
and (negative) potential energy Qπ (s, a; R). The hyperparameter α serves as an inverse
temperature parameter defining the steepness of the policy distribution, or how “greedy”
for optimal Q-values it is. This greediness may be understood as the level of rationality
attributed to the agent by the ToM observer. This distribution is known as the Softmax
function in the machine learning literature.
A maximum entropy optimal policy can be obtained through the soft Q-function
π ( s, a ) = R ( s, a ) + γ
Qsoft ∑s0 ∈S T (s, a, s0 )Vsoft
π (s0 )
(12)
∑ a∈A exp Qsoft
π ( s ) , log
π ( s, a ) ,
Vsoft
where the value function is defined through the LogSumExp function [39]. Actions with
higher Q-values reduce regret, which rational agents are expected to act in accordance
Algorithms 2023, 16, 68 6 of 42
The Boltzmann distribution provides a smooth parametric model for the action choice
distribution in a given state (i.e., a policy) that is shaped by the Q-function at the state.
These qualities, along with the alignment with MaxEnt and the degree of freedom in the
temperature hyperparameter, are desirable attributes in modelling rational agents. For
these reasons, a sizable proportion of IRL algorithms resort to a Boltzmann assumption
when characterising the policy (e.g., Sections 3.1.1.6, 3.1.2.2, 3.1.2.3, 3.1.3.2 and 3.1.6.3). It
is common practice to assume the Q-function is given in a converged state or obtained
through dynamic programming (value iteration) or RL methods (e.g., Q-learning). In the
ToM interpretation, the accuracy of the Q-function with respect to the true values of the
MDP encode, in part, the accuracy of the agent’s beliefs—a core mental attitude of ToM,
as discussed in Section 4. Another core mental attitude in models of ToM are desires,
which are encoded as rewards in rational agent models. In the following section, we
review IRL algorithms whose emphasis is on recovering these rewards, and discuss how
they can provide an effective computational approach to inferring an agent’s desires from
their behaviour.
(T a1 − T a )(I − γT a1 )−1 R 0 ∀ a ∈ A \ a1 ,
| R(s)| ≤ Rmax ∀s ∈ S ,
with
| θi | ≤ 1 i = 1, 2, . . . , d
x x≥ 0
p( x ) =
2x x < 0,
where p is a penalty function for states in which π is not optimal under R̂. The penalty
weight value of 2 in p is arbitrarily chosen. The original paper asserts that results were not
sensitive to this value.
Finally, further generalising the method, they introduced an algorithm to find R̂ such
that a policy π to be determined maximises V π when a set D of trajectories τ through S is
given in lieu of π E . This algorithm requires (i) a way to approximate V π (as above), (ii) a
way to find an optimal policy πk under any R (techniques from the RL literature can be
employed to this end), and (iii) the ability to simulate trajectories starting from s0 under
policy π in the MDP.
Approximations of the feature expectations and the value function can be obtained
by performing m Monte Carlo trajectories of length H under π, and averaging over their
values (for a large H, the difference as compared with an infinite time horizon is negligible):
m Hj
1
∑ ∑ γt φ(st
( j)
ṽπ (s0 ) = ) (16)
m j =1 t =0
In the context of ToM, the given trajectories represent the observer’s knowledge of
the actor’s behaviour. The longer the trajectories and the larger the set of trajectories
(hyperparameters H and m, respectively), the better the observer can be said to know
the actor. The resultant R̂ is the observer’s model of what drives the actor’s behaviour
(i.e., its utilities), which may be used in conjunction with policy estimates π ∈ Π (i.e., its
probabilities) to predict its future behaviour. In Figure 1, we group these two variables
together conceptually as the observer’s model of the agent (green, dashed outline). The
rationality of the actor is based on these two sources of information [41]. The observer
requires a model of the environment (MDP\R) to be able to estimate the model of the
agent. This is a sensible requirement for any agent. In Algorithm 1, it is assumed to
be completely faithful to the real environment (Figure 1, yellow with dashed and solid
outlines, respectively).
One outstanding question is the meaning of the basis functions, or environment fea-
tures, φi in the context of ToM. We place them conceptually within the observer, as depicted
in Figure 1 (orange). The cardinality d of the space Θ in which we perform the linear
approximation of the reward function, and thus its expressivity, depends on how many
features the observer makes use of. Intuitively, they stand for the perceptual acuity of the
observer—the number of different “stimuli” the observer can differentiate amongst and
attribute value to. They are likely to differ to that of the actor; that is, if the actor does have
them in the first place—it may not know its subutilities and simply be guided by its reward
function. Simple examples of features in the scenario of an agent crossing the road include
whether there is a car present, the speed of the car, the state of the pedestrian crossing
traffic lights, etc. Moreover, not only the features, but the state observations themselves
may differ between the actor and the observer (e.g., first-person vs. third-person point of
view). In Ng and Russell [40] they are “given” and fixed.
As new trajectories are observed, the same algorithm can be used to update the weights
if the current θ and πk are used instead of randomly initialising them. We call attention
to the fact that, although this was not stated in the algorithm as presented, it can yield
the set of policies π ∈ Π, as well as their respective ṽπ , Ṽ π , and different reward function
estimates R̂ under which each of the policies were optimised.
Algorithms 2023, 16, 68 9 of 42
Figure 1. Diagram of the max-margin IRL algorithm (see Algorithms 1 and 2). Given trajectories τ E ,
the observer constructs a model of the actor comprising a policy π and reward function R (dashed
green), employing a model of the environment (i.e., a model of the MDP\R, dashed yellow, which is
usually assumed to be a priori known by the observer and equal to the actual environment, yellow)
to generate candidate trajectories τ π . Both trajectories are compared (blue) with the aid of features
φ (orange) that are intrinsic to the observer to update the weights θ. The weights characterise the
reward function in conjunction with the features. Iteratively repeating this process yields a suitable
reward function.
i.e., ψ∗ is the mixed policy that maximises V ψ − V πE for the worst-case possibility for θ ∗ ,
a sensible constraint because θ ∗ is unknown. This allows for a zero-sum game formulation,
though only abstractly, so the “players” are not the observer and actor, but the rewards
and the policy (this is the foundational concept of adversarial IRL methods, reviewed in
Section 3.1.7). “Min player” sets the reward by choosing θ, and “max player” chooses a
mixed policy ψ, adversarially. As such, the game can be defined via a d × |Π| game matrix
with G (i, k) = vk (i ) − v E (i ), where i indices over the feature dimensions d and k over the
policies in Π, the space of policies π. From this, we have
in Von Neumann’s minimax form [44]. The 0 lower bound is explained as follows. The
stricter constraint setting θ ∈ ΘC is equivalent to assuming all the features “got the sign
right” in relation to how they contribute to the reward (because the weights are all positive).
∗
This assumption results in ψ∗ having higher value than π E when vψ vπE regardless of
the value of the actual weights θ ∗ .
To solve this optimisation problem, they adapt the multiplicative weights algorithm
from [45]. This algorithm has two main steps. (1) Given min player “strategy” θ, find
ψ∗ = arg maxψ∈Ψ θ T Gψ (i.e., find an optimal policy in the MDP with known R, through
any MDP solver); (2) Given max player “strategy” ψ, compute (θ̇ (i) )T Gψ for each of the d
pure (i.e., one-hot) strategies θ̇ (i) (i.e., compute the feature expectations v ∈ Rd of the given
policy ψ, which can be done by solving d systems of linear equations, or approximated
iteratively). These steps in Algorithm 3 are equivalent to the projection algorithm from [42].
The complexity of these steps scales with the size of MDP\R, and not with G. There is
similarity in the higher bound approximation step in [46] (Algorithm 4, line 12).
The mixed policy returned by the Multiplicative Weights AL algorithm consists of
a uniform distribution over estimated policies π̂ that are eπ -optimal, meaning |V (π̂ ) −
V (π ∗ )| ≤ eπ . The game matrix G is slightly modified and makes use of ev -good feature
expectations estimates, meaning ||v̂ − vπ ||∞ ≤ ev .
subject to
vi ≤ Fi (µπ − µ E ) i = 1, . . . , d (20)
with resulting stationary policy
Algorithms 2023, 16, 68 11 of 42
µπ (s, a)
π ( a|s) = . (21)
∑ a∈ A µπ (s, a)
Empirical estimates for occupancy values can be obtained from given trajectories by using
Equation (9).
The last three methods we reviewed [34,42,43] are instances of AL. Although the
objective in AL is to learn a policy that resembles the expert’s, as opposed to learning the
reward function, AL and IRL are largely overlapping and share core techniques, specifically
in the two main tasks of policy estimation from observed behaviour (goal of AL), and the
inference of rewards from a given policy (goal of IRL). Knowing an agent’s policy may also
be considered a form of ToM, as it is internal to the agent and reflects their intentions/modus
operandi. The use of these two core tasks in ToM may be better understood through a
simile with the “theory theory” and the simulation theory accounts of mentalising. The
theory theory perspective assumes that we make inferences about hidden mental states
through logic and abstraction, as we do in the natural sciences for the unobservable
causal phenomena of the world. This is similar to the reward learning approach. In the
“simulation theory” account, mental states are represented through perspective-taking, by
using our own cognitive resources to simulate another’s [47,48]. This is similar to the AL
approach (e.g., [42,49]), as well as the less-sophisticated behavioural cloning (BC), whereby
agents learn state–action mappings through supervised learning (with the limitation in
applicability to observed state–action pairs only). The simulation account can be extended
to IRL. For example, in [50] the observer models a human’s reward function by proposing
counterfactual scenarios.
The form of the loss function imposes a margin by which the solution obtained is
better than any other possible solutions. Using the occupancy measures from the expert
policy has the effect of making rewards for high occupancy state–action pairs larger, which
in turn encourages similarity between the policies, as well as discouraging degenerate
solutions [51]. The optimisation of the weights θ is performed through gradient descent by
using the subgradient of the objective function
m q −1
1
∑ qβ j
q
gθ = θ T Fj + l T T
j µθ − θ Fj µ j Fj (µθ − µ j ) + λθ, (23)
m j =1
descent on the loss function. Others have taken a similar approach of matching the expert’s
state occupancy through gradient techniques [54].
Combining the likelihood from the demonstrations with a given prior over rewards
P( R), we can obtain a posteriori
Two solution methods are used to find this posterior in the literature: gradient-based
methods are used to directly find an (approximate) maximum a posteriori estimate for R,
and Markov chain Monte Carlo (MCMC) methods to approximate the entire posterior dis-
tribution of R [55]; more recently, variational inference-based methods have been proposed
to this end, e.g., [56–58].
applicable, for closer correspondence with IRL. The algorithms are equivalent if γ = 1 and
φ(s) = 0 for all s that are not leaf nodes.
Their approach models the likelihood of state–action pairs as the occupancy under
the expert’s policy Pr((st , at )| R) = µ E (st , at ) with a Boltzmann distribution assumption
(Section 2.3). Unlike its use in [53] (Section 3.1.1.6), here the normalisation in the likelihood
has to be done over all (st , at ) ∈ τ, which may be intractable depending on the state
space. Fortunately, because the algorithm only uses ratios of the densities, the normalising
constant Z can be discarded, and the resultant likelihood is
For the prior Pr( R), they invoke the principle of maximum entropy to assume the
rewards are independent and identically distributed. Three different prior distribution
candidates are proposed:
• for prior-agnostic context, a uniform distribution over [− Rmax , Rmax ]|S| or an improper
prior Pr( R) = 1 over R|S| ;
• for real-world MDP with parsimonious reward structures, a Gaussian or Laplacian
prior (over R|S| ); and
• for planning-type problems, where most states can be expected to have low or negative
rewards, with some having high rewards, a Beta distribution (over R|S| ).
Algorithms 2023, 16, 68 16 of 42
With this statistical model and the usual Boltzmann action choice probability as-
sumption (Section 2.3), they derive two MH algorithms: direct sampling from the joint
posterior distribution Pr(π, ρ|τ ), and a hybrid Gibbs sampler procedure with a reward
sequence augmentation of the model with Pr(rt |st , at , ρ E ). These algorithms do not require
the demonstrations to be optimal, and are capable of finding policies that outperform the
agent’s actual policy with respect to its reward function, as well as revealing policies that
perform better than those recovered with previous IRL methods.
requirement of matching feature counts, thereby resolving the ambiguity. It attributes equal
probabilities to trajectories with equal rewards, and exponentially higher probabilities to
trajectories with higher rewards, and does so globally over the trajectories (as opposed to
locally at the action level as is the case in [49,59]).
For observed trajectory realisations τjE=1:m , this is equivalent to maximising the like-
lihood of the observed
trajectories
under a maximum entropy exponential family of
distributions (exp θ T Φ(τ ) )θ ∈Θ . Thus, learning from observation entails finding θ ∗ =
arg maxθ L(θ ), where
L(θ ) = ∑ log Pr(τjE |θ, T ). (33)
τjE=1:m
With a partition function Z assumed constant for all (s, a, s0 ), and assuming the ef-
fects of transition dynamics on behaviour are negligible, the distribution of interest for
nondeterministic MDPs (which extends trivially to deterministic MDPs) is
exp θ T Φ(τ )
Pr(τ |θ, T ) = ∑ T (s, a, s0 )
(s,a,s0 )∈τ
Z (θ, s, a, s0 )
(34)
exp θ T Φ(τ )
≈
Z (θ, T ) ∏ T (s, a, s0 ).
(s,a,s0 )∈τ
The gradient of L(θ ) is the difference between the average feature counts from ob-
served trajectories and the expected feature counts over all trajectories in the MDP. The lat-
ter can be expressed equivalently taking the expectation over states in the MDP instead,
requiring the state visitation frequencies µθ (s)
m
1
∇ L(θ ) =
m ∑ ∑ φ(st ) − ∑[Pr(τ |θ, T ) ∑ φ(s)]
j =1 s t ∈ τ E τ s∈τ
j
Thus, for the optimal θ, the feature expectations over the MDP match the empirical feature
expectations from the observed trajectories. The state visitation frequencies µθ (s) for an infi-
nite time horizon can be approximated for a large time horizon H by using a sample-based
algorithm (Algorithm 6). The above is equivalent to calculating the feature expectations
ṽπ with γ = 1; thus, here too we try to minimise the difference between trajectory values
between observed trajectories and trajectories from parameterised policy, but avoid actually
computing the policy in favor of using state occupancies obtained with Algorithm 6.
This approach is resilient to expert behaviour being suboptimal (cf. Section 5.1), as well
as the stochasticity of the environment. Although the algorithm is efficient by using all paths
below a fixed length, in their experiments with taxi driver path data Ziebart et al. [52] work
within a smaller, fixed class of reasonable trajectories resulting in significant improvements
in speed.
Algorithms 2023, 16, 68 18 of 42
wherein future state variable outcomes have no effect on preceding variables. The state
transition dynamics follow the Markov property, and thus have a causally conditioned
probability T (S1:H ||A1:H −1 ) = ∏tH=1 T (St , |St−1 , At−1 ). An agent’s policy can also be mod-
elled as a causally conditioned probability distribution, though the factors in the product of
probabilities may not be Markovian, so π (A1:H ||S1:H ) = ∏tH=1 π ( At , |A1:t−1 , S1:t ). The goal
in this framing of the IRL problem is to find the maximum causal entropy (MaxCausalEnt)
policy estimator
π̂ ∗ = arg max Eτ [− log π̂ (a1:H ||s1:H )], (37)
π̂ (A1:H ||S1:H )
such that
Eτ [Φ(τ )] = Φ̃(τE ),
Assuming, as is usually the case, that the features decompose linearly in time makes the
optimisation much simpler. The distribution (first-order Markovian policy) that optimises
this constrained problem is a Boltzmann policy (Section 2.3) and takes the recursive form
Z At |St ,θ
π θ ( A t | St ) =
ZSt ,θ
T
log Z At |St ,θ = θ φ(St , At ) + ∑ P(St+1 | St , At ) log ZSt+1 ,θ (38)
St +1
3.1.3.3. Extensions
Though the MaxEnt approach was groundbreaking and has been adopted as the de
facto canonical model for IRL, it shares shortcomings with other previous methods, such as
reliance on the feature map φ being given, and on a defined model T of the environment’s
transition dynamics. Work that addresses these issues is exposed in Sections 3.2 and 4.1,
respectively. Boularias et al. [68] provide a model-free method based on minimising
the relative entropy (KL-divergence) between the empirical distribution of trajectories
produced by a baseline policy and the distribution of demonstrated trajectories produced
by a learned policy. With p(τ ) = Pr(τ ) defined over the space of possible trajectories,
Algorithms 2023, 16, 68 20 of 42
and pπ,T (τ ) = Pr(τ |π, T ) the probability of a trajectory under a policy and transition
dynamics, the objective is
p(τ )
min ∑ p(τ ) ln (39)
p τ p π,T ( τ )
with constraints
|Eτ [Φ(τ )] − Φ̃i (τE )| ≤ ei ∀i = 1, . . . , d
p(τ ) ≥ 0 ∀τ (40)
∑τ p(τ ) = 1.
The objective is minimised through stochastic gradient descent, and this method is
capable of learning from small demonstration samples. More recently, Snoswell et al. [69]
provide a model-free MaxEnt IRL method based on a unified view of the MaxEnt and
relative entropy methods that is capable of handling trajectories of variable lengths (with
time complexity linear in longest trajectory length), state-dependent action spaces, and non-
linear reward characterisations (Section 3.2). An approach that is similiar to MaxEnt IRL
and extends to continuous time and continuous state and action spaces is presented in [70].
Others have explored the use of semisupervised techniques by including unsupervised
trajectories in addition to expert trajectories in training [71]. Connections of MaxEnt IRL
with GAN and energy-based models have been drawn [72].
The MaxCausalEnt IRL method has been improved by including both (labelled) suc-
cessful and failed demonstrations [73], and by considering its performance degradation
as a result of diverging transition dynamics models in the agent and observer [74]. Its
connections to other methods from econometrics have been studied under a unified per-
spective [75].
A desirability function z(s) = exp(−V (s)) is used to define the optimal control dynamics
Pr(s0 |s)z(s0 )
π ∗ (s0 |s) = . (42)
∑ζ Pr(ζ |s)z(ζ )
When the demonstration sample size is larger than the number of states, the method
can recover the value function analytically, as the MLE of the unconstrained, convex function
which is uniquely defined. The policy and reward function are subsequently recovered
from the value function estimate through z(s). When the size of the demonstration sample
is smaller than the number of states, the (negative) likelihood can be optimised with respect
to z(s) instead, although the resulting function is nonconvex and its optimisation is slower
and susceptible to local minima.
The value function can be represented as a look-up table or approximated as a linear
function of features. Additionally, they suggest a method to automatically initialise and
adapt the features in continuous space, employing Gaussian radial basis function kernels.
A further potential advantage of this method is that it does not require trajectories (s, a),
operating over state transitions (s, s0 ) instead.
Under passive dynamics, Pr(τ |s0 ) = ∏tH=1 Pr(st |st−1 ) is the probability of a trajectory.
For the same trajectory to occur when the control dynamics are applied, the probability is
Pr(τ |s0 ) exp − ∑tH=0 r (st )
Pr(τ |s0 , π ) = . (44)
z ( s0 )
Note the similarity with MaxEnt IRL. Under uniform passive dynamics, MaxEnt IRL
is an equivalent approach for LMDP.
A key contribution of this work is removing the requirement of knowing the transi-
tion dynamics by approximating R through regression, as the second step in the process.
Although the regressors (s, a) are provided, this requires a response variable r̂, obtained
from the Bellman equation with
The resultant dataset is D R = {(st , at ), r̂t }t . However, samples for state–actions that
differ from the expert’s (sk , a0 6= ak ) are needed to reduce the regression error. The authors
address this with a synthetic augmentation of the regression dataset with artificial samples
((st = sk , a0 ), rlo )t,∀a0 6=πE (st ) . The reward for these samples is set to ensure it is always lower
than that of the expert’s samples: rlo = mink r̂k − 1.
R(s, a) = Q∗ (s, a) − γ ∑
0
T (s, a, s0 ) ∑
0
π ∗ (s, a) Q∗ (s, a) (47)
s ∈S a ∈A
shows a one-to-one relationship between Q-functions and rewards. If the transition dynam-
ics are known, all we need to obtain a valid R is the Q-function, because the optimal policy
is assumed to be either deterministic (Equation (3)) or Boltzmann (Equation (11)). Given
the demonstrations and a prior on policies we can obtain an empirical Bayesian estimate of
the policy π̂ (s, a). If the optimal policy is known for a given state, we have Q(s, ·) = 0 for
actions that are suboptimal and uniform across the optimal actions. If the optimal policy is
log(π̂ (s,a))
noisy, Q(s, a) = α + V (s) and we can set V arbitrarily. If no information is available
for the policy at a given state, invoking the advantage function A(s, a) = Q(s, a) − V (s) we
have multiple degrees of freedom: arbitrary V (s), and A(s, ·) constrained to be ≤ 0 and
have at least one zero-valued element because every state has at least one optimal action.
M
c(θ ) = ∑ R j (τ ) (48)
j
where τ is a trajectory under parameters θ (e.g., generated by the optimal policy for Rθ ) in
the same MDP as the demonstrations. Any off-the-shelf policy search method can be used
to optimise θ, with the authors employing the covariance matrix adaptation evolutionary
strategy (CMA-ES) optimiser.
discerns between synthetic and expert trajectories. The two networks are trained adver-
sarially, resulting in a reward function approximator R̂θ and a policy πω . This shares
similarities with the earlier approach in [43] (Section 3.1.1.3).
The adversarial IRL approach has been extended for metalearning [85,87] (Section 6);
improved with an information bottleneck [88], semantic rewards [89], or end-to-end differ-
entiability through self-attention [90]; and adapted to language-conditioned tasks [91].
In this subsection, we have outlined the many approaches that have been proposed
to discriminate between the several reward functions that could explain a given set of
behavioural demonstrations. Maximum margin methods do so by attempting to maximise
the margin between the chosen reward function and any other alternatives (Section 3.1.1).
Probabilistic methods interpret the rewards as a random variable and the state–action pair
demonstrations as evidence, framing the problem as Bayesian inference to obtain a posterior
distribution over rewards (Section 3.1.2). This is extended in maximum entropy methods,
which seek to account for interdependencies between action choices at the trajectory level
to provide a more accurate way to select from plausible reward functions (Section 3.1.3). A
gradient method is proposed to obtain a MAP estimate of the rewards without needing to
integrate over the entire solution space, showing how most previous methods can be unified
under this perspective (Section 3.1.4). Others, by approximating the environment by using
the LMDP construct, are able to recover the reward function without needing the actions to
be given in the demonstrations (Section 3.1.5). A class of more direct methods exploit the
algebraic definition of the IRL problem to find solutions by means of optimisaion techniques
(Section 3.1.6). Finally, adversarial methods train a synthetic policy to generate trajectories
and a discriminator to discern between expert and synthetic trajectories, converging into a
useful reward approximator (Section 3.1.7). All of these approaches assume the solution
space for reward functions is defined a priori. In what ways may this solution space be
defined? In other words, how may these reward functions be characterised?
Algorithms 2023, 16, 68 24 of 42
In Levine et al. [94], the observer learns a regression tree over S to represent the reward
function, with the branching determined by (binary) feature primitives φ(0) (s) ∈ {0, 1}d0 ,
yielding features φ that are logical combinations of these primitives. This way, instead of
minimising a measure of deviation from expert demonstrations as in previous methods,
their algorithm discovers regions of the state–action space where the expressiveness of the
features is insufficient with respect to R, and updates the features accordingly, by iteratively
alternating between an R optimisation step and a φ fitting step. The tree has d leaf nodes
each containing a set of states Si ⊆ S , for i = 1, . . . , d. The features can be interpreted
as indicator functions φi (s) = I (s ∈ Si ). Features deeper in the tree are more complex
combinations of feature primitives.
For the optimisation step, R is constrained by D , because the optimal policy under
R must be consistent with the demonstrations; the current features φ, so that R must
minimise the sum of squares error with its projection onto the feature space. The projection
is performed by means of GRφ ∈ Rd×|S| and GφR ∈ R|S|×d , defined to be
( (
| Si | − 1 if s ∈ Si 1 if s ∈ Si
GRφ (Si , s) = GφR (s, Si ) =
0 otherwise 0 otherwise
so that the vector GφR GRφ R ∈ R|S| encodes the reward for each state, computed as the
average reward over the states in the Si that s belongs to. They set the optimisation step as
a sparse quadratic program
1 λ
min || R − GφR Rφ ||22 + || NRφ ||1 , (52)
R,Rφ ,V |S||A| K
such that
Rφ = GRφ R
V (s) = R(s, a) + γ ∑s0 T (s, a, s0 )V (s0 ) ∀(s, a) ∈ D
V (s) ≥ R(s, a) + γ ∑s0 T (s, a, s0 )V (s0 ) + e ∀s ∈ D , (s, a) ∈
/D
V (s) ≥ R(s, a) + γ ∑s0 T (s, a, s0 )V (s0 ) ∀s ∈
/ D,
where the regularisation term discourages similar features from taking new values by
employing a sparse matrix N ∈ RK ×d of feature distances where each row k out of K =
d(d − 1)/2 corresponds to a pair of features, and Nk,i = − Nk,i0 = ∆(φi , φi0 ). The use of
`1-penalty for this term is justified by the preference for potentially mergeable features
to be very similar to each other, rather than having minimal distance to all others. In the
feature optimisation step, a reward function candidate is computed at each node with
(
|Si |−1 ∑s∈Si R(s, a) if s ∈ Si
R̂(s, a) = (53)
R(s, a) otherwise
Algorithms 2023, 16, 68 26 of 42
and the pertaining optimal policy trained with value iteration. If the optimal policy for
R̂ is consistent with D , set node as leaf node, R ← R̂, and terminate the iteration. The
feature distance measure ∆(φi , φi0 ) is defined to be proportional to the depth of the deepest
common parent node for φi and φi0 and acts as a measure against overfitting. Additionally,
the maximum allowed depth of the tree is increased with each iteration.
Their algorithm reaches convergence in very few iterations consistently. It does not
scale to continuous space because it needs to enumerate all s ∈ S for the optimisation
step, though approximation techniques may be used to construct a tractable set of con-
straints to allow for this. Incorporating priors in the fitting step may make learning more
efficient. Other regression techniques (including neural networks) can be used instead of
regression trees.
A limitation of the above nonlinear reward function methods is that they assume
optimal demonstrations. Two concurrent but differing works [95,96] leverage Gaussian
processes (GP) to learn nonlinear reward functions of the features that do not require the
expert behaviour to be optimal. Furthermore, unlike the above methods, which use the
max-margin heuristic to discriminate between reward functions, they are probabilistic.
Jin et al. [95] extend [42]’s projection method to continuous spaces by using kernels (GP).
The use of kernel machines has issues with scalability, with complexity increasing in the
amount of data, and requiring large numbers of training samples for tasks with high
variability in the reward structure [97]. Grounded on the MaxEnt perspective, the algo-
rithm in Levine et al. [96] learns a reward function and a kernel function by means of a
probabilistic model of the demonstrations and a GP prior on rewards. The learned kernel
function comprises feature weights that capture the relevance of each feature to the agent’s
reward function, an important capability from the ToM perspective. Though they use the
mean posterior of the learned reward distribution, they suggest that the entire distribution
could be used for different exploration/exploitation tradeoffs in policies, or to elicit more
information for regions of high uncertainty. Because it is linear in state, it may not converge
in large spaces. This was addressed in subsequent work by local approximation of the
reward function likelihood [98].
Kim and Park [99] extend the original AL method [42] with kernels (reproducing
kernels), simplifying the training and making it robust to local optima and both robust to
and efficient with small demonstration samples.
Choi and Kim [100] propose a nonparametric method to construct the features based on
Bayesian IRL. These features are again constructed from logical combinations of primitives.
The number of features does not need to be defined beforehand. The prior is an Indian
buffet process (IBP).
An alternative proposed by Michini and How [101] is to partition the demonstra-
tions into smaller subtrajectories to simplify the complexity requirements of the reward
function approximator. Interpreting them as subgoals, simpler reward functions are then
obtained for each. They contribute a Bayesian, nonparametric algorithm that automates
the partitioning based on a generative model. With a Chinese restaurant process (CRP)
prior, the number of partitions does not need to be predetermined and has no limitation in
number. This has a number of advantages. A subgoal may be as simple as a single state or
feature, so sparse reward functions can be obtained through this method. It also removes
sequential dependencies, making it robust to changes in the initial conditions and better
able to handle cyclic trajectories.
Metelli et al. [82] induce the features which, taken as basis functions, span the subspace
of reward functions for which the policy gradient is zero (i.e., under which the policy is
optimal). The reward function for which deviations from the demonstrations has the
highest penalty is selected from this subspace.
Algorithms 2023, 16, 68 27 of 42
The advent of deep architectures provided a way to learn reward functions directly
from “raw” state representations (such as images). In Ref. [97] leverage neural networks
trained through backpropagation, under the MaxEnt paradigm, to approximate complex,
nonlinear reward functions. The features may be learned by the network (e.g., convolu-
tional NN for visual states), without having to rely on handcrafted (given) feature functions.
Neural networks aptly scale to complex reward structures in large state spaces. As the
computational complexity of this method does not increase with the number of demon-
strations, it is suitable for lifelong learning—a desideratum for ToM-IRL. However, it
requires access to the MDP to train a policy at each iteration. Wulfmeier et al. [102] extend
their deep MaxEnt IRL approach [97] with new architectures for more complex environ-
ments. Their approach is shown to be scalable to large demonstration datasets. Similarly,
Bogdanovic et al. [103] demonstrate learning to play simple video games in pixel state
spaces from expert demonstrations with deep AL. They also show that their method can be
extended with an approach similar to [79] to retrieve the reward function [104]. NN are
also used to approximate R in [105], avoiding the need to solve the MDP at each iteration.
Others propose a binomial logistic regression classifier-based method to learn the value
and (nonlinear) reward functions without needing to solve the forward MDP [106].
Training models through Bayesian variational inference has been successful in un-
covering nonlinear reward functions. Jin et al. [56] employ deep GP to concurrently learn
abstract representations of state features and the reward function. The reward function
is modelled as a zero-mean GP prior as in [96], and representations are learned through
stacked latent GP layers. Bayesian neural networks (BNN) are finite-dimensional equiv-
alents to GP. Roa-Vicens et al. [57] apply BNN to solve the IRL problem by exploiting
their ability to robustly and efficiently characterise a reward function from point estimates
obtained by MaxCausalEnt. The process consists of an inference step optimising the likeli-
hood of the demonstrations to obtain point estimates of the rewards, and a learning step
that uses the point estimates to train a BNN mapping features to rewards.
Two approximations to MaxCausalEnt IRL for tasks with unknown dynamics have
been proposed: Finn et al. [107] address these issues with an adversarial, sample-based
approximation algorithm for MaxEnt IRL that is capable of learning nonlinear reward
functions as well as efficiently scaling to continuous, high-dimensional state spaces, with-
out relying on a transition dynamics model. Fu et al., introduce adversarial IRL (AIRL) [86].
Focusing on scalability to large, high-dimensional tasks, with unknown dynamics. their
algorithm obtains reward functions with robustness to changes in environment dynamics,
thereby being able to generalise better beyond training. Following [97], they use a NN
as a reward function approximator (i.e., there is no need for feature map). Furthermore,
by estimating the gradient through sampling, it does not require a transition dynamics
model to be given (but it requires the MDP to simulate in).
In this subsection, we have seen the important role that features play in characterising
the reward function. The expressivity of the reward function has a direct dependency on
the complexity of the features and their relationships. It is important for our discussion
to note that the features in the algorithms belong, phenomenologically, in the observer.
Though the agent’s decision making does indeed depend on features—things in the world
that can be perceived by it—there may or may not be an overlap in the features that the
agent and the observer perceive, depending on how the problem is framed conceptually.
As a simple illustrative example, consider a blind person walking on the street. As they
navigate by using tactile and auditory features, one may infer their “reward” function
(e.g., where they want to go) based on visual features that they are certainly not making use
of. Future IRL approaches in the context of ToM could benefit from preemptively selecting
features based on the type the agent is perceived to be, as a form of perspective-taking. This
could be achieved by means of the priors that some of the algorithms above have available
as “stored sources” of information, to be used in combination with “immediate sources”
observed from the external world [108]. Sections 4.1 and 4.2 provide support for this point
of view.
Algorithms 2023, 16, 68 28 of 42
Learning an agent’s exact R∗ is usually not possible, nor is it necessary, because the
use of knowing R is to act strategically in the context of a particular interaction [46]. This is
supported by Samuelson’s Theory of Revealed Preferences [37], which states that consumer
behaviour is the most reliable indicator of their preferences (read utilities or rewards).
The identifiability of the reward function was flagged and addressed as a fundamental
problem in IRL since its conception, but only recently analysis of the problem has been
undertaken. Kim et al. [109] formalise the problem and show its relation to properties of
the MDP, providing algorithms to establish whether an MDP’s rewards are identifiable.
This analysis is extended by Cao et al. [110], finding that a single reward function is not
identifiable even if the optimal policy is fully known, and that because the value function
parameterises the reward space, it is all that is required in conjunction with the optimal
policy to recover a suitable reward function (cf. Section 3.1.6.1). Interestingly, they also
show that, in the absence of a value function, rewards can be uniquely identified up to a
constant if a policy under different discount factors or transition dynamics is given. This
highlights the importance of parameters of the MDP (γ, T) beyond the reward function.
These recent findings ought to be incorporated into any new IRL algorithms.
observing the agent in tasks with known rewards and subsequently learning the parameter-
isation θ = (θ j=1:m , θ E ) (requiring the real transition dynamics to be known) [120]. Others
perform reward learning with biased beliefs about dynamics [121], study degradation in
performance as the transition functions differ between actor and observer [74], or study
the impacts of changes in the environment dynamics [122,123]. In earlier work, internal
dynamics models are learnt from demonstrations without learning the reward, in a subset
of tasks with linear-Gaussian dynamics and quadratic rewards [124], or selecting from a
discrete set of candidate models [125,126].
5. Agent’s Intentions
Intentions reflect an agent’s commitment to acting, guided by their beliefs, toward
states of the world that align with their desires. In the MDP formulation, intentions manifest
as the selection of actions so as to maximise expected utility, i.e., in the agent’s policy. Here
we take a brief look at IRL methods that have taken the agent’s intentions into account, be
it through considering potentially suboptimal behaviour with respect to the true rewards
(Section 5.1), or the possibility that agents have multiple intentions (Section 5.2).
state spaces (e.g., robotics), others propose metalearning the reward function parameters,
finding a parameterisation for each of the provided demonstrations by assuming the reward
weights are close to the mean of the weights over the tasks [139].
Nonparameteric methods do not require the number of different reward functions to
be known beforehand. Bayesian nonparametric methods have been proposed to achieve
this, extending previous parametric clustering methods with the structured generalisation
of Bayesian IRL from [62] in [140], or using a Dirichlet process mixture model to draw
cluster assignments and reward functions for each cluster through a MCMC algorithm,
with the ability to transfer modelled information to new observations [141]. MaxEnt
methods combining Dirichlet process-based clustering of demonstrations have also been
proposed, including a gradient-based solution based on a Lagrangian relaxation of the re-
sulting nonlinear optimisation problem [142], and employing a deep reward network [143].
Others extend this thinking to continuous action spaces via the path integral MaxEnt
method from [70] with hierarchical clustering [144]. A more recent method based on con-
textual MDPs is able to learn from different experts with nonstationary policies without an
assumption of optimality employing subgradient-based optimisation [145].
As we have seen, mentalising complex agents in the real world will require algorithms
that can handle discrepancies between intentions and behaviour—manifesting as subop-
timal behaviour with respect to the true reward function, as well as the possibility that
agents have multiple intentions they behave in accordance with, and whose number may
be unknown. Despite the number of operational issues that IRL needs to overcome to be
a practicable algorithmic basis for ToM, our review has shown that there is a wealth of
methods that aptly address each or some combination of them. What remains to be accom-
plished is the development of methods capable of modelling not only desires, but beliefs
and intentions too, and to do so in large and complex spaces with degrees of uncertainty.
For these methods to be truly effective, they ought to heed the considerations in the fol-
lowing section, toward which valuable contributions have been made independently, as
we outline.
6. Further Considerations
Having reviewed the main approaches to IRL and how they relate to desires, beliefs,
and intentions, here we outline some remaining important considerations and open chal-
lenges. IRL approaches to ToM need to be able to handle vast, complex state spaces to be
successful in the real world. A number of methods made advances toward extrapolating to
a large state space from demonstrations in a small subset of the space, through minimising
the relative entropy between the observed and a learned policy’s trajectories [68], by using
local approximations of the reward function [98], or employing deep neural networks
(DNN) as reward function approximators [102,146] or feature encoders [147]. Others scale
Bayesian IRL to large state spaces, through approximate variational inference (with the
additional advantage of not requiring it to solve the forward MDP at each iteration) [58],
or leveraging multiple RL algorithms with different configurations as approximators to
create a multifidelity Bayesian optimization framework [148]. Recent work has shown
promising results in large state spaces such as pixel inputs in the Atari suite [133,147,149],
or real-world driving data [69,150].
The number of demonstrations required for accurate modelling is an important con-
sideration for inference, and thus for ToM algorithms. Feature expectation estimates can
be obtained from observed trajectories, and (under a linear reward parameterisation) the
number of expert demonstrations required scales with the dimensions of the features d,
but not with |S| or the complexity of π E . Estimating the actor’s policy from empirical
averages has the advantage of not requiring transition dynamics model to be known. On
the other hand, it requires large amounts of trajectory data to be accurate, as well as being
limited to states that are visited by the actor. This may be addressed through synthetic data,
such as generating trajectories from the learned reward function mimicking the expert to
generalise the expert’s actions to unseen state space regions [35]. Sample efficiency also
Algorithms 2023, 16, 68 32 of 42
affects adaptation to new tasks (e.g., new agents, or new contexts). Metalearning methods
seek to uncover the structural similarities of different tasks to be able to more readily
adapt to new tasks.It has been used to learn effective initialisation of the reward network
parameters in AIRL [87,139], and to learn the similarities between tasks to build a prior,
showing good performance in navigation tasks from pixels [151], or to small and state-only
demonstration samples by conditioning the function approximators on context [152]. These,
however, require known transition dynamics, a shortcoming addressed by disentangling
the reward function from the environment dynamics through probabilistic embeddings,
adapting to different tasks from single demonstrations through conditioning the rewards
and policy on a latent context variable [85].
Another important consideration is how noisy or incomplete the observations are.
The effect of noise in features can be mitigated by propagating information between
states [153]. Incomplete trajectories, for example due to occlusion, have been addressed
with a generalisation of the MaxEnt IRL approach [154]. Work has gone to establish
whether an observation is sufficient to recover a (linear) reward function, allowing for
new information to be included incrementally and identifying irrelevant features [155].
Both occlusions in trajectories and noisy perception by the observer are addressed with an
approach grounded in Bayesian IRL (Section 3.1.2.2) and the MAP inference generalisation
(Section 3.1.4) [156]. In more realistic settings, the actions at each timestep may not be
available and the observer may need to work with state-only trajectories. This is known as
the imitation from observation (IfO) problem (see [157] for a recent review). Some existing
IRL algorithms naturally extend to this setting, e.g., [76,149,152]. IfO considers challenges
relevant to ToM, such as perceptual encoding (vision, proprioception) [158], embodiment
mismatch [159], and differences in viewpoint [160].
A clear direction for expansion of IRL methods, and especially so for their applicability
as algorithmic ToM, is to settings where the observer not only observes but also acts in
the environment or is able to communicate with the actor. In the cooperative setting,
although the reward function may be common to both, the policies must complement
each other in maximising rewards [161]. Incorporating human feedback in the learning
process can be done by querying the expert for action at specific states [55], correcting
suboptimal behaviour as it occurs [162], providing pairwise preferences between segments
of trajectories [163], or evaluating counterfactual (“what if?”) scenarios generated by the
observer and thus reducing the number of interactions with the environment [50]. Human
expertise can also be used to teach features to the observer [164].
The natural and most promising extension of the participative setting is to interactions
where both agents have ToM abilities. This gives raise to game-theoretic considerations
as both agents model each other’s strategies. In psychological game theory, payoffs asso-
ciated with emotions such as guilt or anger are operationalised into the utility function,
going beyond the material payoff-based utility of classical game theory [29]. This frame-
work has been used to predict behaviour in cooperative games [165] and to model the
perception of other’s intentions [166,167]. Emotions and mental states are closely interre-
lated, and computational ToM approaches may benefit from incorporating empathy and
affective mentalising, as well as providing a foundation to develop standalone models
thereof [168,169].
The overlap between IRL and game theory is studied in the game identification
literature in econometrics [170,171], wherein the payoffs are estimated from behaviour an-
alytically. Others do so algorithmically, by employing the game-theoretic concept of regret
in conjunction with MaxEnt [172], or efficient linear programming in succinct games [173],
or learning both the system dynamics and the reward function of multiagent nonzero-sum
multistage games [174]. The extension from two-player games to the multiagent setting
(e.g., [118,175–177]) is nontrivial and may result in emergent phenomena, particularly when
sophisticated ToM is present.
Algorithms 2023, 16, 68 33 of 42
7. Conclusions
We have provided background on the IRL problem, reviewed the main algorithmic
approaches with their formal descriptions, and discussed the applicability of IRL concepts
as the algorithmic basis of a ToM in AI. The main goal in the IRL problem is to retrieve
the reward function that best explains an agent’s behaviour—the agent’s desires in the
ToM context. The foremost challenge in IRL comes from it being an ill-posed problem:
a policy may be optimal under any number of reward functions, including degenerate
ones; therefore, algorithms must incorporate heuristics to discriminate between solutions.
Another important consideration is how the reward function is characterised, usually as a
function of features of the environment. Different approaches have been taken to define
the features and structure that characterise this function.
Algorithms 2023, 16, 68 34 of 42
Some IRL methods also address other core ToM attitudes: beliefs about the environ-
mental dynamics modelled as the transition probabilities and about the states in the form
of observations, and intentions, with considerations of potentially suboptimal observed
behaviour with respect to the true agent goals and the modelling of multiple intentions.
Further considerations have been addressed in IRL algorithms, including the size and
complexity of the state space, sample efficiency and robustness to noisy or incomplete
observations, the participation of the observer in the environment and its game-theoretical
and recursive consequences, ToM as introspection, the incorporation of prior knowledge
and structured representations, and the environments and benchmarks available for further
research and applications.
As demonstrated in this review, the IRL framework encapsulates the core elements of
ToM succinctly while providing enough flexibility for many and various solution meth-
ods and extensions to be developed. As such, it holds great promise as a cradle for the
algorithmic basis of a ToM in AI.
References
1. Frith, C.; Frith, U. Theory of Mind. Curr. Biol. 2005, 15, R644–R645. [CrossRef] [PubMed]
2. Dennett, D.C. Précis of The Intentional Stance. Behav. Brain Sci. 1988, 11, 495–505. [CrossRef]
3. Shevlin, H.; Halina, M. Apply Rich Psychological Terms in AI with Care. Nat. Mach. Intell. 2019, 1, 165–167. [CrossRef]
4. Mitchell, J.P. Mentalizing and Marr: An Information Processing Approach to the Study of Social Cognition. Brain Res. 2006,
1079, 66–75. [CrossRef]
5. Lockwood, P.L.; Apps, M.A.J.; Chang, S.W.C. Is There a ‘Social’ Brain? Implementations and Algorithms. Trends Cogn. Sci. 2020,
24, 802–813. [CrossRef]
6. Rusch, T.; Steixner-Kumar, S.; Doshi, P.; Spezio, M.; Gläscher, J. Theory of Mind and Decision Science: Towards a Typology of
Tasks and Computational Models. Neuropsychologia 2020, 146, 107488. [CrossRef]
7. Bakhtin, A.; Brown, N.; Dinan, E.; Farina, G.; Flaherty, C.; Fried, D.; Goff, A.; Gray, J.; Hu, H.; Jacob, A.P.; et al. Human-Level Play
in the Game of Diplomacy by Combining Language Models with Strategic Reasoning. Science 2022, 378, 1067–1074. [CrossRef]
8. Perez-Osorio, J.; Wykowska, A. Adopting the Intentional Stance toward Natural and Artificial Agents. Philos. Psychol. 2020,
33, 369–395. [CrossRef]
9. Harré, M.S. Information Theory for Agents in Artificial Intelligence, Psychology, and Economics. Entropy 2021, 23, 310. [CrossRef]
10. Williams, J.; Fiore, S.M.; Jentsch, F. Supporting Artificial Social Intelligence With Theory of Mind. Front. Artif. Intell. 2022,
5, 750763. [CrossRef]
11. Ho, M.K.; Saxe, R.; Cushman, F. Planning with Theory of Mind. Trends Cogn. Sci. 2022, 26, 959–971. . 2022.08.003. [CrossRef]
12. Cohen, P.R.; Levesque, H.J. Intention Is Choice with Commitment. Artif. Intell. 1990, 42, 213–261. [CrossRef]
13. Premack, D.; Woodruff, G. Does the Chimpanzee Have a Theory of Mind? Behav. Brain Sci. 1978, 1, 515–526. [CrossRef]
14. Schmidt, C.F.; Sridharan, N.S.; Goodson, J.L. The Plan Recognition Problem: An Intersection of Psychology and Artificial
Intelligence. Artif. Intell. 1978, 11, 45–83. [CrossRef]
15. Pollack, M.E. A Model of Plan Inference That Distinguishes between the Beliefs of Actors and Observers. In Proceedings of the
24th Annual Meeting on Association for Computational Linguistics (ACL ’86), New York, NY, USA, 24–27 June 1986; pp. 207–214.
[CrossRef]
16. Konolige, K.; Pollack, M.E. A Representationalist Theory of Intention. In Proceedings of the 13th International Joint Conference
on Artifical Intelligence (IJCAI ’93), Chambery, France, 28 August–3 September 1993; Morgan Kaufmann Publishers Inc.: San
Francisco, CA, USA, 1993; Volume 1, pp. 390–395.
17. Yoshida, W.; Dolan, R.J.; Friston, K.J. Game Theory of Mind. PLoS Comput. Biol. 2008, 4, e1000254. . pcbi.1000254. [CrossRef]
18. Baker, C.; Saxe, R.; Tenenbaum, J. Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution. In Proceedings of the
Annual Meeting of the Cognitive Science Society, Boston, MA, USA, 20–23 July 2011; pp. 2469–2474.
Algorithms 2023, 16, 68 35 of 42
19. Baker, C.L.; Jara-Ettinger, J.; Saxe, R.; Tenenbaum, J. Rational Quantitative Attribution of Beliefs, Desires and Percepts in Human
Mentalizing. Nat. Hum. Behav. 2017, 1, 64. [CrossRef]
20. Rabinowitz, N.; Perbet, F.; Song, F.; Zhang, C.; Eslami, S.M.A.; Botvinick, M. Machine Theory of Mind. In Proceedings of the 35th
International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4218–4227.
21. Langley, C.; Cirstea, B.I.; Cuzzolin, F.; Sahakian, B.J. Theory of Mind and Preference Learning at the Interface of Cognitive Science,
Neuroscience, and AI: A Review. Front. Artif. Intell. 2022, 5, 62. [CrossRef]
22. Jara-Ettinger, J. Theory of Mind as Inverse Reinforcement Learning. Curr. Opin. Behav. Sci. 2019, 29, 105–110. [CrossRef]
23. Osa, T.; Pajarinen, J.; Neumann, G.; Bagnell, J.A.; Abbeel, P.; Peters, J. An Algorithmic Perspective on Imitation Learning. ROB
2018, 7, 1–179. [CrossRef]
24. Ab Azar, N.; Shahmansoorian, A.; Davoudi, M. From Inverse Optimal Control to Inverse Reinforcement Learning: A Historical
Review. Annu. Rev. Control 2020, 50, 119–138. [CrossRef]
25. Arora, S.; Doshi, P. A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress. Artif. Intell. 2021,
297, 103500. [CrossRef]
26. Shah, S.I.H.; De Pietro, G. An Overview of Inverse Reinforcement Learning Techniques. Intell. Environ. 2021, 29, 202–212.
[CrossRef]
27. Adams, S.; Cody, T.; Beling, P.A. A Survey of Inverse Reinforcement Learning. Artif. Intell. Rev. 2022, 55, 4307–4346. [CrossRef]
28. Albrecht, S.V.; Stone, P. Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems. Artif.
Intell. 2018, 258, 66–95. [CrossRef]
29. González, B.; Chang, L.J. Computational Models of Mentalizing. In The Neural Basis of Mentalizing; Gilead, M., Ochsner, K.N.,
Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 299–315. [CrossRef]
30. Kennington, C. Understanding Intention for Machine Theory of Mind: A Position Paper. In Proceedings of the 31st IEEE
International Conference on Robot and Human Interactive Communication (RO-MAN), Naples, Italy, 29 August–2 September
2022; pp. 450–453. [CrossRef]
31. Keeney, R.L. Multiattribute Utility Analysis—A Brief Survey. In Systems Theory in the Social Sciences: Stochastic and Control Systems
Pattern Recognition Fuzzy Analysis Simulation Behavioral Models; Bossel, H., Klaczko, S., Müller, N., Eds.; Interdisciplinary Systems
Research/Interdisziplinäre Systemforschung, Birkhäuser: Basel, Switzerland, 1976; pp. 534–550. [CrossRef]
32. Russell, S. Learning Agents for Uncertain Environments (Extended Abstract). In Proceedings of the Eleventh Annual Conference
on Computational Learning Theory (COLT ’98), Madison, WI, USA, 24–26 July 1998; Association for Computing Machinery: New
York, NY, USA, 1998; pp. 101–103. [CrossRef]
33. Baker, C.L.; Tenenbaum, J.B.; Saxe, R.R. Bayesian Models of Human Action Understanding. In Proceedings of the 18th
International Conference on Neural Information Processing Systems (NIPS ’05), Vancouver, BC, Canada, 5–8 December 2005; MIT
Press: Cambridge, MA, USA, 2005; pp. 99–106.
34. Syed, U.; Bowling, M.; Schapire, R.E. Apprenticeship Learning Using Linear Programming. In Proceedings of the 25th
International Conference on Machine Learning (ICML ’08), Helsinki, Finland, 5–9 July 2008; Association for Computing Machinery:
New York, NY, USA, 2008; pp. 1032–1039. [CrossRef]
35. Boularias, A.; Chaib-draa, B. Apprenticeship Learning with Few Examples. Neurocomputing 2013, 104, 83–96. [CrossRef]
36. Carmel, D.; Markovitch, S. Learning Models of the Opponent’s Strategy in Game Playing. In Proceedings of the AAAI Fall
Symposium on Games: Planing and Learning, Raleigh, NC, USA, 22–24 October 1993; pp. 140–147.
37. Samuelson, P.A. A Note on the Pure Theory of Consumer’s Behaviour. Economica 1938, 5, 61–71. [CrossRef]
38. Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev. 1957, 106, 620–630. [CrossRef]
39. Ziebart, B.D.; Bagnell, J.A.; Dey, A.K. Modeling Interaction via the Principle of Maximum Causal Entropy. In Proceedings of
the 27th International Conference on International Conference on Machine Learning (ICML ’10), Haifa, Israel, 21–24 June 2010;
Omnipress: Madison, WI, USA, 2010; pp. 1255–1262.
40. Ng, A.Y.; Russell, S.J. Algorithms for Inverse Reinforcement Learning. In Proceedings of the Seventeenth International Conference
on Machine Learning (ICML ’00), Stanford, CA, USA, 29 June–2 July 2000; Morgan Kaufmann Publishers Inc.: San Francisco, CA,
USA, 2000; pp. 663–670.
41. Chajewska, U.; Koller, D. Utilities as Random Variables: Density Estimation and Structure Discovery. In Proceedings of the
Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI ’00), Stanford, CA, USA, 30 June–3 July 2000. [CrossRef]
42. Abbeel, P.; Ng, A.Y. Apprenticeship Learning via Inverse Reinforcement Learning. In Proceedings of the Twenty-First International
Conference on Machine Learning (ICML ’04), Banff, AB, Canada, 4–8 July 2004; Association for Computing Machinery: New
York, NY, USA, 2004. [CrossRef]
43. Syed, U.; Schapire, R.E. A Game-Theoretic Approach to Apprenticeship Learning. In Proceedings of the Advances in Neural
Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; Platt, J., Koller, D., Singer, Y., Roweis, S., Eds.;
Curran Associates, Inc.: Red Hook, NY, USA, 2007; Volume 20.
44. Von Neumann, J. On the Theory of Parlor Games. Math. Ann. 1928, 100, 295–320.
45. Freund, Y.; Schapire, R.E. Adaptive Game Playing Using Multiplicative Weights. Games Econ. Behav. 1999, 29, 79–103. [CrossRef]
46. Chajewska, U.; Koller, D.; Ormoneit, D. Learning an Agent’s Utility Function by Observing Behavior. In Proceedings of the
Eighteenth International Conference on Machine Learning (ICML ’01), Williamstown, MA, USA, 28 June–1 July 2001; Morgan
Kaufmann Publishers Inc.: San Francisco, CA, USA, 2001; pp. 35–42.
Algorithms 2023, 16, 68 36 of 42
47. Gallese, V.; Goldman, A. Mirror Neurons and the Simulation Theory of Mind-Reading. Trends Cogn. Sci. 1998, 2, 493–501.
[CrossRef]
48. Shanton, K.; Goldman, A. Simulation Theory. WIREs Cogn. Sci. 2010, 1, 527–538. [CrossRef]
49. Ratliff, N.D.; Bagnell, J.A.; Zinkevich, M.A. Maximum Margin Planning. In Proceedings of the 23rd International Conference
on Machine Learning (ICML ’06), Pittsburgh, PA, USA, 25–29 June 2006; ACM Press: Pittsburgh, PA, USA, 2006; pp. 729–736.
[CrossRef]
50. Reddy, S.; Dragan, A.; Levine, S.; Legg, S.; Leike, J. Learning Human Objectives by Evaluating Hypothetical Behavior. In
Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 8020–8029.
51. Neu, G.; Szepesvári, C. Training Parsers by Inverse Reinforcement Learning. Mach. Learn. 2009, 77, 303. [CrossRef]
52. Ziebart, B.D.; Maas, A.; Bagnell, J.A.; Dey, A.K. Maximum Entropy Inverse Reinforcement Learning. In Proceedings of the 23rd
National Conference on Artificial Intelligence-Volume 3 (AAAI ’08), Chicago, IL, USA, 13–17 July 2008; AAAI Press: Chicago, OL,
USA, 2008; pp. 1433–1438.
53. Neu, G.; Szepesvári, C. Apprenticeship Learning Using Inverse Reinforcement Learning and Gradient Methods. In Proceedings
of the Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI ’07), Vancouver, BC, Canada, 19–22 July 2007;
AUAI Press: Arlington, VA, USA, 2007; pp. 295–302.
54. Ni, T.; Sikchi, H.; Wang, Y.; Gupta, T.; Lee, L.; Eysenbach, B. F-IRL: Inverse Reinforcement Learning via State Marginal Matching.
In Proceedings of the 2020 Conference on Robot Learning, Virtual Event, 16–18 November 2020; pp. 529–551.
55. Lopes, M.; Melo, F.; Montesano, L. Active Learning for Reward Estimation in Inverse Reinforcement Learning. In Proceedings
of the 2009 European Conference on Machine Learning and Knowledge Discovery in Databases-Volume Part II (ECMLPKDD ’09), Bled,
Slovenia, 7–11 September 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 31–46.
56. Jin, M.; Damianou, A.; Abbeel, P.; Spanos, C. Inverse Reinforcement Learning via Deep Gaussian Process. In Proceedings of the
Conference on Uncertainty in Artificial Intelligence (UAI), Sydney, Australia, 11–15 August 2017; p. 10.
57. Roa-Vicens, J.; Chtourou, C.; Filos, A.; Rullan, F.; Gal, Y.; Silva, R. Towards Inverse Reinforcement Learning for Limit Order
Book Dynamics. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15
June 2019. [CrossRef]
58. Chan, A.J.; Schaar, M. Scalable Bayesian Inverse Reinforcement Learning. In Proceedings of the 2021 International Conference on
Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021.
59. Ramachandran, D.; Amir, E. Bayesian Inverse Reinforcement Learning. In Proceedings of the International Joint Conference on
Artificial Intelligence (IJCAI ’07), Hyderabad, India, 6–12 January 2007; pp. 2586–2591.
60. Choi, J.; Kim, K.e. MAP Inference for Bayesian Inverse Reinforcement Learning. In Proceedings of the Advances in Neural
Information Processing Systems, Granada, Spain, 12–15 December 2011; Curran Associates, Inc.: Red Hook, NY, USA 2011.
61. Melo, F.S.; Lopes, M.; Ferreira, R. Analysis of Inverse Reinforcement Learning with Perturbed Demonstrations. In Proceedings of
the 19th European Conference on Artificial Intelligence, Lisbon, Portugal, 16–20 August 2010; pp. 349–354.
62. Rothkopf, C.A.; Dimitrakakis, C. Preference Elicitation and Inverse Reinforcement Learning. In Proceedings of the Machine Learning
and Knowledge Discovery in Databases (ECMLPKDD ’11), Athens, Greece, 5–9 September 2011; Gunopulos, D., Hofmann, T., Malerba,
D., Vazirgiannis, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 34–48. [CrossRef]
63. Ziebart, B.D.; Bagnell, J.A.; Dey, A.K. The Principle of Maximum Causal Entropy for Estimating Interacting Processes. IEEE Trans.
Inf. Theory 2013, 59, 1966–1980. [CrossRef]
64. Kramer, G. Directed Information for Channels with Feedback. Ph.D. Thesis, Hartung-Gorre Germany, Swiss Federal Institute of
Technology, Zurich, Switzerland, 1998.
65. Bloem, M.; Bambos, N. Infinite Time Horizon Maximum Causal Entropy Inverse Reinforcement Learning. In Proceedings of the
53rd IEEE Conference on Decision and Control, Los Angeles, CA, USA, 15–17 December 2014; pp. 4911–4916. [CrossRef]
66. Zhou, Z.; Bloem, M.; Bambos, N. Infinite Time Horizon Maximum Causal Entropy Inverse Reinforcement Learning. IEEE Trans.
Autom. Control 2018, 63, 2787–2802. [CrossRef]
67. Ziebart, B.D. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. Ph.D. Thesis, Carnegie
Mellon University, Pittsburgh, PA, USA, 2010.
68. Boularias, A.; Kober, J.; Peters, J. Relative Entropy Inverse Reinforcement Learning. In Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Ft. Lauderdale, FL, USA, 11–13
April 2011; pp. 182–189.
69. Snoswell, A.J.; Singh, S.P.N.; Ye, N. Revisiting Maximum Entropy Inverse Reinforcement Learning: New Perspectives and
Algorithms. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI ’20), Canberra, ACT,
Australia, 1–4 December 2020; pp. 241–249. [CrossRef]
70. Aghasadeghi, N.; Bretl, T. Maximum Entropy Inverse Reinforcement Learning in Continuous State Spaces with Path Integrals. In
Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 25–30
September 2011; pp. 1561–1566. [CrossRef]
Algorithms 2023, 16, 68 37 of 42
71. Audiffren, J.; Valko, M.; Lazaric, A.; Ghavamzadeh, M. Maximum Entropy Semi-Supervised Inverse Reinforcement Learning. In
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July
2015; pp. 3315–3321
72. Finn, C.; Christiano, P.; Abbeel, P.; Levine, S. A Connection between Generative Adversarial Networks, Inverse Reinforcement
Learning, and Energy-Based Models. arXiv 2016, arXiv:1611.03852. [CrossRef]
73. Shiarlis, K.; Messias, J.; Whiteson, S. Inverse Reinforcement Learning from Failure. In Proceedings of the 2016 International
Conference on Autonomous Agents & Multiagent Systems (AAMAS ’16), Singapore, 9–13 May 2016; International Foundation
for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2016; pp. 1060–1068.
74. Viano, L.; Huang, Y.T.; Kamalaruban, P.; Weller, A.; Cevher, V. Robust Inverse Reinforcement Learning under Transition Dynamics
Mismatch. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Event, 6–14 December 2021;
Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 25917–25931.
75. Sanghvi, N.; Usami, S.; Sharma, M.; Groeger, J.; Kitani, K. Inverse Reinforcement Learning with Explicit Policy Estimates. In
Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; pp. 9472–9480. [CrossRef]
76. Dvijotham, K.; Todorov, E. Inverse Optimal Control with Linearly-Solvable MDPs. In Proceedings of the 27th International
Conference on International Conference on Machine Learning (ICML ’10), Haifa, Israel, 21–24 June 2010; Omnipress: Madison,
WI, USA, 2010; pp. 335–342.
77. Todorov, E. Linearly-Solvable Markov Decision Problems. In Advances in Neural Information Processing Systems 19, Proceedings of
the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006; Schölkopf, B.,
Platt, J.C., Hofmann, T., Eds.; MIT Press: Cambridge, MA, USA, 2006; pp. 1369–1376.
78. Klein, E.; Geist, M.; Piot, B.; Pietquin, O. Inverse Reinforcement Learning through Structured Classification. In Proceedings of the
Advances in Neural Information Processing Systems (NeurIPS ’12), Lake Tahoe, NV, USA, 3–8 December 2012; Curran Associates,
Inc.: Red Hook, NY, USA, 2012; 25.
79. Klein, E.; Piot, B.; Geist, M.; Pietquin, O. A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning. In
Proceedings of the Machine Learning and Knowledge Discovery in Databases, Prague, Czech Republic, 23–27 September 2013; Lecture
Notes in Computer Science; Blockeel, H., Kersting, K., Nijssen, S., Železný, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2013;
pp. 1–16. [CrossRef]
80. Doerr, A.; Ratliff, N.; Bohg, J.; Toussaint, M.; Schaal, S. Direct Loss Minimization Inverse Optimal Control. In Proceedings of the
Robotics: Science and Systems Conference, Rome, Italy, 13–17 July 2015.
81. Pirotta, M.; Restelli, M. Inverse Reinforcement Learning through Policy Gradient Minimization. In Proceedings of the Thirtieth
AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016.
82. Metelli, A.M.; Pirotta, M.; Restelli, M. Compatible Reward Inverse Reinforcement Learning. In Proceedings of the Advances in
Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA,
2017; Volume 30.
83. Ho, J.; Ermon, S. Generative Adversarial Imitation Learning. In Proceedings of the Advances in Neural Information Processing
Systems, Barcelona, Spain, 5–10 December 2016; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29.
84. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial
Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014;
Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27.
85. Yu, L.; Yu, T.; Finn, C.; Ermon, S. Meta-Inverse Reinforcement Learning with Probabilistic Context Variables. In Proceedings of
the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.:
Red Hook, NY, USA, 2019; Volume 32.
86. Fu, J.; Luo, K.; Levine, S. Learning Robust Rewards with Adverserial Inverse Reinforcement Learning. In Proceedings of the 6th
International Conference on Learning Representations (ICLR ’18), Vancouver, BC, Canada, 30 April–3 May 2018.
87. Wang, P.; Li, H.; Chan, C.Y. Meta-Adversarial Inverse Reinforcement Learning for Decision-making Tasks. In Proceedings of the
2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 12632–12638.
[CrossRef]
88. Peng, X.B.; Kanazawa, A.; Toyer, S.; Abbeel, P.; Levine, S. Variational Discriminator Bottleneck: Improving Imitation Learning,
Inverse RL, and GANs by Constraining Information Flow. In Proceedings of the 7th International Conference on Learning
Representations, New Orleans, LA, USA, 6–9 May 2019.
89. Wang, P.; Wang, P.; Liu, D.; Chen, J.; Li, H.; Chan, C.Y.; Chan, C.Y. Decision Making for Autonomous Driving via Augmented
Adversarial Inverse Reinforcement Learning. In Proceedings of the 2021 IEEE International Conference on Robotics and
Automation (ICRA), Xi’an, China, 30 May–5 June 2021. [CrossRef]
90. Sun, J.; Yu, L.; Dong, P.; Lu, B.; Zhou, B. Adversarial Inverse Reinforcement Learning With Self-Attention Dynamics Model. IEEE
Robot. Autom. Lett. 2021, 6, 1880–1886. [CrossRef]
91. Zhou, L.; Small, K. Inverse Reinforcement Learning with Natural Language Goals. In Proceedings of the AAAI Conference on
Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [CrossRef]
92. Ratliff, N.; Bradley, D.; Bagnell, J.; Chestnutt, J. Boosting Structured Prediction for Imitation Learning. In Proceedings of the
Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 4–9 December 2006; MIT Press: Cambridge, MA,
USA, 2006; Volume 19.
Algorithms 2023, 16, 68 38 of 42
93. Ratliff, N.D.; Silver, D.; Bagnell, J.A. Learning to Search: Functional Gradient Techniques for Imitation Learning. Auton. Robot
2009, 27, 25–53. [CrossRef]
94. Levine, S.; Popovic, Z.; Koltun, V. Feature Construction for Inverse Reinforcement Learning. In Proceedings of the Advances in
Neural Information Processing Systems (NeurIPS ’10), Vancouver, BC, Canada, 6–11 December 2010; Curran Associates, Inc.: Red
Hook, NY, USA, 2010; Volume 23.
95. Jin, Z.J.; Qian, H.; Zhu, M.L. Gaussian Processes in Inverse Reinforcement Learning. In Proceedings of the 2010 International
Conference on Machine Learning and Cybernetics (ICMLC ’10), Qingdao, China, 11–14 July 2010; Volume 1, pp. 225–230.
[CrossRef]
96. Levine, S.; Popovic, Z.; Koltun, V. Nonlinear Inverse Reinforcement Learning with Gaussian Processes. In Proceedings of the
Advances in Neural Information Processing Systems, Granada, Spain, 12–17 December 2011; Curran Associates, Inc.: Red Hook,
NY, USA, 2011; Volume 24.
97. Wulfmeier, M.; Ondruska, P.; Posner, I. Maximum Entropy Deep Inverse Reinforcement Learning. arXiv 2015, arXiv:1507.04888.
[CrossRef]
98. Levine, S.; Koltun, V. Continuous Inverse Optimal Control with Locally Optimal Examples. In Proceedings of the 29th
International Conference on Machine Learning (ICML ’12), Edinburgh, Scotland, 26 June–1 July 2012; Omnipress: Madison, WI,
USA, 2012; pp. 475–482.
99. Kim, K.E.; Park, H.S. Imitation Learning via Kernel Mean Embedding. In Proceedings of the AAAI Conference on Artificial
Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [CrossRef]
100. Choi, J.; Kim, K.E. Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning. In Proceedings of the
Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI ’13), Beijing, China, 3–9 August 2013; p. 7.
101. Michini, B.; How, J.P. Bayesian Nonparametric Inverse Reinforcement Learning. In Proceedings of the Machine Learning and
Knowledge Discovery in Databases, Bristol, UK, 24–28 September 2012; Lecture Notes in Computer Science; Flach, P.A., De Bie, T.,
Cristianini, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 148–163. [CrossRef]
102. Wulfmeier, M.; Wang, D.Z.; Posner, I. Watch This: Scalable Cost-Function Learning for Path Planning in Urban Environments.
In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of
Korea, 9–14 October 2016; pp. 2089–2095. [CrossRef]
103. Bogdanovic, M.; Markovikj, D.; Denil, M.; de Freitas, N. Deep Apprenticeship Learning for Playing Video Games. In Papers from
the 2015 AAAI Workshop; AAAI Technical Report WS-15-10; The AAAI Press: Palo Alto, CA, USA, 2015.
104. Markovikj, D. Deep Apprenticeship Learning for Playing Games. Master’s Thesis, University of Oxford, Oxford, UK, 2014.
105. Xia, C.; El Kamel, A. Neural Inverse Reinforcement Learning in Autonomous Navigation. Robot. Auton. Syst. 2016, 84, 1–14.
[CrossRef]
106. Uchibe, E. Model-Free Deep Inverse Reinforcement Learning by Logistic Regression. Neural. Process Lett. 2018, 47, 891–905.
[CrossRef]
107. Finn, C.; Levine, S.; Abbeel, P. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. In Proceedings of
the 33rd International Conference on International Conference on Machine Learning (ICML ’16), New York, NY, USA, 19–24 June
2016; Volume 48, pp. 49–58.
108. Achim, A.M.; Guitton, M.; Jackson, P.L.; Boutin, A.; Monetta, L. On What Ground Do We Mentalize? Characteristics of Current
Tasks and Sources of Information That Contribute to Mentalizing Judgments. Psychol. Assess. 2013, 25, 117–126. [CrossRef]
[PubMed]
109. Kim, K.; Garg, S.; Shiragur, K.; Ermon, S. Reward Identification in Inverse Reinforcement Learning. In Proceedings of the 38th
International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 5496–5505.
110. Cao, H.; Cohen, S.; Szpruch, L. Identifiability in Inverse Reinforcement Learning. In Proceedings of the Advances in Neural
Information Processing Systems, Virtual Event, 6–14 December 2021; Curran Associates, Inc.: Red Hook, NY, USA, 2021;
Volume 34, pp. 12362–12373.
111. Tauber, S.; Steyvers, M. Using Inverse Planning and Theory of Mind for Social Goal Inference. In Proceedings of the 33rd Annual
Meeting of the Cognitive Science Society, Boston, MA, USA, 20–23 July 2011; Volume 1, pp. 2480–2485.
112. Rust, J. Structural Estimation of Markov Decision Processes. In Handbook of Econometrics; Elsevier: Amsterdam, The Netherlands,
1994; Volume 4, pp. 3081–3143. [CrossRef]
113. Damiani, A.; Manganini, G.; Metelli, A.M.; Restelli, M. Balancing Sample Efficiency and Suboptimality in Inverse Reinforcement
Learning. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022;
pp. 4618–4629.
114. Jarboui, F.; Perchet, V. A Generalised Inverse Reinforcement Learning Framework. arXiv 2021, arXiv:2105.11812. [CrossRef]
115. Bogert, K.; Doshi, P. Toward Estimating Others’ Transition Models under Occlusion for Multi-Robot IRL. In Proceedings of the
Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015.
116. Ramponi, G.; Likmeta, A.; Metelli, A.M.; Tirinzoni, A.; Restelli, M. Truly Batch Model-Free Inverse Reinforcement Learning about
Multiple Intentions. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Virtual
Event, 26–28 August 2020; pp. 2359–2369.
Algorithms 2023, 16, 68 39 of 42
117. Xue, W.; Lian, B.; Fan, J.; Kolaric, P.; Chai, T.; Lewis, F.L. Inverse Reinforcement Q-Learning Through Expert Imitation for
Discrete-Time Systems. IEEE Trans. Neural Netw. Learn. Syst. 2021. [CrossRef] [PubMed]
118. Donge, V.S.; Lian, B.; Lewis, F.L.; Davoudi, A. Multi-Agent Graphical Games with Inverse Reinforcement Learning. IEEE Trans.
Control. Netw. Syst. 2022. [CrossRef]
119. Herman, M.; Gindele, T.; Wagner, J.; Schmitt, F.; Burgard, W. Inverse Reinforcement Learning with Simultaneous Estimation of
Rewards and Dynamics. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Cadiz, Spain,
9–11 May 2016; pp. 102–110.
120. Reddy, S.; Dragan, A.; Levine, S. Where Do You Think You’ Re Going? Inferring Beliefs about Dynamics from Behavior. In
Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Curran
Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31.
121. Gong, Z.; Zhang, Y. What Is It You Really Want of Me? Generalized Reward Learning with Biased Beliefs about Domain Dynamics.
Proc. AAAI Conf. Artif. Intell. 2020, 34, 2485–2492. [CrossRef]
122. Munzer, T.; Piot, B.; Geist, M.; Pietquin, O.; Lopes, M. Inverse Reinforcement Learning in Relational Domains. In Proceedings of
the 24th International Conference on Artificial Intelligence (IJCAI ’15), Buenos Aires, Argentina, 25–31 July 2015; AAAI Press:
Palo Alto, CA, USA, 2015; pp. 3735–3741.
123. Chae, J.; Han, S.; Jung, W.; Cho, M.; Choi, S.; Sung, Y. Robust Imitation Learning against Variations in Environment Dynamics. In
Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 2828–2852.
124. Golub, M.; Chase, S.; Yu, B. Learning an Internal Dynamics Model from Control Demonstration. In Proceedings of the 30th
International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 606–614.
125. Rafferty, A.N.; LaMar, M.M.; Griffiths, T.L. Inferring Learners’ Knowledge From Their Actions. Cogn. Sci. 2015, 39, 584–618.
[CrossRef]
126. Rafferty, A.N.; Jansen, R.A.; Griffiths, T.L. Using Inverse Planning for Personalized Feedback. In Proceedings of the 9th
International Conference on Educational Data Mining, Raleigh, NC, USA, 29 June–2 July 2016; p. 6.
127. Choi, J.; Kim, K.E. Inverse Reinforcement Learning in Partially Observable Environments. J. Mach. Learn. Res. 2011, 12, 691–730.
128. Baker, C.L.; Saxe, R.; Tenenbaum, J.B. Action Understanding as Inverse Planning. Cognition 2009, 113, 329–349. [CrossRef]
129. Nielsen, T.D.; Jensen, F.V. Learning a Decision Maker’s Utility Function from (Possibly) Inconsistent Behavior. Artif. Intell. 2004,
160, 53–78. [CrossRef]
130. Zheng, J.; Liu, S.; Ni, L.M. Robust Bayesian Inverse Reinforcement Learning with Sparse Behavior Noise. In Proceedings of the
Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI ’14), Québec City, QC, Canada 27–31 July 2014; AAAI Press:
Palo Alto, CA, USA, 2014; pp. 2198–2205.
131. Lian, B.; Xue, W.; Lewis, F.L.; Chai, T. Inverse Reinforcement Learning for Adversarial Apprentice Games. IEEE Trans. Neural
Netw. 2021. [CrossRef]
132. Noothigattu, R.; Yan, T.; Procaccia, A.D. Inverse Reinforcement Learning From Like-Minded Teachers. Proc. AAAI Conf. Artif.
Intell. 2021, 35, 9197–9204. [CrossRef]
133. Brown, D.; Goo, W.; Nagarajan, P.; Niekum, S. Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement
Learning from Observations. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA,
9–15 June 2019; pp. 783–792.
134. Armstrong, S.; Mindermann, S. Occam’ s Razor Is Insufficient to Infer the Preferences of Irrational Agents. In Proceedings of the
Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Curran Associates, Inc.: Red
Hook, NY, USA, 2018; Volume 31.
135. Ranchod, P.; Rosman, B.; Konidaris, G. Nonparametric Bayesian Reward Segmentation for Skill Discovery Using Inverse
Reinforcement Learning. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), Hamburg, Germany, 28 September–2 October October 2015; pp. 471–477. [CrossRef]
136. Henderson, P.; Chang, W.D.; Bacon, P.L.; Meger, D.; Pineau, J.; Precup, D. OptionGAN: Learning Joint Reward-Policy Options
Using Generative Adversarial Inverse Reinforcement Learning. In Proceedings of the 32nd AAAI Conference on Artificial
Intelligence, New Orleans, LA, USA, 2–7 February 2018. [CrossRef]
137. Babeş-Vroman, M.; Marivate, V.; Subramanian, K.; Littman, M. Apprenticeship Learning about Multiple Intentions. In
Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML ’11), Bellevue, WA,
USA, 28 June–2 July 2011; Omnipress: Madison, WI, USA, 2011; pp. 897–904.
138. Likmeta, A.; Metelli, A.M.; Ramponi, G.; Tirinzoni, A.; Giuliani, M.; Restelli, M. Dealing with Multiple Experts and Non-
Stationarity in Inverse Reinforcement Learning: An Application to Real-Life Problems. Mach. Learn. 2021, 110, 2541–2576.
[CrossRef]
139. Gleave, A.; Habryka, O. Multi-Task Maximum Entropy Inverse Reinforcement Learning. arXiv 2018, arXiv:1805.08882. [CrossRef]
140. Dimitrakakis, C.; Rothkopf, C.A. Bayesian Multitask Inverse Reinforcement Learning. In Proceedings of the Recent Advances in
Reinforcement Learning—9th European Workshop (EWRL), Athens, Greece, 9–11 September 2011; Lecture Notes in Computer Science;
Sanner, S., Hutter, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 273–284. [CrossRef]
141. Choi, J.; Kim, K.e. Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions. In Proceedings
of the Advances in Neural Information Processing Systems (NeurIPS ’12), Lake Tahoe, NV, USA, 3–8 December 2012; Curran
Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25.
Algorithms 2023, 16, 68 40 of 42
142. Arora, S.; Doshi, P.; Banerjee, B. Min-Max Entropy Inverse RL of Multiple Tasks. In Proceedings of the 2021 IEEE International
Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June; pp. 12639–12645. [CrossRef]
143. Bighashdel, A.; Meletis, P.; Jancura, P.; Jancura, P.; Dubbelman, G. Deep Adaptive Multi-Intention Inverse Reinforcement Learning.
ECML/PKDD 2021, 2021, 206–221. [CrossRef]
144. Almingol, J.; Montesano, L. Learning Multiple Behaviours Using Hierarchical Clustering of Rewards. In Proceedings of the 2015
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015;
pp. 4608–4613. [CrossRef]
145. Belogolovsky, S.; Korsunsky, P.; Mannor, S.; Tessler, C.; Zahavy, T. Inverse Reinforcement Learning in Contextual MDPs. Mach.
Learn. 2021, 110, 2295–2334. [CrossRef]
146. Sharifzadeh, S.; Chiotellis, I.; Triebel, R.; Cremers, D. Learning to Drive Using Inverse Reinforcement Learning and Deep
Q-Networks. In Proceedings of the NIPS Workshop on Deep Learning for Action and Interaction. arXiv 2017, arXiv:1612.03653.
https://doi.org/10.48550/arXiv.1612.03653.
147. Brown, D.; Coleman, R.; Srinivasan, R.; Niekum, S. Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences.
In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 12–18 July 2020; pp. 1165–1177.
148. Imani, M.; Ghoreishi, S.F. Scalable Inverse Reinforcement Learning Through Multifidelity Bayesian Optimization. IEEE Trans.
Neural Netw. Learn. Syst. 2022, 33, 4125–4132. [CrossRef]
149. Garg, D.; Chakraborty, S.; Cundy, C.; Song, J.; Ermon, S. IQ-Learn: Inverse Soft-Q Learning for Imitation. In Proceedings of the
Advances in Neural Information Processing Systems, Virtual Event, 6–14 December 2021; Curran Associates, Inc.: Red Hook, NY,
USA, 2021; Volume 34, pp. 4028–4039.
150. Liu, S.; Jiang, H.; Chen, S.; Ye, J.; He, R.; Sun, Z. Integrating Dijkstra’s Algorithm into Deep Inverse Reinforcement Learning for
Food Delivery Route Planning. Transp. Res. Part E Logist. Transp. Rev. 2020, 142, 102070. [CrossRef]
151. Xu, K.; Ratner, E.; Dragan, A.; Levine, S.; Finn, C. Learning a Prior over Intent via Meta-Inverse Reinforcement Learning.
In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019;
pp. 6952–6962.
152. Seyed Ghasemipour, S.K.; Gu, S.S.; Zemel, R. SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional
Policies. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December
2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32.
153. Boularias, A.; Krömer, O.; Peters, J. Structured Apprenticeship Learning. In Proceedings of the Machine Learning and Knowledge
Discovery in Databases, Bristol, UK, 24–28 September 2012; Lecture Notes in Computer Science; Flach, P.A., De Bie, T., Cristianini, N.,
Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 227–242. [CrossRef]
154. Bogert, K.; Doshi, P. Multi-Robot Inverse Reinforcement Learning under Occlusion with Estimation of State Transitions. Artif.
Intell. 2018, 263, 46–73. [CrossRef]
155. Jin, W.; Kulić, D.; Mou, S.; Hirche, S. Inverse Optimal Control from Incomplete Trajectory Observations. Int. J. Robot. Res. 2021,
40, 848–865. [CrossRef]
156. Suresh, P.S.; Doshi, P. Marginal MAP Estimation for Inverse RL under Occlusion with Observer Noise. In Proceedings of the
Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, Eindhoven, The Netherlands, 1–5 August 2022; pp. 1907–1916.
157. Torabi, F.; Warnell, G.; Stone, P. Recent Advances in Imitation Learning from Observation. In Proceedings of the Electronic
Proceedings of IJCAI (IJCAI ’19), Macao, China, 10–16 August 2019; pp. 6325–6331.
158. Das, N.; Bechtle, S.; Davchev, T.; Jayaraman, D.; Rai, A.; Meier, F. Model-Based Inverse Reinforcement Learning from Visual
Demonstrations. In Proceedings of the 2020 Conference on Robot Learning, London, UK, 8–11 November 2021; pp. 1930–1942.
159. Zakka, K.; Zeng, A.; Florence, P.; Tompson, J.; Bohg, J.; Dwibedi, D. XIRL: Cross-embodiment Inverse Reinforcement Learning. In
Proceedings of the 5th Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022; pp. 537–546.
160. Liu, Y.; Gupta, A.; Abbeel, P.; Levine, S. Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context
Translation. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia,
21–25 May 2018; pp. 1118–1125. [CrossRef]
161. Hadfield-Menell, D.; Russell, S.J.; Abbeel, P.; Dragan, A. Cooperative Inverse Reinforcement Learning. In Proceedings of the
Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Curran Associates, Inc.: Red Hook,
NY, USA, 2016; Volume 29.
162. Amin, K.; Jiang, N.; Singh, S. Repeated Inverse Reinforcement Learning. In Proceedings of the Advances in Neural Information
Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30.
163. Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep Reinforcement Learning from Human Preferences. In
Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran
Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30.
164. Bobu, A.; Wiggert, M.; Tomlin, C.; Dragan, A.D. Inducing Structure in Reward Learning by Learning Features. Int. J. Robot. Res.
2022, 41, 497–518. [CrossRef]
165. Chang, L.J.; Smith, A. Social Emotions and Psychological Games. Curr. Opin. Behav. Sci. 2015, 5, 133–140. [CrossRef]
166. Rabin, M. Incorporating Fairness into Game Theory and Economics. Am. Econ. Rev. 1993, 83, 1281–1302.
Algorithms 2023, 16, 68 41 of 42
167. Falk, A.; Fehr, E.; Fischbacher, U. On the Nature of Fair Behavior. Econ. Inq. 2003, 41, 20–26. [CrossRef]
168. Preckel, K.; Kanske, P.; Singer, T. On the Interaction of Social Affect and Cognition: Empathy, Compassion and Theory of Mind.
Curr. Opin. Behav. Sci. 2018, 19, 1–6. [CrossRef]
169. Ong, D.C.; Zaki, J.; Goodman, N.D. Computational Models of Emotion Inference in Theory of Mind: A Review and Roadmap.
Top. Cogn. Sci. 2019, 11, 338–357. [CrossRef]
170. Lise, W. Estimating a Game Theoretic Model. Comput. Econ. 2001, 18, 141–157. [CrossRef]
171. Bajari, P.; Hong, H.; Ryan, S.P. Identification and Estimation of a Discrete Game of Complete Information. Econometrica 2010,
78, 1529–1568. [CrossRef]
172. Waugh, K.; Ziebart, B.D.; Bagnell, J.A. Computational Rationalization: The Inverse Equilibrium Problem. In Proceedings of the
28th International Conference on International Conference on Machine Learning (ICML ’11), Bellevue, WA, USA, 28 June–2 July
2011; Omnipress: Madison, WI, USA, 2011; pp. 1169–1176.
173. Kuleshov, V.; Schrijvers, O. Inverse Game Theory: Learning Utilities in Succinct Games. In Proceedings of the Web and Internet
Economics, Amsterdam, The Netherlands, 9–12 December 2015; Lecture Notes in Computer Science; Markakis, E., Schäfer, G., Eds.;
Springer: Berlin/Heidelberg, Germany, 2015; pp. 413–427. [CrossRef]
174. Cao, K.; Xie, L. Game-Theoretic Inverse Reinforcement Learning: A Differential Pontryagin’s Maximum Principle Approach.
IEEE Trans. Neural Netw. Learn. Syst. 2022. [CrossRef]
175. Natarajan, S.; Kunapuli, G.; Judah, K.; Tadepalli, P.; Kersting, K.; Shavlik, J. Multi-Agent Inverse Reinforcement Learning. In
Proceedings of the 2010 Ninth International Conference on Machine Learning and Applications (ICMLA ’10), Washington, DC,
USA, 12–14 December 2010; pp. 395–400. [CrossRef]
176. Reddy, T.S.; Gopikrishna, V.; Zaruba, G.; Huber, M. Inverse Reinforcement Learning for Decentralized Non-Cooperative
Multiagent Systems. In Proceedings of the 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC) (IEEE
SMC ’12), Seoul, Republic of Korea, 14–17 October 2012; pp. 1930–1935. [CrossRef]
177. Chen, Y.; Zhang, L.; Liu, J.; Hu, S. Individual-Level Inverse Reinforcement Learning for Mean Field Games. arXiv 2022,
arXiv:2202.06401. [CrossRef]
178. Harré, M.S. What Can Game Theory Tell Us about an AI ‘Theory of Mind’? Games 2022, 13, 46. [CrossRef]
179. Wellman, H.M.; Miller, J.G. Including Deontic Reasoning as Fundamental to Theory of Mind. HDE 2008, 51, 105–135. [CrossRef]
180. Sanfey, A.G. Social Decision-Making: Insights from Game Theory and Neuroscience. Science 2007, 318, 598–602. [CrossRef]
181. Adolphs, R. The Social Brain: Neural Basis of Social Knowledge. Annu. Rev. Psychol. 2009, 60, 693–716. [CrossRef]
182. Peterson, J.C.; Bourgin, D.D.; Agrawal, M.; Reichman, D.; Griffiths, T.L. Using Large-Scale Experiments and Machine Learning to
Discover Theories of Human Decision-Making. Science 2021, 372, 1209–1214. [CrossRef]
183. Gershman, S.J.; Gerstenberg, T.; Baker, C.L.; Cushman, F.A. Plans, Habits, and Theory of Mind. PLoS ONE 2016, 11, e0162246.
[CrossRef]
184. Harsanyi, J.C. Games with Incomplete Information Played by “Bayesian” Players, I–III. Part III. The Basic Probability Distribution
of the Game. Manag. Sci. 1968, 14, 486–502. [CrossRef]
185. Conway, J.R.; Catmur, C.; Bird, G. Understanding Individual Differences in Theory of Mind via Representation of Minds, Not
Mental States. Psychon. Bull. Rev. 2019, 26, 798. [CrossRef]
186. Velez-Ginorio, J.; Siegel, M.H.; Tenenbaum, J.; Jara-Ettinger, J. Interpreting Actions by Attributing Compositional Desires. In
Proceedings of the 39th Annual Meeting of the Cognitive Science Society, London, UK, 16–29 July 2017.
187. Sun, L.; Zhan, W.; Tomizuka, M. Probabilistic Prediction of Interactive Driving Behavior via Hierarchical Inverse Reinforcement
Learning. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA,
4–7 November 2018; pp. 2111–2117. [CrossRef]
188. Kolter, J.; Abbeel, P.; Ng, A. Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion. In Proceedings
of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; Curran Associates, Inc.:
Red Hook, NY, USA, 2007; Volume 20.
189. Natarajan, S.; Joshi, S.; Tadepalli, P.; Kersting, K.; Shavlik, J. Imitation Learning in Relational Domains: A Functional-Gradient
Boosting Approach. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona,
Spain, 16–22 July 2011.
190. Okal, B.; Gilbert, H.; Arras, K.O. Efficient Inverse Reinforcement Learning Using Adaptive State-Graphs. In Proceedings of the
Robotics: Science and Systems XI Conference (RSS ’15), Rome, Italy, 13–17 July 2015; p. 2.
191. Gao, X.; Gong, R.; Zhao, Y.; Wang, S.; Shu, T.; Zhu, S.C. Joint Mind Modeling for Explanation Generation in Complex Human-
Robot Collaborative Tasks. In Proceedings of the 2020 29th IEEE International Conference on Robot and Human Interactive
Communication (RO-MAN), Naples, Italy, 31 August–4 September 2020; pp. 1119–1126. [CrossRef]
192. Bard, N.; Foerster, J.N.; Chandar, S.; Burch, N.; Lanctot, M.; Song, H.F.; Parisotto, E.; Dumoulin, V.; Moitra, S.; Hughes, E.; et al.
The Hanabi Challenge: A New Frontier for AI Research. Artif. Intell. 2020, 280, 103216. [CrossRef]
193. Heidecke, J. Evaluating the Robustness of GAN-Based Inverse Reinforcement Learning Algorithms. Master’s Thesis, Universitat
Politècnica de Catalunya, Barcelona, Spain, 2019.
194. Snoswell, A.J.; Singh, S.P.N.; Ye, N. LiMIIRL: Lightweight Multiple-Intent Inverse Reinforcement Learning. arXiv 2021,
arXiv:2106.01777. [CrossRef]
Algorithms 2023, 16, 68 42 of 42
195. Toyer, S.; Shah, R.; Critch, A.; Russell, S. The MAGICAL Benchmark for Robust Imitation. arXiv 2020, arXiv:2011.00401. [CrossRef]
196. Waade, P.T.; Enevoldsen, K.C.; Vermillet, A.Q.; Simonsen, A.; Fusaroli, R. Introducing Tomsup: Theory of Mind Simulations
Using Python. Behav. Res. Methods 2022. [CrossRef] [PubMed]
197. Conway, J.R.; Bird, G. Conceptualizing Degrees of Theory of Mind. Proc. Natl. Acad. Sci. USA 2018, 115, 1408–1410. [CrossRef]
[PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.