Stackelberg games

1. Summary:Stackelberg game的关键问题

1.1. 基础:

  • Game中有一个attacker,一个defender。Defender先行(leader),attacker后行(follower)。

    • Attacker应是model-based:具有自己的optimal function。举例而言attacker可能具有一定的optimal function(线性/非线性方程),或者在MDP的环境下具有自己的optimal policy(以下都用optimal function代表)。一般attacker已知defender的策略,在该前提下最大化自己的收益。
    • Defender对attacker有一定认识,理想情况是完全获知attacker的optimal function。进阶情况为defender只有关于attacker的部分知识或者不准确的知识。Defender的策略建立在对attacker的了解之上,如何通过对已知attacker的知识去优化自己的策略,从而最小化attacker的收益。 Defender可能采用pure strategey或者mixed strategy(multiple types,通过一定概率选取type,更具优势)。
  • Attacker的optimal function:

    • 可以是给定的、部分给定,通过统计模型获知(PAC model)等等
  • 回合制:Stackelberg game可以是单一回合制的,即defender在一开始制定好策略,部署后attacker行动,通过attacker的收益衡量defender的策略。也可以是多回合制(repeated game),defender部署一个策略,attacker行动,defender在此基础上再部署新的策略,以此往复。

    • 多回合制时defender需要考虑:
      1. Defender需要考虑随后的若干步attacker的策略(可以类比于用RL为之前的reward设置)从而选择当前策略。
      2. 更进一步的,若对attacker的optimal function只有部分知识,需要平衡exploitation与exploration,同时保证一定的收益。(如何学习attacker的function

1.2. 模型关注点

项目解释
环境/游戏类型回合(题设)一回合结束?多回合制?(repeated game?)
Follower/attacker知识(题设)一般是完全知道Leader的策略(如mixed strategy的概率),但是不知道具体action,从而在已确定的type下计算出相应的最优策略。
策略(题设)一个follower?多个follower?确定的optimal function(单一type)或者multiple types(一开始从多个选项中确定一个,或每回合以一定概率从types中选择)?
Defender/leader知识(题设)最理想情况是确知Follower的optimal function。若follower是multiple types,可能知道究竟有哪些types,但是不知道follower选择了哪个。那么通过何种方式(例如Monte Carlo)直接或者间接决定策略。
求解策略的算法mixed strategy或optimal function。从上面的题设,那么Leader最优的策略应如何得出?

1.3. 实际应用

  • 如果defender和attacker所处的environment是stochastic的?
  • 可以把哪些作为optimality的标准?E.g.:
    1. 工厂生产产品:利润。
    2. 机场/wildlife的安全守卫。
    3. 网络安全攻防:如何引导攻击者进入honeypot。

1.4. 问题分类?

  • Stackelberg security game (SSG), Tambe2012
  • Green security game (GSG), Fang 2015, Nguyen 2015, machine learning evolved.
  • PAC model of learning

2. [20%]Efficient algorithms to solve Bayesian Stackelberg games for security applications

Paruchuri et al., AAAI 2008

项目解释
环境/游戏类型回合(题设)单一回合
Follower/attacker知识(题设)完全已知Leader的使用某种策略的概率,但不知道具体action,计算出相应的最优策略。
策略(题设)一个follower,从多个types中确定一个
Defender/leader知识(题设)知道follower是multiple types,可能知道究竟有哪些types,但是不知道follower选择了哪个。
求解策略的算法根据follower的type概率(distribution over types)求出optimal mixed strategy。
1. DOBSS (Decomposed Optimal Bayesian Stackelberg Solver)
2. Approximate procedure: ASAP (Agent Security via approximate Policies)
  • Traditional method: conversion to a normal-form game via the Harsanyi transformation.
  • Proposed methods:
    • Exact method: DOBSS (Decomposed Optimal Bayesian Stackelberg Solver)
    • Approximate procedure: ASAP (Agent Security via approximate Policies)
  • Focus on experimental results with human subjects.

3. [30%]Learning and approximating the optimal strategy to commit to

Joshua Letchford et al., ADT 2011

“The authors develop an elegant method for choosing leader strategies to uncover the follower preferences in as few rounds as possible.” — Playing repeated Stackelberg games with unknown opponents

项目解释
环境/游戏类型回合(题设)repeated game
Follower/attacker知识(题设)完全已知Leader的策略,计算出相应的最优策略。
策略(题设)具有固定的optimal function
Defender/leader知识(题设)对follower完全未知
求解策略的算法需要某种mixed strategy,观察follower的反应,从而决定optimal strategy,测试结果positive。但是approximately optimal Stackelberg strategies的结果negtive。

4. [30%]Playing repeated Stackelberg games with unknown opponents

Marecki et al., AAMAS 2012

  • To balance the exploration of the follower payoff structure vs. the exploitation of this knowledge to optimize on-the-fly the expected cumulative reward-to-go of the leader.
项目解释
环境/游戏类型回合(题设)repeated game
Follower/attacker知识(题设)已知每回合Leader使用某种策略的概率,但不知道具体action,计算出相应的最优策略。
策略(题设)具有固定的optimal function
Defender/leader知识(题设)对follower完全未知
求解策略的算法使用MCTS(蒙特卡洛树搜索)和剪枝进行搜索,通过一个常量平衡exploration和exploitation(此时等价于MDP)。缺点为只能应对单一类型type的follower,并且action是离散的。
  • The follower always stick to its’ own fixed utility function, which is unkown for the leader. The leader plays as a mixed strategy type, the (probability of executing the pure strategy) is known to the follower. Compute an upper confidence bound (best reaction of the follower by leader provides an action). A tunable constant constant controlling the tradeoff between exploration and exploitation.

  • The exploration and exploitation is done by MCTS and pruning.

  • Drawback

    • Only discuss about single attacker type (only one set of ). The follower stick to one pure strategy (utility function) all the time.
    • Action is discrete.
  • Detail: Not trying to model the cost/utility function of the follower directly. Using MC to sample and estimate each (utility value of leader/follower action pair)

  • Preliminary: Stackelberg Game (same as Learning and approximating the optimal strategy to commit to)

  • Monte-Carlo tree search: the implementation of MCTS makes use of the UCT algorithm (Bandit based Monte-Carlo planning)

  • Basic idea of abbreviated proof of convergence of the algorithm: map a repeated Bayesian Stackelberg game to an equivalent MDP.

  • Pruning:

    1. Leader identify its’ own strategy
    2. Find out strategies that have not yet been employed
    3. Compare to the expected value of current strategies.
    4. Reomve those unemployed one with expected value guaranteed not to exceed the current expected one.

5. [30%]Commitment Without Regrets: Online Learning in Stackelberg Security Games

Maria-Florina Balcan et al., EC 2015

  • SSG: Stackelberg security game
项目解释
环境/游戏类型回合(题设)repeated game
Follower/attacker知识(题设)
策略(题设)The attacker/follower choose the best adversarially from some known set $K$
follower的type是从一个集合中选择
Defender/leader知识(题设)
求解策略的算法Use online learning to deal with uncertainty about attackers.
Leader根据历史信息(之前的attacker的对策)在新的回合决定mixed strategy
  • Preliminary:

    • Time horizon

    • Set of target

    • A defender with:

      • Resources (set, )
      • Schedules . A schedule of whether a target will be protected. 每个targe是否受到保护(0/1)
      • Assignment function: . Define whether is assigned to . 决定资源里的每个元素是否配置给某个schedule。
      • Strategy: pure or mixed.
        • pure strategies can be shown as a 0/1 table. iff target is covered by some resource in the pure strategy

          Target Target
          Stratgy 0/10/1
          Stratgy 0/10/1
        • Mixed strategy adds probability vector to indicate the probablity of play each pure strategy.

    • A set of all attacker types

      • An adversary selects a sequence of attackers 𝕒, such that for all ,
      • is known to the defender, is unknown.
    • Utilities: payoff for the defender and attacker.

  • Problem formulation: how a defender can effectively protect targets in a repeated Stackelberg security game when the sequence of attackers is not known to him.

    • Full information: after attacked by , the defender know which type the attacker is.
    • Partial information

6. [30%]Optimal Network Security Hardening Using Attack Graph Games

Karel Durkota et al., IJCAI 2015

  • Introduce a novel game-theoretic model of network hardening using honeypots that extends the existing line of Stackelberg security games by combining two elements:
    1. Adopt a compact representation of strategies for attacking computer networks called attack graphs.
    2. the defender uses deception instead of directly allocating resources to harden targes.
项目解释
环境/游戏类型回合(题设)
Follower/attacker知识(题设)
策略(题设)
Defender/leader知识(题设)
求解策略的算法
  • Dependency attack graphs (AG) and attacker’s strategies:

    • AND/OR graph consisting of fact nodes F (OR) and action nodes A (AND)
    • Every action has:
      1. preconditions (): a set of facts that must be true before the action can be performed.
      2. effects (): a set of facts that become true if the action is successfully performed. Preconditions and effects are presented by edges in the attack graph.
      • : associated a probability of being performed successfully.
      • : associated with a cost . The attacker pays regardless of whether the action is successful. Represent the time and resources for the attacker to perform the action.
      • : act with a set of host types. After started the attack, the attacker only interact with that certain type of host.
      • Terminate/stop action denoted .
    • Use the common monotonicity assumption that once a fact becomes true during an attack, it can never become false again as an effect of any action.
    • Optimal attack policy: Linear plans that may be provided by classical planners are not sufficient as they cannot represent attacker’s behavior after action failures.
      1. Need to identify how likely the action can be executed.
      2. The order of execution.
  • Solving AG

    • The MDP for attack graph AG tuple
      1. is the set of actions
      2. is set of states , where: is the set of actions that a rational attacker can still perform. is the set of achieved facts. is the set of host types that the attacker has interacted so far
        • Initial MDP state is
      3. is the the probability that performing action in state leads to state , three possible result:
        1. Acts with a honeypot, terminated.
        2. Does not interact with a honeypot and is successsful.
        3. Does not interact with a honeypot and neither is successful.
      4. is the attacker’s reward for performing action in state leading to .
    • Compute the optimal policy in this MDP using backward induction based on depth-first search. Enhanced by dynamic programmming.

7. [10%]Learning adversary behavior in security games: a PAC model perspective

Arunesh Sinha et al., AAMAS 2016

  • The defender first learns the response function of the adversary (adversary behavior), and then optimizes against the learned response.
  • Contribution:
    1. Use Probably Approximately Correct (PAC) model to analyze the learning problem at hand.
    2. An analysis of the SUQR (Subjective Utility Quantal Response) model of bounded rationality adversary behavior used in SSGs.

      SUQR is the best known model of bounded rationality in SSGs, resulting in multiple deployed applications.

    3. Includes an entirely new behavioral model specified by the non-parametric Lipschitz (NPL) class of response functions for SSGs.
    4. Convert PAC learning guarantee into a bound on the utility derived by the defender when planning her strategy based on the learned adversary behavior model.
项目解释
环境/游戏类型回合(题设)
Follower/attacker知识(题设)
策略(题设)
Defender/leader知识(题设)
求解策略的算法

8. [0%]Three strategies to success: Learning adversary models in security games

Nika Haghtalab et al., IJCAI 2016

  • Problem formulation:
    • A set of target
    • Defender’s action space is defined by a set of vectors to protect target
    • Attacker attacks target based on a non-adaptive probability
    • Goal: The defender learns thes utility function of the attacker.
项目解释
环境/游戏类型回合(题设)
Follower/attacker知识(题设)
策略(题设)
Defender/leader知识(题设)
求解策略的算法

9. [10%]Learning to play Stackelberg security games

Avrim Blum et al., Improving homeland security decisions 2017

项目解释
环境/游戏类型回合(题设)repeated game
Follower/attacker知识(题设)
策略(题设)Different types. A single attacker type is drawn from a known distribution over all possible types.
Defender/leader知识(题设)unknown attacker utilities
求解策略的算法gathers information about the attacker purely by observing the attacker's responses to mixed strategies. Monte Carlo tree search and no-regret learning

10. [30%]Adaptive learning in two-player Stackelberg games with continuous action sets

Guosong Yang et al., CDC 2019

项目解释
环境/游戏类型回合(题设)repeated game
Follower/attacker知识(题设)
策略(题设)belongs to a known family of functions parameterzied by an unknown parameter vector
Defender/leader知识(题设)has partial knowledge of the follower's action set and cost function
部分知道follower's action set和cost function
求解策略的算法adaptive learning algorithm that simultaneously estimates the unknown parameter based on the follower's past actions and minimizeds the leader's cost.
Consider the case where the parameterzied function that is known to the leader does not perfectly match the follower's actually strategy
  • Detail: assuming the follower’s actual strategy belongs to a known parameterized family of functions (introduces little loss of generality by doing so). Approximated on a compact set up to an arbitrary precision as a finite weighted sum of a preselected class of basis functions. E.g., using RBF model as the function approximator (utility function of the best leader/follower action pair), use gradients to improve the RBF.

  • Example: link-flooding denial-of-service DOS attacks such as Crossfire attack

  • RBF: Radial basis function network 【1】Moody, J. and Darken, C., 1988. Learning with localized receptive fields (pp. 133-143). Yale Univ., Department of Computer Science.

Leader (player 1)Follower (player 2)
action set
cost function
player 1选择action ,player 2的反应action 的组合
strategyGoal: unknown function
for all
    1. Construct a parameter estimate that approaches the actual parameter
    2. Adjust the leader’s action based on a gradient descent method to minimize its predicted cost to solve the optimization problem
  • the set of best responses against . , and if

  • belongs to a parameterized family of functions


Guosong Yang et al., 2020

  • Stackelberg game model tuple:
    • : action/strategy of router/leader/defender
    • : action of attacker
    • : routing cost
    • : attack cost
项目解释
环境/游戏类型回合(题设)
Follower/attacker知识(题设)
策略(题设)
Defender/leader知识(题设)
求解策略的算法
----

12. [0%]End-to-End Game-Focused Learning of Adversary Behavior in Security Games

Andrew Perrault et al., AAAI 2020

项目解释
环境/游戏类型回合(题设)
Follower/attacker知识(题设)
策略(题设)attack function q.
Defender/leader知识(题设)
求解策略的算法Train a prediction model using an estimate of defender expected utility as the loss function