Stackelberg games
1. Summary:Stackelberg game的关键问题
1.1. 基础:
-
Game中有一个attacker,一个defender。Defender先行(leader),attacker后行(follower)。
- Attacker应是model-based:具有自己的optimal function。举例而言attacker可能具有一定的optimal function(线性/非线性方程),或者在MDP的环境下具有自己的optimal policy(以下都用optimal function代表)。一般attacker已知defender的策略,在该前提下最大化自己的收益。
- Defender对attacker有一定认识,理想情况是完全获知attacker的optimal function。进阶情况为defender只有关于attacker的部分知识或者不准确的知识。Defender的策略建立在对attacker的了解之上,如何通过对已知attacker的知识去优化自己的策略,从而最小化attacker的收益。 Defender可能采用pure strategey或者mixed strategy(multiple types,通过一定概率选取type,更具优势)。
-
Attacker的optimal function:
- 可以是给定的、部分给定,通过统计模型获知(PAC model)等等
-
回合制:Stackelberg game可以是单一回合制的,即defender在一开始制定好策略,部署后attacker行动,通过attacker的收益衡量defender的策略。也可以是多回合制(repeated game),defender部署一个策略,attacker行动,defender在此基础上再部署新的策略,以此往复。
- 多回合制时defender需要考虑:
- Defender需要考虑随后的若干步attacker的策略(可以类比于用RL为之前的reward设置
)从而选择当前策略。 - 更进一步的,若对attacker的optimal function只有部分知识,需要平衡exploitation与exploration,同时保证一定的收益。(如何学习attacker的function)
- Defender需要考虑随后的若干步attacker的策略(可以类比于用RL为之前的reward设置
- 多回合制时defender需要考虑:
1.2. 模型关注点
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | 一回合结束?多回合制?(repeated game?) |
Follower/attacker | 知识(题设) | 一般是完全知道Leader的策略(如mixed strategy的概率),但是不知道具体action,从而在已确定的type下计算出相应的最优策略。 |
策略(题设) | 一个follower?多个follower?确定的optimal function(单一type)或者multiple types(一开始从多个选项中确定一个,或每回合以一定概率从types中选择)? | |
Defender/leader | 知识(题设) | 最理想情况是确知Follower的optimal function。若follower是multiple types,可能知道究竟有哪些types,但是不知道follower选择了哪个。那么通过何种方式(例如Monte Carlo)直接或者间接决定策略。 |
求解策略的算法 | mixed strategy或optimal function。从上面的题设,那么Leader最优的策略应如何得出? |
1.3. 实际应用
- 如果defender和attacker所处的environment是stochastic的?
- 可以把哪些作为optimality的标准?E.g.:
- 工厂生产产品:利润。
- 机场/wildlife的安全守卫。
- 网络安全攻防:如何引导攻击者进入honeypot。
1.4. 问题分类?
- Stackelberg security game (SSG), Tambe2012
- Green security game (GSG), Fang 2015, Nguyen 2015, machine learning evolved.
- PAC model of learning
2. [20%]Efficient algorithms to solve Bayesian Stackelberg games for security applications
Paruchuri et al., AAAI 2008
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | 单一回合 |
Follower/attacker | 知识(题设) | 完全已知Leader的使用某种策略的概率,但不知道具体action,计算出相应的最优策略。 |
策略(题设) | 一个follower,从多个types中确定一个 | |
Defender/leader | 知识(题设) | 知道follower是multiple types,可能知道究竟有哪些types,但是不知道follower选择了哪个。 |
求解策略的算法 | 根据follower的type概率(distribution over types)求出optimal mixed strategy。 1. DOBSS (Decomposed Optimal Bayesian Stackelberg Solver) 2. Approximate procedure: ASAP (Agent Security via approximate Policies) |
- Traditional method: conversion to a normal-form game via the Harsanyi transformation.
- Proposed methods:
- Exact method: DOBSS (Decomposed Optimal Bayesian Stackelberg Solver)
- Approximate procedure: ASAP (Agent Security via approximate Policies)
- Focus on experimental results with human subjects.
3. [30%]Learning and approximating the optimal strategy to commit to
Joshua Letchford et al., ADT 2011
“The authors develop an elegant method for choosing leader strategies to uncover the follower preferences in as few rounds as possible.” — Playing repeated Stackelberg games with unknown opponents
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | repeated game |
Follower/attacker | 知识(题设) | 完全已知Leader的策略,计算出相应的最优策略。 |
策略(题设) | 具有固定的optimal function | |
Defender/leader | 知识(题设) | 对follower完全未知 |
求解策略的算法 | 需要某种mixed strategy,观察follower的反应,从而决定optimal strategy,测试结果positive。但是approximately optimal Stackelberg strategies的结果negtive。 |
4. [30%]Playing repeated Stackelberg games with unknown opponents
Marecki et al., AAMAS 2012
- To balance the exploration of the follower payoff structure vs. the exploitation of this knowledge to optimize on-the-fly the expected cumulative reward-to-go of the leader.
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | repeated game |
Follower/attacker | 知识(题设) | 已知每回合Leader使用某种策略的概率,但不知道具体action,计算出相应的最优策略。 |
策略(题设) | 具有固定的optimal function | |
Defender/leader | 知识(题设) | 对follower完全未知 |
求解策略的算法 | 使用MCTS(蒙特卡洛树搜索)和剪枝进行搜索,通过一个常量平衡exploration和exploitation(此时等价于MDP)。缺点为只能应对单一类型type的follower,并且action是离散的。 |
-
The follower always stick to its’ own fixed utility function, which is unkown for the leader. The leader plays as a mixed strategy type, the
(probability of executing the pure strategy) is known to the follower. Compute an upper confidence bound (best reaction of the follower by leader provides an action). A tunable constant constant controlling the tradeoff between exploration and exploitation. -
The exploration and exploitation is done by MCTS and pruning.
-
Drawback
- Only discuss about single attacker type (only one set of
). The follower stick to one pure strategy (utility function) all the time. - Action is discrete.
- Only discuss about single attacker type (only one set of
-
Detail: Not trying to model the cost/utility function of the follower directly. Using MC to sample and estimate each
(utility value of leader/follower action pair) -
Preliminary: Stackelberg Game (same as Learning and approximating the optimal strategy to commit to)
-
Monte-Carlo tree search: the implementation of MCTS makes use of the UCT algorithm (Bandit based Monte-Carlo planning)
-
Basic idea of abbreviated proof of convergence of the algorithm: map a repeated Bayesian Stackelberg game to an equivalent MDP.
-
Pruning:
- Leader identify its’ own strategy
- Find out strategies that have not yet been employed
- Compare to the expected value of current strategies.
- Reomve those unemployed one with expected value guaranteed not to exceed the current expected one.
5. [30%]Commitment Without Regrets: Online Learning in Stackelberg Security Games
Maria-Florina Balcan et al., EC 2015
- SSG: Stackelberg security game
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | repeated game |
Follower/attacker | 知识(题设) | |
策略(题设) | The attacker/follower choose the best adversarially from some known set $K$ follower的type是从一个集合中选择 | |
Defender/leader | 知识(题设) | |
求解策略的算法 | Use online learning to deal with uncertainty about attackers. Leader根据历史信息(之前的attacker的对策)在新的回合决定mixed strategy |
-
Preliminary:
-
Time horizon
-
Set of target
-
A defender with:
- Resources
(set, ) - Schedules
. A schedule of whether a target will be protected. 每个targe是否受到保护(0/1) - Assignment function:
. Define whether is assigned to . 决定资源里的每个元素是否配置给某个schedule。 - Strategy: pure or mixed.
-
pure strategies can be shown as a 0/1 table. iff target is covered by some resource in the pure strategy Target Target Stratgy 0/1 0/1 Stratgy 0/1 0/1 -
Mixed strategy adds
probability vector to indicate the probablity of play each pure strategy.
-
- Resources
-
A set of all attacker types
- An adversary selects a sequence of attackers
, such that for all , is known to the defender, is unknown.
- An adversary selects a sequence of attackers
-
Utilities: payoff for the defender and attacker.
-
-
Problem formulation: how a defender can effectively protect targets in a repeated Stackelberg security game when the sequence of attackers is not known to him.
- Full information: after attacked by
, the defender know which type the attacker is. - Partial information
- Full information: after attacked by
6. [30%]Optimal Network Security Hardening Using Attack Graph Games
Karel Durkota et al., IJCAI 2015
- Introduce a novel game-theoretic model of network hardening using honeypots that extends the existing line of Stackelberg security games by combining two elements:
- Adopt a compact representation of strategies for attacking computer networks called attack graphs.
- the defender uses deception instead of directly allocating resources to harden targes.
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | |
Follower/attacker | 知识(题设) | |
策略(题设) | ||
Defender/leader | 知识(题设) | |
求解策略的算法 |
-
Dependency attack graphs (AG) and attacker’s strategies:
- AND/OR graph consisting of fact nodes F (OR) and action nodes A (AND)
- Every action
has: - preconditions (
): a set of facts that must be true before the action can be performed. - effects (
): a set of facts that become true if the action is successfully performed. Preconditions and effects are presented by edges in the attack graph.
: associated a probability of being performed successfully. : associated with a cost . The attacker pays regardless of whether the action is successful. Represent the time and resources for the attacker to perform the action. : act with a set of host types. After started the attack, the attacker only interact with that certain type of host. - Terminate/stop action denoted
.
- preconditions (
- Use the common monotonicity assumption that once a fact becomes true during an attack, it can never become false again as an effect of any action.
- Optimal attack policy:
Linear plans that may be provided by classical planners are not sufficient as they cannot represent attacker’s behavior after action failures.
- Need to identify how likely the action can be executed.
- The order of execution.
-
Solving AG
- The MDP for attack graph AG tuple
is the set of actions is set of states , where: is the set of actions that a rational attacker can still perform. is the set of achieved facts. is the set of host types that the attacker has interacted so far - Initial MDP state is
- Initial MDP state is
is the the probability that performing action in state leads to state , three possible result: - Acts with a honeypot, terminated.
- Does not interact with a honeypot and is successsful.
- Does not interact with a honeypot and neither is successful.
is the attacker’s reward for performing action in state leading to .
- Compute the optimal policy in this MDP using backward induction based on depth-first search. Enhanced by dynamic programmming.
- The MDP for attack graph AG tuple
7. [10%]Learning adversary behavior in security games: a PAC model perspective
Arunesh Sinha et al., AAMAS 2016
- The defender first learns the response function of the adversary (adversary behavior), and then optimizes against the learned response.
- Contribution:
- Use Probably Approximately Correct (PAC) model to analyze the learning problem at hand.
- An analysis of the SUQR (Subjective Utility Quantal Response) model of bounded rationality adversary behavior used in SSGs.
SUQR is the best known model of bounded rationality in SSGs, resulting in multiple deployed applications.
- Includes an entirely new behavioral model specified by the non-parametric Lipschitz (NPL) class of response functions for SSGs.
- Convert PAC learning guarantee into a bound on the utility derived by the defender when planning her strategy based on the learned adversary behavior model.
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | |
Follower/attacker | 知识(题设) | |
策略(题设) | ||
Defender/leader | 知识(题设) | |
求解策略的算法 |
8. [0%]Three strategies to success: Learning adversary models in security games
Nika Haghtalab et al., IJCAI 2016
- Problem formulation:
- A set of
target - Defender’s action space is defined by a set of vectors to protect target
- Attacker attacks target
based on a non-adaptive probability - Goal: The defender learns thes utility function of the attacker.
- A set of
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | |
Follower/attacker | 知识(题设) | |
策略(题设) | ||
Defender/leader | 知识(题设) | |
求解策略的算法 |
9. [10%]Learning to play Stackelberg security games
Avrim Blum et al., Improving homeland security decisions 2017
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | repeated game |
Follower/attacker | 知识(题设) | |
策略(题设) | Different types. A single attacker type is drawn from a known distribution over all possible types. | |
Defender/leader | 知识(题设) | unknown attacker utilities |
求解策略的算法 | gathers information about the attacker purely by observing the attacker's responses to mixed strategies. Monte Carlo tree search and no-regret learning |
10. [30%]Adaptive learning in two-player Stackelberg games with continuous action sets
Guosong Yang et al., CDC 2019
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | repeated game |
Follower/attacker | 知识(题设) | |
策略(题设) | belongs to a known family of functions parameterzied by an unknown parameter vector | |
Defender/leader | 知识(题设) | has partial knowledge of the follower's action set and cost function 部分知道follower's action set和cost function |
求解策略的算法 | adaptive learning algorithm that simultaneously estimates the unknown parameter based on the follower's past actions and minimizeds the leader's cost. Consider the case where the parameterzied function that is known to the leader does not perfectly match the follower's actually strategy |
-
Detail: assuming the follower’s actual strategy belongs to a known parameterized family of functions (introduces little loss of generality by doing so). Approximated on a compact set up to an arbitrary precision as a finite weighted sum of a preselected class of basis functions. E.g., using RBF model as the function approximator
(utility function of the best leader/follower action pair), use gradients to improve the RBF. -
Example: link-flooding denial-of-service DOS attacks such as Crossfire attack
-
RBF: Radial basis function network 【1】Moody, J. and Darken, C., 1988. Learning with localized receptive fields (pp. 133-143). Yale Univ., Department of Computer Science.
Leader (player 1) | Follower (player 2) | |
---|---|---|
action set | ||
cost function | player 1选择action | |
strategy | Goal: | unknown function |
-
- Construct a parameter estimate
that approaches the actual parameter - Adjust the leader’s action
based on a gradient descent method to minimize its predicted cost to solve the optimization problem
- Construct a parameter estimate
-
the set of best responses against . , and if -
belongs to a parameterized family of functions
11. [0%]Modeling and mitigating link-flooding distributed denial-of-service attacks via learning in Stackelberg games
Guosong Yang et al., 2020
- Stackelberg game model tuple:
: action/strategy of router/leader/defender : action of attacker : routing cost : attack cost
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | |
Follower/attacker | 知识(题设) | |
策略(题设) | ||
Defender/leader | 知识(题设) | |
求解策略的算法 |
12. [0%]End-to-End Game-Focused Learning of Adversary Behavior in Security Games
Andrew Perrault et al., AAAI 2020
项目 | 解释 | |
环境/游戏类型 | 回合(题设) | |
Follower/attacker | 知识(题设) | |
策略(题设) | attack function q. | |
Defender/leader | 知识(题设) | |
求解策略的算法 | Train a prediction model using an estimate of defender expected utility as the loss function |