1. An adaptive performance modeling tool for GPU architectures
Sara S. Baghsorkhi et al., PPoPP 2010
- progress: 40%, link
1.1. 主要工作
-
设计了一个模型,用于提供performance information to an auto-tuning compiler(这个compiler是什么?),assist it in narrowing down the search (辅助在搜索空间中进行剪枝)。 “We introduce a abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel”
-
如何做到不依赖于GPU architecture/high-level prgramming interface
-
PDG: 用于性能评估。for performance evaluation, framework to represent control and data dependences for each program operation
-
work flow graph
-
key factors
1.2. Term
-
SPMD: Single-Program Multiple-Data
-
SIMD: Single-Instruction, Multiple-Data
-
thread granularities
NVIDIA thread-block ATI groups OpenCL work-groups Threads within a thread-block are grouped into warps
-
SM: streaming multiprocessor
-
PDG: program dependence graph
-
WLP: Warp-level parallelism (threads within a thread-block are grouped into warps)
-
DLP: data-level parallelims
-
TLP: Thread-level parallelism
-
ILP: instruction-level parallelism
-
NUMblocks: the number of active thread-blocks on a streaming multiprocessor
1.3. Performance Model
- 多个层面上的并行
- kernel level -> warp level -> thread level
graph TD;
kernel[kernel]
warp[warp level:<br />GPUs attempt to reduce memory latency by exploiting the data-level parallelism]
thread[thread level:<br />instruction-level parallelism can still improve performance by partially covering intra-warp stalls]
kernel-->warp
warp-->thread
1.4. Work flow graph
- What to take into consideration:
- SIMD pipeline latency, global memory latency
1.5. Measurement
- 执行时间