1. An adaptive performance modeling tool for GPU architectures

Sara S. Baghsorkhi et al., PPoPP 2010

  • progress: 40%, link

1.1. 主要工作

  • 设计了一个模型,用于提供performance information to an auto-tuning compiler(这个compiler是什么?),assist it in narrowing down the search (辅助在搜索空间中进行剪枝)。 “We introduce a abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel”

  • 如何做到不依赖于GPU architecture/high-level prgramming interface

  • PDG: 用于性能评估。for performance evaluation, framework to represent control and data dependences for each program operation

  • work flow graph

  • key factors

1.2. Term

  • SPMD: Single-Program Multiple-Data

  • SIMD: Single-Instruction, Multiple-Data

  • thread granularities

    NVIDIAthread-block
    ATIgroups
    OpenCLwork-groups

    Threads within a thread-block are grouped into warps

  • SM: streaming multiprocessor

  • PDG: program dependence graph

  • WLP: Warp-level parallelism (threads within a thread-block are grouped into warps)

  • DLP: data-level parallelims

  • TLP: Thread-level parallelism

  • ILP: instruction-level parallelism

  • NUMblocks: the number of active thread-blocks on a streaming multiprocessor

1.3. Performance Model

  • 多个层面上的并行
    • kernel level -> warp level -> thread level
graph TD;

	kernel[kernel]
	warp[warp level:<br />GPUs attempt to reduce memory latency by exploiting the data-level parallelism]
	thread[thread level:<br />instruction-level parallelism can still improve performance by partially covering intra-warp stalls]
	
	kernel-->warp
	warp-->thread
	

1.4. Work flow graph

  • What to take into consideration:
    • SIMD pipeline latency, global memory latency

1.5. Measurement

  • 执行时间