clang  6.0.0svn
Classes | Enumerations | Functions
CGOpenMPRuntimeNVPTX.cpp File Reference
#include "CGOpenMPRuntimeNVPTX.h"
#include "clang/AST/DeclOpenMP.h"
#include "CodeGenFunction.h"
#include "clang/AST/StmtOpenMP.h"
Include dependency graph for CGOpenMPRuntimeNVPTX.cpp:

Go to the source code of this file.

Classes

struct  CopyOptionsTy
 

Enumerations

enum  OpenMPRTLFunctionNVPTX
 
enum  MachineConfiguration : unsigned
 GPU Configuration: This information can be derived from cuda registers, however, providing compile time constants helps generate more efficient code. More...
 
enum  NamedBarrier : unsigned
 
enum  CopyAction : unsigned
 

Functions

static llvm::ValuegetNVPTXWarpSize (CodeGenFunction &CGF)
 Get the GPU warp size. More...
 
static llvm::ValuegetNVPTXThreadID (CodeGenFunction &CGF)
 Get the id of the current thread on the GPU. More...
 
static llvm::ValuegetNVPTXWarpID (CodeGenFunction &CGF)
 Get the id of the warp in the block. More...
 
static llvm::ValuegetNVPTXLaneID (CodeGenFunction &CGF)
 Get the id of the current lane in the Warp. More...
 
static llvm::ValuegetNVPTXNumThreads (CodeGenFunction &CGF)
 Get the maximum number of threads in a block of the GPU. More...
 
static void getNVPTXCTABarrier (CodeGenFunction &CGF)
 Get barrier to synchronize all threads in a block. More...
 
static void getNVPTXBarrier (CodeGenFunction &CGF, int ID, llvm::Value *NumThreads)
 Get barrier #ID to synchronize selected (multiple of warp size) threads in a CTA. More...
 
static void syncCTAThreads (CodeGenFunction &CGF)
 Synchronize all GPU threads in a block. More...
 
static void syncParallelThreads (CodeGenFunction &CGF, llvm::Value *NumThreads)
 Synchronize worker threads in a parallel region. More...
 
static llvm::ValuegetThreadLimit (CodeGenFunction &CGF, bool IsInSpmdExecutionMode=false)
 Get the value of the thread_limit clause in the teams directive. More...
 
static llvm::ValuegetMasterThreadID (CodeGenFunction &CGF)
 Get the thread id of the OMP master thread. More...
 
static CGOpenMPRuntimeNVPTX::ExecutionMode getExecutionModeForDirective (CodeGenModule &CGM, const OMPExecutableDirective &D)
 
static void setPropertyExecutionMode (CodeGenModule &CGM, StringRef Name, CGOpenMPRuntimeNVPTX::ExecutionMode Mode)
 
static llvm::ValuecreateRuntimeShuffleFunction (CodeGenFunction &CGF, QualType ElemTy, llvm::Value *Elem, llvm::Value *Offset)
 This function creates calls to one of two shuffle functions to copy variables between lanes in a warp. More...
 
static void emitReductionListCopy (CopyAction Action, CodeGenFunction &CGF, QualType ReductionArrayTy, ArrayRef< const Expr *> Privates, Address SrcBase, Address DestBase, CopyOptionsTy CopyOptions={nullptr, nullptr, nullptr})
 Emit instructions to copy a Reduce list, which contains partially aggregated values, in the specified direction. More...
 
static llvm::ValueemitReduceScratchpadFunction (CodeGenModule &CGM, ArrayRef< const Expr *> Privates, QualType ReductionArrayTy, llvm::Value *ReduceFn)
 This function emits a helper that loads data from the scratchpad array and (optionally) reduces it with the input operand. More...
 
static llvm::ValueemitCopyToScratchpad (CodeGenModule &CGM, ArrayRef< const Expr *> Privates, QualType ReductionArrayTy)
 This function emits a helper that stores reduced data from the team master to a scratchpad array in global memory. More...
 
static llvm::ValueemitInterWarpCopyFunction (CodeGenModule &CGM, ArrayRef< const Expr *> Privates, QualType ReductionArrayTy)
 This function emits a helper that gathers Reduce lists from the first lane of every active warp to lanes in the first warp. More...
 
static llvm::ValueemitShuffleAndReduceFunction (CodeGenModule &CGM, ArrayRef< const Expr *> Privates, QualType ReductionArrayTy, llvm::Value *ReduceFn)
 Emit a helper that reduces data across two OpenMP threads (lanes) in the same warp. More...
 

Enumeration Type Documentation

◆ CopyAction

enum CopyAction : unsigned

Definition at line 1084 of file CGOpenMPRuntimeNVPTX.cpp.

◆ MachineConfiguration

enum MachineConfiguration : unsigned

GPU Configuration: This information can be derived from cuda registers, however, providing compile time constants helps generate more efficient code.

For all practical purposes this is fine because the configuration is the same for all known NVPTX architectures.

Definition at line 134 of file CGOpenMPRuntimeNVPTX.cpp.

◆ NamedBarrier

enum NamedBarrier : unsigned

Definition at line 145 of file CGOpenMPRuntimeNVPTX.cpp.

◆ OpenMPRTLFunctionNVPTX

Definition at line 24 of file CGOpenMPRuntimeNVPTX.cpp.

Function Documentation

◆ createRuntimeShuffleFunction()

static llvm::Value* createRuntimeShuffleFunction ( CodeGenFunction CGF,
QualType  ElemTy,
llvm::Value Elem,
llvm::Value Offset 
)
static

◆ emitCopyToScratchpad()

static llvm::Value* emitCopyToScratchpad ( CodeGenModule CGM,
ArrayRef< const Expr *>  Privates,
QualType  ReductionArrayTy 
)
static

◆ emitInterWarpCopyFunction()

static llvm::Value* emitInterWarpCopyFunction ( CodeGenModule CGM,
ArrayRef< const Expr *>  Privates,
QualType  ReductionArrayTy 
)
static

This function emits a helper that gathers Reduce lists from the first lane of every active warp to lanes in the first warp.

void inter_warp_copy_func(void* reduce_data, num_warps) shared smem[warp_size]; For all data entries D in reduce_data: If (I am the first lane in each warp) Copy my local D to smem[warp_id] sync if (I am the first warp) Copy smem[thread_id] to my local D sync

Definition at line 1524 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenTypes::arrangeBuiltinFunctionDeclaration(), clang::CodeGen::CodeGenFunction::Builder, clang::CodeGen::CodeGenFunction::ConvertTypeForMem(), clang::Create(), clang::CodeGen::CodeGenFunction::createBasicBlock(), clang::cuda_shared, clang::CodeGen::CodeGenFunction::disableDebugInfo(), clang::CodeGen::CodeGenFunction::EmitBlock(), clang::CodeGen::CodeGenFunction::EmitLoadOfScalar(), clang::CodeGen::CodeGenFunction::EmitStoreOfScalar(), clang::CodeGen::CodeGenFunction::FinishFunction(), clang::CodeGen::CodeGenFunction::GetAddrOfLocalVar(), clang::CodeGen::CodeGenModule::getContext(), clang::CodeGen::CodeGenTypes::GetFunctionType(), clang::CodeGen::CodeGenModule::getModule(), getNVPTXLaneID(), getNVPTXThreadID(), getNVPTXWarpID(), getNVPTXWarpSize(), clang::CodeGen::CodeGenTypeCache::getPointerAlign(), clang::CodeGen::CodeGenTypeCache::getPointerSize(), clang::CodeGen::Address::getType(), clang::CodeGen::CodeGenModule::getTypes(), clang::CodeGen::CodeGenTypeCache::Int64Ty, clang::InternalLinkage, clang::ImplicitParamDecl::Other, clang::CodeGen::CodeGenModule::SetInternalFunctionAttributes(), clang::CodeGen::CodeGenFunction::StartFunction(), and syncParallelThreads().

◆ emitReduceScratchpadFunction()

static llvm::Value* emitReduceScratchpadFunction ( CodeGenModule CGM,
ArrayRef< const Expr *>  Privates,
QualType  ReductionArrayTy,
llvm::Value ReduceFn 
)
static

This function emits a helper that loads data from the scratchpad array and (optionally) reduces it with the input operand.

load_and_reduce(local, scratchpad, index, width, should_reduce) reduce_data remote; for elem in remote: remote.elem = Scratchpad[elem_id][index] if (should_reduce) local = local @ remote else local = remote

Definition at line 1312 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenTypes::arrangeBuiltinFunctionDeclaration(), clang::CodeGen::CodeGenFunction::Builder, clang::CodeGen::CodeGenFunction::ConvertTypeForMem(), clang::Create(), clang::CodeGen::CodeGenFunction::createBasicBlock(), clang::CodeGen::CodeGenFunction::CreateMemTemp(), clang::CodeGen::CodeGenFunction::disableDebugInfo(), clang::CodeGen::CodeGenFunction::EmitBlock(), clang::CodeGen::CodeGenFunction::EmitCallOrInvoke(), clang::CodeGen::CodeGenFunction::EmitLoadOfScalar(), emitReductionListCopy(), clang::CodeGen::CodeGenFunction::FinishFunction(), clang::CodeGen::CodeGenFunction::GetAddrOfLocalVar(), clang::CodeGen::CodeGenModule::getContext(), clang::CodeGen::CodeGenTypes::GetFunctionType(), clang::CodeGen::CodeGenModule::getModule(), clang::CodeGen::CodeGenTypeCache::getPointerAlign(), clang::CodeGen::CodeGenModule::getTypes(), clang::InternalLinkage, clang::ImplicitParamDecl::Other, clang::CodeGen::CodeGenModule::SetInternalFunctionAttributes(), clang::CodeGen::CodeGenTypeCache::SizeTy, clang::CodeGen::CodeGenFunction::StartFunction(), and clang::CodeGen::CodeGenTypeCache::VoidPtrTy.

◆ emitReductionListCopy()

static void emitReductionListCopy ( CopyAction  Action,
CodeGenFunction CGF,
QualType  ReductionArrayTy,
ArrayRef< const Expr *>  Privates,
Address  SrcBase,
Address  DestBase,
CopyOptionsTy  CopyOptions = {nullptr, nullptr, nullptr} 
)
static

Emit instructions to copy a Reduce list, which contains partially aggregated values, in the specified direction.

Definition at line 1106 of file CGOpenMPRuntimeNVPTX.cpp.

Referenced by emitCopyToScratchpad(), emitReduceScratchpadFunction(), and emitShuffleAndReduceFunction().

◆ emitShuffleAndReduceFunction()

static llvm::Value* emitShuffleAndReduceFunction ( CodeGenModule CGM,
ArrayRef< const Expr *>  Privates,
QualType  ReductionArrayTy,
llvm::Value ReduceFn 
)
static

Emit a helper that reduces data across two OpenMP threads (lanes) in the same warp.

It uses shuffle instructions to copy over data from a remote lane's stack. The reduction algorithm performed is specified by the fourth parameter.

Algorithm Versions. Full Warp Reduce (argument value 0): This algorithm assumes that all 32 lanes are active and gathers data from these 32 lanes, producing a single resultant value. Contiguous Partial Warp Reduce (argument value 1): This algorithm assumes that only a contiguous subset of lanes are active. This happens for the last warp in a parallel region when the user specified num_threads is not an integer multiple of

  1. This contiguous subset always starts with the zeroth lane. Partial Warp Reduce (argument value 2): This algorithm gathers data from any number of lanes at any position. All reduced values are stored in the lowest possible lane. The set of problems every algorithm addresses is a super set of those addressable by algorithms with a lower version number. Overhead increases as algorithm version increases.

Terminology Reduce element: Reduce element refers to the individual data field with primitive data types to be combined and reduced across threads. Reduce list: Reduce list refers to a collection of local, thread-private reduce elements. Remote Reduce list: Remote Reduce list refers to a collection of remote (relative to the current thread) reduce elements.

We distinguish between three states of threads that are important to the implementation of this function. Alive threads: Threads in a warp executing the SIMT instruction, as distinguished from threads that are inactive due to divergent control flow. Active threads: The minimal set of threads that has to be alive upon entry to this function. The computation is correct iff active threads are alive. Some threads are alive but they are not active because they do not contribute to the computation in any useful manner. Turning them off may introduce control flow overheads without any tangible benefits. Effective threads: In order to comply with the argument requirements of the shuffle function, we must keep all lanes holding data alive. But at most half of them perform value aggregation; we refer to this half of threads as effective. The other half is simply handing off their data.

Procedure Value shuffle: In this step active threads transfer data from higher lane positions in the warp to lower lane positions, creating Remote Reduce list. Value aggregation: In this step, effective threads combine their thread local Reduce list with Remote Reduce list and store the result in the thread local Reduce list. Value copy: In this step, we deal with the assumption made by algorithm 2 (i.e. contiguity assumption). When we have an odd number of lanes active, say 2k+1, only k threads will be effective and therefore k new values will be produced. However, the Reduce list owned by the (2k+1)th thread is ignored in the value aggregation. Therefore we copy the Reduce list from the (2k+1)th lane to (k+1)th lane so that the contiguity assumption still holds.

Definition at line 1774 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenTypes::arrangeBuiltinFunctionDeclaration(), clang::CodeGen::CodeGenFunction::Builder, clang::CodeGen::CodeGenFunction::ConvertTypeForMem(), clang::Create(), clang::CodeGen::CodeGenFunction::createBasicBlock(), clang::CodeGen::CodeGenFunction::CreateMemTemp(), clang::CodeGen::CodeGenFunction::disableDebugInfo(), clang::CodeGen::CodeGenFunction::EmitBlock(), clang::CodeGen::CodeGenFunction::EmitCallOrInvoke(), clang::CodeGen::CodeGenFunction::EmitLoadOfScalar(), emitReductionListCopy(), clang::CodeGen::CodeGenFunction::FinishFunction(), clang::CodeGen::CodeGenFunction::GetAddrOfLocalVar(), clang::CodeGen::CodeGenModule::getContext(), clang::CodeGen::CodeGenTypes::GetFunctionType(), clang::CodeGen::CodeGenModule::getModule(), clang::CodeGen::Address::getPointer(), clang::CodeGen::CodeGenTypeCache::getPointerAlign(), clang::CodeGen::CodeGenModule::getTypes(), clang::InternalLinkage, clang::ImplicitParamDecl::Other, clang::CodeGen::CodeGenModule::SetInternalFunctionAttributes(), clang::CodeGen::CodeGenFunction::StartFunction(), and clang::CodeGen::CodeGenTypeCache::VoidPtrTy.

◆ getExecutionModeForDirective()

static CGOpenMPRuntimeNVPTX::ExecutionMode getExecutionModeForDirective ( CodeGenModule CGM,
const OMPExecutableDirective D 
)
static

◆ getMasterThreadID()

static llvm::Value* getMasterThreadID ( CodeGenFunction CGF)
static

Get the thread id of the OMP master thread.

The master thread id is the first thread (lane) of the last warp in the GPU block. Warp size is assumed to be some power of 2. Thread id is 0 indexed. E.g: If NumThreads is 33, master id is 32. If NumThreads is 64, master id is 32. If NumThreads is 1024, master id is 992.

Definition at line 240 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenTypes::arrangeNullaryFunction(), clang::CodeGen::CodeGenFunction::Builder, clang::Create(), clang::CodeGen::CodeGenTypes::GetFunctionType(), clang::CodeGen::CodeGenModule::getModule(), getNVPTXNumThreads(), getNVPTXWarpSize(), clang::CodeGen::CodeGenModule::getTypes(), clang::InternalLinkage, and clang::CodeGen::CodeGenModule::SetInternalFunctionAttributes().

◆ getNVPTXBarrier()

static void getNVPTXBarrier ( CodeGenFunction CGF,
int  ID,
llvm::Value NumThreads 
)
static

Get barrier #ID to synchronize selected (multiple of warp size) threads in a CTA.

Definition at line 202 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenFunction::Builder, clang::CodeGen::CodeGenFunction::CGM, clang::CodeGen::CodeGenFunction::EmitRuntimeCall(), and clang::CodeGen::CodeGenModule::getModule().

Referenced by syncParallelThreads().

◆ getNVPTXCTABarrier()

static void getNVPTXCTABarrier ( CodeGenFunction CGF)
static

Get barrier to synchronize all threads in a block.

Definition at line 195 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenFunction::CGM, clang::CodeGen::CodeGenFunction::EmitRuntimeCall(), and clang::CodeGen::CodeGenModule::getModule().

Referenced by syncCTAThreads().

◆ getNVPTXLaneID()

static llvm::Value* getNVPTXLaneID ( CodeGenFunction CGF)
static

Get the id of the current lane in the Warp.

We assume that the warp size is 32, which is always the case on the NVPTX device, to generate more efficient code.

Definition at line 180 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenFunction::Builder, and getNVPTXThreadID().

Referenced by emitInterWarpCopyFunction().

◆ getNVPTXNumThreads()

static llvm::Value* getNVPTXNumThreads ( CodeGenFunction CGF)
static

Get the maximum number of threads in a block of the GPU.

Definition at line 187 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenFunction::CGM, clang::CodeGen::CodeGenFunction::EmitRuntimeCall(), and clang::CodeGen::CodeGenModule::getModule().

Referenced by getMasterThreadID(), and getThreadLimit().

◆ getNVPTXThreadID()

static llvm::Value* getNVPTXThreadID ( CodeGenFunction CGF)
static

◆ getNVPTXWarpID()

static llvm::Value* getNVPTXWarpID ( CodeGenFunction CGF)
static

Get the id of the warp in the block.

We assume that the warp size is 32, which is always the case on the NVPTX device, to generate more efficient code.

Definition at line 172 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenFunction::Builder, and getNVPTXThreadID().

Referenced by emitInterWarpCopyFunction().

◆ getNVPTXWarpSize()

static llvm::Value* getNVPTXWarpSize ( CodeGenFunction CGF)
static

◆ getThreadLimit()

static llvm::Value* getThreadLimit ( CodeGenFunction CGF,
bool  IsInSpmdExecutionMode = false 
)
static

Get the value of the thread_limit clause in the teams directive.

For the 'generic' execution mode, the runtime encodes thread_limit in the launch parameters, always starting thread_limit+warpSize threads per CTA. The threads in the last warp are reserved for master execution. For the 'spmd' execution mode, all threads in a CTA are part of the team.

Definition at line 224 of file CGOpenMPRuntimeNVPTX.cpp.

References clang::CodeGen::CodeGenFunction::Builder, getNVPTXNumThreads(), and getNVPTXWarpSize().

◆ setPropertyExecutionMode()

static void setPropertyExecutionMode ( CodeGenModule CGM,
StringRef  Name,
CGOpenMPRuntimeNVPTX::ExecutionMode  Mode 
)
static

◆ syncCTAThreads()

static void syncCTAThreads ( CodeGenFunction CGF)
static

Synchronize all GPU threads in a block.

Definition at line 212 of file CGOpenMPRuntimeNVPTX.cpp.

References getNVPTXCTABarrier().

Referenced by clang::CodeGen::CGOpenMPRuntimeNVPTX::emitParallelCall().

◆ syncParallelThreads()

static void syncParallelThreads ( CodeGenFunction CGF,
llvm::Value NumThreads 
)
static

Synchronize worker threads in a parallel region.

Definition at line 215 of file CGOpenMPRuntimeNVPTX.cpp.

References getNVPTXBarrier().

Referenced by emitInterWarpCopyFunction().