HELP
What is this?
The Intel GPU Occupancy tool will compute the theoretical GPU Occupancy for Intel GPU devices based on GPU Kernel code parameters. The Occupancy value will determine whether the GPU is efficiently utilizing all the GPU hardware resources. The tool will also generate graphs that will help optimize GPU Occupancy.
How to use?
- Input GPU Device type
- Input GPU Kernel code parameters
- Analyze GPU Occupancy output values and graphs
Detailed Instructions
- Input GPU Device type
- From the dropdown, select the GPU family and change "XVE Count" for your GPU. (XVE: Number of Xe Vector Engines in the GPU).To find out XVE Count for your GPU, check the GPU hardware specification or run
clinfo
in terminal and look for "Max compute units" value.
- Alternatively, you can select the option to use enter "PCI ID of GPU Device" to find the your GPU
- Run
lspci | grep Display
in terminal to find your GPU devices and PCI IDs.
- Find your GPU name and find the corresponding PCI ID from this Intel GPU Harware List
- Note: If your GPU is not found using PCI ID, please report a bug and use the other method by selecting your GPU family.
-
Input GPU Kernel code parameters
Once the GPU is selected, the tool will load some default values for GPU Kernel code's Global size, Work-Group size, Sub-Group size and Local Memory size. It will then compute GPU Occupancy values and will generate graphs.
Change the values of input parameters based on your GPU Kernel code to recalculate the Occupancy:
- Global size: Workload size of GPU Kernel or Total number of work-items that is executed by the GPU Kernel code. Change this to your Kernel Workload size. The value can be 1,2 or 3 dimensional value.
Examples: 4096
| 1024,1024
| 256,256,256
- Work-Group size: Group of work-items that execute within an Xe-Core and is set in GPU kernel code using
sycl::nd_range
. Setting this value too low will result in dispatching too many Work-Groups to GPU. This will result in dispatch and synchronization overhead which may affect performance negatively.
- Sub-Group size: Compiler automatically sets the SIMD/sub-group size, unless explicitly set in GPU Kernel code using
[[intel::reqd_sub_group_size(SIMD_SIZE)]]
[More Details]
- Local Memory size: Total local memory allocated by GPU Kernel code using
sycl::local_accessor
. Local Memory is shared by all Work-Groups that are dispatched to an Xe-Core. Allocating too much local memory may limit the number of Work-Groups that can be executed concurrently on Xe-Core. This may reduce Xe-Core Occupancy, however performance may or may not be negatively affected.
- Barrier Usage: Select "YES" if kernel code uses
sycl::group_barrier
either explicitly or implicitly within a group algorithm or library function. The GPU has a set number of barrier registers per Xe-Core. The number of Work-Groups that can be execute concurrently on an Xe-Core may be limited if GPU Kernel code has barriers.
-
Analyze GPU Occupancy output values and graphs
Focus on optimizing the Xe-Core Occupancy value by changing the input parameters of GPU kernel code. The Peak and Average GPU Occupancy values just scales based on the number of Xe-Cores in the GPU and global size of workload.
- Xe-Core Occupancy: Theoretical Occupancy of GPU's Xe-Core is determined by usage of available thread contexts in an Xe-Core. A Work-Group is dispatched to Xe-Core on GPU, multiple Work-Groups can be dispatched to the same Xe-Core if there are sufficient resources. Parameters affecting Xe-Core occupancy are work-group size, SIMD sub-group size and local memory usage. [More Details]
- GPU Occupancy: Theoretical Occupancy of GPU hardware to execute the entire GPU Kernel workload on all Xe-Cores of the GPU. This depends on the Xe-Core Occupancy, number of Xe-Cores in the GPU and the workload size. Depending on the GPU Kernel workload size, the GPU may require multiple waves of execution. The Peak GPU Occupancy value and Average GPU Occupancy value are calculated.
- Graph - Impact of varying Work-Group: This Graph shows how the Xe-Core Occupancy varies for different Work-Group sizes with all other parameters constant. This information can be used to set appropriate Work-Group size in GPU Kernel code to maximize Occupancy.
- Graph - Impact of varying Local Memory: This Graph shows how the Xe-Core Occupancy varies for different Local Memory sizes with all other parameters constant. This information can be used to allocate appropriate amount Local Memory in GPU Kernel code to maximize Occupancy.
- Graph - GPU Occupancy: This Graph shows how the GPU Kernel executes the entire workload. The execution may happen in multiple waves depending on number of Work-Groups and available GPU resources. The graph will show Peak GPU Occupancy value and Average GPU Occupancy value.
Accronyms/Definitions
GPU | Graphics Processing Unit |
HW | Hardware |
Xe | Intel GPU Architecture name |
XVE | Xe Vector Engine (formaly known as EU or Execution Unit) Smallest thread level building block of GPU consisting of Arithmetic Logic Units |
SLM | Shared Local Memory |
Xe-Core | Compute building block of GPU consisting of XVEs, SLM and other HW resources |
SYCL | Open Source Cross-Platform Programming Model |
WG | Work-Group |
SG | SIMD/Sub-Group |
Disclaimer
- The GPU Occupancy Calculator gives a theoretical estimate of GPU Occupancy, actual occupancy on the hardware may be slightly different due to many factors like kernel scheduling variations, cache hit/miss and other hardware conditions.
- Note that higher occupancy does not always translate to higher performance (for example in some GPU kernels, higher local memory usage may reduce occupancy but may give better performance).
- Occupancy reported by tools like Intel VTune Profiler and Intel Advisor Roofline are actual Occupancy and may be different than the theoretical GPU Occupancy Calculator value.