Collective Operations#
Important
Intel® SHMEM does not yet support teams-based collectives. All collectives must operate on the world team.
Important
All collective operations must complete before another kernel calls collective operations.
Important
A collective call must be either all host-initiated or device-initiated. For example, a program that initiate a collective operation from the host on some PEs but from the device on other PEs has undefined behavior.
ISHMEM_BARRIER_ALL#
Registers the arrival of a PE at a barrier and blocks the PE until all other PEs arrive at the barrier and all local updates and remote memory updates are completed.
-
void ishmem_barrier_all()#
Callable from the host and device.
Description:
The ishmem_barrier_all
routine is a mechanism for synchronizing all PEs
in the world team at once.
This routine blocks the calling PE until all PEs have called
ishmem_barrier_all
.
Prior to synchronizing with other PEs, ishmem_barrier_all
ensures
completion of all previously issued memory stores, and of all local and remote
memory updates issued via ishmem
AMO and RMA routine calls such as
ishmem_int_add
, ishmem_put_nbi
, and ishmem_get_nbi
.
ISHMEMX_BARRIER_ALL_WORK_GROUP#
Registers the arrival of a PE at a barrier and blocks the PE until all other PEs arrive at the barrier and all local updates and remote memory updates are completed.
-
template<typename Group>
void ishmemx_barrier_all_work_group(const Group &group)# - Parameters:
group – The SYCL
group
orsub_group
on which to collectively perform the barrier operation.
Callable from the device.
Description:
The ishmemx_barrier_all_work_group
routine is a mechanism for
synchronizing all PEs.
Unlike ishmem_barrier_all
, ishmemx_barrier_all_work_group
allows
for the device threads within group to cooperate towards the barrier
operation.
This may be more performant; for example, when ishmem_barrier_all
requires all device threads in the kernel to invoke RMA operations.
This routine blocks the calling PE until all PEs have called
ishmemx_barrier_all_work_group
.
All threads in group must call the routine with identical arguments.
ISHMEM_SYNC_ALL#
Registers the arrival of a PE at a synchronization point and suspends execution until all other PEs arrive at the synchronization point.
-
void ishmem_sync_all()#
Callable from the host and the device.
Description:
This routine blocks the calling PE until all PEs have called
ishmem_sync_all
.
In contrast with the ishmem_barrier_all
routines, ishmem_sync_all
only ensures completion and visibility of previously issued memory
stores and does not ensure completion of remote memory updates issued via
ishmem
routines.
ISHMEMX_SYNC_ALL_WORK_GROUP#
Registers the arrival of a PE at a synchronization point and suspends execution until all other PEs arrive at the synchronization point.
-
template<typename Group>
void ishmemx_sync_all_work_group(const Group &group)# - Parameters:
group – The SYCL
group
orsub_group
on which to collectively perform the barrier operation.
Callable from the device.
Description:
This routine blocks the calling PE until all PEs have called
ishmemx_sync_all_work_group
.
In contrast with the ishmem_sync_all
routine, ishmemx_sync_all_work_group
allows for the device threads within group to cooperate towards the sync operation.
This may be more performant; for example, when ishmem_sync_all
requires all device threads in the kernel to invoke RMA operations.
This routine blocks the calling PE until all PEs have called
ishmemx_sync_all_work_group
.
ishmemx_sync_all_work_group
only ensures completion and visibility of
previously issued memory stores and does not ensure completion of remote
memory updates issued via ishmem
routines.
All threads in group must call the routine with identical arguments.
ISHMEM_ALLTOALL#
Exchanges a fixed amount of contiguous data blocks between all pairs of PEs participating in the collective routine.
-
int ishmem_TYPENAME_alltoall(TYPE *dest, const TYPE *source, size_t nelems)#
-
int ishmem_alltoallmem(void *dest, const void *source, size_t nelems)#
- Parameters:
dest – Symmetric address of a data object large enough to receive the combined total of nelems elements from each PE. The type of dest should match the TYPE and TYPENAME according to the table of Standard RMA types.
source – Symmetric address of a data object that contains nelems elements of data for each PE, ordered according to destination PE. The type of source should match the TYPE and TYPENAME according to the table of Standard RMA types.
nelems – The number of elements to exchange for each PE. For
ishmem_alltoallmem
, elements are bytes.
- Returns:
zero on successful local completion; otherwise, nonzero.
Callable from the host and device.
Description:
The ishmem_alltoall
routines are collective routines. Each PE
participating in the operation exchanges nelems data elements
with all other PEs participating in the operation.
The size of a data element is 8 bits for ishmem_alltoallmem
.
The data being sent and received are stored in a contiguous symmetric data object. The total size of each PE’s source object and dest object is nelems times the size of an element times N, where N equals the number of PEs participating in the operation. The source object contains N blocks of data (where the size of each block is defined by nelems) and each block of data is sent to a different PE.
The same dest and source arrays, and same value for nelems must be passed by all PEs that participate in the collective.
Given a PE i that is the ith PE participating in the operation and a PE j that is the jth PE participating in the operation, PE i sends the jth block of its source object to the ith block of the dest object of PE j.
All PEs must participate in the collective.
Before any PE calls a ishmem_alltoall
routine, the following conditions must
be ensured:
The dest data object on all PEs is ready to accept the
ishmem_alltoall
data.The source data object on all PEs is ready to send.
Otherwise, the behavior is undefined.
Upon return from a ishmem_alltoall
routine, the following is true for
the local PE:
Its dest symmetric data object is completely updated.
The data has been copied out of the source data object.
ISHMEMX_ALLTOALL_WORK_GROUP#
Exchanges a fixed amount of contiguous data blocks between all pairs of PEs participating in the collective routine.
In the functions below, TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types.
-
template<typename Group>
int ishmemx_TYPENAME_alltoall_work_group(TYPE *dest, const TYPE *source, size_t nelems, const Group &group)#
-
template<typename Group>
int ishmemx_alltoallmem_work_group(void *dest, const void *source, size_t nelems, const Group &group)# - Parameters:
dest – Symmetric address of a data object large enough to receive the combined total of nelems elements from each PE. The type of dest should match the TYPE and TYPENAME according to the table of Standard RMA types.
source – Symmetric address of a data object that contains nelems elements of data for each PE, ordered according to destination PE. The type of source should match the TYPE and TYPENAME according to the table of Standard RMA types.
nelems – The number of elements to exchange for each PE. For
ishmem_alltoallmem
, elements are bytes.group – The SYCL
group
orsub_group
on which to collectively perform the barrier operation.
- Returns:
zero on successful local completion; otherwise, nonzero.
Callable from the device.
Description:
The ishmemx_alltoall_work_group
routines have similar semantics and
requirements as the ishmem_alltoall
routines.
In contrast with the ishmem_alltoall
routines,
ishmemx_alltoall_work_group
allows for the device threads within group
to cooperate towards the all-to-all operation.
This may be more performant; for example, when ishmem_alltoall
requires all device threads in the kernel to invoke RMA operations.
This routine blocks the calling PE until all PEs have called
ishmemx_alltoall_work_group
.
ishmemx_alltoall_work_group
only ensures completion and visibility of
previously issued memory stores and does not ensure completion of remote memory
updates issued via ishmem
routines.
All threads in group must call the routine with identical arguments.
ISHMEM_BROADCAST#
Broadcasts a block of data from one PE to one or more destination PEs.
Below, TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types.
-
int ishmem_TYPENAME_broadcast(TYPE *dest, const TYPE *source, size_t nelems, int PE_root)#
-
int ishmem_broadcastmem(void *dest, const void *source, size_t nelems, int PE_root)#
- Parameters:
dest – Symmetric address of the destination data object. The type of dest should match the TYPE and TYPENAME according to the table of Standard RMA types.
source – Symmetric address of the source data object. The type of source should match the TYPE and TYPENAME according to the table of Standard RMA types.
nelems – The number of elements in the source and dest arrays. For
ishmem_broadcastmem
, elements are bytes.PE_root – The PE from which the data is copied.
- Returns:
zero on successful local completion; otherwise, nonzero.
Callable from the host and device.
Description: The broadcast routines are collective routines across all PEs. They copy the source data object on the PE specified by PE_root to the dest data object on the PEs participating in the collective operation. The same dest and source data objects and the same value of PE_root must be passed by all PEs participating in the collective operation.
For broadcasts:
The dest object is updated on all PEs.
All PEs must participate in the operation.
The values of argument PE_root must be the same value on all PEs.
The value of PE_root must be between 0 and PE_size - 1.
Before any PE calls a broadcast routine, the following conditions must be ensured:
The dest array on all PEs participating in the broadcast is ready to accept the broadcast data.
Otherwise, the behavior is undefined.
Upon return from a broadcast routine, the following are true for the local PE:
The dest data object is updated on all PEs.
The source data object may be safely reused.
ISHMEMX_BROADCAST_WORK_GROUP#
Broadcasts a block of data from one PE to one or more destination PEs.
Below, TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types.
-
template<typename Group>
int ishmemx_TYPENAME_broadcast_work_group(TYPE *dest, const TYPE *source, size_t nelems, int PE_root, const Group &group)#
-
template<typename Group>
int ishmemx_broadcastmem_work_group(void *dest, const void *source, size_t nelems, int PE_root, const Group &group)# - Parameters:
dest – Symmetric address of the destination data object. The type of dest should match the TYPE and TYPENAME according to the table of Standard RMA types.
source – Symmetric address of the source data object. The type of source should match the TYPE and TYPENAME according to the table of Standard RMA types.
nelems – The number of elements in the source and dest arrays. For
ishmemx_broadcastmem_work_group
, elements are bytes.PE_root – The PE from which the data is copied.
group – The SYCL
group
orsub_group
on which to collectively perform the barrier operation.
- Returns:
zero on successful local completion; otherwise, nonzero.
Callable from the device.
Description:
The ishmemx_broadcast_work_group
and
ishmemx_broadcastmem_work_group
routines have similar semantics and
requirements as the ishmem_broadcast
routines.
In contrast with the ishmem_broadcast
routines,
ishmemx_broadcast_work_group
and ishmemx_broadcastmem_work_group
allow for the device threads within group to cooperate towards the broadcast operation.
This routine blocks the calling PE until all PEs have called
ishmemx_broadcast_work_group
.
ishmemx_broadcast_work_group
only ensures completion and visibility of
previously issued memory stores and does not ensure completion of remote
memory updates issued via ishmem
routines.
All threads in group must call the routine with identical arguments.
ISHMEM_COLLECT, ISHMEM_FCOLLECT#
Concatenates blocks of data from multiple PEs to an array in every PE participating in the collective routine.
In the functions below, TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types.
-
int ishmem_TYPENAME_collect(TYPE *dest, const TYPE *source, size_t nelems)#
-
int ishmem_TYPENAME_fcollect(TYPE *dest, const TYPE *source, size_t nelems)#
-
int ishmem_collectmem(void *dest, const void *source, size_t nelems)#
-
int ishmem_fcollectmem(void *dest, const void *source, size_t nelems)#
- Parameters:
dest – Symmetric address of an array large enough to accept the concatenation of the source arrays on all participating PEs. The type of dest should match the TYPE and TYPENAME according to the table of Standard RMA types.
source – Symmetric address of the source data object. The type of source should match the TYPE and TYPENAME according to the table of Standard RMA types.
nelems – The number of elements in source array. For
ishmem_[f]collectmem
, elements are bytes.
- Returns:
Zero on successful local completion. Nonzero otherwise.
Callable from the host and device.
Description:
The ishmem_collect
and ishmem_fcollect
routines perform a
collective operation to concatenate nelems data items from the source
array into the dest array, over all PEs in processor number order.
The collected result is written to the dest array for all PEs. The same dest and source arrays must be passed by all PEs that participate in the operation.
The ishmem_fcollect
routines require that nelems be the same value
in all participating PEs, while the ishmem_collect
routines allow
nelems to vary from PE to PE.
Upon return from a collective routine, the following are true for the local PE:
The dest array is updated and the source array may be safely reused.
ISHMEMX_[F]COLLECT_WORK_GROUP#
Concatenates blocks of data from multiple PEs to an array in every PE participating in the collective routine.
In the functions below, TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types.
-
template<typename Group>
int ishmemx_TYPENAME_collect_work_group(TYPE *dest, const TYPE *source, size_t nelems, const Group &group)#
-
template<typename Group>
int ishmemx_TYPENAME_fcollect_work_group(TYPE *dest, const TYPE *source, size_t nelems, const Group &group)#
-
template<typename Group>
int ishmemx_collectmem_work_group(void *dest, const void *source, size_t nelems, const Group &group)#
-
template<typename Group>
int ishmemx_fcollectmem_work_group(void *dest, const void *source, size_t nelems, const Group &group)# - Parameters:
dest – Symmetric address of an array large enough to accept the concatenation of the source arrays on all participating PEs. The type of dest should match the TYPE and TYPENAME according to the table of Standard RMA types.
source – Symmetric address of the source data object. The type of source should match the TYPE and TYPENAME according to the table of Standard RMA types.
nelems – The number of elements in source array. For
ishmemx_[f]collectmem_work_group
, elements are bytes.group – The SYCL
group
orsub_group
on which to collectively perform the barrier operation.
- Returns:
Zero on successful local completion. Nonzero otherwise.
Callable from the device.
Description:
The ishmemx_[f]collect_work_group
routines have similar semantics and
requirements as the ishmem_[f]collect
routines.
In contrast with the ishmem_[f]collect
routines,
ishmemx_[f]collect_work_group
allows for the device threads within
group to cooperate towards the collect operation.
This may be more performant; for example, when ishmem_collect
requires all device threads in the kernel to invoke RMA operations.
This routine blocks the calling PE until all PEs have called
ishmemx_[f]collect_work_group
.
ishmemx_[f]collect_work_group
only ensures completion and visibility of
previously issued memory stores and does not ensure completion of remote
memory updates issued via ishmem
routines.
All threads in group must call the routine with identical arguments.
ISHMEM_REDUCE#
Reduction Types, Names, and Supporting Operations:
TYPE |
TYPENAME |
Operations Supporting TYPE |
---|---|---|
char |
char |
MAX, MIN, SUM, PROD |
signed char |
schar |
MAX, MIN, SUM, PROD |
short |
short |
MAX, MIN, SUM, PROD |
int |
int |
MAX, MIN, SUM, PROD |
long |
long |
MAX, MIN, SUM, PROD |
long long |
longlong |
MAX, MIN, SUM, PROD |
ptrdiff_t |
ptrdiff |
MAX, MIN, SUM, PROD |
unsigned char |
uchar |
AND, OR, XOR, MAX, MIN, SUM, PROD |
unsigned short |
ushort |
AND, OR, XOR, MAX, MIN, SUM, PROD |
unsigned int |
uint |
AND, OR, XOR, MAX, MIN, SUM, PROD |
unsigned long |
ulong |
AND, OR, XOR, MAX, MIN, SUM, PROD |
unsigned long long |
ulonglong |
AND, OR, XOR, MAX, MIN, SUM, PROD |
int8_t |
int8 |
AND, OR, XOR, MAX, MIN, SUM, PROD |
int16_t |
int16 |
AND, OR, XOR, MAX, MIN, SUM, PROD |
int32_t |
int32 |
AND, OR, XOR, MAX, MIN, SUM, PROD |
int64_t |
int64 |
AND, OR, XOR, MAX, MIN, SUM, PROD |
uint8_t |
uint8 |
AND, OR, XOR, MAX, MIN, SUM, PROD |
uint16_t |
uint16 |
AND, OR, XOR, MAX, MIN, SUM, PROD |
uint32_t |
uint32 |
AND, OR, XOR, MAX, MIN, SUM, PROD |
uint64_t |
uint64 |
AND, OR, XOR, MAX, MIN, SUM, PROD |
size_t |
size |
AND, OR, XOR, MAX, MIN, SUM, PROD |
float |
float |
MAX, MIN, SUM, PROD |
double |
double |
MAX, MIN, SUM, PROD |
The following functions perform reduction operations across all PEs.
In the functions below, TYPE is one of the reduction types and has a corresponding TYPENAME specified by Table Reduction Types, Names, and Supporting Operations.
-
int ishmem_TYPENAME_and_reduce(TYPE *dest, const TYPE *source, size_t nreduce)#
-
int ishmem_TYPENAME_or_reduce(TYPE *dest, const TYPE *source, size_t nreduce)#
-
int ishmem_TYPENAME_xor_reduce(TYPE *dest, const TYPE *source, size_t nreduce)#
-
int ishmem_TYPENAME_max_reduce(TYPE *dest, const TYPE *source, size_t nreduce)#
-
int ishmem_TYPENAME_min_reduce(TYPE *dest, const TYPE *source, size_t nreduce)#
-
int ishmem_TYPENAME_sum_reduce(TYPE *dest, const TYPE *source, size_t nreduce)#
-
int ishmem_TYPENAME_prod_reduce(TYPE *dest, const TYPE *source, size_t nreduce)#
- Parameters:
dest – Symmetric address of an array, of length nreduce elements, to receive the result of the reduction routines. The type of dest should match the TYPE and TYPENAME according to the table of Reduction Types.
source – Symmetric address of an array, of length nreduce elements, that contains one element for each separate reduction routine. The type of source should match the TYPE and TYPENAME according to the table of Reduction Types.
nreduce – The number of elements in the dest and source arrays. nreduce must be of type size_t and have the same value across all PEs.
- Returns:
Zero on successful local completion. Nonzero otherwise.
Callable from the host and device.
Description:
ishmem
reduction routines are collective routines over all PEs that
compute one or more reductions across symmetric arrays.
A reduction performs an associative binary routine across a set of values.
The nreduce argument determines the number of separate reductions to perform. The source array on all PEs provides one element for each reduction. The results of the reductions are placed in the dest array on all PEs.
The source and dest arguments must either be the same symmetric address, or two different symmetric addresses corresponding to buffers that do not overlap in memory. That is, they must be completely overlapping or completely disjoint.
Before any PE calls a reduction routine, the following conditions must be ensured:
The dest array on all PEs participating in the reduction is ready to accept the results of the reduction.
Otherwise, the behavior is undefined.
Upon return from a reduction routine, the following are true for the local PE:
The dest array is updated and the source array may be safely reused.
ISHMEMX_REDUCE_WORK_GROUP#
The following functions perform reduction operations across all PEs.
In the functions below, TYPE is one of the reduction types and has a corresponding TYPENAME specified by Table Reduction Types, Names, and Supporting Operations.
-
template<typename Group>
int ishmemx_TYPENAME_and_reduce_work_group(TYPE *dest, const TYPE *source, size_t nreduce, const Group &group)#
-
template<typename Group>
int ishmemx_TYPENAME_or_reduce_work_group(TYPE *dest, const TYPE *source, size_t nreduce, const Group &group)#
-
template<typename Group>
int ishmemx_TYPENAME_xor_reduce_work_group(TYPE *dest, const TYPE *source, size_t nreduce, const Group &group)#
-
template<typename Group>
int ishmemx_TYPENAME_max_reduce_work_group(TYPE *dest, const TYPE *source, size_t nreduce, const Group &group)#
-
template<typename Group>
int ishmemx_TYPENAME_min_reduce_work_group(TYPE *dest, const TYPE *source, size_t nreduce, const Group &group)#
-
template<typename Group>
int ishmemx_TYPENAME_sum_reduce_work_group(TYPE *dest, const TYPE *source, size_t nreduce, const Group &group)#
-
template<typename Group>
int ishmemx_TYPENAME_prod_reduce_work_group(TYPE *dest, const TYPE *source, size_t nreduce, const Group &group)# - Parameters:
dest – Symmetric address of an array, of length nreduce elements, to receive the result of the reduction routines. The type of dest should match the TYPE and TYPENAME according to the table of Reduction Types.
source – Symmetric address of an array, of length nreduce elements, that contains one element for each separate reduction routine. The type of source should match the TYPE and TYPENAME according to the table of Reduction Types.
nreduce – The number of elements in the dest and source arrays. nreduce must be of type size_t and have the same value across all PEs.
group – The SYCL
group
orsub_group
on which to collectively perform the barrier operation.
- Returns:
Zero on successful local completion. Nonzero otherwise.
Callable from the device.
Description:
The ishmemx_reduce_work_group
routines have similar semantics and
requirements as the ishmem_reduce
routines.
In contrast with the ishmem_reduce
routines,
ishmemx_reduce_work_group
allows for the device threads within
group to cooperate towards the reduction operation.
This may be more performant; for example, when ishmem_reduce
requires all device threads in the kernel to invoke RMA operations.
This routine blocks the calling PE until all PEs have called
ishmemx_reduce_work_group
.
ishmemx_reduce_work_group
only ensures completion and visibility of
previously issued memory stores and does not ensure completion of remote
memory updates issued via ishmem
routines.
All threads in group must call the routine with identical arguments.
Important
For the reduction operations sum
and prod
, the order of
reduction may not be the same across all PEs, so the results for floating
point datatypes may differ slightly. This is because floating addition and
multiplication are not associative operations.