Implement global atomics and a memory barrier in the standard library.

This checkin provides the standard set of atomic operations and a memory barrier in the ispc standard library.  Both signed and unsigned 32- and 64-bit integer types are supported.
This commit is contained in:
Matt Pharr
2011-07-04 17:20:42 +01:00
parent 24f47b300d
commit 5bcc611409
13 changed files with 364 additions and 9 deletions

View File

@@ -121,7 +121,7 @@ objs/stdlib-c.o: objs/stdlib-c.cpp
objs/stdlib_ispc.cpp: stdlib.ispc
@echo Creating C++ source from $<
@$(CPP) -DISPC=1 -DPI=3.1415926536 $< | ./stdlib2cpp.py > $@
@$(CLANG) -E -x c -DISPC=1 -DPI=3.1415926536 $< -o - | ./stdlib2cpp.py > $@
objs/stdlib_ispc.o: objs/stdlib_ispc.cpp
@echo Compiling $<

View File

@@ -1,9 +1,9 @@
=== v1.0.3 === (not yet released)
=== v1.0.3 === (4 July 2011)
ispc now has a bulit-in pre-processor (from LLVM's clang compiler).
(Thanks to Pete Couperus!) It is therefore no longer necessary to use
cl.exe for preprocessing before on Windows; the MSVC proejct files for the
examples have been updated accordingly.
(Thanks to Pete Couperus for this patch!) It is therefore no longer
necessary to use cl.exe for preprocessing on Windows; the MSVC proejct
files for the examples have been updated accordingly.
There is another variant of the shuffle() function int the standard
library: "<type> shuffle(<type> v0, <type> v1, int permute)", where the
@@ -11,8 +11,15 @@ permutation vector indexes over the concatenation of the two vectors
(e.g. the value 0 corresponds to the first element of v0, the value
2*programCount-1 corresponds to the last element of v1, etc.)
ispc now supports the usual range of atomic operations (add, subtract, min,
max, and, or, and xor) as well as atomic swap and atomic compare and
exchange. There is also a facility for inserting memory fences. See the
"Atomic Operations and Memory Fences" section of the user's guide
(http://ispc.github.com/ispc.html#atomic-operations-and-memory-fences) for
more information.
There are now both 'signed' and 'unsigned' variants of the standard library
functions like packed_load_active() that that references to arrays of
functions like packed_load_active() that take references to arrays of
signed int32s and unsigned int32s respectively. (The
{load_from,store_to}_{int8,int16}() functions have similarly been augmented
to have both 'signed' and 'unsigned' variants.)

View File

@@ -1,6 +1,6 @@
#!/bin/bash
rst2html ispc.txt > ispc.html
rst2html.py ispc.txt > ispc.html
#rst2latex --section-numbering --documentclass=article --documentoptions=DIV=9,10pt,letterpaper ispc.txt > ispc.tex
#pdflatex ispc.tex

View File

@@ -76,6 +76,7 @@ Contents:
+ `Output Functions`_
+ `Cross-Program Instance Operations`_
+ `Packed Load and Store Operations`_
+ `Atomic Operations and Memory Fences`_
+ `Low-Level Bits`_
* `Interoperability with the Application`_
@@ -1811,6 +1812,69 @@ where the ``i`` th element of ``x`` has been replaced with the value ``v``
int insert(int x, uniform int i, uniform int v)
Atomic Operations and Memory Fences
-----------------------------------
The usual range of atomic memory operations are provided in ``ispc``. As an
example, consider the 32-bit integer atomic add routine:
::
int32 atomic_add_global(reference uniform int32 val, int32 delta)
The semantics are the expected ones for an atomic add function: the value
"val" has the value "delta" added to it atomically, and the old value of
"val" is returned from the function. (Thus, if multiple processors
simultaneously issue atomic adds to the same memory location, the adds will
be serialized by the hardware so that the correct result is computed in the
end.)
One thing to note is that that the value being added to here is a
``uniform`` integer, while the increment amount and the return value are
``varying``. In other words, the semantics are that each running program
instance individually issues the atomic operation with its own ``delta``
value and gets the previous value of ``val`` back in return.
Here are the declarations of the ``int32`` variants of these functions.
There are also ``int64`` equivalents as well as variants that take
``unsigned`` ``int32`` and ``int64`` values.
::
int32 atomic_add_global(reference uniform int32 val, int32 value)
int32 atomic_subtract_global(reference uniform int32 val, int32 value)
int32 atomic_min_global(reference uniform int32 val, int32 value)
int32 atomic_max_global(reference uniform int32 val, int32 value)
int32 atomic_and_global(reference uniform int32 val, int32 value)
int32 atomic_or_global(reference uniform int32 val, int32 value)
int32 atomic_xor_global(reference uniform int32 val, int32 value)
int32 atomic_swap_global(reference uniform int32 val, int32 newval)
There is also an atomic "compare and exchange" function; it atomically
compares the value in "val" to "compare"--if they match, it assigns
"newval" to "val". In either case, the old value of "val" is returned.
(As with the other atomic operations, there are also ``unsigned`` and
64-bit variants of this function.)
::
int32 atomic_compare_exchange_global(reference uniform int32 val,
int32 compare, int32 newval)
``ispc`` also has a standard library routine that inserts a memory barrier
into the code; it ensures that all memory reads and writes prior to be
barrier complete before any reads or writes after the barrier are issued.
See the `Linux kernel documentation on memory barriers`_ for an excellent
writeup on the need for that the use of memory barriers in multi-threaded
code.
.. _Linux kernel documentation on memory barriers: http://www.kernel.org/doc/Documentation/memory-barriers.txt
::
void memory_barrier();
Low-Level Bits
--------------

View File

@@ -59,9 +59,9 @@
<ItemGroup>
<CustomBuild Include="stdlib.ispc">
<FileType>Document</FileType>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">cl /EP /TP %(Filename).ispc /DISPC=1 /DPI=3.1415926535 | python stdlib2cpp.py &gt; gen-stdlib.cpp</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">clang -E -x c %(Filename).ispc -DISPC=1 -DPI=3.1415926535 | python stdlib2cpp.py &gt; gen-stdlib.cpp</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">gen-stdlib.cpp</Outputs>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">cl /EP /TP %(Filename).ispc /DISPC=1 /DPI=3.1415926535 | python stdlib2cpp.py &gt; gen-stdlib.cpp</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">clang -E -x c %(Filename).ispc -DISPC=1 -DPI=3.1415926535 | python stdlib2cpp.py &gt; gen-stdlib.cpp</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">gen-stdlib.cpp</Outputs>
<Message Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">Building gen-stdlib.cpp</Message>
<Message Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">Building gen-stdlib.cpp</Message>

View File

@@ -295,6 +295,75 @@ static inline uniform int packed_store_active(uniform int a[], uniform int start
return __packed_store_active(a, start, vals, __mask);
}
///////////////////////////////////////////////////////////////////////////
// Atomics and memory barriers
static inline void memory_barrier() {
__memory_barrier();
}
#define DEFINE_ATOMIC_OP(TA,TB,OPA,OPB) \
static inline TA atomic_##OPA##_global(uniform reference TA ref, TA value) { \
memory_barrier(); \
TA ret = __atomic_##OPB##_##TB##_global(ref, value, __mask); \
memory_barrier(); \
return ret; \
}
DEFINE_ATOMIC_OP(int32,int32,add,add)
DEFINE_ATOMIC_OP(int32,int32,subtract,sub)
DEFINE_ATOMIC_OP(int32,int32,min,min)
DEFINE_ATOMIC_OP(int32,int32,max,max)
DEFINE_ATOMIC_OP(int32,int32,and,and)
DEFINE_ATOMIC_OP(int32,int32,or,or)
DEFINE_ATOMIC_OP(int32,int32,xor,xor)
DEFINE_ATOMIC_OP(int32,int32,swap,swap)
// For everything but atomic min and max, we can use the same
// implementations for unsigned as for signed.
DEFINE_ATOMIC_OP(unsigned int32,int32,add,add)
DEFINE_ATOMIC_OP(unsigned int32,int32,subtract,sub)
DEFINE_ATOMIC_OP(unsigned int32,uint32,min,umin)
DEFINE_ATOMIC_OP(unsigned int32,uint32,max,umax)
DEFINE_ATOMIC_OP(unsigned int32,int32,and,and)
DEFINE_ATOMIC_OP(unsigned int32,int32,or,or)
DEFINE_ATOMIC_OP(unsigned int32,int32,xor,xor)
DEFINE_ATOMIC_OP(unsigned int32,int32,swap,swap)
DEFINE_ATOMIC_OP(int64,int64,add,add)
DEFINE_ATOMIC_OP(int64,int64,subtract,sub)
DEFINE_ATOMIC_OP(int64,int64,min,min)
DEFINE_ATOMIC_OP(int64,int64,max,max)
DEFINE_ATOMIC_OP(int64,int64,and,and)
DEFINE_ATOMIC_OP(int64,int64,or,or)
DEFINE_ATOMIC_OP(int64,int64,xor,xor)
DEFINE_ATOMIC_OP(int64,int64,swap,swap)
// For everything but atomic min and max, we can use the same
// implementations for unsigned as for signed.
DEFINE_ATOMIC_OP(unsigned int64,int64,add,add)
DEFINE_ATOMIC_OP(unsigned int64,int64,subtract,sub)
DEFINE_ATOMIC_OP(unsigned int64,uint64,min,umin)
DEFINE_ATOMIC_OP(unsigned int64,uint64,max,umax)
DEFINE_ATOMIC_OP(unsigned int64,int64,and,and)
DEFINE_ATOMIC_OP(unsigned int64,int64,or,or)
DEFINE_ATOMIC_OP(unsigned int64,int64,xor,xor)
DEFINE_ATOMIC_OP(unsigned int64,int64,swap,swap)
#define ATOMIC_DECL_CMPXCHG(TA, TB) \
static inline TA atomic_compare_exchange_global( \
uniform reference TA ref, TA oldval, TA newval) { \
memory_barrier(); \
TA ret = __atomic_compare_exchange_##TB##_global(ref, oldval, newval, __mask); \
memory_barrier(); \
return ret; \
}
ATOMIC_DECL_CMPXCHG(int32, int32)
ATOMIC_DECL_CMPXCHG(unsigned int32, int32)
ATOMIC_DECL_CMPXCHG(int64, int64)
ATOMIC_DECL_CMPXCHG(unsigned int64, int64)
///////////////////////////////////////////////////////////////////////////
// Load/store from/to 8/16-bit types

131
stdlib.m4
View File

@@ -405,6 +405,95 @@ forloop(i, 1, eval($1-1), `
}
')
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; global_atomic
;; Defines the implementation of a function that handles the mapping from
;; an ispc atomic function to the underlying LLVM intrinsics. Specifically,
;; the function handles loooping over the active lanes, calling the underlying
;; scalar atomic intrinsic for each one, and assembling the vector result.
;;
;; Takes four parameters:
;; $1: vector width of the target
;; $2: operation being performed (w.r.t. LLVM atomic intrinsic names)
;; (add, sub...)
;; $3: return type of the LLVM atomic (e.g. i32)
;; $4: return type of the LLVM atomic type, in ispc naming paralance (e.g. int32)
define(`global_atomic', `
declare $3 @llvm.atomic.load.$2.$3.p0$3($3 * %ptr, $3 %delta)
define internal <$1 x $3> @__atomic_$2_$4_global($3 * %ptr, <$1 x $3> %val,
<$1 x i32> %mask) nounwind alwaysinline {
%rptr = alloca <$1 x $3>
%rptr32 = bitcast <$1 x $3> * %rptr to $3 *
per_lane($1, <$1 x i32> %mask, `
%v_LANE_ID = extractelement <$1 x $3> %val, i32 LANE
%r_LANE_ID = call $3 @llvm.atomic.load.$2.$3.p0$3($3 * %ptr, $3 %v_LANE_ID)
%rp_LANE_ID = getelementptr $3 * %rptr32, i32 LANE
store $3 %r_LANE_ID, $3 * %rp_LANE_ID')
%r = load <$1 x $3> * %rptr
ret <$1 x $3> %r
}
')
;; Macro to declare the function that implements the swap atomic.
;; Takes three parameters:
;; $1: vector width of the target
;; $2: llvm type of the vector elements (e.g. i32)
;; $3: ispc type of the elements (e.g. int32)
define(`global_swap', `
declare $2 @llvm.atomic.swap.$2.p0$2($2 * %ptr, $2 %val)
define <$1 x $2> @__atomic_swap_$3_global($2* %ptr, <$1 x $2> %val,
<$1 x i32> %mask) nounwind alwaysinline {
%rptr = alloca <$1 x $2>
%rptr32 = bitcast <$1 x $2> * %rptr to $2 *
per_lane($1, <$1 x i32> %mask, `
%val_LANE_ID = extractelement <$1 x $2> %val, i32 LANE
%r_LANE_ID = call $2 @llvm.atomic.swap.$2.p0$2($2 * %ptr, $2 %val_LANE_ID)
%rp_LANE_ID = getelementptr $2 * %rptr32, i32 LANE
store $2 %r_LANE_ID, $2 * %rp_LANE_ID')
%r = load <$1 x $2> * %rptr
ret <$1 x $2> %r
}
')
;; Similarly, macro to declare the function that implements the compare/exchange
;; atomic. Takes three parameters:
;; $1: vector width of the target
;; $2: llvm type of the vector elements (e.g. i32)
;; $3: ispc type of the elements (e.g. int32)
define(`global_atomic_exchange', `
declare $2 @llvm.atomic.cmp.swap.$2.p0$2($2 * %ptr, $2 %cmp, $2 %val)
define <$1 x $2> @__atomic_compare_exchange_$3_global($2* %ptr, <$1 x $2> %cmp,
<$1 x $2> %val, <$1 x i32> %mask) nounwind alwaysinline {
%rptr = alloca <$1 x $2>
%rptr32 = bitcast <$1 x $2> * %rptr to $2 *
per_lane($1, <$1 x i32> %mask, `
%cmp_LANE_ID = extractelement <$1 x $2> %cmp, i32 LANE
%val_LANE_ID = extractelement <$1 x $2> %val, i32 LANE
%r_LANE_ID = call $2 @llvm.atomic.cmp.swap.$2.p0$2($2 * %ptr, $2 %cmp_LANE_ID,
$2 %val_LANE_ID)
%rp_LANE_ID = getelementptr $2 * %rptr32, i32 LANE
store $2 %r_LANE_ID, $2 * %rp_LANE_ID')
%r = load <$1 x $2> * %rptr
ret <$1 x $2> %r
}
')
define(`stdlib_core', `
@@ -543,6 +632,48 @@ define internal float @__stdlib_pow(float, float) nounwind readnone alwaysinline
%r = call float @powf(float %0, float %1)
ret float %r
}
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; atomics and memory barriers
declare void @llvm.memory.barrier(i1 %loadload, i1 %loadstore, i1 %storeload,
i1 %storestore, i1 %device)
define internal void @__memory_barrier() nounwind readnone alwaysinline {
;; see http://llvm.org/bugs/show_bug.cgi?id=2829. It seems like we
;; only get an MFENCE on x86 if "device" is true, but IMHO we should
;; in the case where the first 4 args are true but it is false.
;; So we just always set that to true...
call void @llvm.memory.barrier(i1 true, i1 true, i1 true, i1 true, i1 true)
ret void
}
global_atomic($1, add, i32, int32)
global_atomic($1, sub, i32, int32)
global_atomic($1, and, i32, int32)
global_atomic($1, or, i32, int32)
global_atomic($1, xor, i32, int32)
global_atomic($1, min, i32, int32)
global_atomic($1, max, i32, int32)
global_atomic($1, umin, i32, uint32)
global_atomic($1, umax, i32, uint32)
global_atomic($1, add, i64, int64)
global_atomic($1, sub, i64, int64)
global_atomic($1, and, i64, int64)
global_atomic($1, or, i64, int64)
global_atomic($1, xor, i64, int64)
global_atomic($1, min, i64, int64)
global_atomic($1, max, i64, int64)
global_atomic($1, umin, i64, uint64)
global_atomic($1, umax, i64, uint64)
global_swap($1, i32, int32)
global_swap($1, i64, int64)
global_atomic_exchange($1, i32, int32)
global_atomic_exchange($1, i64, int64)
')
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

14
tests/atomics-1.ispc Normal file
View File

@@ -0,0 +1,14 @@
export uniform int width() { return programCount; }
uniform unsigned int32 s = 0;
export void f_f(uniform float RET[], uniform float aFOO[]) {
float a = aFOO[programIndex];
float b = atomic_add_global(s, 1);
RET[programIndex] = reduce_add(b);
}
export void result(uniform float RET[]) {
RET[programIndex] = reduce_add(programIndex);
}

14
tests/atomics-2.ispc Normal file
View File

@@ -0,0 +1,14 @@
export uniform int width() { return programCount; }
uniform int64 s = 0;
export void f_f(uniform float RET[], uniform float aFOO[]) {
float a = aFOO[programIndex];
float b = atomic_add_global(s, 1);
RET[programIndex] = reduce_add(b);
}
export void result(uniform float RET[]) {
RET[programIndex] = reduce_add(programIndex);
}

14
tests/atomics-3.ispc Normal file
View File

@@ -0,0 +1,14 @@
export uniform int width() { return programCount; }
uniform int32 s = 0xff;
export void f_f(uniform float RET[], uniform float aFOO[]) {
float a = aFOO[programIndex];
float b = atomic_xor_global(s, 0xfffffff0);
RET[programIndex] = s;
}
export void result(uniform float RET[]) {
RET[programIndex] = 0xff;
}

14
tests/atomics-4.ispc Normal file
View File

@@ -0,0 +1,14 @@
export uniform int width() { return programCount; }
uniform int32 s = 0;
export void f_f(uniform float RET[], uniform float aFOO[]) {
float a = aFOO[programIndex];
float b = atomic_or_global(s, (1<<programIndex));
RET[programIndex] = s;
}
export void result(uniform float RET[]) {
RET[programIndex] = (1<<programCount)-1;
}

14
tests/atomics-5.ispc Normal file
View File

@@ -0,0 +1,14 @@
export uniform int width() { return programCount; }
uniform int32 s = 0xbeef;
export void f_f(uniform float RET[], uniform float aFOO[]) {
float a = aFOO[programIndex];
float b = atomic_swap_global(s, programIndex);
RET[programIndex] = reduce_max(b);
}
export void result(uniform float RET[]) {
RET[programIndex] = 0xbeef;
}

14
tests/atomics-6.ispc Normal file
View File

@@ -0,0 +1,14 @@
export uniform int width() { return programCount; }
uniform int32 s = 2;
export void f_f(uniform float RET[], uniform float aFOO[]) {
float a = aFOO[programIndex];
float b = atomic_compare_exchange_global(s, programIndex, a*1000);
RET[programIndex] = s;
}
export void result(uniform float RET[]) {
RET[programIndex] = 3000;
}