There's now a SOA variability class (in addition to uniform,
varying, and unbound variability); the SOA factor must be a
positive power of 2.
When applied to a type, the leaf elements of the type (i.e.
atomic types, pointer types, and enum types) are widened out
into arrays of the given SOA factor. For example, given
struct Point { float x, y, z; };
Then "soa<8> Point" has a memory layout of "float x[8], y[8],
z[8]".
Furthermore, array indexing syntax has been augmented so that
when indexing into arrays of SOA-variability data, the two-stage
indexing (first into the array of soa<> elements and then into
the leaf arrays of SOA data) is performed automatically.
There are two related optimizations that happen now. (These
currently only apply for gathers where the mask is known to be
all on, and to gathers that are accessing 32-bit sized elements,
but both of these may be generalized in the future.)
First, for any single gather, we are now more flexible in mapping it
to individual memory operations. Previously, we would only either map
it to a general gather (one scalar load per SIMD lane), or an
unaligned vector load (if the program instances could be determined
to be accessing a sequential set of locations in memory.)
Now, we are able to break gathers into scalar, 2-wide (i.e. 64-bit),
4-wide, or 8-wide loads. Further, we now generate code that shuffles
these loads around. Doing fewer, larger loads in this manner, when
possible, can be more efficient.
Second, we can coalesce memory accesses across multiple gathers. If
we have a series of gathers without any memory writes in the middle,
then we try to analyze their reads collectively and choose an efficient
set of loads for them. Not only does this help if different gathers
reuse values from the same location in memory, but it's specifically
helpful when data with AOS layout is being accessed; in this case,
we're often able to generate wide vector loads and appropriate shuffles
automatically.
When the --fuzz-test command-line option is given, the input program
will be randomly perturbed by the lexer in an effort to trigger
assertions or crashes in the compiler (neither of which should ever
happen, even for malformed programs.)
Switches with both uniform and varying "switch" expressions are
supported. Switch statements with varying expressions and very
large numbers of labels may not perform well; some issues to be
filed shortly will track opportunities for improving these.
ispc now supports goto, but only under uniform control flow--i.e.
it must be possible for the compiler to statically determine that
all program instances will follow the goto. An error is issued at
compile time if a goto is used when this is not the case.
Specifically, we want to be able to late-bind on whether the mask is i32s or i1s, so if there's
any chance of ambiguity, we emit code that does the "GEP from a NULL base pointer" trick to
compute the value later in compilation.
When used, these targets end up with calls to undefined functions for all
of the various special vector stuff ispc needs to compile ispc programs
(masked store, gather, min/max, sqrt, etc.).
These targets are not yet useful for anything, but are a step toward
having an option to C++ code with calls out to intrinsics.
Reorganized the directory structure a bit and put the LLVM bitcode used
to define target-specific stuff (as well as some generic built-ins stuff)
into a builtins/ directory.
Note that for building on Windows, it's now necessary to set a LLVM_VERSION
environment variable (with values like LLVM_2_9, LLVM_3_0, LLVM_3_1svn, etc.)
For now this target just uses the same builtins-*.ll files as the
regular AVX1 target. Once the gather intrinsic is available from
LLVM, we'll want to have custom target files that call out to that
for gathers. (The integer min/max intrinsics should be wired up to
the __{min,max}_varying_{int,uint}*() builtins at that point as
well.)
Pointers can be either uniform or varying, and behave correspondingly.
e.g.: "uniform float * varying" is a varying pointer to uniform float
data in memory, and "float * uniform" is a uniform pointer to varying
data in memory. Like other types, pointers are varying by default.
Pointer-based expressions, & and *, sizeof, ->, pointer arithmetic,
and the array/pointer duality all bahave as in C. Array arguments
to functions are converted to pointers, also like C.
There is a built-in NULL for a null pointer value; conversion from
compile-time constant 0 values to NULL still needs to be implemented.
Other changes:
- Syntax for references has been updated to be C++ style; a useful
warning is now issued if the "reference" keyword is used.
- It is now illegal to pass a varying lvalue as a reference parameter
to a function; references are essentially uniform pointers.
This case had previously been handled via special case call by value
return code. That path has been removed, now that varying pointers
are available to handle this use case (and much more).
- Some stdlib routines have been updated to take pointers as
arguments where appropriate (e.g. prefetch and the atomics).
A number of others still need attention.
- All of the examples have been updated
- Many new tests
TODO: documentation
Previously, to compute the size of objects and the offsets of struct
elements within structs, we were using the trick of using getelementpointer
with a NULL base pointer and then casting the result to an int32/64.
However, since we actually know the target we're compiling for at
compile time, we can use corresponding methods from TargetData to
get these values directly.
This mostly cleans up code, but may make some of the gather/scatter
lowering to loads/stores optimizations work better in the presence
of structures.
Both uniform and varying function pointers are supported; when a function
is called through a varying function pointer, each unique function pointer
value across the running program instances is called once for the set of
active program instances that want to call it.
Be better about tracking the full extent of expressions in the parser;
this leads to more intelligible error messages when we indicate where
exactly the error happened.
The stuff in decl.h/decl.cpp is messy, largely due to its close mapping
to C-style variable declarations. This checkin has updated code throughout
all of the declaration statement, variable, and function code that operates
on symbols and types directly. Thus, Decl* related stuff is now localized
to decl.h/decl.cpp and the parser.
Issue #13.
Added AST and Function classes.
Now, we parse the whole file and build up the AST for all of the
functions in the Module before we emit IR for the functions (vs. before,
when we generated IR along the way as we parsed the source file.)
If a flag along the lines of "--target=sse4,avx-x2" is provided on the command-line,
then the program will be compiled for each of the given targets, with a separate
output file generated for each one. Further, an output file with dispatch functions
that check the current system's CPU and then chooses the best available variant
is also created.
Issue #11.
Go back to running both sides of 'if' statements with masking and without
branching if we can determine that the code is relatively simple (as per
the simple cost model), and is safe to run even if the mask is 'all off'.
This gives a bit of a performance improvement for some of the examples
(most notably, the ray tracer), and is the code that one wants generated
in this case anyhow.
This is currently only used to decide whether it's worth doing an
"are all lanes running" check at the start of functions--for small
functions, it's not worth the overhead.
The cost is estimated relatively early in compilation (e.g. before
we know if an array access is a scatter/gather or not, before
constant folding, etc.), so there are many known shortcomings.
If no CPU is specified, use the host CPU type, not just a default of "nehalem".
Provide better features strings to the LLVM target machinery.
-> Thus ensuring that LLVM doesn't generate SSE>2 instructions for the SSE2
target (Fixes issue #82).
-> Slight code improvements from using cmovs in generated code now
Use the llvm popcnt intrinsic for the SSE2 target now (it now generates code
that doesn't call the popcnt instruction now that we properly tell LLVM
which instructions are and aren't available for SSE2.)
Set the Module's target appropriately when it's first created.
Compile separate 32 and 64 bit versions of the builtins-c bitcocde
and load the appropriate one based on the target we're compiling
for.