aaron/ispc - ispc - git.frat.tech

aaron/ispc

Author	SHA1	Message	Date
Matt Pharr	7e954e4248	Don't issue gather/scatter warnigns in the 'extra' bits of foreach loops. With AOS data, we can often coalesce the accesses into gathers for the main part of foreach loops but only fail on the last bits where the mask is not all on (since the coalescing code doesn't handle mixed masks, yet.) Before, we'd report success with coalescing and then also report that gathers were needed for the same accesses that were coalesced, which was a) confusing, and b) didn't accurately represent what was going on for the majority of the loop iterations.	2012-03-19 15:08:35 -07:00
Matt Pharr	57af0eb64f	Still do the gather/scatter -> load store pass even if leaving 'pseudo' mem opts unchanged.	2012-03-19 12:04:38 -07:00
Matt Pharr	60aae16752	Move check for linear vector to LLVMVectorIsLinear() function.	2012-03-19 11:57:04 -07:00
Matt Pharr	e264d95019	LLVMVectorValuesAllEqual() improvements. Clean up the API, so the caller doesn't have to pass in a vector so the function can track PHI nodes (do that internally instead.) Handle casts in lValuesAreEqual().	2012-03-19 11:54:18 -07:00
Matt Pharr	0664f5a724	Add LLVMExtractVectorInts() function, use it in the opt code.	2012-03-19 11:48:38 -07:00
Matt Pharr	17c6a19527	Add LLVMExtractFirstVectorElement() function (and use it). For cases where it turns out that we just need the first element of a vector (e.g. because we've determined that all of the values are equal), it's often more efficient to only compute that one value with scalar operations than to compute the whole vector's worth and then just use one value. This function tries to rewrite a vector computation to the scalar equivalent, if possible. (Partial work-around to http://llvm.org/bugs/show_bug.cgi?id=11775.) Note that sometimes this is the wrong thing to do--if we need the entire vector value for other purposes, for example.	2012-03-19 11:48:33 -07:00
Matt Pharr	cbc8b8259b	Use LLVMIntAsType() in opt code instead of locally-defined equivalent.	2012-03-19 11:36:00 -07:00
Matt Pharr	1067a2e4be	Add LLVMShuffleVectors() and LLVMConcatVectors() functions. These were local functions in opt.cpp that are now public via the llvmutil.* files.	2012-03-19 11:34:52 -07:00
Matt Pharr	74a031a759	Small improvements to debug info printing in opt.cpp	2012-03-19 11:32:08 -07:00
Matt Pharr	9ec8e5a275	Fix compile warnings on Linux	2012-03-12 13:12:23 -07:00
Matt Pharr	8fdf84de04	Disable debugging printing code.	2012-03-05 09:58:09 -08:00
Matt Pharr	e013e0a374	Handle extract instructions in the lGetBasePtrAndOffsets() pattern matching code.	2012-03-05 09:58:09 -08:00
Matt Pharr	f7937f1e4b	Fix build with LLVM2.9/3.0	2012-03-03 10:30:56 -08:00
Matt Pharr	95224f3f11	Improve detection of cases where 32-bit gather/scatter can be used. Previously, we weren't noticing that an <n x i64> zero vector could be represented as an <n x i32> without error.	2012-02-21 12:13:25 -08:00
Matt Pharr	a86b942730	Fix cases in coalesce opt where offsets would be truncated to 32 bits	2012-02-14 10:05:07 -08:00
Matt Pharr	cc86e4a7d2	Disable coalescing optimizations when using generic target. The main issue is that they end up generating a number of smaller vector ops (e.g. 4-wide and 8-wide on the 16-wide generic target, which the examples/intrinsics implementations don't currently support. This fixes a number of failing tests for now; it may be worth generalizing the stuff in examples/intrinsics at some point, since as a general principle, e.g. if generating LLVM IR output, the coalescing optimizations are still desirable. Issue #175.	2012-02-13 16:52:01 -08:00
Matt Pharr	e864447e4a	Fix silly bug in vector scale extraction optimization. (Introduced in `f20a2d2ee`. How did this ever pass tests?)	2012-02-13 12:06:45 -08:00
Matt Pharr	73bf552cd6	Add support for coalescing memory accesses from gathers. There are two related optimizations that happen now. (These currently only apply for gathers where the mask is known to be all on, and to gathers that are accessing 32-bit sized elements, but both of these may be generalized in the future.) First, for any single gather, we are now more flexible in mapping it to individual memory operations. Previously, we would only either map it to a general gather (one scalar load per SIMD lane), or an unaligned vector load (if the program instances could be determined to be accessing a sequential set of locations in memory.) Now, we are able to break gathers into scalar, 2-wide (i.e. 64-bit), 4-wide, or 8-wide loads. Further, we now generate code that shuffles these loads around. Doing fewer, larger loads in this manner, when possible, can be more efficient. Second, we can coalesce memory accesses across multiple gathers. If we have a series of gathers without any memory writes in the middle, then we try to analyze their reads collectively and choose an efficient set of loads for them. Not only does this help if different gathers reuse values from the same location in memory, but it's specifically helpful when data with AOS layout is being accessed; in this case, we're often able to generate wide vector loads and appropriate shuffles automatically.	2012-02-10 13:10:39 -08:00
Matt Pharr	f20a2d2ee9	Generalize code to extract scales by 2/4/8 from addressing calculations. Now, if we have a scale by 16, say, we extract out the scalar scale of 8 and leave an explicit scale by 2.	2012-02-10 12:35:44 -08:00
Matt Pharr	0c25bc063c	Add lGEPInst() utility routine to opt.cpp. Deal with the messiness of LLVM API changes when creating these in a single place.	2012-02-10 12:32:15 -08:00
Matt Pharr	5b4673e8eb	Fix build with LLVM 2.9.	2012-02-07 08:37:13 -08:00
Matt Pharr	0432f97555	Fix build with LLVM 3.1 TOT	2012-01-31 14:10:07 -08:00
Matt Pharr	f73abb05a7	Fix bug in handling scatters where all instances go to the same location. Previously, we'd pick one lane and generate a regular store for its value. This was the wrong thing to do, since we also should have been checking that the mask was on (for the lane that was chosen). This bug didn't become evident until the scalar target was added, since many stores fall into this case with that target. Now, we just leave those as regular scatters. Fixes most of the failing tests for the scalar target listed in issue #167.	2012-01-31 11:06:14 -08:00
Matt Pharr	d71c49494f	Missed pass that should be skipped when pseudo memory ops are supposed to be left unchanged.	2012-01-31 11:02:23 -08:00
Matt Pharr	1eec27f890	Scalar target fixes. Don't issue warnings about all instances writing to the same location if there is only one program instance in the gang. Be sure to report that all values are equal in one-element vectors in LLVMVectorValuesAllEqual(). Issue #166.	2012-01-31 08:52:11 -08:00
Matt Pharr	b7f17d435f	Fix crash in gather/scatter optimization pass.	2012-01-27 14:44:35 -08:00
Matt Pharr	5893a9c49d	Remove incorrect assert	2012-01-27 09:14:45 -08:00
Matt Pharr	177e6312b4	Fix build with LLVM ToT (ConstantVector::getVectorElements() is gone now).	2012-01-27 09:07:58 -08:00
Matt Pharr	a5b7fca7e0	Extract constant offsets from gather/scatter base+offsets offset vectors. When we're able to turn a general gather/scatter into the "base + offsets" form, we now try to extract out any constant components of the offsets and then pass them as a separate parameter to the gather/scatter function implementation. We then in turn carefully emit code for the addressing calculation so that these constant offsets match LLVM's patterns to detect this case, such that we get the constant offsets directly encoded in the instruction's addressing calculation in many cases, saving arithmetic instructions to do these calculations. Improves performance of stencil by ~15%. Other workloads unchanged.	2012-01-24 14:41:15 -08:00
Matt Pharr	7be2c399b1	Rename various optimization passes to have more descriptive names. No functionality change.	2012-01-23 14:49:48 -08:00
Matt Pharr	d6337b3b22	Code cleanups in opt.cpp; no functional change	2012-01-23 14:36:32 -08:00
Matt Pharr	91ac3b9d7c	Back out WIP changes to opt.cpp that were inadvertently checked in.	2012-01-21 07:34:53 -08:00
Matt Pharr	d65bf2eb2f	Doxygen number bump and release notes for 1.1.3	2012-01-20 17:04:16 -08:00
Matt Pharr	4388338dad	Fix performance regression introduced in `be0c77d556` Effectively, the patterns that detected when given a gather or scatter in base+offsets form, the offsets were actually a multiple of 2/4/8, were no longer working. This change not only fixes this, but also expands the set of patterns that are matched by this. For example, given offsets of the form 4v1 + 16v2, it identifies a scale of 4 and new offsets of v1 + 4*v2. This fix makes the volume renderer run 1.19x faster, and noise 1.54x faster.	2012-01-19 17:57:59 -08:00
Matt Pharr	68f6ea8def	For << and >> with C++, detect when all instances are shifting by the same amount. In this case, we now emit calls to potentially-specialized functions for the left/right shifts that take a single integer value for the shift amount. These in turn can be matched to the corresponding intrinsics for the SSE target. Issue #145.	2012-01-19 10:04:32 -07:00
Matt Pharr	3bf3ac7922	Be more conservative about using blending in place of masked store. More specifically, we do a proper masked store (rather than a load- blend-store) unless we can determine that we're accessing a stack-allocated "varying" variable. This fixes a number of nefarious bugs where given code like: uniform float a[21]; foreach (i = 0 … 21) a[i] = 0; We'd use a blend and in turn read past the end of a[] in the last iteration. Also made slight changes to inlining in aobench; this keeps compiles to ~5s, versus ~45s without them (with this change). Fixes issue #160.	2012-01-17 23:42:22 -07:00
Matt Pharr	0f8eee9809	Fix cases in optimization code to not inadvertently match calls to func ptrs. If we call a function pointer, CallInst::getCalledFunction() returns NULL; we need to be careful about this case when we're matching various function calls in optimization passes. (Fixes a crash.)	2012-01-12 10:33:06 -08:00
Pierre-Antoine Lacaze	da9200fcee	Fix alloca use on mingw.	2012-01-09 10:19:09 +01:00
Matt Pharr	be0c77d556	Detect more gather/scatter cases that are actually base+offsets. We now recognize patterns like (ptr + offset1 + offset2) as being cases we can handle with the base_offsets variants of the gather/scatter functions. (This can come up with multidimensional array indexing, for example.) Issue #150.	2012-01-08 14:06:44 -08:00
Matt Pharr	ff6971fb15	Use Assert() rather than assert()	2012-01-08 14:06:44 -08:00
Matt Pharr	71317e6aa6	Fix bug in gather/scatter optimization passes. When flattening chains of insertelement instructions, we didn't handle the case where the initial insertelement was to a constant vector (with one value set and the other values undef). Also generalized the "do all of the instances access the same location" check to handle the case where some of them are accessing undef locations; these are ignored in this check, as they should correspond to the mask being off for that lane anyway. Fixes issue #149.	2012-01-06 09:19:18 -08:00
Matt Pharr	8938e14442	Add support for emitting ~generic vectorized C++ code. The compiler now supports an --emit-c++ option, which generates generic vector C++ code. To actually compile this code, the user must provide C++ code that implements a variety of types and operations (e.g. adding two floating-point vector values together, comparing them, etc). There are two examples of this required code in examples/intrinsics: generic-16.h is a "generic" 16-wide implementation that does all required with scalar math; it's useful for demonstrating the requirements of the implementation. Then, sse4.h shows a simple implementation of a SSE4 target that maps the emitted function calls to SSE intrinsics. When using these example implementations with the ispc test suite, all but one or two tests pass with gcc and clang on Linux and OSX. There are currently ~10 failures with icc on Linux, and ~50 failures with MSVC 2010. (To be fixed in coming days.) Performance varies: when running the examples through the sse4.h target, some have the same performance as when compiled with --target=sse4 from ispc directly (options), while noise is 12% slower, rt is 26% slower, and aobench is 2.2x slower. The details of this haven't yet been carefully investigated, but will be in coming days as well. Issue #92.	2012-01-04 12:59:03 -08:00
Matt Pharr	dea13979e0	Fix bug in lIs248Splat() in opt.cpp	2012-01-04 11:55:02 -08:00
Matt Pharr	052d34bf5b	Various cleanups to optimization code. Stop using the PassManagerBuilder but add all of the passes directly in code here. This currently leads to no different behavior, but was useful with experimenting with disabling the SROA pass when compiling to generic targets.	2012-01-04 11:54:44 -08:00
Matt Pharr	d4c5e82896	Add VSelMovMsk optimization pass. Various peephole improvements to vector select instructions.	2012-01-04 11:52:27 -08:00
Matt Pharr	562d61caff	Added masked load optimization pass. This pass handles the "all on" and "all off" mask cases appropriately. Also renamed load_masked stuff in built-ins to masked_load for consistency with masked_store.	2012-01-04 11:51:26 -08:00
Matt Pharr	1d9201fe3d	Add "generic" 4, 8, and 16-wide targets. When used, these targets end up with calls to undefined functions for all of the various special vector stuff ispc needs to compile ispc programs (masked store, gather, min/max, sqrt, etc.). These targets are not yet useful for anything, but are a step toward having an option to C++ code with calls out to intrinsics. Reorganized the directory structure a bit and put the LLVM bitcode used to define target-specific stuff (as well as some generic built-ins stuff) into a builtins/ directory. Note that for building on Windows, it's now necessary to set a LLVM_VERSION environment variable (with values like LLVM_2_9, LLVM_3_0, LLVM_3_1svn, etc.)	2011-12-19 13:46:50 -08:00
Matt Pharr	6dbb15027a	Take advantage of x86's free "scale by 2, 4, or 8" in addressing calculations When loading from an address that's computed by adding two registers together, x86 can scale one of them by 2, 4, or 8, for free as part of the addressing calculation. This change makes the code generated for gather and scatter use this. For the cases where gather/scatter is based on a base pointer and an integer offset vector, the GSImprovementsPass looks to see if the integer offsets are being computed as 2/4/8 times some other value. If so, it extracts the 2x/4x/8x part and leaves the rest there as the the offsets. The {gather,scatter}_base_offsets_* functions take an i32 scale factor, which is passed to them, and then they carefully generate IR so that it hits LLVM's pattern matching for these scales. This is particular win on AVX, since it saves us two 4-wide integer multiplies. Noise runs 14% faster with this. Issue #132.	2011-12-16 15:55:44 -08:00
Matt Pharr	e82a720223	Fix various warnings / build issues on Windows	2011-12-15 12:06:38 -08:00
Matt Pharr	8d1b77b235	Have assertion macro and FATAL() text ask user to file a bug, provide URL to do so. Switch to Assert() from assert() to make it clear it's not the C stdlib one we're using any more.	2011-12-15 11:11:16 -08:00

1 2

95 Commits