merge with sm35

This commit is contained in:
Evghenii
2014-01-06 13:53:02 +01:00
189 changed files with 0 additions and 131201 deletions

View File

@@ -1,167 +0,0 @@
====================
ISPC Examples README
====================
This directory has a number of sample ispc programs. Before building them
(on an system), install the appropriate ispc compiler binary into a
directory in your path. Then, if you're running Windows, open the
"examples.sln" file and built from there. For building under Linux/OSX,
there are makefiles in each directory that build the examples individually.
Almost all of them benchmark ispc implementations of the given computation
against regular serial C++ implementations, printing out a comparison of
the runtimes and the speedup delivered by ispc. It may be instructive to
do a side-by-side diff of the C++ and ispc implementations of these
algorithms to learn more about wirting ispc code.
AOBench
=======
This is an ISPC implementation of the "AO bench" benchmark
(http://syoyo.wordpress.com/2009/01/26/ao-bench-is-evolving/). The command
line arguments are:
ao (num iterations) (x res) (yres)
It executes the program for the given number of iterations, rendering an
(xres x yres) image each time and measuring the computation time with both
serial and ispc implementations.
AOBench_Instrumented
====================
This version of AO Bench is compiled with the --instrument ispc compiler
flag. This causes the compiler to emit calls to a (user-supplied)
ISPCInstrument() function at interesting places in the compiled code. An
example implementation of this function that counts the number of times the
callback is made and records some statistics about control flow coherence
is provided in the instrument.cpp file.
Deferred
========
This example shows an extensive example of using ispc for efficient
deferred shading of scenes with thousands of lights; it's an implementation
of the algorithm that Johan Andersson described at SIGGRAPH 2009,
implemented by Andrew Lauritzen and Jefferson Montgomery. The basic idea
is that a pre-rendered G-buffer is partitioned into tiles, and in each
tile, the set of lights that contribute to the tile is first computed.
Then, the pixels in the tile are then shaded using just those light
sources. (See slides 19-29 of
http://s09.idav.ucdavis.edu/talks/04-JAndersson-ParallelFrostbite-Siggraph09.pdf
for more details on the algorithm.)
This directory includes three implementations of the algorithm:
- An ispc implementation that first does a static partitioning of the
screen into tiles to parallelize across the CPU cores. Within each tile
ispc kernels provide highly efficient implementations of the light
culling and shading calculations.
- A "best practices" serial C++ implementation. This implementation does a
dynamic partitioning of the screen, refining tiles with significant Z
depth complexity (these tiles often have a large number of lights that
affect them). Within each final tile, the pixels are shaded using
regular C++ code.
- If the Cilk extensions are available in your compiler, an ispc
implementation that uses Cilk will also be built.
(See http://software.intel.com/en-us/articles/intel-cilk-plus/). Like
the "best practices" serial implementation, this version does dynamic
tile partitioning for better load balancing and then uses ispc for the
light culling and shading.
GMRES
=====
An implementation of the generalized minimal residual method for solving
sparse matrix equations.
(http://en.wikipedia.org/wiki/Generalized_minimal_residual_method)
Mandelbrot
==========
Mandelbrot set generation. This example is extensively documented at the
http://ispc.github.com/example.html page.
Mandelbrot_tasks
================
Implementation of Mandelbrot set generation that also parallelizes across
cores using tasks. Under Windows, a simple task system built on
Microsoft's Concurrency Runtime is used (see tasks_concrt.cpp). On OSX, a
task system based on Grand Central Dispatch is used (tasks_gcd.cpp), and on
Linux, a pthreads-based task system is used (tasks_pthreads.cpp). When
using tasks with ispc, no task system is mandated; the user is free to plug
in any task system they want, for ease of interoperating with existing task
systems.
Noise
=====
This example has an implementation of Ken Perlin's procedural "noise"
function, as described in his 2002 "Improving Noise" SIGGRAPH paper.
Options
=======
This program implements both the Black-Scholes and Binomial options pricing
models in both ispc and regular serial C++ code.
Perfbench
=========
This runs a number of microbenchmarks to measure system performance and
code generation quality.
RT
==
This is a simple ray tracer; it reads in camera parameters and a bounding
volume hierarchy and renders the scene from the given viewpoint. The
command line arguments are:
rt <scene name base>
Where <scene base name> is one of "cornell", "teapot", or "sponza".
The implementation originally derives from the bounding volume hierarchy
and triangle intersection code from pbrt; see the pbrt source code and/or
"Physically Based Rendering" book for more about the basic algorithmic
details.
Simple
======
This is a simple "hello world" type program that shows a ~10 line
application program calling out to a ~5 line ispc program to do a simple
computation.
Sort
====
This is a bucket sort of 32 bit unsigned integers.
By default 1000000 random elements get sorted.
Call ./sort N in order to sort N elements instead.
Volume
======
Ray-marching volume rendering, with single scattering lighting model. To
run it, specify a camera parameter file and a volume density file, e.g.:
volume camera.dat density_highres.vol
(See, e.g. Chapters 11 and 16 of "Physically Based Rendering" for
information about the algorithm implemented here.) The volume data set
included here was generated by the example implementation of the "Wavelet
Turbulence for Fluid Simulation" SIGGRAPH 2008 paper by Kim et
al. (http://www.cs.cornell.edu/~tedkim/WTURB/)

View File

@@ -1,2 +0,0 @@
ao
*.ppm

View File

@@ -1,8 +0,0 @@
EXAMPLE=ao
CPP_SRC=ao.cpp ao_serial.cpp
ISPC_SRC=ao1.ispc
ISPC_IA_TARGETS=avx
ISPC_ARM_TARGETS=neon
include ../common.mk

View File

@@ -1,56 +0,0 @@
PROG=ao_cu
ISPC_SRC=ao1.ispc
CXX_SRC=ao_cu.cpp
CXX=g++
CXXFLAGS=-O3 -I$(CUDATK)/include
LD=g++
LDFLAGS=-lcuda
ISPC=ispc
ISPCFLAGS=-O3 --math-lib=default --target=nvptx64 --opt=fast-math
LLVM32 = $(HOME)/usr/local/llvm/bin-3.2
LLVM = $(HOME)/usr/local/llvm/bin-3.3
PTXGEN = $(HOME)/ptxgen
PTXGEN += -opt=3
PTXGEN += -ftz=1 -prec-div=0 -prec-sqrt=0 -fma=1
LLVM32DIS=$(LLVM32)/bin/llvm-dis
##.SUFFIXES: .bc .o .ptx .cu
ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
ISPC_BC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.bc)
PTXSRC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.ptx)
CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
all: $(ISPC_BC) $(PROG)
$(CXX_OBJ) : kernel.ptx
$(PROG): $(CXX_OBJ) kernel.ptx
/bin/cp kernel.ptx __kernels.ptx
$(LD) -o $@ $(CXX_OBJ) $(LDFLAGS)
%.o: %.cpp
$(CXX) $(CXXFLAGS) -o $@ -c $<
%_ispc_nvptx64.bc: %.ispc
$(ISPC) $(ISPCFLAGS) --emit-llvm -o `basename $< .ispc`_ispc_nvptx64.bc -h `basename $< .ispc`_ispc.h $< --emit-llvm
%.ptx: %.bc
$(PTXGEN) $< > $@
# $(LLVM32DIS) $<
# $(PTXGEN) `basename $< .bc`.ll > $@
kernel.ptx: $(PTXSRC)
cat $^ > kernel.ptx
clean:
/bin/rm -rf *.ptx *.bc *.ll $(PROG)

View File

@@ -1,37 +0,0 @@
PROG=ao_mic
ISPC_SRC=ao.ispc
CXX_SRC=ao.cpp ../tasksys.cpp
CXX=icc
CXXFLAGS=-O3 -I$(CUDATK)/include -mmic -openmp
LD=icc
LDFLAGS=-mmic -openmp
ISPC=ispc
ISPCFLAGS=-O3 --math-lib=default --target=generic-16 --c++-include-file=../intrinsics/knc-i1x16.h --opt=fast-math
.SUFFIXES: .o .cpp
ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
all: $(PROG)
$(PROG): $(ISPC_OBJ) $(CXX_OBJ)
$(LD) -o $@ $^ $(LDFLAGS)
%.o: %.cpp
$(CXX) $(CXXFLAGS) -o $@ -c $<
%_ispc.o: %.ispc
$(ISPC) $(ISPCFLAGS) --emit-c++ -o `basename $< .ispc`_ispc_zmm.cpp -h `basename $< .ispc`_ispc.h $<
$(CXX) $(CXXFLAGS) -o $@ `basename $< .ispc`_ispc_zmm.cpp -c
clean:
/bin/rm -rf *_ispc_zmm.cpp *.o $(PROG)

View File

@@ -1,204 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#define NOMINMAX
#pragma warning (disable: 4244)
#pragma warning (disable: 4305)
#endif
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#ifdef __linux__
#include <malloc.h>
#endif
#include <math.h>
#include <map>
#include <string>
#include <algorithm>
#include <sys/types.h>
#include "ao_ispc.h"
using namespace ispc;
#include "../timing.h"
#include <sys/time.h>
static inline double rtc(void)
{
struct timeval Tvalue;
double etime;
struct timezone dummy;
gettimeofday(&Tvalue,&dummy);
etime = (double) Tvalue.tv_sec +
1.e-6*((double) Tvalue.tv_usec);
return etime;
}
#define NSUBSAMPLES 2
extern void ao_serial(int w, int h, int nsubsamples, float image[]);
static unsigned int test_iterations;
static unsigned int width, height;
static unsigned char *img;
static float *fimg;
static unsigned char
clamp(float f)
{
int i = (int)(f * 255.5);
if (i < 0) i = 0;
if (i > 255) i = 255;
return (unsigned char)i;
}
static void
savePPM(const char *fname, int w, int h)
{
for (int y = 0; y < h; y++) {
for (int x = 0; x < w; x++) {
img[3 * (y * w + x) + 0] = clamp(fimg[3 *(y * w + x) + 0]);
img[3 * (y * w + x) + 1] = clamp(fimg[3 *(y * w + x) + 1]);
img[3 * (y * w + x) + 2] = clamp(fimg[3 *(y * w + x) + 2]);
}
}
FILE *fp = fopen(fname, "wb");
if (!fp) {
perror(fname);
exit(1);
}
fprintf(fp, "P6\n");
fprintf(fp, "%d %d\n", w, h);
fprintf(fp, "255\n");
fwrite(img, w * h * 3, 1, fp);
fclose(fp);
printf("Wrote image file %s\n", fname);
}
int main(int argc, char **argv)
{
if (argc != 4) {
printf ("%s\n", argv[0]);
printf ("Usage: ao [num test iterations] [width] [height]\n");
getchar();
exit(-1);
}
else {
test_iterations = atoi(argv[1]);
width = atoi (argv[2]);
height = atoi (argv[3]);
}
// Allocate space for output images
img = new unsigned char[width * height * 3];
fimg = new float[width * height * 3];
//
// Run the ispc path, test_iterations times, and report the minimum
// time for any of them.
//
double minTimeISPC = 1e30;
#if 0
for (unsigned int i = 0; i < test_iterations; i++) {
memset((void *)fimg, 0, sizeof(float) * width * height * 3);
assert(NSUBSAMPLES == 2);
reset_and_start_timer();
ao_ispc(width, height, NSUBSAMPLES, fimg);
double t = get_elapsed_mcycles();
minTimeISPC = std::min(minTimeISPC, t);
}
// Report results and save image
printf("[aobench ispc]:\t\t\t[%.3f] million cycles (%d x %d image)\n",
minTimeISPC, width, height);
savePPM("ao-ispc.ppm", width, height);
#endif
//
// Run the ispc + tasks path, test_iterations times, and report the
// minimum time for any of them.
//
double minTimeISPCTasks = 1e30;
for (unsigned int i = 0; i < test_iterations; i++) {
memset((void *)fimg, 0, sizeof(float) * width * height * 3);
assert(NSUBSAMPLES == 2);
reset_and_start_timer();
const double t0 = rtc();
ao_ispc_tasks(width, height, NSUBSAMPLES, fimg);
double t = 1e3*(rtc() - t0); //get_elapsed_mcycles();
minTimeISPCTasks = std::min(minTimeISPCTasks, t);
}
// Report results and save image
printf("[aobench ispc + tasks]:\t\t[%.3f] million cycles (%d x %d image)\n",
minTimeISPCTasks, width, height);
savePPM("ao-ispc-tasks.ppm", width, height);
return 0;
//
// Run the serial path, again test_iteration times, and report the
// minimum time.
//
double minTimeSerial = 1e30;
for (unsigned int i = 0; i < test_iterations; i++) {
memset((void *)fimg, 0, sizeof(float) * width * height * 3);
reset_and_start_timer();
ao_serial(width, height, NSUBSAMPLES, fimg);
double t = get_elapsed_mcycles();
minTimeSerial = std::min(minTimeSerial, t);
}
// Report more results, save another image...
printf("[aobench serial]:\t\t[%.3f] million cycles (%d x %d image)\n", minTimeSerial,
width, height);
printf("\t\t\t\t(%.2fx speedup from ISPC, %.2fx speedup from ISPC + tasks)\n",
minTimeSerial / minTimeISPC, minTimeSerial / minTimeISPCTasks);
savePPM("ao-serial.ppm", width, height);
return 0;
}

View File

@@ -1,424 +0,0 @@
// -*- mode: c++ -*-
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
*/
#define NAO_SAMPLES 8
//#define M_PI 3.1415926535f
#define vec Float3
struct Float3
{
float x,y,z;
__device__ friend Float3 operator+(const Float3 a, const Float3 b)
{
Float3 c;
c.x = a.x+b.x;
c.y = a.y+b.y;
c.z = a.z+b.z;
return c;
}
__device__ friend Float3 operator-(const Float3 a, const Float3 b)
{
Float3 c;
c.x = a.x-b.x;
c.y = a.y-b.y;
c.z = a.z-b.z;
return c;
}
__device__ friend Float3 operator/(const Float3 a, const Float3 b)
{
Float3 c;
c.x = a.x/b.x;
c.y = a.y/b.y;
c.z = a.z/b.z;
return c;
}
__device__ friend Float3 operator/(const float a, const Float3 b)
{
Float3 c;
c.x = a/b.x;
c.y = a/b.y;
c.z = a/b.z;
return c;
}
__device__ friend Float3 operator*(const Float3 a, const Float3 b)
{
Float3 c;
c.x = a.x*b.x;
c.y = a.y*b.y;
c.z = a.z*b.z;
return c;
}
__device__ friend Float3 operator*(const Float3 a, const float b)
{
Float3 c;
c.x = a.x*b;
c.y = a.y*b;
c.z = a.z*b;
return c;
}
};
///////////////////////////////////////////////////////////////////////////
// RNG stuff
struct RNGState {
unsigned int z1, z2, z3, z4;
};
__device__
static inline unsigned int random(RNGState * state)
{
unsigned int b;
b = ((state->z1 << 6) ^ state->z1) >> 13;
state->z1 = ((state->z1 & 4294967294U) << 18) ^ b;
b = ((state->z2 << 2) ^ state->z2) >> 27;
state->z2 = ((state->z2 & 4294967288U) << 2) ^ b;
b = ((state->z3 << 13) ^ state->z3) >> 21;
state->z3 = ((state->z3 & 4294967280U) << 7) ^ b;
b = ((state->z4 << 3) ^ state->z4) >> 12;
state->z4 = ((state->z4 & 4294967168U) << 13) ^ b;
return (state->z1 ^ state->z2 ^ state->z3 ^ state->z4);
}
__device__
static inline float frandom(RNGState * state)
{
unsigned int irand = random(state);
irand &= (1ul<<23)-1;
return __int_as_float(0x3F800000 | irand)-1.0f;
}
__device__
static inline void seed_rng(RNGState * state,
unsigned int seed) {
state->z1 = seed;
state->z2 = seed ^ 0xbeeff00d;
state->z3 = ((seed & 0xfffful) << 16) | (seed >> 16);
state->z4 = (((seed & 0xfful) << 24) | ((seed & 0xff00ul) << 8) |
((seed & 0xff0000ul) >> 8) | (seed & 0xff000000ul) >> 24);
}
#define programCount 32
#define programIndex (threadIdx.x & 31)
#define taskIndex0 (blockIdx.x*4 + (threadIdx.x >> 5))
#define taskCount0 (gridDim.x*4)
#define taskIndex1 (blockIdx.y)
#define taskCount1 (gridDim.y)
#define warpIdx (threadIdx.x >> 5)
struct Isect {
float t;
vec p;
vec n;
int hit;
};
struct Sphere {
vec center;
float radius;
};
struct Plane {
vec p;
vec n;
};
struct Ray {
vec org;
vec dir;
};
__device__
static inline float dot(vec a, vec b) {
return a.x * b.x + a.y * b.y + a.z * b.z;
}
__device__
static inline vec vcross(vec v0, vec v1) {
vec ret;
ret.x = v0.y * v1.z - v0.z * v1.y;
ret.y = v0.z * v1.x - v0.x * v1.z;
ret.z = v0.x * v1.y - v0.y * v1.x;
return ret;
}
__device__
static inline void vnormalize(vec &v) {
float len2 = dot(v, v);
float invlen = rsqrt(len2);
v = v*invlen;
}
__device__
static inline void
ray_plane_intersect(Isect &isect,const Ray &ray, const Plane &plane) {
float d = -dot(plane.p, plane.n);
float v = dot(ray.dir, plane.n);
if (abs(v) < 1.0e-17)
return;
else {
float t = -(dot(ray.org, plane.n) + d) / v;
if ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + ray.dir * t;
isect.n = plane.n;
}
}
}
__device__
static inline void
ray_sphere_intersect(Isect &isect,const Ray &ray, const Sphere &sphere) {
vec rs = ray.org - sphere.center;
float B = dot(rs, ray.dir);
float C = dot(rs, rs) - sphere.radius * sphere.radius;
float D = B * B - C;
if (D > 0.) {
float t = -B - sqrt(D);
if ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + ray.dir * t;
isect.n = isect.p - sphere.center;
vnormalize(isect.n);
}
}
}
__device__
static inline void
orthoBasis(vec basis[3], vec n) {
basis[2] = n;
basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
if ((n.x < 0.6) && (n.x > -0.6)) {
basis[1].x = 1.0;
} else if ((n.y < 0.6) && (n.y > -0.6)) {
basis[1].y = 1.0;
} else if ((n.z < 0.6) && (n.z > -0.6)) {
basis[1].z = 1.0;
} else {
basis[1].x = 1.0;
}
basis[0] = vcross(basis[1], basis[2]);
vnormalize(basis[0]);
basis[1] = vcross(basis[2], basis[0]);
vnormalize(basis[1]);
}
__device__
static inline float
ambient_occlusion(Isect &isect, const Plane &plane, const Sphere spheres[3],
RNGState &rngstate) {
float eps = 0.0001f;
vec p; //, n;
vec basis[3];
float occlusion = 0.0;
p = isect.p + isect.n * eps;
orthoBasis(basis, isect.n);
const int ntheta = NAO_SAMPLES;
const int nphi = NAO_SAMPLES;
for ( int j = 0; j < ntheta; j++) {
for ( int i = 0; i < nphi; i++) {
Ray ray;
Isect occIsect;
float theta = sqrt(frandom(&rngstate));
float phi = 2.0f * M_PI * frandom(&rngstate);
float x = cos(phi) * theta;
float y = sin(phi) * theta;
float z = sqrt(1.0 - theta * theta);
// local . global
float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
ray.org = p;
ray.dir.x = rx;
ray.dir.y = ry;
ray.dir.z = rz;
occIsect.t = 1.0e+17;
occIsect.hit = 0;
for ( int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(occIsect, ray, spheres[snum]);
ray_plane_intersect (occIsect, ray, plane);
if (occIsect.hit) occlusion += 1.0;
}
}
occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
return occlusion;
}
/* Compute the image for the scanlines from [y0,y1), for an overall image
of width w and height h.
*/
__device__
static inline void ao_tile(
int x0, int x1,
int y0, int y1,
int w, int h,
int nsubsamples,
float image[])
{
const Plane plane = { { 0.0f, -0.5f, 0.0f }, { 0.f, 1.f, 0.f } };
const Sphere spheres[3] = {
{ { -2.0f, 0.0f, -3.5f }, 0.5f },
{ { -0.5f, 0.0f, -3.0f }, 0.5f },
{ { 1.0f, 0.0f, -2.2f }, 0.5f } };
RNGState rngstate;
seed_rng(&rngstate, programIndex + (y0 << (programIndex & 31)));
float invSamples = 1.f / nsubsamples;
for ( int y = y0; y < y1; y++)
for ( int xb = x0; xb < x1; xb += programCount)
{
const int x = xb + programIndex;
const int offset = 3 * (y * w + x);
float res = 0.0f;
for ( int u = 0; u < nsubsamples; u++)
for ( int v = 0; v < nsubsamples; v++)
{
float du = (float)u * invSamples, dv = (float)v * invSamples;
// Figure out x,y pixel in NDC
float px = (x + du - (w / 2.0f)) / (w / 2.0f);
float py = -(y + dv - (h / 2.0f)) / (h / 2.0f);
float ret = 0.f;
Ray ray;
Isect isect;
ray.org.x = 0.0f;
ray.org.y = 0.0f;
ray.org.z = 0.0f;
// Poor man's perspective projection
ray.dir.x = px;
ray.dir.y = py;
ray.dir.z = -1.0;
vnormalize(ray.dir);
isect.t = 1.0e+17;
isect.hit = 0;
for ( int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(isect, ray, spheres[snum]);
ray_plane_intersect(isect, ray, plane);
// Note use of 'coherent' if statement; the set of rays we
// trace will often all hit or all miss the scene
if (isect.hit) {
ret = ambient_occlusion(isect, plane, spheres, rngstate);
ret *= invSamples * invSamples;
res += ret;
}
}
if (xb < x1)
{
image[offset ] = res;
image[offset+1] = res;
image[offset+2] = res;
}
}
}
#define TILEX 64
#define TILEY 4
extern "C"
__global__
void ao_task( int width, int height,
int nsubsamples, float image[])
{
if (taskIndex0 >= taskCount0) return;
if (taskIndex1 >= taskCount1) return;
const int x0 = taskIndex0 * TILEX;
const int x1 = min(x0 + TILEX, width);
const int y0 = taskIndex1 * TILEY;
const int y1 = min(y0 + TILEY, height);
ao_tile(x0,x1,y0,y1, width, height, nsubsamples, image);
}
#if 1
extern "C"
__global__
void ao_ispc_tasks(
int w, int h, int nsubsamples,
float image[])
{
const int ntilex = (w+TILEX-1)/TILEX;
const int ntiley = (h+TILEY-1)/TILEY;
const int nbx = (ntilex-1)/4 + 1;
const int nby = ntiley;
const int nbz = 1;
const dim3 blocks (nbx, nby, nbz);
if (threadIdx.x == 0)
ao_task<<<blocks, 128>>>(w,w,nsubsamples,image);
cudaDeviceSynchronize();
}
#endif

View File

@@ -1,272 +0,0 @@
// -*- mode: c++ -*-
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
*/
#define NAO_SAMPLES 8
#define M_PI 3.1415926535f
typedef float<3> vec;
struct Isect {
float t;
vec p;
vec n;
int hit;
};
struct Sphere {
vec center;
float radius;
};
struct Plane {
vec p;
vec n;
};
struct Ray {
vec org;
vec dir;
};
static inline float dot(vec a, vec b) {
return a.x * b.x + a.y * b.y + a.z * b.z;
}
static inline vec vcross(vec v0, vec v1) {
vec ret;
ret.x = v0.y * v1.z - v0.z * v1.y;
ret.y = v0.z * v1.x - v0.x * v1.z;
ret.z = v0.x * v1.y - v0.y * v1.x;
return ret;
}
static inline void vnormalize(vec &v) {
float len2 = dot(v, v);
float invlen = rsqrt(len2);
v *= invlen;
}
static void
ray_plane_intersect(Isect &isect, Ray &ray, uniform Plane &plane) {
float d = -dot(plane.p, plane.n);
float v = dot(ray.dir, plane.n);
cif (abs(v) < 1.0e-17)
return;
else {
float t = -(dot(ray.org, plane.n) + d) / v;
cif ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + ray.dir * t;
isect.n = plane.n;
}
}
}
static inline void
ray_sphere_intersect(Isect &isect, Ray &ray, uniform Sphere &sphere) {
vec rs = ray.org - sphere.center;
float B = dot(rs, ray.dir);
float C = dot(rs, rs) - sphere.radius * sphere.radius;
float D = B * B - C;
cif (D > 0.) {
float t = -B - sqrt(D);
cif ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + t * ray.dir;
isect.n = isect.p - sphere.center;
vnormalize(isect.n);
}
}
}
static void
orthoBasis(vec basis[3], vec n) {
basis[2] = n;
basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
if ((n.x < 0.6) && (n.x > -0.6)) {
basis[1].x = 1.0;
} else if ((n.y < 0.6) && (n.y > -0.6)) {
basis[1].y = 1.0;
} else if ((n.z < 0.6) && (n.z > -0.6)) {
basis[1].z = 1.0;
} else {
basis[1].x = 1.0;
}
basis[0] = vcross(basis[1], basis[2]);
vnormalize(basis[0]);
basis[1] = vcross(basis[2], basis[0]);
vnormalize(basis[1]);
}
static float
ambient_occlusion(Isect &isect, uniform Plane &plane, uniform Sphere spheres[3],
RNGState &rngstate) {
float eps = 0.0001f;
vec p, n;
vec basis[3];
float occlusion = 0.0;
p = isect.p + eps * isect.n;
orthoBasis(basis, isect.n);
static const uniform int ntheta = NAO_SAMPLES;
static const uniform int nphi = NAO_SAMPLES;
for (uniform int j = 0; j < ntheta; j++) {
for (uniform int i = 0; i < nphi; i++) {
Ray ray;
Isect occIsect;
float theta = sqrt(frandom(&rngstate));
float phi = 2.0f * M_PI * frandom(&rngstate);
float x = cos(phi) * theta;
float y = sin(phi) * theta;
float z = sqrt(1.0 - theta * theta);
// local . global
float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
ray.org = p;
ray.dir.x = rx;
ray.dir.y = ry;
ray.dir.z = rz;
occIsect.t = 1.0e+17;
occIsect.hit = 0;
for (uniform int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(occIsect, ray, spheres[snum]);
ray_plane_intersect (occIsect, ray, plane);
if (occIsect.hit) occlusion += 1.0;
}
}
occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
return occlusion;
}
/* Compute the image for the scanlines from [y0,y1), for an overall image
of width w and height h.
*/
static void ao_scanlines(uniform int y0, uniform int y1, uniform int w,
uniform int h, uniform int nsubsamples,
uniform float image[]) {
static uniform Plane plane = { { 0.0f, -0.5f, 0.0f }, { 0.f, 1.f, 0.f } };
static uniform Sphere spheres[3] = {
{ { -2.0f, 0.0f, -3.5f }, 0.5f },
{ { -0.5f, 0.0f, -3.0f }, 0.5f },
{ { 1.0f, 0.0f, -2.2f }, 0.5f } };
RNGState rngstate;
seed_rng(&rngstate, programIndex + (y0 << (programIndex & 15)));
float invSamples = 1.f / nsubsamples;
foreach_tiled(y = y0 ... y1, x = 0 ... w,
u = 0 ... nsubsamples, v = 0 ... nsubsamples) {
float du = (float)u * invSamples, dv = (float)v * invSamples;
// Figure out x,y pixel in NDC
float px = (x + du - (w / 2.0f)) / (w / 2.0f);
float py = -(y + dv - (h / 2.0f)) / (h / 2.0f);
float ret = 0.f;
Ray ray;
Isect isect;
ray.org = 0.f;
// Poor man's perspective projection
ray.dir.x = px;
ray.dir.y = py;
ray.dir.z = -1.0;
vnormalize(ray.dir);
isect.t = 1.0e+17;
isect.hit = 0;
for (uniform int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(isect, ray, spheres[snum]);
ray_plane_intersect(isect, ray, plane);
// Note use of 'coherent' if statement; the set of rays we
// trace will often all hit or all miss the scene
cif (isect.hit) {
ret = ambient_occlusion(isect, plane, spheres, rngstate);
ret *= invSamples * invSamples;
int offset = 3 * (y * w + x);
atomic_add_local(&image[offset], ret);
atomic_add_local(&image[offset+1], ret);
atomic_add_local(&image[offset+2], ret);
}
}
}
export void ao_ispc(uniform int w, uniform int h, uniform int nsubsamples,
uniform float image[]) {
ao_scanlines(0, h, w, h, nsubsamples, image);
}
static void task ao_task(uniform int width, uniform int height,
uniform int nsubsamples, uniform float image[]) {
ao_scanlines(taskIndex, taskIndex+1, width, height, nsubsamples, image);
}
export void ao_ispc_tasks(uniform int w, uniform int h, uniform int nsubsamples,
uniform float image[]) {
launch[h] ao_task(w, h, nsubsamples, image);
}

View File

@@ -1,302 +0,0 @@
// -*- mode: c++ -*-
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
*/
#define NAO_SAMPLES 8
#define M_PI 3.1415926535f
typedef float<3> vec;
struct Isect {
float t;
vec p;
vec n;
int hit;
};
struct Sphere {
vec center;
float radius;
};
struct Plane {
vec p;
vec n;
};
struct Ray {
vec org;
vec dir;
};
static inline float dot(vec a, vec b) {
return a.x * b.x + a.y * b.y + a.z * b.z;
}
static inline vec vcross(vec v0, vec v1) {
vec ret;
ret.x = v0.y * v1.z - v0.z * v1.y;
ret.y = v0.z * v1.x - v0.x * v1.z;
ret.z = v0.x * v1.y - v0.y * v1.x;
return ret;
}
static inline void vnormalize(vec &v) {
float len2 = dot(v, v);
float invlen = rsqrt(len2);
v *= invlen;
}
static inline void
ray_plane_intersect(Isect &isect, Ray &ray, uniform Plane &plane) {
float d = -dot(plane.p, plane.n);
float v = dot(ray.dir, plane.n);
if (abs(v) < 1.0e-17)
return;
else {
float t = -(dot(ray.org, plane.n) + d) / v;
if ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + ray.dir * t;
isect.n = plane.n;
}
}
}
void
ray_sphere_intersect(Isect &isect, Ray &ray, uniform Sphere &sphere) {
vec rs = ray.org - sphere.center;
float B = dot(rs, ray.dir);
float C = dot(rs, rs) - sphere.radius * sphere.radius;
float D = B * B - C;
if (D > 0.) {
float t = -B - sqrt(D);
if ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + t * ray.dir;
isect.n = isect.p - sphere.center;
vnormalize(isect.n);
}
}
}
static inline void
orthoBasis(vec basis[3], vec n) {
basis[2] = n;
basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
if ((n.x < 0.6) && (n.x > -0.6)) {
basis[1].x = 1.0;
} else if ((n.y < 0.6) && (n.y > -0.6)) {
basis[1].y = 1.0;
} else if ((n.z < 0.6) && (n.z > -0.6)) {
basis[1].z = 1.0;
} else {
basis[1].x = 1.0;
}
basis[0] = vcross(basis[1], basis[2]);
vnormalize(basis[0]);
basis[1] = vcross(basis[2], basis[0]);
vnormalize(basis[1]);
}
static inline float
ambient_occlusion(Isect &isect, uniform Plane &plane, uniform Sphere spheres[3],
RNGState &rngstate) {
float eps = 0.0001f;
vec p, n;
vec basis[3];
float occlusion = 0.0;
p = isect.p + eps * isect.n;
orthoBasis(basis, isect.n);
static const uniform int ntheta = NAO_SAMPLES;
static const uniform int nphi = NAO_SAMPLES;
for (uniform int j = 0; j < ntheta; j++) {
for (uniform int i = 0; i < nphi; i++) {
Ray ray;
Isect occIsect;
float theta = sqrt(frandom(&rngstate));
float phi = 2.0f * M_PI * frandom(&rngstate);
float x = cos(phi) * theta;
float y = sin(phi) * theta;
float z = sqrt(1.0 - theta * theta);
// local . global
float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
ray.org = p;
ray.dir.x = rx;
ray.dir.y = ry;
ray.dir.z = rz;
occIsect.t = 1.0e+17;
occIsect.hit = 0;
for (uniform int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(occIsect, ray, spheres[snum]);
ray_plane_intersect (occIsect, ray, plane);
if (occIsect.hit) occlusion += 1.0;
}
}
occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
return occlusion;
}
/* Compute the image for the scanlines from [y0,y1), for an overall image
of width w and height h.
*/
static inline void ao_tile(
uniform int x0, uniform int x1,
uniform int y0, uniform int y1,
uniform int w, uniform int h,
uniform int nsubsamples,
uniform float image[])
{
uniform Plane plane = { { 0.0f, -0.5f, 0.0f }, { 0.f, 1.f, 0.f } };
uniform Sphere spheres[3] = {
{ { -2.0f, 0.0f, -3.5f }, 0.5f },
{ { -0.5f, 0.0f, -3.0f }, 0.5f },
{ { 1.0f, 0.0f, -2.2f }, 0.5f } };
RNGState rngstate;
seed_rng(&rngstate, programIndex + (y0 << (programIndex & 31)));
float invSamples = 1.f / nsubsamples;
foreach_tiled (y = y0 ... y1, x = x0 ... x1)
{
const int offset = 3 * (y * w + x);
float res = 0.0f;
for (uniform int u = 0; u < nsubsamples; u++)
for (uniform int v = 0; v < nsubsamples; v++)
{
float du = (float)u * invSamples, dv = (float)v * invSamples;
// Figure out x,y pixel in NDC
float px = (x + du - (w / 2.0f)) / (w / 2.0f);
float py = -(y + dv - (h / 2.0f)) / (h / 2.0f);
float ret = 0.f;
Ray ray;
Isect isect;
ray.org = 0.f;
// Poor man's perspective projection
ray.dir.x = px;
ray.dir.y = py;
ray.dir.z = -1.0;
vnormalize(ray.dir);
isect.t = 1.0e+17;
isect.hit = 0;
for (uniform int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(isect, ray, spheres[snum]);
ray_plane_intersect(isect, ray, plane);
// Note use of 'coherent' if statement; the set of rays we
// trace will often all hit or all miss the scene
if (isect.hit) {
ret = ambient_occlusion(isect, plane, spheres, rngstate);
ret *= invSamples * invSamples;
res += ret;
}
}
//if (x < x1)
{
image[offset ] = res;
image[offset+1] = res;
image[offset+2] = res;
}
}
}
#define TILEX 64
#define TILEY 4
/* unless task/export is specified all functions
* are generated as mangled "__device__" functions
*/
/* task will generate mangled "__global__" function only */
void task ao_task(uniform int width, uniform int height,
uniform int nsubsamples, uniform float image[])
{
if (taskIndex0 >= taskCount0) return;
if (taskIndex1 >= taskCount1) return;
const uniform int x0 = taskIndex0 * TILEX;
const uniform int x1 = min(x0 + TILEX, width);
const uniform int y0 = taskIndex1 * TILEY;
const uniform int y1 = min(y0 + TILEY, height);
ao_tile(x0,x1,y0,y1, width, height, nsubsamples, image);
}
/* export will generate unmangled "extern "C" __global__" and mangled "__device__" */
export void ao_ispc_tasks(uniform int w, uniform int h, uniform int nsubsamples,
uniform float image[])
{
const uniform int ntilex = (w+TILEX-1)/TILEX;
const uniform int ntiley = (h+TILEY-1)/TILEY;
launch[ntilex,ntiley] ao_task(w, h, nsubsamples, image);
sync;
}

View File

@@ -1,510 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#define NOMINMAX
#pragma warning (disable: 4244)
#pragma warning (disable: 4305)
#endif
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#ifdef __linux__
#include <malloc.h>
#endif
#include <math.h>
#include <map>
#include <string>
#include <algorithm>
#include <sys/types.h>
//#include "ao1_ispc.h"
//using namespace ispc;
#include "../timing.h"
#include <sys/time.h>
static inline double rtc(void)
{
struct timeval Tvalue;
double etime;
struct timezone dummy;
gettimeofday(&Tvalue,&dummy);
etime = (double) Tvalue.tv_sec +
1.e-6*((double) Tvalue.tv_usec);
return etime;
}
/******************************/
#include <cassert>
#include <iostream>
#include <cuda.h>
#include "drvapi_error_string.h"
#define checkCudaErrors(err) __checkCudaErrors (err, __FILE__, __LINE__)
// These are the inline versions for all of the SDK helper functions
void __checkCudaErrors(CUresult err, const char *file, const int line) {
if(CUDA_SUCCESS != err) {
std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
<< getCudaDrvErrorString(err) << "\" from file <" << file
<< ", line " << line << "\n";
exit(-1);
}
}
/**********************/
/* Basic CUDriver API */
CUcontext context;
void createContext(const int deviceId = 0)
{
CUdevice device;
int devCount;
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGetCount(&devCount));
assert(devCount > 0);
checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
char name[128];
checkCudaErrors(cuDeviceGetName(name, 128, device));
std::cout << "Using CUDA Device [0]: " << name << "\n";
int devMajor, devMinor;
checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
std::cout << "Device Compute Capability: "
<< devMajor << "." << devMinor << "\n";
if (devMajor < 2) {
std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
exit(1);
}
// Create driver context
checkCudaErrors(cuCtxCreate(&context, 0, device));
}
void destroyContext()
{
checkCudaErrors(cuCtxDestroy(context));
}
CUmodule loadModule(const char * module)
{
const double t0 = rtc();
CUmodule cudaModule;
// in this branch we use compilation with parameters
#if 0
unsigned int jitNumOptions = 1;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
void **jitOptVals = new void*[jitNumOptions];
// set up pointer to set the Maximum # of registers for a particular kernel
jitOptions[0] = CU_JIT_MAX_REGISTERS;
int jitRegCount = 64;
jitOptVals[0] = (void *)(size_t)jitRegCount;
#if 0
{
jitNumOptions = 3;
// set up size of compilation log buffer
jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
int jitLogBufferSize = 1024;
jitOptVals[0] = (void *)(size_t)jitLogBufferSize;
// set up pointer to the compilation log buffer
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
char *jitLogBuffer = new char[jitLogBufferSize];
jitOptVals[1] = jitLogBuffer;
// set up pointer to set the Maximum # of registers for a particular kernel
jitOptions[2] = CU_JIT_MAX_REGISTERS;
int jitRegCount = 32;
jitOptVals[2] = (void *)(size_t)jitRegCount;
}
#endif
checkCudaErrors(cuModuleLoadDataEx(&cudaModule, module,jitNumOptions, jitOptions, (void **)jitOptVals));
#else
CUlinkState CUState;
CUlinkState *lState = &CUState;
const int nOptions = 7;
CUjit_option options[nOptions];
void* optionVals[nOptions];
float walltime;
const unsigned int logSize = 32768;
char error_log[logSize],
info_log[logSize];
void *cuOut;
size_t outSize;
int myErr = 0;
// Setup linker options
// Return walltime from JIT compilation
options[0] = CU_JIT_WALL_TIME;
optionVals[0] = (void*) &walltime;
// Pass a buffer for info messages
options[1] = CU_JIT_INFO_LOG_BUFFER;
optionVals[1] = (void*) info_log;
// Pass the size of the info buffer
options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
optionVals[2] = (void*) logSize;
// Pass a buffer for error message
options[3] = CU_JIT_ERROR_LOG_BUFFER;
optionVals[3] = (void*) error_log;
// Pass the size of the error buffer
options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
optionVals[4] = (void*) logSize;
// Make the linker verbose
options[5] = CU_JIT_LOG_VERBOSE;
optionVals[5] = (void*) 1;
// Max # of registers/pthread
options[6] = CU_JIT_MAX_REGISTERS;
int jitRegCount = 64;
optionVals[6] = (void *)(size_t)jitRegCount;
// Create a pending linker invocation
checkCudaErrors(cuLinkCreate(nOptions,options, optionVals, lState));
#if 0
if (sizeof(void *)==4)
{
// Load the PTX from the string myPtx32
printf("Loading myPtx32[] program\n");
// PTX May also be loaded from file, as per below.
myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)myPtx32, strlen(myPtx32)+1, 0, 0, 0, 0);
}
else
#endif
{
// Load the PTX from the string myPtx (64-bit)
fprintf(stderr, "Loading ptx..\n");
myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)module, strlen(module)+1, 0, 0, 0, 0);
myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_LIBRARY, "libcudadevrt.a", 0,0,0);
// PTX May also be loaded from file, as per below.
// myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_PTX, "myPtx64.ptx",0,0,0);
}
// Complete the linker step
myErr = cuLinkComplete(*lState, &cuOut, &outSize);
if ( myErr != CUDA_SUCCESS )
{
// Errors will be put in error_log, per CU_JIT_ERROR_LOG_BUFFER option above.
fprintf(stderr,"PTX Linker Error:\n%s\n",error_log);
assert(0);
}
// Linker walltime and info_log were requested in options above.
fprintf(stderr, "CUDA Link Completed in %fms [ %g ms]. Linker Output:\n%s\n",walltime,info_log,1e3*(rtc() - t0));
// Load resulting cuBin into module
checkCudaErrors(cuModuleLoadData(&cudaModule, cuOut));
// Destroy the linker invocation
checkCudaErrors(cuLinkDestroy(*lState));
#endif
fprintf(stderr, " loadModule took %g ms \n", 1e3*(rtc() - t0));
return cudaModule;
}
void unloadModule(CUmodule &cudaModule)
{
checkCudaErrors(cuModuleUnload(cudaModule));
}
CUfunction getFunction(CUmodule &cudaModule, const char * function)
{
CUfunction cudaFunction;
checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
return cudaFunction;
}
CUdeviceptr deviceMalloc(const size_t size)
{
CUdeviceptr d_buf;
checkCudaErrors(cuMemAllocManaged(&d_buf, size, CU_MEM_ATTACH_GLOBAL));
return d_buf;
}
void deviceFree(CUdeviceptr d_buf)
{
checkCudaErrors(cuMemFree(d_buf));
}
void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
{
checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
}
void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
{
checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
}
#define deviceLaunch(func,params) \
checkCudaErrors(cuFuncSetCacheConfig((func), CU_FUNC_CACHE_PREFER_EQUAL)); \
checkCudaErrors( \
cuLaunchKernel( \
(func), \
1,1,1, \
32, 1, 1, \
0, NULL, (params), NULL \
));
typedef CUdeviceptr devicePtr;
/**************/
#include <vector>
std::vector<char> readBinary(const char * filename)
{
std::vector<char> buffer;
FILE *fp = fopen(filename, "rb");
if (!fp )
{
fprintf(stderr, "file %s not found\n", filename);
assert(0);
}
#if 0
char c;
while ((c = fgetc(fp)) != EOF)
buffer.push_back(c);
#else
fseek(fp, 0, SEEK_END);
const unsigned long long size = ftell(fp); /*calc the size needed*/
fseek(fp, 0, SEEK_SET);
buffer.resize(size);
if (fp == NULL){ /*ERROR detection if file == empty*/
fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);
exit(1);
}
else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
exit(1);
}
#endif
fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
return buffer;
}
extern "C"
{
void *CUDAAlloc(void **handlePtr, int64_t size, int32_t alignment)
{
return NULL;
}
double CUDALaunch(
void **handlePtr,
const char * func_name,
void **func_args)
{
const std::vector<char> module_str = readBinary("__kernels.ptx");
const char * module = &module_str[0];
CUmodule cudaModule = loadModule(module);
CUfunction cudaFunction = getFunction(cudaModule, func_name);
const double t0 = rtc();
deviceLaunch(cudaFunction, func_args);
checkCudaErrors(cuStreamSynchronize(0));
const double dt = rtc() - t0;
unloadModule(cudaModule);
return dt;
}
void CUDASync(void *handle)
{
checkCudaErrors(cuStreamSynchronize(0));
}
void ISPCSync(void *handle)
{
checkCudaErrors(cuStreamSynchronize(0));
}
void CUDAFree(void *handle)
{
}
}
/******************************/
#define NSUBSAMPLES 2
extern void ao_serial(int w, int h, int nsubsamples, float image[]);
static unsigned int test_iterations;
static unsigned int width, height;
static unsigned char *img;
static float *fimg;
static unsigned char
clamp(float f)
{
int i = (int)(f * 255.5);
if (i < 0) i = 0;
if (i > 255) i = 255;
return (unsigned char)i;
}
static void
savePPM(const char *fname, int w, int h)
{
for (int y = 0; y < h; y++) {
for (int x = 0; x < w; x++) {
img[3 * (y * w + x) + 0] = clamp(fimg[3 *(y * w + x) + 0]);
img[3 * (y * w + x) + 1] = clamp(fimg[3 *(y * w + x) + 1]);
img[3 * (y * w + x) + 2] = clamp(fimg[3 *(y * w + x) + 2]);
}
}
FILE *fp = fopen(fname, "wb");
if (!fp) {
perror(fname);
exit(1);
}
fprintf(fp, "P6\n");
fprintf(fp, "%d %d\n", w, h);
fprintf(fp, "255\n");
fwrite(img, w * h * 3, 1, fp);
fclose(fp);
printf("Wrote image file %s\n", fname);
}
int main(int argc, char **argv)
{
if (argc != 4) {
printf ("%s\n", argv[0]);
printf ("Usage: ao [num test iterations] [width] [height]\n");
getchar();
exit(-1);
}
else {
test_iterations = atoi(argv[1]);
width = atoi (argv[2]);
height = atoi (argv[3]);
}
// Allocate space for output images
img = new unsigned char[width * height * 3];
fimg = new float[width * height * 3];
//
// Run the ispc path, test_iterations times, and report the minimum
// time for any of them.
//
double minTimeISPC = 1e30;
#if 0
for (unsigned int i = 0; i < test_iterations; i++) {
memset((void *)fimg, 0, sizeof(float) * width * height * 3);
assert(NSUBSAMPLES == 2);
reset_and_start_timer();
ao_ispc(width, height, NSUBSAMPLES, fimg);
double t = get_elapsed_mcycles();
minTimeISPC = std::min(minTimeISPC, t);
}
// Report results and save image
printf("[aobench ispc]:\t\t\t[%.3f] million cycles (%d x %d image)\n",
minTimeISPC, width, height);
savePPM("ao-ispc.ppm", width, height);
#endif
/*******************/
createContext();
/*******************/
devicePtr d_fimg = deviceMalloc(width*height*3*sizeof(float));
//
// Run the ispc + tasks path, test_iterations times, and report the
// minimum time for any of them.
//
double minTimeISPCTasks = 1e30;
for (unsigned int i = 0; i < test_iterations; i++) {
memset((void *)fimg, 0, sizeof(float) * width * height * 3);
assert(NSUBSAMPLES == 2);
memcpyH2D(d_fimg, fimg, width*height*3*sizeof(float));
reset_and_start_timer();
#if 0
const double t0 = rtc();
ao_ispc_tasks(
width,
height,
NSUBSAMPLES,
(float*)d_fimg);
// double t = (rtc() - t0); //get_elapsed_mcycles();
#else
const char * func_name = "ao_ispc_tasks";
int arg_1 = width;
int arg_2 = height;
int arg_3 = NSUBSAMPLES;
void *func_args[] = {&arg_1, &arg_2, &arg_3, (float*)&d_fimg};
const double t = 1e3*CUDALaunch(NULL, func_name, func_args);
#endif
minTimeISPCTasks = std::min(minTimeISPCTasks, t);
}
memcpyD2H(fimg, d_fimg, width*height*3*sizeof(float));
// Report results and save image
printf("[aobench ispc + tasks]:\t\t[%.3f] million cycles (%d x %d image)\n",
minTimeISPCTasks, width, height);
savePPM("ao-cuda.ppm", width, height);
/*******************/
destroyContext();
/*******************/
return 0;
//
// Run the serial path, again test_iteration times, and report the
// minimum time.
//
double minTimeSerial = 1e30;
for (unsigned int i = 0; i < test_iterations; i++) {
memset((void *)fimg, 0, sizeof(float) * width * height * 3);
reset_and_start_timer();
ao_serial(width, height, NSUBSAMPLES, fimg);
double t = get_elapsed_mcycles();
minTimeSerial = std::min(minTimeSerial, t);
}
// Report more results, save another image...
printf("[aobench serial]:\t\t[%.3f] million cycles (%d x %d image)\n", minTimeSerial,
width, height);
printf("\t\t\t\t(%.2fx speedup from ISPC, %.2fx speedup from ISPC + tasks)\n",
minTimeSerial / minTimeISPC, minTimeSerial / minTimeISPCTasks);
savePPM("ao-serial.ppm", width, height);
return 0;
}

View File

@@ -1,314 +0,0 @@
// -*- mode: c++ -*-
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
*/
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#define NOMINMAX
#pragma warning (disable: 4244)
#pragma warning (disable: 4305)
#endif
#include <stdlib.h>
#include <math.h>
#ifdef _MSC_VER
static long long drand48_x = 0x1234ABCD330E;
static inline void srand48(int x) {
drand48_x = x ^ (x << 16);
}
static inline double drand48() {
drand48_x = drand48_x * 0x5DEECE66D + 0xB;
return (drand48_x & 0xFFFFFFFFFFFF) * (1.0 / 281474976710656.0);
}
#endif // _MSC_VER
#ifdef _MSC_VER
__declspec(align(16))
#endif
struct vec {
vec() { x=y=z=pad=0.; }
vec(float xx, float yy, float zz) { x = xx; y = yy; z = zz; }
vec operator*(float f) const { return vec(x*f, y*f, z*f); }
vec operator+(const vec &f2) const {
return vec(x+f2.x, y+f2.y, z+f2.z);
}
vec operator-(const vec &f2) const {
return vec(x-f2.x, y-f2.y, z-f2.z);
}
vec operator*(const vec &f2) const {
return vec(x*f2.x, y*f2.y, z*f2.z);
}
float x, y, z;
float pad;
}
#ifndef _MSC_VER
__attribute__ ((aligned(16)))
#endif
;
inline vec operator*(float f, const vec &v) { return vec(f*v.x, f*v.y, f*v.z); }
#define NAO_SAMPLES 8
#ifdef M_PI
#undef M_PI
#endif
#define M_PI 3.1415926535f
struct Isect {
float t;
vec p;
vec n;
int hit;
};
struct Sphere {
vec center;
float radius;
};
struct Plane {
vec p;
vec n;
};
struct Ray {
vec org;
vec dir;
};
static inline float dot(const vec &a, const vec &b) {
return a.x * b.x + a.y * b.y + a.z * b.z;
}
static inline vec vcross(const vec &v0, const vec &v1) {
vec ret;
ret.x = v0.y * v1.z - v0.z * v1.y;
ret.y = v0.z * v1.x - v0.x * v1.z;
ret.z = v0.x * v1.y - v0.y * v1.x;
return ret;
}
static inline void vnormalize(vec &v) {
float len2 = dot(v, v);
float invlen = 1.f / sqrtf(len2);
v = v * invlen;
}
static inline void
ray_plane_intersect(Isect &isect, Ray &ray,
Plane &plane) {
float d = -dot(plane.p, plane.n);
float v = dot(ray.dir, plane.n);
if (fabsf(v) < 1.0e-17f)
return;
else {
float t = -(dot(ray.org, plane.n) + d) / v;
if ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + ray.dir * t;
isect.n = plane.n;
}
}
}
static inline void
ray_sphere_intersect(Isect &isect, Ray &ray,
Sphere &sphere) {
vec rs = ray.org - sphere.center;
float B = dot(rs, ray.dir);
float C = dot(rs, rs) - sphere.radius * sphere.radius;
float D = B * B - C;
if (D > 0.) {
float t = -B - sqrtf(D);
if ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + t * ray.dir;
isect.n = isect.p - sphere.center;
vnormalize(isect.n);
}
}
}
static inline void
orthoBasis(vec basis[3], const vec &n) {
basis[2] = n;
basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
if ((n.x < 0.6f) && (n.x > -0.6f)) {
basis[1].x = 1.0;
} else if ((n.y < 0.6f) && (n.y > -0.6f)) {
basis[1].y = 1.0;
} else if ((n.z < 0.6f) && (n.z > -0.6f)) {
basis[1].z = 1.0;
} else {
basis[1].x = 1.0;
}
basis[0] = vcross(basis[1], basis[2]);
vnormalize(basis[0]);
basis[1] = vcross(basis[2], basis[0]);
vnormalize(basis[1]);
}
static float
ambient_occlusion(Isect &isect, Plane &plane,
Sphere spheres[3]) {
float eps = 0.0001f;
vec p, n;
vec basis[3];
float occlusion = 0.0;
p = isect.p + eps * isect.n;
orthoBasis(basis, isect.n);
static const int ntheta = NAO_SAMPLES;
static const int nphi = NAO_SAMPLES;
for (int j = 0; j < ntheta; j++) {
for (int i = 0; i < nphi; i++) {
Ray ray;
Isect occIsect;
float theta = sqrtf(drand48());
float phi = 2.0f * M_PI * drand48();
float x = cosf(phi) * theta;
float y = sinf(phi) * theta;
float z = sqrtf(1.0f - theta * theta);
// local . global
float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
ray.org = p;
ray.dir.x = rx;
ray.dir.y = ry;
ray.dir.z = rz;
occIsect.t = 1.0e+17f;
occIsect.hit = 0;
for (int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(occIsect, ray, spheres[snum]);
ray_plane_intersect (occIsect, ray, plane);
if (occIsect.hit) occlusion += 1.f;
}
}
occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
return occlusion;
}
/* Compute the image for the scanlines from [y0,y1), for an overall image
of width w and height h.
*/
static void ao_scanlines(int y0, int y1, int w, int h, int nsubsamples,
float image[]) {
static Plane plane = { vec(0.0f, -0.5f, 0.0f), vec(0.f, 1.f, 0.f) };
static Sphere spheres[3] = {
{ vec(-2.0f, 0.0f, -3.5f), 0.5f },
{ vec(-0.5f, 0.0f, -3.0f), 0.5f },
{ vec(1.0f, 0.0f, -2.2f), 0.5f } };
srand48(y0);
for (int y = y0; y < y1; ++y) {
for (int x = 0; x < w; ++x) {
int offset = 3 * (y * w + x);
for (int u = 0; u < nsubsamples; ++u) {
for (int v = 0; v < nsubsamples; ++v) {
float px = (x + (u / (float)nsubsamples) - (w / 2.0f)) / (w / 2.0f);
float py = -(y + (v / (float)nsubsamples) - (h / 2.0f)) / (h / 2.0f);
float ret = 0.f;
Ray ray;
Isect isect;
ray.org = vec(0.f, 0.f, 0.f);
ray.dir.x = px;
ray.dir.y = py;
ray.dir.z = -1.0f;
vnormalize(ray.dir);
isect.t = 1.0e+17f;
isect.hit = 0;
for (int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(isect, ray, spheres[snum]);
ray_plane_intersect(isect, ray, plane);
if (isect.hit)
ret = ambient_occlusion(isect, plane, spheres);
// Update image for AO for this ray
image[offset+0] += ret;
image[offset+1] += ret;
image[offset+2] += ret;
}
}
// Normalize image pixels by number of samples taken per pixel
image[offset+0] /= nsubsamples * nsubsamples;
image[offset+1] /= nsubsamples * nsubsamples;
image[offset+2] /= nsubsamples * nsubsamples;
}
}
}
void ao_serial(int w, int h, int nsubsamples,
float image[]) {
ao_scanlines(0, h, w, h, nsubsamples, image);
}

View File

@@ -1,180 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<ItemGroup Label="ProjectConfigurations">
<ProjectConfiguration Include="Debug|Win32">
<Configuration>Debug</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Debug|x64">
<Configuration>Debug</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|Win32">
<Configuration>Release</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|x64">
<Configuration>Release</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
</ItemGroup>
<ItemGroup>
<ClCompile Include="ao.cpp" />
<ClCompile Include="ao_serial.cpp" />
<ClCompile Include="../tasksys.cpp" />
</ItemGroup>
<ItemGroup>
<CustomBuild Include="ao.ispc">
<FileType>Document</FileType>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4,avx
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4,avx
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4,avx
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4,avx
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
</CustomBuild>
</ItemGroup>
<PropertyGroup Label="Globals">
<ProjectGuid>{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}</ProjectGuid>
<Keyword>Win32Proj</Keyword>
<RootNamespace>aobench</RootNamespace>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
<ImportGroup Label="ExtensionSettings">
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<PropertyGroup Label="UserMacros" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<TargetName>ao</TargetName>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ExecutablePath);$(ProjectDir)..\..</ExecutablePath>
<TargetName>ao</TargetName>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<TargetName>ao</TargetName>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<TargetName>ao</TargetName>
</PropertyGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<IntrinsicFunctions>true</IntrinsicFunctions>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<IntrinsicFunctions>true</IntrinsicFunctions>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
<ImportGroup Label="ExtensionTargets">
</ImportGroup>
</Project>

View File

@@ -1,370 +0,0 @@
/*
* Copyright 1993-2012 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
#ifndef _DRVAPI_ERROR_STRING_H_
#define _DRVAPI_ERROR_STRING_H_
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
// Error Code string definitions here
typedef struct
{
char const *error_string;
int error_id;
} s_CudaErrorStr;
/**
* Error codes
*/
static s_CudaErrorStr sCudaDrvErrorString[] =
{
/**
* The API call returned with no errors. In the case of query calls, this
* can also mean that the operation being queried is complete (see
* ::cuEventQuery() and ::cuStreamQuery()).
*/
{ "CUDA_SUCCESS", 0 },
/**
* This indicates that one or more of the parameters passed to the API call
* is not within an acceptable range of values.
*/
{ "CUDA_ERROR_INVALID_VALUE", 1 },
/**
* The API call failed because it was unable to allocate enough memory to
* perform the requested operation.
*/
{ "CUDA_ERROR_OUT_OF_MEMORY", 2 },
/**
* This indicates that the CUDA driver has not been initialized with
* ::cuInit() or that initialization has failed.
*/
{ "CUDA_ERROR_NOT_INITIALIZED", 3 },
/**
* This indicates that the CUDA driver is in the process of shutting down.
*/
{ "CUDA_ERROR_DEINITIALIZED", 4 },
/**
* This indicates profiling APIs are called while application is running
* in visual profiler mode.
*/
{ "CUDA_ERROR_PROFILER_DISABLED", 5 },
/**
* This indicates profiling has not been initialized for this context.
* Call cuProfilerInitialize() to resolve this.
*/
{ "CUDA_ERROR_PROFILER_NOT_INITIALIZED", 6 },
/**
* This indicates profiler has already been started and probably
* cuProfilerStart() is incorrectly called.
*/
{ "CUDA_ERROR_PROFILER_ALREADY_STARTED", 7 },
/**
* This indicates profiler has already been stopped and probably
* cuProfilerStop() is incorrectly called.
*/
{ "CUDA_ERROR_PROFILER_ALREADY_STOPPED", 8 },
/**
* This indicates that no CUDA-capable devices were detected by the installed
* CUDA driver.
*/
{ "CUDA_ERROR_NO_DEVICE (no CUDA-capable devices were detected)", 100 },
/**
* This indicates that the device ordinal supplied by the user does not
* correspond to a valid CUDA device.
*/
{ "CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)", 101 },
/**
* This indicates that the device kernel image is invalid. This can also
* indicate an invalid CUDA module.
*/
{ "CUDA_ERROR_INVALID_IMAGE", 200 },
/**
* This most frequently indicates that there is no context bound to the
* current thread. This can also be returned if the context passed to an
* API call is not a valid handle (such as a context that has had
* ::cuCtxDestroy() invoked on it). This can also be returned if a user
* mixes different API versions (i.e. 3010 context with 3020 API calls).
* See ::cuCtxGetApiVersion() for more details.
*/
{ "CUDA_ERROR_INVALID_CONTEXT", 201 },
/**
* This indicated that the context being supplied as a parameter to the
* API call was already the active context.
* \deprecated
* This error return is deprecated as of CUDA 3.2. It is no longer an
* error to attempt to push the active context via ::cuCtxPushCurrent().
*/
{ "CUDA_ERROR_CONTEXT_ALREADY_CURRENT", 202 },
/**
* This indicates that a map or register operation has failed.
*/
{ "CUDA_ERROR_MAP_FAILED", 205 },
/**
* This indicates that an unmap or unregister operation has failed.
*/
{ "CUDA_ERROR_UNMAP_FAILED", 206 },
/**
* This indicates that the specified array is currently mapped and thus
* cannot be destroyed.
*/
{ "CUDA_ERROR_ARRAY_IS_MAPPED", 207 },
/**
* This indicates that the resource is already mapped.
*/
{ "CUDA_ERROR_ALREADY_MAPPED", 208 },
/**
* This indicates that there is no kernel image available that is suitable
* for the device. This can occur when a user specifies code generation
* options for a particular CUDA source file that do not include the
* corresponding device configuration.
*/
{ "CUDA_ERROR_NO_BINARY_FOR_GPU", 209 },
/**
* This indicates that a resource has already been acquired.
*/
{ "CUDA_ERROR_ALREADY_ACQUIRED", 210 },
/**
* This indicates that a resource is not mapped.
*/
{ "CUDA_ERROR_NOT_MAPPED", 211 },
/**
* This indicates that a mapped resource is not available for access as an
* array.
*/
{ "CUDA_ERROR_NOT_MAPPED_AS_ARRAY", 212 },
/**
* This indicates that a mapped resource is not available for access as a
* pointer.
*/
{ "CUDA_ERROR_NOT_MAPPED_AS_POINTER", 213 },
/**
* This indicates that an uncorrectable ECC error was detected during
* execution.
*/
{ "CUDA_ERROR_ECC_UNCORRECTABLE", 214 },
/**
* This indicates that the ::CUlimit passed to the API call is not
* supported by the active device.
*/
{ "CUDA_ERROR_UNSUPPORTED_LIMIT", 215 },
/**
* This indicates that the ::CUcontext passed to the API call can
* only be bound to a single CPU thread at a time but is already
* bound to a CPU thread.
*/
{ "CUDA_ERROR_CONTEXT_ALREADY_IN_USE", 216 },
/**
* This indicates that peer access is not supported across the given
* devices.
*/
{ "CUDA_ERROR_PEER_ACCESS_UNSUPPORTED", 217},
/**
* This indicates that the device kernel source is invalid.
*/
{ "CUDA_ERROR_INVALID_SOURCE", 300 },
/**
* This indicates that the file specified was not found.
*/
{ "CUDA_ERROR_FILE_NOT_FOUND", 301 },
/**
* This indicates that a link to a shared object failed to resolve.
*/
{ "CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND", 302 },
/**
* This indicates that initialization of a shared object failed.
*/
{ "CUDA_ERROR_SHARED_OBJECT_INIT_FAILED", 303 },
/**
* This indicates that an OS call failed.
*/
{ "CUDA_ERROR_OPERATING_SYSTEM", 304 },
/**
* This indicates that a resource handle passed to the API call was not
* valid. Resource handles are opaque types like ::CUstream and ::CUevent.
*/
{ "CUDA_ERROR_INVALID_HANDLE", 400 },
/**
* This indicates that a named symbol was not found. Examples of symbols
* are global/constant variable names, texture names }, and surface names.
*/
{ "CUDA_ERROR_NOT_FOUND", 500 },
/**
* This indicates that asynchronous operations issued previously have not
* completed yet. This result is not actually an error, but must be indicated
* differently than ::CUDA_SUCCESS (which indicates completion). Calls that
* may return this value include ::cuEventQuery() and ::cuStreamQuery().
*/
{ "CUDA_ERROR_NOT_READY", 600 },
/**
* An exception occurred on the device while executing a kernel. Common
* causes include dereferencing an invalid device pointer and accessing
* out of bounds shared memory. The context cannot be used }, so it must
* be destroyed (and a new one should be created). All existing device
* memory allocations from this context are invalid and must be
* reconstructed if the program is to continue using CUDA.
*/
{ "CUDA_ERROR_LAUNCH_FAILED", 700 },
/**
* This indicates that a launch did not occur because it did not have
* appropriate resources. This error usually indicates that the user has
* attempted to pass too many arguments to the device kernel, or the
* kernel launch specifies too many threads for the kernel's register
* count. Passing arguments of the wrong size (i.e. a 64-bit pointer
* when a 32-bit int is expected) is equivalent to passing too many
* arguments and can also result in this error.
*/
{ "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES", 701 },
/**
* This indicates that the device kernel took too long to execute. This can
* only occur if timeouts are enabled - see the device attribute
* ::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information. The
* context cannot be used (and must be destroyed similar to
* ::CUDA_ERROR_LAUNCH_FAILED). All existing device memory allocations from
* this context are invalid and must be reconstructed if the program is to
* continue using CUDA.
*/
{ "CUDA_ERROR_LAUNCH_TIMEOUT", 702 },
/**
* This error indicates a kernel launch that uses an incompatible texturing
* mode.
*/
{ "CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING", 703 },
/**
* This error indicates that a call to ::cuCtxEnablePeerAccess() is
* trying to re-enable peer access to a context which has already
* had peer access to it enabled.
*/
{ "CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED", 704 },
/**
* This error indicates that ::cuCtxDisablePeerAccess() is
* trying to disable peer access which has not been enabled yet
* via ::cuCtxEnablePeerAccess().
*/
{ "CUDA_ERROR_PEER_ACCESS_NOT_ENABLED", 705 },
/**
* This error indicates that the primary context for the specified device
* has already been initialized.
*/
{ "CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE", 708 },
/**
* This error indicates that the context current to the calling thread
* has been destroyed using ::cuCtxDestroy }, or is a primary context which
* has not yet been initialized.
*/
{ "CUDA_ERROR_CONTEXT_IS_DESTROYED", 709 },
/**
* A device-side assert triggered during kernel execution. The context
* cannot be used anymore, and must be destroyed. All existing device
* memory allocations from this context are invalid and must be
* reconstructed if the program is to continue using CUDA.
*/
{ "CUDA_ERROR_ASSERT", 710 },
/**
* This error indicates that the hardware resources required to enable
* peer access have been exhausted for one or more of the devices
* passed to ::cuCtxEnablePeerAccess().
*/
{ "CUDA_ERROR_TOO_MANY_PEERS", 711 },
/**
* This error indicates that the memory range passed to ::cuMemHostRegister()
* has already been registered.
*/
{ "CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED", 712 },
/**
* This error indicates that the pointer passed to ::cuMemHostUnregister()
* does not correspond to any currently registered memory region.
*/
{ "CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED", 713 },
/**
* This error indicates that the attempted operation is not permitted.
*/
{ "CUDA_ERROR_NOT_PERMITTED", 800 },
/**
* This error indicates that the attempted operation is not supported
* on the current system or device.
*/
{ "CUDA_ERROR_NOT_SUPPORTED", 801 },
/**
* This indicates that an unknown internal error has occurred.
*/
{ "CUDA_ERROR_UNKNOWN", 999 },
{ NULL, -1 }
};
// This is just a linear search through the array, since the error_id's are not
// always ocurring consecutively
const char * getCudaDrvErrorString(CUresult error_id)
{
int index = 0;
while (sCudaDrvErrorString[index].error_id != error_id &&
sCudaDrvErrorString[index].error_id != -1)
{
index++;
}
if (sCudaDrvErrorString[index].error_id == error_id)
return (const char *)sCudaDrvErrorString[index].error_string;
else
return (const char *)"CUDA_ERROR not found!";
}
#endif

View File

@@ -1,2 +0,0 @@
ao
*.ppm

View File

@@ -1,26 +0,0 @@
CXX=clang++ -m64
CXXFLAGS=-Iobjs/ -g3 -Wall
ISPC=ispc
ISPCFLAGS=-O2 --instrument --arch=x86-64 --target=sse2
default: ao
.PHONY: dirs clean
dirs:
/bin/mkdir -p objs/
clean:
/bin/rm -rf objs *~ ao
ao: objs/ao.o objs/instrument.o objs/ao_ispc.o ../tasksys.cpp
$(CXX) $(CXXFLAGS) -o $@ $^ -lm -lpthread
objs/%.o: %.cpp dirs
$(CXX) $< $(CXXFLAGS) -c -o $@
objs/ao.o: objs/ao_ispc.h
objs/%_ispc.h objs/%_ispc.o: %.ispc dirs
$(ISPC) $(ISPCFLAGS) $< -o objs/$*_ispc.o -h objs/$*_instrumented_ispc.h

View File

@@ -1,131 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define NOMINMAX
#pragma warning (disable: 4244)
#pragma warning (disable: 4305)
#endif
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#ifdef __linux__
#include <malloc.h>
#endif
#include <math.h>
#include <map>
#include <string>
#include <algorithm>
#include <sys/types.h>
#include "ao_instrumented_ispc.h"
using namespace ispc;
#include "instrument.h"
#include "../timing.h"
#define NSUBSAMPLES 2
static unsigned int test_iterations;
static unsigned int width, height;
static unsigned char *img;
static float *fimg;
static unsigned char
clamp(float f)
{
int i = (int)(f * 255.5);
if (i < 0) i = 0;
if (i > 255) i = 255;
return (unsigned char)i;
}
static void
savePPM(const char *fname, int w, int h)
{
for (int y = 0; y < h; y++) {
for (int x = 0; x < w; x++) {
img[3 * (y * w + x) + 0] = clamp(fimg[3 *(y * w + x) + 0]);
img[3 * (y * w + x) + 1] = clamp(fimg[3 *(y * w + x) + 1]);
img[3 * (y * w + x) + 2] = clamp(fimg[3 *(y * w + x) + 2]);
}
}
FILE *fp = fopen(fname, "wb");
if (!fp) {
perror(fname);
exit(1);
}
fprintf(fp, "P6\n");
fprintf(fp, "%d %d\n", w, h);
fprintf(fp, "255\n");
fwrite(img, w * h * 3, 1, fp);
fclose(fp);
printf("Wrote image file %s\n", fname);
}
int main(int argc, char **argv)
{
if (argc != 4) {
printf ("%s\n", argv[0]);
printf ("Usage: ao [num test iterations] [width] [height]\n");
getchar();
exit(-1);
}
else {
test_iterations = atoi(argv[1]);
width = atoi (argv[2]);
height = atoi (argv[3]);
}
// Allocate space for output images
img = new unsigned char[width * height * 3];
fimg = new float[width * height * 3];
ao_ispc(width, height, NSUBSAMPLES, fimg);
savePPM("ao-ispc.ppm", width, height);
ISPCPrintInstrument();
return 0;
}

View File

@@ -1,333 +0,0 @@
// -*- mode: c++ -*-
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
*/
#define NAO_SAMPLES 8
#define M_PI 3.1415926535f
typedef float<3> vec;
struct Isect {
float t;
vec p;
vec n;
int hit;
};
struct Sphere {
vec center;
float radius;
};
struct Plane {
vec p;
vec n;
};
struct Ray {
vec org;
vec dir;
};
static inline float dot(vec a, vec b) {
return a.x * b.x + a.y * b.y + a.z * b.z;
}
static inline vec vcross(vec v0, vec v1) {
vec ret;
ret.x = v0.y * v1.z - v0.z * v1.y;
ret.y = v0.z * v1.x - v0.x * v1.z;
ret.z = v0.x * v1.y - v0.y * v1.x;
return ret;
}
static inline void vnormalize(vec &v) {
float len2 = dot(v, v);
float invlen = rsqrt(len2);
v *= invlen;
}
static inline void
ray_plane_intersect(Isect &isect, Ray &ray, Plane &plane) {
float d = -dot(plane.p, plane.n);
float v = dot(ray.dir, plane.n);
cif (abs(v) < 1.0e-17)
return;
else {
float t = -(dot(ray.org, plane.n) + d) / v;
cif ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + ray.dir * t;
isect.n = plane.n;
}
}
}
static inline void
ray_sphere_intersect(Isect &isect, Ray &ray, Sphere &sphere) {
vec rs = ray.org - sphere.center;
float B = dot(rs, ray.dir);
float C = dot(rs, rs) - sphere.radius * sphere.radius;
float D = B * B - C;
cif (D > 0.) {
float t = -B - sqrt(D);
cif ((t > 0.0) && (t < isect.t)) {
isect.t = t;
isect.hit = 1;
isect.p = ray.org + t * ray.dir;
isect.n = isect.p - sphere.center;
vnormalize(isect.n);
}
}
}
static inline void
orthoBasis(vec basis[3], vec n) {
basis[2] = n;
basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
if ((n.x < 0.6) && (n.x > -0.6)) {
basis[1].x = 1.0;
} else if ((n.y < 0.6) && (n.y > -0.6)) {
basis[1].y = 1.0;
} else if ((n.z < 0.6) && (n.z > -0.6)) {
basis[1].z = 1.0;
} else {
basis[1].x = 1.0;
}
basis[0] = vcross(basis[1], basis[2]);
vnormalize(basis[0]);
basis[1] = vcross(basis[2], basis[0]);
vnormalize(basis[1]);
}
static inline float
ambient_occlusion(Isect &isect, Plane &plane, Sphere spheres[3],
RNGState &rngstate) {
float eps = 0.0001f;
vec p, n;
vec basis[3];
float occlusion = 0.0;
p = isect.p + eps * isect.n;
orthoBasis(basis, isect.n);
static const uniform int ntheta = NAO_SAMPLES;
static const uniform int nphi = NAO_SAMPLES;
for (uniform int j = 0; j < ntheta; j++) {
for (uniform int i = 0; i < nphi; i++) {
Ray ray;
Isect occIsect;
float theta = sqrt(frandom(&rngstate));
float phi = 2.0f * M_PI * frandom(&rngstate);
float x = cos(phi) * theta;
float y = sin(phi) * theta;
float z = sqrt(1.0 - theta * theta);
// local . global
float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
ray.org = p;
ray.dir.x = rx;
ray.dir.y = ry;
ray.dir.z = rz;
occIsect.t = 1.0e+17;
occIsect.hit = 0;
for (uniform int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(occIsect, ray, spheres[snum]);
ray_plane_intersect (occIsect, ray, plane);
if (occIsect.hit) occlusion += 1.0;
}
}
occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
return occlusion;
}
/* Compute the image for the scanlines from [y0,y1), for an overall image
of width w and height h.
*/
static void ao_scanlines(uniform int y0, uniform int y1, uniform int w,
uniform int h, uniform int nsubsamples,
uniform float image[]) {
static Plane plane = { { 0.0f, -0.5f, 0.0f }, { 0.f, 1.f, 0.f } };
static Sphere spheres[3] = {
{ { -2.0f, 0.0f, -3.5f }, 0.5f },
{ { -0.5f, 0.0f, -3.0f }, 0.5f },
{ { 1.0f, 0.0f, -2.2f }, 0.5f } };
RNGState rngstate;
seed_rng(&rngstate, programIndex + (y0 << (programIndex & 15)));
// Compute the mapping between the 'programCount'-wide program
// instances running in parallel and samples in the image.
//
// For now, we'll always take four samples per pixel, so start by
// initializing du and dv with offsets into subpixel samples. We'll
// take care of further updating du and dv for the case where we're
// doing more than 4 program instances in parallel shortly.
uniform float uSteps[4] = { 0, 1, 0, 1 };
uniform float vSteps[4] = { 0, 0, 1, 1 };
float du = uSteps[programIndex % 4] / nsubsamples;
float dv = vSteps[programIndex % 4] / nsubsamples;
// Now handle the case where we are able to do more than one pixel's
// worth of work at once. nx records the number of pixels in the x
// direction we do per iteration and ny the number in y.
uniform int nx = 1, ny = 1;
// FIXME: We actually need ny to be 1 regardless of the decomposition,
// since the task decomposition is one scanline high.
if (programCount == 8) {
// Do two pixels at once in the x direction
nx = 2;
if (programIndex >= 4)
// And shift the offsets for the second pixel's worth of work
++du;
}
else if (programCount == 16) {
nx = 4;
ny = 1;
if (programIndex >= 4 && programIndex < 8)
++du;
if (programIndex >= 8 && programIndex < 12)
du += 2;
if (programIndex >= 12)
du += 3;
}
// Now loop over all of the pixels, stepping in x and y as calculated
// above. (Assumes that ny divides y and nx divides x...)
for (uniform int y = y0; y < y1; y += ny) {
for (uniform int x = 0; x < w; x += nx) {
// Figure out x,y pixel in NDC
float px = (x + du - (w / 2.0f)) / (w / 2.0f);
float py = -(y + dv - (h / 2.0f)) / (h / 2.0f);
float ret = 0.f;
Ray ray;
Isect isect;
ray.org = 0.f;
// Poor man's perspective projection
ray.dir.x = px;
ray.dir.y = py;
ray.dir.z = -1.0;
vnormalize(ray.dir);
isect.t = 1.0e+17;
isect.hit = 0;
for (uniform int snum = 0; snum < 3; ++snum)
ray_sphere_intersect(isect, ray, spheres[snum]);
ray_plane_intersect(isect, ray, plane);
// Note use of 'coherent' if statement; the set of rays we
// trace will often all hit or all miss the scene
cif (isect.hit)
ret = ambient_occlusion(isect, plane, spheres, rngstate);
// This is a little grungy; we have results for
// programCount-worth of values. Because we're doing 2x2
// subsamples, we need to peel them off in groups of four,
// average the four values for each pixel, and update the
// output image.
//
// Store the varying value to a uniform array of the same size.
// See the discussion about communication among program
// instances in the ispc user's manual for more discussion on
// this idiom.
uniform float retArray[programCount];
retArray[programIndex] = ret;
// offset to the first pixel in the image
uniform int offset = 3 * (y * w + x);
for (uniform int p = 0; p < programCount; p += 4, offset += 3) {
// Get the four sample values for this pixel
uniform float sumret = retArray[p] + retArray[p+1] + retArray[p+2] +
retArray[p+3];
// Normalize by number of samples taken
sumret /= nsubsamples * nsubsamples;
// Store result in the image
image[offset+0] = sumret;
image[offset+1] = sumret;
image[offset+2] = sumret;
}
}
}
}
export void ao_ispc(uniform int w, uniform int h, uniform int nsubsamples,
uniform float image[]) {
ao_scanlines(0, h, w, h, nsubsamples, image);
}
static void task ao_task(uniform int width, uniform int height,
uniform int nsubsamples, uniform float image[]) {
ao_scanlines(taskIndex, taskIndex+1, width, height, nsubsamples, image);
}
export void ao_ispc_tasks(uniform int w, uniform int h, uniform int nsubsamples,
uniform float image[]) {
launch[h] ao_task(w, h, nsubsamples, image);
}

View File

@@ -1,174 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<ItemGroup Label="ProjectConfigurations">
<ProjectConfiguration Include="Debug|Win32">
<Configuration>Debug</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Debug|x64">
<Configuration>Debug</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|Win32">
<Configuration>Release</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|x64">
<Configuration>Release</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
</ItemGroup>
<ItemGroup>
<ClCompile Include="ao.cpp" />
<ClCompile Include="instrument.cpp" />
<ClCompile Include="../tasksys.cpp" />
</ItemGroup>
<ItemGroup>
<CustomBuild Include="ao.ispc">
<FileType>Document</FileType>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename)_instrumented.obj -h $(TargetDir)%(Filename)_instrumented_ispc.h --arch=x86 --instrument --target=sse2
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename)_instrumented.obj -h $(TargetDir)%(Filename)_instrumented_ispc.h --instrument --target=sse2
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename)_instrumented.obj;$(TargetDir)%(Filename)_instrumented_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename)_instrumented.obj;$(TargetDir)%(Filename)_instrumented_ispc.h</Outputs>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename)_instrumented.obj -h $(TargetDir)%(Filename)_instrumented_ispc.h --arch=x86 --instrument --target=sse2
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename)_instrumented.obj -h $(TargetDir)%(Filename)_instrumented_ispc.h --instrument --target=sse2
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename)_instrumented.obj;$(TargetDir)%(Filename)_instrumented_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename)_instrumented.obj;$(TargetDir)%(Filename)_instrumented_ispc.h</Outputs>
</CustomBuild>
</ItemGroup>
<PropertyGroup Label="Globals">
<ProjectGuid>{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}</ProjectGuid>
<Keyword>Win32Proj</Keyword>
<RootNamespace>aobench_instrumented</RootNamespace>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
<ImportGroup Label="ExtensionSettings">
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<PropertyGroup Label="UserMacros" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<PreBuildEventUseInBuild>true</PreBuildEventUseInBuild>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<PreBuildEventUseInBuild>true</PreBuildEventUseInBuild>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<PreBuildEventUseInBuild>true</PreBuildEventUseInBuild>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<PreBuildEventUseInBuild>true</PreBuildEventUseInBuild>
</PropertyGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
<ImportGroup Label="ExtensionTargets">
</ImportGroup>
</Project>

View File

@@ -1,94 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "instrument.h"
#include <stdio.h>
#include <assert.h>
#include <string>
#include <map>
struct CallInfo {
CallInfo() { count = laneCount = allOff = 0; }
int count;
int laneCount;
int allOff;
};
static std::map<std::string, CallInfo> callInfo;
int countbits(int i) {
int ret = 0;
while (i) {
if (i & 0x1)
++ret;
i >>= 1;
}
return ret;
}
// Callback function that ispc compiler emits calls to when --instrument
// command-line flag is given while compiling.
void
ISPCInstrument(const char *fn, const char *note, int line, uint64_t mask) {
char sline[16];
sprintf(sline, "%04d", line);
std::string s = std::string(fn) + std::string("(") + std::string(sline) +
std::string(") - ") + std::string(note);
// Find or create a CallInfo instance for this callsite.
CallInfo &ci = callInfo[s];
// And update its statistics...
++ci.count;
if (mask == 0)
++ci.allOff;
ci.laneCount += countbits(mask);
}
void
ISPCPrintInstrument() {
// When program execution is done, go through the stats and print them
// out. (This function is called by ao.cpp).
std::map<std::string, CallInfo>::iterator citer = callInfo.begin();
while (citer != callInfo.end()) {
CallInfo &ci = citer->second;
float activePct = 100.f * ci.laneCount / (4.f * ci.count);
float allOffPct = 100.f * ci.allOff / ci.count;
printf("%s: %d calls (%d / %.2f%% all off!), %.2f%% active lanes\n",
citer->first.c_str(), ci.count, ci.allOff, allOffPct,
activePct);
++citer;
}
}

View File

@@ -1,45 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef INSTRUMENT_H
#define INSTRUMENT_H 1
#include <stdint.h>
extern "C" {
void ISPCInstrument(const char *fn, const char *note, int line, uint64_t mask);
}
void ISPCPrintInstrument();
#endif // INSTRUMENT_H

View File

@@ -1,98 +0,0 @@
TASK_CXX=../tasksys.cpp
TASK_LIB=-lpthread
TASK_OBJ=objs/tasksys.o
CXX=icc -openmp
CXXFLAGS+=-Iobjs/ -O2
CC=icc -openmp
CCFLAGS+=-Iobjs/ -O2
LIBS=-lm $(TASK_LIB) -lstdc++
ISPC=ispc
ISPC_FLAGS+=-O2
ISPC_FLAGS+=--opt=fast-math --math-lib=default
ISPC_HEADER=objs/$(ISPC_SRC:.ispc=_ispc.h)
ARCH:=$(shell uname -m | sed -e s/x86_64/x86/ -e s/i686/x86/ -e s/arm.*/arm/ -e s/sa110/arm/)
ifeq ($(ARCH),x86)
# ISPC_OBJS=$(addprefix objs/, $(ISPC_SRC:.ispc=)_ispc.o $(ISPC_SRC:.ispc=)_ispc_sse2.o \
$(ISPC_SRC:.ispc=)_ispc_sse4.o $(ISPC_SRC:.ispc=)_ispc_avx.o)
ISPC_OBJS=$(addprefix objs/, $(ISPC_SRC:.ispc=)_ispc.o )
ISPC_TARGETS=$(ISPC_IA_TARGETS)
ARCH_BIT:=$(shell getconf LONG_BIT)
ifeq ($(ARCH_BIT),32)
ISPC_FLAGS += --arch=x86
CXXFLAGS += -m32
CCFLAGS += -m32
else
ISPC_FLAGS += --arch=x86-64
CXXFLAGS += -m64
CCFLAGS += -m64
endif
else ifeq ($(ARCH),arm)
ISPC_OBJS=$(addprefix objs/, $(ISPC_SRC:.ispc=_ispc.o))
ISPC_TARGETS=$(ISPC_ARM_TARGETS)
else
$(error Unknown architecture $(ARCH) from uname -m)
endif
CPP_OBJS=$(addprefix objs/, $(CPP_SRC:.cpp=.o))
CC_OBJS=$(addprefix objs/, $(CC_SRC:.c=.o))
OBJS=$(CPP_OBJS) $(CC_OBJS) $(TASK_OBJ) $(ISPC_OBJS)
default: $(EXAMPLE)
all: $(EXAMPLE) $(EXAMPLE)-sse4 $(EXAMPLE)-generic16 $(EXAMPLE)-scalar
.PHONY: dirs clean
dirs:
/bin/mkdir -p objs/
objs/%.cpp objs/%.o objs/%.h: dirs
clean:
/bin/rm -rf objs *~ $(EXAMPLE) $(EXAMPLE)-sse4 $(EXAMPLE)-generic16 ref test
$(EXAMPLE): $(OBJS)
$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS)
objs/%.o: %.cpp dirs $(ISPC_HEADER)
$(CXX) $< $(CXXFLAGS) -c -o $@
objs/%.o: %.c dirs $(ISPC_HEADER)
$(CC) $< $(CCFLAGS) -c -o $@
objs/%.o: ../%.cpp dirs
$(CXX) $< $(CXXFLAGS) -c -o $@
objs/$(EXAMPLE).o: objs/$(EXAMPLE)_ispc.h
objs/%_ispc.h objs/%_ispc.o objs/%_ispc_sse2.o objs/%_ispc_sse4.o objs/%_ispc_avx.o: %.ispc
$(ISPC) $(ISPC_FLAGS) --target=$(ISPC_TARGETS) $< -o objs/$*_ispc.o -h objs/$*_ispc.h
objs/$(ISPC_SRC:.ispc=)_sse4.cpp: $(ISPC_SRC)
$(ISPC) $(ISPC_FLAGS) $< -o $@ --target=generic-4 --emit-c++ --c++-include-file=sse4.h
objs/$(ISPC_SRC:.ispc=)_sse4.o: objs/$(ISPC_SRC:.ispc=)_sse4.cpp
$(CXX) -I../intrinsics -msse4.2 $< $(CXXFLAGS) -c -o $@
$(EXAMPLE)-sse4: $(CPP_OBJS) objs/$(ISPC_SRC:.ispc=)_sse4.o
$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS)
objs/$(ISPC_SRC:.ispc=)_generic16.cpp: $(ISPC_SRC)
$(ISPC) $(ISPC_FLAGS) $< -o $@ --target=generic-16 --emit-c++ --c++-include-file=generic-16.h
objs/$(ISPC_SRC:.ispc=)_generic16.o: objs/$(ISPC_SRC:.ispc=)_generic16.cpp
$(CXX) -I../intrinsics $< $(CXXFLAGS) -c -o $@
$(EXAMPLE)-generic16: $(CPP_OBJS) objs/$(ISPC_SRC:.ispc=)_generic16.o
$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS)
objs/$(ISPC_SRC:.ispc=)_scalar.o: $(ISPC_SRC)
$(ISPC) $(ISPC_FLAGS) $< -o $@ --target=generic-1
$(EXAMPLE)-scalar: $(CPP_OBJS) objs/$(ISPC_SRC:.ispc=)_scalar.o
$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS)

View File

@@ -1,280 +0,0 @@
#pragma once
/******************************/
#include <sys/time.h>
static inline double rtc(void)
{
struct timeval Tvalue;
double etime;
struct timezone dummy;
gettimeofday(&Tvalue,&dummy);
etime = (double) Tvalue.tv_sec +
1.e-6*((double) Tvalue.tv_usec);
return etime;
}
/******************************/
#include <cassert>
#include <iostream>
#include <cuda.h>
#include "drvapi_error_string.h"
#define checkCudaErrors(err) __checkCudaErrors (err, __FILE__, __LINE__)
// These are the inline versions for all of the SDK helper functions
void __checkCudaErrors(CUresult err, const char *file, const int line) {
if(CUDA_SUCCESS != err) {
std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
<< getCudaDrvErrorString(err) << "\" from file <" << file
<< ", line " << line << "\n";
exit(-1);
}
}
/******************************/
/**** Basic CUDriver API ****/
/******************************/
CUcontext context;
static void createContext(
const int deviceId = 0,
const size_t stackLimit = 4*1024,
const size_t heapLimit = 1024*1024*1024
)
{
CUdevice device;
int devCount;
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGetCount(&devCount));
assert(devCount > 0);
checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
char name[128];
checkCudaErrors(cuDeviceGetName(name, 128, device));
std::cout << "Using CUDA Device [0]: " << name << "\n";
int devMajor, devMinor;
checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
std::cout << "Device Compute Capability: "
<< devMajor << "." << devMinor << "\n";
if (devMajor < 2) {
std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
exit(1);
}
// Create driver context
checkCudaErrors(cuCtxCreate(&context, 0, device));
#if 0
size_t limit;
checkCudaErrors(cuCtxGetLimit(&limit, CU_LIMIT_STACK_SIZE));
fprintf(stderr, " stack_limit= %llu KB\n", limit/1024);
checkCudaErrors(cuCtxGetLimit(&limit, CU_LIMIT_MALLOC_HEAP_SIZE));
fprintf(stderr, " heap_limit= %llu KB\n", limit/1024);
checkCudaErrors(cuCtxSetLimit(CU_LIMIT_STACK_SIZE,stackLimit));
checkCudaErrors(cuCtxSetLimit(CU_LIMIT_MALLOC_HEAP_SIZE,heapLimit));
#endif
}
static void destroyContext()
{
checkCudaErrors(cuCtxDestroy(context));
}
static CUmodule loadModule(
const char * module,
const int maxrregcount = 64,
const char cudadevrt_lib[] = "libcudadevrt.a",
const size_t log_size = 32768,
const bool print_log = true
)
{
const double t0 = rtc();
CUmodule cudaModule;
// in this branch we use compilation with parameters
CUlinkState CUState;
CUlinkState *lState = &CUState;
const int nOptions = 8;
CUjit_option options[nOptions];
void* optionVals[nOptions];
float walltime;
size_t logSize = log_size;
char error_log[logSize],
info_log[logSize];
void *cuOut;
size_t outSize;
int myErr = 0;
// Setup linker options
// Return walltime from JIT compilation
options[0] = CU_JIT_WALL_TIME;
optionVals[0] = (void*) &walltime;
// Pass a buffer for info messages
options[1] = CU_JIT_INFO_LOG_BUFFER;
optionVals[1] = (void*) info_log;
// Pass the size of the info buffer
options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
optionVals[2] = (void*) logSize;
// Pass a buffer for error message
options[3] = CU_JIT_ERROR_LOG_BUFFER;
optionVals[3] = (void*) error_log;
// Pass the size of the error buffer
options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
optionVals[4] = (void*) logSize;
// Make the linker verbose
options[5] = CU_JIT_LOG_VERBOSE;
optionVals[5] = (void*) 1;
// Max # of registers/pthread
options[6] = CU_JIT_MAX_REGISTERS;
int jitRegCount = maxrregcount;
optionVals[6] = (void *)(size_t)jitRegCount;
// Caching
options[7] = CU_JIT_CACHE_MODE;
optionVals[7] = (void *)CU_JIT_CACHE_OPTION_CA;
// Create a pending linker invocation
// Create a pending linker invocation
checkCudaErrors(cuLinkCreate(nOptions,options, optionVals, lState));
#if 0
if (sizeof(void *)==4)
{
// Load the PTX from the string myPtx32
printf("Loading myPtx32[] program\n");
// PTX May also be loaded from file, as per below.
myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)myPtx32, strlen(myPtx32)+1, 0, 0, 0, 0);
}
else
#endif
{
// Load the PTX from the string myPtx (64-bit)
if (print_log)
fprintf(stderr, "Loading ptx..\n");
myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)module, strlen(module)+1, 0, 0, 0, 0);
myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_LIBRARY, cudadevrt_lib, 0,0,0);
// PTX May also be loaded from file, as per below.
// myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_PTX, "myPtx64.ptx",0,0,0);
}
// Complete the linker step
myErr = cuLinkComplete(*lState, &cuOut, &outSize);
if ( myErr != CUDA_SUCCESS )
{
// Errors will be put in error_log, per CU_JIT_ERROR_LOG_BUFFER option above.
fprintf(stderr,"PTX Linker Error:\n%s\n",error_log);
assert(0);
}
// Linker walltime and info_log were requested in options above.
if (print_log)
fprintf(stderr, "CUDA Link Completed in %fms [ %g ms]. Linker Output:\n%s\n",walltime,info_log,1e3*(rtc() - t0));
// Load resulting cuBin into module
checkCudaErrors(cuModuleLoadData(&cudaModule, cuOut));
// Destroy the linker invocation
checkCudaErrors(cuLinkDestroy(*lState));
if (print_log)
fprintf(stderr, " loadModule took %g ms \n", 1e3*(rtc() - t0));
return cudaModule;
}
static void unloadModule(CUmodule &cudaModule)
{
checkCudaErrors(cuModuleUnload(cudaModule));
}
static CUfunction getFunction(CUmodule &cudaModule, const char * function)
{
CUfunction cudaFunction;
checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
return cudaFunction;
}
static CUdeviceptr deviceMalloc(const size_t size)
{
CUdeviceptr d_buf;
checkCudaErrors(cuMemAlloc(&d_buf, size));
return d_buf;
}
static void deviceFree(CUdeviceptr d_buf)
{
checkCudaErrors(cuMemFree(d_buf));
}
static void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
{
checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
}
static void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
{
checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
}
#define deviceLaunch(func,params) \
checkCudaErrors(cuFuncSetCacheConfig((func), CU_FUNC_CACHE_PREFER_SHARED)); \
checkCudaErrors( \
cuLaunchKernel( \
(func), \
1,1,1, \
32, 1, 1, \
0, NULL, (params), NULL \
));
typedef CUdeviceptr devicePtr;
/**************/
#include <vector>
static std::vector<char> readBinary(const char * filename, const bool print_size = false)
{
std::vector<char> buffer;
FILE *fp = fopen(filename, "rb");
if (!fp )
{
fprintf(stderr, "file %s not found\n", filename);
assert(0);
}
fseek(fp, 0, SEEK_END);
const unsigned long long size = ftell(fp); /*calc the size needed*/
fseek(fp, 0, SEEK_SET);
buffer.resize(size);
if (fp == NULL){ /*ERROR detection if file == empty*/
fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);
exit(1);
}
else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
exit(1);
}
if (print_size)
fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
return buffer;
}
static double CUDALaunch(
void **handlePtr,
const char * func_name,
void **func_args,
const bool print_log = true,
const int maxrregcount = 64,
const char kernel_file[] = "__kernels.ptx",
const char cudadevrt_lib[] = "libcudadevrt.a",
const int log_size = 32768)
{
const std::vector<char> module_str = readBinary(kernel_file, print_log);
const char * module = &module_str[0];
CUmodule cudaModule = loadModule(module, maxrregcount, cudadevrt_lib, log_size, print_log);
CUfunction cudaFunction = getFunction(cudaModule, func_name);
checkCudaErrors(cuStreamSynchronize(0));
const double t0 = rtc();
deviceLaunch(cudaFunction, func_args);
checkCudaErrors(cuStreamSynchronize(0));
const double dt = rtc() - t0;
unloadModule(cudaModule);
return dt;
}
/******************************/

View File

@@ -1,8 +0,0 @@
EXAMPLE=deferred_shading
CPP_SRC=common.cpp main.cpp
ISPC_SRC=kernels1.ispc
ISPC_IA_TARGETS=avx
ISPC_FLAGS=--opt=fast-math
include ../common.mk

View File

@@ -1,55 +0,0 @@
PROG=main_cu
ISPC_SRC=kernels1.ispc
CXX_SRC=main_cu.cpp common.cpp
CXX=g++
CXXFLAGS=-O3 -I$(CUDATK)/include
LD=g++
LDFLAGS=-lcuda
ISPC=ispc
ISPCFLAGS=-O3 --math-lib=default --target=nvptx64 --opt=fast-math
LLVM32 = $(HOME)/usr/local/llvm/bin-3.2
LLVM = $(HOME)/usr/local/llvm/bin-3.3
PTXGEN = $(HOME)/ptxgen
PTXGEN += -opt=3
PTXGEN += -ftz=1 -prec-div=0 -prec-sqrt=0 -fma=1
LLVM32DIS=$(LLVM32)/bin/llvm-dis
.SUFFIXES: .bc .o .ptx .cu _ispc_nvptx64.bc
ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
ISPC_BC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.bc)
PTXSRC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.ptx)
CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
all: $(PROG)
$(PROG): $(CXX_OBJ) kernel.ptx
/bin/cp kernel.ptx __kernels.ptx
$(LD) -o $@ $(CXX_OBJ) $(LDFLAGS)
%.o: %.cpp
$(CXX) $(CXXFLAGS) -o $@ -c $<
%_ispc_nvptx64.bc: %.ispc
$(ISPC) $(ISPCFLAGS) --emit-llvm -o `basename $< .ispc`_ispc_nvptx64.bc -h `basename $< .ispc`_ispc.h $< --emit-llvm
%.ptx: %.bc
$(PTXGEN) $< > $@
# $(LLVM32DIS) $<
# $(PTXGEN) `basename $< .bc`.ll > $@
kernel.ptx: $(PTXSRC)
cat $^ > kernel.ptx
clean:
/bin/rm -rf *.ptx *.bc *.ll $(PROG)

View File

@@ -1,37 +0,0 @@
PROG=main_mic
ISPC_SRC=kernels1.ispc
CXX_SRC=main.cpp ../tasksys.cpp common.cpp
CXX=icc
CXXFLAGS=-O3 -I$(CUDATK)/include -mmic -openmp
LD=icc
LDFLAGS=-mmic -openmp
ISPC=ispc
ISPCFLAGS=-O3 --math-lib=default --target=generic-16 --c++-include-file=../intrinsics/knc-i1x16.h --opt=fast-math
.SUFFIXES: .o .cpp
ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
all: $(PROG)
$(PROG): $(ISPC_OBJ) $(CXX_OBJ)
$(LD) -o $@ $^ $(LDFLAGS)
%.o: %.cpp
$(CXX) $(CXXFLAGS) -o $@ -c $<
%_ispc.o: %.ispc
$(ISPC) $(ISPCFLAGS) --emit-c++ -o `basename $< .ispc`_ispc_zmm.cpp -h `basename $< .ispc`_ispc.h $<
$(CXX) $(CXXFLAGS) -o $@ `basename $< .ispc`_ispc_zmm.cpp -c
clean:
/bin/rm -rf *_ispc_zmm.cpp *.o $(PROG)

View File

@@ -1,211 +0,0 @@
/*
Copyright (c) 2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#define ISPC_IS_WINDOWS
#elif defined(__linux__)
#define ISPC_IS_LINUX
#elif defined(__APPLE__)
#define ISPC_IS_APPLE
#endif
#include <fcntl.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <stdint.h>
#include <algorithm>
#include <assert.h>
#include <vector>
#ifdef ISPC_IS_WINDOWS
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#endif
#ifdef ISPC_IS_LINUX
#include <malloc.h>
#endif
#include "deferred.h"
#include "../timing.h"
///////////////////////////////////////////////////////////////////////////
static void *
lAlignedMalloc(size_t size, int32_t alignment) {
#ifdef ISPC_IS_WINDOWS
return _aligned_malloc(size, alignment);
#endif
#ifdef ISPC_IS_LINUX
return memalign(alignment, size);
#endif
#ifdef ISPC_IS_APPLE
void *mem = malloc(size + (alignment-1) + sizeof(void*));
char *amem = ((char*)mem) + sizeof(void*);
amem = amem + uint32_t(alignment - (reinterpret_cast<uint64_t>(amem) &
(alignment - 1)));
((void**)amem)[-1] = mem;
return amem;
#endif
}
static void
lAlignedFree(void *ptr) {
#ifdef ISPC_IS_WINDOWS
_aligned_free(ptr);
#endif
#ifdef ISPC_IS_LINUX
free(ptr);
#endif
#ifdef ISPC_IS_APPLE
free(((void**)ptr)[-1]);
#endif
}
Framebuffer::Framebuffer(int width, int height) {
nPixels = width*height;
r = (uint8_t *)lAlignedMalloc(nPixels, ALIGNMENT_BYTES);
g = (uint8_t *)lAlignedMalloc(nPixels, ALIGNMENT_BYTES);
b = (uint8_t *)lAlignedMalloc(nPixels, ALIGNMENT_BYTES);
}
Framebuffer::~Framebuffer() {
lAlignedFree(r);
lAlignedFree(g);
lAlignedFree(b);
}
void
Framebuffer::clear() {
memset(r, 0, nPixels);
memset(g, 0, nPixels);
memset(b, 0, nPixels);
}
InputData *
CreateInputDataFromFile(const char *path) {
FILE *in = fopen(path, "rb");
if (!in) return 0;
InputData *input = new InputData;
// Load header
if (fread(&input->header, sizeof(ispc::InputHeader), 1, in) != 1) {
fprintf(stderr, "Preumature EOF reading file \"%s\"\n", path);
return NULL;
}
fprintf(stderr, " numLights= %d\n", input->header.numLights);
// Load data chunk and update pointers
input->chunk = (uint8_t *)lAlignedMalloc(input->header.inputDataChunkSize,
ALIGNMENT_BYTES);
if (fread(input->chunk, input->header.inputDataChunkSize, 1, in) != 1) {
fprintf(stderr, "Preumature EOF reading file \"%s\"\n", path);
return NULL;
}
input->arrays.zBuffer =
(float *)&input->chunk[input->header.inputDataArrayOffsets[idaZBuffer]];
input->arrays.normalEncoded_x =
(uint16_t *)&input->chunk[input->header.inputDataArrayOffsets[idaNormalEncoded_x]];
input->arrays.normalEncoded_y =
(uint16_t *)&input->chunk[input->header.inputDataArrayOffsets[idaNormalEncoded_y]];
input->arrays.specularAmount =
(uint16_t *)&input->chunk[input->header.inputDataArrayOffsets[idaSpecularAmount]];
input->arrays.specularPower =
(uint16_t *)&input->chunk[input->header.inputDataArrayOffsets[idaSpecularPower]];
input->arrays.albedo_x =
(uint8_t *)&input->chunk[input->header.inputDataArrayOffsets[idaAlbedo_x]];
input->arrays.albedo_y =
(uint8_t *)&input->chunk[input->header.inputDataArrayOffsets[idaAlbedo_y]];
input->arrays.albedo_z =
(uint8_t *)&input->chunk[input->header.inputDataArrayOffsets[idaAlbedo_z]];
input->arrays.lightPositionView_x =
(float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightPositionView_x]];
input->arrays.lightPositionView_y =
(float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightPositionView_y]];
input->arrays.lightPositionView_z =
(float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightPositionView_z]];
input->arrays.lightAttenuationBegin =
(float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightAttenuationBegin]];
input->arrays.lightColor_x =
(float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightColor_x]];
input->arrays.lightColor_y =
(float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightColor_y]];
input->arrays.lightColor_z =
(float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightColor_z]];
input->arrays.lightAttenuationEnd =
(float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightAttenuationEnd]];
fclose(in);
return input;
}
void DeleteInputData(InputData *input) {
lAlignedFree(input->chunk);
}
void WriteFrame(const char *filename, const InputData *input,
const Framebuffer &framebuffer) {
// Deswizzle and copy to RGBA output
// Doesn't need to be fast... only happens once
size_t imageBytes = 3 * input->header.framebufferWidth *
input->header.framebufferHeight;
uint8_t* framebufferAOS = (uint8_t *)lAlignedMalloc(imageBytes, ALIGNMENT_BYTES);
memset(framebufferAOS, 0, imageBytes);
for (int i = 0; i < input->header.framebufferWidth *
input->header.framebufferHeight; ++i) {
framebufferAOS[3 * i + 0] = framebuffer.r[i];
framebufferAOS[3 * i + 1] = framebuffer.g[i];
framebufferAOS[3 * i + 2] = framebuffer.b[i];
}
// Write out simple PPM file
FILE *out = fopen(filename, "wb");
fprintf(out, "P6 %d %d 255\n", input->header.framebufferWidth,
input->header.framebufferHeight);
fwrite(framebufferAOS, imageBytes, 1, out);
fclose(out);
lAlignedFree(framebufferAOS);
}

View File

@@ -1,108 +0,0 @@
/*
Copyright (c) 2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef DEFERRED_H
#define DEFERRED_H
// Currently tile widths must be a multiple of SIMD width (i.e. 8 for ispc sse4x2)!
#define MIN_TILE_WIDTH 64
#define MIN_TILE_HEIGHT 16
#define MAX_LIGHTS 1024
enum InputDataArraysEnum {
idaZBuffer = 0,
idaNormalEncoded_x,
idaNormalEncoded_y,
idaSpecularAmount,
idaSpecularPower,
idaAlbedo_x,
idaAlbedo_y,
idaAlbedo_z,
idaLightPositionView_x,
idaLightPositionView_y,
idaLightPositionView_z,
idaLightAttenuationBegin,
idaLightColor_x,
idaLightColor_y,
idaLightColor_z,
idaLightAttenuationEnd,
idaNum
};
#ifndef ISPC
#include <stdint.h>
#include "kernels1_ispc.h"
#define ALIGNMENT_BYTES 64
#define MAX_LIGHTS 1024
#define VISUALIZE_LIGHT_COUNT 0
struct InputData
{
ispc::InputHeader header;
ispc::InputDataArrays arrays;
uint8_t *chunk;
};
struct Framebuffer {
Framebuffer(int width, int height);
~Framebuffer();
void clear();
uint8_t *r, *g, *b;
private:
int nPixels;
Framebuffer(const Framebuffer &);
Framebuffer &operator=(const Framebuffer *);
};
InputData *CreateInputDataFromFile(const char *path);
void DeleteInputData(InputData *input);
void WriteFrame(const char *filename, const InputData *input,
const Framebuffer &framebuffer);
void InitDynamicC(InputData *input);
void InitDynamicCilk(InputData *input);
void DispatchDynamicC(InputData *input, Framebuffer *framebuffer);
void DispatchDynamicCilk(InputData *input, Framebuffer *framebuffer);
#endif // !ISPC
#endif // DEFERRED_H

View File

@@ -1,178 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<ItemGroup Label="ProjectConfigurations">
<ProjectConfiguration Include="Debug|Win32">
<Configuration>Debug</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Debug|x64">
<Configuration>Debug</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|Win32">
<Configuration>Release</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|x64">
<Configuration>Release</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
</ItemGroup>
<PropertyGroup Label="Globals">
<ProjectGuid>{87f53c53-957e-4e91-878a-bc27828fb9eb}</ProjectGuid>
<Keyword>Win32Proj</Keyword>
<RootNamespace>mandelbrot</RootNamespace>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
<ImportGroup Label="ExtensionSettings">
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<PropertyGroup Label="UserMacros" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
</PropertyGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<IntrinsicFunctions>true</IntrinsicFunctions>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<IntrinsicFunctions>true</IntrinsicFunctions>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<ItemGroup>
<ClCompile Include="common.cpp" />
<ClCompile Include="dynamic_c.cpp" />
<ClCompile Include="dynamic_cilk.cpp" />
<ClCompile Include="main.cpp" />
<ClCompile Include="../tasksys.cpp" />
</ItemGroup>
<ItemGroup>
<CustomBuild Include="kernels.ispc">
<FileType>Document</FileType>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
</CustomBuild>
</ItemGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
<ImportGroup Label="ExtensionTargets">
</ImportGroup>
</Project>

View File

@@ -1,370 +0,0 @@
/*
* Copyright 1993-2012 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
#ifndef _DRVAPI_ERROR_STRING_H_
#define _DRVAPI_ERROR_STRING_H_
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
// Error Code string definitions here
typedef struct
{
char const *error_string;
int error_id;
} s_CudaErrorStr;
/**
* Error codes
*/
static s_CudaErrorStr sCudaDrvErrorString[] =
{
/**
* The API call returned with no errors. In the case of query calls, this
* can also mean that the operation being queried is complete (see
* ::cuEventQuery() and ::cuStreamQuery()).
*/
{ "CUDA_SUCCESS", 0 },
/**
* This indicates that one or more of the parameters passed to the API call
* is not within an acceptable range of values.
*/
{ "CUDA_ERROR_INVALID_VALUE", 1 },
/**
* The API call failed because it was unable to allocate enough memory to
* perform the requested operation.
*/
{ "CUDA_ERROR_OUT_OF_MEMORY", 2 },
/**
* This indicates that the CUDA driver has not been initialized with
* ::cuInit() or that initialization has failed.
*/
{ "CUDA_ERROR_NOT_INITIALIZED", 3 },
/**
* This indicates that the CUDA driver is in the process of shutting down.
*/
{ "CUDA_ERROR_DEINITIALIZED", 4 },
/**
* This indicates profiling APIs are called while application is running
* in visual profiler mode.
*/
{ "CUDA_ERROR_PROFILER_DISABLED", 5 },
/**
* This indicates profiling has not been initialized for this context.
* Call cuProfilerInitialize() to resolve this.
*/
{ "CUDA_ERROR_PROFILER_NOT_INITIALIZED", 6 },
/**
* This indicates profiler has already been started and probably
* cuProfilerStart() is incorrectly called.
*/
{ "CUDA_ERROR_PROFILER_ALREADY_STARTED", 7 },
/**
* This indicates profiler has already been stopped and probably
* cuProfilerStop() is incorrectly called.
*/
{ "CUDA_ERROR_PROFILER_ALREADY_STOPPED", 8 },
/**
* This indicates that no CUDA-capable devices were detected by the installed
* CUDA driver.
*/
{ "CUDA_ERROR_NO_DEVICE (no CUDA-capable devices were detected)", 100 },
/**
* This indicates that the device ordinal supplied by the user does not
* correspond to a valid CUDA device.
*/
{ "CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)", 101 },
/**
* This indicates that the device kernel image is invalid. This can also
* indicate an invalid CUDA module.
*/
{ "CUDA_ERROR_INVALID_IMAGE", 200 },
/**
* This most frequently indicates that there is no context bound to the
* current thread. This can also be returned if the context passed to an
* API call is not a valid handle (such as a context that has had
* ::cuCtxDestroy() invoked on it). This can also be returned if a user
* mixes different API versions (i.e. 3010 context with 3020 API calls).
* See ::cuCtxGetApiVersion() for more details.
*/
{ "CUDA_ERROR_INVALID_CONTEXT", 201 },
/**
* This indicated that the context being supplied as a parameter to the
* API call was already the active context.
* \deprecated
* This error return is deprecated as of CUDA 3.2. It is no longer an
* error to attempt to push the active context via ::cuCtxPushCurrent().
*/
{ "CUDA_ERROR_CONTEXT_ALREADY_CURRENT", 202 },
/**
* This indicates that a map or register operation has failed.
*/
{ "CUDA_ERROR_MAP_FAILED", 205 },
/**
* This indicates that an unmap or unregister operation has failed.
*/
{ "CUDA_ERROR_UNMAP_FAILED", 206 },
/**
* This indicates that the specified array is currently mapped and thus
* cannot be destroyed.
*/
{ "CUDA_ERROR_ARRAY_IS_MAPPED", 207 },
/**
* This indicates that the resource is already mapped.
*/
{ "CUDA_ERROR_ALREADY_MAPPED", 208 },
/**
* This indicates that there is no kernel image available that is suitable
* for the device. This can occur when a user specifies code generation
* options for a particular CUDA source file that do not include the
* corresponding device configuration.
*/
{ "CUDA_ERROR_NO_BINARY_FOR_GPU", 209 },
/**
* This indicates that a resource has already been acquired.
*/
{ "CUDA_ERROR_ALREADY_ACQUIRED", 210 },
/**
* This indicates that a resource is not mapped.
*/
{ "CUDA_ERROR_NOT_MAPPED", 211 },
/**
* This indicates that a mapped resource is not available for access as an
* array.
*/
{ "CUDA_ERROR_NOT_MAPPED_AS_ARRAY", 212 },
/**
* This indicates that a mapped resource is not available for access as a
* pointer.
*/
{ "CUDA_ERROR_NOT_MAPPED_AS_POINTER", 213 },
/**
* This indicates that an uncorrectable ECC error was detected during
* execution.
*/
{ "CUDA_ERROR_ECC_UNCORRECTABLE", 214 },
/**
* This indicates that the ::CUlimit passed to the API call is not
* supported by the active device.
*/
{ "CUDA_ERROR_UNSUPPORTED_LIMIT", 215 },
/**
* This indicates that the ::CUcontext passed to the API call can
* only be bound to a single CPU thread at a time but is already
* bound to a CPU thread.
*/
{ "CUDA_ERROR_CONTEXT_ALREADY_IN_USE", 216 },
/**
* This indicates that peer access is not supported across the given
* devices.
*/
{ "CUDA_ERROR_PEER_ACCESS_UNSUPPORTED", 217},
/**
* This indicates that the device kernel source is invalid.
*/
{ "CUDA_ERROR_INVALID_SOURCE", 300 },
/**
* This indicates that the file specified was not found.
*/
{ "CUDA_ERROR_FILE_NOT_FOUND", 301 },
/**
* This indicates that a link to a shared object failed to resolve.
*/
{ "CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND", 302 },
/**
* This indicates that initialization of a shared object failed.
*/
{ "CUDA_ERROR_SHARED_OBJECT_INIT_FAILED", 303 },
/**
* This indicates that an OS call failed.
*/
{ "CUDA_ERROR_OPERATING_SYSTEM", 304 },
/**
* This indicates that a resource handle passed to the API call was not
* valid. Resource handles are opaque types like ::CUstream and ::CUevent.
*/
{ "CUDA_ERROR_INVALID_HANDLE", 400 },
/**
* This indicates that a named symbol was not found. Examples of symbols
* are global/constant variable names, texture names }, and surface names.
*/
{ "CUDA_ERROR_NOT_FOUND", 500 },
/**
* This indicates that asynchronous operations issued previously have not
* completed yet. This result is not actually an error, but must be indicated
* differently than ::CUDA_SUCCESS (which indicates completion). Calls that
* may return this value include ::cuEventQuery() and ::cuStreamQuery().
*/
{ "CUDA_ERROR_NOT_READY", 600 },
/**
* An exception occurred on the device while executing a kernel. Common
* causes include dereferencing an invalid device pointer and accessing
* out of bounds shared memory. The context cannot be used }, so it must
* be destroyed (and a new one should be created). All existing device
* memory allocations from this context are invalid and must be
* reconstructed if the program is to continue using CUDA.
*/
{ "CUDA_ERROR_LAUNCH_FAILED", 700 },
/**
* This indicates that a launch did not occur because it did not have
* appropriate resources. This error usually indicates that the user has
* attempted to pass too many arguments to the device kernel, or the
* kernel launch specifies too many threads for the kernel's register
* count. Passing arguments of the wrong size (i.e. a 64-bit pointer
* when a 32-bit int is expected) is equivalent to passing too many
* arguments and can also result in this error.
*/
{ "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES", 701 },
/**
* This indicates that the device kernel took too long to execute. This can
* only occur if timeouts are enabled - see the device attribute
* ::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information. The
* context cannot be used (and must be destroyed similar to
* ::CUDA_ERROR_LAUNCH_FAILED). All existing device memory allocations from
* this context are invalid and must be reconstructed if the program is to
* continue using CUDA.
*/
{ "CUDA_ERROR_LAUNCH_TIMEOUT", 702 },
/**
* This error indicates a kernel launch that uses an incompatible texturing
* mode.
*/
{ "CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING", 703 },
/**
* This error indicates that a call to ::cuCtxEnablePeerAccess() is
* trying to re-enable peer access to a context which has already
* had peer access to it enabled.
*/
{ "CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED", 704 },
/**
* This error indicates that ::cuCtxDisablePeerAccess() is
* trying to disable peer access which has not been enabled yet
* via ::cuCtxEnablePeerAccess().
*/
{ "CUDA_ERROR_PEER_ACCESS_NOT_ENABLED", 705 },
/**
* This error indicates that the primary context for the specified device
* has already been initialized.
*/
{ "CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE", 708 },
/**
* This error indicates that the context current to the calling thread
* has been destroyed using ::cuCtxDestroy }, or is a primary context which
* has not yet been initialized.
*/
{ "CUDA_ERROR_CONTEXT_IS_DESTROYED", 709 },
/**
* A device-side assert triggered during kernel execution. The context
* cannot be used anymore, and must be destroyed. All existing device
* memory allocations from this context are invalid and must be
* reconstructed if the program is to continue using CUDA.
*/
{ "CUDA_ERROR_ASSERT", 710 },
/**
* This error indicates that the hardware resources required to enable
* peer access have been exhausted for one or more of the devices
* passed to ::cuCtxEnablePeerAccess().
*/
{ "CUDA_ERROR_TOO_MANY_PEERS", 711 },
/**
* This error indicates that the memory range passed to ::cuMemHostRegister()
* has already been registered.
*/
{ "CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED", 712 },
/**
* This error indicates that the pointer passed to ::cuMemHostUnregister()
* does not correspond to any currently registered memory region.
*/
{ "CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED", 713 },
/**
* This error indicates that the attempted operation is not permitted.
*/
{ "CUDA_ERROR_NOT_PERMITTED", 800 },
/**
* This error indicates that the attempted operation is not supported
* on the current system or device.
*/
{ "CUDA_ERROR_NOT_SUPPORTED", 801 },
/**
* This indicates that an unknown internal error has occurred.
*/
{ "CUDA_ERROR_UNKNOWN", 999 },
{ NULL, -1 }
};
// This is just a linear search through the array, since the error_id's are not
// always ocurring consecutively
const char * getCudaDrvErrorString(CUresult error_id)
{
int index = 0;
while (sCudaDrvErrorString[index].error_id != error_id &&
sCudaDrvErrorString[index].error_id != -1)
{
index++;
}
if (sCudaDrvErrorString[index].error_id == error_id)
return (const char *)sCudaDrvErrorString[index].error_string;
else
return (const char *)"CUDA_ERROR not found!";
}
#endif

View File

@@ -1,870 +0,0 @@
/*
Copyright (c) 2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "deferred.h"
#include "kernels_ispc.h"
#include <algorithm>
#include <stdint.h>
#include <assert.h>
#include <math.h>
#ifdef _MSC_VER
#define ISPC_IS_WINDOWS
#elif defined(__linux__)
#define ISPC_IS_LINUX
#elif defined(__APPLE__)
#define ISPC_IS_APPLE
#endif
#ifdef ISPC_IS_LINUX
#include <malloc.h>
#endif // ISPC_IS_LINUX
// Currently tile widths must be a multiple of SIMD width (i.e. 8 for ispc sse4x2)!
//#define MIN_TILE_WIDTH 16
//#define MIN_TILE_HEIGHT 16
#define DYNAMIC_TREE_LEVELS 5
// If this is set to 1 then the result will be identical to the static version
#define DYNAMIC_MIN_LIGHTS_TO_SUBDIVIDE 1
static void *
lAlignedMalloc(size_t size, int32_t alignment) {
#ifdef ISPC_IS_WINDOWS
return _aligned_malloc(size, alignment);
#endif
#ifdef ISPC_IS_LINUX
return memalign(alignment, size);
#endif
#ifdef ISPC_IS_APPLE
void *mem = malloc(size + (alignment-1) + sizeof(void*));
char *amem = ((char*)mem) + sizeof(void*);
amem = amem + uint32_t(alignment - (reinterpret_cast<uint64_t>(amem) &
(alignment - 1)));
((void**)amem)[-1] = mem;
return amem;
#endif
}
static void
lAlignedFree(void *ptr) {
#ifdef ISPC_IS_WINDOWS
_aligned_free(ptr);
#endif
#ifdef ISPC_IS_LINUX
free(ptr);
#endif
#ifdef ISPC_IS_APPLE
free(((void**)ptr)[-1]);
#endif
}
static void
ComputeZBounds(int tileStartX, int tileEndX,
int tileStartY, int tileEndY,
// G-buffer data
float zBuffer[],
int gBufferWidth,
// Camera data
float cameraProj_33, float cameraProj_43,
float cameraNear, float cameraFar,
// Output
float *minZ, float *maxZ)
{
// Find Z bounds
float laneMinZ = cameraFar;
float laneMaxZ = cameraNear;
for (int y = tileStartY; y < tileEndY; ++y) {
for (int x = tileStartX; x < tileEndX; ++x) {
// Unproject depth buffer Z value into view space
float z = zBuffer[(y * gBufferWidth + x)];
float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
// Work out Z bounds for our samples
// Avoid considering skybox/background or otherwise invalid pixels
if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
laneMinZ = std::min(laneMinZ, viewSpaceZ);
laneMaxZ = std::max(laneMaxZ, viewSpaceZ);
}
}
}
*minZ = laneMinZ;
*maxZ = laneMaxZ;
}
static void
ComputeZBoundsRow(int tileY, int tileWidth, int tileHeight,
int numTilesX, int numTilesY,
// G-buffer data
float zBuffer[],
int gBufferWidth,
// Camera data
float cameraProj_33, float cameraProj_43,
float cameraNear, float cameraFar,
// Output
float minZArray[],
float maxZArray[])
{
for (int tileX = 0; tileX < numTilesX; ++tileX) {
float minZ, maxZ;
ComputeZBounds(tileX * tileWidth, tileX * tileWidth + tileWidth,
tileY * tileHeight, tileY * tileHeight + tileHeight,
zBuffer, gBufferWidth, cameraProj_33, cameraProj_43,
cameraNear, cameraFar, &minZ, &maxZ);
minZArray[tileX] = minZ;
maxZArray[tileX] = maxZ;
}
}
class MinMaxZTree
{
public:
// Currently (min) tile dimensions must divide gBuffer dimensions evenly
// Levels must be small enough that neither dimension goes below one tile
MinMaxZTree(
int tileWidth, int tileHeight, int levels,
int gBufferWidth, int gBufferHeight)
: mTileWidth(tileWidth), mTileHeight(tileHeight), mLevels(levels)
{
mNumTilesX = gBufferWidth / mTileWidth;
mNumTilesY = gBufferHeight / mTileHeight;
// Allocate arrays
mMinZArrays = (float **)lAlignedMalloc(sizeof(float *) * mLevels, 16);
mMaxZArrays = (float **)lAlignedMalloc(sizeof(float *) * mLevels, 16);
for (int i = 0; i < mLevels; ++i) {
int x = NumTilesX(i);
int y = NumTilesY(i);
assert(x > 0);
assert(y > 0);
// NOTE: If the following two asserts fire it probably means that
// the base tile dimensions do not evenly divide the G-buffer dimensions
assert(x * (mTileWidth << i) >= gBufferWidth);
assert(y * (mTileHeight << i) >= gBufferHeight);
mMinZArrays[i] = (float *)lAlignedMalloc(sizeof(float) * x * y, 16);
mMaxZArrays[i] = (float *)lAlignedMalloc(sizeof(float) * x * y, 16);
}
}
void Update(float *zBuffer, int gBufferPitchInElements,
float cameraProj_33, float cameraProj_43,
float cameraNear, float cameraFar)
{
for (int tileY = 0; tileY < mNumTilesY; ++tileY) {
ComputeZBoundsRow(tileY, mTileWidth, mTileHeight, mNumTilesX, mNumTilesY,
zBuffer, gBufferPitchInElements,
cameraProj_33, cameraProj_43, cameraNear, cameraFar,
mMinZArrays[0] + (tileY * mNumTilesX),
mMaxZArrays[0] + (tileY * mNumTilesX));
}
// Generate other levels
for (int level = 1; level < mLevels; ++level) {
int destTilesX = NumTilesX(level);
int destTilesY = NumTilesY(level);
int srcLevel = level - 1;
int srcTilesX = NumTilesX(srcLevel);
int srcTilesY = NumTilesY(srcLevel);
for (int y = 0; y < destTilesY; ++y) {
for (int x = 0; x < destTilesX; ++x) {
int srcX = x << 1;
int srcY = y << 1;
// NOTE: Ugly branches to deal with non-multiple dimensions at some levels
// TODO: SSE branchless min/max is probably better...
float minZ = mMinZArrays[srcLevel][(srcY) * srcTilesX + (srcX)];
float maxZ = mMaxZArrays[srcLevel][(srcY) * srcTilesX + (srcX)];
if (srcX + 1 < srcTilesX) {
minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY) * srcTilesX +
(srcX + 1)]);
maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY) * srcTilesX +
(srcX + 1)]);
if (srcY + 1 < srcTilesY) {
minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY + 1) * srcTilesX +
(srcX + 1)]);
maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY + 1) * srcTilesX +
(srcX + 1)]);
}
}
if (srcY + 1 < srcTilesY) {
minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY + 1) * srcTilesX +
(srcX )]);
maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY + 1) * srcTilesX +
(srcX )]);
}
mMinZArrays[level][y * destTilesX + x] = minZ;
mMaxZArrays[level][y * destTilesX + x] = maxZ;
}
}
}
}
~MinMaxZTree() {
for (int i = 0; i < mLevels; ++i) {
lAlignedFree(mMinZArrays[i]);
lAlignedFree(mMaxZArrays[i]);
}
lAlignedFree(mMinZArrays);
lAlignedFree(mMaxZArrays);
}
int Levels() const { return mLevels; }
// These round UP, so beware that the last tile for a given level may not be completely full
// TODO: Verify this...
int NumTilesX(int level = 0) const { return (mNumTilesX + (1 << level) - 1) >> level; }
int NumTilesY(int level = 0) const { return (mNumTilesY + (1 << level) - 1) >> level; }
int TileWidth(int level = 0) const { return (mTileWidth << level); }
int TileHeight(int level = 0) const { return (mTileHeight << level); }
float MinZ(int level, int tileX, int tileY) const {
return mMinZArrays[level][tileY * NumTilesX(level) + tileX];
}
float MaxZ(int level, int tileX, int tileY) const {
return mMaxZArrays[level][tileY * NumTilesX(level) + tileX];
}
private:
int mTileWidth;
int mTileHeight;
int mLevels;
int mNumTilesX;
int mNumTilesY;
// One array for each "level" in the tree
float **mMinZArrays;
float **mMaxZArrays;
};
static MinMaxZTree *gMinMaxZTree = 0;
void InitDynamicC(InputData *input) {
gMinMaxZTree =
new MinMaxZTree(MIN_TILE_WIDTH, MIN_TILE_HEIGHT, DYNAMIC_TREE_LEVELS,
input->header.framebufferWidth,
input->header.framebufferHeight);
}
/* We're going to split a tile into 4 sub-tiles. This function
reclassifies the tile's lights with respect to the sub-tiles. */
static void
SplitTileMinMax(
int tileMidX, int tileMidY,
// Subtile data (00, 10, 01, 11)
float subtileMinZ[],
float subtileMaxZ[],
// G-buffer data
int gBufferWidth, int gBufferHeight,
// Camera data
float cameraProj_11, float cameraProj_22,
// Light Data
int lightIndices[],
int numLights,
float light_positionView_x_array[],
float light_positionView_y_array[],
float light_positionView_z_array[],
float light_attenuationEnd_array[],
// Outputs
int subtileIndices[],
int subtileIndicesPitch,
int subtileNumLights[]
)
{
float gBufferScale_x = 0.5f * (float)gBufferWidth;
float gBufferScale_y = 0.5f * (float)gBufferHeight;
float frustumPlanes_xy[2] = { -(cameraProj_11 * gBufferScale_x),
(cameraProj_22 * gBufferScale_y) };
float frustumPlanes_z[2] = { tileMidX - gBufferScale_x,
tileMidY - gBufferScale_y };
for (int i = 0; i < 2; ++i) {
// Normalize
float norm = 1.f / sqrtf(frustumPlanes_xy[i] * frustumPlanes_xy[i] +
frustumPlanes_z[i] * frustumPlanes_z[i]);
frustumPlanes_xy[i] *= norm;
frustumPlanes_z[i] *= norm;
}
// Initialize
int subtileLightOffset[4];
subtileLightOffset[0] = 0 * subtileIndicesPitch;
subtileLightOffset[1] = 1 * subtileIndicesPitch;
subtileLightOffset[2] = 2 * subtileIndicesPitch;
subtileLightOffset[3] = 3 * subtileIndicesPitch;
for (int i = 0; i < numLights; ++i) {
int lightIndex = lightIndices[i];
float light_positionView_x = light_positionView_x_array[lightIndex];
float light_positionView_y = light_positionView_y_array[lightIndex];
float light_positionView_z = light_positionView_z_array[lightIndex];
float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
float light_attenuationEndNeg = -light_attenuationEnd;
// Test lights again against subtile z bounds
bool inFrustum[4];
inFrustum[0] = (light_positionView_z - subtileMinZ[0] >= light_attenuationEndNeg) &&
(subtileMaxZ[0] - light_positionView_z >= light_attenuationEndNeg);
inFrustum[1] = (light_positionView_z - subtileMinZ[1] >= light_attenuationEndNeg) &&
(subtileMaxZ[1] - light_positionView_z >= light_attenuationEndNeg);
inFrustum[2] = (light_positionView_z - subtileMinZ[2] >= light_attenuationEndNeg) &&
(subtileMaxZ[2] - light_positionView_z >= light_attenuationEndNeg);
inFrustum[3] = (light_positionView_z - subtileMinZ[3] >= light_attenuationEndNeg) &&
(subtileMaxZ[3] - light_positionView_z >= light_attenuationEndNeg);
float dx = light_positionView_z * frustumPlanes_z[0] +
light_positionView_x * frustumPlanes_xy[0];
float dy = light_positionView_z * frustumPlanes_z[1] +
light_positionView_y * frustumPlanes_xy[1];
if (fabsf(dx) > light_attenuationEnd) {
bool positiveX = dx > 0.0f;
inFrustum[0] = inFrustum[0] && positiveX; // 00 subtile
inFrustum[1] = inFrustum[1] && !positiveX; // 10 subtile
inFrustum[2] = inFrustum[2] && positiveX; // 01 subtile
inFrustum[3] = inFrustum[3] && !positiveX; // 11 subtile
}
if (fabsf(dy) > light_attenuationEnd) {
bool positiveY = dy > 0.0f;
inFrustum[0] = inFrustum[0] && positiveY; // 00 subtile
inFrustum[1] = inFrustum[1] && positiveY; // 10 subtile
inFrustum[2] = inFrustum[2] && !positiveY; // 01 subtile
inFrustum[3] = inFrustum[3] && !positiveY; // 11 subtile
}
if (inFrustum[0])
subtileIndices[subtileLightOffset[0]++] = lightIndex;
if (inFrustum[1])
subtileIndices[subtileLightOffset[1]++] = lightIndex;
if (inFrustum[2])
subtileIndices[subtileLightOffset[2]++] = lightIndex;
if (inFrustum[3])
subtileIndices[subtileLightOffset[3]++] = lightIndex;
}
subtileNumLights[0] = subtileLightOffset[0] - 0 * subtileIndicesPitch;
subtileNumLights[1] = subtileLightOffset[1] - 1 * subtileIndicesPitch;
subtileNumLights[2] = subtileLightOffset[2] - 2 * subtileIndicesPitch;
subtileNumLights[3] = subtileLightOffset[3] - 3 * subtileIndicesPitch;
}
static inline float
dot3(float x, float y, float z, float a, float b, float c) {
return (x*a + y*b + z*c);
}
static inline void
normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
float n = 1.f / sqrtf(x*x + y*y + z*z);
ox = x * n;
oy = y * n;
oz = z * n;
}
static inline float
Unorm8ToFloat32(uint8_t u) {
return (float)u * (1.0f / 255.0f);
}
static inline uint8_t
Float32ToUnorm8(float f) {
return (uint8_t)(f * 255.0f);
}
static inline float
half_to_float_fast(uint16_t h) {
uint32_t hs = h & (int32_t)0x8000u; // Pick off sign bit
uint32_t he = h & (int32_t)0x7C00u; // Pick off exponent bits
uint32_t hm = h & (int32_t)0x03FFu; // Pick off mantissa bits
// sign
uint32_t xs = ((uint32_t) hs) << 16;
// Exponent: unbias the halfp, then bias the single
int32_t xes = ((int32_t) (he >> 10)) - 15 + 127;
// Exponent
uint32_t xe = (uint32_t) (xes << 23);
// Mantissa
uint32_t xm = ((uint32_t) hm) << 13;
uint32_t bits = (xs | xe | xm);
float *fp = reinterpret_cast<float *>(&bits);
return *fp;
}
static void
ShadeTileC(
int32_t tileStartX, int32_t tileEndX,
int32_t tileStartY, int32_t tileEndY,
int32_t gBufferWidth, int32_t gBufferHeight,
const ispc::InputDataArrays &inputData,
// Camera data
float cameraProj_11, float cameraProj_22,
float cameraProj_33, float cameraProj_43,
// Light list
int32_t tileLightIndices[],
int32_t tileNumLights,
// UI
bool visualizeLightCount,
// Output
uint8_t framebuffer_r[],
uint8_t framebuffer_g[],
uint8_t framebuffer_b[]
)
{
if (tileNumLights == 0 || visualizeLightCount) {
uint8_t c = (uint8_t)(std::min(tileNumLights << 2, 255));
for (int32_t y = tileStartY; y < tileEndY; ++y) {
for (int32_t x = tileStartX; x < tileEndX; ++x) {
int32_t framebufferIndex = (y * gBufferWidth + x);
framebuffer_r[framebufferIndex] = c;
framebuffer_g[framebufferIndex] = c;
framebuffer_b[framebufferIndex] = c;
}
}
} else {
float twoOverGBufferWidth = 2.0f / gBufferWidth;
float twoOverGBufferHeight = 2.0f / gBufferHeight;
for (int32_t y = tileStartY; y < tileEndY; ++y) {
float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
for (int32_t x = tileStartX; x < tileEndX; ++x) {
int32_t gBufferOffset = y * gBufferWidth + x;
// Reconstruct position and (negative) view vector from G-buffer
float surface_positionView_x, surface_positionView_y, surface_positionView_z;
float Vneg_x, Vneg_y, Vneg_z;
float z = inputData.zBuffer[gBufferOffset];
// Compute screen/clip-space position
// NOTE: Mind DX11 viewport transform and pixel center!
float positionScreen_x = (0.5f + (float)(x)) *
twoOverGBufferWidth - 1.0f;
// Unproject depth buffer Z value into view space
surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
surface_positionView_x = positionScreen_x * surface_positionView_z /
cameraProj_11;
surface_positionView_y = positionScreen_y * surface_positionView_z /
cameraProj_22;
// We actually end up with a vector pointing *at* the
// surface (i.e. the negative view vector)
normalize3(surface_positionView_x, surface_positionView_y,
surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
// Reconstruct normal from G-buffer
float surface_normal_x, surface_normal_y, surface_normal_z;
float normal_x = half_to_float_fast(inputData.normalEncoded_x[gBufferOffset]);
float normal_y = half_to_float_fast(inputData.normalEncoded_y[gBufferOffset]);
float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
float m = sqrtf(4.0f * f - 1.0f);
surface_normal_x = m * (4.0f * normal_x - 2.0f);
surface_normal_y = m * (4.0f * normal_y - 2.0f);
surface_normal_z = 3.0f - 8.0f * f;
// Load other G-buffer parameters
float surface_specularAmount =
half_to_float_fast(inputData.specularAmount[gBufferOffset]);
float surface_specularPower =
half_to_float_fast(inputData.specularPower[gBufferOffset]);
float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
float lit_x = 0.0f;
float lit_y = 0.0f;
float lit_z = 0.0f;
for (int32_t tileLightIndex = 0; tileLightIndex < tileNumLights;
++tileLightIndex) {
int32_t lightIndex = tileLightIndices[tileLightIndex];
// Gather light data relevant to initial culling
float light_positionView_x =
inputData.lightPositionView_x[lightIndex];
float light_positionView_y =
inputData.lightPositionView_y[lightIndex];
float light_positionView_z =
inputData.lightPositionView_z[lightIndex];
float light_attenuationEnd =
inputData.lightAttenuationEnd[lightIndex];
// Compute light vector
float L_x = light_positionView_x - surface_positionView_x;
float L_y = light_positionView_y - surface_positionView_y;
float L_z = light_positionView_z - surface_positionView_z;
float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
// Clip at end of attenuation
float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
if (distanceToLight2 < light_attenutaionEnd2) {
float distanceToLight = sqrtf(distanceToLight2);
float distanceToLightRcp = 1.f / distanceToLight;
L_x *= distanceToLightRcp;
L_y *= distanceToLightRcp;
L_z *= distanceToLightRcp;
// Start computing brdf
float NdotL = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, L_x, L_y, L_z);
// Clip back facing
if (NdotL > 0.0f) {
float light_attenuationBegin =
inputData.lightAttenuationBegin[lightIndex];
// Light distance attenuation (linstep)
float lightRange = (light_attenuationEnd - light_attenuationBegin);
float falloffPosition = (light_attenuationEnd - distanceToLight);
float attenuation = std::min(falloffPosition / lightRange, 1.0f);
float H_x = (L_x - Vneg_x);
float H_y = (L_y - Vneg_y);
float H_z = (L_z - Vneg_z);
normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
float NdotH = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, H_x, H_y, H_z);
NdotH = std::max(NdotH, 0.0f);
float specular = powf(NdotH, surface_specularPower);
float specularNorm = (surface_specularPower + 2.0f) *
(1.0f / 8.0f);
float specularContrib = surface_specularAmount *
specularNorm * specular;
float k = attenuation * NdotL * (1.0f + specularContrib);
float light_color_x = inputData.lightColor_x[lightIndex];
float light_color_y = inputData.lightColor_y[lightIndex];
float light_color_z = inputData.lightColor_z[lightIndex];
float lightContrib_x = surface_albedo_x * light_color_x;
float lightContrib_y = surface_albedo_y * light_color_y;
float lightContrib_z = surface_albedo_z * light_color_z;
lit_x += lightContrib_x * k;
lit_y += lightContrib_y * k;
lit_z += lightContrib_z * k;
}
}
}
// Gamma correct
float gamma = 1.0 / 2.2f;
lit_x = powf(std::min(std::max(lit_x, 0.0f), 1.0f), gamma);
lit_y = powf(std::min(std::max(lit_y, 0.0f), 1.0f), gamma);
lit_z = powf(std::min(std::max(lit_z, 0.0f), 1.0f), gamma);
framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
}
}
}
}
void
ShadeDynamicTileRecurse(InputData *input, int level, int tileX, int tileY,
int *lightIndices, int numLights,
Framebuffer *framebuffer) {
const MinMaxZTree *minMaxZTree = gMinMaxZTree;
// If we few enough lights or this is the base case (last level), shade
// this full tile directly
if (level == 0 || numLights < DYNAMIC_MIN_LIGHTS_TO_SUBDIVIDE) {
int width = minMaxZTree->TileWidth(level);
int height = minMaxZTree->TileHeight(level);
int startX = tileX * width;
int startY = tileY * height;
int endX = std::min(input->header.framebufferWidth, startX + width);
int endY = std::min(input->header.framebufferHeight, startY + height);
// Skip entirely offscreen tiles
if (endX > startX && endY > startY) {
ShadeTileC(startX, endX, startY, endY,
input->header.framebufferWidth, input->header.framebufferHeight,
input->arrays,
input->header.cameraProj[0][0], input->header.cameraProj[1][1],
input->header.cameraProj[2][2], input->header.cameraProj[3][2],
lightIndices, numLights, VISUALIZE_LIGHT_COUNT,
framebuffer->r, framebuffer->g, framebuffer->b);
}
}
else {
// Otherwise, subdivide and 4-way recurse using X and Y splitting planes
// Move down a level in the tree
--level;
tileX <<= 1;
tileY <<= 1;
int width = minMaxZTree->TileWidth(level);
int height = minMaxZTree->TileHeight(level);
// Work out splitting coords
int midX = (tileX + 1) * width;
int midY = (tileY + 1) * height;
// Read subtile min/max data
// NOTE: We must be sure to handle out-of-bounds access here since
// sometimes we'll only have 1 or 2 subtiles for non-pow-2
// framebuffer sizes.
bool rightTileExists = (tileX + 1 < minMaxZTree->NumTilesX(level));
bool bottomTileExists = (tileY + 1 < minMaxZTree->NumTilesY(level));
// NOTE: Order is 00, 10, 01, 11
// Set defaults up to cull all lights if the tile doesn't exist (offscreen)
float minZ[4] = {input->header.cameraFar, input->header.cameraFar,
input->header.cameraFar, input->header.cameraFar};
float maxZ[4] = {input->header.cameraNear, input->header.cameraNear,
input->header.cameraNear, input->header.cameraNear};
minZ[0] = minMaxZTree->MinZ(level, tileX, tileY);
maxZ[0] = minMaxZTree->MaxZ(level, tileX, tileY);
if (rightTileExists) {
minZ[1] = minMaxZTree->MinZ(level, tileX + 1, tileY);
maxZ[1] = minMaxZTree->MaxZ(level, tileX + 1, tileY);
if (bottomTileExists) {
minZ[3] = minMaxZTree->MinZ(level, tileX + 1, tileY + 1);
maxZ[3] = minMaxZTree->MaxZ(level, tileX + 1, tileY + 1);
}
}
if (bottomTileExists) {
minZ[2] = minMaxZTree->MinZ(level, tileX, tileY + 1);
maxZ[2] = minMaxZTree->MaxZ(level, tileX, tileY + 1);
}
// Cull lights into subtile lists
#ifdef ISPC_IS_WINDOWS
__declspec(align(ALIGNMENT_BYTES))
#endif
int subtileLightIndices[4][MAX_LIGHTS]
#ifndef ISPC_IS_WINDOWS
__attribute__ ((aligned(ALIGNMENT_BYTES)))
#endif
;
int subtileNumLights[4];
SplitTileMinMax(midX, midY, minZ, maxZ,
input->header.framebufferWidth, input->header.framebufferHeight,
input->header.cameraProj[0][0], input->header.cameraProj[1][1],
lightIndices, numLights, input->arrays.lightPositionView_x,
input->arrays.lightPositionView_y, input->arrays.lightPositionView_z,
input->arrays.lightAttenuationEnd,
subtileLightIndices[0], MAX_LIGHTS, subtileNumLights);
// Recurse into subtiles
ShadeDynamicTileRecurse(input, level, tileX , tileY,
subtileLightIndices[0], subtileNumLights[0],
framebuffer);
ShadeDynamicTileRecurse(input, level, tileX + 1, tileY,
subtileLightIndices[1], subtileNumLights[1],
framebuffer);
ShadeDynamicTileRecurse(input, level, tileX , tileY + 1,
subtileLightIndices[2], subtileNumLights[2],
framebuffer);
ShadeDynamicTileRecurse(input, level, tileX + 1, tileY + 1,
subtileLightIndices[3], subtileNumLights[3],
framebuffer);
}
}
static int
IntersectLightsWithTileMinMax(
int tileStartX, int tileEndX,
int tileStartY, int tileEndY,
// Tile data
float minZ,
float maxZ,
// G-buffer data
int gBufferWidth, int gBufferHeight,
// Camera data
float cameraProj_11, float cameraProj_22,
// Light Data
int numLights,
float light_positionView_x_array[],
float light_positionView_y_array[],
float light_positionView_z_array[],
float light_attenuationEnd_array[],
// Output
int tileLightIndices[]
)
{
float gBufferScale_x = 0.5f * (float)gBufferWidth;
float gBufferScale_y = 0.5f * (float)gBufferHeight;
float frustumPlanes_xy[4];
float frustumPlanes_z[4];
// This one is totally constant over the whole screen... worth pulling it up at all?
float frustumPlanes_xy_v[4] = { -(cameraProj_11 * gBufferScale_x),
(cameraProj_11 * gBufferScale_x),
(cameraProj_22 * gBufferScale_y),
-(cameraProj_22 * gBufferScale_y) };
float frustumPlanes_z_v[4] = { tileEndX - gBufferScale_x,
-tileStartX + gBufferScale_x,
tileEndY - gBufferScale_y,
-tileStartY + gBufferScale_y };
for (int i = 0; i < 4; ++i) {
float norm = 1.f / sqrtf(frustumPlanes_xy_v[i] * frustumPlanes_xy_v[i] +
frustumPlanes_z_v[i] * frustumPlanes_z_v[i]);
frustumPlanes_xy_v[i] *= norm;
frustumPlanes_z_v[i] *= norm;
frustumPlanes_xy[i] = frustumPlanes_xy_v[i];
frustumPlanes_z[i] = frustumPlanes_z_v[i];
}
int tileNumLights = 0;
for (int lightIndex = 0; lightIndex < numLights; ++lightIndex) {
float light_positionView_z = light_positionView_z_array[lightIndex];
float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
float light_attenuationEndNeg = -light_attenuationEnd;
float d = light_positionView_z - minZ;
bool inFrustum = (d >= light_attenuationEndNeg);
d = maxZ - light_positionView_z;
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
if (!inFrustum)
continue;
float light_positionView_x = light_positionView_x_array[lightIndex];
float light_positionView_y = light_positionView_y_array[lightIndex];
d = light_positionView_z * frustumPlanes_z[0] +
light_positionView_x * frustumPlanes_xy[0];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[1] +
light_positionView_x * frustumPlanes_xy[1];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[2] +
light_positionView_y * frustumPlanes_xy[2];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[3] +
light_positionView_y * frustumPlanes_xy[3];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
// Pack and store intersecting lights
if (inFrustum)
tileLightIndices[tileNumLights++] = lightIndex;
}
return tileNumLights;
}
void
ShadeDynamicTile(InputData *input, int level, int tileX, int tileY,
Framebuffer *framebuffer) {
const MinMaxZTree *minMaxZTree = gMinMaxZTree;
// Get Z min/max for this tile
int width = minMaxZTree->TileWidth(level);
int height = minMaxZTree->TileHeight(level);
float minZ = minMaxZTree->MinZ(level, tileX, tileY);
float maxZ = minMaxZTree->MaxZ(level, tileX, tileY);
int startX = tileX * width;
int startY = tileY * height;
int endX = std::min(input->header.framebufferWidth, startX + width);
int endY = std::min(input->header.framebufferHeight, startY + height);
// This is a root tile, so first do a full 6-plane cull
#ifdef ISPC_IS_WINDOWS
__declspec(align(ALIGNMENT_BYTES))
#endif
int lightIndices[MAX_LIGHTS]
#ifndef ISPC_IS_WINDOWS
__attribute__ ((aligned(ALIGNMENT_BYTES)))
#endif
;
int numLights = IntersectLightsWithTileMinMax(
startX, endX, startY, endY, minZ, maxZ,
input->header.framebufferWidth, input->header.framebufferHeight,
input->header.cameraProj[0][0], input->header.cameraProj[1][1],
MAX_LIGHTS, input->arrays.lightPositionView_x,
input->arrays.lightPositionView_y, input->arrays.lightPositionView_z,
input->arrays.lightAttenuationEnd, lightIndices);
// Now kick off the recursive process for this tile
ShadeDynamicTileRecurse(input, level, tileX, tileY, lightIndices,
numLights, framebuffer);
}
void
DispatchDynamicC(InputData *input, Framebuffer *framebuffer)
{
MinMaxZTree *minMaxZTree = gMinMaxZTree;
// Update min/max Z tree
minMaxZTree->Update(input->arrays.zBuffer, input->header.framebufferWidth,
input->header.cameraProj[2][2], input->header.cameraProj[3][2],
input->header.cameraNear, input->header.cameraFar);
int rootLevel = minMaxZTree->Levels() - 1;
int rootTilesX = minMaxZTree->NumTilesX(rootLevel);
int rootTilesY = minMaxZTree->NumTilesY(rootLevel);
int rootTiles = rootTilesX * rootTilesY;
for (int g = 0; g < rootTiles; ++g) {
uint32_t tileY = g / rootTilesX;
uint32_t tileX = g % rootTilesX;
ShadeDynamicTile(input, rootLevel, tileX, tileY, framebuffer);
}
}

View File

@@ -1,398 +0,0 @@
/*
Copyright (c) 2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef __cilk
#include "deferred.h"
#include "kernels_ispc.h"
#include <algorithm>
#include <assert.h>
#ifdef _MSC_VER
#define ISPC_IS_WINDOWS
#elif defined(__linux__)
#define ISPC_IS_LINUX
#elif defined(__APPLE__)
#define ISPC_IS_APPLE
#endif
#ifdef ISPC_IS_LINUX
#include <malloc.h>
#endif // ISPC_IS_LINUX
// Currently tile widths must be a multiple of SIMD width (i.e. 8 for ispc sse4x2)!
//#define MIN_TILE_WIDTH 64
//#define MIN_TILE_HEIGHT 16
#define DYNAMIC_TREE_LEVELS 5
// If this is set to 1 then the result will be identical to the static version
#define DYNAMIC_MIN_LIGHTS_TO_SUBDIVIDE 1
static void *
lAlignedMalloc(size_t size, int32_t alignment) {
#ifdef ISPC_IS_WINDOWS
return _aligned_malloc(size, alignment);
#endif
#ifdef ISPC_IS_LINUX
return memalign(alignment, size);
#endif
#ifdef ISPC_IS_APPLE
void *mem = malloc(size + (alignment-1) + sizeof(void*));
char *amem = ((char*)mem) + sizeof(void*);
amem = amem + uint32_t(alignment - (reinterpret_cast<uint64_t>(amem) &
(alignment - 1)));
((void**)amem)[-1] = mem;
return amem;
#endif
}
static void
lAlignedFree(void *ptr) {
#ifdef ISPC_IS_WINDOWS
_aligned_free(ptr);
#endif
#ifdef ISPC_IS_LINUX
free(ptr);
#endif
#ifdef ISPC_IS_APPLE
free(((void**)ptr)[-1]);
#endif
}
class MinMaxZTreeCilk
{
public:
// Currently (min) tile dimensions must divide gBuffer dimensions evenly
// Levels must be small enough that neither dimension goes below one tile
MinMaxZTreeCilk(
int tileWidth, int tileHeight, int levels,
int gBufferWidth, int gBufferHeight)
: mTileWidth(tileWidth), mTileHeight(tileHeight), mLevels(levels)
{
mNumTilesX = gBufferWidth / mTileWidth;
mNumTilesY = gBufferHeight / mTileHeight;
// Allocate arrays
mMinZArrays = (float **)lAlignedMalloc(sizeof(float *) * mLevels, 16);
mMaxZArrays = (float **)lAlignedMalloc(sizeof(float *) * mLevels, 16);
for (int i = 0; i < mLevels; ++i) {
int x = NumTilesX(i);
int y = NumTilesY(i);
assert(x > 0);
assert(y > 0);
// NOTE: If the following two asserts fire it probably means that
// the base tile dimensions do not evenly divide the G-buffer dimensions
assert(x * (mTileWidth << i) >= gBufferWidth);
assert(y * (mTileHeight << i) >= gBufferHeight);
mMinZArrays[i] = (float *)lAlignedMalloc(sizeof(float) * x * y, 16);
mMaxZArrays[i] = (float *)lAlignedMalloc(sizeof(float) * x * y, 16);
}
}
void Update(float *zBuffer, int gBufferPitchInElements,
float cameraProj_33, float cameraProj_43,
float cameraNear, float cameraFar)
{
// Compute level 0 in parallel. Outer loops is here since we use Cilk
_Cilk_for (int tileY = 0; tileY < mNumTilesY; ++tileY) {
ispc::ComputeZBoundsRow(tileY,
mTileWidth, mTileHeight, mNumTilesX, mNumTilesY,
zBuffer, gBufferPitchInElements,
cameraProj_33, cameraProj_43, cameraNear, cameraFar,
mMinZArrays[0] + (tileY * mNumTilesX),
mMaxZArrays[0] + (tileY * mNumTilesX));
}
// Generate other levels
// NOTE: We currently don't use ispc here since it's sort of an
// awkward gather-based reduction Using SSE odd pack/unpack
// instructions might actually work here when we need to optimize
for (int level = 1; level < mLevels; ++level) {
int destTilesX = NumTilesX(level);
int destTilesY = NumTilesY(level);
int srcLevel = level - 1;
int srcTilesX = NumTilesX(srcLevel);
int srcTilesY = NumTilesY(srcLevel);
_Cilk_for (int y = 0; y < destTilesY; ++y) {
for (int x = 0; x < destTilesX; ++x) {
int srcX = x << 1;
int srcY = y << 1;
// NOTE: Ugly branches to deal with non-multiple dimensions at some levels
// TODO: SSE branchless min/max is probably better...
float minZ = mMinZArrays[srcLevel][(srcY) * srcTilesX + (srcX)];
float maxZ = mMaxZArrays[srcLevel][(srcY) * srcTilesX + (srcX)];
if (srcX + 1 < srcTilesX) {
minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY) * srcTilesX +
(srcX + 1)]);
maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY) * srcTilesX +
(srcX + 1)]);
if (srcY + 1 < srcTilesY) {
minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY + 1) * srcTilesX +
(srcX + 1)]);
maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY + 1) * srcTilesX +
(srcX + 1)]);
}
}
if (srcY + 1 < srcTilesY) {
minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY + 1) * srcTilesX +
(srcX )]);
maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY + 1) * srcTilesX +
(srcX )]);
}
mMinZArrays[level][y * destTilesX + x] = minZ;
mMaxZArrays[level][y * destTilesX + x] = maxZ;
}
}
}
}
~MinMaxZTreeCilk() {
for (int i = 0; i < mLevels; ++i) {
lAlignedFree(mMinZArrays[i]);
lAlignedFree(mMaxZArrays[i]);
}
lAlignedFree(mMinZArrays);
lAlignedFree(mMaxZArrays);
}
int Levels() const { return mLevels; }
// These round UP, so beware that the last tile for a given level may not be completely full
// TODO: Verify this...
int NumTilesX(int level = 0) const { return (mNumTilesX + (1 << level) - 1) >> level; }
int NumTilesY(int level = 0) const { return (mNumTilesY + (1 << level) - 1) >> level; }
int TileWidth(int level = 0) const { return (mTileWidth << level); }
int TileHeight(int level = 0) const { return (mTileHeight << level); }
float MinZ(int level, int tileX, int tileY) const {
return mMinZArrays[level][tileY * NumTilesX(level) + tileX];
}
float MaxZ(int level, int tileX, int tileY) const {
return mMaxZArrays[level][tileY * NumTilesX(level) + tileX];
}
private:
int mTileWidth;
int mTileHeight;
int mLevels;
int mNumTilesX;
int mNumTilesY;
// One array for each "level" in the tree
float **mMinZArrays;
float **mMaxZArrays;
};
static MinMaxZTreeCilk *gMinMaxZTreeCilk = 0;
void InitDynamicCilk(InputData *input) {
gMinMaxZTreeCilk =
new MinMaxZTreeCilk(MIN_TILE_WIDTH, MIN_TILE_HEIGHT, DYNAMIC_TREE_LEVELS,
input->header.framebufferWidth,
input->header.framebufferHeight);
}
static void
ShadeDynamicTileRecurse(InputData *input, int level, int tileX, int tileY,
int *lightIndices, int numLights,
Framebuffer *framebuffer) {
const MinMaxZTreeCilk *minMaxZTree = gMinMaxZTreeCilk;
// If we few enough lights or this is the base case (last level), shade
// this full tile directly
if (level == 0 || numLights < DYNAMIC_MIN_LIGHTS_TO_SUBDIVIDE) {
int width = minMaxZTree->TileWidth(level);
int height = minMaxZTree->TileHeight(level);
int startX = tileX * width;
int startY = tileY * height;
int endX = std::min(input->header.framebufferWidth, startX + width);
int endY = std::min(input->header.framebufferHeight, startY + height);
// Skip entirely offscreen tiles
if (endX > startX && endY > startY) {
ispc::ShadeTile(
startX, endX, startY, endY,
input->header.framebufferWidth, input->header.framebufferHeight,
input->arrays,
input->header.cameraProj[0][0], input->header.cameraProj[1][1],
input->header.cameraProj[2][2], input->header.cameraProj[3][2],
lightIndices, numLights, VISUALIZE_LIGHT_COUNT,
framebuffer->r, framebuffer->g, framebuffer->b);
}
}
else {
// Otherwise, subdivide and 4-way recurse using X and Y splitting planes
// Move down a level in the tree
--level;
tileX <<= 1;
tileY <<= 1;
int width = minMaxZTree->TileWidth(level);
int height = minMaxZTree->TileHeight(level);
// Work out splitting coords
int midX = (tileX + 1) * width;
int midY = (tileY + 1) * height;
// Read subtile min/max data
// NOTE: We must be sure to handle out-of-bounds access here since
// sometimes we'll only have 1 or 2 subtiles for non-pow-2
// framebuffer sizes.
bool rightTileExists = (tileX + 1 < minMaxZTree->NumTilesX(level));
bool bottomTileExists = (tileY + 1 < minMaxZTree->NumTilesY(level));
// NOTE: Order is 00, 10, 01, 11
// Set defaults up to cull all lights if the tile doesn't exist (offscreen)
float minZ[4] = {input->header.cameraFar, input->header.cameraFar,
input->header.cameraFar, input->header.cameraFar};
float maxZ[4] = {input->header.cameraNear, input->header.cameraNear,
input->header.cameraNear, input->header.cameraNear};
minZ[0] = minMaxZTree->MinZ(level, tileX, tileY);
maxZ[0] = minMaxZTree->MaxZ(level, tileX, tileY);
if (rightTileExists) {
minZ[1] = minMaxZTree->MinZ(level, tileX + 1, tileY);
maxZ[1] = minMaxZTree->MaxZ(level, tileX + 1, tileY);
if (bottomTileExists) {
minZ[3] = minMaxZTree->MinZ(level, tileX + 1, tileY + 1);
maxZ[3] = minMaxZTree->MaxZ(level, tileX + 1, tileY + 1);
}
}
if (bottomTileExists) {
minZ[2] = minMaxZTree->MinZ(level, tileX, tileY + 1);
maxZ[2] = minMaxZTree->MaxZ(level, tileX, tileY + 1);
}
// Cull lights into subtile lists
#ifdef ISPC_IS_WINDOWS
__declspec(align(ALIGNMENT_BYTES))
#endif
int subtileLightIndices[4][MAX_LIGHTS]
#ifndef ISPC_IS_WINDOWS
__attribute__ ((aligned(ALIGNMENT_BYTES)))
#endif
;
int subtileNumLights[4];
ispc::SplitTileMinMax(midX, midY, minZ, maxZ,
input->header.framebufferWidth, input->header.framebufferHeight,
input->header.cameraProj[0][0], input->header.cameraProj[1][1],
lightIndices, numLights, input->arrays.lightPositionView_x,
input->arrays.lightPositionView_y, input->arrays.lightPositionView_z,
input->arrays.lightAttenuationEnd,
subtileLightIndices[0], MAX_LIGHTS, subtileNumLights);
// Recurse into subtiles
_Cilk_spawn ShadeDynamicTileRecurse(input, level, tileX , tileY,
subtileLightIndices[0], subtileNumLights[0],
framebuffer);
_Cilk_spawn ShadeDynamicTileRecurse(input, level, tileX + 1, tileY,
subtileLightIndices[1], subtileNumLights[1],
framebuffer);
_Cilk_spawn ShadeDynamicTileRecurse(input, level, tileX , tileY + 1,
subtileLightIndices[2], subtileNumLights[2],
framebuffer);
ShadeDynamicTileRecurse(input, level, tileX + 1, tileY + 1,
subtileLightIndices[3], subtileNumLights[3],
framebuffer);
}
}
static void
ShadeDynamicTile(InputData *input, int level, int tileX, int tileY,
Framebuffer *framebuffer) {
const MinMaxZTreeCilk *minMaxZTree = gMinMaxZTreeCilk;
// Get Z min/max for this tile
int width = minMaxZTree->TileWidth(level);
int height = minMaxZTree->TileHeight(level);
float minZ = minMaxZTree->MinZ(level, tileX, tileY);
float maxZ = minMaxZTree->MaxZ(level, tileX, tileY);
int startX = tileX * width;
int startY = tileY * height;
int endX = std::min(input->header.framebufferWidth, startX + width);
int endY = std::min(input->header.framebufferHeight, startY + height);
// This is a root tile, so first do a full 6-plane cull
#ifdef ISPC_IS_WINDOWS
__declspec(align(ALIGNMENT_BYTES))
#endif
int lightIndices[MAX_LIGHTS]
#ifndef ISPC_IS_WINDOWS
__attribute__ ((aligned(ALIGNMENT_BYTES)))
#endif
;
int numLights = ispc::IntersectLightsWithTileMinMax(
startX, endX, startY, endY, minZ, maxZ,
input->header.framebufferWidth, input->header.framebufferHeight,
input->header.cameraProj[0][0], input->header.cameraProj[1][1],
MAX_LIGHTS, input->arrays.lightPositionView_x,
input->arrays.lightPositionView_y, input->arrays.lightPositionView_z,
input->arrays.lightAttenuationEnd, lightIndices);
// Now kick off the recursive process for this tile
ShadeDynamicTileRecurse(input, level, tileX, tileY, lightIndices,
numLights, framebuffer);
}
void
DispatchDynamicCilk(InputData *input, Framebuffer *framebuffer)
{
MinMaxZTreeCilk *minMaxZTree = gMinMaxZTreeCilk;
// Update min/max Z tree
minMaxZTree->Update(input->arrays.zBuffer, input->header.framebufferWidth,
input->header.cameraProj[2][2], input->header.cameraProj[3][2],
input->header.cameraNear, input->header.cameraFar);
// Launch the "root" tiles. Ideally these should at least fill the
// machine... at the moment we have a static number of "levels" to the
// mip tree but it might make sense to compute it based on the width of
// the machine.
int rootLevel = minMaxZTree->Levels() - 1;
int rootTilesX = minMaxZTree->NumTilesX(rootLevel);
int rootTilesY = minMaxZTree->NumTilesY(rootLevel);
int rootTiles = rootTilesX * rootTilesY;
_Cilk_for (int g = 0; g < rootTiles; ++g) {
uint32_t tileY = g / rootTilesX;
uint32_t tileX = g % rootTilesX;
ShadeDynamicTile(input, rootLevel, tileX, tileY, framebuffer);
}
}
#endif // __cilk

View File

@@ -1,761 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "deferred.h"
#include <stdio.h>
#include <assert.h>
#define programCount 32
#define programIndex (threadIdx.x & 31)
#define taskIndex (blockIdx.x*4 + (threadIdx.x >> 5))
#define taskCount (gridDim.x*4)
#define warpIdx (threadIdx.x >> 5)
#define int32 int
#define int16 short
#define int8 char
__device__ static inline float clamp(float v, float low, float high)
{
return min(max(v, low), high);
}
struct InputDataArrays
{
float *zBuffer;
unsigned int16 *normalEncoded_x; // half float
unsigned int16 *normalEncoded_y; // half float
unsigned int16 *specularAmount; // half float
unsigned int16 *specularPower; // half float
unsigned int8 *albedo_x; // unorm8
unsigned int8 *albedo_y; // unorm8
unsigned int8 *albedo_z; // unorm8
float *lightPositionView_x;
float *lightPositionView_y;
float *lightPositionView_z;
float *lightAttenuationBegin;
float *lightColor_x;
float *lightColor_y;
float *lightColor_z;
float *lightAttenuationEnd;
};
struct InputHeader
{
float cameraProj[4][4];
float cameraNear;
float cameraFar;
int32 framebufferWidth;
int32 framebufferHeight;
int32 numLights;
int32 inputDataChunkSize;
int32 inputDataArrayOffsets[idaNum];
};
///////////////////////////////////////////////////////////////////////////
// Common utility routines
__device__
static inline float
dot3(float x, float y, float z, float a, float b, float c) {
return (x*a + y*b + z*c);
}
#if 0
static __shared__ int shdata_full[128];
template<typename T, int N>
struct Uniform
{
T data[(N+programCount-1)/programCount];
volatile T *shdata;
__device__ inline Uniform()
{
shdata = ((T*)shdata_full) + warpIdx*32;
}
__device__ inline int2 get_chunk(const int i) const
{
const int elem = i & (programCount - 1);
const int chunk = i >> 5;
shdata[programIndex] = chunk;
shdata[ elem] = chunk;
return make_int2(shdata[programIndex], elem);
}
__device__ inline const T get(const int i) const
{
const int2 idx = get_chunk(i);
return __shfl(data[idx.x], idx.y);
}
__device__ inline void set(const bool active, const int i, T value)
{
const int2 idx = get_chunk(i);
const int chunkIdx = idx.x;
const int elemIdx = idx.y;
shdata[programIndex] = data[chunkIdx];
if (active) shdata[elemIdx] = value;
data[chunkIdx] = shdata[programIndex];
}
};
#elif 1
template<typename T, int N>
struct Uniform
{
union
{
T *data;
int32_t ptr[2];
};
__device__ inline Uniform()
{
if (programIndex == 0)
data = (T*)malloc(N*sizeof(T));
ptr[0] = __shfl(ptr[0], 0);
ptr[1] = __shfl(ptr[1], 0);
}
__device__ inline ~Uniform()
{
if (programIndex == 0)
free(data);
}
__device__ inline const T get(const int i) const
{
return data[i];
}
__device__ inline T* get_ptr(const int i) {return &data[i]; }
__device__ inline void set(const bool active, const int i, T value)
{
if (active)
data[i] = value;
}
};
#else
__shared__ int shdata_full[4*MAX_LIGHTS];
template<typename T, int N>
struct Uniform
{
volatile T *shdata;
__device__ Uniform()
{
shdata = (T*)&shdata_full[warpIdx*MAX_LIGHTS];
}
__device__ inline const T get(const int i) const
{
return shdata[i];
}
__device__ inline void set(const bool active, const int i, T value)
{
if (active)
shdata[i] = value;
}
};
#endif
__device__
static inline void
normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
float n = rsqrt(x*x + y*y + z*z);
ox = x * n;
oy = y * n;
oz = z * n;
}
__device__ inline
static float reduce_min(float value)
{
#pragma unroll
for (int i = 4; i >=0; i--)
value = fminf(value, __shfl_xor(value, 1<<i, 32));
return value;
}
__device__ inline
static float reduce_max(float value)
{
#pragma unroll
for (int i = 4; i >=0; i--)
value = fmaxf(value, __shfl_xor(value, 1<<i, 32));
return value;
}
#if 0
__device__ inline
static int reduce_sum(int value)
{
#pragma unroll
for (int i = 4; i >=0; i--)
value += __shfl_xor(value, 1<<i, 32);
return value;
}
static __device__ __forceinline__ uint shfl_scan_add_step(uint partial, uint up_offset)
{
uint result;
asm(
"{.reg .u32 r0;"
".reg .pred p;"
"shfl.up.b32 r0|p, %1, %2, 0;"
"@p add.u32 r0, r0, %3;"
"mov.u32 %0, r0;}"
: "=r"(result) : "r"(partial), "r"(up_offset), "r"(partial));
return result;
}
static __device__ __forceinline__ int inclusive_scan_warp(const int value)
{
uint sum = value;
#pragma unroll
for(int i = 0; i < 5; ++i)
sum = shfl_scan_add_step(sum, 1 << i);
return sum - value;
}
#endif
static __device__ __forceinline__ int lanemask_lt()
{
int mask;
asm("mov.u32 %0, %lanemask_lt;" : "=r" (mask));
return mask;
}
static __device__ __forceinline__ int2 warpBinExclusiveScan(const bool p)
{
const int b = __ballot(p);
return make_int2(__popc(b), __popc(b & lanemask_lt()));
}
__device__ static inline
int packed_store_active(bool active, int* ptr, int value)
{
const int2 res = warpBinExclusiveScan(active);
const int idx = res.y;
const int nactive = res.x;
if (active)
ptr[idx] = value;
return nactive;
}
__device__
static inline float
Unorm8ToFloat32(unsigned int8 u) {
return (float)u * (1.0f / 255.0f);
}
__device__
static inline unsigned int8
Float32ToUnorm8(float f) {
return (unsigned int8)(f * 255.0f);
}
__device__
static inline void
ComputeZBounds(
int32 tileStartX, int32 tileEndX,
int32 tileStartY, int32 tileEndY,
// G-buffer data
float zBuffer[],
int32 gBufferWidth,
// Camera data
float cameraProj_33, float cameraProj_43,
float cameraNear, float cameraFar,
// Output
float &minZ,
float &maxZ
)
{
// Find Z bounds
float laneMinZ = cameraFar;
float laneMaxZ = cameraNear;
for ( int32 y = tileStartY; y < tileEndY; ++y) {
for ( int xb = tileStartX; xb < tileEndX; xb += programCount)
{
const int x = xb + programIndex;
if (x >= tileEndX) break;
// Unproject depth buffer Z value into view space
float z = zBuffer[y * gBufferWidth + x];
float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
// Work out Z bounds for our samples
// Avoid considering skybox/background or otherwise invalid pixels
if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
laneMinZ = min(laneMinZ, viewSpaceZ);
laneMaxZ = max(laneMaxZ, viewSpaceZ);
}
}
}
minZ = reduce_min(laneMinZ);
maxZ = reduce_max(laneMaxZ);
}
__device__
static inline int32
IntersectLightsWithTileMinMax(
int32 tileStartX, int32 tileEndX,
int32 tileStartY, int32 tileEndY,
// Tile data
float minZ,
float maxZ,
// G-buffer data
int32 gBufferWidth, int32 gBufferHeight,
// Camera data
float cameraProj_11, float cameraProj_22,
// Light Data
int32 numLights,
float light_positionView_x_array[],
float light_positionView_y_array[],
float light_positionView_z_array[],
float light_attenuationEnd_array[],
// Output
Uniform<int,MAX_LIGHTS> &tileLightIndices
)
{
float gBufferScale_x = 0.5f * (float)gBufferWidth;
float gBufferScale_y = 0.5f * (float)gBufferHeight;
float frustumPlanes_xy[4] = {
-(cameraProj_11 * gBufferScale_x),
(cameraProj_11 * gBufferScale_x),
(cameraProj_22 * gBufferScale_y),
-(cameraProj_22 * gBufferScale_y) };
float frustumPlanes_z[4] = {
tileEndX - gBufferScale_x,
-tileStartX + gBufferScale_x,
tileEndY - gBufferScale_y,
-tileStartY + gBufferScale_y };
for ( int i = 0; i < 4; ++i) {
float norm = rsqrt(frustumPlanes_xy[i] * frustumPlanes_xy[i] +
frustumPlanes_z[i] * frustumPlanes_z[i]);
frustumPlanes_xy[i] *= norm;
frustumPlanes_z[i] *= norm;
}
int32 tileNumLights = 0;
for ( int lightIndexB = 0; lightIndexB < numLights; lightIndexB += programCount)
{
const int lightIndex = lightIndexB + programIndex;
if (lightIndex >= numLights) break;
float light_positionView_z = light_positionView_z_array[lightIndex];
float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
float light_attenuationEndNeg = -light_attenuationEnd;
float d = light_positionView_z - minZ;
bool inFrustum = (d >= light_attenuationEndNeg);
d = maxZ - light_positionView_z;
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
// This seems better than cif(!inFrustum) ccontinue; here since we
// don't actually need to mask the rest of this function - this is
// just a greedy early-out. Could also structure all of this as
// nested if() statements, but this a bit easier to read
if (__ballot(inFrustum) > 0)
{
float light_positionView_x = light_positionView_x_array[lightIndex];
float light_positionView_y = light_positionView_y_array[lightIndex];
d = light_positionView_z * frustumPlanes_z[0] +
light_positionView_x * frustumPlanes_xy[0];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[1] +
light_positionView_x * frustumPlanes_xy[1];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[2] +
light_positionView_y * frustumPlanes_xy[2];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[3] +
light_positionView_y * frustumPlanes_xy[3];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
// Pack and store intersecting lights
const bool active = inFrustum && lightIndex < numLights;
#if 0
if (__ballot(active) > 0)
tileNumLights += packed_store_active(active, tileLightIndices.get_ptr(tileNumLights), lightIndex);
#else
if (__ballot(active) > 0)
{
const int2 res = warpBinExclusiveScan(active);
const int idx = tileNumLights + res.y;
const int nactive = res.x;
tileLightIndices.set(active, idx, lightIndex);
tileNumLights += nactive;
}
#endif
}
}
return tileNumLights;
}
__device__
static inline int32
IntersectLightsWithTile(
int32 tileStartX, int32 tileEndX,
int32 tileStartY, int32 tileEndY,
int32 gBufferWidth, int32 gBufferHeight,
// G-buffer data
float zBuffer[],
// Camera data
float cameraProj_11, float cameraProj_22,
float cameraProj_33, float cameraProj_43,
float cameraNear, float cameraFar,
// Light Data
int32 numLights,
float light_positionView_x_array[],
float light_positionView_y_array[],
float light_positionView_z_array[],
float light_attenuationEnd_array[],
// Output
Uniform<int,MAX_LIGHTS> &tileLightIndices
)
{
float minZ, maxZ;
ComputeZBounds(tileStartX, tileEndX, tileStartY, tileEndY,
zBuffer, gBufferWidth, cameraProj_33, cameraProj_43, cameraNear, cameraFar,
minZ, maxZ);
int32 tileNumLights = IntersectLightsWithTileMinMax(
tileStartX, tileEndX, tileStartY, tileEndY, minZ, maxZ,
gBufferWidth, gBufferHeight, cameraProj_11, cameraProj_22,
MAX_LIGHTS, light_positionView_x_array, light_positionView_y_array,
light_positionView_z_array, light_attenuationEnd_array,
tileLightIndices);
return tileNumLights;
}
__device__
static inline void
ShadeTile(
int32 tileStartX, int32 tileEndX,
int32 tileStartY, int32 tileEndY,
int32 gBufferWidth, int32 gBufferHeight,
const InputDataArrays &inputData,
// Camera data
float cameraProj_11, float cameraProj_22,
float cameraProj_33, float cameraProj_43,
// Light list
Uniform<int,MAX_LIGHTS> &tileLightIndices,
int32 tileNumLights,
// UI
bool visualizeLightCount,
// Output
unsigned int8 framebuffer_r[],
unsigned int8 framebuffer_g[],
unsigned int8 framebuffer_b[]
)
{
if (tileNumLights == 0 || visualizeLightCount) {
unsigned int8 c = (unsigned int8)(min(tileNumLights << 2, 255));
for ( int32 y = tileStartY; y < tileEndY; ++y) {
for ( int xb = tileStartX ; xb < tileEndX; xb += programCount)
{
const int x = xb + programIndex;
if (x >= tileEndX) continue;
int32 framebufferIndex = (y * gBufferWidth + x);
framebuffer_r[framebufferIndex] = c;
framebuffer_g[framebufferIndex] = c;
framebuffer_b[framebufferIndex] = c;
}
}
} else {
float twoOverGBufferWidth = 2.0f / gBufferWidth;
float twoOverGBufferHeight = 2.0f / gBufferHeight;
for ( int32 y = tileStartY; y < tileEndY; ++y) {
float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
for ( int xb = tileStartX ; xb < tileEndX; xb += programCount)
{
const int x = xb + programIndex;
// if (x >= tileEndX) break;
int32 gBufferOffset = y * gBufferWidth + x;
// Reconstruct position and (negative) view vector from G-buffer
float surface_positionView_x, surface_positionView_y, surface_positionView_z;
float Vneg_x, Vneg_y, Vneg_z;
float z = inputData.zBuffer[gBufferOffset];
// Compute screen/clip-space position
// NOTE: Mind DX11 viewport transform and pixel center!
float positionScreen_x = (0.5f + (float)(x)) *
twoOverGBufferWidth - 1.0f;
// Unproject depth buffer Z value into view space
surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
surface_positionView_x = positionScreen_x * surface_positionView_z /
cameraProj_11;
surface_positionView_y = positionScreen_y * surface_positionView_z /
cameraProj_22;
// We actually end up with a vector pointing *at* the
// surface (i.e. the negative view vector)
normalize3(surface_positionView_x, surface_positionView_y,
surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
// Reconstruct normal from G-buffer
float surface_normal_x, surface_normal_y, surface_normal_z;
asm("// half2float //");
float normal_x = __half2float(inputData.normalEncoded_x[gBufferOffset]);
float normal_y = __half2float(inputData.normalEncoded_y[gBufferOffset]);
asm("// half2float //");
float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
float m = sqrt(4.0f * f - 1.0f);
surface_normal_x = m * (4.0f * normal_x - 2.0f);
surface_normal_y = m * (4.0f * normal_y - 2.0f);
surface_normal_z = 3.0f - 8.0f * f;
// Load other G-buffer parameters
float surface_specularAmount =
__half2float(inputData.specularAmount[gBufferOffset]);
float surface_specularPower =
__half2float(inputData.specularPower[gBufferOffset]);
float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
float lit_x = 0.0f;
float lit_y = 0.0f;
float lit_z = 0.0f;
for ( int32 tileLightIndex = 0; tileLightIndex < tileNumLights;
++tileLightIndex) {
int32 lightIndex = tileLightIndices.get(tileLightIndex);
// Gather light data relevant to initial culling
float light_positionView_x =
__ldg(&inputData.lightPositionView_x[lightIndex]);
float light_positionView_y =
__ldg(&inputData.lightPositionView_y[lightIndex]);
float light_positionView_z =
__ldg(&inputData.lightPositionView_z[lightIndex]);
float light_attenuationEnd =
__ldg(&inputData.lightAttenuationEnd[lightIndex]);
// Compute light vector
float L_x = light_positionView_x - surface_positionView_x;
float L_y = light_positionView_y - surface_positionView_y;
float L_z = light_positionView_z - surface_positionView_z;
float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
// Clip at end of attenuation
float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
if (distanceToLight2 < light_attenutaionEnd2) {
float distanceToLight = sqrt(distanceToLight2);
// HLSL "rcp" is allowed to be fairly inaccurate
float distanceToLightRcp = 1.0f/distanceToLight;
L_x *= distanceToLightRcp;
L_y *= distanceToLightRcp;
L_z *= distanceToLightRcp;
// Start computing brdf
float NdotL = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, L_x, L_y, L_z);
// Clip back facing
if (NdotL > 0.0f) {
float light_attenuationBegin =
inputData.lightAttenuationBegin[lightIndex];
// Light distance attenuation (linstep)
float lightRange = (light_attenuationEnd - light_attenuationBegin);
float falloffPosition = (light_attenuationEnd - distanceToLight);
float attenuation = min(falloffPosition / lightRange, 1.0f);
float H_x = (L_x - Vneg_x);
float H_y = (L_y - Vneg_y);
float H_z = (L_z - Vneg_z);
normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
float NdotH = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, H_x, H_y, H_z);
NdotH = max(NdotH, 0.0f);
float specular = pow(NdotH, surface_specularPower);
float specularNorm = (surface_specularPower + 2.0f) *
(1.0f / 8.0f);
float specularContrib = surface_specularAmount *
specularNorm * specular;
float k = attenuation * NdotL * (1.0f + specularContrib);
float light_color_x = inputData.lightColor_x[lightIndex];
float light_color_y = inputData.lightColor_y[lightIndex];
float light_color_z = inputData.lightColor_z[lightIndex];
float lightContrib_x = surface_albedo_x * light_color_x;
float lightContrib_y = surface_albedo_y * light_color_y;
float lightContrib_z = surface_albedo_z * light_color_z;
lit_x += lightContrib_x * k;
lit_y += lightContrib_y * k;
lit_z += lightContrib_z * k;
}
}
}
// Gamma correct
// These pows are pretty slow right now, but we can do
// something faster if really necessary to squeeze every
// last bit of performance out of it
float gamma = 1.0 / 2.2f;
lit_x = pow(clamp(lit_x, 0.0f, 1.0f), gamma);
lit_y = pow(clamp(lit_y, 0.0f, 1.0f), gamma);
lit_z = pow(clamp(lit_z, 0.0f, 1.0f), gamma);
framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
}
}
}
}
///////////////////////////////////////////////////////////////////////////
// Static decomposition
__global__ void
RenderTile( int num_groups_x, int num_groups_y,
const InputHeader *inputHeaderPtr,
const InputDataArrays *inputDataPtr,
int visualizeLightCount,
// Output
unsigned int8 framebuffer_r[],
unsigned int8 framebuffer_g[],
unsigned int8 framebuffer_b[]) {
if (taskIndex >= taskCount) return;
const InputHeader inputHeader = *inputHeaderPtr;
const InputDataArrays inputData = *inputDataPtr;
int32 group_y = taskIndex / num_groups_x;
int32 group_x = taskIndex % num_groups_x;
int32 tile_start_x = group_x * MIN_TILE_WIDTH;
int32 tile_start_y = group_y * MIN_TILE_HEIGHT;
int32 tile_end_x = tile_start_x + MIN_TILE_WIDTH;
int32 tile_end_y = tile_start_y + MIN_TILE_HEIGHT;
int framebufferWidth = inputHeader.framebufferWidth;
int framebufferHeight = inputHeader.framebufferHeight;
float cameraProj_00 = inputHeader.cameraProj[0][0];
float cameraProj_11 = inputHeader.cameraProj[1][1];
float cameraProj_22 = inputHeader.cameraProj[2][2];
float cameraProj_32 = inputHeader.cameraProj[3][2];
// Light intersection: figure out which lights illuminate this tile.
Uniform<int,MAX_LIGHTS> tileLightIndices; // Light list for the tile
#if 1
int numTileLights =
IntersectLightsWithTile(tile_start_x, tile_end_x,
tile_start_y, tile_end_y,
framebufferWidth, framebufferHeight,
inputData.zBuffer,
cameraProj_00, cameraProj_11,
cameraProj_22, cameraProj_32,
inputHeader.cameraNear, inputHeader.cameraFar,
MAX_LIGHTS,
inputData.lightPositionView_x,
inputData.lightPositionView_y,
inputData.lightPositionView_z,
inputData.lightAttenuationEnd,
tileLightIndices);
// And now shade the tile, using the lights in tileLightIndices
ShadeTile(tile_start_x, tile_end_x, tile_start_y, tile_end_y,
framebufferWidth, framebufferHeight, inputData,
cameraProj_00, cameraProj_11, cameraProj_22, cameraProj_32,
tileLightIndices, numTileLights, visualizeLightCount,
framebuffer_r, framebuffer_g, framebuffer_b);
#endif
}
extern "C" __global__ void
RenderStatic( InputHeader inputHeaderPtr[],
InputDataArrays inputDataPtr[],
int visualizeLightCount,
// Output
unsigned int8 framebuffer_r[],
unsigned int8 framebuffer_g[],
unsigned int8 framebuffer_b[]) {
const InputHeader inputHeader = *inputHeaderPtr;
const InputDataArrays inputData = *inputDataPtr;
int num_groups_x = (inputHeader.framebufferWidth +
MIN_TILE_WIDTH - 1) / MIN_TILE_WIDTH;
int num_groups_y = (inputHeader.framebufferHeight +
MIN_TILE_HEIGHT - 1) / MIN_TILE_HEIGHT;
int num_groups = num_groups_x * num_groups_y;
// Launch a task to render each tile, each of which is MIN_TILE_WIDTH
// by MIN_TILE_HEIGHT pixels.
if (programIndex == 0)
RenderTile<<<(num_groups+4-1)/4,128>>>(num_groups_x, num_groups_y,
inputHeaderPtr, inputDataPtr, visualizeLightCount,
framebuffer_r, framebuffer_g, framebuffer_b);
cudaDeviceSynchronize();
}

View File

@@ -1,675 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "deferred.h"
struct InputDataArrays
{
float *zBuffer;
unsigned int16 *normalEncoded_x; // half float
unsigned int16 *normalEncoded_y; // half float
unsigned int16 *specularAmount; // half float
unsigned int16 *specularPower; // half float
unsigned int8 *albedo_x; // unorm8
unsigned int8 *albedo_y; // unorm8
unsigned int8 *albedo_z; // unorm8
float *lightPositionView_x;
float *lightPositionView_y;
float *lightPositionView_z;
float *lightAttenuationBegin;
float *lightColor_x;
float *lightColor_y;
float *lightColor_z;
float *lightAttenuationEnd;
};
struct InputHeader
{
float cameraProj[4][4];
float cameraNear;
float cameraFar;
int32 framebufferWidth;
int32 framebufferHeight;
int32 numLights;
int32 inputDataChunkSize;
int32 inputDataArrayOffsets[idaNum];
};
///////////////////////////////////////////////////////////////////////////
// Common utility routines
static inline float
dot3(float x, float y, float z, float a, float b, float c) {
return (x*a + y*b + z*c);
}
static inline void
normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
float n = rsqrt(x*x + y*y + z*z);
ox = x * n;
oy = y * n;
oz = z * n;
}
static inline float
Unorm8ToFloat32(unsigned int8 u) {
return (float)u * (1.0f / 255.0f);
}
static inline unsigned int8
Float32ToUnorm8(float f) {
return (unsigned int8)(f * 255.0f);
}
static void
ComputeZBounds(
uniform int32 tileStartX, uniform int32 tileEndX,
uniform int32 tileStartY, uniform int32 tileEndY,
// G-buffer data
uniform float zBuffer[],
uniform int32 gBufferWidth,
// Camera data
uniform float cameraProj_33, uniform float cameraProj_43,
uniform float cameraNear, uniform float cameraFar,
// Output
uniform float &minZ,
uniform float &maxZ
)
{
// Find Z bounds
float laneMinZ = cameraFar;
float laneMaxZ = cameraNear;
for (uniform int32 y = tileStartY; y < tileEndY; ++y) {
foreach (x = tileStartX ... tileEndX)
{
// Unproject depth buffer Z value into view space
float z = zBuffer[y * gBufferWidth + x];
float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
// Work out Z bounds for our samples
// Avoid considering skybox/background or otherwise invalid pixels
if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
laneMinZ = min(laneMinZ, viewSpaceZ);
laneMaxZ = max(laneMaxZ, viewSpaceZ);
}
}
}
minZ = reduce_min(laneMinZ);
maxZ = reduce_max(laneMaxZ);
}
export uniform int32
IntersectLightsWithTileMinMax(
uniform int32 tileStartX, uniform int32 tileEndX,
uniform int32 tileStartY, uniform int32 tileEndY,
// Tile data
uniform float minZ,
uniform float maxZ,
// G-buffer data
uniform int32 gBufferWidth, uniform int32 gBufferHeight,
// Camera data
uniform float cameraProj_11, uniform float cameraProj_22,
// Light Data
uniform int32 numLights,
uniform float light_positionView_x_array[],
uniform float light_positionView_y_array[],
uniform float light_positionView_z_array[],
uniform float light_attenuationEnd_array[],
// Output
uniform int32 tileLightIndices[]
)
{
uniform float gBufferScale_x = 0.5f * (float)gBufferWidth;
uniform float gBufferScale_y = 0.5f * (float)gBufferHeight;
uniform float frustumPlanes_xy[4] = {
-(cameraProj_11 * gBufferScale_x),
(cameraProj_11 * gBufferScale_x),
(cameraProj_22 * gBufferScale_y),
-(cameraProj_22 * gBufferScale_y) };
uniform float frustumPlanes_z[4] = {
tileEndX - gBufferScale_x,
-tileStartX + gBufferScale_x,
tileEndY - gBufferScale_y,
-tileStartY + gBufferScale_y };
for (uniform int i = 0; i < 4; ++i) {
uniform float norm = rsqrt(frustumPlanes_xy[i] * frustumPlanes_xy[i] +
frustumPlanes_z[i] * frustumPlanes_z[i]);
frustumPlanes_xy[i] *= norm;
frustumPlanes_z[i] *= norm;
}
uniform int32 tileNumLights = 0;
foreach (lightIndex = 0 ... numLights)
{
float light_positionView_z = light_positionView_z_array[lightIndex];
float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
float light_attenuationEndNeg = -light_attenuationEnd;
float d = light_positionView_z - minZ;
bool inFrustum = (d >= light_attenuationEndNeg);
d = maxZ - light_positionView_z;
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
// This seems better than cif(!inFrustum) ccontinue; here since we
// don't actually need to mask the rest of this function - this is
// just a greedy early-out. Could also structure all of this as
// nested if() statements, but this a bit easier to read
if (any(inFrustum)) {
float light_positionView_x = light_positionView_x_array[lightIndex];
float light_positionView_y = light_positionView_y_array[lightIndex];
d = light_positionView_z * frustumPlanes_z[0] +
light_positionView_x * frustumPlanes_xy[0];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[1] +
light_positionView_x * frustumPlanes_xy[1];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[2] +
light_positionView_y * frustumPlanes_xy[2];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[3] +
light_positionView_y * frustumPlanes_xy[3];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
// Pack and store intersecting lights
const bool active = inFrustum && lightIndex < numLights;
if (any(active))
tileNumLights += packed_store_active(active, &tileLightIndices[tileNumLights], lightIndex);
}
}
return tileNumLights;
}
static uniform int32
IntersectLightsWithTile(
uniform int32 tileStartX, uniform int32 tileEndX,
uniform int32 tileStartY, uniform int32 tileEndY,
uniform int32 gBufferWidth, uniform int32 gBufferHeight,
// G-buffer data
uniform float zBuffer[],
// Camera data
uniform float cameraProj_11, uniform float cameraProj_22,
uniform float cameraProj_33, uniform float cameraProj_43,
uniform float cameraNear, uniform float cameraFar,
// Light Data
uniform int32 numLights,
uniform float light_positionView_x_array[],
uniform float light_positionView_y_array[],
uniform float light_positionView_z_array[],
uniform float light_attenuationEnd_array[],
// Output
uniform int32 tileLightIndices[]
)
{
uniform float minZ, maxZ;
ComputeZBounds(tileStartX, tileEndX, tileStartY, tileEndY,
zBuffer, gBufferWidth, cameraProj_33, cameraProj_43, cameraNear, cameraFar,
minZ, maxZ);
uniform int32 tileNumLights = IntersectLightsWithTileMinMax(
tileStartX, tileEndX, tileStartY, tileEndY, minZ, maxZ,
gBufferWidth, gBufferHeight, cameraProj_11, cameraProj_22,
MAX_LIGHTS, light_positionView_x_array, light_positionView_y_array,
light_positionView_z_array, light_attenuationEnd_array,
tileLightIndices);
return tileNumLights;
}
export void
ShadeTile(
uniform int32 tileStartX, uniform int32 tileEndX,
uniform int32 tileStartY, uniform int32 tileEndY,
uniform int32 gBufferWidth, uniform int32 gBufferHeight,
uniform InputDataArrays &inputData,
// Camera data
uniform float cameraProj_11, uniform float cameraProj_22,
uniform float cameraProj_33, uniform float cameraProj_43,
// Light list
uniform int32 tileLightIndices[],
uniform int32 tileNumLights,
// UI
uniform bool visualizeLightCount,
// Output
uniform unsigned int8 framebuffer_r[],
uniform unsigned int8 framebuffer_g[],
uniform unsigned int8 framebuffer_b[]
)
{
if (tileNumLights == 0 || visualizeLightCount) {
uniform unsigned int8 c = (unsigned int8)(min(tileNumLights << 2, 255));
for (uniform int32 y = tileStartY; y < tileEndY; ++y) {
foreach (x = tileStartX ... tileEndX)
{
int32 framebufferIndex = (y * gBufferWidth + x);
framebuffer_r[framebufferIndex] = c;
framebuffer_g[framebufferIndex] = c;
framebuffer_b[framebufferIndex] = c;
}
}
} else {
uniform float twoOverGBufferWidth = 2.0f / gBufferWidth;
uniform float twoOverGBufferHeight = 2.0f / gBufferHeight;
for (uniform int32 y = tileStartY; y < tileEndY; ++y) {
uniform float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
foreach (x = tileStartX ... tileEndX) {
int32 gBufferOffset = y * gBufferWidth + x;
// Reconstruct position and (negative) view vector from G-buffer
float surface_positionView_x, surface_positionView_y, surface_positionView_z;
float Vneg_x, Vneg_y, Vneg_z;
float z = inputData.zBuffer[gBufferOffset];
// Compute screen/clip-space position
// NOTE: Mind DX11 viewport transform and pixel center!
float positionScreen_x = (0.5f + (float)(x)) *
twoOverGBufferWidth - 1.0f;
// Unproject depth buffer Z value into view space
surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
surface_positionView_x = positionScreen_x * surface_positionView_z /
cameraProj_11;
surface_positionView_y = positionScreen_y * surface_positionView_z /
cameraProj_22;
// We actually end up with a vector pointing *at* the
// surface (i.e. the negative view vector)
normalize3(surface_positionView_x, surface_positionView_y,
surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
// Reconstruct normal from G-buffer
float surface_normal_x, surface_normal_y, surface_normal_z;
float normal_x = half_to_float(inputData.normalEncoded_x[gBufferOffset]);
float normal_y = half_to_float(inputData.normalEncoded_y[gBufferOffset]);
float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
float m = sqrt(4.0f * f - 1.0f);
surface_normal_x = m * (4.0f * normal_x - 2.0f);
surface_normal_y = m * (4.0f * normal_y - 2.0f);
surface_normal_z = 3.0f - 8.0f * f;
// Load other G-buffer parameters
float surface_specularAmount =
half_to_float(inputData.specularAmount[gBufferOffset]);
float surface_specularPower =
half_to_float(inputData.specularPower[gBufferOffset]);
float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
float lit_x = 0.0f;
float lit_y = 0.0f;
float lit_z = 0.0f;
for (uniform int32 tileLightIndex = 0; tileLightIndex < tileNumLights;
++tileLightIndex) {
uniform int32 lightIndex = tileLightIndices[tileLightIndex];
// Gather light data relevant to initial culling
uniform float light_positionView_x =
inputData.lightPositionView_x[lightIndex];
uniform float light_positionView_y =
inputData.lightPositionView_y[lightIndex];
uniform float light_positionView_z =
inputData.lightPositionView_z[lightIndex];
uniform float light_attenuationEnd =
inputData.lightAttenuationEnd[lightIndex];
// Compute light vector
float L_x = light_positionView_x - surface_positionView_x;
float L_y = light_positionView_y - surface_positionView_y;
float L_z = light_positionView_z - surface_positionView_z;
float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
// Clip at end of attenuation
float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
cif (distanceToLight2 < light_attenutaionEnd2) {
float distanceToLight = sqrt(distanceToLight2);
// HLSL "rcp" is allowed to be fairly inaccurate
float distanceToLightRcp = rcp(distanceToLight);
L_x *= distanceToLightRcp;
L_y *= distanceToLightRcp;
L_z *= distanceToLightRcp;
// Start computing brdf
float NdotL = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, L_x, L_y, L_z);
// Clip back facing
cif (NdotL > 0.0f) {
uniform float light_attenuationBegin =
inputData.lightAttenuationBegin[lightIndex];
// Light distance attenuation (linstep)
float lightRange = (light_attenuationEnd - light_attenuationBegin);
float falloffPosition = (light_attenuationEnd - distanceToLight);
float attenuation = min(falloffPosition / lightRange, 1.0f);
float H_x = (L_x - Vneg_x);
float H_y = (L_y - Vneg_y);
float H_z = (L_z - Vneg_z);
normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
float NdotH = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, H_x, H_y, H_z);
NdotH = max(NdotH, 0.0f);
float specular = pow(NdotH, surface_specularPower);
float specularNorm = (surface_specularPower + 2.0f) *
(1.0f / 8.0f);
float specularContrib = surface_specularAmount *
specularNorm * specular;
float k = attenuation * NdotL * (1.0f + specularContrib);
uniform float light_color_x = inputData.lightColor_x[lightIndex];
uniform float light_color_y = inputData.lightColor_y[lightIndex];
uniform float light_color_z = inputData.lightColor_z[lightIndex];
float lightContrib_x = surface_albedo_x * light_color_x;
float lightContrib_y = surface_albedo_y * light_color_y;
float lightContrib_z = surface_albedo_z * light_color_z;
lit_x += lightContrib_x * k;
lit_y += lightContrib_y * k;
lit_z += lightContrib_z * k;
}
}
}
// Gamma correct
// These pows are pretty slow right now, but we can do
// something faster if really necessary to squeeze every
// last bit of performance out of it
float gamma = 1.0 / 2.2f;
lit_x = pow(clamp(lit_x, 0.0f, 1.0f), gamma);
lit_y = pow(clamp(lit_y, 0.0f, 1.0f), gamma);
lit_z = pow(clamp(lit_z, 0.0f, 1.0f), gamma);
framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
}
}
}
}
///////////////////////////////////////////////////////////////////////////
// Static decomposition
task void
RenderTile(uniform int num_groups_x, uniform int num_groups_y,
uniform InputHeader &inputHeader,
uniform InputDataArrays &inputData,
uniform int visualizeLightCount,
// Output
uniform unsigned int8 framebuffer_r[],
uniform unsigned int8 framebuffer_g[],
uniform unsigned int8 framebuffer_b[]) {
uniform int32 group_y = taskIndex / num_groups_x;
uniform int32 group_x = taskIndex % num_groups_x;
uniform int32 tile_start_x = group_x * MIN_TILE_WIDTH;
uniform int32 tile_start_y = group_y * MIN_TILE_HEIGHT;
uniform int32 tile_end_x = tile_start_x + MIN_TILE_WIDTH;
uniform int32 tile_end_y = tile_start_y + MIN_TILE_HEIGHT;
uniform int framebufferWidth = inputHeader.framebufferWidth;
uniform int framebufferHeight = inputHeader.framebufferHeight;
uniform float cameraProj_00 = inputHeader.cameraProj[0][0];
uniform float cameraProj_11 = inputHeader.cameraProj[1][1];
uniform float cameraProj_22 = inputHeader.cameraProj[2][2];
uniform float cameraProj_32 = inputHeader.cameraProj[3][2];
// Light intersection: figure out which lights illuminate this tile.
uniform int tileLightIndices[MAX_LIGHTS]; // Light list for the tile
uniform int numTileLights =
IntersectLightsWithTile(tile_start_x, tile_end_x,
tile_start_y, tile_end_y,
framebufferWidth, framebufferHeight,
inputData.zBuffer,
cameraProj_00, cameraProj_11,
cameraProj_22, cameraProj_32,
inputHeader.cameraNear, inputHeader.cameraFar,
MAX_LIGHTS,
inputData.lightPositionView_x,
inputData.lightPositionView_y,
inputData.lightPositionView_z,
inputData.lightAttenuationEnd,
tileLightIndices);
// And now shade the tile, using the lights in tileLightIndices
ShadeTile(tile_start_x, tile_end_x, tile_start_y, tile_end_y,
framebufferWidth, framebufferHeight, inputData,
cameraProj_00, cameraProj_11, cameraProj_22, cameraProj_32,
tileLightIndices, numTileLights, visualizeLightCount,
framebuffer_r, framebuffer_g, framebuffer_b);
}
export void
RenderStatic(uniform InputHeader &inputHeader,
uniform InputDataArrays &inputData,
uniform int visualizeLightCount,
// Output
uniform unsigned int8 framebuffer_r[],
uniform unsigned int8 framebuffer_g[],
uniform unsigned int8 framebuffer_b[]) {
uniform int num_groups_x = (inputHeader.framebufferWidth +
MIN_TILE_WIDTH - 1) / MIN_TILE_WIDTH;
uniform int num_groups_y = (inputHeader.framebufferHeight +
MIN_TILE_HEIGHT - 1) / MIN_TILE_HEIGHT;
uniform int num_groups = num_groups_x * num_groups_y;
// Launch a task to render each tile, each of which is MIN_TILE_WIDTH
// by MIN_TILE_HEIGHT pixels.
launch[num_groups] RenderTile(num_groups_x, num_groups_y,
inputHeader, inputData, visualizeLightCount,
framebuffer_r, framebuffer_g, framebuffer_b);
}
///////////////////////////////////////////////////////////////////////////
// Routines for dynamic decomposition path
// This computes the z min/max range for a whole row worth of tiles.
export void
ComputeZBoundsRow(
uniform int32 tileY,
uniform int32 tileWidth, uniform int32 tileHeight,
uniform int32 numTilesX, uniform int32 numTilesY,
// G-buffer data
uniform float zBuffer[],
uniform int32 gBufferWidth,
// Camera data
uniform float cameraProj_33, uniform float cameraProj_43,
uniform float cameraNear, uniform float cameraFar,
// Output
uniform float minZArray[],
uniform float maxZArray[]
)
{
for (uniform int32 tileX = 0; tileX < numTilesX; ++tileX) {
uniform float minZ, maxZ;
ComputeZBounds(
tileX * tileWidth, tileX * tileWidth + tileWidth,
tileY * tileHeight, tileY * tileHeight + tileHeight,
zBuffer, gBufferWidth,
cameraProj_33, cameraProj_43, cameraNear, cameraFar,
minZ, maxZ);
minZArray[tileX] = minZ;
maxZArray[tileX] = maxZ;
}
}
// Reclassifies the lights with respect to four sub-tiles when we refine a tile.
// numLights need not be a multiple of programCount here, but the input and output arrays
// should be able to handle programCount-sized load/stores.
export void
SplitTileMinMax(
uniform int32 tileMidX, uniform int32 tileMidY,
// Subtile data (00, 10, 01, 11)
uniform float subtileMinZ[],
uniform float subtileMaxZ[],
// G-buffer data
uniform int32 gBufferWidth, uniform int32 gBufferHeight,
// Camera data
uniform float cameraProj_11, uniform float cameraProj_22,
// Light Data
uniform int32 lightIndices[],
uniform int32 numLights,
uniform float light_positionView_x_array[],
uniform float light_positionView_y_array[],
uniform float light_positionView_z_array[],
uniform float light_attenuationEnd_array[],
// Outputs
uniform int32 subtileIndices[],
uniform int32 subtileIndicesPitch,
uniform int32 subtileNumLights[]
)
{
uniform float gBufferScale_x = 0.5f * (float)gBufferWidth;
uniform float gBufferScale_y = 0.5f * (float)gBufferHeight;
uniform float frustumPlanes_xy[2] = { -(cameraProj_11 * gBufferScale_x),
(cameraProj_22 * gBufferScale_y) };
uniform float frustumPlanes_z[2] = { tileMidX - gBufferScale_x,
tileMidY - gBufferScale_y };
// Normalize
uniform float norm[2] = { rsqrt(frustumPlanes_xy[0] * frustumPlanes_xy[0] +
frustumPlanes_z[0] * frustumPlanes_z[0]),
rsqrt(frustumPlanes_xy[1] * frustumPlanes_xy[1] +
frustumPlanes_z[1] * frustumPlanes_z[1]) };
frustumPlanes_xy[0] *= norm[0];
frustumPlanes_xy[1] *= norm[1];
frustumPlanes_z[0] *= norm[0];
frustumPlanes_z[1] *= norm[1];
// Initialize
uniform int32 subtileLightOffset[4];
subtileLightOffset[0] = 0 * subtileIndicesPitch;
subtileLightOffset[1] = 1 * subtileIndicesPitch;
subtileLightOffset[2] = 2 * subtileIndicesPitch;
subtileLightOffset[3] = 3 * subtileIndicesPitch;
foreach (i = 0 ... numLights) {
int32 lightIndex = lightIndices[i];
float light_positionView_x = light_positionView_x_array[lightIndex];
float light_positionView_y = light_positionView_y_array[lightIndex];
float light_positionView_z = light_positionView_z_array[lightIndex];
float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
float light_attenuationEndNeg = -light_attenuationEnd;
// Test lights again subtile z bounds
bool inFrustum[4];
inFrustum[0] = (light_positionView_z - subtileMinZ[0] >= light_attenuationEndNeg) &&
(subtileMaxZ[0] - light_positionView_z >= light_attenuationEndNeg);
inFrustum[1] = (light_positionView_z - subtileMinZ[1] >= light_attenuationEndNeg) &&
(subtileMaxZ[1] - light_positionView_z >= light_attenuationEndNeg);
inFrustum[2] = (light_positionView_z - subtileMinZ[2] >= light_attenuationEndNeg) &&
(subtileMaxZ[2] - light_positionView_z >= light_attenuationEndNeg);
inFrustum[3] = (light_positionView_z - subtileMinZ[3] >= light_attenuationEndNeg) &&
(subtileMaxZ[3] - light_positionView_z >= light_attenuationEndNeg);
float dx = light_positionView_z * frustumPlanes_z[0] +
light_positionView_x * frustumPlanes_xy[0];
float dy = light_positionView_z * frustumPlanes_z[1] +
light_positionView_y * frustumPlanes_xy[1];
cif (abs(dx) > light_attenuationEnd) {
bool positiveX = dx > 0.0f;
inFrustum[0] = inFrustum[0] && positiveX; // 00 subtile
inFrustum[1] = inFrustum[1] && !positiveX; // 10 subtile
inFrustum[2] = inFrustum[2] && positiveX; // 01 subtile
inFrustum[3] = inFrustum[3] && !positiveX; // 11 subtile
}
cif (abs(dy) > light_attenuationEnd) {
bool positiveY = dy > 0.0f;
inFrustum[0] = inFrustum[0] && positiveY; // 00 subtile
inFrustum[1] = inFrustum[1] && positiveY; // 10 subtile
inFrustum[2] = inFrustum[2] && !positiveY; // 01 subtile
inFrustum[3] = inFrustum[3] && !positiveY; // 11 subtile
}
// Pack and store intersecting lights
// TODO: Experiment with a loop here instead
cif (inFrustum[0])
subtileLightOffset[0] +=
packed_store_active(&subtileIndices[subtileLightOffset[0]],
lightIndex);
cif (inFrustum[1])
subtileLightOffset[1] +=
packed_store_active(&subtileIndices[subtileLightOffset[1]],
lightIndex);
cif (inFrustum[2])
subtileLightOffset[2] +=
packed_store_active(&subtileIndices[subtileLightOffset[2]],
lightIndex);
cif (inFrustum[3])
subtileLightOffset[3] +=
packed_store_active(&subtileIndices[subtileLightOffset[3]],
lightIndex);
}
subtileNumLights[0] = subtileLightOffset[0] - 0 * subtileIndicesPitch;
subtileNumLights[1] = subtileLightOffset[1] - 1 * subtileIndicesPitch;
subtileNumLights[2] = subtileLightOffset[2] - 2 * subtileIndicesPitch;
subtileNumLights[3] = subtileLightOffset[3] - 3 * subtileIndicesPitch;
}

View File

@@ -1,556 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef __NVPTX__
#warning "emitting DEVICE code"
#define programCount warpSize()
#define programIndex laneIndex()
#define taskIndex blockIndex0()
#define taskCount blockCount0()
#define cif if
#else
#warning "emitting HOST code"
#endif
#include "deferred.h"
struct InputDataArrays
{
float *zBuffer;
unsigned int16 *normalEncoded_x; // half float
unsigned int16 *normalEncoded_y; // half float
unsigned int16 *specularAmount; // half float
unsigned int16 *specularPower; // half float
unsigned int8 *albedo_x; // unorm8
unsigned int8 *albedo_y; // unorm8
unsigned int8 *albedo_z; // unorm8
float *lightPositionView_x;
float *lightPositionView_y;
float *lightPositionView_z;
float *lightAttenuationBegin;
float *lightColor_x;
float *lightColor_y;
float *lightColor_z;
float *lightAttenuationEnd;
};
struct InputHeader
{
float cameraProj[4][4];
float cameraNear;
float cameraFar;
int32 framebufferWidth;
int32 framebufferHeight;
int32 numLights;
int32 inputDataChunkSize;
int32 inputDataArrayOffsets[idaNum];
};
///////////////////////////////////////////////////////////////////////////
// Common utility routines
static inline float
dot3(float x, float y, float z, float a, float b, float c) {
return (x*a + y*b + z*c);
}
static inline void
normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
float n = rsqrt(x*x + y*y + z*z);
ox = x * n;
oy = y * n;
oz = z * n;
}
static inline float
Unorm8ToFloat32(unsigned int8 u) {
return (float)u * (1.0f / 255.0f);
}
static inline unsigned int8
Float32ToUnorm8(float f) {
return (unsigned int8)(f * 255.0f);
}
static inline void
ComputeZBounds(
uniform int32 tileStartX, uniform int32 tileEndX,
uniform int32 tileStartY, uniform int32 tileEndY,
// G-buffer data
uniform float zBuffer[],
uniform int32 gBufferWidth,
// Camera data
uniform float cameraProj_33, uniform float cameraProj_43,
uniform float cameraNear, uniform float cameraFar,
// Output
uniform float &minZ,
uniform float &maxZ
)
{
// Find Z bounds
float laneMinZ = cameraFar;
float laneMaxZ = cameraNear;
for (uniform int32 y = tileStartY; y < tileEndY; ++y)
foreach (x = tileStartX ... tileEndX)
{
// Unproject depth buffer Z value into view space
float z = zBuffer[y * gBufferWidth + x];
float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
// Work out Z bounds for our samples
// Avoid considering skybox/background or otherwise invalid pixels
if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
laneMinZ = min(laneMinZ, viewSpaceZ);
laneMaxZ = max(laneMaxZ, viewSpaceZ);
}
}
minZ = reduce_min(laneMinZ);
maxZ = reduce_max(laneMaxZ);
}
static inline uniform int32
IntersectLightsWithTileMinMax(
uniform int32 tileStartX, uniform int32 tileEndX,
uniform int32 tileStartY, uniform int32 tileEndY,
// Tile data
uniform float minZ,
uniform float maxZ,
// G-buffer data
uniform int32 gBufferWidth, uniform int32 gBufferHeight,
// Camera data
uniform float cameraProj_11, uniform float cameraProj_22,
// Light Data
uniform int32 numLights,
uniform float light_positionView_x_array[],
uniform float light_positionView_y_array[],
uniform float light_positionView_z_array[],
uniform float light_attenuationEnd_array[],
// Output
uniform int32 tileLightIndices[]
)
{
uniform float gBufferScale_x = 0.5f * (float)gBufferWidth;
uniform float gBufferScale_y = 0.5f * (float)gBufferHeight;
uniform float frustumPlanes_xy[4] = {
-(cameraProj_11 * gBufferScale_x),
(cameraProj_11 * gBufferScale_x),
(cameraProj_22 * gBufferScale_y),
-(cameraProj_22 * gBufferScale_y) };
uniform float frustumPlanes_z[4] = {
tileEndX - gBufferScale_x,
-tileStartX + gBufferScale_x,
tileEndY - gBufferScale_y,
-tileStartY + gBufferScale_y };
for (uniform int i = 0; i < 4; ++i) {
uniform float norm = rsqrt(frustumPlanes_xy[i] * frustumPlanes_xy[i] +
frustumPlanes_z[i] * frustumPlanes_z[i]);
frustumPlanes_xy[i] *= norm;
frustumPlanes_z[i] *= norm;
}
uniform int32 tileNumLights = 0;
foreach (lightIndex = 0 ... numLights)
{
float light_positionView_z = light_positionView_z_array[lightIndex];
float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
float light_attenuationEndNeg = -light_attenuationEnd;
float d = light_positionView_z - minZ;
bool inFrustum = (d >= light_attenuationEndNeg);
d = maxZ - light_positionView_z;
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
// This seems better than cif(!inFrustum) ccontinue; here since we
// don't actually need to mask the rest of this function - this is
// just a greedy early-out. Could also structure all of this as
// nested if() statements, but this a bit easier to read
if (any(inFrustum))
{
float light_positionView_x = light_positionView_x_array[lightIndex];
float light_positionView_y = light_positionView_y_array[lightIndex];
d = light_positionView_z * frustumPlanes_z[0] +
light_positionView_x * frustumPlanes_xy[0];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[1] +
light_positionView_x * frustumPlanes_xy[1];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[2] +
light_positionView_y * frustumPlanes_xy[2];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[3] +
light_positionView_y * frustumPlanes_xy[3];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
// Pack and store intersecting lights
const bool active = inFrustum && lightIndex < numLights;
if(any(active))
tileNumLights += packed_store_active(active, &tileLightIndices[tileNumLights], lightIndex);
}
}
return tileNumLights;
}
static inline uniform int32
IntersectLightsWithTile(
uniform int32 tileStartX, uniform int32 tileEndX,
uniform int32 tileStartY, uniform int32 tileEndY,
uniform int32 gBufferWidth, uniform int32 gBufferHeight,
// G-buffer data
uniform float zBuffer[],
// Camera data
uniform float cameraProj_11, uniform float cameraProj_22,
uniform float cameraProj_33, uniform float cameraProj_43,
uniform float cameraNear, uniform float cameraFar,
// Light Data
uniform int32 numLights,
uniform float light_positionView_x_array[],
uniform float light_positionView_y_array[],
uniform float light_positionView_z_array[],
uniform float light_attenuationEnd_array[],
// Output
uniform int32 tileLightIndices[]
)
{
uniform float minZ, maxZ;
ComputeZBounds(tileStartX, tileEndX, tileStartY, tileEndY,
zBuffer, gBufferWidth, cameraProj_33, cameraProj_43, cameraNear, cameraFar,
minZ, maxZ);
uniform int32 tileNumLights = IntersectLightsWithTileMinMax(
tileStartX, tileEndX, tileStartY, tileEndY, minZ, maxZ,
gBufferWidth, gBufferHeight, cameraProj_11, cameraProj_22,
MAX_LIGHTS, light_positionView_x_array, light_positionView_y_array,
light_positionView_z_array, light_attenuationEnd_array,
tileLightIndices);
return tileNumLights;
}
static inline void
ShadeTile(
uniform int32 tileStartX, uniform int32 tileEndX,
uniform int32 tileStartY, uniform int32 tileEndY,
uniform int32 gBufferWidth, uniform int32 gBufferHeight,
const uniform InputDataArrays &inputData,
// Camera data
uniform float cameraProj_11, uniform float cameraProj_22,
uniform float cameraProj_33, uniform float cameraProj_43,
// Light list
uniform int32 tileLightIndices[],
uniform int32 tileNumLights,
// UI
uniform bool visualizeLightCount,
// Output
uniform unsigned int8 framebuffer_r[],
uniform unsigned int8 framebuffer_g[],
uniform unsigned int8 framebuffer_b[]
)
{
if (tileNumLights == 0 || visualizeLightCount) {
uniform unsigned int8 c = (unsigned int8)(min(tileNumLights << 2, 255));
for (uniform int32 y = tileStartY; y < tileEndY; ++y)
foreach (x = tileStartX ... tileEndX)
{
int32 framebufferIndex = (y * gBufferWidth + x);
framebuffer_r[framebufferIndex] = c;
framebuffer_g[framebufferIndex] = c;
framebuffer_b[framebufferIndex] = c;
}
} else {
uniform float twoOverGBufferWidth = 2.0f / gBufferWidth;
uniform float twoOverGBufferHeight = 2.0f / gBufferHeight;
for (uniform int32 y = tileStartY; y < tileEndY; ++y) {
uniform float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
foreach (x = tileStartX ... tileEndX) {
int32 gBufferOffset = y * gBufferWidth + x;
// Reconstruct position and (negative) view vector from G-buffer
float surface_positionView_x, surface_positionView_y, surface_positionView_z;
float Vneg_x, Vneg_y, Vneg_z;
float z = inputData.zBuffer[gBufferOffset];
// Compute screen/clip-space position
// NOTE: Mind DX11 viewport transform and pixel center!
float positionScreen_x = (0.5f + (float)(x)) *
twoOverGBufferWidth - 1.0f;
// Unproject depth buffer Z value into view space
surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
surface_positionView_x = positionScreen_x * surface_positionView_z /
cameraProj_11;
surface_positionView_y = positionScreen_y * surface_positionView_z /
cameraProj_22;
// We actually end up with a vector pointing *at* the
// surface (i.e. the negative view vector)
normalize3(surface_positionView_x, surface_positionView_y,
surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
// Reconstruct normal from G-buffer
float surface_normal_x, surface_normal_y, surface_normal_z;
float normal_x = half_to_float(inputData.normalEncoded_x[gBufferOffset]);
float normal_y = half_to_float(inputData.normalEncoded_y[gBufferOffset]);
float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
float m = sqrt(4.0f * f - 1.0f);
surface_normal_x = m * (4.0f * normal_x - 2.0f);
surface_normal_y = m * (4.0f * normal_y - 2.0f);
surface_normal_z = 3.0f - 8.0f * f;
// Load other G-buffer parameters
float surface_specularAmount =
half_to_float(inputData.specularAmount[gBufferOffset]);
float surface_specularPower =
half_to_float(inputData.specularPower[gBufferOffset]);
float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
float lit_x = 0.0f;
float lit_y = 0.0f;
float lit_z = 0.0f;
for (uniform int32 tileLightIndex = 0; tileLightIndex < tileNumLights;
++tileLightIndex) {
uniform int32 lightIndex = tileLightIndices[tileLightIndex];
// Gather light data relevant to initial culling
uniform float light_positionView_x =
inputData.lightPositionView_x[lightIndex];
uniform float light_positionView_y =
inputData.lightPositionView_y[lightIndex];
uniform float light_positionView_z =
inputData.lightPositionView_z[lightIndex];
uniform float light_attenuationEnd =
inputData.lightAttenuationEnd[lightIndex];
// Compute light vector
float L_x = light_positionView_x - surface_positionView_x;
float L_y = light_positionView_y - surface_positionView_y;
float L_z = light_positionView_z - surface_positionView_z;
float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
// Clip at end of attenuation
float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
cif (distanceToLight2 < light_attenutaionEnd2) {
float distanceToLight = sqrt(distanceToLight2);
// HLSL "rcp" is allowed to be fairly inaccurate
float distanceToLightRcp = rcp(distanceToLight);
L_x *= distanceToLightRcp;
L_y *= distanceToLightRcp;
L_z *= distanceToLightRcp;
// Start computing brdf
float NdotL = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, L_x, L_y, L_z);
// Clip back facing
cif (NdotL > 0.0f) {
uniform float light_attenuationBegin =
inputData.lightAttenuationBegin[lightIndex];
// Light distance attenuation (linstep)
float lightRange = (light_attenuationEnd - light_attenuationBegin);
float falloffPosition = (light_attenuationEnd - distanceToLight);
float attenuation = min(falloffPosition / lightRange, 1.0f);
float H_x = (L_x - Vneg_x);
float H_y = (L_y - Vneg_y);
float H_z = (L_z - Vneg_z);
normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
float NdotH = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, H_x, H_y, H_z);
NdotH = max(NdotH, 0.0f);
float specular = pow(NdotH, surface_specularPower);
float specularNorm = (surface_specularPower + 2.0f) *
(1.0f / 8.0f);
float specularContrib = surface_specularAmount *
specularNorm * specular;
float k = attenuation * NdotL * (1.0f + specularContrib);
uniform float light_color_x = inputData.lightColor_x[lightIndex];
uniform float light_color_y = inputData.lightColor_y[lightIndex];
uniform float light_color_z = inputData.lightColor_z[lightIndex];
float lightContrib_x = surface_albedo_x * light_color_x;
float lightContrib_y = surface_albedo_y * light_color_y;
float lightContrib_z = surface_albedo_z * light_color_z;
lit_x += lightContrib_x * k;
lit_y += lightContrib_y * k;
lit_z += lightContrib_z * k;
}
}
}
// Gamma correct
// These pows are pretty slow right now, but we can do
// something faster if really necessary to squeeze every
// last bit of performance out of it
float gamma = 1.0 / 2.2f;
lit_x = pow(clamp(lit_x, 0.0f, 1.0f), gamma);
lit_y = pow(clamp(lit_y, 0.0f, 1.0f), gamma);
lit_z = pow(clamp(lit_z, 0.0f, 1.0f), gamma);
framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
}
}
}
}
///////////////////////////////////////////////////////////////////////////
// Static decomposition
void task
RenderTile(uniform int num_groups_x, uniform int num_groups_y,
const uniform InputHeader inputHeaderPtr[],
const uniform InputDataArrays inputDataPtr[],
uniform int visualizeLightCount,
// Output
uniform unsigned int8 framebuffer_r[],
uniform unsigned int8 framebuffer_g[],
uniform unsigned int8 framebuffer_b[]) {
if (taskIndex >= taskCount) return;
const uniform InputHeader inputHeader = *inputHeaderPtr;
const uniform InputDataArrays inputData = *inputDataPtr;
uniform int32 group_y = taskIndex / num_groups_x;
uniform int32 group_x = taskIndex % num_groups_x;
uniform int32 tile_start_x = group_x * MIN_TILE_WIDTH;
uniform int32 tile_start_y = group_y * MIN_TILE_HEIGHT;
uniform int32 tile_end_x = tile_start_x + MIN_TILE_WIDTH;
uniform int32 tile_end_y = tile_start_y + MIN_TILE_HEIGHT;
uniform int framebufferWidth = inputHeader.framebufferWidth;
uniform int framebufferHeight = inputHeader.framebufferHeight;
uniform float cameraProj_00 = inputHeader.cameraProj[0][0];
uniform float cameraProj_11 = inputHeader.cameraProj[1][1];
uniform float cameraProj_22 = inputHeader.cameraProj[2][2];
uniform float cameraProj_32 = inputHeader.cameraProj[3][2];
// Light intersection: figure out which lights illuminate this tile.
#if 1
uniform int * uniform tileLightIndices = uniform new uniform int [MAX_LIGHTS];
#else
uniform int tileLightIndices[MAX_LIGHTS]; // Light list for the tile
#endif
#if 1
uniform int numTileLights =
IntersectLightsWithTile(tile_start_x, tile_end_x,
tile_start_y, tile_end_y,
framebufferWidth, framebufferHeight,
inputData.zBuffer,
cameraProj_00, cameraProj_11,
cameraProj_22, cameraProj_32,
inputHeader.cameraNear, inputHeader.cameraFar,
MAX_LIGHTS,
inputData.lightPositionView_x,
inputData.lightPositionView_y,
inputData.lightPositionView_z,
inputData.lightAttenuationEnd,
tileLightIndices);
// And now shade the tile, using the lights in tileLightIndices
ShadeTile(tile_start_x, tile_end_x, tile_start_y, tile_end_y,
framebufferWidth, framebufferHeight, inputData,
cameraProj_00, cameraProj_11, cameraProj_22, cameraProj_32,
tileLightIndices, numTileLights, visualizeLightCount,
framebuffer_r, framebuffer_g, framebuffer_b);
#endif
#if 1
delete tileLightIndices;
#endif
}
export void
RenderStatic(uniform InputHeader inputHeaderPtr[],
uniform InputDataArrays inputDataPtr[],
uniform int visualizeLightCount,
// Output
uniform unsigned int8 framebuffer_r[],
uniform unsigned int8 framebuffer_g[],
uniform unsigned int8 framebuffer_b[]) {
const uniform InputHeader inputHeader = *inputHeaderPtr;
const uniform InputDataArrays inputData = *inputDataPtr;
uniform int num_groups_x = (inputHeader.framebufferWidth +
MIN_TILE_WIDTH - 1) / MIN_TILE_WIDTH;
uniform int num_groups_y = (inputHeader.framebufferHeight +
MIN_TILE_HEIGHT - 1) / MIN_TILE_HEIGHT;
uniform int num_groups = num_groups_x * num_groups_y;
// Launch a task to render each tile, each of which is MIN_TILE_WIDTH
// by MIN_TILE_HEIGHT pixels.
launch[num_groups] RenderTile(num_groups_x, num_groups_y,
inputHeaderPtr, inputDataPtr, visualizeLightCount,
framebuffer_r, framebuffer_g, framebuffer_b);
sync;
}

View File

@@ -1,659 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "deferred.h"
#include <stdio.h>
#include <assert.h>
#define programCount 32
#define programIndex (threadIdx.x & 31)
#define taskIndex (blockIdx.x*4 + (threadIdx.x >> 5))
#define taskCount (gridDim.x*4)
#define warpIdx (threadIdx.x >> 5)
#define int32 int
#define int16 short
#define int8 char
__device__ static inline float clamp(float v, float low, float high)
{
return min(max(v, low), high);
}
struct InputDataArrays
{
float *zBuffer;
unsigned int16 *normalEncoded_x; // half float
unsigned int16 *normalEncoded_y; // half float
unsigned int16 *specularAmount; // half float
unsigned int16 *specularPower; // half float
unsigned int8 *albedo_x; // unorm8
unsigned int8 *albedo_y; // unorm8
unsigned int8 *albedo_z; // unorm8
float *lightPositionView_x;
float *lightPositionView_y;
float *lightPositionView_z;
float *lightAttenuationBegin;
float *lightColor_x;
float *lightColor_y;
float *lightColor_z;
float *lightAttenuationEnd;
};
struct InputHeader
{
float cameraProj[4][4];
float cameraNear;
float cameraFar;
int32 framebufferWidth;
int32 framebufferHeight;
int32 numLights;
int32 inputDataChunkSize;
int32 inputDataArrayOffsets[idaNum];
};
///////////////////////////////////////////////////////////////////////////
// Common utility routines
__device__
static inline float
dot3(float x, float y, float z, float a, float b, float c) {
return (x*a + y*b + z*c);
}
#if 0
template<typename T, int N>
struct Uniform
{
static __shared__ T shdata[128];
T data[(N-1)/programCount+1];
__device__ inline const T get(const int i) const
{
const int elemIdx = i & (programCount-1);
const int chunkIdx = i >> 5;
return __shfl(data[chunkIdx], elemIdx);
}
__device__ inline void set(const int i, const T value) const
{
const int elemIdx = i & (programCount-1);
const int chunkIdx = i >> 5;
shdata[elemIdx] = value;
data[chunkIdx] = shdata[programIndex];
}
}
#endif
__device__
static inline void
normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
float n = rsqrt(x*x + y*y + z*z);
ox = x * n;
oy = y * n;
oz = z * n;
}
__device__ inline
static float reduce_min(float value)
{
#pragma unroll
for (int i = 4; i >=0; i--)
value = min(value, __shfl_xor(value, 1<<i, 32));
return value;
}
__device__ inline
static float reduce_max(float value)
{
#pragma unroll
for (int i = 4; i >=0; i--)
value = max(value, __shfl_xor(value, 1<<i, 32));
return value;
}
__device__ inline
static int reduce_sum(int value)
{
#pragma unroll
for (int i = 4; i >=0; i--)
value += __shfl_xor(value, 1<<i, 32);
return value;
}
static __device__ __forceinline__ uint shfl_scan_add_step(uint partial, uint up_offset)
{
uint result;
asm(
"{.reg .u32 r0;"
".reg .pred p;"
"shfl.up.b32 r0|p, %1, %2, 0;"
"@p add.u32 r0, r0, %3;"
"mov.u32 %0, r0;}"
: "=r"(result) : "r"(partial), "r"(up_offset), "r"(partial));
return result;
}
static __device__ __forceinline__ int inclusive_scan_warp(const int value)
{
uint sum = value;
#pragma unroll
for(int i = 0; i < 5; ++i)
sum = shfl_scan_add_step(sum, 1 << i);
return sum - value;
}
static __device__ __forceinline__ int lanemask_lt()
{
int mask;
asm("mov.u32 %0, %lanemask_lt;" : "=r" (mask));
return mask;
}
static __device__ __forceinline__ int2 warpBinExclusiveScan(const bool p)
{
const unsigned int b = __ballot(p);
return make_int2(__popc(b & lanemask_lt()), __popc(b));
}
__device__
static inline float
Unorm8ToFloat32(unsigned int8 u) {
return (float)u * (1.0f / 255.0f);
}
__device__
static inline unsigned int8
Float32ToUnorm8(float f) {
return (unsigned int8)(f * 255.0f);
}
__device__
static inline void
ComputeZBounds(
int32 tileStartX, int32 tileEndX,
int32 tileStartY, int32 tileEndY,
// G-buffer data
float zBuffer[],
int32 gBufferWidth,
// Camera data
float cameraProj_33, float cameraProj_43,
float cameraNear, float cameraFar,
// Output
float &minZ,
float &maxZ
)
{
// Find Z bounds
float laneMinZ = cameraFar;
float laneMaxZ = cameraNear;
for ( int32 y = tileStartY; y < tileEndY; ++y) {
for ( int xb = tileStartX; xb < tileEndX; xb += programCount)
{
const int x = xb + programIndex;
if (x >= tileEndX) break;
// Unproject depth buffer Z value into view space
float z = zBuffer[y * gBufferWidth + x];
float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
// Work out Z bounds for our samples
// Avoid considering skybox/background or otherwise invalid pixels
if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
laneMinZ = min(laneMinZ, viewSpaceZ);
laneMaxZ = max(laneMaxZ, viewSpaceZ);
}
}
}
minZ = reduce_min(laneMinZ);
maxZ = reduce_max(laneMaxZ);
}
__device__
static inline int32
IntersectLightsWithTileMinMax(
int32 tileStartX, int32 tileEndX,
int32 tileStartY, int32 tileEndY,
// Tile data
float minZ,
float maxZ,
// G-buffer data
int32 gBufferWidth, int32 gBufferHeight,
// Camera data
float cameraProj_11, float cameraProj_22,
// Light Data
int32 numLights,
float light_positionView_x_array[],
float light_positionView_y_array[],
float light_positionView_z_array[],
float light_attenuationEnd_array[],
// Output
int32 tileLightIndices[]
)
{
float gBufferScale_x = 0.5f * (float)gBufferWidth;
float gBufferScale_y = 0.5f * (float)gBufferHeight;
float frustumPlanes_xy[4] = {
-(cameraProj_11 * gBufferScale_x),
(cameraProj_11 * gBufferScale_x),
(cameraProj_22 * gBufferScale_y),
-(cameraProj_22 * gBufferScale_y) };
float frustumPlanes_z[4] = {
tileEndX - gBufferScale_x,
-tileStartX + gBufferScale_x,
tileEndY - gBufferScale_y,
-tileStartY + gBufferScale_y };
for ( int i = 0; i < 4; ++i) {
float norm = rsqrt(frustumPlanes_xy[i] * frustumPlanes_xy[i] +
frustumPlanes_z[i] * frustumPlanes_z[i]);
frustumPlanes_xy[i] *= norm;
frustumPlanes_z[i] *= norm;
}
int32 tileNumLights = 0;
for ( int lightIndexB = 0; lightIndexB < numLights; lightIndexB += programCount)
{
const int lightIndex = lightIndexB + programIndex;
float light_positionView_z = light_positionView_z_array[lightIndex];
float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
float light_attenuationEndNeg = -light_attenuationEnd;
float d = light_positionView_z - minZ;
bool inFrustum = (d >= light_attenuationEndNeg);
d = maxZ - light_positionView_z;
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
// This seems better than cif(!inFrustum) ccontinue; here since we
// don't actually need to mask the rest of this function - this is
// just a greedy early-out. Could also structure all of this as
// nested if() statements, but this a bit easier to read
int active = 0;
if ((inFrustum)) {
float light_positionView_x = light_positionView_x_array[lightIndex];
float light_positionView_y = light_positionView_y_array[lightIndex];
d = light_positionView_z * frustumPlanes_z[0] +
light_positionView_x * frustumPlanes_xy[0];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[1] +
light_positionView_x * frustumPlanes_xy[1];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[2] +
light_positionView_y * frustumPlanes_xy[2];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
d = light_positionView_z * frustumPlanes_z[3] +
light_positionView_y * frustumPlanes_xy[3];
inFrustum = inFrustum && (d >= light_attenuationEndNeg);
// Pack and store intersecting lights
#if 0
if (inFrustum) {
tileNumLights += packed_store_active(&tileLightIndices[tileNumLights],
lightIndex);
}
#else
if (inFrustum)
{
active = 1;
}
#endif
}
#if 1
if (lightIndex >= numLights)
active = 0;
#if 0
const int idx = tileNumLights + inclusive_scan_warp(active);
const int nactive = reduce_sum(active);
#else
const int2 res = warpBinExclusiveScan(active);
const int idx = tileNumLights + res.x;
const int nactive = res.y;
#endif
if (active)
tileLightIndices[idx] = lightIndex;
tileNumLights += nactive;
#endif
}
return tileNumLights;
}
__device__
static inline int32
IntersectLightsWithTile(
int32 tileStartX, int32 tileEndX,
int32 tileStartY, int32 tileEndY,
int32 gBufferWidth, int32 gBufferHeight,
// G-buffer data
float zBuffer[],
// Camera data
float cameraProj_11, float cameraProj_22,
float cameraProj_33, float cameraProj_43,
float cameraNear, float cameraFar,
// Light Data
int32 numLights,
float light_positionView_x_array[],
float light_positionView_y_array[],
float light_positionView_z_array[],
float light_attenuationEnd_array[],
// Output
int32 tileLightIndices[]
)
{
float minZ, maxZ;
ComputeZBounds(tileStartX, tileEndX, tileStartY, tileEndY,
zBuffer, gBufferWidth, cameraProj_33, cameraProj_43, cameraNear, cameraFar,
minZ, maxZ);
int32 tileNumLights = IntersectLightsWithTileMinMax(
tileStartX, tileEndX, tileStartY, tileEndY, minZ, maxZ,
gBufferWidth, gBufferHeight, cameraProj_11, cameraProj_22,
MAX_LIGHTS, light_positionView_x_array, light_positionView_y_array,
light_positionView_z_array, light_attenuationEnd_array,
tileLightIndices);
return tileNumLights;
}
__device__
static inline void
ShadeTile(
int32 tileStartX, int32 tileEndX,
int32 tileStartY, int32 tileEndY,
int32 gBufferWidth, int32 gBufferHeight,
const InputDataArrays &inputData,
// Camera data
float cameraProj_11, float cameraProj_22,
float cameraProj_33, float cameraProj_43,
// Light list
int32 tileLightIndices[],
int32 tileNumLights,
// UI
bool visualizeLightCount,
// Output
unsigned int8 framebuffer_r[],
unsigned int8 framebuffer_g[],
unsigned int8 framebuffer_b[]
)
{
if (tileNumLights == 0 || visualizeLightCount) {
unsigned int8 c = (unsigned int8)(min(tileNumLights << 2, 255));
for ( int32 y = tileStartY; y < tileEndY; ++y) {
for ( int xb = tileStartX ; xb < tileEndX; xb += programCount)
{
const int x = xb + programIndex;
if (x >= tileEndX) continue;
int32 framebufferIndex = (y * gBufferWidth + x);
framebuffer_r[framebufferIndex] = c;
framebuffer_g[framebufferIndex] = c;
framebuffer_b[framebufferIndex] = c;
}
}
} else {
float twoOverGBufferWidth = 2.0f / gBufferWidth;
float twoOverGBufferHeight = 2.0f / gBufferHeight;
for ( int32 y = tileStartY; y < tileEndY; ++y) {
float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
for ( int xb = tileStartX ; xb < tileEndX; xb += programCount)
{
const int x = xb + programIndex;
// if (x >= tileEndX) break;
int32 gBufferOffset = y * gBufferWidth + x;
// Reconstruct position and (negative) view vector from G-buffer
float surface_positionView_x, surface_positionView_y, surface_positionView_z;
float Vneg_x, Vneg_y, Vneg_z;
float z = inputData.zBuffer[gBufferOffset];
// Compute screen/clip-space position
// NOTE: Mind DX11 viewport transform and pixel center!
float positionScreen_x = (0.5f + (float)(x)) *
twoOverGBufferWidth - 1.0f;
// Unproject depth buffer Z value into view space
surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
surface_positionView_x = positionScreen_x * surface_positionView_z /
cameraProj_11;
surface_positionView_y = positionScreen_y * surface_positionView_z /
cameraProj_22;
// We actually end up with a vector pointing *at* the
// surface (i.e. the negative view vector)
normalize3(surface_positionView_x, surface_positionView_y,
surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
// Reconstruct normal from G-buffer
float surface_normal_x, surface_normal_y, surface_normal_z;
float normal_x = __half2float(inputData.normalEncoded_x[gBufferOffset]);
float normal_y = __half2float(inputData.normalEncoded_y[gBufferOffset]);
float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
float m = sqrt(4.0f * f - 1.0f);
surface_normal_x = m * (4.0f * normal_x - 2.0f);
surface_normal_y = m * (4.0f * normal_y - 2.0f);
surface_normal_z = 3.0f - 8.0f * f;
// Load other G-buffer parameters
float surface_specularAmount =
__half2float(inputData.specularAmount[gBufferOffset]);
float surface_specularPower =
__half2float(inputData.specularPower[gBufferOffset]);
float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
float lit_x = 0.0f;
float lit_y = 0.0f;
float lit_z = 0.0f;
for ( int32 tileLightIndex = 0; tileLightIndex < tileNumLights;
++tileLightIndex) {
int32 lightIndex = tileLightIndices[tileLightIndex];
// Gather light data relevant to initial culling
float light_positionView_x =
inputData.lightPositionView_x[lightIndex];
float light_positionView_y =
inputData.lightPositionView_y[lightIndex];
float light_positionView_z =
inputData.lightPositionView_z[lightIndex];
float light_attenuationEnd =
inputData.lightAttenuationEnd[lightIndex];
// Compute light vector
float L_x = light_positionView_x - surface_positionView_x;
float L_y = light_positionView_y - surface_positionView_y;
float L_z = light_positionView_z - surface_positionView_z;
float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
// Clip at end of attenuation
float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
if (distanceToLight2 < light_attenutaionEnd2) {
float distanceToLight = sqrt(distanceToLight2);
// HLSL "rcp" is allowed to be fairly inaccurate
float distanceToLightRcp = 1.0f/distanceToLight;
L_x *= distanceToLightRcp;
L_y *= distanceToLightRcp;
L_z *= distanceToLightRcp;
// Start computing brdf
float NdotL = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, L_x, L_y, L_z);
// Clip back facing
if (NdotL > 0.0f) {
float light_attenuationBegin =
inputData.lightAttenuationBegin[lightIndex];
// Light distance attenuation (linstep)
float lightRange = (light_attenuationEnd - light_attenuationBegin);
float falloffPosition = (light_attenuationEnd - distanceToLight);
float attenuation = min(falloffPosition / lightRange, 1.0f);
float H_x = (L_x - Vneg_x);
float H_y = (L_y - Vneg_y);
float H_z = (L_z - Vneg_z);
normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
float NdotH = dot3(surface_normal_x, surface_normal_y,
surface_normal_z, H_x, H_y, H_z);
NdotH = max(NdotH, 0.0f);
float specular = pow(NdotH, surface_specularPower);
float specularNorm = (surface_specularPower + 2.0f) *
(1.0f / 8.0f);
float specularContrib = surface_specularAmount *
specularNorm * specular;
float k = attenuation * NdotL * (1.0f + specularContrib);
float light_color_x = inputData.lightColor_x[lightIndex];
float light_color_y = inputData.lightColor_y[lightIndex];
float light_color_z = inputData.lightColor_z[lightIndex];
float lightContrib_x = surface_albedo_x * light_color_x;
float lightContrib_y = surface_albedo_y * light_color_y;
float lightContrib_z = surface_albedo_z * light_color_z;
lit_x += lightContrib_x * k;
lit_y += lightContrib_y * k;
lit_z += lightContrib_z * k;
}
}
}
// Gamma correct
// These pows are pretty slow right now, but we can do
// something faster if really necessary to squeeze every
// last bit of performance out of it
float gamma = 1.0 / 2.2f;
lit_x = pow(clamp(lit_x, 0.0f, 1.0f), gamma);
lit_y = pow(clamp(lit_y, 0.0f, 1.0f), gamma);
lit_z = pow(clamp(lit_z, 0.0f, 1.0f), gamma);
framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
}
}
}
}
///////////////////////////////////////////////////////////////////////////
// Static decomposition
extern "C" __global__ void
RenderTile( int num_groups_x, int num_groups_y,
const InputHeader *inputHeaderPtr,
const InputDataArrays *inputDataPtr,
int visualizeLightCount,
// Output
unsigned int8 framebuffer_r[],
unsigned int8 framebuffer_g[],
unsigned int8 framebuffer_b[]) {
if (taskIndex >= taskCount) return;
const InputHeader inputHeader = *inputHeaderPtr;
const InputDataArrays inputData = *inputDataPtr;
int32 group_y = taskIndex / num_groups_x;
int32 group_x = taskIndex % num_groups_x;
int32 tile_start_x = group_x * MIN_TILE_WIDTH;
int32 tile_start_y = group_y * MIN_TILE_HEIGHT;
int32 tile_end_x = tile_start_x + MIN_TILE_WIDTH;
int32 tile_end_y = tile_start_y + MIN_TILE_HEIGHT;
int framebufferWidth = inputHeader.framebufferWidth;
int framebufferHeight = inputHeader.framebufferHeight;
float cameraProj_00 = inputHeader.cameraProj[0][0];
float cameraProj_11 = inputHeader.cameraProj[1][1];
float cameraProj_22 = inputHeader.cameraProj[2][2];
float cameraProj_32 = inputHeader.cameraProj[3][2];
// Light intersection: figure out which lights illuminate this tile.
#if 0
int tileLightIndices[MAX_LIGHTS]; // Light list for the tile
#else
__shared__ int tileLightIndicesFull[4*MAX_LIGHTS]; // Light list for the tile
int *tileLightIndices = &tileLightIndicesFull[warpIdx*MAX_LIGHTS];
#endif
int numTileLights =
IntersectLightsWithTile(tile_start_x, tile_end_x,
tile_start_y, tile_end_y,
framebufferWidth, framebufferHeight,
inputData.zBuffer,
cameraProj_00, cameraProj_11,
cameraProj_22, cameraProj_32,
inputHeader.cameraNear, inputHeader.cameraFar,
MAX_LIGHTS,
inputData.lightPositionView_x,
inputData.lightPositionView_y,
inputData.lightPositionView_z,
inputData.lightAttenuationEnd,
tileLightIndices);
// And now shade the tile, using the lights in tileLightIndices
ShadeTile(tile_start_x, tile_end_x, tile_start_y, tile_end_y,
framebufferWidth, framebufferHeight, inputData,
cameraProj_00, cameraProj_11, cameraProj_22, cameraProj_32,
tileLightIndices, numTileLights, visualizeLightCount,
framebuffer_r, framebuffer_g, framebuffer_b);
}

View File

@@ -1,156 +0,0 @@
/*
Copyright (c) 2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define ISPC_IS_WINDOWS
#define NOMINMAX
#elif defined(__linux__)
#define ISPC_IS_LINUX
#elif defined(__APPLE__)
#define ISPC_IS_APPLE
#endif
#include <fcntl.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <stdint.h>
#include <algorithm>
#include <assert.h>
#include <vector>
#ifdef ISPC_IS_WINDOWS
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#endif
#include "deferred.h"
#include "../timing.h"
#include <sys/time.h>
static inline double rtc(void)
{
struct timeval Tvalue;
double etime;
struct timezone dummy;
gettimeofday(&Tvalue,&dummy);
etime = (double) Tvalue.tv_sec +
1.e-6*((double) Tvalue.tv_usec);
return etime;
}
///////////////////////////////////////////////////////////////////////////
int main(int argc, char** argv) {
if (argc != 2) {
printf("usage: deferred_shading <input_file (e.g. data/pp1280x720.bin)>\n");
return 1;
}
InputData *input = CreateInputDataFromFile(argv[1]);
if (!input) {
printf("Failed to load input file \"%s\"!\n", argv[1]);
return 1;
}
Framebuffer framebuffer(input->header.framebufferWidth,
input->header.framebufferHeight);
#if 0
InitDynamicC(input);
#ifdef __cilk
InitDynamicCilk(input);
#endif // __cilk
#endif
int nframes = 5;
double ispcCycles = 1e30;
for (int i = 0; i < 5; ++i) {
framebuffer.clear();
const double t0 = rtc();
for (int j = 0; j < nframes; ++j)
ispc::RenderStatic(&input->header, &input->arrays,
VISUALIZE_LIGHT_COUNT,
framebuffer.r, framebuffer.g, framebuffer.b);
double mcycles = 1000*(rtc() - t0) / nframes;
ispcCycles = std::min(ispcCycles, mcycles);
}
printf("[ispc static + tasks]:\t\t[%.3f] million cycles to render "
"%d x %d image\n", ispcCycles,
input->header.framebufferWidth, input->header.framebufferHeight);
WriteFrame("deferred-ispc-static.ppm", input, framebuffer);
return 0;
#if 0
#ifdef __cilk
double dynamicCilkCycles = 1e30;
for (int i = 0; i < 5; ++i) {
framebuffer.clear();
const double t0 = rtc();
for (int j = 0; j < nframes; ++j)
DispatchDynamicCilk(input, &framebuffer);
double mcycles = 1000*(rtc() - t0) / nframes;
dynamicCilkCycles = std::min(dynamicCilkCycles, mcycles);
}
printf("[ispc + Cilk dynamic]:\t\t[%.3f] million cycles to render image\n",
dynamicCilkCycles);
WriteFrame("deferred-ispc-dynamic.ppm", input, framebuffer);
#endif // __cilk
double serialCycles = 1e30;
for (int i = 0; i < 5; ++i) {
framebuffer.clear();
const double t0 = rtc();
for (int j = 0; j < nframes; ++j)
DispatchDynamicC(input, &framebuffer);
double mcycles = 1000*(rtc() - t0) / nframes;
serialCycles = std::min(serialCycles, mcycles);
}
printf("[C++ serial dynamic, 1 core]:\t[%.3f] million cycles to render image\n",
serialCycles);
WriteFrame("deferred-serial-dynamic.ppm", input, framebuffer);
#ifdef __cilk
printf("\t\t\t\t(%.2fx speedup from static ISPC, %.2fx from Cilk+ISPC)\n",
serialCycles/ispcCycles, serialCycles/dynamicCilkCycles);
#else
printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", serialCycles/ispcCycles);
#endif // __cilk
#endif
DeleteInputData(input);
return 0;
}

View File

@@ -1,518 +0,0 @@
/*
Copyright (c) 2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define ISPC_IS_WINDOWS
#define NOMINMAX
#elif defined(__linux__)
#define ISPC_IS_LINUX
#elif defined(__APPLE__)
#define ISPC_IS_APPLE
#endif
#include <fcntl.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <stdint.h>
#include <algorithm>
#include <assert.h>
#include <vector>
#ifdef ISPC_IS_WINDOWS
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#endif
#include "kernels1_ispc.h"
#include "deferred.h"
#include "../timing.h"
#include <sys/time.h>
static inline double rtc(void)
{
struct timeval Tvalue;
double etime;
struct timezone dummy;
gettimeofday(&Tvalue,&dummy);
etime = (double) Tvalue.tv_sec +
1.e-6*((double) Tvalue.tv_usec);
return etime;
}
/******************************/
#include <cassert>
#include <iostream>
#include <cuda.h>
#include "drvapi_error_string.h"
#define checkCudaErrors(err) __checkCudaErrors (err, __FILE__, __LINE__)
// These are the inline versions for all of the SDK helper functions
void __checkCudaErrors(CUresult err, const char *file, const int line) {
if(CUDA_SUCCESS != err) {
std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
<< getCudaDrvErrorString(err) << "\" from file <" << file
<< ", line " << line << "\n";
exit(-1);
}
}
/**********************/
/* Basic CUDriver API */
CUcontext context;
void createContext(const int deviceId = 0)
{
CUdevice device;
int devCount;
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGetCount(&devCount));
assert(devCount > 0);
checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
char name[128];
checkCudaErrors(cuDeviceGetName(name, 128, device));
std::cout << "Using CUDA Device [0]: " << name << "\n";
int devMajor, devMinor;
checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
std::cout << "Device Compute Capability: "
<< devMajor << "." << devMinor << "\n";
if (devMajor < 2) {
std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
exit(1);
}
// Create driver context
checkCudaErrors(cuCtxCreate(&context, 0, device));
const size_t stackLimit = 4*1024;
// const size_t heapLimit = 1024*1024*1024;
checkCudaErrors(cuCtxSetLimit(CU_LIMIT_STACK_SIZE,stackLimit));
// checkCudaErrors(cuCtxSetLimit(CU_LIMIT_MALLOC_HEAP_SIZE,heapLimit));
}
void destroyContext()
{
checkCudaErrors(cuCtxDestroy(context));
}
CUmodule loadModule(const char * module)
{
const double t0 = rtc();
CUmodule cudaModule;
// in this branch we use compilation with parameters
#if 0
unsigned int jitNumOptions = 1;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
void **jitOptVals = new void*[jitNumOptions];
// set up pointer to set the Maximum # of registers for a particular kernel
jitOptions[0] = CU_JIT_MAX_REGISTERS;
int jitRegCount = 64;
jitOptVals[0] = (void *)(size_t)jitRegCount;
#if 0
{
jitNumOptions = 3;
// set up size of compilation log buffer
jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
int jitLogBufferSize = 1024;
jitOptVals[0] = (void *)(size_t)jitLogBufferSize;
// set up pointer to the compilation log buffer
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
char *jitLogBuffer = new char[jitLogBufferSize];
jitOptVals[1] = jitLogBuffer;
// set up pointer to set the Maximum # of registers for a particular kernel
jitOptions[2] = CU_JIT_MAX_REGISTERS;
int jitRegCount = 32;
jitOptVals[2] = (void *)(size_t)jitRegCount;
}
#endif
checkCudaErrors(cuModuleLoadDataEx(&cudaModule, module,jitNumOptions, jitOptions, (void **)jitOptVals));
#else
CUlinkState CUState;
CUlinkState *lState = &CUState;
const int nOptions = 8;
CUjit_option options[nOptions];
void* optionVals[nOptions];
float walltime;
const unsigned int logSize = 32768;
char error_log[logSize],
info_log[logSize];
void *cuOut;
size_t outSize;
int myErr = 0;
// Setup linker options
// Return walltime from JIT compilation
options[0] = CU_JIT_WALL_TIME;
optionVals[0] = (void*) &walltime;
// Pass a buffer for info messages
options[1] = CU_JIT_INFO_LOG_BUFFER;
optionVals[1] = (void*) info_log;
// Pass the size of the info buffer
options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
optionVals[2] = (void*) logSize;
// Pass a buffer for error message
options[3] = CU_JIT_ERROR_LOG_BUFFER;
optionVals[3] = (void*) error_log;
// Pass the size of the error buffer
options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
optionVals[4] = (void*) logSize;
// Make the linker verbose
options[5] = CU_JIT_LOG_VERBOSE;
optionVals[5] = (void*) 1;
// Max # of registers/pthread
options[6] = CU_JIT_MAX_REGISTERS;
int jitRegCount = 48;
optionVals[6] = (void *)(size_t)jitRegCount;
// Caching
options[7] = CU_JIT_CACHE_MODE;
optionVals[7] = (void *)CU_JIT_CACHE_OPTION_CA;
// Create a pending linker invocation
checkCudaErrors(cuLinkCreate(nOptions,options, optionVals, lState));
#if 0
if (sizeof(void *)==4)
{
// Load the PTX from the string myPtx32
printf("Loading myPtx32[] program\n");
// PTX May also be loaded from file, as per below.
myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)myPtx32, strlen(myPtx32)+1, 0, 0, 0, 0);
}
else
#endif
{
// Load the PTX from the string myPtx (64-bit)
fprintf(stderr, "Loading ptx..\n");
myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)module, strlen(module)+1, 0, 0, 0, 0);
myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_LIBRARY, "libcudadevrt.a", 0,0,0);
// PTX May also be loaded from file, as per below.
// myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_PTX, "myPtx64.ptx",0,0,0);
}
// Complete the linker step
myErr = cuLinkComplete(*lState, &cuOut, &outSize);
if ( myErr != CUDA_SUCCESS )
{
// Errors will be put in error_log, per CU_JIT_ERROR_LOG_BUFFER option above.
fprintf(stderr,"PTX Linker Error:\n%s\n",error_log);
assert(0);
}
// Linker walltime and info_log were requested in options above.
fprintf(stderr, "CUDA Link Completed in %fms [ %g ms]. Linker Output:\n%s\n",walltime,info_log,1e3*(rtc() - t0));
// Load resulting cuBin into module
checkCudaErrors(cuModuleLoadData(&cudaModule, cuOut));
// Destroy the linker invocation
checkCudaErrors(cuLinkDestroy(*lState));
#endif
fprintf(stderr, " loadModule took %g ms \n", 1e3*(rtc() - t0));
return cudaModule;
}
void unloadModule(CUmodule &cudaModule)
{
checkCudaErrors(cuModuleUnload(cudaModule));
}
CUfunction getFunction(CUmodule &cudaModule, const char * function)
{
CUfunction cudaFunction;
checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
return cudaFunction;
}
CUdeviceptr deviceMalloc(const size_t size)
{
CUdeviceptr d_buf;
checkCudaErrors(cuMemAlloc(&d_buf, size));
return d_buf;
}
void deviceFree(CUdeviceptr d_buf)
{
checkCudaErrors(cuMemFree(d_buf));
}
void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
{
checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
}
void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
{
checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
}
#define deviceLaunch(func,params) \
checkCudaErrors(cuFuncSetCacheConfig((func), CU_FUNC_CACHE_PREFER_L1)); \
checkCudaErrors( \
cuLaunchKernel( \
(func), \
1,1,1, \
32, 1, 1, \
0, NULL, (params), NULL \
));
typedef CUdeviceptr devicePtr;
/**************/
#include <vector>
std::vector<char> readBinary(const char * filename)
{
std::vector<char> buffer;
FILE *fp = fopen(filename, "rb");
if (!fp )
{
fprintf(stderr, "file %s not found\n", filename);
assert(0);
}
#if 0
char c;
while ((c = fgetc(fp)) != EOF)
buffer.push_back(c);
#else
fseek(fp, 0, SEEK_END);
const unsigned long long size = ftell(fp); /*calc the size needed*/
fseek(fp, 0, SEEK_SET);
buffer.resize(size);
if (fp == NULL){ /*ERROR detection if file == empty*/
fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);
exit(1);
}
else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
exit(1);
}
#endif
fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
return buffer;
}
extern "C"
{
double CUDALaunch(
void **handlePtr,
const char * func_name,
void **func_args)
{
const std::vector<char> module_str = readBinary("__kernels.ptx");
const char * module = &module_str[0];
CUmodule cudaModule = loadModule(module);
CUfunction cudaFunction = getFunction(cudaModule, func_name);
const double t0 = rtc();
deviceLaunch(cudaFunction, func_args);
checkCudaErrors(cuStreamSynchronize(0));
const double dt = rtc() - t0;
unloadModule(cudaModule);
return dt;
}
}
/******************************/
///////////////////////////////////////////////////////////////////////////
int main(int argc, char** argv) {
if (argc != 2) {
printf("usage: deferred_shading <input_file (e.g. data/pp1280x720.bin)>\n");
return 1;
}
InputData *input = CreateInputDataFromFile(argv[1]);
if (!input) {
printf("Failed to load input file \"%s\"!\n", argv[1]);
return 1;
}
Framebuffer framebuffer(input->header.framebufferWidth,
input->header.framebufferHeight);
// InitDynamicC(input);
#if 0
#ifdef __cilk
InitDynamicCilk(input);
#endif // __cilk
#endif
/*******************/
createContext();
/*******************/
devicePtr d_header = deviceMalloc(sizeof(ispc::InputHeader));
devicePtr d_arrays = deviceMalloc(sizeof(ispc::InputDataArrays));
const int buffsize = input->header.framebufferWidth*input->header.framebufferHeight;
devicePtr d_r = deviceMalloc(buffsize);
devicePtr d_g = deviceMalloc(buffsize);
devicePtr d_b = deviceMalloc(buffsize);
for (int i = 0; i < buffsize; i++)
framebuffer.r[i] = framebuffer.g[i] = framebuffer.b[i] = 0;
ispc::InputDataArrays dh_arrays;
{
devicePtr d_chunk = deviceMalloc(input->header.inputDataChunkSize);
memcpyH2D(d_chunk, input->chunk, input->header.inputDataChunkSize);
dh_arrays.zBuffer = (float*)(d_chunk + input->header.inputDataArrayOffsets[idaZBuffer]);
dh_arrays.normalEncoded_x =
(uint16_t *)(d_chunk+input->header.inputDataArrayOffsets[idaNormalEncoded_x]);
fprintf(stderr, "%p %p \n",
dh_arrays.zBuffer, dh_arrays.normalEncoded_x);
fprintf(stderr, " diff= %d %d \n",
input->header.inputDataArrayOffsets[idaZBuffer],
input->header.inputDataArrayOffsets[idaNormalEncoded_x]);
dh_arrays.normalEncoded_y =
(uint16_t *)(d_chunk+input->header.inputDataArrayOffsets[idaNormalEncoded_y]);
dh_arrays.specularAmount =
(uint16_t *)(d_chunk+input->header.inputDataArrayOffsets[idaSpecularAmount]);
dh_arrays.specularPower =
(uint16_t *)(d_chunk+input->header.inputDataArrayOffsets[idaSpecularPower]);
dh_arrays.albedo_x =
(uint8_t *)(d_chunk+input->header.inputDataArrayOffsets[idaAlbedo_x]);
dh_arrays.albedo_y =
(uint8_t *)(d_chunk+input->header.inputDataArrayOffsets[idaAlbedo_y]);
dh_arrays.albedo_z =
(uint8_t *)(d_chunk+input->header.inputDataArrayOffsets[idaAlbedo_z]);
dh_arrays.lightPositionView_x =
(float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightPositionView_x]);
dh_arrays.lightPositionView_y =
(float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightPositionView_y]);
dh_arrays.lightPositionView_z =
(float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightPositionView_z]);
dh_arrays.lightAttenuationBegin =
(float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightAttenuationBegin]);
dh_arrays.lightColor_x =
(float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightColor_x]);
dh_arrays.lightColor_y =
(float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightColor_y]);
dh_arrays.lightColor_z =
(float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightColor_z]);
dh_arrays.lightAttenuationEnd =
(float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightAttenuationEnd]);
}
memcpyH2D(d_header, &input->header, sizeof(ispc::InputHeader));
memcpyH2D(d_arrays, &dh_arrays, sizeof(ispc::InputDataArrays));
memcpyH2D(d_r, framebuffer.r, buffsize);
memcpyH2D(d_g, framebuffer.g, buffsize);
memcpyH2D(d_b, framebuffer.b, buffsize);
int nframes = 5;
double ispcCycles = 1e30;
for (int i = 0; i < 5; ++i) {
framebuffer.clear();
const double t0 = rtc();
double dt = 0.0;
for (int j = 0; j < nframes; ++j)
{
const char * func_name = "RenderStatic";
int light_count = VISUALIZE_LIGHT_COUNT;
void *func_args[] = {
&d_header,
&d_arrays,
&light_count,
&d_r,
&d_g,
&d_b};
dt += CUDALaunch(NULL, func_name, func_args);
}
//double mcycles = 1000*(rtc() - t0) / nframes;
double mcycles = 1000*dt / nframes;
fprintf(stderr, "dt= %g\n", mcycles);
ispcCycles = std::min(ispcCycles, mcycles);
}
memcpyD2H(framebuffer.r, d_r, buffsize);
memcpyD2H(framebuffer.g, d_g, buffsize);
memcpyD2H(framebuffer.b, d_b, buffsize);
printf("[ispc cuda]:\t\t[%.3f] million cycles to render "
"%d x %d image\n", ispcCycles,
input->header.framebufferWidth, input->header.framebufferHeight);
WriteFrame("deferred-cuda.ppm", input, framebuffer);
/*******************/
destroyContext();
/*******************/
return 0;
#if 0
#ifdef __cilk
double dynamicCilkCycles = 1e30;
for (int i = 0; i < 5; ++i) {
framebuffer.clear();
reset_and_start_timer();
for (int j = 0; j < nframes; ++j)
DispatchDynamicCilk(input, &framebuffer);
double mcycles = get_elapsed_mcycles() / nframes;
dynamicCilkCycles = std::min(dynamicCilkCycles, mcycles);
}
printf("[ispc + Cilk dynamic]:\t\t[%.3f] million cycles to render image\n",
dynamicCilkCycles);
WriteFrame("deferred-ispc-dynamic.ppm", input, framebuffer);
#endif // __cilk
double serialCycles = 1e30;
for (int i = 0; i < 5; ++i) {
framebuffer.clear();
reset_and_start_timer();
for (int j = 0; j < nframes; ++j)
DispatchDynamicC(input, &framebuffer);
double mcycles = get_elapsed_mcycles() / nframes;
serialCycles = std::min(serialCycles, mcycles);
}
printf("[C++ serial dynamic, 1 core]:\t[%.3f] million cycles to render image\n",
serialCycles);
WriteFrame("deferred-serial-dynamic.ppm", input, framebuffer);
#ifdef __cilk
printf("\t\t\t\t(%.2fx speedup from static ISPC, %.2fx from Cilk+ISPC)\n",
serialCycles/ispcCycles, serialCycles/dynamicCilkCycles);
#else
printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", serialCycles/ispcCycles);
#endif // __cilk
#endif
DeleteInputData(input);
return 0;
}

View File

@@ -1,163 +0,0 @@
/*
Copyright (c) 2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define ISPC_IS_WINDOWS
#define NOMINMAX
#elif defined(__linux__)
#define ISPC_IS_LINUX
#elif defined(__APPLE__)
#define ISPC_IS_APPLE
#endif
#include <fcntl.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <stdint.h>
#include <algorithm>
#include <assert.h>
#include <vector>
#ifdef ISPC_IS_WINDOWS
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#endif
#include "deferred.h"
#include "kernels_ispc.h"
#include "../timing.h"
#include <sys/time.h>
static inline double rtc(void)
{
struct timeval Tvalue;
double etime;
struct timezone dummy;
gettimeofday(&Tvalue,&dummy);
etime = (double) Tvalue.tv_sec +
1.e-6*((double) Tvalue.tv_usec);
return etime;
}
///////////////////////////////////////////////////////////////////////////
int main(int argc, char** argv) {
if (argc != 2) {
printf("usage: deferred_shading <input_file (e.g. data/pp1280x720.bin)>\n");
return 1;
}
InputData *input = CreateInputDataFromFile(argv[1]);
if (!input) {
printf("Failed to load input file \"%s\"!\n", argv[1]);
return 1;
}
Framebuffer framebuffer(input->header.framebufferWidth,
input->header.framebufferHeight);
#if 0
InitDynamicC(input);
#ifdef __cilk
InitDynamicCilk(input);
#endif // __cilk
#endif
const int buffsize = input->header.framebufferWidth*input->header.framebufferHeight;
for (int i = 0; i < buffsize; i++)
framebuffer.r[i] = framebuffer.g[i] = framebuffer.b[i] = 0;
int nframes = 5;
double ispcCycles = 1e30;
for (int i = 0; i < 5; ++i) {
framebuffer.clear();
const double t0 = rtc();
for (int j = 0; j < nframes; ++j)
ispc::RenderStatic(&input->header, &input->arrays,
input->header,
VISUALIZE_LIGHT_COUNT,
framebuffer.r, framebuffer.g, framebuffer.b);
double mcycles = 1000*(rtc() - t0) / nframes;
ispcCycles = std::min(ispcCycles, mcycles);
}
printf("[ispc static + tasks]:\t\t[%.3f] million cycles to render "
"%d x %d image\n", ispcCycles,
input->header.framebufferWidth, input->header.framebufferHeight);
WriteFrame("deferred-ispc-static.ppm", input, framebuffer);
return 0;
#if 0
#ifdef __cilk
double dynamicCilkCycles = 1e30;
for (int i = 0; i < 5; ++i) {
framebuffer.clear();
reset_and_start_timer();
for (int j = 0; j < nframes; ++j)
DispatchDynamicCilk(input, &framebuffer);
double mcycles = get_elapsed_mcycles() / nframes;
dynamicCilkCycles = std::min(dynamicCilkCycles, mcycles);
}
printf("[ispc + Cilk dynamic]:\t\t[%.3f] million cycles to render image\n",
dynamicCilkCycles);
WriteFrame("deferred-ispc-dynamic.ppm", input, framebuffer);
#endif // __cilk
double serialCycles = 1e30;
for (int i = 0; i < 5; ++i) {
framebuffer.clear();
reset_and_start_timer();
for (int j = 0; j < nframes; ++j)
DispatchDynamicC(input, &framebuffer);
double mcycles = get_elapsed_mcycles() / nframes;
serialCycles = std::min(serialCycles, mcycles);
}
printf("[C++ serial dynamic, 1 core]:\t[%.3f] million cycles to render image\n",
serialCycles);
WriteFrame("deferred-serial-dynamic.ppm", input, framebuffer);
#ifdef __cilk
printf("\t\t\t\t(%.2fx speedup from static ISPC, %.2fx from Cilk+ISPC)\n",
serialCycles/ispcCycles, serialCycles/dynamicCilkCycles);
#else
printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", serialCycles/ispcCycles);
#endif // __cilk
#endif
DeleteInputData(input);
return 0;
}

View File

@@ -1,370 +0,0 @@
/*
* Copyright 1993-2012 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
#ifndef _DRVAPI_ERROR_STRING_H_
#define _DRVAPI_ERROR_STRING_H_
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
// Error Code string definitions here
typedef struct
{
char const *error_string;
int error_id;
} s_CudaErrorStr;
/**
* Error codes
*/
static s_CudaErrorStr sCudaDrvErrorString[] =
{
/**
* The API call returned with no errors. In the case of query calls, this
* can also mean that the operation being queried is complete (see
* ::cuEventQuery() and ::cuStreamQuery()).
*/
{ "CUDA_SUCCESS", 0 },
/**
* This indicates that one or more of the parameters passed to the API call
* is not within an acceptable range of values.
*/
{ "CUDA_ERROR_INVALID_VALUE", 1 },
/**
* The API call failed because it was unable to allocate enough memory to
* perform the requested operation.
*/
{ "CUDA_ERROR_OUT_OF_MEMORY", 2 },
/**
* This indicates that the CUDA driver has not been initialized with
* ::cuInit() or that initialization has failed.
*/
{ "CUDA_ERROR_NOT_INITIALIZED", 3 },
/**
* This indicates that the CUDA driver is in the process of shutting down.
*/
{ "CUDA_ERROR_DEINITIALIZED", 4 },
/**
* This indicates profiling APIs are called while application is running
* in visual profiler mode.
*/
{ "CUDA_ERROR_PROFILER_DISABLED", 5 },
/**
* This indicates profiling has not been initialized for this context.
* Call cuProfilerInitialize() to resolve this.
*/
{ "CUDA_ERROR_PROFILER_NOT_INITIALIZED", 6 },
/**
* This indicates profiler has already been started and probably
* cuProfilerStart() is incorrectly called.
*/
{ "CUDA_ERROR_PROFILER_ALREADY_STARTED", 7 },
/**
* This indicates profiler has already been stopped and probably
* cuProfilerStop() is incorrectly called.
*/
{ "CUDA_ERROR_PROFILER_ALREADY_STOPPED", 8 },
/**
* This indicates that no CUDA-capable devices were detected by the installed
* CUDA driver.
*/
{ "CUDA_ERROR_NO_DEVICE (no CUDA-capable devices were detected)", 100 },
/**
* This indicates that the device ordinal supplied by the user does not
* correspond to a valid CUDA device.
*/
{ "CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)", 101 },
/**
* This indicates that the device kernel image is invalid. This can also
* indicate an invalid CUDA module.
*/
{ "CUDA_ERROR_INVALID_IMAGE", 200 },
/**
* This most frequently indicates that there is no context bound to the
* current thread. This can also be returned if the context passed to an
* API call is not a valid handle (such as a context that has had
* ::cuCtxDestroy() invoked on it). This can also be returned if a user
* mixes different API versions (i.e. 3010 context with 3020 API calls).
* See ::cuCtxGetApiVersion() for more details.
*/
{ "CUDA_ERROR_INVALID_CONTEXT", 201 },
/**
* This indicated that the context being supplied as a parameter to the
* API call was already the active context.
* \deprecated
* This error return is deprecated as of CUDA 3.2. It is no longer an
* error to attempt to push the active context via ::cuCtxPushCurrent().
*/
{ "CUDA_ERROR_CONTEXT_ALREADY_CURRENT", 202 },
/**
* This indicates that a map or register operation has failed.
*/
{ "CUDA_ERROR_MAP_FAILED", 205 },
/**
* This indicates that an unmap or unregister operation has failed.
*/
{ "CUDA_ERROR_UNMAP_FAILED", 206 },
/**
* This indicates that the specified array is currently mapped and thus
* cannot be destroyed.
*/
{ "CUDA_ERROR_ARRAY_IS_MAPPED", 207 },
/**
* This indicates that the resource is already mapped.
*/
{ "CUDA_ERROR_ALREADY_MAPPED", 208 },
/**
* This indicates that there is no kernel image available that is suitable
* for the device. This can occur when a user specifies code generation
* options for a particular CUDA source file that do not include the
* corresponding device configuration.
*/
{ "CUDA_ERROR_NO_BINARY_FOR_GPU", 209 },
/**
* This indicates that a resource has already been acquired.
*/
{ "CUDA_ERROR_ALREADY_ACQUIRED", 210 },
/**
* This indicates that a resource is not mapped.
*/
{ "CUDA_ERROR_NOT_MAPPED", 211 },
/**
* This indicates that a mapped resource is not available for access as an
* array.
*/
{ "CUDA_ERROR_NOT_MAPPED_AS_ARRAY", 212 },
/**
* This indicates that a mapped resource is not available for access as a
* pointer.
*/
{ "CUDA_ERROR_NOT_MAPPED_AS_POINTER", 213 },
/**
* This indicates that an uncorrectable ECC error was detected during
* execution.
*/
{ "CUDA_ERROR_ECC_UNCORRECTABLE", 214 },
/**
* This indicates that the ::CUlimit passed to the API call is not
* supported by the active device.
*/
{ "CUDA_ERROR_UNSUPPORTED_LIMIT", 215 },
/**
* This indicates that the ::CUcontext passed to the API call can
* only be bound to a single CPU thread at a time but is already
* bound to a CPU thread.
*/
{ "CUDA_ERROR_CONTEXT_ALREADY_IN_USE", 216 },
/**
* This indicates that peer access is not supported across the given
* devices.
*/
{ "CUDA_ERROR_PEER_ACCESS_UNSUPPORTED", 217},
/**
* This indicates that the device kernel source is invalid.
*/
{ "CUDA_ERROR_INVALID_SOURCE", 300 },
/**
* This indicates that the file specified was not found.
*/
{ "CUDA_ERROR_FILE_NOT_FOUND", 301 },
/**
* This indicates that a link to a shared object failed to resolve.
*/
{ "CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND", 302 },
/**
* This indicates that initialization of a shared object failed.
*/
{ "CUDA_ERROR_SHARED_OBJECT_INIT_FAILED", 303 },
/**
* This indicates that an OS call failed.
*/
{ "CUDA_ERROR_OPERATING_SYSTEM", 304 },
/**
* This indicates that a resource handle passed to the API call was not
* valid. Resource handles are opaque types like ::CUstream and ::CUevent.
*/
{ "CUDA_ERROR_INVALID_HANDLE", 400 },
/**
* This indicates that a named symbol was not found. Examples of symbols
* are global/constant variable names, texture names }, and surface names.
*/
{ "CUDA_ERROR_NOT_FOUND", 500 },
/**
* This indicates that asynchronous operations issued previously have not
* completed yet. This result is not actually an error, but must be indicated
* differently than ::CUDA_SUCCESS (which indicates completion). Calls that
* may return this value include ::cuEventQuery() and ::cuStreamQuery().
*/
{ "CUDA_ERROR_NOT_READY", 600 },
/**
* An exception occurred on the device while executing a kernel. Common
* causes include dereferencing an invalid device pointer and accessing
* out of bounds shared memory. The context cannot be used }, so it must
* be destroyed (and a new one should be created). All existing device
* memory allocations from this context are invalid and must be
* reconstructed if the program is to continue using CUDA.
*/
{ "CUDA_ERROR_LAUNCH_FAILED", 700 },
/**
* This indicates that a launch did not occur because it did not have
* appropriate resources. This error usually indicates that the user has
* attempted to pass too many arguments to the device kernel, or the
* kernel launch specifies too many threads for the kernel's register
* count. Passing arguments of the wrong size (i.e. a 64-bit pointer
* when a 32-bit int is expected) is equivalent to passing too many
* arguments and can also result in this error.
*/
{ "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES", 701 },
/**
* This indicates that the device kernel took too long to execute. This can
* only occur if timeouts are enabled - see the device attribute
* ::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information. The
* context cannot be used (and must be destroyed similar to
* ::CUDA_ERROR_LAUNCH_FAILED). All existing device memory allocations from
* this context are invalid and must be reconstructed if the program is to
* continue using CUDA.
*/
{ "CUDA_ERROR_LAUNCH_TIMEOUT", 702 },
/**
* This error indicates a kernel launch that uses an incompatible texturing
* mode.
*/
{ "CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING", 703 },
/**
* This error indicates that a call to ::cuCtxEnablePeerAccess() is
* trying to re-enable peer access to a context which has already
* had peer access to it enabled.
*/
{ "CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED", 704 },
/**
* This error indicates that ::cuCtxDisablePeerAccess() is
* trying to disable peer access which has not been enabled yet
* via ::cuCtxEnablePeerAccess().
*/
{ "CUDA_ERROR_PEER_ACCESS_NOT_ENABLED", 705 },
/**
* This error indicates that the primary context for the specified device
* has already been initialized.
*/
{ "CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE", 708 },
/**
* This error indicates that the context current to the calling thread
* has been destroyed using ::cuCtxDestroy }, or is a primary context which
* has not yet been initialized.
*/
{ "CUDA_ERROR_CONTEXT_IS_DESTROYED", 709 },
/**
* A device-side assert triggered during kernel execution. The context
* cannot be used anymore, and must be destroyed. All existing device
* memory allocations from this context are invalid and must be
* reconstructed if the program is to continue using CUDA.
*/
{ "CUDA_ERROR_ASSERT", 710 },
/**
* This error indicates that the hardware resources required to enable
* peer access have been exhausted for one or more of the devices
* passed to ::cuCtxEnablePeerAccess().
*/
{ "CUDA_ERROR_TOO_MANY_PEERS", 711 },
/**
* This error indicates that the memory range passed to ::cuMemHostRegister()
* has already been registered.
*/
{ "CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED", 712 },
/**
* This error indicates that the pointer passed to ::cuMemHostUnregister()
* does not correspond to any currently registered memory region.
*/
{ "CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED", 713 },
/**
* This error indicates that the attempted operation is not permitted.
*/
{ "CUDA_ERROR_NOT_PERMITTED", 800 },
/**
* This error indicates that the attempted operation is not supported
* on the current system or device.
*/
{ "CUDA_ERROR_NOT_SUPPORTED", 801 },
/**
* This indicates that an unknown internal error has occurred.
*/
{ "CUDA_ERROR_UNKNOWN", 999 },
{ NULL, -1 }
};
// This is just a linear search through the array, since the error_id's are not
// always ocurring consecutively
const char * getCudaDrvErrorString(CUresult error_id)
{
int index = 0;
while (sCudaDrvErrorString[index].error_id != error_id &&
sCudaDrvErrorString[index].error_id != -1)
{
index++;
}
if (sCudaDrvErrorString[index].error_id == error_id)
return (const char *)sCudaDrvErrorString[index].error_string;
else
return (const char *)"CUDA_ERROR not found!";
}
#endif

View File

@@ -1,136 +0,0 @@

Microsoft Visual Studio Solution File, Format Version 11.00
# Visual Studio 2010
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "simple", "simple\simple.vcxproj", "{947C5311-8B78-4D05-BEE4-BCF342D4B367}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "rt", "rt\rt.vcxproj", "{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "aobench", "aobench\aobench.vcxproj", "{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "mandelbrot", "mandelbrot\mandelbrot.vcxproj", "{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "options", "options\options.vcxproj", "{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "mandelbrot_tasks", "mandelbrot_tasks\mandelbrot_tasks.vcxproj", "{E80DA7D4-AB22-4648-A068-327307156BE6}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "aobench_instrumented", "aobench_instrumented\aobench_instrumented.vcxproj", "{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "noise", "noise\noise.vcxproj", "{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "volume", "volume_rendering\volume.vcxproj", "{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "stencil", "stencil\stencil.vcxproj", "{2EF070A1-F62F-4E6A-944B-88D140945C3C}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "deferred_shading", "deferred\deferred_shading.vcxproj", "{87F53C53-957E-4E91-878A-BC27828FB9EB}"
EndProject
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "perfbench", "perfbench\perfbench.vcxproj", "{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Win32 = Debug|Win32
Debug|x64 = Debug|x64
Release|Win32 = Release|Win32
Release|x64 = Release|x64
EndGlobalSection
GlobalSection(ProjectConfigurationPlatforms) = postSolution
{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Debug|Win32.ActiveCfg = Debug|Win32
{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Debug|Win32.Build.0 = Debug|Win32
{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Debug|x64.ActiveCfg = Debug|x64
{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Debug|x64.Build.0 = Debug|x64
{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Release|Win32.ActiveCfg = Release|Win32
{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Release|Win32.Build.0 = Release|Win32
{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Release|x64.ActiveCfg = Release|x64
{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Release|x64.Build.0 = Release|x64
{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Debug|Win32.ActiveCfg = Debug|Win32
{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Debug|Win32.Build.0 = Debug|Win32
{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Debug|x64.ActiveCfg = Debug|x64
{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Debug|x64.Build.0 = Debug|x64
{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Release|Win32.ActiveCfg = Release|Win32
{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Release|Win32.Build.0 = Release|Win32
{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Release|x64.ActiveCfg = Release|x64
{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Release|x64.Build.0 = Release|x64
{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Debug|Win32.ActiveCfg = Debug|Win32
{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Debug|Win32.Build.0 = Debug|Win32
{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Debug|x64.ActiveCfg = Debug|x64
{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Debug|x64.Build.0 = Debug|x64
{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Release|Win32.ActiveCfg = Release|Win32
{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Release|Win32.Build.0 = Release|Win32
{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Release|x64.ActiveCfg = Release|x64
{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Release|x64.Build.0 = Release|x64
{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Debug|Win32.ActiveCfg = Debug|Win32
{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Debug|Win32.Build.0 = Debug|Win32
{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Debug|x64.ActiveCfg = Debug|x64
{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Debug|x64.Build.0 = Debug|x64
{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Release|Win32.ActiveCfg = Release|Win32
{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Release|Win32.Build.0 = Release|Win32
{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Release|x64.ActiveCfg = Release|x64
{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Release|x64.Build.0 = Release|x64
{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Debug|Win32.ActiveCfg = Debug|Win32
{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Debug|Win32.Build.0 = Debug|Win32
{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Debug|x64.ActiveCfg = Debug|x64
{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Debug|x64.Build.0 = Debug|x64
{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Release|Win32.ActiveCfg = Release|Win32
{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Release|Win32.Build.0 = Release|Win32
{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Release|x64.ActiveCfg = Release|x64
{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Release|x64.Build.0 = Release|x64
{E80DA7D4-AB22-4648-A068-327307156BE6}.Debug|Win32.ActiveCfg = Debug|Win32
{E80DA7D4-AB22-4648-A068-327307156BE6}.Debug|Win32.Build.0 = Debug|Win32
{E80DA7D4-AB22-4648-A068-327307156BE6}.Debug|x64.ActiveCfg = Debug|x64
{E80DA7D4-AB22-4648-A068-327307156BE6}.Debug|x64.Build.0 = Debug|x64
{E80DA7D4-AB22-4648-A068-327307156BE6}.Release|Win32.ActiveCfg = Release|Win32
{E80DA7D4-AB22-4648-A068-327307156BE6}.Release|Win32.Build.0 = Release|Win32
{E80DA7D4-AB22-4648-A068-327307156BE6}.Release|x64.ActiveCfg = Release|x64
{E80DA7D4-AB22-4648-A068-327307156BE6}.Release|x64.Build.0 = Release|x64
{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Debug|Win32.ActiveCfg = Debug|Win32
{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Debug|Win32.Build.0 = Debug|Win32
{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Debug|x64.ActiveCfg = Debug|x64
{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Debug|x64.Build.0 = Debug|x64
{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Release|Win32.ActiveCfg = Release|Win32
{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Release|Win32.Build.0 = Release|Win32
{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Release|x64.ActiveCfg = Release|x64
{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Release|x64.Build.0 = Release|x64
{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Debug|Win32.ActiveCfg = Debug|Win32
{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Debug|Win32.Build.0 = Debug|Win32
{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Debug|x64.ActiveCfg = Debug|x64
{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Debug|x64.Build.0 = Debug|x64
{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Release|Win32.ActiveCfg = Release|Win32
{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Release|Win32.Build.0 = Release|Win32
{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Release|x64.ActiveCfg = Release|x64
{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Release|x64.Build.0 = Release|x64
{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Debug|Win32.ActiveCfg = Debug|Win32
{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Debug|Win32.Build.0 = Debug|Win32
{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Debug|x64.ActiveCfg = Debug|x64
{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Debug|x64.Build.0 = Debug|x64
{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Release|Win32.ActiveCfg = Release|Win32
{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Release|Win32.Build.0 = Release|Win32
{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Release|x64.ActiveCfg = Release|x64
{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Release|x64.Build.0 = Release|x64
{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Debug|Win32.ActiveCfg = Debug|Win32
{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Debug|Win32.Build.0 = Debug|Win32
{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Debug|x64.ActiveCfg = Debug|x64
{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Debug|x64.Build.0 = Debug|x64
{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Release|Win32.ActiveCfg = Release|Win32
{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Release|Win32.Build.0 = Release|Win32
{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Release|x64.ActiveCfg = Release|x64
{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Release|x64.Build.0 = Release|x64
{87F53C53-957E-4E91-878A-BC27828FB9EB}.Debug|Win32.ActiveCfg = Debug|Win32
{87F53C53-957E-4E91-878A-BC27828FB9EB}.Debug|Win32.Build.0 = Debug|Win32
{87F53C53-957E-4E91-878A-BC27828FB9EB}.Debug|x64.ActiveCfg = Debug|x64
{87F53C53-957E-4E91-878A-BC27828FB9EB}.Debug|x64.Build.0 = Debug|x64
{87F53C53-957E-4E91-878A-BC27828FB9EB}.Release|Win32.ActiveCfg = Release|Win32
{87F53C53-957E-4E91-878A-BC27828FB9EB}.Release|Win32.Build.0 = Release|Win32
{87F53C53-957E-4E91-878A-BC27828FB9EB}.Release|x64.ActiveCfg = Release|x64
{87F53C53-957E-4E91-878A-BC27828FB9EB}.Release|x64.Build.0 = Release|x64
{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Debug|Win32.ActiveCfg = Debug|Win32
{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Debug|Win32.Build.0 = Debug|Win32
{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Debug|x64.ActiveCfg = Debug|x64
{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Debug|x64.Build.0 = Debug|x64
{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Release|Win32.ActiveCfg = Release|Win32
{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Release|Win32.Build.0 = Release|Win32
{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Release|x64.ActiveCfg = Release|x64
{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Release|x64.Build.0 = Release|x64
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
EndGlobalSection
EndGlobal

View File

@@ -1,9 +0,0 @@
EXAMPLE=gmres
CPP_SRC=algorithm.cpp main.cpp matrix.cpp
CC_SRC=mmio.c
ISPC_SRC=matrix.ispc
ISPC_IA_TARGETS=sse2,sse4-x2,avx-x2
ISPC_ARM_TARGETS=neon
include ../common.mk

View File

@@ -1,231 +0,0 @@
/*
Copyright (c) 2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*===========================================================================*\
|* Includes
\*===========================================================================*/
#include "algorithm.h"
#include "stdio.h"
#include "debug.h"
/*===========================================================================*\
|* GMRES
\*===========================================================================*/
/* upper_triangular_right_solve:
* ----------------------------
* Given upper triangular matrix R and rhs vector b, solve for
* x. This "solve" ignores the rows, columns of R that are greater than the
* dimensions of x.
*/
void upper_triangular_right_solve (const DenseMatrix &R, const Vector &b, Vector &x)
{
// Dimensionality check
ASSERT(R.rows() >= b.size());
ASSERT(R.cols() >= x.size());
ASSERT(b.size() >= x.size());
int max_row = x.size() - 1;
// first solve step:
x[max_row] = b[max_row] / R(max_row, max_row);
for (int row = max_row - 1; row >= 0; row--) {
double xi = b[row];
for (int col = max_row; col > row; col--)
xi -= x[col] * R(row, col);
x[row] = xi / R(row, row);
}
}
/* create_rotation (used in gmres):
* -------------------------------
* Construct a Givens rotation to zero out the lowest non-zero entry in a partially
* factored Hessenburg matrix. Note that the previous Givens rotations should be
* applied to this column before creating a new rotation.
*/
void create_rotation (const DenseMatrix &H, size_t col, Vector &Cn, Vector &Sn)
{
double a = H(col, col);
double b = H(col + 1, col);
double r;
if (b == 0) {
Cn[col] = copysign(1, a);
Sn[col] = 0;
}
else if (a == 0) {
Cn[col] = 0;
Sn[col] = copysign(1, b);
}
else {
r = sqrt(a*a + b*b);
Sn[col] = -b / r;
Cn[col] = a / r;
}
}
/* Applies the 'col'th Givens rotation stored in vectors Sn and Cn to the 'col'th
* column of the DenseMatrix M. (Previous columns don't need the rotation applied b/c
* presumeably, the first col-1 columns are already upper triangular, and so their
* entries in the col and col+1 rows are 0.)
*/
void apply_rotation (DenseMatrix &H, size_t col, Vector &Cn, Vector &Sn)
{
double c = Cn[col];
double s = Sn[col];
double tmp = c * H(col, col) - s * H(col+1, col);
H(col+1, col) = s * H(col, col) + c * H(col+1, col);
H(col, col) = tmp;
}
/* Applies the 'col'th Givens rotation to the vector.
*/
void apply_rotation (Vector &v, size_t col, Vector &Cn, Vector &Sn)
{
double a = v[col];
double b = v[col + 1];
double c = Cn[col];
double s = Sn[col];
v[col] = c * a - s * b;
v[col + 1] = s * a + c * b;
}
/* Applies the first 'col' Givens rotations to the newly-created column
* of H. (Leaves other columns alone.)
*/
void update_column (DenseMatrix &H, size_t col, Vector &Cn, Vector &Sn)
{
for (int i = 0; i < col; i++) {
double c = Cn[i];
double s = Sn[i];
double t = c * H(i,col) - s * H(i+1,col);
H(i+1, col) = s * H(i,col) + c * H(i+1,col);
H(i, col) = t;
}
}
/* After a new column has been added to the hessenburg matrix, factor it back into
* an upper-triangular matrix by:
* - applying the previous Givens rotations to the new column
* - computing the new Givens rotation to make the column upper triangluar
* - applying the new Givens rotation to the column, and
* - applying the new Givens rotation to the solution vector
*/
void update_qr_decomp (DenseMatrix &H, Vector &s, size_t col, Vector &Cn, Vector &Sn)
{
update_column( H, col, Cn, Sn);
create_rotation(H, col, Cn, Sn);
apply_rotation( H, col, Cn, Sn);
apply_rotation( s, col, Cn, Sn);
}
void gmres (const Matrix &A, const Vector &b, Vector &x, int num_iters, double max_err)
{
DEBUG_PRINT("gmres starting!\n");
x.zero();
ASSERT(A.rows() == A.cols());
DenseMatrix Qstar(num_iters + 1, A.rows());
DenseMatrix H(num_iters + 1, num_iters);
// arrays for storing parameters of givens rotations
Vector Sn(num_iters);
Vector Cn(num_iters);
// array for storing the rhs projected onto the hessenburg's column space
Vector G(num_iters+1);
G.zero();
double beta = b.norm();
G[0] = beta;
// temp vector, stores Aqi
Vector w(A.rows());
w.copy(b);
w.normalize();
Qstar.set_row(0, w);
int iter = 0;
Vector temp(A.rows(), false);
double rel_err;
while (iter < num_iters)
{
// w = Aqi
Qstar.row(iter, temp);
A.multiply(temp, w);
// construct ith column of H, i+1th row of Qstar:
for (int row = 0; row <= iter; row++) {
Qstar.row(row, temp);
H(row, iter) = temp.dot(w);
w.add_ax(-H(row, iter), temp);
}
H(iter+1, iter) = w.norm();
w.divide(H(iter+1, iter));
Qstar.set_row(iter+1, w);
update_qr_decomp (H, G, iter, Cn, Sn);
rel_err = fabs(G[iter+1] / beta);
if (rel_err < max_err)
break;
if (iter % 100 == 0)
DEBUG_PRINT("Iter %d: %f err\n", iter, rel_err);
iter++;
}
if (iter == num_iters) {
fprintf(stderr, "Error: gmres failed to converge in %d iterations (relative err: %f)\n", num_iters, rel_err);
exit(-1);
}
// We've reached an acceptable solution (?):
DEBUG_PRINT("gmres completed in %d iterations (rel. resid. %f, max %f)\n", num_iters, rel_err, max_err);
Vector y(iter+1);
upper_triangular_right_solve(H, G, y);
for (int i = 0; i < iter + 1; i++) {
Qstar.row(i, temp);
x.add_ax(y[i], temp);
}
}

View File

@@ -1,50 +0,0 @@
/*
Copyright (c) 2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef __ALGORITHM_H__
#define __ALGORITHM_H__
#include "matrix.h"
/* Generalized Minimal Residual Method:
* -----------------------------------
* Takes a square matrix and an rhs and uses GMRES to find an estimate for x.
* The specified error is relative.
*/
void gmres (const Matrix &A, const Vector &b, Vector &x, int num_iters, double err);
#endif

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,55 +0,0 @@
/*
Copyright (c) 2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef __DEBUG_H__
#define __DEBUG_H__
#include <cassert>
/**************************************************************\
| Macros
\**************************************************************/
#define DEBUG
#ifdef DEBUG
#define ASSERT(expr) assert(expr)
#define DEBUG_PRINT(...) printf(__VA_ARGS__)
#else
#define ASSERT(expr)
#define DEBUG_PRINT(...)
#endif
#endif

View File

@@ -1,79 +0,0 @@
/*
Copyright (c) 2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "matrix.h"
#include "algorithm.h"
#include "util.h"
#include <cmath>
#include "../timing.h"
int main (int argc, char **argv)
{
if (argc < 4) {
printf("usage: %s <input-matrix> <input-rhs> <output-file>\n", argv[0]);
return -1;
}
double gmres_cycles;
DEBUG_PRINT("Loading A...\n");
Matrix *A = CRSMatrix::matrix_from_mtf(argv[1]);
if (A == NULL)
return -1;
DEBUG_PRINT("... size: %lu\n", A->cols());
DEBUG_PRINT("Loading b...\n");
Vector *b = Vector::vector_from_mtf(argv[2]);
if (b == NULL)
return -1;
Vector x(A->cols());
DEBUG_PRINT("Beginning gmres...\n");
gmres(*A, *b, x, A->cols() / 2, .01);
// Write result out to file
x.to_mtf(argv[argc-1]);
// Compute residual (double-check)
#ifdef DEBUG
Vector bprime(b->size());
A->multiply(x, bprime);
Vector resid(bprime.size(), &(bprime[0]));
resid.subtract(*b);
DEBUG_PRINT("residual error check: %lg\n", resid.norm() / b->norm());
#endif
// Print profiling results
DEBUG_PRINT("-- Total mcycles to solve : %.03f --\n", gmres_cycles);
}

View File

@@ -1,246 +0,0 @@
/*
Copyright (c) 2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/**************************************************************\
| Includes
\**************************************************************/
#include "matrix.h"
#include "matrix_ispc.h"
extern "C" {
#include "mmio.h"
}
/**************************************************************\
| DenseMatrix methods
\**************************************************************/
void DenseMatrix::multiply (const Vector &v, Vector &r) const
{
// Dimensionality check
ASSERT(v.size() == cols());
ASSERT(r.size() == rows());
for (int i = 0; i < rows(); i++)
r[i] = v.dot(entries + i * num_cols);
}
const Vector *DenseMatrix::row (size_t row) const {
return new Vector(num_cols, entries + row * num_cols, true);
}
void DenseMatrix::row (size_t row, Vector &r) {
r.entries = entries + row * cols();
r._size = cols();
}
void DenseMatrix::set_row(size_t row, const Vector &v)
{
ASSERT(v.size() == num_cols);
memcpy(entries + row * num_cols, v.entries, num_cols * sizeof(double));
}
/**************************************************************\
| CRSMatrix Methods
\**************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <algorithm>
struct entry {
int row;
int col;
double val;
};
bool compare_entries(struct entry i, struct entry j) {
if (i.row < j.row)
return true;
if (i.row > j.row)
return false;
return i.col < j.col;
}
#define ERR_OUT(...) { fprintf(stderr, __VA_ARGS__); return NULL; }
CRSMatrix *CRSMatrix::matrix_from_mtf (char *path) {
FILE *f;
MM_typecode matcode;
int m, n, nz;
if ((f = fopen(path, "r")) == NULL)
ERR_OUT("Error: %s does not name a valid/readable file.\n", path);
if (mm_read_banner(f, &matcode) != 0)
ERR_OUT("Error: Could not process Matrix Market banner.\n");
if (mm_is_complex(matcode))
ERR_OUT("Error: Application does not support complex numbers.\n")
if (mm_is_dense(matcode))
ERR_OUT("Error: supplied matrix is dense (should be sparse.)\n");
if (!mm_is_matrix(matcode))
ERR_OUT("Error: %s does not encode a matrix.\n", path)
if (mm_read_mtx_crd_size(f, &m, &n, &nz) != 0)
ERR_OUT("Error: could not read matrix size from file.\n");
if (m != n)
ERR_OUT("Error: Application does not support non-square matrices.");
std::vector<struct entry> entries;
entries.resize(nz);
for (int i = 0; i < nz; i++) {
fscanf(f, "%d %d %lg\n", &entries[i].row, &entries[i].col, &entries[i].val);
// Adjust from 1-based to 0-based
entries[i].row--;
entries[i].col--;
}
sort(entries.begin(), entries.end(), compare_entries);
CRSMatrix *M = new CRSMatrix(m, n, nz);
int cur_row = -1;
for (int i = 0; i < nz; i++) {
while (entries[i].row > cur_row)
M->row_offsets[++cur_row] = i;
M->entries[i] = entries[i].val;
M->columns[i] = entries[i].col;
}
return M;
}
Vector *Vector::vector_from_mtf (char *path) {
FILE *f;
MM_typecode matcode;
int m, n, nz;
if ((f = fopen(path, "r")) == NULL)
ERR_OUT("Error: %s does not name a valid/readable file.\n", path);
if (mm_read_banner(f, &matcode) != 0)
ERR_OUT("Error: Could not process Matrix Market banner.\n");
if (mm_is_complex(matcode))
ERR_OUT("Error: Application does not support complex numbers.\n")
if (mm_is_dense(matcode)) {
if (mm_read_mtx_array_size(f, &m, &n) != 0)
ERR_OUT("Error: could not read matrix size from file.\n");
} else {
if (mm_read_mtx_crd_size(f, &m, &n, &nz) != 0)
ERR_OUT("Error: could not read matrix size from file.\n");
}
if (n != 1)
ERR_OUT("Error: %s does not describe a vector.\n", path);
Vector *x = new Vector(m);
if (mm_is_dense(matcode)) {
double val;
for (int i = 0; i < m; i++) {
fscanf(f, "%lg\n", &val);
(*x)[i] = val;
}
}
else {
x->zero();
double val;
int row;
int col;
for (int i = 0; i < nz; i++) {
fscanf(f, "%d %d %lg\n", &row, &col, &val);
(*x)[row-1] = val;
}
}
return x;
}
#define ERR(...) { fprintf(stderr, __VA_ARGS__); exit(-1); }
void Vector::to_mtf (char *path) {
FILE *f;
MM_typecode matcode;
mm_initialize_typecode(&matcode);
mm_set_matrix(&matcode);
mm_set_real(&matcode);
mm_set_dense(&matcode);
mm_set_general(&matcode);
if ((f = fopen(path, "w")) == NULL)
ERR("Error: cannot open/write to %s\n", path);
mm_write_banner(f, matcode);
mm_write_mtx_array_size(f, size(), 1);
for (int i = 0; i < size(); i++)
fprintf(f, "%lg\n", entries[i]);
fclose(f);
}
void CRSMatrix::multiply (const Vector &v, Vector &r) const
{
ASSERT(v.size() == cols());
ASSERT(r.size() == rows());
for (int row = 0; row < rows(); row++)
{
int row_offset = row_offsets[row];
int next_offset = ((row + 1 == rows()) ? _nonzeroes : row_offsets[row + 1]);
double sum = 0;
for (int i = row_offset; i < next_offset; i++)
{
sum += v[columns[i]] * entries[i];
}
r[row] = sum;
}
}
void CRSMatrix::zero ( )
{
entries.clear();
row_offsets.clear();
columns.clear();
_nonzeroes = 0;
}

View File

@@ -1,279 +0,0 @@
/*
Copyright (c) 2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef __MATRIX_H__
#define __MATRIX_H__
/**************************************************************\
| Includes
\**************************************************************/
#include <cstring> // size_t
#include <cstdlib> // malloc, memcpy, etc.
#include <cmath> // sqrt
#include <vector>
#include "debug.h"
#include "matrix_ispc.h"
class DenseMatrix;
/**************************************************************\
| Vector class
\**************************************************************/
class Vector {
public:
static Vector *vector_from_mtf(char *path);
void to_mtf (char *path);
Vector(size_t size, bool alloc_mem=true)
{
shared_ptr = false;
_size = size;
if (alloc_mem)
entries = (double *) malloc(sizeof(double) * _size);
else {
shared_ptr = true;
entries = NULL;
}
}
Vector(size_t size, double *content, bool share_ptr=false)
{
_size = size;
if (share_ptr) {
entries = content;
shared_ptr = true;
}
else {
shared_ptr = false;
entries = (double *) malloc(sizeof(double) * _size);
memcpy(entries, content, sizeof(double) * _size);
}
}
~Vector() { if (!shared_ptr) free(entries); }
const double & operator [] (size_t index) const
{
ASSERT(index < _size);
return *(entries + index);
}
double &operator [] (size_t index)
{
ASSERT(index < _size);
return *(entries + index);
}
bool operator == (const Vector &v) const
{
if (v.size() != _size)
return false;
for (int i = 0; i < _size; i++)
if (entries[i] != v[i])
return false;
return true;
}
size_t size() const {return _size; }
double dot (const Vector &b) const
{
ASSERT(b.size() == this->size());
return ispc::vector_dot(entries, b.entries, size());
}
double dot (const double * const b) const
{
return ispc::vector_dot(entries, b, size());
}
void zero ()
{
ispc::zero(entries, size());
}
double norm () const { return sqrtf(dot(entries)); }
void normalize () { this->divide(this->norm()); }
void add (const Vector &a)
{
ASSERT(size() == a.size());
ispc::vector_add(entries, a.entries, size());
}
void subtract (const Vector &s)
{
ASSERT(size() == s.size());
ispc::vector_sub(entries, s.entries, size());
}
void multiply (double scalar)
{
ispc::vector_mult(entries, scalar, size());
}
void divide (double scalar)
{
ispc::vector_div(entries, scalar, size());
}
// Note: x may be longer than *(this)
void add_ax (double a, const Vector &x) {
ASSERT(x.size() >= size());
ispc::vector_add_ax(entries, a, x.entries, size());
}
// Note that copy only copies the first size() elements of the
// supplied vector, i.e. the supplied vector can be longer than
// this one. This is useful in least squares calculations.
void copy (const Vector &other) {
ASSERT(other.size() >= size());
memcpy(entries, other.entries, size() * sizeof(double));
}
friend class DenseMatrix;
private:
size_t _size;
bool shared_ptr;
double *entries;
};
/**************************************************************\
| Matrix base class
\**************************************************************/
class Matrix {
friend class Vector;
public:
Matrix(size_t size_r, size_t size_c)
{
num_rows = size_r;
num_cols = size_c;
}
~Matrix(){}
size_t rows() const { return num_rows; }
size_t cols() const { return num_cols; }
virtual void multiply (const Vector &v, Vector &r) const = 0;
virtual void zero () = 0;
protected:
size_t num_rows;
size_t num_cols;
};
/**************************************************************\
| DenseMatrix class
\**************************************************************/
class DenseMatrix : public Matrix {
friend class Vector;
public:
DenseMatrix(size_t size_r, size_t size_c) : Matrix(size_r, size_c)
{
entries = (double *) malloc(size_r * size_c * sizeof(double));
}
DenseMatrix(size_t size_r, size_t size_c, const double *content) : Matrix (size_r, size_c)
{
entries = (double *) malloc(size_r * size_c * sizeof(double));
memcpy(entries, content, size_r * size_c * sizeof(double));
}
virtual void multiply (const Vector &v, Vector &r) const;
double &operator () (unsigned int r, unsigned int c)
{
return *(entries + r * num_cols + c);
}
const double &operator () (unsigned int r, unsigned int c) const
{
return *(entries + r * num_cols + c);
}
const Vector *row(size_t row) const;
void row(size_t row, Vector &r);
void set_row(size_t row, const Vector &v);
virtual void zero() { ispc::zero(entries, rows() * cols()); }
void copy (const DenseMatrix &other)
{
ASSERT(rows() == other.rows());
ASSERT(cols() == other.cols());
memcpy(entries, other.entries, rows() * cols() * sizeof(double));
}
private:
double *entries;
bool shared_ptr;
};
/**************************************************************\
| CSRMatrix (compressed row storage, a sparse matrix format)
\**************************************************************/
class CRSMatrix : public Matrix {
public:
CRSMatrix (size_t size_r, size_t size_c, size_t nonzeroes) :
Matrix(size_r, size_c)
{
_nonzeroes = nonzeroes;
entries.resize(nonzeroes);
columns.resize(nonzeroes);
row_offsets.resize(size_r);
}
virtual void multiply(const Vector &v, Vector &r) const;
virtual void zero();
static CRSMatrix *matrix_from_mtf (char *path);
private:
unsigned int _nonzeroes;
std::vector<double> entries;
std::vector<int> row_offsets;
std::vector<int> columns;
};
#endif

View File

@@ -1,122 +0,0 @@
/*
Copyright (c) 2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/**************************************************************\
| General
\**************************************************************/
export void zero (uniform double data[],
uniform int size)
{
foreach (i = 0 ... size)
data[i] = 0.0;
}
/**************************************************************\
| Vector helpers
\**************************************************************/
export void vector_add (uniform double a[],
const uniform double b[],
const uniform int size)
{
foreach (i = 0 ... size)
a[i] += b[i];
}
export void vector_sub (uniform double a[],
const uniform double b[],
const uniform int size)
{
foreach (i = 0 ... size)
a[i] -= b[i];
}
export void vector_mult (uniform double a[],
const uniform double b,
const uniform int size)
{
foreach (i = 0 ... size)
a[i] *= b;
}
export void vector_div (uniform double a[],
const uniform double b,
const uniform int size)
{
foreach (i = 0 ... size)
a[i] /= b;
}
export void vector_add_ax (uniform double r[],
const uniform double a,
const uniform double x[],
const uniform int size)
{
foreach (i = 0 ... size)
r[i] += a * x[i];
}
export uniform double vector_dot (const uniform double a[],
const uniform double b[],
const uniform int size)
{
varying double sum = 0.0;
foreach (i = 0 ... size)
sum += a[i] * b[i];
return reduce_add(sum);
}
/**************************************************************\
| Matrix helpers
\**************************************************************/
export void sparse_multiply (const uniform double entries[],
const uniform double columns[],
const uniform double row_offsets[],
const uniform int rows,
const uniform int cols,
const uniform int nonzeroes,
const uniform double v[],
uniform double r[])
{
foreach (row = 0 ... rows) {
int row_offset = row_offsets[row];
int next_offset = ((row + 1 == rows) ? nonzeroes : row_offsets[row+1]);
double sum = 0;
for (int j = row_offset; j < next_offset; j++)
sum += v[columns[j]] * entries[j];
r[row] = sum;
}
}

View File

@@ -1,511 +0,0 @@
/*
* Matrix Market I/O library for ANSI C
*
* See http://math.nist.gov/MatrixMarket for details.
*
*
*/
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#include "mmio.h"
int mm_read_unsymmetric_sparse(const char *fname, int *M_, int *N_, int *nz_,
double **val_, int **I_, int **J_)
{
FILE *f;
MM_typecode matcode;
int M, N, nz;
int i;
double *val;
int *I, *J;
if ((f = fopen(fname, "r")) == NULL)
return -1;
if (mm_read_banner(f, &matcode) != 0)
{
printf("mm_read_unsymetric: Could not process Matrix Market banner ");
printf(" in file [%s]\n", fname);
return -1;
}
if ( !(mm_is_real(matcode) && mm_is_matrix(matcode) &&
mm_is_sparse(matcode)))
{
fprintf(stderr, "Sorry, this application does not support ");
fprintf(stderr, "Market Market type: [%s]\n",
mm_typecode_to_str(matcode));
return -1;
}
/* find out size of sparse matrix: M, N, nz .... */
if (mm_read_mtx_crd_size(f, &M, &N, &nz) !=0)
{
fprintf(stderr, "read_unsymmetric_sparse(): could not parse matrix size.\n");
return -1;
}
*M_ = M;
*N_ = N;
*nz_ = nz;
/* reseve memory for matrices */
I = (int *) malloc(nz * sizeof(int));
J = (int *) malloc(nz * sizeof(int));
val = (double *) malloc(nz * sizeof(double));
*val_ = val;
*I_ = I;
*J_ = J;
/* NOTE: when reading in doubles, ANSI C requires the use of the "l" */
/* specifier as in "%lg", "%lf", "%le", otherwise errors will occur */
/* (ANSI C X3.159-1989, Sec. 4.9.6.2, p. 136 lines 13-15) */
for (i=0; i<nz; i++)
{
fscanf(f, "%d %d %lg\n", &I[i], &J[i], &val[i]);
I[i]--; /* adjust from 1-based to 0-based */
J[i]--;
}
fclose(f);
return 0;
}
int mm_is_valid(MM_typecode matcode)
{
if (!mm_is_matrix(matcode)) return 0;
if (mm_is_dense(matcode) && mm_is_pattern(matcode)) return 0;
if (mm_is_real(matcode) && mm_is_hermitian(matcode)) return 0;
if (mm_is_pattern(matcode) && (mm_is_hermitian(matcode) ||
mm_is_skew(matcode))) return 0;
return 1;
}
int mm_read_banner(FILE *f, MM_typecode *matcode)
{
char line[MM_MAX_LINE_LENGTH];
char banner[MM_MAX_TOKEN_LENGTH];
char mtx[MM_MAX_TOKEN_LENGTH];
char crd[MM_MAX_TOKEN_LENGTH];
char data_type[MM_MAX_TOKEN_LENGTH];
char storage_scheme[MM_MAX_TOKEN_LENGTH];
char *p;
mm_clear_typecode(matcode);
if (fgets(line, MM_MAX_LINE_LENGTH, f) == NULL)
return MM_PREMATURE_EOF;
if (sscanf(line, "%s %s %s %s %s", banner, mtx, crd, data_type,
storage_scheme) != 5)
return MM_PREMATURE_EOF;
for (p=mtx; *p!='\0'; *p=tolower(*p),p++); /* convert to lower case */
for (p=crd; *p!='\0'; *p=tolower(*p),p++);
for (p=data_type; *p!='\0'; *p=tolower(*p),p++);
for (p=storage_scheme; *p!='\0'; *p=tolower(*p),p++);
/* check for banner */
if (strncmp(banner, MatrixMarketBanner, strlen(MatrixMarketBanner)) != 0)
return MM_NO_HEADER;
/* first field should be "mtx" */
if (strcmp(mtx, MM_MTX_STR) != 0)
return MM_UNSUPPORTED_TYPE;
mm_set_matrix(matcode);
/* second field describes whether this is a sparse matrix (in coordinate
storgae) or a dense array */
if (strcmp(crd, MM_SPARSE_STR) == 0)
mm_set_sparse(matcode);
else
if (strcmp(crd, MM_DENSE_STR) == 0)
mm_set_dense(matcode);
else
return MM_UNSUPPORTED_TYPE;
/* third field */
if (strcmp(data_type, MM_REAL_STR) == 0)
mm_set_real(matcode);
else
if (strcmp(data_type, MM_COMPLEX_STR) == 0)
mm_set_complex(matcode);
else
if (strcmp(data_type, MM_PATTERN_STR) == 0)
mm_set_pattern(matcode);
else
if (strcmp(data_type, MM_INT_STR) == 0)
mm_set_integer(matcode);
else
return MM_UNSUPPORTED_TYPE;
/* fourth field */
if (strcmp(storage_scheme, MM_GENERAL_STR) == 0)
mm_set_general(matcode);
else
if (strcmp(storage_scheme, MM_SYMM_STR) == 0)
mm_set_symmetric(matcode);
else
if (strcmp(storage_scheme, MM_HERM_STR) == 0)
mm_set_hermitian(matcode);
else
if (strcmp(storage_scheme, MM_SKEW_STR) == 0)
mm_set_skew(matcode);
else
return MM_UNSUPPORTED_TYPE;
return 0;
}
int mm_write_mtx_crd_size(FILE *f, int M, int N, int nz)
{
if (fprintf(f, "%d %d %d\n", M, N, nz) != 3)
return MM_COULD_NOT_WRITE_FILE;
else
return 0;
}
int mm_read_mtx_crd_size(FILE *f, int *M, int *N, int *nz )
{
char line[MM_MAX_LINE_LENGTH];
int num_items_read;
/* set return null parameter values, in case we exit with errors */
*M = *N = *nz = 0;
/* now continue scanning until you reach the end-of-comments */
do
{
if (fgets(line,MM_MAX_LINE_LENGTH,f) == NULL)
return MM_PREMATURE_EOF;
}while (line[0] == '%');
/* line[] is either blank or has M,N, nz */
if (sscanf(line, "%d %d %d", M, N, nz) == 3)
return 0;
else
do
{
num_items_read = fscanf(f, "%d %d %d", M, N, nz);
if (num_items_read == EOF) return MM_PREMATURE_EOF;
}
while (num_items_read != 3);
return 0;
}
int mm_read_mtx_array_size(FILE *f, int *M, int *N)
{
char line[MM_MAX_LINE_LENGTH];
int num_items_read;
/* set return null parameter values, in case we exit with errors */
*M = *N = 0;
/* now continue scanning until you reach the end-of-comments */
do
{
if (fgets(line,MM_MAX_LINE_LENGTH,f) == NULL)
return MM_PREMATURE_EOF;
}while (line[0] == '%');
/* line[] is either blank or has M,N, nz */
if (sscanf(line, "%d %d", M, N) == 2)
return 0;
else /* we have a blank line */
do
{
num_items_read = fscanf(f, "%d %d", M, N);
if (num_items_read == EOF) return MM_PREMATURE_EOF;
}
while (num_items_read != 2);
return 0;
}
int mm_write_mtx_array_size(FILE *f, int M, int N)
{
if (fprintf(f, "%d %d\n", M, N) != 2)
return MM_COULD_NOT_WRITE_FILE;
else
return 0;
}
/*-------------------------------------------------------------------------*/
/******************************************************************/
/* use when I[], J[], and val[]J, and val[] are already allocated */
/******************************************************************/
int mm_read_mtx_crd_data(FILE *f, int M, int N, int nz, int I[], int J[],
double val[], MM_typecode matcode)
{
int i;
if (mm_is_complex(matcode))
{
for (i=0; i<nz; i++)
if (fscanf(f, "%d %d %lg %lg", &I[i], &J[i], &val[2*i], &val[2*i+1])
!= 4) return MM_PREMATURE_EOF;
}
else if (mm_is_real(matcode))
{
for (i=0; i<nz; i++)
{
if (fscanf(f, "%d %d %lg\n", &I[i], &J[i], &val[i])
!= 3) return MM_PREMATURE_EOF;
}
}
else if (mm_is_pattern(matcode))
{
for (i=0; i<nz; i++)
if (fscanf(f, "%d %d", &I[i], &J[i])
!= 2) return MM_PREMATURE_EOF;
}
else
return MM_UNSUPPORTED_TYPE;
return 0;
}
int mm_read_mtx_crd_entry(FILE *f, int *I, int *J,
double *real, double *imag, MM_typecode matcode)
{
if (mm_is_complex(matcode))
{
if (fscanf(f, "%d %d %lg %lg", I, J, real, imag)
!= 4) return MM_PREMATURE_EOF;
}
else if (mm_is_real(matcode))
{
if (fscanf(f, "%d %d %lg\n", I, J, real)
!= 3) return MM_PREMATURE_EOF;
}
else if (mm_is_pattern(matcode))
{
if (fscanf(f, "%d %d", I, J) != 2) return MM_PREMATURE_EOF;
}
else
return MM_UNSUPPORTED_TYPE;
return 0;
}
/************************************************************************
mm_read_mtx_crd() fills M, N, nz, array of values, and return
type code, e.g. 'MCRS'
if matrix is complex, values[] is of size 2*nz,
(nz pairs of real/imaginary values)
************************************************************************/
int mm_read_mtx_crd(char *fname, int *M, int *N, int *nz, int **I, int **J,
double **val, MM_typecode *matcode)
{
int ret_code;
FILE *f;
if (strcmp(fname, "stdin") == 0) f=stdin;
else
if ((f = fopen(fname, "r")) == NULL)
return MM_COULD_NOT_READ_FILE;
if ((ret_code = mm_read_banner(f, matcode)) != 0)
return ret_code;
if (!(mm_is_valid(*matcode) && mm_is_sparse(*matcode) &&
mm_is_matrix(*matcode)))
return MM_UNSUPPORTED_TYPE;
if ((ret_code = mm_read_mtx_crd_size(f, M, N, nz)) != 0)
return ret_code;
*I = (int *) malloc(*nz * sizeof(int));
*J = (int *) malloc(*nz * sizeof(int));
*val = NULL;
if (mm_is_complex(*matcode))
{
*val = (double *) malloc(*nz * 2 * sizeof(double));
ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val,
*matcode);
if (ret_code != 0) return ret_code;
}
else if (mm_is_real(*matcode))
{
*val = (double *) malloc(*nz * sizeof(double));
ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val,
*matcode);
if (ret_code != 0) return ret_code;
}
else if (mm_is_pattern(*matcode))
{
ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val,
*matcode);
if (ret_code != 0) return ret_code;
}
if (f != stdin) fclose(f);
return 0;
}
int mm_write_banner(FILE *f, MM_typecode matcode)
{
char *str = mm_typecode_to_str(matcode);
int ret_code;
ret_code = fprintf(f, "%s %s\n", MatrixMarketBanner, str);
free(str);
if (ret_code !=2 )
return MM_COULD_NOT_WRITE_FILE;
else
return 0;
}
int mm_write_mtx_crd(char fname[], int M, int N, int nz, int I[], int J[],
double val[], MM_typecode matcode)
{
FILE *f;
int i;
if (strcmp(fname, "stdout") == 0)
f = stdout;
else
if ((f = fopen(fname, "w")) == NULL)
return MM_COULD_NOT_WRITE_FILE;
/* print banner followed by typecode */
fprintf(f, "%s ", MatrixMarketBanner);
fprintf(f, "%s\n", mm_typecode_to_str(matcode));
/* print matrix sizes and nonzeros */
fprintf(f, "%d %d %d\n", M, N, nz);
/* print values */
if (mm_is_pattern(matcode))
for (i=0; i<nz; i++)
fprintf(f, "%d %d\n", I[i], J[i]);
else
if (mm_is_real(matcode))
for (i=0; i<nz; i++)
fprintf(f, "%d %d %20.16g\n", I[i], J[i], val[i]);
else
if (mm_is_complex(matcode))
for (i=0; i<nz; i++)
fprintf(f, "%d %d %20.16g %20.16g\n", I[i], J[i], val[2*i],
val[2*i+1]);
else
{
if (f != stdout) fclose(f);
return MM_UNSUPPORTED_TYPE;
}
if (f !=stdout) fclose(f);
return 0;
}
/**
* Create a new copy of a string s. mm_strdup() is a common routine, but
* not part of ANSI C, so it is included here. Used by mm_typecode_to_str().
*
*/
char *mm_strdup(const char *s)
{
int len = strlen(s);
char *s2 = (char *) malloc((len+1)*sizeof(char));
return strcpy(s2, s);
}
char *mm_typecode_to_str(MM_typecode matcode)
{
char buffer[MM_MAX_LINE_LENGTH];
char *types[4];
char *mm_strdup(const char *);
int error =0;
/* check for MTX type */
if (mm_is_matrix(matcode))
types[0] = MM_MTX_STR;
else
error=1;
/* check for CRD or ARR matrix */
if (mm_is_sparse(matcode))
types[1] = MM_SPARSE_STR;
else
if (mm_is_dense(matcode))
types[1] = MM_DENSE_STR;
else
return NULL;
/* check for element data type */
if (mm_is_real(matcode))
types[2] = MM_REAL_STR;
else
if (mm_is_complex(matcode))
types[2] = MM_COMPLEX_STR;
else
if (mm_is_pattern(matcode))
types[2] = MM_PATTERN_STR;
else
if (mm_is_integer(matcode))
types[2] = MM_INT_STR;
else
return NULL;
/* check for symmetry type */
if (mm_is_general(matcode))
types[3] = MM_GENERAL_STR;
else
if (mm_is_symmetric(matcode))
types[3] = MM_SYMM_STR;
else
if (mm_is_hermitian(matcode))
types[3] = MM_HERM_STR;
else
if (mm_is_skew(matcode))
types[3] = MM_SKEW_STR;
else
return NULL;
sprintf(buffer,"%s %s %s %s", types[0], types[1], types[2], types[3]);
return mm_strdup(buffer);
}

View File

@@ -1,135 +0,0 @@
/*
* Matrix Market I/O library for ANSI C
*
* See http://math.nist.gov/MatrixMarket for details.
*
*
*/
#ifndef MM_IO_H
#define MM_IO_H
#define MM_MAX_LINE_LENGTH 1025
#define MatrixMarketBanner "%%MatrixMarket"
#define MM_MAX_TOKEN_LENGTH 64
typedef char MM_typecode[4];
#include <stdio.h>
char *mm_typecode_to_str(MM_typecode matcode);
int mm_read_banner(FILE *f, MM_typecode *matcode);
int mm_read_mtx_crd_size(FILE *f, int *M, int *N, int *nz);
int mm_read_mtx_array_size(FILE *f, int *M, int *N);
int mm_write_banner(FILE *f, MM_typecode matcode);
int mm_write_mtx_crd_size(FILE *f, int M, int N, int nz);
int mm_write_mtx_array_size(FILE *f, int M, int N);
/********************* MM_typecode query fucntions ***************************/
#define mm_is_matrix(typecode) ((typecode)[0]=='M')
#define mm_is_sparse(typecode) ((typecode)[1]=='C')
#define mm_is_coordinate(typecode)((typecode)[1]=='C')
#define mm_is_dense(typecode) ((typecode)[1]=='A')
#define mm_is_array(typecode) ((typecode)[1]=='A')
#define mm_is_complex(typecode) ((typecode)[2]=='C')
#define mm_is_real(typecode) ((typecode)[2]=='R')
#define mm_is_pattern(typecode) ((typecode)[2]=='P')
#define mm_is_integer(typecode) ((typecode)[2]=='I')
#define mm_is_symmetric(typecode)((typecode)[3]=='S')
#define mm_is_general(typecode) ((typecode)[3]=='G')
#define mm_is_skew(typecode) ((typecode)[3]=='K')
#define mm_is_hermitian(typecode)((typecode)[3]=='H')
int mm_is_valid(MM_typecode matcode); /* too complex for a macro */
/********************* MM_typecode modify fucntions ***************************/
#define mm_set_matrix(typecode) ((*typecode)[0]='M')
#define mm_set_coordinate(typecode) ((*typecode)[1]='C')
#define mm_set_array(typecode) ((*typecode)[1]='A')
#define mm_set_dense(typecode) mm_set_array(typecode)
#define mm_set_sparse(typecode) mm_set_coordinate(typecode)
#define mm_set_complex(typecode)((*typecode)[2]='C')
#define mm_set_real(typecode) ((*typecode)[2]='R')
#define mm_set_pattern(typecode)((*typecode)[2]='P')
#define mm_set_integer(typecode)((*typecode)[2]='I')
#define mm_set_symmetric(typecode)((*typecode)[3]='S')
#define mm_set_general(typecode)((*typecode)[3]='G')
#define mm_set_skew(typecode) ((*typecode)[3]='K')
#define mm_set_hermitian(typecode)((*typecode)[3]='H')
#define mm_clear_typecode(typecode) ((*typecode)[0]=(*typecode)[1]= \
(*typecode)[2]=' ',(*typecode)[3]='G')
#define mm_initialize_typecode(typecode) mm_clear_typecode(typecode)
/********************* Matrix Market error codes ***************************/
#define MM_COULD_NOT_READ_FILE 11
#define MM_PREMATURE_EOF 12
#define MM_NOT_MTX 13
#define MM_NO_HEADER 14
#define MM_UNSUPPORTED_TYPE 15
#define MM_LINE_TOO_LONG 16
#define MM_COULD_NOT_WRITE_FILE 17
/******************** Matrix Market internal definitions ********************
MM_matrix_typecode: 4-character sequence
ojbect sparse/ data storage
dense type scheme
string position: [0] [1] [2] [3]
Matrix typecode: M(atrix) C(oord) R(eal) G(eneral)
A(array) C(omplex) H(ermitian)
P(attern) S(ymmetric)
I(nteger) K(kew)
***********************************************************************/
#define MM_MTX_STR "matrix"
#define MM_ARRAY_STR "array"
#define MM_DENSE_STR "array"
#define MM_COORDINATE_STR "coordinate"
#define MM_SPARSE_STR "coordinate"
#define MM_COMPLEX_STR "complex"
#define MM_REAL_STR "real"
#define MM_INT_STR "integer"
#define MM_GENERAL_STR "general"
#define MM_SYMM_STR "symmetric"
#define MM_HERM_STR "hermitian"
#define MM_SKEW_STR "skew-symmetric"
#define MM_PATTERN_STR "pattern"
/* high level routines */
int mm_write_mtx_crd(char fname[], int M, int N, int nz, int I[], int J[],
double val[], MM_typecode matcode);
int mm_read_mtx_crd_data(FILE *f, int M, int N, int nz, int I[], int J[],
double val[], MM_typecode matcode);
int mm_read_mtx_crd_entry(FILE *f, int *I, int *J, double *real, double *img,
MM_typecode matcode);
int mm_read_unsymmetric_sparse(const char *fname, int *M_, int *N_, int *nz_,
double **val_, int **I_, int **J_);
#endif

View File

@@ -1,53 +0,0 @@
/*
Copyright (c) 2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef __UTIL_H__
#define __UTIL_H__
#include <stdio.h>
#include "matrix.h"
inline void printMatrix (DenseMatrix &M, const char *name) {
printf("Matrix %s:\n", name);
for (int row = 0; row < M.rows(); row++) {
printf("row %2d: ", row + 1);
for (int col = 0; col < M.cols(); col++)
printf("%6f ", M(row, col));
printf("\n");
}
printf("\n");
}
#endif

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,86 +0,0 @@
#define __ZMM64BIT__
#include "knc-i1x8.h"
/* the following tests fails because on KNC native vec8_i32 and vec8_float are 512 and not 256 bit in size.
*
* Using test compiler: Intel(r) SPMD Program Compiler (ispc), 1.4.5dev (build commit d68dbbc7bce74803 @ 20130919, LLVM 3.3)
* Using C/C++ compiler: icpc (ICC) 14.0.0 20130728
*
*/
/* knc-i1x8unsafe_fast.h fails:
* ----------------------------
1 / 1206 tests FAILED compilation:
./tests/ptr-assign-lhs-math-1.ispc
33 / 1206 tests FAILED execution:
./tests/array-gather-simple.ispc
./tests/array-gather-vary.ispc
./tests/array-multidim-gather-scatter.ispc
./tests/array-scatter-vary.ispc
./tests/atomics-5.ispc
./tests/atomics-swap.ispc
./tests/cfor-array-gather-vary.ispc
./tests/cfor-gs-improve-varying-1.ispc
./tests/cfor-struct-gather-2.ispc
./tests/cfor-struct-gather-3.ispc
./tests/cfor-struct-gather.ispc
./tests/gather-struct-vector.ispc
./tests/global-array-4.ispc
./tests/gs-improve-varying-1.ispc
./tests/half-1.ispc
./tests/half-3.ispc
./tests/half.ispc
./tests/launch-3.ispc
./tests/launch-4.ispc
./tests/masked-scatter-vector.ispc
./tests/masked-struct-scatter-varying.ispc
./tests/new-delete-6.ispc
./tests/ptr-24.ispc
./tests/ptr-25.ispc
./tests/short-vec-15.ispc
./tests/struct-gather-2.ispc
./tests/struct-gather-3.ispc
./tests/struct-gather.ispc
./tests/struct-ref-lvalue.ispc
./tests/struct-test-118.ispc
./tests/struct-vary-index-expr.ispc
./tests/typedef-2.ispc
./tests/vector-varying-scatter.ispc
*/
/* knc-i1x8.h fails:
* ----------------------------
1 / 1206 tests FAILED compilation:
./tests/ptr-assign-lhs-math-1.ispc
3 / 1206 tests FAILED execution:
./tests/half-1.ispc
./tests/half-3.ispc
./tests/half.ispc
*/
/* knc-i1x8.h fails:
* ----------------------------
1 / 1206 tests FAILED compilation:
./tests/ptr-assign-lhs-math-1.ispc
4 / 1206 tests FAILED execution:
./tests/half-1.ispc
./tests/half-3.ispc
./tests/half.ispc
./tests/test-141.ispc
*/
/* generic-16.h fails: (from these knc-i1x8.h & knc-i1x16.h are derived
* ----------------------------
1 / 1206 tests FAILED compilation:
./tests/ptr-assign-lhs-math-1.ispc
6 / 1206 tests FAILED execution:
./tests/func-overload-max.ispc
./tests/half-1.ispc
./tests/half-3.ispc
./tests/half.ispc
./tests/test-141.ispc
./tests/test-143.ispc
*/

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,3 +0,0 @@
mandelbrot
*.ppm
objs

View File

@@ -1,8 +0,0 @@
EXAMPLE=mandelbrot
CPP_SRC=mandelbrot.cpp mandelbrot_serial.cpp
ISPC_SRC=mandelbrot.ispc
ISPC_IA_TARGETS=sse2,sse4-x2,avx-x2
ISPC_ARM_TARGETS=neon
include ../common.mk

Binary file not shown.

Binary file not shown.

View File

@@ -1,118 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#define NOMINMAX
#pragma warning (disable: 4244)
#pragma warning (disable: 4305)
#endif
#include <stdio.h>
#include <algorithm>
#include "../timing.h"
#include "mandelbrot_ispc.h"
using namespace ispc;
extern void mandelbrot_serial(float x0, float y0, float x1, float y1,
int width, int height, int maxIterations,
int output[]);
/* Write a PPM image file with the image of the Mandelbrot set */
static void
writePPM(int *buf, int width, int height, const char *fn) {
FILE *fp = fopen(fn, "wb");
fprintf(fp, "P6\n");
fprintf(fp, "%d %d\n", width, height);
fprintf(fp, "255\n");
for (int i = 0; i < width*height; ++i) {
// Map the iteration count to colors by just alternating between
// two greys.
char c = (buf[i] & 0x1) ? 240 : 20;
for (int j = 0; j < 3; ++j)
fputc(c, fp);
}
fclose(fp);
printf("Wrote image file %s\n", fn);
}
int main() {
unsigned int width = 768;
unsigned int height = 512;
float x0 = -2;
float x1 = 1;
float y0 = -1;
float y1 = 1;
int maxIterations = 256;
int *buf = new int[width*height];
//
// Compute the image using the ispc implementation; report the minimum
// time of three runs.
//
double minISPC = 1e30;
for (int i = 0; i < 3; ++i) {
reset_and_start_timer();
mandelbrot_ispc(x0, y0, x1, y1, width, height, maxIterations, buf);
double dt = get_elapsed_mcycles();
minISPC = std::min(minISPC, dt);
}
printf("[mandelbrot ispc]:\t\t[%.3f] million cycles\n", minISPC);
writePPM(buf, width, height, "mandelbrot-ispc.ppm");
// Clear out the buffer
for (unsigned int i = 0; i < width * height; ++i)
buf[i] = 0;
//
// And run the serial implementation 3 times, again reporting the
// minimum time.
//
double minSerial = 1e30;
for (int i = 0; i < 3; ++i) {
reset_and_start_timer();
mandelbrot_serial(x0, y0, x1, y1, width, height, maxIterations, buf);
double dt = get_elapsed_mcycles();
minSerial = std::min(minSerial, dt);
}
printf("[mandelbrot serial]:\t\t[%.3f] million cycles\n", minSerial);
writePPM(buf, width, height, "mandelbrot-serial.ppm");
printf("\t\t\t\t(%.2fx speedup from ISPC)\n", minSerial/minISPC);
return 0;
}

View File

@@ -1,78 +0,0 @@
/*
Copyright (c) 2010-2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
static inline int mandel(float c_re, float c_im, int count) {
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i) {
if (z_re * z_re + z_im * z_im > 4.)
break;
float new_re = z_re*z_re - z_im*z_im;
float new_im = 2.f * z_re * z_im;
unmasked {
z_re = c_re + new_re;
z_im = c_im + new_im;
}
}
return i;
}
export void mandelbrot_ispc(uniform float x0, uniform float y0,
uniform float x1, uniform float y1,
uniform int width, uniform int height,
uniform int maxIterations,
uniform int output[])
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
for (uniform int j = 0; j < height; j++) {
// Note that we'll be doing programCount computations in parallel,
// so increment i by that much. This assumes that width evenly
// divides programCount.
foreach (i = 0 ... width) {
// Figure out the position on the complex plane to compute the
// number of iterations at. Note that the x values are
// different across different program instances, since its
// initializer incorporates the value of the programIndex
// variable.
float x = x0 + i * dx;
float y = y0 + j * dy;
int index = j * width + i;
output[index] = mandel(x, y, maxIterations);
}
}
}

View File

@@ -1,175 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<ItemGroup Label="ProjectConfigurations">
<ProjectConfiguration Include="Debug|Win32">
<Configuration>Debug</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Debug|x64">
<Configuration>Debug</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|Win32">
<Configuration>Release</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|x64">
<Configuration>Release</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
</ItemGroup>
<PropertyGroup Label="Globals">
<ProjectGuid>{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}</ProjectGuid>
<Keyword>Win32Proj</Keyword>
<RootNamespace>mandelbrot</RootNamespace>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
<ImportGroup Label="ExtensionSettings">
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<PropertyGroup Label="UserMacros" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
</PropertyGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<IntrinsicFunctions>true</IntrinsicFunctions>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<IntrinsicFunctions>true</IntrinsicFunctions>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<ItemGroup>
<ClCompile Include="mandelbrot.cpp" />
<ClCompile Include="mandelbrot_serial.cpp" />
</ItemGroup>
<ItemGroup>
<CustomBuild Include="mandelbrot.ispc">
<FileType>Document</FileType>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
</CustomBuild>
</ItemGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
<ImportGroup Label="ExtensionTargets">
</ImportGroup>
</Project>

View File

@@ -1,68 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
static int mandel(float c_re, float c_im, int count) {
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i) {
if (z_re * z_re + z_im * z_im > 4.f)
break;
float new_re = z_re*z_re - z_im*z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}
return i;
}
void mandelbrot_serial(float x0, float y0, float x1, float y1,
int width, int height, int maxIterations,
int output[])
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; ++i) {
float x = x0 + i * dx;
float y = y0 + j * dy;
int index = (j * width + i);
output[index] = mandel(x, y, maxIterations);
}
}
}

Binary file not shown.

View File

@@ -1,843 +0,0 @@
//
// Generated by LLVM NVPTX Back-End
//
.version 3.1
.target sm_35, texmode_independent
.address_size 64
// .globl __vselect_i8
// @__vselect_i8
.func (.param .align 1 .b8 func_retval0[1]) __vselect_i8(
.param .align 1 .b8 __vselect_i8_param_0[1],
.param .align 1 .b8 __vselect_i8_param_1[1],
.param .align 4 .b8 __vselect_i8_param_2[4]
)
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0:
ld.param.u32 %r0, [__vselect_i8_param_2];
setp.eq.s32 %p0, %r0, 0;
ld.param.u8 %rc0, [__vselect_i8_param_0];
ld.param.u8 %rc1, [__vselect_i8_param_1];
selp.b16 %rc0, %rc0, %rc1, %p0;
st.param.b8 [func_retval0+0], %rc0;
ret;
}
// .globl __vselect_i16
.func (.param .align 2 .b8 func_retval0[2]) __vselect_i16(
.param .align 2 .b8 __vselect_i16_param_0[2],
.param .align 2 .b8 __vselect_i16_param_1[2],
.param .align 4 .b8 __vselect_i16_param_2[4]
) // @__vselect_i16
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0:
ld.param.u32 %r0, [__vselect_i16_param_2];
setp.eq.s32 %p0, %r0, 0;
ld.param.u16 %rs0, [__vselect_i16_param_0];
ld.param.u16 %rs1, [__vselect_i16_param_1];
selp.b16 %rs0, %rs0, %rs1, %p0;
st.param.b16 [func_retval0+0], %rs0;
ret;
}
// .globl __vselect_i64
.func (.param .align 8 .b8 func_retval0[8]) __vselect_i64(
.param .align 8 .b8 __vselect_i64_param_0[8],
.param .align 8 .b8 __vselect_i64_param_1[8],
.param .align 4 .b8 __vselect_i64_param_2[4]
) // @__vselect_i64
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0:
ld.param.u32 %r0, [__vselect_i64_param_2];
setp.eq.s32 %p0, %r0, 0;
ld.param.u64 %rl0, [__vselect_i64_param_0];
ld.param.u64 %rl1, [__vselect_i64_param_1];
selp.b64 %rl0, %rl0, %rl1, %p0;
st.param.b64 [func_retval0+0], %rl0;
ret;
}
// .globl __aos_to_soa4_float1
.func __aos_to_soa4_float1(
.param .align 4 .b8 __aos_to_soa4_float1_param_0[4],
.param .align 4 .b8 __aos_to_soa4_float1_param_1[4],
.param .align 4 .b8 __aos_to_soa4_float1_param_2[4],
.param .align 4 .b8 __aos_to_soa4_float1_param_3[4],
.param .b64 __aos_to_soa4_float1_param_4,
.param .b64 __aos_to_soa4_float1_param_5,
.param .b64 __aos_to_soa4_float1_param_6,
.param .b64 __aos_to_soa4_float1_param_7
) // @__aos_to_soa4_float1
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0:
ld.param.u64 %rl0, [__aos_to_soa4_float1_param_4];
ld.param.u64 %rl1, [__aos_to_soa4_float1_param_5];
ld.param.u64 %rl2, [__aos_to_soa4_float1_param_6];
ld.param.u64 %rl3, [__aos_to_soa4_float1_param_7];
ld.param.f32 %f0, [__aos_to_soa4_float1_param_0];
ld.param.f32 %f1, [__aos_to_soa4_float1_param_1];
ld.param.f32 %f2, [__aos_to_soa4_float1_param_2];
ld.param.f32 %f3, [__aos_to_soa4_float1_param_3];
st.f32 [%rl0], %f0;
st.f32 [%rl1], %f1;
st.f32 [%rl2], %f2;
st.f32 [%rl3], %f3;
ret;
}
// .globl __soa_to_aos4_float1
.func __soa_to_aos4_float1(
.param .align 4 .b8 __soa_to_aos4_float1_param_0[4],
.param .align 4 .b8 __soa_to_aos4_float1_param_1[4],
.param .align 4 .b8 __soa_to_aos4_float1_param_2[4],
.param .align 4 .b8 __soa_to_aos4_float1_param_3[4],
.param .b64 __soa_to_aos4_float1_param_4,
.param .b64 __soa_to_aos4_float1_param_5,
.param .b64 __soa_to_aos4_float1_param_6,
.param .b64 __soa_to_aos4_float1_param_7
) // @__soa_to_aos4_float1
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0:
ld.param.u64 %rl0, [__soa_to_aos4_float1_param_4];
ld.param.u64 %rl1, [__soa_to_aos4_float1_param_5];
ld.param.u64 %rl2, [__soa_to_aos4_float1_param_6];
ld.param.u64 %rl3, [__soa_to_aos4_float1_param_7];
ld.param.f32 %f0, [__soa_to_aos4_float1_param_0];
ld.param.f32 %f1, [__soa_to_aos4_float1_param_1];
ld.param.f32 %f2, [__soa_to_aos4_float1_param_2];
ld.param.f32 %f3, [__soa_to_aos4_float1_param_3];
st.f32 [%rl0], %f0;
st.f32 [%rl1], %f1;
st.f32 [%rl2], %f2;
st.f32 [%rl3], %f3;
ret;
}
// .globl __aos_to_soa3_float1
.func __aos_to_soa3_float1(
.param .align 4 .b8 __aos_to_soa3_float1_param_0[4],
.param .align 4 .b8 __aos_to_soa3_float1_param_1[4],
.param .align 4 .b8 __aos_to_soa3_float1_param_2[4],
.param .b64 __aos_to_soa3_float1_param_3,
.param .b64 __aos_to_soa3_float1_param_4,
.param .b64 __aos_to_soa3_float1_param_5
) // @__aos_to_soa3_float1
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0:
ld.param.u64 %rl0, [__aos_to_soa3_float1_param_3];
ld.param.u64 %rl1, [__aos_to_soa3_float1_param_4];
ld.param.u64 %rl2, [__aos_to_soa3_float1_param_5];
ld.param.f32 %f0, [__aos_to_soa3_float1_param_0];
ld.param.f32 %f1, [__aos_to_soa3_float1_param_1];
ld.param.f32 %f2, [__aos_to_soa3_float1_param_2];
st.f32 [%rl0], %f0;
st.f32 [%rl1], %f1;
st.f32 [%rl2], %f2;
ret;
}
// .globl __soa_to_aos3_float1
.func __soa_to_aos3_float1(
.param .align 4 .b8 __soa_to_aos3_float1_param_0[4],
.param .align 4 .b8 __soa_to_aos3_float1_param_1[4],
.param .align 4 .b8 __soa_to_aos3_float1_param_2[4],
.param .b64 __soa_to_aos3_float1_param_3,
.param .b64 __soa_to_aos3_float1_param_4,
.param .b64 __soa_to_aos3_float1_param_5
) // @__soa_to_aos3_float1
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0:
ld.param.u64 %rl0, [__soa_to_aos3_float1_param_3];
ld.param.u64 %rl1, [__soa_to_aos3_float1_param_4];
ld.param.u64 %rl2, [__soa_to_aos3_float1_param_5];
ld.param.f32 %f0, [__soa_to_aos3_float1_param_0];
ld.param.f32 %f1, [__soa_to_aos3_float1_param_1];
ld.param.f32 %f2, [__soa_to_aos3_float1_param_2];
st.f32 [%rl0], %f0;
st.f32 [%rl1], %f1;
st.f32 [%rl2], %f2;
ret;
}
// .globl __rsqrt_varying_double
.func (.param .align 8 .b8 func_retval0[8]) __rsqrt_varying_double(
.param .align 8 .b8 __rsqrt_varying_double_param_0[8]
) // @__rsqrt_varying_double
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0:
ld.param.f64 %fl0, [__rsqrt_varying_double_param_0];
rsqrt.approx.f64 %fl0, %fl0;
st.param.f64 [func_retval0+0], %fl0;
ret;
}
// .globl mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_
.func mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_(
.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_0,
.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_1,
.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_2,
.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_3,
.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_4,
.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_5,
.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_6,
.param .b64 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_7,
.param .align 4 .b8 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_8[4]
) // @mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0: // %allocas
ld.param.f32 %f0, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_0];
ld.param.f32 %f1, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_1];
ld.param.f32 %f3, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_2];
ld.param.f32 %f2, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_3];
ld.param.u32 %r0, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_4];
ld.param.u32 %r1, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_5];
ld.param.u32 %r2, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_6];
ld.param.u64 %rl0, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_7];
ld.param.u32 %r3, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_8];
setp.lt.s32 %p0, %r3, 0;
sub.f32 %f3, %f3, %f0;
cvt.rn.f32.s32 %f4, %r0;
sub.f32 %f2, %f2, %f1;
cvt.rn.f32.s32 %f5, %r1;
div.rn.f32 %f2, %f2, %f5;
div.rn.f32 %f3, %f3, %f4;
@%p0 bra BB8_9;
// BB#1: // %for_test110.preheader
setp.lt.s32 %p0, %r1, 1;
@%p0 bra BB8_45;
// BB#2: // %outer_not_in_extras140.preheader.lr.ph
setp.gt.s32 %p0, %r2, 0;
mov.u32 %r3, 0;
selp.b32 %r4, -1, 0, %p0;
shl.b32 %r5, %r0, 2;
mov.u32 %r6, %r3;
BB8_3: // %outer_not_in_extras140.preheader
// =>This Loop Header: Depth=1
// Child Loop BB8_41 Depth 2
// Child Loop BB8_43 Depth 2
// Child Loop BB8_38 Depth 2
// Child Loop BB8_33 Depth 3
setp.lt.s32 %p0, %r0, 1;
@%p0 bra BB8_4;
// BB#31: // %foreach_full_body120.lr.ph
// in Loop: Header=BB8_3 Depth=1
setp.lt.s32 %p0, %r4, 0;
mov.u32 %r7, %r0;
mov.u32 %r8, %r3;
@%p0 bra BB8_32;
bra.uni BB8_43;
BB8_32: // in Loop: Header=BB8_3 Depth=1
mov.u64 %rl1, 0;
cvt.rn.f32.s32 %f4, %r6;
fma.rn.f32 %f4, %f2, %f4, %f1;
mul.lo.s32 %r7, %r6, %r0;
BB8_38: // %for_loop.i380.lr.ph.us
// Parent Loop BB8_3 Depth=1
// => This Loop Header: Depth=2
// Child Loop BB8_33 Depth 3
cvt.u32.u64 %r8, %rl1;
cvt.rn.f32.s32 %f5, %r8;
fma.rn.f32 %f5, %f3, %f5, %f0;
mov.u32 %r10, 0;
mov.u32 %r12, %r4;
mov.u32 %r11, %r10;
mov.u32 %r9, %r10;
mov.f32 %f7, %f5;
mov.f32 %f6, %f4;
BB8_33: // %for_loop.i380.us
// Parent Loop BB8_3 Depth=1
// Parent Loop BB8_38 Depth=2
// => This Inner Loop Header: Depth=3
mul.f32 %f8, %f7, %f7;
fma.rn.f32 %f9, %f6, %f6, %f8;
setp.gtu.f32 %p0, %f9, 0f40800000;
selp.b32 %r13, %r12, 0, %p0;
or.b32 %r11, %r13, %r11;
shr.u32 %r13, %r11, 31;
shr.u32 %r14, %r12, 31;
setp.eq.s32 %p0, %r13, %r14;
@%p0 bra BB8_34;
bra.uni BB8_35;
BB8_34: // in Loop: Header=BB8_33 Depth=3
mov.u32 %r12, %r10;
bra.uni BB8_36;
BB8_35: // %not_all_continued_or_breaked.i394.us
// in Loop: Header=BB8_33 Depth=3
mul.f32 %f9, %f6, %f6;
not.b32 %r13, %r11;
and.b32 %r12, %r12, %r13;
sub.f32 %f8, %f8, %f9;
add.f32 %f8, %f5, %f8;
add.f32 %f7, %f7, %f7;
fma.rn.f32 %f6, %f6, %f7, %f4;
mov.f32 %f7, %f8;
BB8_36: // %for_step.i363.us
// in Loop: Header=BB8_33 Depth=3
setp.ne.s32 %p0, %r12, 0;
selp.u32 %r13, 1, 0, %p0;
add.s32 %r9, %r9, %r13;
setp.lt.s32 %p0, %r9, %r2;
selp.b32 %r12, %r12, 0, %p0;
setp.lt.s32 %p0, %r12, 0;
@%p0 bra BB8_33;
// BB#37: // %mandel___vyfvyfvyi.exit395.us
// in Loop: Header=BB8_38 Depth=2
add.s32 %r8, %r8, %r7;
shl.b32 %r8, %r8, 2;
cvt.s64.s32 %rl2, %r8;
add.s64 %rl2, %rl2, %rl0;
st.u32 [%rl2], %r9;
add.s64 %rl1, %rl1, 1;
cvt.u32.u64 %r8, %rl1;
setp.eq.s32 %p0, %r8, %r0;
@%p0 bra BB8_44;
bra.uni BB8_38;
BB8_43: // %mandel___vyfvyfvyi.exit395
// Parent Loop BB8_3 Depth=1
// => This Inner Loop Header: Depth=2
cvt.s64.s32 %rl1, %r8;
add.s64 %rl1, %rl1, %rl0;
mov.u32 %r9, 0;
st.u32 [%rl1], %r9;
add.s32 %r8, %r8, 4;
add.s32 %r7, %r7, -1;
setp.eq.s32 %p0, %r7, 0;
@%p0 bra BB8_44;
bra.uni BB8_43;
BB8_4: // %partial_inner_all_outer156
// in Loop: Header=BB8_3 Depth=1
@%p0 bra BB8_44;
// BB#5: // %partial_inner_only197
// in Loop: Header=BB8_3 Depth=1
setp.gt.s32 %p0, %r0, 0;
mov.u32 %r8, 0;
fma.rn.f32 %f4, %f3, 0f00000000, %f0;
cvt.rn.f32.s32 %f5, %r6;
fma.rn.f32 %f5, %f2, %f5, %f1;
selp.b32 %r7, %r4, 0, %p0;
setp.lt.s32 %p1, %r7, 0;
mov.u32 %r10, %r4;
mov.u32 %r9, %r8;
mov.u32 %r7, %r8;
mov.f32 %f7, %f4;
mov.f32 %f6, %f5;
@%p1 bra BB8_41;
bra.uni BB8_6;
BB8_41: // %for_loop.i
// Parent Loop BB8_3 Depth=1
// => This Inner Loop Header: Depth=2
selp.b32 %r11, %r10, 0, %p0;
mul.f32 %f8, %f7, %f7;
fma.rn.f32 %f9, %f6, %f6, %f8;
setp.gtu.f32 %p1, %f9, 0f40800000;
selp.b32 %r12, %r10, 0, %p1;
or.b32 %r9, %r12, %r9;
selp.b32 %r12, %r9, 0, %p0;
shr.u32 %r12, %r12, 31;
shr.u32 %r11, %r11, 31;
setp.eq.s32 %p1, %r12, %r11;
@%p1 bra BB8_42;
bra.uni BB8_39;
BB8_42: // in Loop: Header=BB8_41 Depth=2
mov.u32 %r10, %r8;
bra.uni BB8_40;
BB8_39: // %not_all_continued_or_breaked.i
// in Loop: Header=BB8_41 Depth=2
mul.f32 %f9, %f6, %f6;
not.b32 %r11, %r9;
and.b32 %r10, %r10, %r11;
sub.f32 %f8, %f8, %f9;
add.f32 %f8, %f4, %f8;
add.f32 %f7, %f7, %f7;
fma.rn.f32 %f6, %f6, %f7, %f5;
mov.f32 %f7, %f8;
BB8_40: // %for_step.i
// in Loop: Header=BB8_41 Depth=2
setp.ne.s32 %p1, %r10, 0;
selp.u32 %r11, 1, 0, %p1;
add.s32 %r7, %r7, %r11;
setp.lt.s32 %p1, %r7, %r2;
selp.b32 %r10, %r10, 0, %p1;
selp.b32 %r11, %r10, 0, %p0;
setp.gt.s32 %p1, %r11, -1;
@%p1 bra BB8_7;
bra.uni BB8_41;
BB8_6: // in Loop: Header=BB8_3 Depth=1
mov.u32 %r7, %r8;
BB8_7: // %mandel___vyfvyfvyi.exit
// in Loop: Header=BB8_3 Depth=1
setp.lt.s32 %p0, %r0, 1;
@%p0 bra BB8_44;
// BB#8: // %pl_dolane.i
// in Loop: Header=BB8_3 Depth=1
mul.lo.s32 %r8, %r6, %r0;
shl.b32 %r8, %r8, 2;
cvt.s64.s32 %rl1, %r8;
add.s64 %rl1, %rl1, %rl0;
st.u32 [%rl1], %r7;
BB8_44: // %foreach_reset128
// in Loop: Header=BB8_3 Depth=1
add.s32 %r6, %r6, 1;
add.s32 %r3, %r3, %r5;
setp.eq.s32 %p0, %r6, %r1;
@%p0 bra BB8_45;
bra.uni BB8_3;
BB8_9: // %for_test.preheader
setp.lt.s32 %p0, %r1, 1;
@%p0 bra BB8_45;
// BB#10: // %outer_not_in_extras.preheader.lr.ph
setp.gt.s32 %p0, %r2, 0;
mov.u32 %r3, 0;
selp.b32 %r4, -1, 0, %p0;
shl.b32 %r5, %r0, 2;
mov.u32 %r6, %r3;
BB8_11: // %outer_not_in_extras.preheader
// =>This Loop Header: Depth=1
// Child Loop BB8_23 Depth 2
// Child Loop BB8_20 Depth 2
// Child Loop BB8_19 Depth 2
// Child Loop BB8_14 Depth 3
setp.lt.s32 %p0, %r0, 1;
@%p0 bra BB8_28;
// BB#12: // %foreach_full_body.lr.ph
// in Loop: Header=BB8_11 Depth=1
setp.lt.s32 %p0, %r4, 0;
mov.u32 %r7, %r0;
mov.u32 %r8, %r3;
@%p0 bra BB8_13;
bra.uni BB8_20;
BB8_13: // in Loop: Header=BB8_11 Depth=1
mov.u64 %rl1, 0;
cvt.rn.f32.s32 %f4, %r6;
fma.rn.f32 %f4, %f2, %f4, %f1;
mul.lo.s32 %r7, %r6, %r0;
BB8_19: // %for_loop.i281.lr.ph.us
// Parent Loop BB8_11 Depth=1
// => This Loop Header: Depth=2
// Child Loop BB8_14 Depth 3
cvt.u32.u64 %r8, %rl1;
cvt.rn.f32.s32 %f5, %r8;
fma.rn.f32 %f5, %f3, %f5, %f0;
mov.u32 %r10, 0;
mov.u32 %r12, %r4;
mov.u32 %r11, %r10;
mov.u32 %r9, %r10;
mov.f32 %f7, %f5;
mov.f32 %f6, %f4;
BB8_14: // %for_loop.i281.us
// Parent Loop BB8_11 Depth=1
// Parent Loop BB8_19 Depth=2
// => This Inner Loop Header: Depth=3
mul.f32 %f8, %f7, %f7;
fma.rn.f32 %f9, %f6, %f6, %f8;
setp.gtu.f32 %p0, %f9, 0f40800000;
selp.b32 %r13, %r12, 0, %p0;
or.b32 %r11, %r13, %r11;
shr.u32 %r13, %r11, 31;
shr.u32 %r14, %r12, 31;
setp.eq.s32 %p0, %r13, %r14;
@%p0 bra BB8_15;
bra.uni BB8_16;
BB8_15: // in Loop: Header=BB8_14 Depth=3
mov.u32 %r12, %r10;
bra.uni BB8_17;
BB8_16: // %not_all_continued_or_breaked.i295.us
// in Loop: Header=BB8_14 Depth=3
mul.f32 %f9, %f6, %f6;
not.b32 %r13, %r11;
and.b32 %r12, %r12, %r13;
sub.f32 %f8, %f8, %f9;
add.f32 %f8, %f5, %f8;
add.f32 %f7, %f7, %f7;
fma.rn.f32 %f6, %f6, %f7, %f4;
mov.f32 %f7, %f8;
BB8_17: // %for_step.i264.us
// in Loop: Header=BB8_14 Depth=3
setp.ne.s32 %p0, %r12, 0;
selp.u32 %r13, 1, 0, %p0;
add.s32 %r9, %r9, %r13;
setp.lt.s32 %p0, %r9, %r2;
selp.b32 %r12, %r12, 0, %p0;
setp.lt.s32 %p0, %r12, 0;
@%p0 bra BB8_14;
// BB#18: // %mandel___vyfvyfvyi.exit296.us
// in Loop: Header=BB8_19 Depth=2
add.s32 %r8, %r8, %r7;
shl.b32 %r8, %r8, 2;
cvt.s64.s32 %rl2, %r8;
add.s64 %rl2, %rl2, %rl0;
st.u32 [%rl2], %r9;
add.s64 %rl1, %rl1, 1;
cvt.u32.u64 %r8, %rl1;
setp.eq.s32 %p0, %r8, %r0;
@%p0 bra BB8_27;
bra.uni BB8_19;
BB8_20: // %mandel___vyfvyfvyi.exit296
// Parent Loop BB8_11 Depth=1
// => This Inner Loop Header: Depth=2
cvt.s64.s32 %rl1, %r8;
add.s64 %rl1, %rl1, %rl0;
mov.u32 %r9, 0;
st.u32 [%rl1], %r9;
add.s32 %r8, %r8, 4;
add.s32 %r7, %r7, -1;
setp.eq.s32 %p0, %r7, 0;
@%p0 bra BB8_27;
bra.uni BB8_20;
BB8_28: // %partial_inner_all_outer
// in Loop: Header=BB8_11 Depth=1
@%p0 bra BB8_27;
// BB#29: // %partial_inner_only
// in Loop: Header=BB8_11 Depth=1
setp.gt.s32 %p0, %r0, 0;
mov.u32 %r8, 0;
fma.rn.f32 %f4, %f3, 0f00000000, %f0;
cvt.rn.f32.s32 %f5, %r6;
fma.rn.f32 %f5, %f2, %f5, %f1;
selp.b32 %r7, %r4, 0, %p0;
setp.lt.s32 %p1, %r7, 0;
mov.u32 %r10, %r4;
mov.u32 %r9, %r8;
mov.u32 %r7, %r8;
mov.f32 %f7, %f4;
mov.f32 %f6, %f5;
@%p1 bra BB8_23;
bra.uni BB8_30;
BB8_23: // %for_loop.i332
// Parent Loop BB8_11 Depth=1
// => This Inner Loop Header: Depth=2
selp.b32 %r11, %r10, 0, %p0;
mul.f32 %f8, %f7, %f7;
fma.rn.f32 %f9, %f6, %f6, %f8;
setp.gtu.f32 %p1, %f9, 0f40800000;
selp.b32 %r12, %r10, 0, %p1;
or.b32 %r9, %r12, %r9;
selp.b32 %r12, %r9, 0, %p0;
shr.u32 %r12, %r12, 31;
shr.u32 %r11, %r11, 31;
setp.eq.s32 %p1, %r12, %r11;
@%p1 bra BB8_24;
bra.uni BB8_21;
BB8_24: // in Loop: Header=BB8_23 Depth=2
mov.u32 %r10, %r8;
bra.uni BB8_22;
BB8_21: // %not_all_continued_or_breaked.i346
// in Loop: Header=BB8_23 Depth=2
mul.f32 %f9, %f6, %f6;
not.b32 %r11, %r9;
and.b32 %r10, %r10, %r11;
sub.f32 %f8, %f8, %f9;
add.f32 %f8, %f4, %f8;
add.f32 %f7, %f7, %f7;
fma.rn.f32 %f6, %f6, %f7, %f5;
mov.f32 %f7, %f8;
BB8_22: // %for_step.i313
// in Loop: Header=BB8_23 Depth=2
setp.ne.s32 %p1, %r10, 0;
selp.u32 %r11, 1, 0, %p1;
add.s32 %r7, %r7, %r11;
setp.lt.s32 %p1, %r7, %r2;
selp.b32 %r10, %r10, 0, %p1;
selp.b32 %r11, %r10, 0, %p0;
setp.gt.s32 %p1, %r11, -1;
@%p1 bra BB8_25;
bra.uni BB8_23;
BB8_30: // in Loop: Header=BB8_11 Depth=1
mov.u32 %r7, %r8;
BB8_25: // %mandel___vyfvyfvyi.exit347
// in Loop: Header=BB8_11 Depth=1
setp.lt.s32 %p0, %r0, 1;
@%p0 bra BB8_27;
// BB#26: // %pl_dolane.i452
// in Loop: Header=BB8_11 Depth=1
mul.lo.s32 %r8, %r6, %r0;
shl.b32 %r8, %r8, 2;
cvt.s64.s32 %rl1, %r8;
add.s64 %rl1, %rl1, %rl0;
st.u32 [%rl1], %r7;
BB8_27: // %foreach_reset
// in Loop: Header=BB8_11 Depth=1
add.s32 %r6, %r6, 1;
add.s32 %r3, %r3, %r5;
setp.eq.s32 %p0, %r6, %r1;
@%p0 bra BB8_45;
bra.uni BB8_11;
BB8_45: // %for_exit
ret;
}
// .globl mandelbrot_ispc
.func mandelbrot_ispc(
.param .b32 mandelbrot_ispc_param_0,
.param .b32 mandelbrot_ispc_param_1,
.param .b32 mandelbrot_ispc_param_2,
.param .b32 mandelbrot_ispc_param_3,
.param .b32 mandelbrot_ispc_param_4,
.param .b32 mandelbrot_ispc_param_5,
.param .b32 mandelbrot_ispc_param_6,
.param .b64 mandelbrot_ispc_param_7
) // @mandelbrot_ispc
{
.reg .pred %p<396>;
.reg .s16 %rc<396>;
.reg .s16 %rs<396>;
.reg .s32 %r<396>;
.reg .s64 %rl<396>;
.reg .f32 %f<396>;
.reg .f64 %fl<396>;
// BB#0: // %allocas
ld.param.u32 %r0, [mandelbrot_ispc_param_5];
setp.lt.s32 %p0, %r0, 1;
@%p0 bra BB9_18;
// BB#1: // %outer_not_in_extras.preheader.lr.ph
ld.param.f32 %f0, [mandelbrot_ispc_param_0];
ld.param.f32 %f1, [mandelbrot_ispc_param_1];
ld.param.f32 %f3, [mandelbrot_ispc_param_2];
ld.param.f32 %f2, [mandelbrot_ispc_param_3];
ld.param.u32 %r1, [mandelbrot_ispc_param_4];
ld.param.u32 %r2, [mandelbrot_ispc_param_6];
ld.param.u64 %rl0, [mandelbrot_ispc_param_7];
sub.f32 %f3, %f3, %f0;
cvt.rn.f32.s32 %f4, %r1;
sub.f32 %f2, %f2, %f1;
cvt.rn.f32.s32 %f5, %r0;
div.rn.f32 %f2, %f2, %f5;
div.rn.f32 %f3, %f3, %f4;
setp.gt.s32 %p0, %r2, 0;
mov.u32 %r3, 0;
selp.b32 %r4, -1, 0, %p0;
BB9_2: // %outer_not_in_extras.preheader
// =>This Loop Header: Depth=1
// Child Loop BB9_13 Depth 2
// Child Loop BB9_4 Depth 2
// Child Loop BB9_9 Depth 3
setp.lt.s32 %p0, %r1, 1;
@%p0 bra BB9_19;
// BB#3: // %foreach_full_body.lr.ph
// in Loop: Header=BB9_2 Depth=1
mov.u64 %rl1, 0;
cvt.rn.f32.s32 %f4, %r3;
fma.rn.f32 %f4, %f2, %f4, %f1;
mul.lo.s32 %r5, %r3, %r1;
BB9_4: // %foreach_full_body
// Parent Loop BB9_2 Depth=1
// => This Loop Header: Depth=2
// Child Loop BB9_9 Depth 3
setp.lt.s32 %p0, %r4, 0;
cvt.u32.u64 %r6, %rl1;
cvt.rn.f32.s32 %f5, %r6;
fma.rn.f32 %f5, %f3, %f5, %f0;
mov.u32 %r8, 0;
mov.u32 %r10, %r4;
mov.u32 %r9, %r8;
mov.u32 %r7, %r8;
mov.f32 %f7, %f5;
mov.f32 %f6, %f4;
@%p0 bra BB9_9;
bra.uni BB9_5;
BB9_9: // %for_loop.i281
// Parent Loop BB9_2 Depth=1
// Parent Loop BB9_4 Depth=2
// => This Inner Loop Header: Depth=3
mul.f32 %f8, %f7, %f7;
fma.rn.f32 %f9, %f6, %f6, %f8;
setp.gtu.f32 %p0, %f9, 0f40800000;
selp.b32 %r11, %r10, 0, %p0;
or.b32 %r9, %r11, %r9;
shr.u32 %r11, %r9, 31;
shr.u32 %r12, %r10, 31;
setp.eq.s32 %p0, %r11, %r12;
@%p0 bra BB9_10;
bra.uni BB9_7;
BB9_10: // in Loop: Header=BB9_9 Depth=3
mov.u32 %r10, %r8;
bra.uni BB9_8;
BB9_7: // %not_all_continued_or_breaked.i295
// in Loop: Header=BB9_9 Depth=3
mul.f32 %f9, %f6, %f6;
not.b32 %r11, %r9;
and.b32 %r10, %r10, %r11;
sub.f32 %f8, %f8, %f9;
add.f32 %f8, %f5, %f8;
add.f32 %f7, %f7, %f7;
fma.rn.f32 %f6, %f6, %f7, %f4;
mov.f32 %f7, %f8;
BB9_8: // %for_step.i264
// in Loop: Header=BB9_9 Depth=3
setp.ne.s32 %p0, %r10, 0;
selp.u32 %r11, 1, 0, %p0;
add.s32 %r7, %r7, %r11;
setp.lt.s32 %p0, %r7, %r2;
selp.b32 %r10, %r10, 0, %p0;
setp.gt.s32 %p0, %r10, -1;
@%p0 bra BB9_6;
bra.uni BB9_9;
BB9_5: // in Loop: Header=BB9_4 Depth=2
mov.u32 %r7, %r8;
BB9_6: // %mandel___vyfvyfvyi.exit296
// in Loop: Header=BB9_4 Depth=2
add.s32 %r6, %r6, %r5;
shl.b32 %r6, %r6, 2;
cvt.s64.s32 %rl2, %r6;
add.s64 %rl2, %rl2, %rl0;
st.u32 [%rl2], %r7;
add.s64 %rl1, %rl1, 1;
cvt.u32.u64 %r6, %rl1;
setp.eq.s32 %p0, %r6, %r1;
@%p0 bra BB9_17;
bra.uni BB9_4;
BB9_19: // %partial_inner_all_outer
// in Loop: Header=BB9_2 Depth=1
@%p0 bra BB9_17;
// BB#20: // %partial_inner_only
// in Loop: Header=BB9_2 Depth=1
setp.gt.s32 %p0, %r1, 0;
mov.u32 %r6, 0;
fma.rn.f32 %f4, %f3, 0f00000000, %f0;
cvt.rn.f32.s32 %f5, %r3;
fma.rn.f32 %f5, %f2, %f5, %f1;
selp.b32 %r5, %r4, 0, %p0;
setp.lt.s32 %p1, %r5, 0;
mov.u32 %r8, %r4;
mov.u32 %r7, %r6;
mov.u32 %r5, %r6;
mov.f32 %f7, %f4;
mov.f32 %f6, %f5;
@%p1 bra BB9_13;
bra.uni BB9_21;
BB9_13: // %for_loop.i332
// Parent Loop BB9_2 Depth=1
// => This Inner Loop Header: Depth=2
selp.b32 %r9, %r8, 0, %p0;
mul.f32 %f8, %f7, %f7;
fma.rn.f32 %f9, %f6, %f6, %f8;
setp.gtu.f32 %p1, %f9, 0f40800000;
selp.b32 %r10, %r8, 0, %p1;
or.b32 %r7, %r10, %r7;
selp.b32 %r10, %r7, 0, %p0;
shr.u32 %r10, %r10, 31;
shr.u32 %r9, %r9, 31;
setp.eq.s32 %p1, %r10, %r9;
@%p1 bra BB9_14;
bra.uni BB9_11;
BB9_14: // in Loop: Header=BB9_13 Depth=2
mov.u32 %r8, %r6;
bra.uni BB9_12;
BB9_11: // %not_all_continued_or_breaked.i346
// in Loop: Header=BB9_13 Depth=2
mul.f32 %f9, %f6, %f6;
not.b32 %r9, %r7;
and.b32 %r8, %r8, %r9;
sub.f32 %f8, %f8, %f9;
add.f32 %f8, %f4, %f8;
add.f32 %f7, %f7, %f7;
fma.rn.f32 %f6, %f6, %f7, %f5;
mov.f32 %f7, %f8;
BB9_12: // %for_step.i313
// in Loop: Header=BB9_13 Depth=2
setp.ne.s32 %p1, %r8, 0;
selp.u32 %r9, 1, 0, %p1;
add.s32 %r5, %r5, %r9;
setp.lt.s32 %p1, %r5, %r2;
selp.b32 %r8, %r8, 0, %p1;
selp.b32 %r9, %r8, 0, %p0;
setp.gt.s32 %p1, %r9, -1;
@%p1 bra BB9_15;
bra.uni BB9_13;
BB9_21: // in Loop: Header=BB9_2 Depth=1
mov.u32 %r5, %r6;
BB9_15: // %mandel___vyfvyfvyi.exit347
// in Loop: Header=BB9_2 Depth=1
setp.lt.s32 %p0, %r1, 1;
@%p0 bra BB9_17;
// BB#16: // %pl_dolane.i
// in Loop: Header=BB9_2 Depth=1
mul.lo.s32 %r6, %r3, %r1;
shl.b32 %r6, %r6, 2;
cvt.s64.s32 %rl1, %r6;
add.s64 %rl1, %rl1, %rl0;
st.u32 [%rl1], %r5;
BB9_17: // %foreach_reset
// in Loop: Header=BB9_2 Depth=1
add.s32 %r3, %r3, 1;
setp.eq.s32 %p0, %r3, %r0;
@%p0 bra BB9_18;
bra.uni BB9_2;
BB9_18: // %for_exit
ret;
}

Binary file not shown.

Binary file not shown.

View File

@@ -1,2 +0,0 @@
mandelbrot
*.ppm

View File

@@ -1,8 +0,0 @@
EXAMPLE=mandelbrot_tasks
CPP_SRC=mandelbrot_tasks.cpp mandelbrot_tasks_serial.cpp
ISPC_SRC=mandelbrot_tasks.ispc
ISPC_IA_TARGETS=avx
ISPC_ARM_TARGETS=neon
include ../common.mk

View File

@@ -1,164 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#define NOMINMAX
#pragma warning (disable: 4244)
#pragma warning (disable: 4305)
#endif
#include <stdio.h>
#include <algorithm>
#include <string.h>
#include "../timing.h"
#include "mandelbrot_ispc.h"
using namespace ispc;
#include <sys/time.h>
double rtc(void)
{
struct timeval Tvalue;
double etime;
struct timezone dummy;
gettimeofday(&Tvalue,&dummy);
etime = (double) Tvalue.tv_sec +
1.e-6*((double) Tvalue.tv_usec);
return etime;
}
extern void mandelbrot_serial(float x0, float y0, float x1, float y1,
int width, int height, int maxIterations,
int output[]);
/* Write a PPM image file with the image of the Mandelbrot set */
static void
writePPM(int *buf, int width, int height, const char *fn) {
FILE *fp = fopen(fn, "wb");
fprintf(fp, "P6\n");
fprintf(fp, "%d %d\n", width, height);
fprintf(fp, "255\n");
for (int i = 0; i < width*height; ++i) {
// Map the iteration count to colors by just alternating between
// two greys.
char c = (buf[i] & 0x1) ? 240 : 20;
for (int j = 0; j < 3; ++j)
fputc(c, fp);
}
fclose(fp);
printf("Wrote image file %s\n", fn);
}
static void usage() {
fprintf(stderr, "usage: mandelbrot [--scale=<factor>]\n");
exit(1);
}
int main(int argc, char *argv[]) {
unsigned int width = 1536;
unsigned int height = 1024;
float x0 = -2;
float x1 = 1;
float y0 = -1;
float y1 = 1;
if (argc == 1)
;
else if (argc == 2) {
if (strncmp(argv[1], "--scale=", 8) == 0) {
float scale = atof(argv[1] + 8);
if (scale == 0.f)
usage();
width *= scale;
height *= scale;
// round up to multiples of 16
width = (width + 0xf) & ~0xf;
height = (height + 0xf) & ~0xf;
}
else
usage();
}
else
usage();
int maxIterations = 512;
int *buf = new int[width*height];
//
// Compute the image using the ispc implementation; report the minimum
// time of three runs.
//
double minISPC = 1e30;
for (int i = 0; i < 3; ++i) {
// Clear out the buffer
for (unsigned int i = 0; i < width * height; ++i)
buf[i] = 0;
reset_and_start_timer();
double t0 = rtc();
mandelbrot_ispc(x0, y0, x1, y1, width, height, maxIterations, buf);
double dt = rtc() - t0; //get_elapsed_mcycles();
minISPC = std::min(minISPC, dt);
}
printf("[mandelbrot ispc+tasks]:\t[%.3f] million cycles\n", minISPC);
writePPM(buf, width, height, "mandelbrot-ispc.ppm");
//
// And run the serial implementation 3 times, again reporting the
// minimum time.
//
double minSerial = 1e30;
#if 0
for (int i = 0; i < 3; ++i) {
// Clear out the buffer
for (unsigned int i = 0; i < width * height; ++i)
buf[i] = 0;
reset_and_start_timer();
mandelbrot_serial(x0, y0, x1, y1, width, height, maxIterations, buf);
double dt = get_elapsed_mcycles();
minSerial = std::min(minSerial, dt);
}
printf("[mandelbrot serial]:\t\t[%.3f] million cycles\n", minSerial);
writePPM(buf, width, height, "mandelbrot-serial.ppm");
#endif
printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", minSerial/minISPC);
return 0;
}

View File

@@ -1,86 +0,0 @@
/*
Copyright (c) 2010-2012, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
static inline int
mandel(float c_re, float c_im, int count) {
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i) {
if (z_re * z_re + z_im * z_im > 4.)
break;
float new_re = z_re*z_re - z_im*z_im;
float new_im = 2.f * z_re * z_im;
unmasked {
z_re = c_re + new_re;
z_im = c_im + new_im;
}
}
return i;
}
/* Task to compute the Mandelbrot iterations for a single scanline.
*/
task void
mandelbrot_scanline(uniform float x0, uniform float dx,
uniform float y0, uniform float dy,
uniform int width, uniform int height,
uniform int span,
uniform int maxIterations, uniform int output[]) {
uniform int ystart = taskIndex * span;
uniform int yend = min((taskIndex+1) * span, (unsigned int)height);
foreach (yi = ystart ... yend, xi = 0 ... width) {
float x = x0 + xi * dx;
float y = y0 + yi * dy;
int index = yi * width + xi;
output[index] = mandel(x, y, maxIterations);
}
}
export void
mandelbrot_ispc(uniform float x0, uniform float y0,
uniform float x1, uniform float y1,
uniform int width, uniform int height,
uniform int maxIterations, uniform int output[]) {
uniform float dx = (x1 - x0) / width;
uniform float dy = (y1 - y0) / height;
uniform int span = 4;
launch[height/span] mandelbrot_scanline(x0, dx, y0, dy, width, height, span,
maxIterations, output);
}

View File

@@ -1,180 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<ItemGroup Label="ProjectConfigurations">
<ProjectConfiguration Include="Debug|Win32">
<Configuration>Debug</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Debug|x64">
<Configuration>Debug</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|Win32">
<Configuration>Release</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|x64">
<Configuration>Release</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
</ItemGroup>
<PropertyGroup Label="Globals">
<ProjectGuid>{E80DA7D4-AB22-4648-A068-327307156BE6}</ProjectGuid>
<Keyword>Win32Proj</Keyword>
<RootNamespace>mandelbrot_tasks</RootNamespace>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
<ImportGroup Label="ExtensionSettings">
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<PropertyGroup Label="UserMacros" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<TargetName>mandelbrot_tasks</TargetName>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<LinkIncremental>true</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<TargetName>mandelbrot_tasks</TargetName>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<TargetName>mandelbrot_tasks</TargetName>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<LinkIncremental>false</LinkIncremental>
<ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
<TargetName>mandelbrot_tasks</TargetName>
</PropertyGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<IntrinsicFunctions>true</IntrinsicFunctions>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<ClCompile>
<PrecompiledHeader>
</PrecompiledHeader>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<IntrinsicFunctions>true</IntrinsicFunctions>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<PrecompiledHeader>
</PrecompiledHeader>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
<FloatingPointModel>Fast</FloatingPointModel>
</ClCompile>
<Link>
<SubSystem>Console</SubSystem>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
</Link>
</ItemDefinitionGroup>
<ItemGroup>
<ClCompile Include="mandelbrot_tasks.cpp" />
<ClCompile Include="mandelbrot_tasks_serial.cpp" />
<ClCompile Include="../tasksys.cpp" />
</ItemGroup>
<ItemGroup>
<CustomBuild Include="mandelbrot_tasks.ispc">
<FileType>Document</FileType>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
</Command>
<Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
</Command>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
<Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
</CustomBuild>
</ItemGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
<ImportGroup Label="ExtensionTargets">
</ImportGroup>
</Project>

View File

@@ -1,68 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
static int mandel(float c_re, float c_im, int count) {
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i) {
if (z_re * z_re + z_im * z_im > 4.f)
break;
float new_re = z_re*z_re - z_im*z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}
return i;
}
void mandelbrot_serial(float x0, float y0, float x1, float y1,
int width, int height, int maxIterations,
int output[])
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; ++i) {
float x = x0 + i * dx;
float y = y0 + j * dy;
int index = (j * width + i);
output[index] = mandel(x, y, maxIterations);
}
}
}

View File

@@ -1,2 +0,0 @@
mandelbrot
*.ppm

View File

@@ -1,8 +0,0 @@
EXAMPLE=mandelbrot_tasks3d
CPP_SRC=mandelbrot_tasks3d.cpp mandelbrot_tasks_serial.cpp
ISPC_SRC=mandelbrot_tasks3d.ispc
ISPC_IA_TARGETS=avx
ISPC_ARM_TARGETS=neon
include ../common.mk

View File

@@ -1,59 +0,0 @@
PROG=mandel_cu
ISPC_SRC=mandelbrot_tasks3d.ispc
CXX_SRC=mandel_cu.cpp mandelbrot_tasks_serial.cpp
CXX=g++
CXXFLAGS=-O3 -I$(CUDATK)/include
LD=g++
LDFLAGS=-lcuda
ISPC=ispc
ISPCFLAGS=-O3 --math-lib=default --target=nvptx64 --opt=fast-math
LLVM32 = $(HOME)/usr/local/llvm/bin-3.2
LLVM = $(HOME)/usr/local/llvm/bin-3.3
PTXGEN = $(HOME)/ptxgen
PTXGEN += -opt=3
PTXGEN += -ftz=1 -prec-div=0 -prec-sqrt=0 -fma=1
LLVM32DIS=$(LLVM32)/bin/llvm-dis
.SUFFIXES: .bc .o .ptx .cu _ispc_nvptx64.bc
ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
ISPC_BC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.bc)
PTXSRC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.ptx)
CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
all: $(ISPC_BC) $(PROG)
CUDART:
cd _cuobj && make
g++ -o mandel_cu_nvcc mandel_cu.cpp -I$(CUDATK)/include -lcuda mandelbrot_tasks_serial.cpp -L./_cuobj -lmandel_cudart -lcudart -L$(CUDATK)/lib64 -D_CUDART_ -lcudadevrt
$(CXX_OBJ) : kernel.ptx
$(PROG): $(CXX_OBJ) kernel.ptx
/bin/cp kernel.ptx __kernels.ptx
$(LD) -o $@ $(CXX_OBJ) $(LDFLAGS)
%.o: %.cpp
$(CXX) $(CXXFLAGS) -o $@ -c $<
%_ispc_nvptx64.bc: %.ispc
$(ISPC) $(ISPCFLAGS) --emit-llvm -o `basename $< .ispc`_ispc_nvptx64.bc -h `basename $< .ispc`_ispc.h $< --emit-llvm
%.ptx: %.bc
$(LLVM32DIS) $<
$(PTXGEN) `basename $< .bc`.ll > $@
kernel.ptx: $(PTXSRC)
cat $^ > kernel.ptx
clean:
/bin/rm -rf *.ptx *.bc *.ll $(PROG)

View File

@@ -1,37 +0,0 @@
PROG=mandelbrot_mic
ISPC_SRC=mandelbrot_tasks3d.ispc
CXX_SRC=mandelbrot_tasks3d.cpp ../tasksys.cpp
CXX=icc
CXXFLAGS=-O3 -I$(CUDATK)/include -mmic -openmp
LD=icc
LDFLAGS=-mmic -openmp
ISPC=ispc
ISPCFLAGS=-O3 --math-lib=default --target=generic-16 --c++-include-file=../intrinsics/knc-i1x16.h --opt=fast-math
.SUFFIXES: .o .cpp
ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
all: $(PROG)
$(PROG): $(ISPC_OBJ) $(CXX_OBJ)
$(LD) -o $@ $^ $(LDFLAGS)
%.o: %.cpp
$(CXX) $(CXXFLAGS) -o $@ -c $<
%_ispc.o: %.ispc
$(ISPC) $(ISPCFLAGS) --emit-c++ -o `basename $< .ispc`_ispc_zmm.cpp -h `basename $< .ispc`_ispc.h $<
$(CXX) $(CXXFLAGS) -o $@ `basename $< .ispc`_ispc_zmm.cpp -c
clean:
/bin/rm -rf *_ispc_zmm.cpp *.o $(PROG)

View File

@@ -1,15 +0,0 @@
FILE=mandel
LIB=lib$(FILE)_cudart.a
all: $(LIB)
$(LIB) : $(FILE).cu
nvcc -dc $(FILE).cu -arch=sm_35 -Xptxas=-v -dryrun 2>&1 | sed 's/\#\$$//g'|awk '{ if ($$1 == "cicc") print "cp ../__kernels.ptx " $$NF; else print $0 }' > run.sh
sh run.sh
nvcc -dlink -o $(FILE)_dlink.o $(FILE).o -lcudadevrt -arch=sm_35
nvcc $(FILE).o $(FILE)_dlink.o --lib -o lib$(FILE)_cudart.a
clean:
/bin/rm -f *.o *.a run.sh

View File

@@ -1,22 +0,0 @@
extern "C" static inline int __device__ mandel___vyfvyfvyi_(float c_re, float c_im, int count) {}
extern "C" void __global__ mandelbrot_scanline___unfunfunfunfuniuniuniuniuniun_3C_uni_3E_( float x0, float dx,
float y0, float dy,
int width, int height,
int xspan, int yspan,
int maxIterations, int output[]) {}
extern "C" void __global__ mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_( float x0, float y0,
float x1, float y1,
int width, int height,
int maxIterations, int output[]) { }
extern "C"
void mandelbrot_ispc(float x0, float y0,
float x1, float y1,
int width, int height,
int maxIterations, int output[])
{
mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_<<<1,32>>>
(x0,y0,x1,y1,width,height,maxIterations,output);
cudaDeviceSynchronize();
}

View File

@@ -1,6 +0,0 @@
#!/bin/sh
ptxas -arch=sm_35 -c -o kernel.gpu.o kernel_cu.ptx
fatbinary -arch=sm_35 -create kernel.fatbin -elf kernel.gpu.o
nvcc -arch=sm_35 -Xptxas="-v" -dc kernel_driver.cu -lcudadevrt
nvcc -arch=sm_35 -Xptxas="-v" -dlink -o mandel_nvcc.o kernel.fatbin kernel_driver.o -rdc=true -lcudadevrt

View File

@@ -1,321 +0,0 @@
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <algorithm>
#include <string.h>
#include <cuda.h>
#include <vector>
#include <cassert>
#include "drvapi_error_string.h"
#define checkCudaErrors(err) __checkCudaErrors (err, __FILE__, __LINE__)
// These are the inline versions for all of the SDK helper functions
void __checkCudaErrors(CUresult err, const char *file, const int line) {
if(CUDA_SUCCESS != err) {
std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
<< getCudaDrvErrorString(err) << "\" from file <" << file
<< ", line " << line << "\n";
exit(-1);
}
}
/**********************/
/* Basic CUDriver API */
CUcontext context;
void createContext(const int deviceId = 0)
{
CUdevice device;
int devCount;
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGetCount(&devCount));
assert(devCount > 0);
checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
char name[128];
checkCudaErrors(cuDeviceGetName(name, 128, device));
std::cout << "Using CUDA Device [0]: " << name << "\n";
int devMajor, devMinor;
checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
std::cout << "Device Compute Capability: "
<< devMajor << "." << devMinor << "\n";
if (devMajor < 2) {
std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
exit(1);
}
// Create driver context
checkCudaErrors(cuCtxCreate(&context, 0, device));
}
void destroyContext()
{
checkCudaErrors(cuCtxDestroy(context));
}
CUmodule loadModule(const char * module)
{
CUmodule cudaModule;
checkCudaErrors(cuModuleLoadData(&cudaModule, module));
return cudaModule;
}
void unloadModule(CUmodule &cudaModule)
{
checkCudaErrors(cuModuleUnload(cudaModule));
}
CUfunction getFunction(CUmodule &cudaModule, const char * function)
{
CUfunction cudaFunction;
checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
return cudaFunction;
}
CUdeviceptr deviceMalloc(const size_t size)
{
CUdeviceptr d_buf;
checkCudaErrors(cuMemAlloc(&d_buf, size));
return d_buf;
}
void deviceFree(CUdeviceptr d_buf)
{
checkCudaErrors(cuMemFree(d_buf));
}
void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
{
checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
}
void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
{
checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
}
#define deviceLaunch(func,nbx,nby,nbz,params) \
checkCudaErrors( \
cuLaunchKernel( \
(func), \
(nbx), (nby), (nbz), \
32, 1, 1, \
0, NULL, (params), NULL \
));
typedef CUdeviceptr devicePtr;
/**************/
extern "C"
{
#if 0
struct ModuleManager
{
private:
typedef std::pair<std::string, CUModule> ModulePair;
typedef std::map <std::string, CUModule> ModuleMap;
ModuleMap module_list;
ModuleMap::iterator findModule(const char * module_name)
{
return module_list.find(std::string(module_name));
}
public:
CUmodule loadModule(const char * module_name, const char * module_data)
{
const ModuleMap::iterator it = findModule(module_name)
if (it != ModuleMap::end)
{
CUmodule cudaModule = loadModule(module);
module_list.insert(std::make_pair(std::string(module_name), cudaModule));
return cudaModule
}
return it->second;
}
void unloadModule(const char * module_name)
{
ModuleMap::iterator it = findModule(module_name)
if (it != ModuleMap::end)
module_list.erase(it);
}
};
#endif
void *CUDAAlloc(void **handlePtr, int64_t size, int32_t alignment)
{
return NULL;
}
void CUDALaunch(
void **handlePtr,
const char * module_name,
const char * module,
const char * func_name,
void **func_args,
int countx, int county, int countz)
{
CUmodule cudaModule = loadModule(module);
CUfunction cudaFunction = getFunction(cudaModule, func_name);
deviceLaunch(cudaFunction, countx, county, countz, func_args);
unloadModule(cudaModule);
}
void CUDASync(void *handle)
{
checkCudaErrors(cuStreamSynchronize(0));
}
void CUDAFree(void *handle)
{
}
}
/********************/
/* Write a PPM image file with the image of the Mandelbrot set */
static void
writePPM(int *buf, int width, int height, const char *fn)
{
FILE *fp = fopen(fn, "wb");
fprintf(fp, "P6\n");
fprintf(fp, "%d %d\n", width, height);
fprintf(fp, "255\n");
for (int i = 0; i < width*height; ++i) {
// Map the iteration count to colors by just alternating between
// two greys.
char c = (buf[i] & 0x1) ? 240 : 20;
for (int j = 0; j < 3; ++j)
fputc(c, fp);
}
fclose(fp);
printf("Wrote image file %s\n", fn);
}
std::vector<char> readBinary(const char * filename)
{
std::vector<char> buffer;
FILE *fp = fopen(filename, "rb");
if (!fp )
{
fprintf(stderr, "file %s not found\n", filename);
assert(0);
}
#if 0
char c;
while ((c = fgetc(fp)) != EOF)
buffer.push_back(c);
#else
fseek(fp, 0, SEEK_END);
const unsigned long long size = ftell(fp); /*calc the size needed*/
fseek(fp, 0, SEEK_SET);
buffer.resize(size);
if (fp == NULL){ /*ERROR detection if file == empty*/
fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);
exit(1);
}
else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
exit(1);
}
#endif
fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
return buffer;
}
static void usage()
{
fprintf(stderr, "usage: mandelbrot [--scale=<factor>]\n");
exit(1);
}
extern "C"
void mandelbrot_ispc(
float x0, float y0,
float x1, float y1,
int width, int height,
int maxIterations, int output[])
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
int xspan = 16; /* make sure it is big enough to avoid false-sharing */
int yspan = 4;
const int nbx = width/xspan;
const int nby = height/yspan;
const int nbz = 1;
fprintf(stderr ," nbx= %d nby= %d nbtot= %d \n", nbx, nby, nbx*nby);
#if 0
launch [nbx,nby]
mandelbrot_scanline(x0, dx, y0, dy, width, height, xspan, yspan,
maxIterations, output);
#endif
// const std::vector<char> cubin = readBinary("cuLaunch.cubin");
const std::vector<char> cubin = readBinary("cuLaunch.ptx");
void *params[] = {&x0, &dx, &y0, &dy, &width, &height, &xspan, &yspan, &maxIterations, &output};
CUDALaunch(
NULL, //void **handlePtr,
"module_01", // const char * module_name,
&cubin[0], //const char * module,
"mandelbrot_scanline", //const char * func_name,
params, //void **func_args,
nbx,nby,nbz); //int countx, int county, int countz)
CUDASync(NULL);
}
int main(int argc, char *argv[])
{
unsigned int width = 1536;
unsigned int height = 1024;
float x0 = -2;
float x1 = 1;
float y0 = -1;
float y1 = 1;
if (argc == 1)
;
else if (argc == 2) {
if (strncmp(argv[1], "--scale=", 8) == 0) {
float scale = atof(argv[1] + 8);
if (scale == 0.f)
usage();
width *= scale;
height *= scale;
// round up to multiples of 16
width = (width + 0xf) & ~0xf;
height = (height + 0xf) & ~0xf;
}
else
usage();
}
else
usage();
/*******************/
createContext();
/*******************/
int maxIterations = 512;
int *h_buf = new int[width*height];
for (unsigned int i = 0; i < width*height; i++)
h_buf[i] = 0;
const size_t bufsize = sizeof(int)*width*height;
devicePtr d_buf = deviceMalloc(bufsize);
memcpyH2D(d_buf, h_buf, bufsize);
mandelbrot_ispc(x0,y0,x1,y1,width, height, maxIterations, (int*)d_buf);
memcpyD2H(h_buf, d_buf, bufsize);
deviceFree(d_buf);
writePPM(h_buf, width, height, "mandelbrot-cuda.ppm");
/*******************/
destroyContext();
/*******************/
return 0;
}

View File

@@ -1,73 +0,0 @@
typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;
extern "C" __device__ void PTXmandelbrot_scanline___UM_unfunfunfunfuniuniuniuniuniun_3C_uni_3E_(
float,float,float,float,uint32_t,uint32_t,uint32_t,uint32_t,uint32_t,uint64_t);
extern "C"
__global__ void mandelbrot_scanline___UM_unfunfunfunfuniuniuniuniuniun_3C_uni_3E_(
float param0,
float param1,
float param2,
float param3,
uint32_t param4,
uint32_t param5,
uint32_t param6,
uint32_t param7,
uint32_t param8,
uint64_t param9)
{
PTXmandelbrot_scanline___UM_unfunfunfunfuniuniuniuniuniun_3C_uni_3E_(
param0, param1, param2, param3, param4, param5, param6, param7, param8, param9);
}
extern "C" __device__ void PTXmandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_(
float param0,
float param1,
float param2,
float param3,
uint32_t param4,
uint32_t param5,
uint32_t param6,
uint64_t param7,
char param8);
extern "C"
__global__ void mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_(
float param0,
float param1,
float param2,
float param3,
uint32_t param4,
uint32_t param5,
uint32_t param6,
uint64_t param7,
char param8)
{
PTXmandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_(
param0,param1,param2,param3,param4,param5,param6,param7,param8);
}
extern "C" __device__ void PTXmandelbrot_ispc(
float param0,
float param1,
float param2,
float param3,
uint32_t param4,
uint32_t param5,
uint32_t param6,
uint64_t param7);
extern "C"
__global__ void mandelbrot_ispc(
float param0,
float param1,
float param2,
float param3,
uint32_t param4,
uint32_t param5,
uint32_t param6,
uint64_t param7)
{
PTXmandelbrot_ispc(
param0,param1,param2,param3,param4,param5,param6,param7);
}

View File

@@ -1,352 +0,0 @@
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <algorithm>
#include <string.h>
#include <cuda.h>
#include <vector>
#include <cassert>
#include "drvapi_error_string.h"
#define checkCudaErrors(err) __checkCudaErrors (err, __FILE__, __LINE__)
// These are the inline versions for all of the SDK helper functions
void __checkCudaErrors(CUresult err, const char *file, const int line) {
if(CUDA_SUCCESS != err) {
std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
<< getCudaDrvErrorString(err) << "\" from file <" << file
<< ", line " << line << "\n";
exit(-1);
}
}
/**********************/
/* Basic CUDriver API */
CUcontext context;
void createContext(const int deviceId = 0)
{
CUdevice device;
int devCount;
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGetCount(&devCount));
assert(devCount > 0);
checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
char name[128];
checkCudaErrors(cuDeviceGetName(name, 128, device));
std::cout << "Using CUDA Device [0]: " << name << "\n";
int devMajor, devMinor;
checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
std::cout << "Device Compute Capability: "
<< devMajor << "." << devMinor << "\n";
if (devMajor < 2) {
std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
exit(1);
}
// Create driver context
checkCudaErrors(cuCtxCreate(&context, 0, device));
}
void destroyContext()
{
checkCudaErrors(cuCtxDestroy(context));
}
CUmodule loadModule(const char * module)
{
CUmodule cudaModule;
checkCudaErrors(cuModuleLoadData(&cudaModule, module));
return cudaModule;
}
void unloadModule(CUmodule &cudaModule)
{
checkCudaErrors(cuModuleUnload(cudaModule));
}
CUfunction getFunction(CUmodule &cudaModule, const char * function)
{
CUfunction cudaFunction;
checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
return cudaFunction;
}
CUdeviceptr deviceMalloc(const size_t size)
{
CUdeviceptr d_buf;
checkCudaErrors(cuMemAlloc(&d_buf, size));
return d_buf;
}
void deviceFree(CUdeviceptr d_buf)
{
checkCudaErrors(cuMemFree(d_buf));
}
void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
{
checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
}
void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
{
checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
}
#define deviceLaunch(func,nbx,nby,nbz,params) \
checkCudaErrors( \
cuLaunchKernel( \
(func), \
(nbx), (nby), (nbz), \
32, 1, 1, \
0, NULL, (params), NULL \
));
typedef CUdeviceptr devicePtr;
/**************/
extern "C"
{
#if 0
struct ModuleManager
{
private:
typedef std::pair<std::string, CUModule> ModulePair;
typedef std::map <std::string, CUModule> ModuleMap;
ModuleMap module_list;
ModuleMap::iterator findModule(const char * module_name)
{
return module_list.find(std::string(module_name));
}
public:
CUmodule loadModule(const char * module_name, const char * module_data)
{
const ModuleMap::iterator it = findModule(module_name)
if (it != ModuleMap::end)
{
CUmodule cudaModule = loadModule(module);
module_list.insert(std::make_pair(std::string(module_name), cudaModule));
return cudaModule
}
return it->second;
}
void unloadModule(const char * module_name)
{
ModuleMap::iterator it = findModule(module_name)
if (it != ModuleMap::end)
module_list.erase(it);
}
};
#endif
void *CUDAAlloc(void **handlePtr, int64_t size, int32_t alignment)
{
#if 0
fprintf(stderr, " ptr= %p\n", *handlePtr);
fprintf(stderr, " size= %d\n", (int)size);
fprintf(stderr, " alignment= %d\n", (int)alignment);
fprintf(stderr, " ------- \n\n");
#endif
return NULL;
}
void CUDALaunch(
void **handlePtr,
const char * module_name,
const char * module,
const char * func_name,
void **func_args,
int countx, int county, int countz)
{
assert(module_name != NULL);
assert(module != NULL);
assert(func_name != NULL);
assert(func_args != NULL);
#if 1
CUmodule cudaModule = loadModule(module);
CUfunction cudaFunction = getFunction(cudaModule, func_name);
deviceLaunch(cudaFunction, countx, county, countz, func_args);
unloadModule(cudaModule);
#else
fprintf(stderr, " handle= %p\n", *handlePtr);
fprintf(stderr, " count= %d %d %d\n", countx, county, countz);
fprintf(stderr, " module_name= %s \n", module_name);
fprintf(stderr, " func_name= %s \n", func_name);
// fprintf(stderr, " ptx= %s \n", module);
fprintf(stderr, " x0= %g \n", *((float*)(func_args[0])));
fprintf(stderr, " dx= %g \n", *((float*)(func_args[1])));
fprintf(stderr, " y0= %g \n", *((float*)(func_args[2])));
fprintf(stderr, " dy= %g \n", *((float*)(func_args[3])));
fprintf(stderr, " w= %d \n", *((int*)(func_args[4])));
fprintf(stderr, " h= %d \n", *((int*)(func_args[5])));
fprintf(stderr, " xs= %d \n", *((int*)(func_args[6])));
fprintf(stderr, " ys= %d \n", *((int*)(func_args[7])));
fprintf(stderr, " maxit= %d \n", *((int*)(func_args[8])));
fprintf(stderr, " ptr= %p \n", *((int**)(func_args[9])));
fprintf(stderr, " ------- \n\n");
#endif
}
void CUDASync(void *handle)
{
checkCudaErrors(cuStreamSynchronize(0));
}
void ISPCSync(void *handle)
{
}
void CUDAFree(void *handle)
{
}
}
/********************/
/* Write a PPM image file with the image of the Mandelbrot set */
static void
writePPM(int *buf, int width, int height, const char *fn)
{
FILE *fp = fopen(fn, "wb");
fprintf(fp, "P6\n");
fprintf(fp, "%d %d\n", width, height);
fprintf(fp, "255\n");
for (int i = 0; i < width*height; ++i) {
// Map the iteration count to colors by just alternating between
// two greys.
char c = (buf[i] & 0x1) ? 240 : 20;
for (int j = 0; j < 3; ++j)
fputc(c, fp);
}
fclose(fp);
printf("Wrote image file %s\n", fn);
}
std::vector<char> readBinary(const char * filename)
{
std::vector<char> buffer;
FILE *fp = fopen(filename, "rb");
if (!fp )
{
fprintf(stderr, "file %s not found\n", filename);
assert(0);
}
#if 0
char c;
while ((c = fgetc(fp)) != EOF)
buffer.push_back(c);
#else
fseek(fp, 0, SEEK_END);
const unsigned long long size = ftell(fp); /*calc the size needed*/
fseek(fp, 0, SEEK_SET);
buffer.resize(size);
if (fp == NULL){ /*ERROR detection if file == empty*/
fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);
exit(1);
}
else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
exit(1);
}
#endif
fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
return buffer;
}
static void usage()
{
fprintf(stderr, "usage: mandelbrot [--scale=<factor>]\n");
exit(1);
}
extern "C"
void mandelbrot_ispc(
float x0, float y0,
float x1, float y1,
int width, int height,
int maxIterations, int output[])
#if 1
;
#else
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
int xspan = 32; /* make sure it is big enough to avoid false-sharing */
int yspan = 4;
const int nbx = width/xspan;
const int nby = width/yspan;
const int nbz = 1;
fprintf(stderr ," nbx= %d nby= %d nbtot= %d \n", nbx, nby, nbx*nby);
// const std::vector<char> cubin = readBinary("cuLaunch.cubin");
const std::vector<char> cubin = readBinary("cuLaunch.ptx");
void *params[] = {&x0, &dx, &y0, &dy, &width, &height, &xspan, &yspan, &maxIterations, &output};
CUDALaunch(
NULL, //void **handlePtr,
"module_01", // const char * module_name,
&cubin[0], //const char * module,
"mandelbrot_scanline", //const char * func_name,
params, //void **func_args,
nbx,nby,nbz); //int countx, int county, int countz)
CUDASync(NULL);
}
#endif
int main(int argc, char *argv[])
{
unsigned int width = 1536;
unsigned int height = 1024;
float x0 = -2;
float x1 = 1;
float y0 = -1;
float y1 = 1;
if (argc == 1)
;
else if (argc == 2) {
if (strncmp(argv[1], "--scale=", 8) == 0) {
float scale = atof(argv[1] + 8);
if (scale == 0.f)
usage();
width *= scale;
height *= scale;
// round up to multiples of 16
width = (width + 0xf) & ~0xf;
height = (height + 0xf) & ~0xf;
}
else
usage();
}
else
usage();
/*******************/
createContext();
/*******************/
int maxIterations = 512;
int *h_buf = new int[width*height];
for (unsigned int i = 0; i < width*height; i++)
h_buf[i] = 0;
const size_t bufsize = sizeof(int)*width*height;
devicePtr d_buf = deviceMalloc(bufsize);
memcpyH2D(d_buf, h_buf, bufsize);
mandelbrot_ispc(x0,y0,x1,y1,width, height, maxIterations, (int*)d_buf);
memcpyD2H(h_buf, d_buf, bufsize);
deviceFree(d_buf);
writePPM(h_buf, width, height, "mandelbrot-cuda.ppm");
/*******************/
destroyContext();
/*******************/
return 0;
}

View File

@@ -1,177 +0,0 @@
/*
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#define NOMINMAX
#pragma warning (disable: 4244)
#pragma warning (disable: 4305)
#endif
#include <stdio.h>
#include <algorithm>
#include <string.h>
#include "../timing.h"
#include "../cuda_ispc.h"
#ifdef _CUDART_
extern "C"
void mandelbrot_ispc(float x0, float y0,
float x1, float y1,
int width, int height,
int maxIterations, int output[]);
#endif
extern void mandelbrot_serial(float x0, float y0, float x1, float y1,
int width, int height, int maxIterations,
int output[]);
/* Write a PPM image file with the image of the Mandelbrot set */
static void
writePPM(int *buf, int width, int height, const char *fn) {
FILE *fp = fopen(fn, "wb");
fprintf(fp, "P6\n");
fprintf(fp, "%d %d\n", width, height);
fprintf(fp, "255\n");
for (int i = 0; i < width*height; ++i) {
// Map the iteration count to colors by just alternating between
// two greys.
char c = (buf[i] & 0x1) ? 240 : 20;
for (int j = 0; j < 3; ++j)
fputc(c, fp);
}
fclose(fp);
printf("Wrote image file %s\n", fn);
}
static void usage() {
fprintf(stderr, "usage: mandelbrot [--scale=<factor>]\n");
exit(1);
}
int main(int argc, char *argv[]) {
unsigned int width = 1536;
unsigned int height = 1024;
float x0 = -2;
float x1 = 1;
float y0 = -1;
float y1 = 1;
if (argc == 1)
;
else if (argc == 2) {
if (strncmp(argv[1], "--scale=", 8) == 0) {
float scale = atof(argv[1] + 8);
if (scale == 0.f)
usage();
width *= scale;
height *= scale;
// round up to multiples of 16
width = (width + 0xf) & ~0xf;
height = (height + 0xf) & ~0xf;
}
else
usage();
}
else
usage();
/*******************/
createContext();
/*******************/
int maxIterations = 512;
int *buf = new int[width*height];
for (unsigned int i = 0; i < width*height; i++)
buf[i] = 0;
const size_t bufsize = sizeof(int)*width*height;
devicePtr d_buf = deviceMalloc(bufsize);
memcpyH2D(d_buf, buf, bufsize);
//
// Compute the image using the ispc implementation; report the minimum
// time of three runs.
//
double minISPC = 1e30;
#if 1
for (int i = 0; i < 3; ++i) {
// Clear out the buffer
for (unsigned int i = 0; i < width * height; ++i)
buf[i] = 0;
reset_and_start_timer();
#ifdef _CUDART_
const double t0 = rtc();
mandelbrot_ispc(x0, y0, x1, y1, width, height, maxIterations, (int*)d_buf);
double dt = 1e3*(rtc() - t0); //get_elapsed_mcycles();
#else
const char * func_name = "mandelbrot_ispc__export";
void *func_args[] = {&x0, &y0, &x1, &y1, &width, &height, &maxIterations, &d_buf};
const double dt = 1e3*CUDALaunch(NULL, func_name, func_args);
#endif
minISPC = std::min(minISPC, dt);
}
#endif
memcpyD2H(buf, d_buf, bufsize);
deviceFree(d_buf);
printf("[mandelbrot ispc+tasks]:\t[%.3f] million cycles\n", minISPC);
writePPM(buf, width, height, "mandelbrot-cuda.ppm");
//
// And run the serial implementation 3 times, again reporting the
// minimum time.
//
double minSerial = 1e30;
for (int i = 0; i < 3; ++i) {
// Clear out the buffer
for (unsigned int i = 0; i < width * height; ++i)
buf[i] = 0;
reset_and_start_timer();
const double t0 = rtc();
mandelbrot_serial(x0, y0, x1, y1, width, height, maxIterations, buf);
double dt = rtc() - t0; //get_elapsed_mcycles();
minSerial = std::min(minSerial, dt);
}
printf("[mandelbrot serial]:\t\t[%.3f] million cycles\n", minSerial);
writePPM(buf, width, height, "mandelbrot-serial.ppm");
printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", minSerial/minISPC);
return 0;
}

View File

@@ -1,53 +0,0 @@
#include <stdio.h>
#define blockIndex0 (blockIdx.x*4 + (threadIdx.x >> 5))
#define blockIndex1 (blockIdx.y)
#define vectorWidth (32)
#define vectorIndex (threadIdx.x & 31)
int __device__ __forceinline__
mandel(float c_re, float c_im, int count)
{
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i) {
if (z_re * z_re + z_im * z_im > 4.0f)
break;
float new_re = z_re*z_re - z_im*z_im;
float new_im = 2.0f * z_re * z_im;
{
z_re = c_re + new_re;
z_im = c_im + new_im;
}
}
return i;
}
extern "C"
__global__ void mandelbrot_scanline(
float x0, float dx,
float y0, float dy,
int width, int height,
int xspan, int yspan,
int maxIterations, int output[])
{
const int xstart = blockIndex0 * xspan;
const int xend = min(xstart + xspan, width);
const int ystart = blockIndex1 * yspan;
const int yend = min(ystart + yspan, height);
for (int yi = ystart; yi < yend; yi++)
for (int xi = xstart; xi < xend; xi += vectorWidth)
{
const float x = x0 + (xi + vectorIndex) * dx;
const float y = y0 + yi * dy;
const int res = mandel(x,y,maxIterations);
const int index = yi * width + (xi + vectorIndex);
if (xi + vectorIndex < xend)
output[index] = res;
}
}

Some files were not shown because too many files have changed in this diff Show More