merge with sm35

2014-01-06 13:53:02 +01:00
parent 546f9cb409 ef9e212eec
commit bf8a16b0e1
189 changed files with 0 additions and 131201 deletions
--- a/examples_cuda/README.txt
+++ b/examples_cuda/README.txt
@@ -1,167 +0,0 @@
-====================
-ISPC Examples README
-====================
-
-This directory has a number of sample ispc programs.  Before building them
-(on an system), install the appropriate ispc compiler binary into a
-directory in your path.  Then, if you're running Windows, open the
-"examples.sln" file and built from there.  For building under Linux/OSX,
-there are makefiles in each directory that build the examples individually.
-
-Almost all of them benchmark ispc implementations of the given computation
-against regular serial C++ implementations, printing out a comparison of
-the runtimes and the speedup delivered by ispc.  It may be instructive to
-do a side-by-side diff of the C++ and ispc implementations of these
-algorithms to learn more about wirting ispc code.
-
- 
-AOBench
-=======
-
-This is an ISPC implementation of the "AO bench" benchmark
-(http://syoyo.wordpress.com/2009/01/26/ao-bench-is-evolving/).  The command
-line arguments are:
-
-ao (num iterations) (x res) (yres)
-
-It executes the program for the given number of iterations, rendering an
-(xres x yres) image each time and measuring the computation time with both
-serial and ispc implementations.
-
-
-AOBench_Instrumented
-====================
-
-This version of AO Bench is compiled with the --instrument ispc compiler
-flag.  This causes the compiler to emit calls to a (user-supplied)
-ISPCInstrument() function at interesting places in the compiled code.  An
-example implementation of this function that counts the number of times the
-callback is made and records some statistics about control flow coherence
-is provided in the instrument.cpp file.
-
-
-Deferred
-========
-
-This example shows an extensive example of using ispc for efficient
-deferred shading of scenes with thousands of lights; it's an implementation
-of the algorithm that Johan Andersson described at SIGGRAPH 2009,
-implemented by Andrew Lauritzen and Jefferson Montgomery.  The basic idea
-is that a pre-rendered G-buffer is partitioned into tiles, and in each
-tile, the set of lights that contribute to the tile is first computed.
-Then, the pixels in the tile are then shaded using just those light
-sources. (See slides 19-29 of
-http://s09.idav.ucdavis.edu/talks/04-JAndersson-ParallelFrostbite-Siggraph09.pdf
-for more details on the algorithm.)
-
-This directory includes three implementations of the algorithm:
-
- An ispc implementation that first does a static partitioning of the
-  screen into tiles to parallelize across the CPU cores.  Within each tile
-  ispc kernels provide highly efficient implementations of the light
-  culling and shading calculations.
- A "best practices" serial C++ implementation.  This implementation does a
-  dynamic partitioning of the screen, refining tiles with significant Z
-  depth complexity (these tiles often have a large number of lights that
-  affect them).  Within each final tile, the pixels are shaded using
-  regular C++ code.
- If the Cilk extensions are available in your compiler, an ispc
-  implementation that uses Cilk will also be built.
-  (See http://software.intel.com/en-us/articles/intel-cilk-plus/).  Like 
-  the "best practices" serial implementation, this version does dynamic
-  tile partitioning for better load balancing and then uses ispc for the
-  light culling and shading.
-
-
-GMRES
-=====
-
-An implementation of the generalized minimal residual method for solving
-sparse matrix equations.
-(http://en.wikipedia.org/wiki/Generalized_minimal_residual_method)
-
-
-Mandelbrot
-==========
-
-Mandelbrot set generation.  This example is extensively documented at the
-http://ispc.github.com/example.html page.
-
-
-Mandelbrot_tasks
-================
-
-Implementation of Mandelbrot set generation that also parallelizes across
-cores using tasks.  Under Windows, a simple task system built on
-Microsoft's Concurrency Runtime is used (see tasks_concrt.cpp).  On OSX, a
-task system based on Grand Central Dispatch is used (tasks_gcd.cpp), and on
-Linux, a pthreads-based task system is used (tasks_pthreads.cpp).  When
-using tasks with ispc, no task system is mandated; the user is free to plug
-in any task system they want, for ease of interoperating with existing task
-systems.
-
-
-Noise
-=====
-
-This example has an implementation of Ken Perlin's procedural "noise"
-function, as described in his 2002 "Improving Noise" SIGGRAPH paper.
-
- 
-Options
-=======
-
-This program implements both the Black-Scholes and Binomial options pricing
-models in both ispc and regular serial C++ code.
-
-
-Perfbench
-=========
-
-This runs a number of microbenchmarks to measure system performance and
-code generation quality.
-
-
-RT
-==
-
-This is a simple ray tracer; it reads in camera parameters and a bounding
-volume hierarchy and renders the scene from the given viewpoint.  The
-command line arguments are:
-
-rt <scene name base>
-
-Where <scene base name> is one of "cornell", "teapot", or "sponza".
-
-The implementation originally derives from the bounding volume hierarchy
-and triangle intersection code from pbrt; see the pbrt source code and/or
-"Physically Based Rendering" book for more about the basic algorithmic
-details.
-
-
-Simple
-======
-
-This is a simple "hello world" type program that shows a ~10 line
-application program calling out to a ~5 line ispc program to do a simple
-computation.
-
-Sort
-====
-This is a bucket sort of 32 bit unsigned integers.
-By default 1000000 random elements get sorted.
-Call ./sort N in order to sort N elements instead.
-
-Volume
-======
-
-Ray-marching volume rendering, with single scattering lighting model.  To
-run it, specify a camera parameter file and a volume density file, e.g.:
-
-volume camera.dat density_highres.vol
-
-(See, e.g. Chapters 11 and 16 of "Physically Based Rendering" for
-information about the algorithm implemented here.)  The volume data set
-included here was generated by the example implementation of the "Wavelet
-Turbulence for Fluid Simulation" SIGGRAPH 2008 paper by Kim et
-al. (http://www.cs.cornell.edu/~tedkim/WTURB/)
--- a/examples_cuda/aobench/.gitignore
+++ b/examples_cuda/aobench/.gitignore
@@ -1,2 +0,0 @@
-ao
-*.ppm
--- a/examples_cuda/aobench/Makefile
+++ b/examples_cuda/aobench/Makefile
@@ -1,8 +0,0 @@
-
-EXAMPLE=ao
-CPP_SRC=ao.cpp ao_serial.cpp
-ISPC_SRC=ao1.ispc
-ISPC_IA_TARGETS=avx
-ISPC_ARM_TARGETS=neon
-
-include ../common.mk
--- a/examples_cuda/aobench/Makefile_gpu
+++ b/examples_cuda/aobench/Makefile_gpu
@@ -1,56 +0,0 @@
-PROG=ao_cu
-ISPC_SRC=ao1.ispc
-CXX_SRC=ao_cu.cpp 
-
-CXX=g++
-CXXFLAGS=-O3 -I$(CUDATK)/include
-LD=g++
-LDFLAGS=-lcuda
-
-ISPC=ispc
-ISPCFLAGS=-O3 --math-lib=default --target=nvptx64 --opt=fast-math
-
-LLVM32 = $(HOME)/usr/local/llvm/bin-3.2
-LLVM   = $(HOME)/usr/local/llvm/bin-3.3
-PTXGEN = $(HOME)/ptxgen
-PTXGEN += -opt=3
-PTXGEN += -ftz=1 -prec-div=0 -prec-sqrt=0 -fma=1
-
-LLVM32DIS=$(LLVM32)/bin/llvm-dis
-
-##.SUFFIXES: .bc .o .ptx .cu
-
-
-ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
-ISPC_BC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.bc)
-PTXSRC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.ptx)
-CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
-
-all: $(ISPC_BC) $(PROG)
-
-
-$(CXX_OBJ) : kernel.ptx
-$(PROG): $(CXX_OBJ) kernel.ptx
-	/bin/cp kernel.ptx __kernels.ptx
-	$(LD) -o $@ $(CXX_OBJ) $(LDFLAGS)
-
-%.o: %.cpp
-	$(CXX) $(CXXFLAGS)  -o $@ -c $<
-
-
-%_ispc_nvptx64.bc: %.ispc
-	$(ISPC) $(ISPCFLAGS) --emit-llvm -o `basename $< .ispc`_ispc_nvptx64.bc -h `basename $< .ispc`_ispc.h $< --emit-llvm
-
-%.ptx: %.bc
-	$(PTXGEN)  $< > $@
-#	$(LLVM32DIS) $<
-#	$(PTXGEN)  `basename $< .bc`.ll > $@
-
-kernel.ptx: $(PTXSRC)
-	cat $^ > kernel.ptx
-
-clean: 
-	/bin/rm -rf *.ptx *.bc *.ll $(PROG)
-
-	 
-
--- a/examples_cuda/aobench/Makefile_knc
+++ b/examples_cuda/aobench/Makefile_knc
@@ -1,37 +0,0 @@
-PROG=ao_mic
-ISPC_SRC=ao.ispc
-CXX_SRC=ao.cpp  ../tasksys.cpp
-
-CXX=icc
-CXXFLAGS=-O3 -I$(CUDATK)/include -mmic -openmp
-LD=icc
-LDFLAGS=-mmic -openmp
-
-ISPC=ispc
-ISPCFLAGS=-O3 --math-lib=default --target=generic-16 --c++-include-file=../intrinsics/knc-i1x16.h --opt=fast-math
-
-.SUFFIXES: .o .cpp
-
-
-ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
-CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
-
-all: $(PROG)
-
-
-
-$(PROG): $(ISPC_OBJ) $(CXX_OBJ) 
-	$(LD) -o $@ $^ $(LDFLAGS)
-
-%.o: %.cpp
-	$(CXX) $(CXXFLAGS)  -o $@ -c $<
-
-%_ispc.o: %.ispc
-	$(ISPC) $(ISPCFLAGS) --emit-c++ -o `basename $< .ispc`_ispc_zmm.cpp -h `basename $< .ispc`_ispc.h $< 
-	$(CXX) $(CXXFLAGS) -o $@ `basename $< .ispc`_ispc_zmm.cpp  -c
-
-clean: 
-	/bin/rm -rf *_ispc_zmm.cpp *.o  $(PROG)
-
-	 
-
--- a/examples_cuda/aobench/ao.cpp
+++ b/examples_cuda/aobench/ao.cpp
@@ -1,204 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define _CRT_SECURE_NO_WARNINGS
-#define NOMINMAX
-#pragma warning (disable: 4244)
-#pragma warning (disable: 4305)
-#endif
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <assert.h>
-#ifdef __linux__
-#include <malloc.h>
-#endif
-#include <math.h>
-#include <map>
-#include <string>
-#include <algorithm>
-#include <sys/types.h>
-
-#include "ao_ispc.h"
-using namespace ispc;
-
-#include "../timing.h"
-
-#include <sys/time.h>
-static inline double rtc(void)
-{
-  struct timeval Tvalue;
-  double etime;
-  struct timezone dummy;
-
-  gettimeofday(&Tvalue,&dummy);
-  etime =  (double) Tvalue.tv_sec +
-    1.e-6*((double) Tvalue.tv_usec);
-  return etime;
-}
-
-
-#define NSUBSAMPLES        2
-
-extern void ao_serial(int w, int h, int nsubsamples, float image[]);
-
-static unsigned int test_iterations;
-static unsigned int width, height;
-static unsigned char *img;
-static float *fimg;
-
-
-static unsigned char
-clamp(float f)
-{
-    int i = (int)(f * 255.5);
-
-    if (i < 0) i = 0;
-    if (i > 255) i = 255;
-
-    return (unsigned char)i;
-}
-
-
-static void
-savePPM(const char *fname, int w, int h)
-{
-    for (int y = 0; y < h; y++) {
-        for (int x = 0; x < w; x++)  {
-            img[3 * (y * w + x) + 0] = clamp(fimg[3 *(y * w + x) + 0]);
-            img[3 * (y * w + x) + 1] = clamp(fimg[3 *(y * w + x) + 1]);
-            img[3 * (y * w + x) + 2] = clamp(fimg[3 *(y * w + x) + 2]);
-        }
-    }
-
-    FILE *fp = fopen(fname, "wb");
-    if (!fp) {
-        perror(fname);
-        exit(1);
-    }
-
-    fprintf(fp, "P6\n");
-    fprintf(fp, "%d %d\n", w, h);
-    fprintf(fp, "255\n");
-    fwrite(img, w * h * 3, 1, fp);
-    fclose(fp);
-    printf("Wrote image file %s\n", fname);
-}
-
-
-int main(int argc, char **argv)
-{
-    if (argc != 4) {
-        printf ("%s\n", argv[0]);
-        printf ("Usage: ao [num test iterations] [width] [height]\n");
-        getchar();
-        exit(-1);
-    }
-    else {
-        test_iterations = atoi(argv[1]);
-        width = atoi (argv[2]);
-        height = atoi (argv[3]);
-    }
-
-    // Allocate space for output images
-    img = new unsigned char[width * height * 3];
-    fimg = new float[width * height * 3];
-
-    //
-    // Run the ispc path, test_iterations times, and report the minimum
-    // time for any of them.
-    //
-    double minTimeISPC = 1e30;
-#if 0
-    for (unsigned int i = 0; i < test_iterations; i++) {
-        memset((void *)fimg, 0, sizeof(float) * width * height * 3);
-        assert(NSUBSAMPLES == 2);
-
-        reset_and_start_timer();
-        ao_ispc(width, height, NSUBSAMPLES, fimg);
-        double t = get_elapsed_mcycles();
-        minTimeISPC = std::min(minTimeISPC, t);
-    }
-
-    // Report results and save image
-    printf("[aobench ispc]:\t\t\t[%.3f] million cycles (%d x %d image)\n", 
-           minTimeISPC, width, height);
-    savePPM("ao-ispc.ppm", width, height); 
-#endif
-
-    //
-    // Run the ispc + tasks path, test_iterations times, and report the
-    // minimum time for any of them.
-    //
-    double minTimeISPCTasks = 1e30;
-    for (unsigned int i = 0; i < test_iterations; i++) {
-        memset((void *)fimg, 0, sizeof(float) * width * height * 3);
-        assert(NSUBSAMPLES == 2);
-
-        reset_and_start_timer();
-        const double t0 = rtc();
-        ao_ispc_tasks(width, height, NSUBSAMPLES, fimg);
-        double t = 1e3*(rtc() - t0); //get_elapsed_mcycles();
-        minTimeISPCTasks = std::min(minTimeISPCTasks, t);
-    }
-
-    // Report results and save image
-    printf("[aobench ispc + tasks]:\t\t[%.3f] million cycles (%d x %d image)\n", 
-           minTimeISPCTasks, width, height);
-    savePPM("ao-ispc-tasks.ppm", width, height); 
-    return 0;
-
-    //
-    // Run the serial path, again test_iteration times, and report the
-    // minimum time.
-    //
-    double minTimeSerial = 1e30;
-    for (unsigned int i = 0; i < test_iterations; i++) {
-        memset((void *)fimg, 0, sizeof(float) * width * height * 3);
-        reset_and_start_timer();
-        ao_serial(width, height, NSUBSAMPLES, fimg);
-        double t = get_elapsed_mcycles();
-        minTimeSerial = std::min(minTimeSerial, t);
-    }
-
-    // Report more results, save another image...
-    printf("[aobench serial]:\t\t[%.3f] million cycles (%d x %d image)\n", minTimeSerial, 
-           width, height);
-    printf("\t\t\t\t(%.2fx speedup from ISPC, %.2fx speedup from ISPC + tasks)\n", 
-           minTimeSerial / minTimeISPC, minTimeSerial / minTimeISPCTasks);
-    savePPM("ao-serial.ppm", width, height); 
-        
-    return 0;
-}
--- a/examples_cuda/aobench/ao.cu
+++ b/examples_cuda/aobench/ao.cu
@@ -1,424 +0,0 @@
-// -*- mode: c++ -*-
-/*
-   Copyright (c) 2010-2011, Intel Corporation
-   All rights reserved.
-
-   Redistribution and use in source and binary forms, with or without
-   modification, are permitted provided that the following conditions are
-met:
-
- * Redistributions of source code must retain the above copyright
- notice, this list of conditions and the following disclaimer.
-
- * Redistributions in binary form must reproduce the above copyright
- notice, this list of conditions and the following disclaimer in the
- documentation and/or other materials provided with the distribution.
-
- * Neither the name of Intel Corporation nor the names of its
- contributors may be used to endorse or promote products derived from
- this software without specific prior written permission.
-
-
- THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
- IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
- TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
- PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
- OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
- */
-/*
-   Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
-   */
-
-#define NAO_SAMPLES		8
-//#define M_PI 3.1415926535f
-
-#define vec Float3
-struct Float3
-{
-  float x,y,z;
-
-  __device__ friend Float3 operator+(const Float3 a, const Float3 b)
-  {
-    Float3 c;
-    c.x = a.x+b.x;
-    c.y = a.y+b.y;
-    c.z = a.z+b.z;
-    return c;
-  }
-  __device__ friend Float3 operator-(const Float3 a, const Float3 b)
-  {
-    Float3 c;
-    c.x = a.x-b.x;
-    c.y = a.y-b.y;
-    c.z = a.z-b.z;
-    return c;
-  }
-  __device__ friend Float3 operator/(const Float3 a, const Float3 b)
-  {
-    Float3 c;
-    c.x = a.x/b.x;
-    c.y = a.y/b.y;
-    c.z = a.z/b.z;
-    return c;
-  }
-  __device__ friend Float3 operator/(const float a, const Float3 b)
-  {
-    Float3 c;
-    c.x = a/b.x;
-    c.y = a/b.y;
-    c.z = a/b.z;
-    return c;
-  }
-  __device__ friend Float3 operator*(const Float3 a, const Float3 b)
-  {
-    Float3 c;
-    c.x = a.x*b.x;
-    c.y = a.y*b.y;
-    c.z = a.z*b.z;
-    return c;
-  }
-  __device__ friend Float3 operator*(const Float3 a, const float b)
-  {
-    Float3 c;
-    c.x = a.x*b;
-    c.y = a.y*b;
-    c.z = a.z*b;
-    return c;
-  }
-};
-
-///////////////////////////////////////////////////////////////////////////
-// RNG stuff
-
-struct RNGState {
-    unsigned int z1, z2, z3, z4;
-};
-
-__device__
-static inline unsigned int random(RNGState * state)
-{
-    unsigned int b;
-
-    b  = ((state->z1 << 6) ^ state->z1) >> 13;
-    state->z1 = ((state->z1 & 4294967294U) << 18) ^ b;
-    b  = ((state->z2 << 2) ^ state->z2) >> 27; 
-    state->z2 = ((state->z2 & 4294967288U) << 2) ^ b;
-    b  = ((state->z3 << 13) ^ state->z3) >> 21;
-    state->z3 = ((state->z3 & 4294967280U) << 7) ^ b;
-    b  = ((state->z4 << 3) ^ state->z4) >> 12;
-    state->z4 = ((state->z4 & 4294967168U) << 13) ^ b;
-    return (state->z1 ^ state->z2 ^ state->z3 ^ state->z4);
-}
-
-
-__device__
-static inline float frandom(RNGState * state)
-{
-    unsigned int irand = random(state);
-    irand &= (1ul<<23)-1;
-    return __int_as_float(0x3F800000 | irand)-1.0f;
-}
-
-__device__
-static inline void seed_rng(RNGState * state, 
-                            unsigned int seed) {
-    state->z1 = seed;
-    state->z2 = seed ^ 0xbeeff00d;
-    state->z3 = ((seed & 0xfffful) << 16) | (seed >> 16);
-    state->z4 = (((seed & 0xfful) << 24) | ((seed & 0xff00ul)  << 8) |
-                 ((seed & 0xff0000ul) >> 8) | (seed & 0xff000000ul) >> 24);
-}
-
-
-#define programCount 32
-#define programIndex (threadIdx.x & 31)
-#define taskIndex0 (blockIdx.x*4 + (threadIdx.x >> 5))
-#define taskCount0 (gridDim.x*4)
-#define taskIndex1 (blockIdx.y)
-#define taskCount1 (gridDim.y)
-#define warpIdx (threadIdx.x >> 5)
-
-struct Isect {
-  float      t;
-  vec        p;
-  vec        n;
-  int        hit; 
-};
-
-struct Sphere {
-  vec        center;
-  float      radius;
-};
-
-struct Plane {
-  vec    p;
-  vec    n;
-};
-
-struct Ray {
-  vec org;
-  vec dir;
-};
-
-__device__
-static inline float dot(vec a, vec b) {
-  return a.x * b.x + a.y * b.y + a.z * b.z;
-}
-
-__device__
-static inline vec vcross(vec v0, vec v1) {
-  vec ret;
-  ret.x = v0.y * v1.z - v0.z * v1.y;
-  ret.y = v0.z * v1.x - v0.x * v1.z;
-  ret.z = v0.x * v1.y - v0.y * v1.x;
-  return ret;
-}
-
-__device__
-static inline void vnormalize(vec &v) {
-  float len2 = dot(v, v);
-  float invlen = rsqrt(len2);
-  v = v*invlen;
-}
-
-
-__device__
-static inline void
-ray_plane_intersect(Isect &isect,const  Ray &ray, const  Plane &plane) {
-  float d = -dot(plane.p, plane.n);
-  float v = dot(ray.dir, plane.n);
-
-  if (abs(v) < 1.0e-17) 
-    return;
-  else {
-    float t = -(dot(ray.org, plane.n) + d) / v;
-
-    if ((t > 0.0) && (t < isect.t)) {
-      isect.t = t;
-      isect.hit = 1;
-      isect.p = ray.org + ray.dir * t;
-      isect.n = plane.n;
-    }
-  }
-}
-
-
-__device__
-static inline void
-ray_sphere_intersect(Isect &isect,const  Ray &ray, const Sphere &sphere) {
-  vec rs = ray.org - sphere.center;
-
-  float B = dot(rs, ray.dir);
-  float C = dot(rs, rs) - sphere.radius * sphere.radius;
-  float D = B * B - C;
-
-  if (D > 0.) {
-    float t = -B - sqrt(D);
-
-    if ((t > 0.0) && (t < isect.t)) {
-      isect.t = t;
-      isect.hit = 1;
-      isect.p = ray.org +  ray.dir * t;
-      isect.n = isect.p - sphere.center;
-      vnormalize(isect.n);
-    }
-  }
-}
-
-
-__device__
-static inline void
-orthoBasis(vec basis[3], vec n) {
-  basis[2] = n;
-  basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
-
-  if ((n.x < 0.6) && (n.x > -0.6)) {
-    basis[1].x = 1.0;
-  } else if ((n.y < 0.6) && (n.y > -0.6)) {
-    basis[1].y = 1.0;
-  } else if ((n.z < 0.6) && (n.z > -0.6)) {
-    basis[1].z = 1.0;
-  } else {
-    basis[1].x = 1.0;
-  }
-
-  basis[0] = vcross(basis[1], basis[2]);
-  vnormalize(basis[0]);
-
-  basis[1] = vcross(basis[2], basis[0]);
-  vnormalize(basis[1]);
-}
-
-
-__device__
-static inline float
-ambient_occlusion(Isect &isect,  const Plane &plane, const  Sphere spheres[3],
-    RNGState &rngstate) {
-  float eps = 0.0001f;
-  vec p; //, n;
-  vec basis[3];
-  float occlusion = 0.0;
-
-  p = isect.p + isect.n * eps;
-
-  orthoBasis(basis, isect.n);
-
-  const  int ntheta = NAO_SAMPLES;
-  const  int nphi   = NAO_SAMPLES;
-  for ( int j = 0; j < ntheta; j++) {
-    for ( int i = 0; i < nphi; i++) {
-      Ray ray;
-      Isect occIsect;
-
-      float theta = sqrt(frandom(&rngstate));
-      float phi   = 2.0f * M_PI * frandom(&rngstate);
-      float x = cos(phi) * theta;
-      float y = sin(phi) * theta;
-      float z = sqrt(1.0 - theta * theta);
-
-      // local . global
-      float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
-      float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
-      float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
-
-      ray.org = p;
-      ray.dir.x = rx;
-      ray.dir.y = ry;
-      ray.dir.z = rz;
-
-      occIsect.t   = 1.0e+17;
-      occIsect.hit = 0;
-
-      for ( int snum = 0; snum < 3; ++snum)
-        ray_sphere_intersect(occIsect, ray, spheres[snum]); 
-      ray_plane_intersect (occIsect, ray, plane); 
-
-      if (occIsect.hit) occlusion += 1.0;
-    }
-  }
-
-  occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
-  return occlusion;
-}
-
-
-/* Compute the image for the scanlines from [y0,y1), for an overall image
-   of width w and height h.
-   */
-__device__
-static inline void ao_tile(
-     int x0,  int x1,
-     int y0,  int y1, 
-     int w,  int h,
-     int nsubsamples, 
-     float image[]) 
-{
-  const  Plane plane = { { 0.0f, -0.5f, 0.0f }, { 0.f, 1.f, 0.f } };
-  const  Sphere spheres[3] = {
-    { { -2.0f, 0.0f, -3.5f }, 0.5f },
-    { { -0.5f, 0.0f, -3.0f }, 0.5f },
-    { { 1.0f, 0.0f, -2.2f }, 0.5f } };
-  RNGState rngstate;
-
-  seed_rng(&rngstate, programIndex + (y0 << (programIndex & 31)));
-  float invSamples = 1.f / nsubsamples;
-  for ( int y = y0; y < y1; y++)
-    for ( int xb = x0; xb < x1; xb += programCount)
-    {
-      const int x = xb + programIndex;
-      const int offset = 3 * (y * w + x);
-      float res = 0.0f;
-
-      for ( int u = 0; u < nsubsamples; u++)
-        for ( int v = 0; v < nsubsamples; v++)
-        {
-          float du = (float)u * invSamples, dv = (float)v * invSamples;
-
-          // Figure out x,y pixel in NDC
-          float px =  (x + du - (w / 2.0f)) / (w / 2.0f);
-          float py = -(y + dv - (h / 2.0f)) / (h / 2.0f);
-          float ret = 0.f;
-          Ray ray;
-          Isect isect;
-
-          ray.org.x = 0.0f;
-          ray.org.y = 0.0f;
-          ray.org.z = 0.0f;
-
-          // Poor man's perspective projection
-          ray.dir.x = px;
-          ray.dir.y = py;
-          ray.dir.z = -1.0;
-          vnormalize(ray.dir);
-
-          isect.t   = 1.0e+17;
-          isect.hit = 0;
-
-          for ( int snum = 0; snum < 3; ++snum)
-            ray_sphere_intersect(isect, ray, spheres[snum]);
-          ray_plane_intersect(isect, ray, plane);
-
-          // Note use of 'coherent' if statement; the set of rays we
-          // trace will often all hit or all miss the scene
-          if (isect.hit) {
-            ret = ambient_occlusion(isect, plane, spheres, rngstate);
-            ret *= invSamples * invSamples;
-            res += ret;
-          }
-        }
-
-      if (xb < x1)
-      {
-        image[offset  ] = res;
-        image[offset+1] = res;
-        image[offset+2] = res;
-      }
-
-    }
-}
-
-
-
-#define TILEX 64
-#define TILEY 4
-
-extern "C"
-__global__
-void ao_task( int width,  int height, 
-     int nsubsamples,  float image[]) 
-{
-  if (taskIndex0 >= taskCount0) return;
-  if (taskIndex1 >= taskCount1) return;
-
-  const  int x0 = taskIndex0 * TILEX;
-  const  int x1 = min(x0 + TILEX, width);
-
-  const  int y0 = taskIndex1 * TILEY;
-  const  int y1 = min(y0 + TILEY, height);
-  ao_tile(x0,x1,y0,y1, width, height, nsubsamples, image);
-}
-
-#if 1
-extern "C"
-__global__
-void ao_ispc_tasks(
-    int w, int h, int nsubsamples, 
-    float image[]) 
-{
-  const int ntilex = (w+TILEX-1)/TILEX;
-  const int ntiley = (h+TILEY-1)/TILEY;
-  const int nbx = (ntilex-1)/4 + 1;
-  const int nby =  ntiley;
-  const int nbz = 1;
-  const dim3 blocks (nbx, nby, nbz);
-  if (threadIdx.x == 0)
-    ao_task<<<blocks, 128>>>(w,w,nsubsamples,image);
-  cudaDeviceSynchronize();
-}
-#endif
--- a/examples_cuda/aobench/ao.ispc
+++ b/examples_cuda/aobench/ao.ispc
@@ -1,272 +0,0 @@
-// -*- mode: c++ -*-
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-/*
-  Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
-*/
-
-#define NAO_SAMPLES		8
-#define M_PI 3.1415926535f
-
-typedef float<3> vec;
-
-struct Isect {
-    float      t;
-    vec        p;
-    vec        n;
-    int        hit; 
-};
-
-struct Sphere {
-    vec        center;
-    float      radius;
-};
-
-struct Plane {
-    vec    p;
-    vec    n;
-};
-
-struct Ray {
-    vec org;
-    vec dir;
-};
-
-static inline float dot(vec a, vec b) {
-    return a.x * b.x + a.y * b.y + a.z * b.z;
-}
-
-static inline vec vcross(vec v0, vec v1) {
-    vec ret;
-    ret.x = v0.y * v1.z - v0.z * v1.y;
-    ret.y = v0.z * v1.x - v0.x * v1.z;
-    ret.z = v0.x * v1.y - v0.y * v1.x;
-    return ret;
-}
-
-static inline void vnormalize(vec &v) {
-    float len2 = dot(v, v);
-    float invlen = rsqrt(len2);
-    v *= invlen;
-}
-
-
-static void
-ray_plane_intersect(Isect &isect, Ray &ray, uniform Plane &plane) {
-    float d = -dot(plane.p, plane.n);
-    float v = dot(ray.dir, plane.n);
-
-    cif (abs(v) < 1.0e-17) 
-        return;
-    else {
-        float t = -(dot(ray.org, plane.n) + d) / v;
-
-        cif ((t > 0.0) && (t < isect.t)) {
-            isect.t = t;
-            isect.hit = 1;
-            isect.p = ray.org + ray.dir * t;
-            isect.n = plane.n;
-        }
-    }
-}
-
-
-static inline void
-ray_sphere_intersect(Isect &isect, Ray &ray, uniform Sphere &sphere) {
-    vec rs = ray.org - sphere.center;
-
-    float B = dot(rs, ray.dir);
-    float C = dot(rs, rs) - sphere.radius * sphere.radius;
-    float D = B * B - C;
-
-    cif (D > 0.) {
-        float t = -B - sqrt(D);
-
-        cif ((t > 0.0) && (t < isect.t)) {
-            isect.t = t;
-            isect.hit = 1;
-            isect.p = ray.org + t * ray.dir;
-            isect.n = isect.p - sphere.center;
-            vnormalize(isect.n);
-        }
-    }
-}
-
-
-static void
-orthoBasis(vec basis[3], vec n) {
-    basis[2] = n;
-    basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
-
-    if ((n.x < 0.6) && (n.x > -0.6)) {
-        basis[1].x = 1.0;
-    } else if ((n.y < 0.6) && (n.y > -0.6)) {
-        basis[1].y = 1.0;
-    } else if ((n.z < 0.6) && (n.z > -0.6)) {
-        basis[1].z = 1.0;
-    } else {
-        basis[1].x = 1.0;
-    }
-
-    basis[0] = vcross(basis[1], basis[2]);
-    vnormalize(basis[0]);
-
-    basis[1] = vcross(basis[2], basis[0]);
-    vnormalize(basis[1]);
-}
-
-
-static float
-ambient_occlusion(Isect &isect, uniform Plane &plane, uniform Sphere spheres[3],
-                  RNGState &rngstate) {
-    float eps = 0.0001f;
-    vec p, n;
-    vec basis[3];
-    float occlusion = 0.0;
-
-    p = isect.p + eps * isect.n;
-
-    orthoBasis(basis, isect.n);
-
-    static const uniform int ntheta = NAO_SAMPLES;
-    static const uniform int nphi   = NAO_SAMPLES;
-    for (uniform int j = 0; j < ntheta; j++) {
-        for (uniform int i = 0; i < nphi; i++) {
-            Ray ray;
-            Isect occIsect;
-
-            float theta = sqrt(frandom(&rngstate));
-            float phi   = 2.0f * M_PI * frandom(&rngstate);
-            float x = cos(phi) * theta;
-            float y = sin(phi) * theta;
-            float z = sqrt(1.0 - theta * theta);
-
-            // local . global
-            float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
-            float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
-            float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
-
-            ray.org = p;
-            ray.dir.x = rx;
-            ray.dir.y = ry;
-            ray.dir.z = rz;
-
-            occIsect.t   = 1.0e+17;
-            occIsect.hit = 0;
-
-            for (uniform int snum = 0; snum < 3; ++snum)
-                ray_sphere_intersect(occIsect, ray, spheres[snum]); 
-            ray_plane_intersect (occIsect, ray, plane); 
-
-            if (occIsect.hit) occlusion += 1.0;
-        }
-    }
-
-    occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
-    return occlusion;
-}
-
-
-/* Compute the image for the scanlines from [y0,y1), for an overall image
-   of width w and height h.
- */
-static void ao_scanlines(uniform int y0, uniform int y1, uniform int w, 
-                         uniform int h,  uniform int nsubsamples, 
-                         uniform float image[]) {
-    static uniform Plane plane = { { 0.0f, -0.5f, 0.0f }, { 0.f, 1.f, 0.f } };
-    static uniform Sphere spheres[3] = {
-        { { -2.0f, 0.0f, -3.5f }, 0.5f },
-        { { -0.5f, 0.0f, -3.0f }, 0.5f },
-        { { 1.0f, 0.0f, -2.2f }, 0.5f } };
-    RNGState rngstate;
-
-    seed_rng(&rngstate, programIndex + (y0 << (programIndex & 15)));
-    float invSamples = 1.f / nsubsamples;
-
-    foreach_tiled(y = y0 ... y1, x = 0 ... w, 
-                  u = 0 ... nsubsamples, v = 0 ... nsubsamples) {
-        float du = (float)u * invSamples, dv = (float)v * invSamples;
-
-        // Figure out x,y pixel in NDC
-        float px =  (x + du - (w / 2.0f)) / (w / 2.0f);
-        float py = -(y + dv - (h / 2.0f)) / (h / 2.0f);
-        float ret = 0.f;
-        Ray ray;
-        Isect isect;
-
-        ray.org = 0.f;
-
-        // Poor man's perspective projection
-        ray.dir.x = px;
-        ray.dir.y = py;
-        ray.dir.z = -1.0;
-        vnormalize(ray.dir);
-
-        isect.t   = 1.0e+17;
-        isect.hit = 0;
-
-        for (uniform int snum = 0; snum < 3; ++snum)
-            ray_sphere_intersect(isect, ray, spheres[snum]);
-        ray_plane_intersect(isect, ray, plane);
-
-        // Note use of 'coherent' if statement; the set of rays we
-        // trace will often all hit or all miss the scene
-        cif (isect.hit) {
-            ret = ambient_occlusion(isect, plane, spheres, rngstate);
-            ret *= invSamples * invSamples;
-
-            int offset = 3 * (y * w + x);
-            atomic_add_local(&image[offset], ret);
-            atomic_add_local(&image[offset+1], ret);
-            atomic_add_local(&image[offset+2], ret);
-        }
-    }
-}
-
-
-export void ao_ispc(uniform int w, uniform int h, uniform int nsubsamples, 
-                    uniform float image[]) {
-    ao_scanlines(0, h, w, h, nsubsamples, image);
-}
-
-
-static void task ao_task(uniform int width, uniform int height, 
-                         uniform int nsubsamples, uniform float image[]) {
-    ao_scanlines(taskIndex, taskIndex+1, width, height, nsubsamples, image);
-}
-
-
-export void ao_ispc_tasks(uniform int w, uniform int h, uniform int nsubsamples, 
-                          uniform float image[]) {
-    launch[h] ao_task(w, h, nsubsamples, image);
-}
--- a/examples_cuda/aobench/ao1.ispc
+++ b/examples_cuda/aobench/ao1.ispc
@@ -1,302 +0,0 @@
-// -*- mode: c++ -*-
-/*
-   Copyright (c) 2010-2011, Intel Corporation
-   All rights reserved.
-
-   Redistribution and use in source and binary forms, with or without
-   modification, are permitted provided that the following conditions are
-met:
-
- * Redistributions of source code must retain the above copyright
- notice, this list of conditions and the following disclaimer.
-
- * Redistributions in binary form must reproduce the above copyright
- notice, this list of conditions and the following disclaimer in the
- documentation and/or other materials provided with the distribution.
-
- * Neither the name of Intel Corporation nor the names of its
- contributors may be used to endorse or promote products derived from
- this software without specific prior written permission.
-
-
- THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
- IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
- TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
- PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
- OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
- */
-/*
-   Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
-   */
-
-#define NAO_SAMPLES		8
-#define M_PI 3.1415926535f
-
-typedef float<3> vec;
-
-struct Isect {
-  float      t;
-  vec        p;
-  vec        n;
-  int        hit; 
-};
-
-struct Sphere {
-  vec        center;
-  float      radius;
-};
-
-struct Plane {
-  vec    p;
-  vec    n;
-};
-
-struct Ray {
-  vec org;
-  vec dir;
-};
-
-static inline float dot(vec a, vec b) {
-  return a.x * b.x + a.y * b.y + a.z * b.z;
-}
-
-static inline vec vcross(vec v0, vec v1) {
-  vec ret;
-  ret.x = v0.y * v1.z - v0.z * v1.y;
-  ret.y = v0.z * v1.x - v0.x * v1.z;
-  ret.z = v0.x * v1.y - v0.y * v1.x;
-  return ret;
-}
-
-static inline void vnormalize(vec &v) {
-  float len2 = dot(v, v);
-  float invlen = rsqrt(len2);
-  v *= invlen;
-}
-
-
-static inline void
-ray_plane_intersect(Isect &isect, Ray &ray, uniform Plane &plane) {
-  float d = -dot(plane.p, plane.n);
-  float v = dot(ray.dir, plane.n);
-
-  if (abs(v) < 1.0e-17) 
-    return;
-  else {
-    float t = -(dot(ray.org, plane.n) + d) / v;
-
-    if ((t > 0.0) && (t < isect.t)) {
-      isect.t = t;
-      isect.hit = 1;
-      isect.p = ray.org + ray.dir * t;
-      isect.n = plane.n;
-    }
-  }
-}
-
-
-void
-ray_sphere_intersect(Isect &isect, Ray &ray, uniform Sphere &sphere) {
-  vec rs = ray.org - sphere.center;
-
-  float B = dot(rs, ray.dir);
-  float C = dot(rs, rs) - sphere.radius * sphere.radius;
-  float D = B * B - C;
-
-  if (D > 0.) {
-    float t = -B - sqrt(D);
-
-    if ((t > 0.0) && (t < isect.t)) {
-      isect.t = t;
-      isect.hit = 1;
-      isect.p = ray.org + t * ray.dir;
-      isect.n = isect.p - sphere.center;
-      vnormalize(isect.n);
-    }
-  }
-}
-
-
-static inline void
-orthoBasis(vec basis[3], vec n) {
-  basis[2] = n;
-  basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
-
-  if ((n.x < 0.6) && (n.x > -0.6)) {
-    basis[1].x = 1.0;
-  } else if ((n.y < 0.6) && (n.y > -0.6)) {
-    basis[1].y = 1.0;
-  } else if ((n.z < 0.6) && (n.z > -0.6)) {
-    basis[1].z = 1.0;
-  } else {
-    basis[1].x = 1.0;
-  }
-
-  basis[0] = vcross(basis[1], basis[2]);
-  vnormalize(basis[0]);
-
-  basis[1] = vcross(basis[2], basis[0]);
-  vnormalize(basis[1]);
-}
-
-
-static inline float
-ambient_occlusion(Isect &isect, uniform Plane &plane, uniform Sphere spheres[3],
-    RNGState &rngstate) {
-  float eps = 0.0001f;
-  vec p, n;
-  vec basis[3];
-  float occlusion = 0.0;
-
-  p = isect.p + eps * isect.n;
-
-  orthoBasis(basis, isect.n);
-
-  static const uniform int ntheta = NAO_SAMPLES;
-  static const uniform int nphi   = NAO_SAMPLES;
-  for (uniform int j = 0; j < ntheta; j++) {
-    for (uniform int i = 0; i < nphi; i++) {
-      Ray ray;
-      Isect occIsect;
-
-      float theta = sqrt(frandom(&rngstate));
-      float phi   = 2.0f * M_PI * frandom(&rngstate);
-      float x = cos(phi) * theta;
-      float y = sin(phi) * theta;
-      float z = sqrt(1.0 - theta * theta);
-
-      // local . global
-      float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
-      float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
-      float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
-
-      ray.org = p;
-      ray.dir.x = rx;
-      ray.dir.y = ry;
-      ray.dir.z = rz;
-
-      occIsect.t   = 1.0e+17;
-      occIsect.hit = 0;
-
-      for (uniform int snum = 0; snum < 3; ++snum)
-        ray_sphere_intersect(occIsect, ray, spheres[snum]); 
-      ray_plane_intersect (occIsect, ray, plane); 
-
-      if (occIsect.hit) occlusion += 1.0;
-    }
-  }
-
-  occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
-  return occlusion;
-}
-
-
-/* Compute the image for the scanlines from [y0,y1), for an overall image
-   of width w and height h.
-   */
-static  inline void ao_tile(
-    uniform int x0, uniform int x1,
-    uniform int y0, uniform int y1, 
-    uniform int w, uniform int h,
-    uniform int nsubsamples, 
-    uniform float image[]) 
-{
-  uniform Plane plane = { { 0.0f, -0.5f, 0.0f }, { 0.f, 1.f, 0.f } };
-  uniform Sphere spheres[3] = {
-    { { -2.0f, 0.0f, -3.5f }, 0.5f },
-    { { -0.5f, 0.0f, -3.0f }, 0.5f },
-    { { 1.0f, 0.0f, -2.2f }, 0.5f } };
-  RNGState rngstate;
-
-  seed_rng(&rngstate, programIndex + (y0 << (programIndex & 31)));
-  float invSamples = 1.f / nsubsamples;
-  foreach_tiled (y = y0 ... y1, x = x0 ... x1)
-  {
-    const int offset = 3 * (y * w + x);
-    float res = 0.0f;
-
-    for (uniform int u = 0; u < nsubsamples; u++)
-      for (uniform int v = 0; v < nsubsamples; v++)
-      {
-        float du = (float)u * invSamples, dv = (float)v * invSamples;
-
-        // Figure out x,y pixel in NDC
-        float px =  (x + du - (w / 2.0f)) / (w / 2.0f);
-        float py = -(y + dv - (h / 2.0f)) / (h / 2.0f);
-        float ret = 0.f;
-        Ray ray;
-        Isect isect;
-
-        ray.org = 0.f;
-
-        // Poor man's perspective projection
-        ray.dir.x = px;
-        ray.dir.y = py;
-        ray.dir.z = -1.0;
-        vnormalize(ray.dir);
-
-        isect.t   = 1.0e+17;
-        isect.hit = 0;
-
-        for (uniform int snum = 0; snum < 3; ++snum)
-          ray_sphere_intersect(isect, ray, spheres[snum]);
-        ray_plane_intersect(isect, ray, plane);
-
-        // Note use of 'coherent' if statement; the set of rays we
-        // trace will often all hit or all miss the scene
-        if (isect.hit) {
-          ret = ambient_occlusion(isect, plane, spheres, rngstate);
-          ret *= invSamples * invSamples;
-          res += ret;
-        }
-      }
-
-    //if (x < x1)
-    {
-      image[offset  ] = res;
-      image[offset+1] = res;
-      image[offset+2] = res;
-    }
-  }
-}
-
-
-
-#define TILEX 64
-#define TILEY 4
-
-/* unless task/export is specified all functions
- * are generated as mangled "__device__" functions  
- */
-
-/* task will generate mangled "__global__" function only */
-void task ao_task(uniform int width, uniform int height, 
-    uniform int nsubsamples, uniform float image[]) 
-{
-  if (taskIndex0 >= taskCount0) return;
-  if (taskIndex1 >= taskCount1) return;
-
-  const uniform int x0 = taskIndex0 * TILEX;
-  const uniform int x1 = min(x0 + TILEX, width);
-
-  const uniform int y0 = taskIndex1 * TILEY;
-  const uniform int y1 = min(y0 + TILEY, height);
-  ao_tile(x0,x1,y0,y1, width, height, nsubsamples, image);
-}
-
-
-/* export will generate unmangled "extern "C" __global__"  and mangled "__device__" */
-export void ao_ispc_tasks(uniform int w, uniform int h, uniform int nsubsamples, 
-    uniform float image[]) 
-{
-  const uniform int ntilex = (w+TILEX-1)/TILEX;
-  const uniform int ntiley = (h+TILEY-1)/TILEY;
-  launch[ntilex,ntiley] ao_task(w, h, nsubsamples, image);
-  sync;
-}
--- a/examples_cuda/aobench/ao_cu.cpp
+++ b/examples_cuda/aobench/ao_cu.cpp
@@ -1,510 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define _CRT_SECURE_NO_WARNINGS
-#define NOMINMAX
-#pragma warning (disable: 4244)
-#pragma warning (disable: 4305)
-#endif
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <assert.h>
-#ifdef __linux__
-#include <malloc.h>
-#endif
-#include <math.h>
-#include <map>
-#include <string>
-#include <algorithm>
-#include <sys/types.h>
-
-//#include "ao1_ispc.h"
-//using namespace ispc;
-
-#include "../timing.h"
-
-#include <sys/time.h>
-static inline double rtc(void)
-{
-  struct timeval Tvalue;
-  double etime;
-  struct timezone dummy;
-
-  gettimeofday(&Tvalue,&dummy);
-  etime =  (double) Tvalue.tv_sec +
-    1.e-6*((double) Tvalue.tv_usec);
-  return etime;
-}
-
-/******************************/ 
-#include <cassert>
-#include <iostream>
-#include <cuda.h>
-#include "drvapi_error_string.h"
-
-#define checkCudaErrors(err)  __checkCudaErrors (err, __FILE__, __LINE__)
-// These are the inline versions for all of the SDK helper functions
-void __checkCudaErrors(CUresult err, const char *file, const int line) {
-  if(CUDA_SUCCESS != err) {
-    std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
-           << getCudaDrvErrorString(err) << "\" from file <" << file
-           << ", line " << line << "\n";
-    exit(-1);
-  }
-}
-
-/**********************/
-/* Basic CUDriver API */
-CUcontext context;
-
-void createContext(const int deviceId = 0)
-{
-  CUdevice device;
-  int devCount;
-  checkCudaErrors(cuInit(0));
-  checkCudaErrors(cuDeviceGetCount(&devCount));
-  assert(devCount > 0);
-  checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
-
-  char name[128];
-  checkCudaErrors(cuDeviceGetName(name, 128, device));
-  std::cout << "Using CUDA Device [0]: " << name << "\n";
-
-  int devMajor, devMinor;
-  checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
-  std::cout << "Device Compute Capability: " 
-    << devMajor << "." << devMinor << "\n";
-  if (devMajor < 2) {
-    std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
-    exit(1); 
-  }
-
-  // Create driver context
-  checkCudaErrors(cuCtxCreate(&context, 0, device));
-}
-void destroyContext()
-{
-  checkCudaErrors(cuCtxDestroy(context));
-}
-
-CUmodule loadModule(const char * module)
-{
-  const double t0 = rtc();
-  CUmodule cudaModule;
-  // in this branch we use compilation with parameters
-
-#if 0
-  unsigned int jitNumOptions = 1;
-  CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
-  void **jitOptVals = new void*[jitNumOptions];
-  // set up pointer to set the Maximum # of registers for a particular kernel
-  jitOptions[0] = CU_JIT_MAX_REGISTERS;
-  int jitRegCount = 64;
-  jitOptVals[0] = (void *)(size_t)jitRegCount;
-#if 0
-
-  {
-    jitNumOptions = 3;
-    // set up size of compilation log buffer
-    jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
-    int jitLogBufferSize = 1024;
-    jitOptVals[0] = (void *)(size_t)jitLogBufferSize;
-
-    // set up pointer to the compilation log buffer
-    jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
-    char *jitLogBuffer = new char[jitLogBufferSize];
-    jitOptVals[1] = jitLogBuffer;
-
-    // set up pointer to set the Maximum # of registers for a particular kernel
-    jitOptions[2] = CU_JIT_MAX_REGISTERS;
-    int jitRegCount = 32;
-    jitOptVals[2] = (void *)(size_t)jitRegCount;
-  }
-#endif
-
-  checkCudaErrors(cuModuleLoadDataEx(&cudaModule, module,jitNumOptions, jitOptions, (void **)jitOptVals));
-#else
-  CUlinkState  CUState;
-  CUlinkState *lState = &CUState;
-  const int nOptions = 7;
-    CUjit_option options[nOptions];
-    void* optionVals[nOptions];
-    float walltime;
-    const unsigned int logSize = 32768;
-    char error_log[logSize],
-         info_log[logSize];
-    void *cuOut;
-    size_t outSize;
-    int myErr = 0;
-
-    // Setup linker options
-    // Return walltime from JIT compilation
-    options[0] = CU_JIT_WALL_TIME;
-    optionVals[0] = (void*) &walltime;
-    // Pass a buffer for info messages
-    options[1] = CU_JIT_INFO_LOG_BUFFER;
-    optionVals[1] = (void*) info_log;
-    // Pass the size of the info buffer
-    options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
-    optionVals[2] = (void*) logSize;
-    // Pass a buffer for error message
-    options[3] = CU_JIT_ERROR_LOG_BUFFER;
-    optionVals[3] = (void*) error_log;
-    // Pass the size of the error buffer
-    options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
-    optionVals[4] = (void*) logSize;
-    // Make the linker verbose
-    options[5] = CU_JIT_LOG_VERBOSE;
-    optionVals[5] = (void*) 1;
-    // Max # of registers/pthread
-    options[6] = CU_JIT_MAX_REGISTERS;
-    int jitRegCount = 64;
-    optionVals[6] = (void *)(size_t)jitRegCount;
-
-    // Create a pending linker invocation
-    checkCudaErrors(cuLinkCreate(nOptions,options, optionVals, lState));
-
-#if 0
-    if (sizeof(void *)==4)
-    {
-        // Load the PTX from the string myPtx32
-        printf("Loading myPtx32[] program\n");
-        // PTX May also be loaded from file, as per below.
-        myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)myPtx32, strlen(myPtx32)+1, 0, 0, 0, 0);
-    }
-    else
-#endif
-    {
-        // Load the PTX from the string myPtx (64-bit)
-        fprintf(stderr, "Loading ptx..\n");
-        myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)module, strlen(module)+1, 0, 0, 0, 0);
-        myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_LIBRARY, "libcudadevrt.a", 0,0,0); 
-        // PTX May also be loaded from file, as per below.
-        // myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_PTX, "myPtx64.ptx",0,0,0);
-    }
-
-    // Complete the linker step
-    myErr = cuLinkComplete(*lState, &cuOut, &outSize);
-
-    if ( myErr != CUDA_SUCCESS )
-    {
-      // Errors will be put in error_log, per CU_JIT_ERROR_LOG_BUFFER option above. 
-      fprintf(stderr,"PTX Linker Error:\n%s\n",error_log);
-      assert(0);
-    }    
-
-    // Linker walltime and info_log were requested in options above.
-    fprintf(stderr, "CUDA Link Completed in %fms [ %g ms]. Linker Output:\n%s\n",walltime,info_log,1e3*(rtc() - t0));
-
-    // Load resulting cuBin into module
-    checkCudaErrors(cuModuleLoadData(&cudaModule, cuOut));
-
-    // Destroy the linker invocation
-    checkCudaErrors(cuLinkDestroy(*lState));
-#endif
-  fprintf(stderr, " loadModule took %g ms \n", 1e3*(rtc() - t0));
-  return cudaModule;
-}
-void unloadModule(CUmodule &cudaModule)
-{
-  checkCudaErrors(cuModuleUnload(cudaModule));
-}
-
-CUfunction getFunction(CUmodule &cudaModule, const char * function)
-{
-  CUfunction cudaFunction;
-  checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
-  return cudaFunction;
-}
-  
-CUdeviceptr deviceMalloc(const size_t size)
-{
-  CUdeviceptr d_buf;
-  checkCudaErrors(cuMemAllocManaged(&d_buf, size, CU_MEM_ATTACH_GLOBAL));
-  return d_buf;
-}
-void deviceFree(CUdeviceptr d_buf)
-{
-  checkCudaErrors(cuMemFree(d_buf));
-}
-void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
-}
-void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
-}
-#define deviceLaunch(func,params) \
-  checkCudaErrors(cuFuncSetCacheConfig((func), CU_FUNC_CACHE_PREFER_EQUAL)); \
-  checkCudaErrors( \
-      cuLaunchKernel( \
-        (func), \
-        1,1,1, \
-        32, 1, 1, \
-        0, NULL, (params), NULL \
-        ));
-
-typedef CUdeviceptr devicePtr;
-
-
-/**************/
-#include <vector>
-std::vector<char> readBinary(const char * filename)
-{
-  std::vector<char> buffer;
-  FILE *fp = fopen(filename, "rb");
-  if (!fp )
-  {
-    fprintf(stderr, "file %s not found\n", filename);
-    assert(0);
-  }
-#if 0
-  char c;
-  while ((c = fgetc(fp)) != EOF)
-    buffer.push_back(c);
-#else
-  fseek(fp, 0, SEEK_END); 
-  const unsigned long long size = ftell(fp);         /*calc the size needed*/
-  fseek(fp, 0, SEEK_SET); 
-  buffer.resize(size);
-
-  if (fp == NULL){ /*ERROR detection if file == empty*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);           
-    exit(1);
-  }
-  else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
-    exit(1);
-  }
-#endif
-  fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
-  return buffer;
-}
-
-extern "C" 
-{
-  void *CUDAAlloc(void **handlePtr, int64_t size, int32_t alignment)
-  {
-    return NULL;
-  }
-  double CUDALaunch(
-      void **handlePtr, 
-      const char * func_name,
-      void **func_args)
-  {
-    const std::vector<char> module_str = readBinary("__kernels.ptx");
-    const char *  module = &module_str[0];
-    CUmodule   cudaModule   = loadModule(module);
-    CUfunction cudaFunction = getFunction(cudaModule, func_name);
-    const double t0 = rtc();
-    deviceLaunch(cudaFunction, func_args);
-    checkCudaErrors(cuStreamSynchronize(0));
-    const double dt = rtc() - t0;
-    unloadModule(cudaModule);
-    return dt;
-  }
-  void CUDASync(void *handle)
-  {
-    checkCudaErrors(cuStreamSynchronize(0));
-  }
-  void ISPCSync(void *handle)
-  {
-    checkCudaErrors(cuStreamSynchronize(0));
-  }
-  void CUDAFree(void *handle)
-  {
-  }
-}
-/******************************/
-
-
-#define NSUBSAMPLES        2
-
-extern void ao_serial(int w, int h, int nsubsamples, float image[]);
-
-static unsigned int test_iterations;
-static unsigned int width, height;
-static unsigned char *img;
-static float *fimg;
-
-
-static unsigned char
-clamp(float f)
-{
-    int i = (int)(f * 255.5);
-
-    if (i < 0) i = 0;
-    if (i > 255) i = 255;
-
-    return (unsigned char)i;
-}
-
-
-static void
-savePPM(const char *fname, int w, int h)
-{
-    for (int y = 0; y < h; y++) {
-        for (int x = 0; x < w; x++)  {
-            img[3 * (y * w + x) + 0] = clamp(fimg[3 *(y * w + x) + 0]);
-            img[3 * (y * w + x) + 1] = clamp(fimg[3 *(y * w + x) + 1]);
-            img[3 * (y * w + x) + 2] = clamp(fimg[3 *(y * w + x) + 2]);
-        }
-    }
-
-    FILE *fp = fopen(fname, "wb");
-    if (!fp) {
-        perror(fname);
-        exit(1);
-    }
-
-    fprintf(fp, "P6\n");
-    fprintf(fp, "%d %d\n", w, h);
-    fprintf(fp, "255\n");
-    fwrite(img, w * h * 3, 1, fp);
-    fclose(fp);
-    printf("Wrote image file %s\n", fname);
-}
-
-
-int main(int argc, char **argv)
-{
-    if (argc != 4) {
-        printf ("%s\n", argv[0]);
-        printf ("Usage: ao [num test iterations] [width] [height]\n");
-        getchar();
-        exit(-1);
-    }
-    else {
-        test_iterations = atoi(argv[1]);
-        width = atoi (argv[2]);
-        height = atoi (argv[3]);
-    }
-
-    // Allocate space for output images
-    img = new unsigned char[width * height * 3];
-    fimg = new float[width * height * 3];
-
-    //
-    // Run the ispc path, test_iterations times, and report the minimum
-    // time for any of them.
-    //
-    double minTimeISPC = 1e30;
-#if 0
-    for (unsigned int i = 0; i < test_iterations; i++) {
-        memset((void *)fimg, 0, sizeof(float) * width * height * 3);
-        assert(NSUBSAMPLES == 2);
-
-        reset_and_start_timer();
-        ao_ispc(width, height, NSUBSAMPLES, fimg);
-        double t = get_elapsed_mcycles();
-        minTimeISPC = std::min(minTimeISPC, t);
-    }
-
-    // Report results and save image
-    printf("[aobench ispc]:\t\t\t[%.3f] million cycles (%d x %d image)\n", 
-           minTimeISPC, width, height);
-    savePPM("ao-ispc.ppm", width, height); 
-#endif
-
-    /*******************/
-  createContext();
-  /*******************/
-  devicePtr d_fimg = deviceMalloc(width*height*3*sizeof(float));
-
-    //
-    // Run the ispc + tasks path, test_iterations times, and report the
-    // minimum time for any of them.
-    //
-    double minTimeISPCTasks = 1e30;
-    for (unsigned int i = 0; i < test_iterations; i++) {
-        memset((void *)fimg, 0, sizeof(float) * width * height * 3);
-        assert(NSUBSAMPLES == 2);
-        memcpyH2D(d_fimg, fimg, width*height*3*sizeof(float));
-
-        reset_and_start_timer();
-#if 0
-        const double t0 = rtc();
-        ao_ispc_tasks(
-            width, 
-            height, 
-            NSUBSAMPLES, 
-            (float*)d_fimg);
-//        double t = (rtc() - t0); //get_elapsed_mcycles();
-#else
-        const char * func_name = "ao_ispc_tasks";
-        int arg_1 = width;
-        int arg_2 = height;
-        int arg_3 = NSUBSAMPLES;
-        void *func_args[] = {&arg_1, &arg_2, &arg_3, (float*)&d_fimg};
-        const double t = 1e3*CUDALaunch(NULL, func_name, func_args);
-#endif
-        minTimeISPCTasks = std::min(minTimeISPCTasks, t);
-    }
-
-    memcpyD2H(fimg, d_fimg, width*height*3*sizeof(float));
-
-    // Report results and save image
-    printf("[aobench ispc + tasks]:\t\t[%.3f] million cycles (%d x %d image)\n", 
-           minTimeISPCTasks, width, height);
-    savePPM("ao-cuda.ppm", width, height); 
-  /*******************/
-  destroyContext();
-  /*******************/
-    return 0;
-
-    //
-    // Run the serial path, again test_iteration times, and report the
-    // minimum time.
-    //
-    double minTimeSerial = 1e30;
-    for (unsigned int i = 0; i < test_iterations; i++) {
-        memset((void *)fimg, 0, sizeof(float) * width * height * 3);
-        reset_and_start_timer();
-        ao_serial(width, height, NSUBSAMPLES, fimg);
-        double t = get_elapsed_mcycles();
-        minTimeSerial = std::min(minTimeSerial, t);
-    }
-
-    // Report more results, save another image...
-    printf("[aobench serial]:\t\t[%.3f] million cycles (%d x %d image)\n", minTimeSerial, 
-           width, height);
-    printf("\t\t\t\t(%.2fx speedup from ISPC, %.2fx speedup from ISPC + tasks)\n", 
-           minTimeSerial / minTimeISPC, minTimeSerial / minTimeISPCTasks);
-    savePPM("ao-serial.ppm", width, height); 
-        
-    return 0;
-}
--- a/examples_cuda/aobench/ao_serial.cpp
+++ b/examples_cuda/aobench/ao_serial.cpp
@@ -1,314 +0,0 @@
-// -*- mode: c++ -*-
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-/*
-  Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
-*/
-
-#ifdef _MSC_VER
-#define _CRT_SECURE_NO_WARNINGS
-#define NOMINMAX
-#pragma warning (disable: 4244)
-#pragma warning (disable: 4305)
-#endif
-
-#include <stdlib.h>
-#include <math.h>
-
-#ifdef _MSC_VER
-static long long drand48_x = 0x1234ABCD330E;
-
-static inline void srand48(int x) {
-    drand48_x = x ^ (x << 16);
-}
-
-static inline double drand48() {
-    drand48_x = drand48_x * 0x5DEECE66D + 0xB;
-    return (drand48_x & 0xFFFFFFFFFFFF) * (1.0 / 281474976710656.0);
-}
-#endif // _MSC_VER
-
-#ifdef _MSC_VER
-__declspec(align(16)) 
-#endif
-struct vec {
-    vec() { x=y=z=pad=0.; }
-    vec(float xx, float yy, float zz) { x = xx; y = yy; z = zz; }
-
-    vec operator*(float f) const { return vec(x*f, y*f, z*f); }
-    vec operator+(const vec &f2) const { 
-        return vec(x+f2.x, y+f2.y, z+f2.z); 
-    }
-    vec operator-(const vec &f2) const { 
-        return vec(x-f2.x, y-f2.y, z-f2.z); 
-    }
-    vec operator*(const vec &f2) const { 
-        return vec(x*f2.x, y*f2.y, z*f2.z); 
-    }
-    float x, y, z;
-    float pad;
-}
-#ifndef _MSC_VER
-__attribute__ ((aligned(16)))
-#endif
-;
-inline vec operator*(float f, const vec &v) { return vec(f*v.x, f*v.y, f*v.z); }
-
-
-#define NAO_SAMPLES		8
-
-#ifdef M_PI
-#undef M_PI
-#endif
-#define M_PI 3.1415926535f
-
-struct Isect {
-    float      t;
-    vec        p;
-    vec        n;
-    int        hit; 
-};
-
-struct Sphere {
-    vec        center;
-    float      radius;
-
-};
-
-struct Plane {
-    vec    p;
-    vec    n;
-};
-
-struct Ray {
-    vec org;
-    vec dir;
-};
-
-static inline float dot(const vec &a, const vec &b) {
-    return a.x * b.x + a.y * b.y + a.z * b.z;
-}
-
-static inline vec vcross(const vec &v0, const vec &v1) {
-    vec ret;
-    ret.x = v0.y * v1.z - v0.z * v1.y;
-    ret.y = v0.z * v1.x - v0.x * v1.z;
-    ret.z = v0.x * v1.y - v0.y * v1.x;
-    return ret;
-}
-
-static inline void vnormalize(vec &v) {
-    float len2 = dot(v, v);
-    float invlen = 1.f / sqrtf(len2);
-    v = v * invlen;
-}
-
-
-static inline void
-ray_plane_intersect(Isect &isect, Ray &ray, 
-                    Plane &plane) {
-    float d = -dot(plane.p, plane.n);
-    float v = dot(ray.dir, plane.n);
-
-    if (fabsf(v) < 1.0e-17f) 
-        return;
-    else {
-        float t = -(dot(ray.org, plane.n) + d) / v;
-
-        if ((t > 0.0) && (t < isect.t)) {
-            isect.t = t;
-            isect.hit = 1;
-            isect.p = ray.org + ray.dir * t;
-            isect.n = plane.n;
-        }
-    }
-}
-
-
-static inline void
-ray_sphere_intersect(Isect &isect, Ray &ray, 
-                     Sphere &sphere) {
-    vec rs = ray.org - sphere.center;
-
-    float B = dot(rs, ray.dir);
-    float C = dot(rs, rs) - sphere.radius * sphere.radius;
-    float D = B * B - C;
-
-    if (D > 0.) {
-        float t = -B - sqrtf(D);
-
-        if ((t > 0.0) && (t < isect.t)) {
-            isect.t = t;
-            isect.hit = 1;
-            isect.p = ray.org + t * ray.dir;
-            isect.n = isect.p - sphere.center;
-            vnormalize(isect.n);
-        }
-    }
-}
-
-
-static inline void
-orthoBasis(vec basis[3], const vec &n) {
-    basis[2] = n;
-    basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
-
-    if ((n.x < 0.6f) && (n.x > -0.6f)) {
-        basis[1].x = 1.0;
-    } else if ((n.y < 0.6f) && (n.y > -0.6f)) {
-        basis[1].y = 1.0;
-    } else if ((n.z < 0.6f) && (n.z > -0.6f)) {
-        basis[1].z = 1.0;
-    } else {
-        basis[1].x = 1.0;
-    }
-
-    basis[0] = vcross(basis[1], basis[2]);
-    vnormalize(basis[0]);
-
-    basis[1] = vcross(basis[2], basis[0]);
-    vnormalize(basis[1]);
-}
-
-
-static float
-ambient_occlusion(Isect &isect, Plane &plane, 
-                  Sphere spheres[3]) {
-    float eps = 0.0001f;
-    vec p, n;
-    vec basis[3];
-    float occlusion = 0.0;
-
-    p = isect.p + eps * isect.n;
-
-    orthoBasis(basis, isect.n);
-
-    static const int ntheta = NAO_SAMPLES;
-    static const int nphi   = NAO_SAMPLES;
-    for (int j = 0; j < ntheta; j++) {
-        for (int i = 0; i < nphi; i++) {
-            Ray ray;
-            Isect occIsect;
-
-            float theta = sqrtf(drand48());
-            float phi   = 2.0f * M_PI * drand48();
-            float x = cosf(phi) * theta;
-            float y = sinf(phi) * theta;
-            float z = sqrtf(1.0f - theta * theta);
-
-            // local . global
-            float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
-            float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
-            float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
-
-            ray.org = p;
-            ray.dir.x = rx;
-            ray.dir.y = ry;
-            ray.dir.z = rz;
-
-            occIsect.t   = 1.0e+17f;
-            occIsect.hit = 0;
-
-            for (int snum = 0; snum < 3; ++snum)
-                ray_sphere_intersect(occIsect, ray, spheres[snum]); 
-            ray_plane_intersect (occIsect, ray, plane); 
-
-            if (occIsect.hit) occlusion += 1.f;
-        }
-    }
-
-    occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
-    return occlusion;
-}
-
-
-/* Compute the image for the scanlines from [y0,y1), for an overall image
-   of width w and height h.
- */
-static void ao_scanlines(int y0, int y1, int w, int h, int nsubsamples,
-                         float image[]) {
-    static Plane plane = { vec(0.0f, -0.5f, 0.0f), vec(0.f, 1.f, 0.f) };
-    static Sphere spheres[3] = {
-        { vec(-2.0f, 0.0f, -3.5f), 0.5f },
-        { vec(-0.5f, 0.0f, -3.0f), 0.5f },
-        { vec(1.0f, 0.0f, -2.2f), 0.5f } };
-
-    srand48(y0);
-    
-    for (int y = y0; y < y1; ++y) {
-        for (int x = 0; x < w; ++x)  {
-            int offset = 3 * (y * w + x);
-            for (int u = 0; u < nsubsamples; ++u) {
-                for (int v = 0; v < nsubsamples; ++v) {
-                    float px = (x + (u / (float)nsubsamples) - (w / 2.0f)) / (w / 2.0f);
-                    float py = -(y + (v / (float)nsubsamples) - (h / 2.0f)) / (h / 2.0f);
-                    float ret = 0.f;
-                    Ray ray;
-                    Isect isect;
-
-                    ray.org = vec(0.f, 0.f, 0.f);
-
-                    ray.dir.x = px;
-                    ray.dir.y = py;
-                    ray.dir.z = -1.0f;
-                    vnormalize(ray.dir);
-
-                    isect.t   = 1.0e+17f;
-                    isect.hit = 0;
-
-                    for (int snum = 0; snum < 3; ++snum)
-                        ray_sphere_intersect(isect, ray, spheres[snum]);
-                    ray_plane_intersect(isect, ray, plane);
-
-                    if (isect.hit)
-                        ret = ambient_occlusion(isect, plane, spheres);
-
-                    // Update image for AO for this ray
-                    image[offset+0] += ret;
-                    image[offset+1] += ret;
-                    image[offset+2] += ret;
-                }
-            }
-            // Normalize image pixels by number of samples taken per pixel
-            image[offset+0] /= nsubsamples * nsubsamples;
-            image[offset+1] /= nsubsamples * nsubsamples;
-            image[offset+2] /= nsubsamples * nsubsamples;
-        }
-    }
-}
-
-
-void ao_serial(int w, int h, int nsubsamples, 
-               float image[]) {
-    ao_scanlines(0, h, w, h, nsubsamples, image);
-}
--- a/examples_cuda/aobench/aobench.vcxproj
+++ b/examples_cuda/aobench/aobench.vcxproj
@@ -1,180 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
-  <ItemGroup Label="ProjectConfigurations">
-    <ProjectConfiguration Include="Debug|Win32">
-      <Configuration>Debug</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Debug|x64">
-      <Configuration>Debug</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|Win32">
-      <Configuration>Release</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|x64">
-      <Configuration>Release</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-  </ItemGroup>
-  <ItemGroup>
-    <ClCompile Include="ao.cpp" />
-    <ClCompile Include="ao_serial.cpp" />
-    <ClCompile Include="../tasksys.cpp" />
-  </ItemGroup>
-  <ItemGroup>
-    <CustomBuild Include="ao.ispc">
-      <FileType>Document</FileType>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4,avx
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4,avx
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4,avx
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4,avx
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-    </CustomBuild>
-  </ItemGroup>
-  <PropertyGroup Label="Globals">
-    <ProjectGuid>{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}</ProjectGuid>
-    <Keyword>Win32Proj</Keyword>
-    <RootNamespace>aobench</RootNamespace>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
-  <ImportGroup Label="ExtensionSettings">
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <PropertyGroup Label="UserMacros" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <TargetName>ao</TargetName>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ExecutablePath);$(ProjectDir)..\..</ExecutablePath>
-    <TargetName>ao</TargetName>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <TargetName>ao</TargetName>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <TargetName>ao</TargetName>
-  </PropertyGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
-  <ImportGroup Label="ExtensionTargets">
-  </ImportGroup>
-</Project>
--- a/examples_cuda/aobench/drvapi_error_string.h
+++ b/examples_cuda/aobench/drvapi_error_string.h
@@ -1,370 +0,0 @@
-/*
- * Copyright 1993-2012 NVIDIA Corporation.  All rights reserved.
- *
- * Please refer to the NVIDIA end user license agreement (EULA) associated
- * with this source code for terms and conditions that govern your use of
- * this software. Any use, reproduction, disclosure, or distribution of
- * this software and related documentation outside the terms of the EULA
- * is strictly prohibited.
- *
- */
- 
-#ifndef _DRVAPI_ERROR_STRING_H_
-#define _DRVAPI_ERROR_STRING_H_
-
-#include <stdio.h>
-#include <string.h>
-#include <stdlib.h>
-
-// Error Code string definitions here
-typedef struct
-{
-    char const *error_string;
-    int  error_id;
-} s_CudaErrorStr;
-
-/**
- * Error codes
- */
-static s_CudaErrorStr sCudaDrvErrorString[] =
-{
-    /**
-     * The API call returned with no errors. In the case of query calls, this
-     * can also mean that the operation being queried is complete (see
-     * ::cuEventQuery() and ::cuStreamQuery()).
-     */
-    { "CUDA_SUCCESS", 0 },
-
-    /**
-     * This indicates that one or more of the parameters passed to the API call
-     * is not within an acceptable range of values.
-     */
-    { "CUDA_ERROR_INVALID_VALUE", 1 },
-
-    /**
-     * The API call failed because it was unable to allocate enough memory to
-     * perform the requested operation.
-     */
-    { "CUDA_ERROR_OUT_OF_MEMORY", 2 },
-
-    /**
-     * This indicates that the CUDA driver has not been initialized with
-     * ::cuInit() or that initialization has failed.
-     */
-    { "CUDA_ERROR_NOT_INITIALIZED", 3 },
-
-    /**
-     * This indicates that the CUDA driver is in the process of shutting down.
-     */
-    { "CUDA_ERROR_DEINITIALIZED", 4 },
-
-    /**
-     * This indicates profiling APIs are called while application is running
-     * in visual profiler mode. 
-    */
-    { "CUDA_ERROR_PROFILER_DISABLED", 5 },
-    /**
-     * This indicates profiling has not been initialized for this context. 
-     * Call cuProfilerInitialize() to resolve this. 
-    */
-    { "CUDA_ERROR_PROFILER_NOT_INITIALIZED", 6 },
-    /**
-     * This indicates profiler has already been started and probably
-     * cuProfilerStart() is incorrectly called.
-    */
-    { "CUDA_ERROR_PROFILER_ALREADY_STARTED", 7 },
-    /**
-     * This indicates profiler has already been stopped and probably
-     * cuProfilerStop() is incorrectly called.
-    */
-    { "CUDA_ERROR_PROFILER_ALREADY_STOPPED", 8 },  
-    /**
-     * This indicates that no CUDA-capable devices were detected by the installed
-     * CUDA driver.
-     */
-    { "CUDA_ERROR_NO_DEVICE (no CUDA-capable devices were detected)", 100 },
-
-    /**
-     * This indicates that the device ordinal supplied by the user does not
-     * correspond to a valid CUDA device.
-     */
-    { "CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)", 101 },
-
-
-    /**
-     * This indicates that the device kernel image is invalid. This can also
-     * indicate an invalid CUDA module.
-     */
-    { "CUDA_ERROR_INVALID_IMAGE", 200 },
-
-    /**
-     * This most frequently indicates that there is no context bound to the
-     * current thread. This can also be returned if the context passed to an
-     * API call is not a valid handle (such as a context that has had
-     * ::cuCtxDestroy() invoked on it). This can also be returned if a user
-     * mixes different API versions (i.e. 3010 context with 3020 API calls).
-     * See ::cuCtxGetApiVersion() for more details.
-     */
-    { "CUDA_ERROR_INVALID_CONTEXT", 201 },
-
-    /**
-     * This indicated that the context being supplied as a parameter to the
-     * API call was already the active context.
-     * \deprecated
-     * This error return is deprecated as of CUDA 3.2. It is no longer an
-     * error to attempt to push the active context via ::cuCtxPushCurrent().
-     */
-    { "CUDA_ERROR_CONTEXT_ALREADY_CURRENT", 202 },
-
-    /**
-     * This indicates that a map or register operation has failed.
-     */
-    { "CUDA_ERROR_MAP_FAILED", 205 },
-
-    /**
-     * This indicates that an unmap or unregister operation has failed.
-     */
-    { "CUDA_ERROR_UNMAP_FAILED", 206 },
-
-    /**
-     * This indicates that the specified array is currently mapped and thus
-     * cannot be destroyed.
-     */
-    { "CUDA_ERROR_ARRAY_IS_MAPPED", 207 },
-
-    /**
-     * This indicates that the resource is already mapped.
-     */
-    { "CUDA_ERROR_ALREADY_MAPPED", 208 },
-
-    /**
-     * This indicates that there is no kernel image available that is suitable
-     * for the device. This can occur when a user specifies code generation
-     * options for a particular CUDA source file that do not include the
-     * corresponding device configuration.
-     */
-    { "CUDA_ERROR_NO_BINARY_FOR_GPU", 209 },
-
-    /**
-     * This indicates that a resource has already been acquired.
-     */
-    { "CUDA_ERROR_ALREADY_ACQUIRED", 210 },
-
-    /**
-     * This indicates that a resource is not mapped.
-     */
-    { "CUDA_ERROR_NOT_MAPPED", 211 },
-
-    /**
-     * This indicates that a mapped resource is not available for access as an
-     * array.
-     */
-    { "CUDA_ERROR_NOT_MAPPED_AS_ARRAY", 212 },
-
-    /**
-     * This indicates that a mapped resource is not available for access as a
-     * pointer.
-     */
-    { "CUDA_ERROR_NOT_MAPPED_AS_POINTER", 213 },
-
-    /**
-     * This indicates that an uncorrectable ECC error was detected during
-     * execution.
-     */
-    { "CUDA_ERROR_ECC_UNCORRECTABLE", 214 },
-
-    /**
-     * This indicates that the ::CUlimit passed to the API call is not
-     * supported by the active device.
-     */
-    { "CUDA_ERROR_UNSUPPORTED_LIMIT", 215 },
-
-    /**
-     * This indicates that the ::CUcontext passed to the API call can
-     * only be bound to a single CPU thread at a time but is already 
-     * bound to a CPU thread.
-     */
-    { "CUDA_ERROR_CONTEXT_ALREADY_IN_USE", 216 },
-
-    /**
-     * This indicates that peer access is not supported across the given
-     * devices.
-     */
-    { "CUDA_ERROR_PEER_ACCESS_UNSUPPORTED", 217},
-
-    /**
-     * This indicates that the device kernel source is invalid.
-     */
-    { "CUDA_ERROR_INVALID_SOURCE", 300 },
-
-    /**
-     * This indicates that the file specified was not found.
-     */
-    { "CUDA_ERROR_FILE_NOT_FOUND", 301 },
-
-    /**
-     * This indicates that a link to a shared object failed to resolve.
-     */
-    { "CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND", 302 },
-
-    /**
-     * This indicates that initialization of a shared object failed.
-     */
-    { "CUDA_ERROR_SHARED_OBJECT_INIT_FAILED", 303 },
-
-    /**
-     * This indicates that an OS call failed.
-     */
-    { "CUDA_ERROR_OPERATING_SYSTEM", 304 },
-
-
-    /**
-     * This indicates that a resource handle passed to the API call was not
-     * valid. Resource handles are opaque types like ::CUstream and ::CUevent.
-     */
-    { "CUDA_ERROR_INVALID_HANDLE", 400 },
-
-
-    /**
-     * This indicates that a named symbol was not found. Examples of symbols
-     * are global/constant variable names, texture names }, and surface names.
-     */
-    { "CUDA_ERROR_NOT_FOUND", 500 },
-
-
-    /**
-     * This indicates that asynchronous operations issued previously have not
-     * completed yet. This result is not actually an error, but must be indicated
-     * differently than ::CUDA_SUCCESS (which indicates completion). Calls that
-     * may return this value include ::cuEventQuery() and ::cuStreamQuery().
-     */
-    { "CUDA_ERROR_NOT_READY", 600 },
-
-
-    /**
-     * An exception occurred on the device while executing a kernel. Common
-     * causes include dereferencing an invalid device pointer and accessing
-     * out of bounds shared memory. The context cannot be used }, so it must
-     * be destroyed (and a new one should be created). All existing device
-     * memory allocations from this context are invalid and must be
-     * reconstructed if the program is to continue using CUDA.
-     */
-    { "CUDA_ERROR_LAUNCH_FAILED", 700 },
-
-    /**
-     * This indicates that a launch did not occur because it did not have
-     * appropriate resources. This error usually indicates that the user has
-     * attempted to pass too many arguments to the device kernel, or the
-     * kernel launch specifies too many threads for the kernel's register
-     * count. Passing arguments of the wrong size (i.e. a 64-bit pointer
-     * when a 32-bit int is expected) is equivalent to passing too many
-     * arguments and can also result in this error.
-     */
-    { "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES", 701 },
-
-    /**
-     * This indicates that the device kernel took too long to execute. This can
-     * only occur if timeouts are enabled - see the device attribute
-     * ::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information. The
-     * context cannot be used (and must be destroyed similar to
-     * ::CUDA_ERROR_LAUNCH_FAILED). All existing device memory allocations from
-     * this context are invalid and must be reconstructed if the program is to
-     * continue using CUDA.
-     */
-    { "CUDA_ERROR_LAUNCH_TIMEOUT", 702 },
-
-    /**
-     * This error indicates a kernel launch that uses an incompatible texturing
-     * mode.
-     */
-    { "CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING", 703 },
-    
-    /**
-     * This error indicates that a call to ::cuCtxEnablePeerAccess() is
-     * trying to re-enable peer access to a context which has already
-     * had peer access to it enabled.
-     */
-    { "CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED", 704 },
-
-    /**
-     * This error indicates that ::cuCtxDisablePeerAccess() is 
-     * trying to disable peer access which has not been enabled yet 
-     * via ::cuCtxEnablePeerAccess(). 
-     */
-    { "CUDA_ERROR_PEER_ACCESS_NOT_ENABLED", 705 },
-
-    /**
-     * This error indicates that the primary context for the specified device
-     * has already been initialized.
-     */
-    { "CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE", 708 },
-
-    /**
-     * This error indicates that the context current to the calling thread
-     * has been destroyed using ::cuCtxDestroy }, or is a primary context which
-     * has not yet been initialized.
-     */
-    { "CUDA_ERROR_CONTEXT_IS_DESTROYED", 709 },
-
-    /**
-     * A device-side assert triggered during kernel execution. The context
-     * cannot be used anymore, and must be destroyed. All existing device 
-     * memory allocations from this context are invalid and must be 
-     * reconstructed if the program is to continue using CUDA.
-     */
-    { "CUDA_ERROR_ASSERT", 710 },
-
-        /**
-     * This error indicates that the hardware resources required to enable
-     * peer access have been exhausted for one or more of the devices 
-     * passed to ::cuCtxEnablePeerAccess().
-     */
-    { "CUDA_ERROR_TOO_MANY_PEERS", 711 },
-
-    /**
-     * This error indicates that the memory range passed to ::cuMemHostRegister()
-     * has already been registered.
-     */
-    { "CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED", 712 },
-
-    /**
-     * This error indicates that the pointer passed to ::cuMemHostUnregister()
-     * does not correspond to any currently registered memory region.
-     */
-    { "CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED", 713 },
-
-    /**
-     * This error indicates that the attempted operation is not permitted.
-     */
-    { "CUDA_ERROR_NOT_PERMITTED", 800 },
-
-    /**
-     * This error indicates that the attempted operation is not supported
-     * on the current system or device.
-     */
-    { "CUDA_ERROR_NOT_SUPPORTED", 801 },
-
-    /**
-     * This indicates that an unknown internal error has occurred.
-     */
-    { "CUDA_ERROR_UNKNOWN", 999 },
-    { NULL, -1 }
-};
-
-// This is just a linear search through the array, since the error_id's are not
-// always ocurring consecutively
-const char * getCudaDrvErrorString(CUresult error_id)
-{
-    int index = 0;
-    while (sCudaDrvErrorString[index].error_id != error_id && 
-           sCudaDrvErrorString[index].error_id != -1)
-    {
-        index++;
-    }
-    if (sCudaDrvErrorString[index].error_id == error_id)
-        return (const char *)sCudaDrvErrorString[index].error_string;
-    else
-        return (const char *)"CUDA_ERROR not found!";
-}
-
-#endif
--- a/examples_cuda/aobench_instrumented/.gitignore
+++ b/examples_cuda/aobench_instrumented/.gitignore
@@ -1,2 +0,0 @@
-ao
-*.ppm
--- a/examples_cuda/aobench_instrumented/Makefile
+++ b/examples_cuda/aobench_instrumented/Makefile
@@ -1,26 +0,0 @@
-
-CXX=clang++ -m64
-CXXFLAGS=-Iobjs/ -g3 -Wall
-ISPC=ispc
-ISPCFLAGS=-O2 --instrument --arch=x86-64 --target=sse2
-
-default: ao
-
-.PHONY: dirs clean
-
-dirs:
-	/bin/mkdir -p objs/
-
-clean:
-	/bin/rm -rf objs *~ ao
-
-ao: objs/ao.o objs/instrument.o objs/ao_ispc.o ../tasksys.cpp
-	$(CXX) $(CXXFLAGS) -o $@ $^ -lm -lpthread
-
-objs/%.o: %.cpp dirs
-	$(CXX) $< $(CXXFLAGS) -c -o $@
-
-objs/ao.o: objs/ao_ispc.h 
-
-objs/%_ispc.h objs/%_ispc.o: %.ispc dirs
-	$(ISPC) $(ISPCFLAGS) $< -o objs/$*_ispc.o -h objs/$*_instrumented_ispc.h
--- a/examples_cuda/aobench_instrumented/ao.cpp
+++ b/examples_cuda/aobench_instrumented/ao.cpp
@@ -1,131 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define NOMINMAX
-#pragma warning (disable: 4244)
-#pragma warning (disable: 4305)
-#endif
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <assert.h>
-#ifdef __linux__
-#include <malloc.h>
-#endif
-#include <math.h>
-#include <map>
-#include <string>
-#include <algorithm>
-#include <sys/types.h>
-
-#include "ao_instrumented_ispc.h"
-using namespace ispc;
-
-#include "instrument.h"
-#include "../timing.h"
-
-#define NSUBSAMPLES        2
-
-static unsigned int test_iterations;
-static unsigned int width, height;
-static unsigned char *img;
-static float *fimg;
-
-
-static unsigned char
-clamp(float f)
-{
-    int i = (int)(f * 255.5);
-
-    if (i < 0) i = 0;
-    if (i > 255) i = 255;
-
-    return (unsigned char)i;
-}
-
-
-static void
-savePPM(const char *fname, int w, int h)
-{
-    for (int y = 0; y < h; y++) {
-        for (int x = 0; x < w; x++)  {
-            img[3 * (y * w + x) + 0] = clamp(fimg[3 *(y * w + x) + 0]);
-            img[3 * (y * w + x) + 1] = clamp(fimg[3 *(y * w + x) + 1]);
-            img[3 * (y * w + x) + 2] = clamp(fimg[3 *(y * w + x) + 2]);
-        }
-    }
-
-    FILE *fp = fopen(fname, "wb");
-    if (!fp) {
-        perror(fname);
-        exit(1);
-    }
-
-    fprintf(fp, "P6\n");
-    fprintf(fp, "%d %d\n", w, h);
-    fprintf(fp, "255\n");
-    fwrite(img, w * h * 3, 1, fp);
-    fclose(fp);
-    printf("Wrote image file %s\n", fname);
-}
-
-
-
-int main(int argc, char **argv)
-{
-    if (argc != 4) {
-        printf ("%s\n", argv[0]);
-        printf ("Usage: ao [num test iterations] [width] [height]\n");
-        getchar();
-        exit(-1);
-    }
-    else {
-        test_iterations = atoi(argv[1]);
-        width = atoi (argv[2]);
-        height = atoi (argv[3]);
-    }
-
-    // Allocate space for output images
-    img = new unsigned char[width * height * 3];
-    fimg = new float[width * height * 3];
-
-    ao_ispc(width, height, NSUBSAMPLES, fimg);
-
-    savePPM("ao-ispc.ppm", width, height); 
-
-    ISPCPrintInstrument();
-
-    return 0;
-}
--- a/examples_cuda/aobench_instrumented/ao.ispc
+++ b/examples_cuda/aobench_instrumented/ao.ispc
@@ -1,333 +0,0 @@
-// -*- mode: c++ -*-
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-/*
-  Based on Syoyo Fujita's aobench: http://code.google.com/p/aobench
-*/
-
-#define NAO_SAMPLES		8
-#define M_PI 3.1415926535f
-
-typedef float<3> vec;
-
-struct Isect {
-    float      t;
-    vec        p;
-    vec        n;
-    int        hit; 
-};
-
-struct Sphere {
-    vec        center;
-    float      radius;
-
-};
-
-struct Plane {
-    vec    p;
-    vec    n;
-};
-
-struct Ray {
-    vec org;
-    vec dir;
-};
-
-static inline float dot(vec a, vec b) {
-    return a.x * b.x + a.y * b.y + a.z * b.z;
-}
-
-static inline vec vcross(vec v0, vec v1) {
-    vec ret;
-    ret.x = v0.y * v1.z - v0.z * v1.y;
-    ret.y = v0.z * v1.x - v0.x * v1.z;
-    ret.z = v0.x * v1.y - v0.y * v1.x;
-    return ret;
-}
-
-static inline void vnormalize(vec &v) {
-    float len2 = dot(v, v);
-    float invlen = rsqrt(len2);
-    v *= invlen;
-}
-
-
-static inline void
-ray_plane_intersect(Isect &isect, Ray &ray, Plane &plane) {
-    float d = -dot(plane.p, plane.n);
-    float v = dot(ray.dir, plane.n);
-
-    cif (abs(v) < 1.0e-17) 
-        return;
-    else {
-        float t = -(dot(ray.org, plane.n) + d) / v;
-
-        cif ((t > 0.0) && (t < isect.t)) {
-            isect.t = t;
-            isect.hit = 1;
-            isect.p = ray.org + ray.dir * t;
-            isect.n = plane.n;
-        }
-    }
-}
-
-
-static inline void
-ray_sphere_intersect(Isect &isect, Ray &ray, Sphere &sphere) {
-    vec rs = ray.org - sphere.center;
-
-    float B = dot(rs, ray.dir);
-    float C = dot(rs, rs) - sphere.radius * sphere.radius;
-    float D = B * B - C;
-
-    cif (D > 0.) {
-        float t = -B - sqrt(D);
-
-        cif ((t > 0.0) && (t < isect.t)) {
-            isect.t = t;
-            isect.hit = 1;
-            isect.p = ray.org + t * ray.dir;
-            isect.n = isect.p - sphere.center;
-            vnormalize(isect.n);
-        }
-    }
-}
-
-
-static inline void
-orthoBasis(vec basis[3], vec n) {
-    basis[2] = n;
-    basis[1].x = 0.0; basis[1].y = 0.0; basis[1].z = 0.0;
-
-    if ((n.x < 0.6) && (n.x > -0.6)) {
-        basis[1].x = 1.0;
-    } else if ((n.y < 0.6) && (n.y > -0.6)) {
-        basis[1].y = 1.0;
-    } else if ((n.z < 0.6) && (n.z > -0.6)) {
-        basis[1].z = 1.0;
-    } else {
-        basis[1].x = 1.0;
-    }
-
-    basis[0] = vcross(basis[1], basis[2]);
-    vnormalize(basis[0]);
-
-    basis[1] = vcross(basis[2], basis[0]);
-    vnormalize(basis[1]);
-}
-
-
-static inline float
-ambient_occlusion(Isect &isect, Plane &plane, Sphere spheres[3], 
-                  RNGState &rngstate) {
-    float eps = 0.0001f;
-    vec p, n;
-    vec basis[3];
-    float occlusion = 0.0;
-
-    p = isect.p + eps * isect.n;
-
-    orthoBasis(basis, isect.n);
-
-    static const uniform int ntheta = NAO_SAMPLES;
-    static const uniform int nphi   = NAO_SAMPLES;
-    for (uniform int j = 0; j < ntheta; j++) {
-        for (uniform int i = 0; i < nphi; i++) {
-            Ray ray;
-            Isect occIsect;
-
-            float theta = sqrt(frandom(&rngstate));
-            float phi   = 2.0f * M_PI * frandom(&rngstate);
-            float x = cos(phi) * theta;
-            float y = sin(phi) * theta;
-            float z = sqrt(1.0 - theta * theta);
-
-            // local . global
-            float rx = x * basis[0].x + y * basis[1].x + z * basis[2].x;
-            float ry = x * basis[0].y + y * basis[1].y + z * basis[2].y;
-            float rz = x * basis[0].z + y * basis[1].z + z * basis[2].z;
-
-            ray.org = p;
-            ray.dir.x = rx;
-            ray.dir.y = ry;
-            ray.dir.z = rz;
-
-            occIsect.t   = 1.0e+17;
-            occIsect.hit = 0;
-
-            for (uniform int snum = 0; snum < 3; ++snum)
-                ray_sphere_intersect(occIsect, ray, spheres[snum]); 
-            ray_plane_intersect (occIsect, ray, plane); 
-
-            if (occIsect.hit) occlusion += 1.0;
-        }
-    }
-
-    occlusion = (ntheta * nphi - occlusion) / (float)(ntheta * nphi);
-    return occlusion;
-}
-
-
-/* Compute the image for the scanlines from [y0,y1), for an overall image
-   of width w and height h.
- */
-static void ao_scanlines(uniform int y0, uniform int y1, uniform int w, 
-                         uniform int h,  uniform int nsubsamples, 
-                         uniform float image[]) {
-    static Plane plane = { { 0.0f, -0.5f, 0.0f }, { 0.f, 1.f, 0.f } };
-    static Sphere spheres[3] = {
-        { { -2.0f, 0.0f, -3.5f }, 0.5f },
-        { { -0.5f, 0.0f, -3.0f }, 0.5f },
-        { { 1.0f, 0.0f, -2.2f }, 0.5f } };
-    RNGState rngstate;
-
-    seed_rng(&rngstate, programIndex + (y0 << (programIndex & 15)));
-
-    // Compute the mapping between the 'programCount'-wide program
-    // instances running in parallel and samples in the image.  
-    //
-    // For now, we'll always take four samples per pixel, so start by
-    // initializing du and dv with offsets into subpixel samples.  We'll
-    // take care of further updating du and dv for the case where we're
-    // doing more than 4 program instances in parallel shortly.
-    uniform float uSteps[4] = { 0, 1, 0, 1 };
-    uniform float vSteps[4] = { 0, 0, 1, 1 };
-    float du = uSteps[programIndex % 4] / nsubsamples;
-    float dv = vSteps[programIndex % 4] / nsubsamples;
-
-    // Now handle the case where we are able to do more than one pixel's
-    // worth of work at once.  nx records the number of pixels in the x
-    // direction we do per iteration and ny the number in y.
-    uniform int nx = 1, ny = 1;
-
-    // FIXME: We actually need ny to be 1 regardless of the decomposition,
-    // since the task decomposition is one scanline high.
-
-    if (programCount == 8) {
-        // Do two pixels at once in the x direction
-        nx = 2;
-        if (programIndex >= 4) 
-            // And shift the offsets for the second pixel's worth of work
-            ++du;
-    }
-    else if (programCount == 16) {
-        nx = 4;
-        ny = 1;
-        if (programIndex >= 4 && programIndex < 8)
-            ++du;
-        if (programIndex >= 8 && programIndex < 12)
-            du += 2;
-        if (programIndex >= 12)
-            du += 3;
-    }
-
-    // Now loop over all of the pixels, stepping in x and y as calculated
-    // above.  (Assumes that ny divides y and nx divides x...)
-    for (uniform int y = y0; y < y1; y += ny) {
-        for (uniform int x = 0; x < w; x += nx)  {
-            // Figure out x,y pixel in NDC
-            float px =  (x + du - (w / 2.0f)) / (w / 2.0f);
-            float py = -(y + dv - (h / 2.0f)) / (h / 2.0f);
-            float ret = 0.f;
-            Ray ray;
-            Isect isect;
-
-            ray.org = 0.f;
-
-            // Poor man's perspective projection
-            ray.dir.x = px;
-            ray.dir.y = py;
-            ray.dir.z = -1.0;
-            vnormalize(ray.dir);
-
-            isect.t   = 1.0e+17;
-            isect.hit = 0;
-
-            for (uniform int snum = 0; snum < 3; ++snum)
-                ray_sphere_intersect(isect, ray, spheres[snum]);
-            ray_plane_intersect(isect, ray, plane);
-
-            // Note use of 'coherent' if statement; the set of rays we
-            // trace will often all hit or all miss the scene
-            cif (isect.hit)
-                ret = ambient_occlusion(isect, plane, spheres, rngstate);
-
-            // This is a little grungy; we have results for
-            // programCount-worth of values.  Because we're doing 2x2
-            // subsamples, we need to peel them off in groups of four,
-            // average the four values for each pixel, and update the
-            // output image.
-            //
-            // Store the varying value to a uniform array of the same size.
-            // See the discussion about communication among program
-            // instances in the ispc user's manual for more discussion on
-            // this idiom.
-            uniform float retArray[programCount];
-            retArray[programIndex] = ret;
-
-            // offset to the first pixel in the image
-            uniform int offset = 3 * (y * w + x);
-            for (uniform int p = 0; p < programCount; p += 4, offset += 3) {
-                // Get the four sample values for this pixel
-                uniform float sumret = retArray[p] + retArray[p+1] + retArray[p+2] +
-                    retArray[p+3];
-
-                // Normalize by number of samples taken
-                sumret /= nsubsamples * nsubsamples; 
-                
-                // Store result in the image
-                image[offset+0] = sumret;
-                image[offset+1] = sumret;
-                image[offset+2] = sumret;
-            }
-        }
-    }
-}
-
-
-export void ao_ispc(uniform int w, uniform int h, uniform int nsubsamples, 
-                    uniform float image[]) {
-    ao_scanlines(0, h, w, h, nsubsamples, image);
-}
-
-
-static void task ao_task(uniform int width, uniform int height, 
-                         uniform int nsubsamples, uniform float image[]) {
-    ao_scanlines(taskIndex, taskIndex+1, width, height, nsubsamples, image);
-}
-
-
-export void ao_ispc_tasks(uniform int w, uniform int h, uniform int nsubsamples, 
-                          uniform float image[]) {
-    launch[h] ao_task(w, h, nsubsamples, image);
-}
--- a/examples_cuda/aobench_instrumented/aobench_instrumented.vcxproj
+++ b/examples_cuda/aobench_instrumented/aobench_instrumented.vcxproj
@@ -1,174 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
-  <ItemGroup Label="ProjectConfigurations">
-    <ProjectConfiguration Include="Debug|Win32">
-      <Configuration>Debug</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Debug|x64">
-      <Configuration>Debug</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|Win32">
-      <Configuration>Release</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|x64">
-      <Configuration>Release</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-  </ItemGroup>
-  <ItemGroup>
-    <ClCompile Include="ao.cpp" />
-    <ClCompile Include="instrument.cpp" />
-    <ClCompile Include="../tasksys.cpp" />
-  </ItemGroup>
-  <ItemGroup>
-    <CustomBuild Include="ao.ispc">
-      <FileType>Document</FileType>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename)_instrumented.obj -h $(TargetDir)%(Filename)_instrumented_ispc.h --arch=x86 --instrument --target=sse2
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename)_instrumented.obj -h $(TargetDir)%(Filename)_instrumented_ispc.h --instrument --target=sse2
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename)_instrumented.obj;$(TargetDir)%(Filename)_instrumented_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename)_instrumented.obj;$(TargetDir)%(Filename)_instrumented_ispc.h</Outputs>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename)_instrumented.obj -h $(TargetDir)%(Filename)_instrumented_ispc.h --arch=x86 --instrument --target=sse2
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename)_instrumented.obj -h $(TargetDir)%(Filename)_instrumented_ispc.h --instrument --target=sse2
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename)_instrumented.obj;$(TargetDir)%(Filename)_instrumented_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename)_instrumented.obj;$(TargetDir)%(Filename)_instrumented_ispc.h</Outputs>
-    </CustomBuild>
-  </ItemGroup>
-  <PropertyGroup Label="Globals">
-    <ProjectGuid>{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}</ProjectGuid>
-    <Keyword>Win32Proj</Keyword>
-    <RootNamespace>aobench_instrumented</RootNamespace>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
-  <ImportGroup Label="ExtensionSettings">
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <PropertyGroup Label="UserMacros" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <PreBuildEventUseInBuild>true</PreBuildEventUseInBuild>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <PreBuildEventUseInBuild>true</PreBuildEventUseInBuild>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <PreBuildEventUseInBuild>true</PreBuildEventUseInBuild>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <PreBuildEventUseInBuild>true</PreBuildEventUseInBuild>
-  </PropertyGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
-  <ImportGroup Label="ExtensionTargets">
-  </ImportGroup>
-</Project>
--- a/examples_cuda/aobench_instrumented/instrument.cpp
+++ b/examples_cuda/aobench_instrumented/instrument.cpp
@@ -1,94 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#include "instrument.h"
-#include <stdio.h>
-#include <assert.h>
-#include <string>
-#include <map>
-
-struct CallInfo {
-    CallInfo() { count = laneCount = allOff = 0; }
-    int count;
-    int laneCount;
-    int allOff;
-};
-
-static std::map<std::string, CallInfo> callInfo;
-
-int countbits(int i) {
-    int ret = 0;
-    while (i) {
-        if (i & 0x1)
-            ++ret;
-        i >>= 1;
-    }
-    return ret;
-}
-
-
-// Callback function that ispc compiler emits calls to when --instrument
-// command-line flag is given while compiling.
-void
-ISPCInstrument(const char *fn, const char *note, int line, uint64_t mask) {
-    char sline[16];
-    sprintf(sline, "%04d", line);
-    std::string s = std::string(fn) + std::string("(") + std::string(sline) +
-        std::string(") - ") + std::string(note);
-
-    // Find or create a CallInfo instance for this callsite.
-    CallInfo &ci = callInfo[s];
-
-    // And update its statistics... 
-    ++ci.count;
-    if (mask == 0)
-        ++ci.allOff;
-    ci.laneCount += countbits(mask);
-}
-
-
-void
-ISPCPrintInstrument() {
-    // When program execution is done, go through the stats and print them
-    // out.  (This function is called by ao.cpp).
-    std::map<std::string, CallInfo>::iterator citer = callInfo.begin();
-    while (citer != callInfo.end()) {
-        CallInfo &ci = citer->second;
-        float activePct = 100.f * ci.laneCount / (4.f * ci.count);
-        float allOffPct = 100.f * ci.allOff / ci.count;
-        printf("%s: %d calls (%d / %.2f%% all off!), %.2f%% active lanes\n",
-               citer->first.c_str(), ci.count, ci.allOff, allOffPct,
-               activePct);
-        ++citer;
-    }
-}
--- a/examples_cuda/aobench_instrumented/instrument.h
+++ b/examples_cuda/aobench_instrumented/instrument.h
@@ -1,45 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-*/
-
-#ifndef INSTRUMENT_H
-#define INSTRUMENT_H 1
-
-#include <stdint.h>
-
-extern "C" {
-    void ISPCInstrument(const char *fn, const char *note, int line, uint64_t mask);
-}
-
-void ISPCPrintInstrument();
-
-#endif // INSTRUMENT_H
--- a/examples_cuda/common.mk
+++ b/examples_cuda/common.mk
@@ -1,98 +0,0 @@
-
-TASK_CXX=../tasksys.cpp
-TASK_LIB=-lpthread
-TASK_OBJ=objs/tasksys.o
-
-CXX=icc -openmp
-CXXFLAGS+=-Iobjs/ -O2
-CC=icc -openmp
-CCFLAGS+=-Iobjs/ -O2
-
-LIBS=-lm $(TASK_LIB) -lstdc++
-ISPC=ispc
-ISPC_FLAGS+=-O2 
-ISPC_FLAGS+=--opt=fast-math --math-lib=default
-ISPC_HEADER=objs/$(ISPC_SRC:.ispc=_ispc.h)
-
-ARCH:=$(shell uname -m | sed -e s/x86_64/x86/ -e s/i686/x86/ -e s/arm.*/arm/ -e s/sa110/arm/)
-
-ifeq ($(ARCH),x86)
-#  ISPC_OBJS=$(addprefix objs/, $(ISPC_SRC:.ispc=)_ispc.o $(ISPC_SRC:.ispc=)_ispc_sse2.o \
-	$(ISPC_SRC:.ispc=)_ispc_sse4.o $(ISPC_SRC:.ispc=)_ispc_avx.o)
-  ISPC_OBJS=$(addprefix objs/, $(ISPC_SRC:.ispc=)_ispc.o )
-  ISPC_TARGETS=$(ISPC_IA_TARGETS)
-  ARCH_BIT:=$(shell getconf LONG_BIT)
-  ifeq ($(ARCH_BIT),32)
-    ISPC_FLAGS += --arch=x86
-    CXXFLAGS += -m32
-    CCFLAGS += -m32
-  else
-    ISPC_FLAGS += --arch=x86-64
-    CXXFLAGS += -m64
-    CCFLAGS += -m64
-  endif
-else ifeq ($(ARCH),arm)
-  ISPC_OBJS=$(addprefix objs/, $(ISPC_SRC:.ispc=_ispc.o))
-  ISPC_TARGETS=$(ISPC_ARM_TARGETS)
-else
-  $(error Unknown architecture $(ARCH) from uname -m)
-endif
-
-CPP_OBJS=$(addprefix objs/, $(CPP_SRC:.cpp=.o))
-CC_OBJS=$(addprefix objs/, $(CC_SRC:.c=.o))
-OBJS=$(CPP_OBJS) $(CC_OBJS) $(TASK_OBJ) $(ISPC_OBJS)
-
-default: $(EXAMPLE)
-
-all: $(EXAMPLE) $(EXAMPLE)-sse4 $(EXAMPLE)-generic16 $(EXAMPLE)-scalar
-
-.PHONY: dirs clean
-
-dirs:
-	/bin/mkdir -p objs/
-
-objs/%.cpp objs/%.o objs/%.h: dirs
-
-clean:
-	/bin/rm -rf objs *~ $(EXAMPLE) $(EXAMPLE)-sse4 $(EXAMPLE)-generic16 ref test
-
-$(EXAMPLE): $(OBJS)
-	$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS)
-
-objs/%.o: %.cpp dirs $(ISPC_HEADER)
-	$(CXX) $< $(CXXFLAGS) -c -o $@
-
-objs/%.o: %.c dirs $(ISPC_HEADER)
-	$(CC) $< $(CCFLAGS) -c -o $@
-
-objs/%.o: ../%.cpp dirs
-	$(CXX) $< $(CXXFLAGS) -c -o $@
-
-objs/$(EXAMPLE).o: objs/$(EXAMPLE)_ispc.h
-
-objs/%_ispc.h objs/%_ispc.o objs/%_ispc_sse2.o objs/%_ispc_sse4.o objs/%_ispc_avx.o: %.ispc
-	$(ISPC) $(ISPC_FLAGS) --target=$(ISPC_TARGETS) $< -o objs/$*_ispc.o -h objs/$*_ispc.h
-
-objs/$(ISPC_SRC:.ispc=)_sse4.cpp: $(ISPC_SRC)
-	$(ISPC) $(ISPC_FLAGS) $< -o $@ --target=generic-4 --emit-c++ --c++-include-file=sse4.h
-
-objs/$(ISPC_SRC:.ispc=)_sse4.o: objs/$(ISPC_SRC:.ispc=)_sse4.cpp
-	$(CXX) -I../intrinsics -msse4.2 $< $(CXXFLAGS) -c -o $@
-
-$(EXAMPLE)-sse4: $(CPP_OBJS) objs/$(ISPC_SRC:.ispc=)_sse4.o
-	$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS)
-
-objs/$(ISPC_SRC:.ispc=)_generic16.cpp: $(ISPC_SRC)
-	$(ISPC) $(ISPC_FLAGS) $< -o $@ --target=generic-16 --emit-c++ --c++-include-file=generic-16.h
-
-objs/$(ISPC_SRC:.ispc=)_generic16.o: objs/$(ISPC_SRC:.ispc=)_generic16.cpp
-	$(CXX) -I../intrinsics $< $(CXXFLAGS) -c -o $@
-
-$(EXAMPLE)-generic16: $(CPP_OBJS) objs/$(ISPC_SRC:.ispc=)_generic16.o
-	$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS)
-
-objs/$(ISPC_SRC:.ispc=)_scalar.o: $(ISPC_SRC)
-	$(ISPC) $(ISPC_FLAGS) $< -o $@ --target=generic-1
-
-$(EXAMPLE)-scalar: $(CPP_OBJS) objs/$(ISPC_SRC:.ispc=)_scalar.o
-	$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS)
--- a/examples_cuda/cuda_ispc.h
+++ b/examples_cuda/cuda_ispc.h
@@ -1,280 +0,0 @@
-#pragma once
-
-/******************************/
-
-#include <sys/time.h>
-static inline double rtc(void)
-{
-  struct timeval Tvalue;
-  double etime;
-  struct timezone dummy;
-
-  gettimeofday(&Tvalue,&dummy);
-  etime =  (double) Tvalue.tv_sec +
-    1.e-6*((double) Tvalue.tv_usec);
-  return etime;
-}
-
-/******************************/
-
-#include <cassert>
-#include <iostream>
-#include <cuda.h>
-#include "drvapi_error_string.h"
-
-#define checkCudaErrors(err)  __checkCudaErrors (err, __FILE__, __LINE__)
-// These are the inline versions for all of the SDK helper functions
-void __checkCudaErrors(CUresult err, const char *file, const int line) {
-  if(CUDA_SUCCESS != err) {
-    std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
-           << getCudaDrvErrorString(err) << "\" from file <" << file
-           << ", line " << line << "\n";
-    exit(-1);
-  }
-}
-
-
-/******************************/
-/****  Basic CUDriver API  ****/
-/******************************/
-
-CUcontext context;
-
-static void createContext(
-    const int deviceId = 0, 
-    const size_t stackLimit = 4*1024,
-    const size_t heapLimit = 1024*1024*1024
-    )
-{
-  CUdevice device;
-  int devCount;
-  checkCudaErrors(cuInit(0));
-  checkCudaErrors(cuDeviceGetCount(&devCount));
-  assert(devCount > 0);
-  checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
-
-  char name[128];
-  checkCudaErrors(cuDeviceGetName(name, 128, device));
-  std::cout << "Using CUDA Device [0]: " << name << "\n";
-
-  int devMajor, devMinor;
-  checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
-  std::cout << "Device Compute Capability: " 
-    << devMajor << "." << devMinor << "\n";
-  if (devMajor < 2) {
-    std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
-    exit(1); 
-  }
-
-  // Create driver context
-  checkCudaErrors(cuCtxCreate(&context, 0, device));
-#if 0
-  size_t limit;
-  checkCudaErrors(cuCtxGetLimit(&limit, CU_LIMIT_STACK_SIZE));
-  fprintf(stderr, " stack_limit= %llu KB\n", limit/1024);
-  checkCudaErrors(cuCtxGetLimit(&limit, CU_LIMIT_MALLOC_HEAP_SIZE));
-  fprintf(stderr, " heap_limit= %llu KB\n", limit/1024);
-  checkCudaErrors(cuCtxSetLimit(CU_LIMIT_STACK_SIZE,stackLimit));
-  checkCudaErrors(cuCtxSetLimit(CU_LIMIT_MALLOC_HEAP_SIZE,heapLimit));
-#endif
-}
-static void destroyContext()
-{
-  checkCudaErrors(cuCtxDestroy(context));
-}
-
-static CUmodule loadModule(
-    const char * module,
-    const int maxrregcount = 64,
-    const char cudadevrt_lib[] = "libcudadevrt.a",
-    const size_t log_size = 32768,
-    const bool print_log = true
-    )
-{
-  const double t0 = rtc();
-  CUmodule cudaModule;
-  // in this branch we use compilation with parameters
-
-  CUlinkState  CUState;
-  CUlinkState *lState = &CUState;
-  const int nOptions = 8;
-  CUjit_option options[nOptions];
-  void* optionVals[nOptions];
-  float walltime;
-  size_t logSize = log_size;
-  char error_log[logSize],
-       info_log[logSize];
-  void *cuOut;
-  size_t outSize;
-  int myErr = 0;
-
-  // Setup linker options
-  // Return walltime from JIT compilation
-  options[0] = CU_JIT_WALL_TIME;
-  optionVals[0] = (void*) &walltime;
-  // Pass a buffer for info messages
-  options[1] = CU_JIT_INFO_LOG_BUFFER;
-  optionVals[1] = (void*) info_log;
-  // Pass the size of the info buffer
-  options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
-  optionVals[2] = (void*) logSize;
-  // Pass a buffer for error message
-  options[3] = CU_JIT_ERROR_LOG_BUFFER;
-  optionVals[3] = (void*) error_log;
-  // Pass the size of the error buffer
-  options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
-  optionVals[4] = (void*) logSize;
-  // Make the linker verbose
-  options[5] = CU_JIT_LOG_VERBOSE;
-  optionVals[5] = (void*) 1;
-  // Max # of registers/pthread
-  options[6] = CU_JIT_MAX_REGISTERS;
-  int jitRegCount = maxrregcount;
-  optionVals[6] = (void *)(size_t)jitRegCount;
-  // Caching
-  options[7] = CU_JIT_CACHE_MODE;
-  optionVals[7] = (void *)CU_JIT_CACHE_OPTION_CA;
-  // Create a pending linker invocation
-
-  // Create a pending linker invocation
-  checkCudaErrors(cuLinkCreate(nOptions,options, optionVals, lState));
-
-#if 0
-  if (sizeof(void *)==4)
-  {
-    // Load the PTX from the string myPtx32
-    printf("Loading myPtx32[] program\n");
-    // PTX May also be loaded from file, as per below.
-    myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)myPtx32, strlen(myPtx32)+1, 0, 0, 0, 0);
-  }
-  else
-#endif
-  {
-    // Load the PTX from the string myPtx (64-bit)
-    if (print_log)
-      fprintf(stderr, "Loading ptx..\n");
-    myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)module, strlen(module)+1, 0, 0, 0, 0);
-    myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_LIBRARY, cudadevrt_lib, 0,0,0); 
-    // PTX May also be loaded from file, as per below.
-    // myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_PTX, "myPtx64.ptx",0,0,0);
-  }
-
-  // Complete the linker step
-  myErr = cuLinkComplete(*lState, &cuOut, &outSize);
-
-  if ( myErr != CUDA_SUCCESS )
-  {
-    // Errors will be put in error_log, per CU_JIT_ERROR_LOG_BUFFER option above. 
-    fprintf(stderr,"PTX Linker Error:\n%s\n",error_log);
-    assert(0);
-  }    
-
-  // Linker walltime and info_log were requested in options above.
- if (print_log)
-   fprintf(stderr, "CUDA Link Completed in %fms [ %g ms]. Linker Output:\n%s\n",walltime,info_log,1e3*(rtc() - t0));
-
- // Load resulting cuBin into module
- checkCudaErrors(cuModuleLoadData(&cudaModule, cuOut));
-
- // Destroy the linker invocation
- checkCudaErrors(cuLinkDestroy(*lState));
- if (print_log)
-   fprintf(stderr, " loadModule took %g ms \n", 1e3*(rtc() - t0));
- return cudaModule;
-}
-static void unloadModule(CUmodule &cudaModule)
-{
-  checkCudaErrors(cuModuleUnload(cudaModule));
-}
-
-static CUfunction getFunction(CUmodule &cudaModule, const char * function)
-{
-  CUfunction cudaFunction;
-  checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
-  return cudaFunction;
-}
-
-static CUdeviceptr deviceMalloc(const size_t size)
-{
-  CUdeviceptr d_buf;
-  checkCudaErrors(cuMemAlloc(&d_buf, size));
-  return d_buf;
-}
-static void deviceFree(CUdeviceptr d_buf)
-{
-  checkCudaErrors(cuMemFree(d_buf));
-}
-static void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
-}
-static void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
-}
-#define deviceLaunch(func,params) \
-  checkCudaErrors(cuFuncSetCacheConfig((func), CU_FUNC_CACHE_PREFER_SHARED)); \
-checkCudaErrors( \
-    cuLaunchKernel( \
-      (func), \
-      1,1,1, \
-      32, 1, 1, \
-      0, NULL, (params), NULL \
-      ));
-
-typedef CUdeviceptr devicePtr;
-
-
-/**************/
-#include <vector>
-static std::vector<char> readBinary(const char * filename, const bool print_size = false)
-{
-  std::vector<char> buffer;
-  FILE *fp = fopen(filename, "rb");
-  if (!fp )
-  {
-    fprintf(stderr, "file %s not found\n", filename);
-    assert(0);
-  }
-  fseek(fp, 0, SEEK_END); 
-  const unsigned long long size = ftell(fp);         /*calc the size needed*/
-  fseek(fp, 0, SEEK_SET); 
-  buffer.resize(size);
-
-  if (fp == NULL){ /*ERROR detection if file == empty*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);           
-    exit(1);
-  }
-  else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
-    exit(1);
-  }
-  if (print_size)
-    fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
-  return buffer;
-}
-
-static double CUDALaunch(
-    void **handlePtr, 
-    const char * func_name,
-    void **func_args,
-    const bool print_log = true,
-    const int maxrregcount = 64,
-    const char kernel_file[] = "__kernels.ptx",
-    const char cudadevrt_lib[] = "libcudadevrt.a",
-    const int log_size = 32768)
-{
-  const std::vector<char> module_str = readBinary(kernel_file, print_log);
-  const char *  module = &module_str[0];
-  CUmodule   cudaModule   = loadModule(module, maxrregcount, cudadevrt_lib, log_size, print_log);
-  CUfunction cudaFunction = getFunction(cudaModule, func_name);
-  checkCudaErrors(cuStreamSynchronize(0));
-  const double t0 = rtc();
-  deviceLaunch(cudaFunction, func_args);
-  checkCudaErrors(cuStreamSynchronize(0));
-  const double dt = rtc() - t0;
-  unloadModule(cudaModule);
-  return dt;
-}
-/******************************/
-
--- a/examples_cuda/deferred/Makefile
+++ b/examples_cuda/deferred/Makefile
@@ -1,8 +0,0 @@
-
-EXAMPLE=deferred_shading
-CPP_SRC=common.cpp main.cpp
-ISPC_SRC=kernels1.ispc
-ISPC_IA_TARGETS=avx
-ISPC_FLAGS=--opt=fast-math
-
-include ../common.mk
--- a/examples_cuda/deferred/Makefile_gpu
+++ b/examples_cuda/deferred/Makefile_gpu
@@ -1,55 +0,0 @@
-PROG=main_cu
-ISPC_SRC=kernels1.ispc
-CXX_SRC=main_cu.cpp common.cpp
-
-CXX=g++
-CXXFLAGS=-O3 -I$(CUDATK)/include
-LD=g++
-LDFLAGS=-lcuda
-
-ISPC=ispc
-ISPCFLAGS=-O3 --math-lib=default --target=nvptx64 --opt=fast-math
-
-LLVM32 = $(HOME)/usr/local/llvm/bin-3.2
-LLVM   = $(HOME)/usr/local/llvm/bin-3.3
-PTXGEN = $(HOME)/ptxgen
-PTXGEN += -opt=3
-PTXGEN += -ftz=1 -prec-div=0 -prec-sqrt=0 -fma=1
-
-LLVM32DIS=$(LLVM32)/bin/llvm-dis
-
-.SUFFIXES: .bc .o .ptx .cu _ispc_nvptx64.bc
-
-
-ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
-ISPC_BC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.bc)
-PTXSRC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.ptx)
-CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
-
-all: $(PROG)
-
-
-$(PROG): $(CXX_OBJ) kernel.ptx
-	/bin/cp kernel.ptx __kernels.ptx
-	$(LD) -o $@ $(CXX_OBJ) $(LDFLAGS)
-
-%.o: %.cpp
-	$(CXX) $(CXXFLAGS)  -o $@ -c $<
-
-
-%_ispc_nvptx64.bc: %.ispc
-	$(ISPC) $(ISPCFLAGS) --emit-llvm -o `basename $< .ispc`_ispc_nvptx64.bc -h `basename $< .ispc`_ispc.h $< --emit-llvm
-
-%.ptx: %.bc
-	$(PTXGEN)  $< > $@
-# $(LLVM32DIS) $<
-# $(PTXGEN)  `basename $< .bc`.ll > $@
-
-kernel.ptx: $(PTXSRC)
-	cat $^ > kernel.ptx
-
-clean: 
-	/bin/rm -rf *.ptx *.bc *.ll $(PROG)
-
-	 
-
--- a/examples_cuda/deferred/Makefile_knc
+++ b/examples_cuda/deferred/Makefile_knc
@@ -1,37 +0,0 @@
-PROG=main_mic
-ISPC_SRC=kernels1.ispc
-CXX_SRC=main.cpp  ../tasksys.cpp common.cpp
-
-CXX=icc
-CXXFLAGS=-O3 -I$(CUDATK)/include -mmic -openmp
-LD=icc
-LDFLAGS=-mmic -openmp
-
-ISPC=ispc
-ISPCFLAGS=-O3 --math-lib=default --target=generic-16 --c++-include-file=../intrinsics/knc-i1x16.h --opt=fast-math
-
-.SUFFIXES: .o .cpp
-
-
-ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
-CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
-
-all: $(PROG)
-
-
-
-$(PROG): $(ISPC_OBJ) $(CXX_OBJ) 
-	$(LD) -o $@ $^ $(LDFLAGS)
-
-%.o: %.cpp
-	$(CXX) $(CXXFLAGS)  -o $@ -c $<
-
-%_ispc.o: %.ispc
-	$(ISPC) $(ISPCFLAGS) --emit-c++ -o `basename $< .ispc`_ispc_zmm.cpp -h `basename $< .ispc`_ispc.h $< 
-	$(CXX) $(CXXFLAGS) -o $@ `basename $< .ispc`_ispc_zmm.cpp  -c
-
-clean: 
-	/bin/rm -rf *_ispc_zmm.cpp *.o  $(PROG)
-
-	 
-
--- a/examples_cuda/deferred/common.cpp
+++ b/examples_cuda/deferred/common.cpp
@@ -1,211 +0,0 @@
-/*
-  Copyright (c) 2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define _CRT_SECURE_NO_WARNINGS
-#define ISPC_IS_WINDOWS
-#elif defined(__linux__)
-#define ISPC_IS_LINUX
-#elif defined(__APPLE__)
-#define ISPC_IS_APPLE
-#endif
-
-#include <fcntl.h>
-#include <float.h>
-#include <math.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <sys/types.h>
-#include <stdint.h>
-#include <algorithm>
-#include <assert.h>
-#include <vector>
-#ifdef ISPC_IS_WINDOWS
-  #define WIN32_LEAN_AND_MEAN
-  #include <windows.h>
-#endif
-#ifdef ISPC_IS_LINUX
-  #include <malloc.h>
-#endif
-#include "deferred.h"
-#include "../timing.h"
-
-///////////////////////////////////////////////////////////////////////////
-
-static void *
-lAlignedMalloc(size_t size, int32_t alignment) {
-#ifdef ISPC_IS_WINDOWS
-    return _aligned_malloc(size, alignment);
-#endif
-#ifdef ISPC_IS_LINUX
-    return memalign(alignment, size);
-#endif
-#ifdef ISPC_IS_APPLE
-    void *mem = malloc(size + (alignment-1) + sizeof(void*));
-    char *amem = ((char*)mem) + sizeof(void*);
-    amem = amem + uint32_t(alignment - (reinterpret_cast<uint64_t>(amem) &
-                                        (alignment - 1)));
-    ((void**)amem)[-1] = mem;
-    return amem;
-#endif
-}
-
-
-static void
-lAlignedFree(void *ptr) {
-#ifdef ISPC_IS_WINDOWS
-    _aligned_free(ptr);
-#endif
-#ifdef ISPC_IS_LINUX
-    free(ptr);
-#endif
-#ifdef ISPC_IS_APPLE
-    free(((void**)ptr)[-1]);
-#endif
-}
-
-
-Framebuffer::Framebuffer(int width, int height) {
-    nPixels = width*height;
-    r = (uint8_t *)lAlignedMalloc(nPixels, ALIGNMENT_BYTES);
-    g = (uint8_t *)lAlignedMalloc(nPixels, ALIGNMENT_BYTES);
-    b = (uint8_t *)lAlignedMalloc(nPixels, ALIGNMENT_BYTES);
-}
-
-
-Framebuffer::~Framebuffer() {
-    lAlignedFree(r);
-    lAlignedFree(g);
-    lAlignedFree(b);
-}
-
-
-void
-Framebuffer::clear() {
-    memset(r, 0, nPixels);
-    memset(g, 0, nPixels);
-    memset(b, 0, nPixels);
-}
-
-
-InputData *
-CreateInputDataFromFile(const char *path) {
-    FILE *in = fopen(path, "rb");
-    if (!in) return 0;
-
-    InputData *input = new InputData;
-
-    // Load header
-    if (fread(&input->header, sizeof(ispc::InputHeader), 1, in) != 1) {
-        fprintf(stderr, "Preumature EOF reading file \"%s\"\n", path);
-        return NULL;
-    }
-    fprintf(stderr, " numLights= %d\n", input->header.numLights);
-
-    // Load data chunk and update pointers
-    input->chunk = (uint8_t *)lAlignedMalloc(input->header.inputDataChunkSize, 
-                                             ALIGNMENT_BYTES);
-    if (fread(input->chunk, input->header.inputDataChunkSize, 1, in) != 1) {
-        fprintf(stderr, "Preumature EOF reading file \"%s\"\n", path);
-        return NULL;
-    }
-    
-    input->arrays.zBuffer =
-        (float *)&input->chunk[input->header.inputDataArrayOffsets[idaZBuffer]];
-    input->arrays.normalEncoded_x =
-        (uint16_t *)&input->chunk[input->header.inputDataArrayOffsets[idaNormalEncoded_x]];
-    input->arrays.normalEncoded_y =
-        (uint16_t *)&input->chunk[input->header.inputDataArrayOffsets[idaNormalEncoded_y]];
-    input->arrays.specularAmount =
-        (uint16_t *)&input->chunk[input->header.inputDataArrayOffsets[idaSpecularAmount]];
-    input->arrays.specularPower =
-        (uint16_t *)&input->chunk[input->header.inputDataArrayOffsets[idaSpecularPower]];
-    input->arrays.albedo_x =
-        (uint8_t *)&input->chunk[input->header.inputDataArrayOffsets[idaAlbedo_x]];
-    input->arrays.albedo_y =
-        (uint8_t *)&input->chunk[input->header.inputDataArrayOffsets[idaAlbedo_y]];
-    input->arrays.albedo_z =
-        (uint8_t *)&input->chunk[input->header.inputDataArrayOffsets[idaAlbedo_z]];
-    input->arrays.lightPositionView_x =
-        (float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightPositionView_x]];
-    input->arrays.lightPositionView_y =
-        (float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightPositionView_y]];
-    input->arrays.lightPositionView_z =
-        (float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightPositionView_z]];
-    input->arrays.lightAttenuationBegin =
-        (float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightAttenuationBegin]];
-    input->arrays.lightColor_x =
-        (float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightColor_x]];
-    input->arrays.lightColor_y =
-        (float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightColor_y]];
-    input->arrays.lightColor_z =
-        (float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightColor_z]];
-    input->arrays.lightAttenuationEnd =
-        (float *)&input->chunk[input->header.inputDataArrayOffsets[idaLightAttenuationEnd]];
-
-    fclose(in);
-    return input;
-}
-
-
-void DeleteInputData(InputData *input) {
-    lAlignedFree(input->chunk);
-}
-
-
-void WriteFrame(const char *filename, const InputData *input,
-                const Framebuffer &framebuffer) {
-    // Deswizzle and copy to RGBA output
-    // Doesn't need to be fast... only happens once
-    size_t imageBytes = 3 * input->header.framebufferWidth * 
-        input->header.framebufferHeight;
-    uint8_t* framebufferAOS = (uint8_t *)lAlignedMalloc(imageBytes, ALIGNMENT_BYTES);
-    memset(framebufferAOS, 0, imageBytes);
-
-    for (int i = 0; i < input->header.framebufferWidth * 
-                        input->header.framebufferHeight; ++i) {
-        framebufferAOS[3 * i + 0] = framebuffer.r[i];
-        framebufferAOS[3 * i + 1] = framebuffer.g[i];
-        framebufferAOS[3 * i + 2] = framebuffer.b[i];
-    }
-    
-    // Write out simple PPM file
-    FILE *out = fopen(filename, "wb");
-    fprintf(out, "P6 %d %d 255\n", input->header.framebufferWidth, 
-            input->header.framebufferHeight);
-    fwrite(framebufferAOS, imageBytes, 1, out);
-    fclose(out);
-
-    lAlignedFree(framebufferAOS);
-}
--- a/examples_cuda/deferred/data/pp1280x720.bin
+++ b/examples_cuda/deferred/data/pp1280x720.bin
--- a/examples_cuda/deferred/data/pp1920x1200.bin
+++ b/examples_cuda/deferred/data/pp1920x1200.bin
--- a/examples_cuda/deferred/deferred.h
+++ b/examples_cuda/deferred/deferred.h
@@ -1,108 +0,0 @@
-/*
-  Copyright (c) 2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifndef DEFERRED_H
-#define DEFERRED_H
-
-// Currently tile widths must be a multiple of SIMD width (i.e. 8 for ispc sse4x2)!
-#define MIN_TILE_WIDTH 64
-#define MIN_TILE_HEIGHT 16
-#define MAX_LIGHTS 1024
-
-enum InputDataArraysEnum {
-    idaZBuffer = 0,
-    idaNormalEncoded_x,
-    idaNormalEncoded_y,
-    idaSpecularAmount,
-    idaSpecularPower,
-    idaAlbedo_x,
-    idaAlbedo_y,
-    idaAlbedo_z,
-    idaLightPositionView_x,
-    idaLightPositionView_y,
-    idaLightPositionView_z,
-    idaLightAttenuationBegin,
-    idaLightColor_x,
-    idaLightColor_y,
-    idaLightColor_z,
-    idaLightAttenuationEnd,
-
-    idaNum
-};
-
-#ifndef ISPC
-
-#include <stdint.h>
-#include "kernels1_ispc.h"
-
-#define ALIGNMENT_BYTES 64
-
-#define MAX_LIGHTS 1024
-
-#define VISUALIZE_LIGHT_COUNT 0
-
-struct InputData
-{
-    ispc::InputHeader header;
-    ispc::InputDataArrays arrays;
-    uint8_t *chunk;
-};
-
-
-struct Framebuffer {
-    Framebuffer(int width, int height);
-    ~Framebuffer();
-
-    void clear();
-
-    uint8_t *r, *g, *b;
-
-private:
-    int nPixels;
-    Framebuffer(const Framebuffer &);
-    Framebuffer &operator=(const Framebuffer *);
-};
-
-
-InputData *CreateInputDataFromFile(const char *path);
-void DeleteInputData(InputData *input);
-void WriteFrame(const char *filename, const InputData *input,
-                const Framebuffer &framebuffer);
-void InitDynamicC(InputData *input);
-void InitDynamicCilk(InputData *input);
-void DispatchDynamicC(InputData *input, Framebuffer *framebuffer);
-void DispatchDynamicCilk(InputData *input, Framebuffer *framebuffer);
-
-#endif // !ISPC
-
-#endif // DEFERRED_H
--- a/examples_cuda/deferred/deferred_shading.vcxproj
+++ b/examples_cuda/deferred/deferred_shading.vcxproj
@@ -1,178 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
-  <ItemGroup Label="ProjectConfigurations">
-    <ProjectConfiguration Include="Debug|Win32">
-      <Configuration>Debug</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Debug|x64">
-      <Configuration>Debug</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|Win32">
-      <Configuration>Release</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|x64">
-      <Configuration>Release</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-  </ItemGroup>
-  <PropertyGroup Label="Globals">
-    <ProjectGuid>{87f53c53-957e-4e91-878a-bc27828fb9eb}</ProjectGuid>
-    <Keyword>Win32Proj</Keyword>
-    <RootNamespace>mandelbrot</RootNamespace>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
-  <ImportGroup Label="ExtensionSettings">
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <PropertyGroup Label="UserMacros" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-  </PropertyGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemGroup>
-    <ClCompile Include="common.cpp" />
-    <ClCompile Include="dynamic_c.cpp" />
-    <ClCompile Include="dynamic_cilk.cpp" />
-    <ClCompile Include="main.cpp" />
-    <ClCompile Include="../tasksys.cpp" />
-  </ItemGroup>
-  <ItemGroup>
-    <CustomBuild Include="kernels.ispc">
-      <FileType>Document</FileType>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-    </CustomBuild>
-  </ItemGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
-  <ImportGroup Label="ExtensionTargets">
-  </ImportGroup>
-</Project>
--- a/examples_cuda/deferred/drvapi_error_string.h
+++ b/examples_cuda/deferred/drvapi_error_string.h
@@ -1,370 +0,0 @@
-/*
- * Copyright 1993-2012 NVIDIA Corporation.  All rights reserved.
- *
- * Please refer to the NVIDIA end user license agreement (EULA) associated
- * with this source code for terms and conditions that govern your use of
- * this software. Any use, reproduction, disclosure, or distribution of
- * this software and related documentation outside the terms of the EULA
- * is strictly prohibited.
- *
- */
- 
-#ifndef _DRVAPI_ERROR_STRING_H_
-#define _DRVAPI_ERROR_STRING_H_
-
-#include <stdio.h>
-#include <string.h>
-#include <stdlib.h>
-
-// Error Code string definitions here
-typedef struct
-{
-    char const *error_string;
-    int  error_id;
-} s_CudaErrorStr;
-
-/**
- * Error codes
- */
-static s_CudaErrorStr sCudaDrvErrorString[] =
-{
-    /**
-     * The API call returned with no errors. In the case of query calls, this
-     * can also mean that the operation being queried is complete (see
-     * ::cuEventQuery() and ::cuStreamQuery()).
-     */
-    { "CUDA_SUCCESS", 0 },
-
-    /**
-     * This indicates that one or more of the parameters passed to the API call
-     * is not within an acceptable range of values.
-     */
-    { "CUDA_ERROR_INVALID_VALUE", 1 },
-
-    /**
-     * The API call failed because it was unable to allocate enough memory to
-     * perform the requested operation.
-     */
-    { "CUDA_ERROR_OUT_OF_MEMORY", 2 },
-
-    /**
-     * This indicates that the CUDA driver has not been initialized with
-     * ::cuInit() or that initialization has failed.
-     */
-    { "CUDA_ERROR_NOT_INITIALIZED", 3 },
-
-    /**
-     * This indicates that the CUDA driver is in the process of shutting down.
-     */
-    { "CUDA_ERROR_DEINITIALIZED", 4 },
-
-    /**
-     * This indicates profiling APIs are called while application is running
-     * in visual profiler mode. 
-    */
-    { "CUDA_ERROR_PROFILER_DISABLED", 5 },
-    /**
-     * This indicates profiling has not been initialized for this context. 
-     * Call cuProfilerInitialize() to resolve this. 
-    */
-    { "CUDA_ERROR_PROFILER_NOT_INITIALIZED", 6 },
-    /**
-     * This indicates profiler has already been started and probably
-     * cuProfilerStart() is incorrectly called.
-    */
-    { "CUDA_ERROR_PROFILER_ALREADY_STARTED", 7 },
-    /**
-     * This indicates profiler has already been stopped and probably
-     * cuProfilerStop() is incorrectly called.
-    */
-    { "CUDA_ERROR_PROFILER_ALREADY_STOPPED", 8 },  
-    /**
-     * This indicates that no CUDA-capable devices were detected by the installed
-     * CUDA driver.
-     */
-    { "CUDA_ERROR_NO_DEVICE (no CUDA-capable devices were detected)", 100 },
-
-    /**
-     * This indicates that the device ordinal supplied by the user does not
-     * correspond to a valid CUDA device.
-     */
-    { "CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)", 101 },
-
-
-    /**
-     * This indicates that the device kernel image is invalid. This can also
-     * indicate an invalid CUDA module.
-     */
-    { "CUDA_ERROR_INVALID_IMAGE", 200 },
-
-    /**
-     * This most frequently indicates that there is no context bound to the
-     * current thread. This can also be returned if the context passed to an
-     * API call is not a valid handle (such as a context that has had
-     * ::cuCtxDestroy() invoked on it). This can also be returned if a user
-     * mixes different API versions (i.e. 3010 context with 3020 API calls).
-     * See ::cuCtxGetApiVersion() for more details.
-     */
-    { "CUDA_ERROR_INVALID_CONTEXT", 201 },
-
-    /**
-     * This indicated that the context being supplied as a parameter to the
-     * API call was already the active context.
-     * \deprecated
-     * This error return is deprecated as of CUDA 3.2. It is no longer an
-     * error to attempt to push the active context via ::cuCtxPushCurrent().
-     */
-    { "CUDA_ERROR_CONTEXT_ALREADY_CURRENT", 202 },
-
-    /**
-     * This indicates that a map or register operation has failed.
-     */
-    { "CUDA_ERROR_MAP_FAILED", 205 },
-
-    /**
-     * This indicates that an unmap or unregister operation has failed.
-     */
-    { "CUDA_ERROR_UNMAP_FAILED", 206 },
-
-    /**
-     * This indicates that the specified array is currently mapped and thus
-     * cannot be destroyed.
-     */
-    { "CUDA_ERROR_ARRAY_IS_MAPPED", 207 },
-
-    /**
-     * This indicates that the resource is already mapped.
-     */
-    { "CUDA_ERROR_ALREADY_MAPPED", 208 },
-
-    /**
-     * This indicates that there is no kernel image available that is suitable
-     * for the device. This can occur when a user specifies code generation
-     * options for a particular CUDA source file that do not include the
-     * corresponding device configuration.
-     */
-    { "CUDA_ERROR_NO_BINARY_FOR_GPU", 209 },
-
-    /**
-     * This indicates that a resource has already been acquired.
-     */
-    { "CUDA_ERROR_ALREADY_ACQUIRED", 210 },
-
-    /**
-     * This indicates that a resource is not mapped.
-     */
-    { "CUDA_ERROR_NOT_MAPPED", 211 },
-
-    /**
-     * This indicates that a mapped resource is not available for access as an
-     * array.
-     */
-    { "CUDA_ERROR_NOT_MAPPED_AS_ARRAY", 212 },
-
-    /**
-     * This indicates that a mapped resource is not available for access as a
-     * pointer.
-     */
-    { "CUDA_ERROR_NOT_MAPPED_AS_POINTER", 213 },
-
-    /**
-     * This indicates that an uncorrectable ECC error was detected during
-     * execution.
-     */
-    { "CUDA_ERROR_ECC_UNCORRECTABLE", 214 },
-
-    /**
-     * This indicates that the ::CUlimit passed to the API call is not
-     * supported by the active device.
-     */
-    { "CUDA_ERROR_UNSUPPORTED_LIMIT", 215 },
-
-    /**
-     * This indicates that the ::CUcontext passed to the API call can
-     * only be bound to a single CPU thread at a time but is already 
-     * bound to a CPU thread.
-     */
-    { "CUDA_ERROR_CONTEXT_ALREADY_IN_USE", 216 },
-
-    /**
-     * This indicates that peer access is not supported across the given
-     * devices.
-     */
-    { "CUDA_ERROR_PEER_ACCESS_UNSUPPORTED", 217},
-
-    /**
-     * This indicates that the device kernel source is invalid.
-     */
-    { "CUDA_ERROR_INVALID_SOURCE", 300 },
-
-    /**
-     * This indicates that the file specified was not found.
-     */
-    { "CUDA_ERROR_FILE_NOT_FOUND", 301 },
-
-    /**
-     * This indicates that a link to a shared object failed to resolve.
-     */
-    { "CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND", 302 },
-
-    /**
-     * This indicates that initialization of a shared object failed.
-     */
-    { "CUDA_ERROR_SHARED_OBJECT_INIT_FAILED", 303 },
-
-    /**
-     * This indicates that an OS call failed.
-     */
-    { "CUDA_ERROR_OPERATING_SYSTEM", 304 },
-
-
-    /**
-     * This indicates that a resource handle passed to the API call was not
-     * valid. Resource handles are opaque types like ::CUstream and ::CUevent.
-     */
-    { "CUDA_ERROR_INVALID_HANDLE", 400 },
-
-
-    /**
-     * This indicates that a named symbol was not found. Examples of symbols
-     * are global/constant variable names, texture names }, and surface names.
-     */
-    { "CUDA_ERROR_NOT_FOUND", 500 },
-
-
-    /**
-     * This indicates that asynchronous operations issued previously have not
-     * completed yet. This result is not actually an error, but must be indicated
-     * differently than ::CUDA_SUCCESS (which indicates completion). Calls that
-     * may return this value include ::cuEventQuery() and ::cuStreamQuery().
-     */
-    { "CUDA_ERROR_NOT_READY", 600 },
-
-
-    /**
-     * An exception occurred on the device while executing a kernel. Common
-     * causes include dereferencing an invalid device pointer and accessing
-     * out of bounds shared memory. The context cannot be used }, so it must
-     * be destroyed (and a new one should be created). All existing device
-     * memory allocations from this context are invalid and must be
-     * reconstructed if the program is to continue using CUDA.
-     */
-    { "CUDA_ERROR_LAUNCH_FAILED", 700 },
-
-    /**
-     * This indicates that a launch did not occur because it did not have
-     * appropriate resources. This error usually indicates that the user has
-     * attempted to pass too many arguments to the device kernel, or the
-     * kernel launch specifies too many threads for the kernel's register
-     * count. Passing arguments of the wrong size (i.e. a 64-bit pointer
-     * when a 32-bit int is expected) is equivalent to passing too many
-     * arguments and can also result in this error.
-     */
-    { "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES", 701 },
-
-    /**
-     * This indicates that the device kernel took too long to execute. This can
-     * only occur if timeouts are enabled - see the device attribute
-     * ::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information. The
-     * context cannot be used (and must be destroyed similar to
-     * ::CUDA_ERROR_LAUNCH_FAILED). All existing device memory allocations from
-     * this context are invalid and must be reconstructed if the program is to
-     * continue using CUDA.
-     */
-    { "CUDA_ERROR_LAUNCH_TIMEOUT", 702 },
-
-    /**
-     * This error indicates a kernel launch that uses an incompatible texturing
-     * mode.
-     */
-    { "CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING", 703 },
-    
-    /**
-     * This error indicates that a call to ::cuCtxEnablePeerAccess() is
-     * trying to re-enable peer access to a context which has already
-     * had peer access to it enabled.
-     */
-    { "CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED", 704 },
-
-    /**
-     * This error indicates that ::cuCtxDisablePeerAccess() is 
-     * trying to disable peer access which has not been enabled yet 
-     * via ::cuCtxEnablePeerAccess(). 
-     */
-    { "CUDA_ERROR_PEER_ACCESS_NOT_ENABLED", 705 },
-
-    /**
-     * This error indicates that the primary context for the specified device
-     * has already been initialized.
-     */
-    { "CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE", 708 },
-
-    /**
-     * This error indicates that the context current to the calling thread
-     * has been destroyed using ::cuCtxDestroy }, or is a primary context which
-     * has not yet been initialized.
-     */
-    { "CUDA_ERROR_CONTEXT_IS_DESTROYED", 709 },
-
-    /**
-     * A device-side assert triggered during kernel execution. The context
-     * cannot be used anymore, and must be destroyed. All existing device 
-     * memory allocations from this context are invalid and must be 
-     * reconstructed if the program is to continue using CUDA.
-     */
-    { "CUDA_ERROR_ASSERT", 710 },
-
-        /**
-     * This error indicates that the hardware resources required to enable
-     * peer access have been exhausted for one or more of the devices 
-     * passed to ::cuCtxEnablePeerAccess().
-     */
-    { "CUDA_ERROR_TOO_MANY_PEERS", 711 },
-
-    /**
-     * This error indicates that the memory range passed to ::cuMemHostRegister()
-     * has already been registered.
-     */
-    { "CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED", 712 },
-
-    /**
-     * This error indicates that the pointer passed to ::cuMemHostUnregister()
-     * does not correspond to any currently registered memory region.
-     */
-    { "CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED", 713 },
-
-    /**
-     * This error indicates that the attempted operation is not permitted.
-     */
-    { "CUDA_ERROR_NOT_PERMITTED", 800 },
-
-    /**
-     * This error indicates that the attempted operation is not supported
-     * on the current system or device.
-     */
-    { "CUDA_ERROR_NOT_SUPPORTED", 801 },
-
-    /**
-     * This indicates that an unknown internal error has occurred.
-     */
-    { "CUDA_ERROR_UNKNOWN", 999 },
-    { NULL, -1 }
-};
-
-// This is just a linear search through the array, since the error_id's are not
-// always ocurring consecutively
-const char * getCudaDrvErrorString(CUresult error_id)
-{
-    int index = 0;
-    while (sCudaDrvErrorString[index].error_id != error_id && 
-           sCudaDrvErrorString[index].error_id != -1)
-    {
-        index++;
-    }
-    if (sCudaDrvErrorString[index].error_id == error_id)
-        return (const char *)sCudaDrvErrorString[index].error_string;
-    else
-        return (const char *)"CUDA_ERROR not found!";
-}
-
-#endif
--- a/examples_cuda/deferred/dynamic_c.cpp
+++ b/examples_cuda/deferred/dynamic_c.cpp
@@ -1,870 +0,0 @@
-/*
-  Copyright (c) 2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#include "deferred.h"
-#include "kernels_ispc.h"
-#include <algorithm>
-#include <stdint.h>
-#include <assert.h>
-#include <math.h>
-
-#ifdef _MSC_VER
-#define ISPC_IS_WINDOWS
-#elif defined(__linux__)
-#define ISPC_IS_LINUX
-#elif defined(__APPLE__)
-#define ISPC_IS_APPLE
-#endif
-
-#ifdef ISPC_IS_LINUX
-#include <malloc.h>
-#endif // ISPC_IS_LINUX
-
-// Currently tile widths must be a multiple of SIMD width (i.e. 8 for ispc sse4x2)!
-//#define MIN_TILE_WIDTH 16
-//#define MIN_TILE_HEIGHT 16
-
-
-#define DYNAMIC_TREE_LEVELS 5
-// If this is set to 1 then the result will be identical to the static version
-#define DYNAMIC_MIN_LIGHTS_TO_SUBDIVIDE 1
-
-static void *
-lAlignedMalloc(size_t size, int32_t alignment) {
-#ifdef ISPC_IS_WINDOWS
-    return _aligned_malloc(size, alignment);
-#endif
-#ifdef ISPC_IS_LINUX
-    return memalign(alignment, size);
-#endif
-#ifdef ISPC_IS_APPLE
-    void *mem = malloc(size + (alignment-1) + sizeof(void*));
-    char *amem = ((char*)mem) + sizeof(void*);
-    amem = amem + uint32_t(alignment - (reinterpret_cast<uint64_t>(amem) &
-                                        (alignment - 1)));
-    ((void**)amem)[-1] = mem;
-    return amem;
-#endif
-}
-
-
-static void
-lAlignedFree(void *ptr) {
-#ifdef ISPC_IS_WINDOWS
-    _aligned_free(ptr);
-#endif
-#ifdef ISPC_IS_LINUX
-    free(ptr);
-#endif
-#ifdef ISPC_IS_APPLE
-    free(((void**)ptr)[-1]);
-#endif
-}
-
-
-static void
-ComputeZBounds(int tileStartX, int tileEndX,
-               int tileStartY, int tileEndY,
-               // G-buffer data
-               float zBuffer[],
-               int gBufferWidth,
-               // Camera data
-               float cameraProj_33, float cameraProj_43,
-               float cameraNear, float cameraFar,
-               // Output
-               float *minZ, float *maxZ)
-{
-    // Find Z bounds
-    float laneMinZ = cameraFar;
-    float laneMaxZ = cameraNear;
-    for (int y = tileStartY; y < tileEndY; ++y) {
-        for (int x = tileStartX; x < tileEndX; ++x) {
-            // Unproject depth buffer Z value into view space
-            float z = zBuffer[(y * gBufferWidth + x)];
-            float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
-
-            // Work out Z bounds for our samples
-            // Avoid considering skybox/background or otherwise invalid pixels
-            if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
-                laneMinZ = std::min(laneMinZ, viewSpaceZ);
-                laneMaxZ = std::max(laneMaxZ, viewSpaceZ);
-            }
-        }
-    }
-    *minZ = laneMinZ;
-    *maxZ = laneMaxZ;
-}
-
-
-static void
-ComputeZBoundsRow(int tileY, int tileWidth, int tileHeight,
-                  int numTilesX, int numTilesY,
-                  // G-buffer data
-                  float zBuffer[],
-                  int gBufferWidth,
-                  // Camera data
-                  float cameraProj_33, float cameraProj_43,
-                  float cameraNear, float cameraFar,
-                  // Output
-                  float minZArray[],
-                  float maxZArray[])
-{
-    for (int tileX = 0; tileX < numTilesX; ++tileX) {
-        float minZ, maxZ;
-        ComputeZBounds(tileX * tileWidth, tileX * tileWidth + tileWidth,
-                       tileY * tileHeight, tileY * tileHeight + tileHeight,
-                       zBuffer, gBufferWidth, cameraProj_33, cameraProj_43, 
-                       cameraNear, cameraFar, &minZ, &maxZ);
-        minZArray[tileX] = minZ;
-        maxZArray[tileX] = maxZ;
-    }
-}
-
-
-class MinMaxZTree
-{
-public:
-    // Currently (min) tile dimensions must divide gBuffer dimensions evenly
-    // Levels must be small enough that neither dimension goes below one tile
-    MinMaxZTree(
-        int tileWidth, int tileHeight, int levels,
-        int gBufferWidth, int gBufferHeight)
-        : mTileWidth(tileWidth), mTileHeight(tileHeight), mLevels(levels)
-    {
-        mNumTilesX = gBufferWidth / mTileWidth;
-        mNumTilesY = gBufferHeight / mTileHeight;
-        
-        // Allocate arrays
-        mMinZArrays = (float **)lAlignedMalloc(sizeof(float *) * mLevels, 16);
-        mMaxZArrays = (float **)lAlignedMalloc(sizeof(float *) * mLevels, 16);
-        for (int i = 0; i < mLevels; ++i) {
-            int x = NumTilesX(i);
-            int y = NumTilesY(i);
-            assert(x > 0);
-            assert(y > 0);
-            // NOTE: If the following two asserts fire it probably means that
-            // the base tile dimensions do not evenly divide the G-buffer dimensions
-            assert(x * (mTileWidth << i) >= gBufferWidth);
-            assert(y * (mTileHeight << i) >= gBufferHeight);
-            mMinZArrays[i] = (float *)lAlignedMalloc(sizeof(float) * x * y, 16);
-            mMaxZArrays[i] = (float *)lAlignedMalloc(sizeof(float) * x * y, 16);
-        }
-    }
-
-    void Update(float *zBuffer, int gBufferPitchInElements,
-        float cameraProj_33, float cameraProj_43,
-        float cameraNear, float cameraFar)
-    {
-        for (int tileY = 0; tileY < mNumTilesY; ++tileY) {
-            ComputeZBoundsRow(tileY, mTileWidth, mTileHeight, mNumTilesX, mNumTilesY,
-                              zBuffer, gBufferPitchInElements,
-                              cameraProj_33, cameraProj_43, cameraNear, cameraFar,
-                              mMinZArrays[0] + (tileY * mNumTilesX),
-                              mMaxZArrays[0] + (tileY * mNumTilesX));
-        }
-
-        // Generate other levels
-        for (int level = 1; level < mLevels; ++level) {
-            int destTilesX = NumTilesX(level);
-            int destTilesY = NumTilesY(level);
-            int srcLevel = level - 1;
-            int srcTilesX = NumTilesX(srcLevel);
-            int srcTilesY = NumTilesY(srcLevel);
-            for (int y = 0; y < destTilesY; ++y) {
-                for (int x = 0; x < destTilesX; ++x) {
-                    int srcX = x << 1;
-                    int srcY = y << 1;
-                    // NOTE: Ugly branches to deal with non-multiple dimensions at some levels
-                    // TODO: SSE branchless min/max is probably better...
-                    float minZ = mMinZArrays[srcLevel][(srcY) * srcTilesX + (srcX)];
-                    float maxZ = mMaxZArrays[srcLevel][(srcY) * srcTilesX + (srcX)];
-                    if (srcX + 1 < srcTilesX) {
-                        minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY) * srcTilesX + 
-                                                                    (srcX + 1)]);
-                        maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY) * srcTilesX +
-                                                                    (srcX + 1)]);
-                        if (srcY + 1 < srcTilesY) {
-                            minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY + 1) * srcTilesX +
-                                                                        (srcX + 1)]);
-                            maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY + 1) * srcTilesX +
-                                                                        (srcX + 1)]);
-                        }
-                    }
-                    if (srcY + 1 < srcTilesY) {
-                        minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY + 1) * srcTilesX +
-                                                                    (srcX    )]);
-                        maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY + 1) * srcTilesX +
-                                                                    (srcX    )]);
-                    }
-                    mMinZArrays[level][y * destTilesX + x] = minZ;
-                    mMaxZArrays[level][y * destTilesX + x] = maxZ;
-                }
-            }
-        }
-    }
-
-    ~MinMaxZTree() {
-        for (int i = 0; i < mLevels; ++i) {
-            lAlignedFree(mMinZArrays[i]);
-            lAlignedFree(mMaxZArrays[i]);
-        }
-        lAlignedFree(mMinZArrays);
-        lAlignedFree(mMaxZArrays); 
-    }
-
-    int Levels() const { return mLevels; }
-
-    // These round UP, so beware that the last tile for a given level may not be completely full
-    // TODO: Verify this...
-    int NumTilesX(int level = 0) const { return (mNumTilesX + (1 << level) - 1) >> level; }
-    int NumTilesY(int level = 0) const { return (mNumTilesY + (1 << level) - 1) >> level; }
-    int TileWidth(int level = 0) const { return (mTileWidth << level); }
-    int TileHeight(int level = 0) const { return (mTileHeight << level); }
-
-    float MinZ(int level, int tileX, int tileY) const {
-        return mMinZArrays[level][tileY * NumTilesX(level) + tileX];
-    }
-    float MaxZ(int level, int tileX, int tileY) const {
-        return mMaxZArrays[level][tileY * NumTilesX(level) + tileX];
-    }
-
-private:
-    int mTileWidth;
-    int mTileHeight;
-    int mLevels;
-    int mNumTilesX;
-    int mNumTilesY;
-
-    // One array for each "level" in the tree
-    float **mMinZArrays;
-    float **mMaxZArrays;
-};
-
-static MinMaxZTree *gMinMaxZTree = 0;
-
-void InitDynamicC(InputData *input) {
-    gMinMaxZTree = 
-        new MinMaxZTree(MIN_TILE_WIDTH, MIN_TILE_HEIGHT, DYNAMIC_TREE_LEVELS,
-                        input->header.framebufferWidth, 
-                        input->header.framebufferHeight);
-}
-
-
-/* We're going to split a tile into 4 sub-tiles.  This function
-   reclassifies the tile's lights with respect to the sub-tiles. */
-static void
-SplitTileMinMax(
-    int tileMidX, int tileMidY,
-    // Subtile data (00, 10, 01, 11)
-    float subtileMinZ[],
-    float subtileMaxZ[],
-    // G-buffer data
-    int gBufferWidth, int gBufferHeight,
-    // Camera data
-    float cameraProj_11, float cameraProj_22,
-    // Light Data
-    int lightIndices[],
-    int numLights,
-    float light_positionView_x_array[],
-    float light_positionView_y_array[],
-    float light_positionView_z_array[],
-    float light_attenuationEnd_array[],
-    // Outputs
-    int subtileIndices[],
-    int subtileIndicesPitch,
-    int subtileNumLights[]
-    )
-{
-    float gBufferScale_x = 0.5f * (float)gBufferWidth;
-    float gBufferScale_y = 0.5f * (float)gBufferHeight;
-        
-    float frustumPlanes_xy[2] = { -(cameraProj_11 * gBufferScale_x),
-                                   (cameraProj_22 * gBufferScale_y) };
-    float frustumPlanes_z[2] = { tileMidX - gBufferScale_x,
-                                 tileMidY - gBufferScale_y };
-
-    for (int i = 0; i < 2; ++i) {
-        // Normalize
-        float norm = 1.f / sqrtf(frustumPlanes_xy[i] * frustumPlanes_xy[i] + 
-                                 frustumPlanes_z[i] * frustumPlanes_z[i]);
-        frustumPlanes_xy[i] *= norm;
-        frustumPlanes_z[i] *= norm;
-    }
-
-    // Initialize
-    int subtileLightOffset[4];
-    subtileLightOffset[0] = 0 * subtileIndicesPitch;
-    subtileLightOffset[1] = 1 * subtileIndicesPitch;
-    subtileLightOffset[2] = 2 * subtileIndicesPitch;
-    subtileLightOffset[3] = 3 * subtileIndicesPitch;
-
-    for (int i = 0; i < numLights; ++i) {
-        int lightIndex = lightIndices[i];
-
-        float light_positionView_x = light_positionView_x_array[lightIndex];
-        float light_positionView_y = light_positionView_y_array[lightIndex];
-        float light_positionView_z = light_positionView_z_array[lightIndex];
-        float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
-        float light_attenuationEndNeg = -light_attenuationEnd;
-        
-        // Test lights again against subtile z bounds
-        bool inFrustum[4];
-        inFrustum[0] = (light_positionView_z - subtileMinZ[0] >= light_attenuationEndNeg) &&
-            (subtileMaxZ[0] - light_positionView_z >= light_attenuationEndNeg);
-        inFrustum[1] = (light_positionView_z - subtileMinZ[1] >= light_attenuationEndNeg) && 
-            (subtileMaxZ[1] - light_positionView_z >= light_attenuationEndNeg);
-        inFrustum[2] = (light_positionView_z - subtileMinZ[2] >= light_attenuationEndNeg) && 
-            (subtileMaxZ[2] - light_positionView_z >= light_attenuationEndNeg);
-        inFrustum[3] = (light_positionView_z - subtileMinZ[3] >= light_attenuationEndNeg) && 
-            (subtileMaxZ[3] - light_positionView_z >= light_attenuationEndNeg);
-
-        float dx = light_positionView_z * frustumPlanes_z[0] + 
-            light_positionView_x * frustumPlanes_xy[0];
-        float dy = light_positionView_z * frustumPlanes_z[1] +
-            light_positionView_y * frustumPlanes_xy[1];
-        
-        if (fabsf(dx) > light_attenuationEnd) {
-            bool positiveX = dx > 0.0f;
-            inFrustum[0] = inFrustum[0] &&  positiveX;    // 00 subtile
-            inFrustum[1] = inFrustum[1] && !positiveX;    // 10 subtile
-            inFrustum[2] = inFrustum[2] &&  positiveX;    // 01 subtile
-            inFrustum[3] = inFrustum[3] && !positiveX;    // 11 subtile
-        }
-        if (fabsf(dy) > light_attenuationEnd) {
-            bool positiveY = dy > 0.0f;
-            inFrustum[0] = inFrustum[0] &&  positiveY;    // 00 subtile
-            inFrustum[1] = inFrustum[1] &&  positiveY;    // 10 subtile
-            inFrustum[2] = inFrustum[2] && !positiveY;    // 01 subtile
-            inFrustum[3] = inFrustum[3] && !positiveY;    // 11 subtile
-        }
-
-        if (inFrustum[0])
-            subtileIndices[subtileLightOffset[0]++] = lightIndex;
-        if (inFrustum[1])
-            subtileIndices[subtileLightOffset[1]++] = lightIndex;
-        if (inFrustum[2])
-            subtileIndices[subtileLightOffset[2]++] = lightIndex;
-        if (inFrustum[3])
-            subtileIndices[subtileLightOffset[3]++] = lightIndex;
-    }
-
-    subtileNumLights[0] = subtileLightOffset[0] - 0 * subtileIndicesPitch;
-    subtileNumLights[1] = subtileLightOffset[1] - 1 * subtileIndicesPitch;
-    subtileNumLights[2] = subtileLightOffset[2] - 2 * subtileIndicesPitch;
-    subtileNumLights[3] = subtileLightOffset[3] - 3 * subtileIndicesPitch;
-}
-
-
-static inline float
-dot3(float x, float y, float z, float a, float b, float c) {
-    return (x*a + y*b + z*c);
-}
-
-
-static inline void
-normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
-    float n = 1.f / sqrtf(x*x + y*y + z*z);
-    ox = x * n;
-    oy = y * n;
-    oz = z * n;
-}
-
-
-static inline float
-Unorm8ToFloat32(uint8_t u) {
-    return (float)u * (1.0f / 255.0f);
-}
-
-
-static inline uint8_t
-Float32ToUnorm8(float f) {
-    return (uint8_t)(f * 255.0f);
-}
-
-
-static inline float
-half_to_float_fast(uint16_t h) {
-    uint32_t hs = h & (int32_t)0x8000u;  // Pick off sign bit
-    uint32_t he = h & (int32_t)0x7C00u;  // Pick off exponent bits
-    uint32_t hm = h & (int32_t)0x03FFu;  // Pick off mantissa bits
-
-    // sign
-    uint32_t xs = ((uint32_t) hs) << 16; 
-    // Exponent: unbias the halfp, then bias the single
-    int32_t xes = ((int32_t) (he >> 10)) - 15 + 127; 
-    // Exponent
-    uint32_t xe = (uint32_t) (xes << 23);
-    // Mantissa
-    uint32_t xm = ((uint32_t) hm) << 13; 
-
-    uint32_t bits = (xs | xe | xm);
-    float *fp = reinterpret_cast<float *>(&bits);
-    return *fp;
-}
-
-
-static void
-ShadeTileC(
-    int32_t tileStartX, int32_t tileEndX,
-    int32_t tileStartY, int32_t tileEndY,
-    int32_t gBufferWidth, int32_t gBufferHeight,
-    const ispc::InputDataArrays &inputData,
-    // Camera data
-    float cameraProj_11, float cameraProj_22,
-    float cameraProj_33, float cameraProj_43,
-    // Light list
-    int32_t tileLightIndices[],
-    int32_t tileNumLights,
-    // UI
-    bool visualizeLightCount,
-    // Output
-    uint8_t framebuffer_r[],
-    uint8_t framebuffer_g[],
-    uint8_t framebuffer_b[]
-    )
-{
-    if (tileNumLights == 0 || visualizeLightCount) {
-        uint8_t c = (uint8_t)(std::min(tileNumLights << 2, 255));
-        for (int32_t y = tileStartY; y < tileEndY; ++y) {
-            for (int32_t x = tileStartX; x < tileEndX; ++x) {
-                int32_t framebufferIndex = (y * gBufferWidth + x);
-                framebuffer_r[framebufferIndex] = c;
-                framebuffer_g[framebufferIndex] = c;
-                framebuffer_b[framebufferIndex] = c;
-            }
-        }
-    } else {
-        float twoOverGBufferWidth = 2.0f / gBufferWidth;
-        float twoOverGBufferHeight = 2.0f / gBufferHeight;
-        
-        for (int32_t y = tileStartY; y < tileEndY; ++y) {
-            float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
-
-            for (int32_t x = tileStartX; x < tileEndX; ++x) {
-                int32_t gBufferOffset = y * gBufferWidth + x;
-                
-                // Reconstruct position and (negative) view vector from G-buffer
-                float surface_positionView_x, surface_positionView_y, surface_positionView_z;
-                float Vneg_x, Vneg_y, Vneg_z;
-
-                float z = inputData.zBuffer[gBufferOffset];
-
-                // Compute screen/clip-space position
-                // NOTE: Mind DX11 viewport transform and pixel center!
-                float positionScreen_x = (0.5f + (float)(x)) * 
-                    twoOverGBufferWidth - 1.0f;
-
-                // Unproject depth buffer Z value into view space
-                surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
-                surface_positionView_x = positionScreen_x * surface_positionView_z / 
-                    cameraProj_11;
-                surface_positionView_y = positionScreen_y * surface_positionView_z / 
-                    cameraProj_22;
-                
-                // We actually end up with a vector pointing *at* the
-                // surface (i.e. the negative view vector)
-                normalize3(surface_positionView_x, surface_positionView_y, 
-                           surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
-
-                // Reconstruct normal from G-buffer
-                float surface_normal_x, surface_normal_y, surface_normal_z;
-                float normal_x = half_to_float_fast(inputData.normalEncoded_x[gBufferOffset]);
-                float normal_y = half_to_float_fast(inputData.normalEncoded_y[gBufferOffset]);
-                    
-                float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
-                float m = sqrtf(4.0f * f - 1.0f);
-                    
-                surface_normal_x = m * (4.0f * normal_x - 2.0f);
-                surface_normal_y = m * (4.0f * normal_y - 2.0f);
-                surface_normal_z = 3.0f - 8.0f * f;
-
-                // Load other G-buffer parameters
-                float surface_specularAmount = 
-                    half_to_float_fast(inputData.specularAmount[gBufferOffset]);
-                float surface_specularPower  = 
-                    half_to_float_fast(inputData.specularPower[gBufferOffset]);
-                float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
-                float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
-                float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
-                
-                float lit_x = 0.0f;
-                float lit_y = 0.0f;
-                float lit_z = 0.0f;
-                for (int32_t tileLightIndex = 0; tileLightIndex < tileNumLights; 
-                     ++tileLightIndex) {
-                    int32_t lightIndex = tileLightIndices[tileLightIndex];
-                                        
-                    // Gather light data relevant to initial culling
-                    float light_positionView_x = 
-                        inputData.lightPositionView_x[lightIndex];
-                    float light_positionView_y = 
-                        inputData.lightPositionView_y[lightIndex];
-                    float light_positionView_z = 
-                        inputData.lightPositionView_z[lightIndex];
-                    float light_attenuationEnd = 
-                        inputData.lightAttenuationEnd[lightIndex];
-                    
-                    // Compute light vector
-                    float L_x = light_positionView_x - surface_positionView_x;
-                    float L_y = light_positionView_y - surface_positionView_y;
-                    float L_z = light_positionView_z - surface_positionView_z;
-
-                    float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
-                    
-                    // Clip at end of attenuation
-                    float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
-
-                    if (distanceToLight2 < light_attenutaionEnd2) {                    
-                        float distanceToLight = sqrtf(distanceToLight2);
-
-                        float distanceToLightRcp = 1.f / distanceToLight;
-                        L_x *= distanceToLightRcp;
-                        L_y *= distanceToLightRcp;
-                        L_z *= distanceToLightRcp;
-
-                        // Start computing brdf
-                        float NdotL = dot3(surface_normal_x, surface_normal_y, 
-                                           surface_normal_z, L_x, L_y, L_z);
-                    
-                        // Clip back facing
-                        if (NdotL > 0.0f) {
-                            float light_attenuationBegin = 
-                                inputData.lightAttenuationBegin[lightIndex];
-
-                            // Light distance attenuation (linstep)
-                            float lightRange = (light_attenuationEnd - light_attenuationBegin);
-                            float falloffPosition = (light_attenuationEnd - distanceToLight);
-                            float attenuation = std::min(falloffPosition / lightRange, 1.0f);
-
-                            float H_x = (L_x - Vneg_x);
-                            float H_y = (L_y - Vneg_y);
-                            float H_z = (L_z - Vneg_z);
-                            normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
-                    
-                            float NdotH = dot3(surface_normal_x, surface_normal_y, 
-                                               surface_normal_z, H_x, H_y, H_z);
-                            NdotH = std::max(NdotH, 0.0f);
-
-                            float specular = powf(NdotH, surface_specularPower);
-                            float specularNorm = (surface_specularPower + 2.0f) * 
-                                (1.0f / 8.0f);
-                            float specularContrib = surface_specularAmount * 
-                                specularNorm * specular;
-
-                            float k = attenuation * NdotL * (1.0f + specularContrib);
-                    
-                            float light_color_x = inputData.lightColor_x[lightIndex];
-                            float light_color_y = inputData.lightColor_y[lightIndex];
-                            float light_color_z = inputData.lightColor_z[lightIndex];
-
-                            float lightContrib_x = surface_albedo_x * light_color_x;
-                            float lightContrib_y = surface_albedo_y * light_color_y;
-                            float lightContrib_z = surface_albedo_z * light_color_z;
-
-                            lit_x += lightContrib_x * k;
-                            lit_y += lightContrib_y * k;
-                            lit_z += lightContrib_z * k;
-                        }
-                    }
-                }
-
-                // Gamma correct
-                float gamma = 1.0 / 2.2f;
-                lit_x = powf(std::min(std::max(lit_x, 0.0f), 1.0f), gamma);
-                lit_y = powf(std::min(std::max(lit_y, 0.0f), 1.0f), gamma);
-                lit_z = powf(std::min(std::max(lit_z, 0.0f), 1.0f), gamma);
-                
-                framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
-                framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
-                framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
-            }
-        }
-    }
-}
-
-
-void
-ShadeDynamicTileRecurse(InputData *input, int level, int tileX, int tileY, 
-                        int *lightIndices, int numLights, 
-                        Framebuffer *framebuffer) {
-    const MinMaxZTree *minMaxZTree = gMinMaxZTree;
-    
-    // If we few enough lights or this is the base case (last level), shade
-    // this full tile directly
-    if (level == 0 || numLights < DYNAMIC_MIN_LIGHTS_TO_SUBDIVIDE) {
-        int width = minMaxZTree->TileWidth(level);
-        int height = minMaxZTree->TileHeight(level);
-        int startX = tileX * width;
-        int startY = tileY * height;
-        int endX = std::min(input->header.framebufferWidth, startX + width);
-        int endY = std::min(input->header.framebufferHeight, startY + height);
-        
-        // Skip entirely offscreen tiles
-        if (endX > startX && endY > startY) {
-            ShadeTileC(startX, endX, startY, endY,
-                       input->header.framebufferWidth, input->header.framebufferHeight,
-                       input->arrays,
-                       input->header.cameraProj[0][0], input->header.cameraProj[1][1], 
-                       input->header.cameraProj[2][2], input->header.cameraProj[3][2],
-                       lightIndices, numLights, VISUALIZE_LIGHT_COUNT, 
-                       framebuffer->r, framebuffer->g, framebuffer->b);
-        }
-    } 
-    else {
-        // Otherwise, subdivide and 4-way recurse using X and Y splitting planes
-        // Move down a level in the tree
-        --level;
-        tileX <<= 1;
-        tileY <<= 1;
-        int width = minMaxZTree->TileWidth(level);
-        int height = minMaxZTree->TileHeight(level);
-
-        // Work out splitting coords
-        int midX = (tileX + 1) * width;
-        int midY = (tileY + 1) * height;
-
-        // Read subtile min/max data
-        // NOTE: We must be sure to handle out-of-bounds access here since
-        // sometimes we'll only have 1 or 2 subtiles for non-pow-2
-        // framebuffer sizes.
-        bool rightTileExists = (tileX + 1 < minMaxZTree->NumTilesX(level));
-        bool bottomTileExists = (tileY + 1 < minMaxZTree->NumTilesY(level));
-
-        // NOTE: Order is 00, 10, 01, 11
-        // Set defaults up to cull all lights if the tile doesn't exist (offscreen)
-        float minZ[4] = {input->header.cameraFar, input->header.cameraFar, 
-                         input->header.cameraFar, input->header.cameraFar};
-        float maxZ[4] = {input->header.cameraNear, input->header.cameraNear, 
-                         input->header.cameraNear, input->header.cameraNear};
-
-        minZ[0] = minMaxZTree->MinZ(level, tileX, tileY);
-        maxZ[0] = minMaxZTree->MaxZ(level, tileX, tileY);
-        if (rightTileExists) {
-            minZ[1] = minMaxZTree->MinZ(level, tileX + 1, tileY);
-            maxZ[1] = minMaxZTree->MaxZ(level, tileX + 1, tileY);
-            if (bottomTileExists) {
-                minZ[3] = minMaxZTree->MinZ(level, tileX + 1, tileY + 1);
-                maxZ[3] = minMaxZTree->MaxZ(level, tileX + 1, tileY + 1);
-            }
-        }
-        if (bottomTileExists) {
-            minZ[2] = minMaxZTree->MinZ(level, tileX, tileY + 1);
-            maxZ[2] = minMaxZTree->MaxZ(level, tileX, tileY + 1);
-        }
-
-        // Cull lights into subtile lists
-#ifdef ISPC_IS_WINDOWS
-        __declspec(align(ALIGNMENT_BYTES)) 
-#endif
-            int subtileLightIndices[4][MAX_LIGHTS]
-#ifndef ISPC_IS_WINDOWS
-            __attribute__ ((aligned(ALIGNMENT_BYTES)))
-#endif
-;
-        int subtileNumLights[4];
-        SplitTileMinMax(midX, midY, minZ, maxZ,
-            input->header.framebufferWidth, input->header.framebufferHeight, 
-            input->header.cameraProj[0][0], input->header.cameraProj[1][1],
-            lightIndices, numLights, input->arrays.lightPositionView_x, 
-            input->arrays.lightPositionView_y, input->arrays.lightPositionView_z, 
-            input->arrays.lightAttenuationEnd,
-            subtileLightIndices[0], MAX_LIGHTS, subtileNumLights);
-        
-        // Recurse into subtiles
-        ShadeDynamicTileRecurse(input, level, tileX    , tileY, 
-                                subtileLightIndices[0], subtileNumLights[0],
-                                framebuffer);
-        ShadeDynamicTileRecurse(input, level, tileX + 1, tileY,
-                                subtileLightIndices[1], subtileNumLights[1],
-                                framebuffer);
-        ShadeDynamicTileRecurse(input, level, tileX    , tileY + 1,
-                                subtileLightIndices[2], subtileNumLights[2],
-                                framebuffer);
-        ShadeDynamicTileRecurse(input, level, tileX + 1, tileY + 1,
-                                subtileLightIndices[3], subtileNumLights[3],
-                                framebuffer);
-    }
-}
-
-
-static int
-IntersectLightsWithTileMinMax(
-    int tileStartX, int tileEndX,
-    int tileStartY, int tileEndY,
-    // Tile data
-    float minZ,
-    float maxZ,
-    // G-buffer data
-    int gBufferWidth, int gBufferHeight,
-    // Camera data
-    float cameraProj_11, float cameraProj_22,
-    // Light Data
-    int numLights,
-    float light_positionView_x_array[],
-    float light_positionView_y_array[],
-    float light_positionView_z_array[],
-    float light_attenuationEnd_array[],
-    // Output
-    int tileLightIndices[]
-    )
-{
-    float gBufferScale_x = 0.5f * (float)gBufferWidth;
-    float gBufferScale_y = 0.5f * (float)gBufferHeight;
-        
-    float frustumPlanes_xy[4];
-    float frustumPlanes_z[4];
-
-    // This one is totally constant over the whole screen... worth pulling it up at all?
-    float frustumPlanes_xy_v[4] = { -(cameraProj_11 * gBufferScale_x),
-                                    (cameraProj_11 * gBufferScale_x),
-                                    (cameraProj_22 * gBufferScale_y),
-                                    -(cameraProj_22 * gBufferScale_y) };
-    
-    float frustumPlanes_z_v[4] = {  tileEndX - gBufferScale_x,
-                                    -tileStartX + gBufferScale_x,
-                                    tileEndY - gBufferScale_y,
-                                    -tileStartY + gBufferScale_y };
-
-    for (int i = 0; i < 4; ++i) {
-        float norm = 1.f / sqrtf(frustumPlanes_xy_v[i] * frustumPlanes_xy_v[i] + 
-                                 frustumPlanes_z_v[i] * frustumPlanes_z_v[i]);
-        frustumPlanes_xy_v[i] *= norm;
-        frustumPlanes_z_v[i] *= norm;
-
-        frustumPlanes_xy[i] = frustumPlanes_xy_v[i];
-        frustumPlanes_z[i] = frustumPlanes_z_v[i];
-    }
-
-    int tileNumLights = 0;
-
-    for (int lightIndex = 0; lightIndex < numLights; ++lightIndex) {
-        float light_positionView_z = light_positionView_z_array[lightIndex];
-        float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
-        float light_attenuationEndNeg = -light_attenuationEnd;
-
-        float d = light_positionView_z - minZ;
-        bool inFrustum = (d >= light_attenuationEndNeg);
-
-        d = maxZ - light_positionView_z;
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-        
-        if (!inFrustum) 
-            continue;
-
-        float light_positionView_x = light_positionView_x_array[lightIndex];
-        float light_positionView_y = light_positionView_y_array[lightIndex];
-
-        d = light_positionView_z * frustumPlanes_z[0] + 
-            light_positionView_x * frustumPlanes_xy[0];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        d = light_positionView_z * frustumPlanes_z[1] + 
-            light_positionView_x * frustumPlanes_xy[1];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        d = light_positionView_z * frustumPlanes_z[2] + 
-            light_positionView_y * frustumPlanes_xy[2];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        d = light_positionView_z * frustumPlanes_z[3] + 
-            light_positionView_y * frustumPlanes_xy[3];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-        
-        // Pack and store intersecting lights
-        if (inFrustum)
-            tileLightIndices[tileNumLights++] = lightIndex;
-    }
-
-    return tileNumLights;
-}
-
-
-void
-ShadeDynamicTile(InputData *input, int level, int tileX, int tileY,
-                 Framebuffer *framebuffer) {
-    const MinMaxZTree *minMaxZTree = gMinMaxZTree;
-
-    // Get Z min/max for this tile
-    int width = minMaxZTree->TileWidth(level);
-    int height = minMaxZTree->TileHeight(level);
-    float minZ = minMaxZTree->MinZ(level, tileX, tileY);
-    float maxZ = minMaxZTree->MaxZ(level, tileX, tileY);
-
-    int startX = tileX * width;
-    int startY = tileY * height;
-    int endX = std::min(input->header.framebufferWidth, startX + width);
-    int endY = std::min(input->header.framebufferHeight, startY + height);
-
-    // This is a root tile, so first do a full 6-plane cull
-#ifdef ISPC_IS_WINDOWS
-    __declspec(align(ALIGNMENT_BYTES)) 
-#endif
-        int lightIndices[MAX_LIGHTS]
-#ifndef ISPC_IS_WINDOWS
-        __attribute__ ((aligned(ALIGNMENT_BYTES)))
-#endif
-;
-    int numLights = IntersectLightsWithTileMinMax(
-        startX, endX, startY, endY,    minZ, maxZ,
-        input->header.framebufferWidth, input->header.framebufferHeight,
-        input->header.cameraProj[0][0], input->header.cameraProj[1][1],
-        MAX_LIGHTS, input->arrays.lightPositionView_x, 
-        input->arrays.lightPositionView_y, input->arrays.lightPositionView_z, 
-        input->arrays.lightAttenuationEnd, lightIndices);
-
-    // Now kick off the recursive process for this tile
-    ShadeDynamicTileRecurse(input, level, tileX, tileY, lightIndices, 
-                            numLights, framebuffer);
-}
-
-
-void
-DispatchDynamicC(InputData *input, Framebuffer *framebuffer)
-{
-    MinMaxZTree *minMaxZTree = gMinMaxZTree;
-        
-    // Update min/max Z tree
-    minMaxZTree->Update(input->arrays.zBuffer, input->header.framebufferWidth,
-        input->header.cameraProj[2][2], input->header.cameraProj[3][2], 
-        input->header.cameraNear, input->header.cameraFar);
-
-    int rootLevel = minMaxZTree->Levels() - 1;
-    int rootTilesX = minMaxZTree->NumTilesX(rootLevel);
-    int rootTilesY = minMaxZTree->NumTilesY(rootLevel);
-    int rootTiles = rootTilesX * rootTilesY;
-    for (int g = 0; g < rootTiles; ++g) {
-        uint32_t tileY = g / rootTilesX;
-        uint32_t tileX = g % rootTilesX;
-        ShadeDynamicTile(input, rootLevel, tileX, tileY, framebuffer);
-    }
-}
--- a/examples_cuda/deferred/dynamic_cilk.cpp
+++ b/examples_cuda/deferred/dynamic_cilk.cpp
@@ -1,398 +0,0 @@
-/*
-  Copyright (c) 2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef __cilk
-
-#include "deferred.h"
-#include "kernels_ispc.h"
-#include <algorithm>
-#include <assert.h>
-
-#ifdef _MSC_VER
-#define ISPC_IS_WINDOWS
-#elif defined(__linux__)
-#define ISPC_IS_LINUX
-#elif defined(__APPLE__)
-#define ISPC_IS_APPLE
-#endif
-
-#ifdef ISPC_IS_LINUX
-#include <malloc.h>
-#endif // ISPC_IS_LINUX
-
-// Currently tile widths must be a multiple of SIMD width (i.e. 8 for ispc sse4x2)!
-//#define MIN_TILE_WIDTH 64
-//#define MIN_TILE_HEIGHT 16
-
-
-#define DYNAMIC_TREE_LEVELS 5
-// If this is set to 1 then the result will be identical to the static version
-#define DYNAMIC_MIN_LIGHTS_TO_SUBDIVIDE 1
-
-static void *
-lAlignedMalloc(size_t size, int32_t alignment) {
-#ifdef ISPC_IS_WINDOWS
-    return _aligned_malloc(size, alignment);
-#endif
-#ifdef ISPC_IS_LINUX
-    return memalign(alignment, size);
-#endif
-#ifdef ISPC_IS_APPLE
-    void *mem = malloc(size + (alignment-1) + sizeof(void*));
-    char *amem = ((char*)mem) + sizeof(void*);
-    amem = amem + uint32_t(alignment - (reinterpret_cast<uint64_t>(amem) &
-                                        (alignment - 1)));
-    ((void**)amem)[-1] = mem;
-    return amem;
-#endif
-}
-
-
-static void
-lAlignedFree(void *ptr) {
-#ifdef ISPC_IS_WINDOWS
-    _aligned_free(ptr);
-#endif
-#ifdef ISPC_IS_LINUX
-    free(ptr);
-#endif
-#ifdef ISPC_IS_APPLE
-    free(((void**)ptr)[-1]);
-#endif
-}
-
-
-class MinMaxZTreeCilk
-{
-public:
-    // Currently (min) tile dimensions must divide gBuffer dimensions evenly
-    // Levels must be small enough that neither dimension goes below one tile
-    MinMaxZTreeCilk(
-        int tileWidth, int tileHeight, int levels,
-        int gBufferWidth, int gBufferHeight)
-        : mTileWidth(tileWidth), mTileHeight(tileHeight), mLevels(levels)
-    {
-        mNumTilesX = gBufferWidth / mTileWidth;
-        mNumTilesY = gBufferHeight / mTileHeight;
-        
-        // Allocate arrays
-        mMinZArrays = (float **)lAlignedMalloc(sizeof(float *) * mLevels, 16);
-        mMaxZArrays = (float **)lAlignedMalloc(sizeof(float *) * mLevels, 16);
-        for (int i = 0; i < mLevels; ++i) {
-            int x = NumTilesX(i);
-            int y = NumTilesY(i);
-            assert(x > 0);
-            assert(y > 0);
-            // NOTE: If the following two asserts fire it probably means that
-            // the base tile dimensions do not evenly divide the G-buffer dimensions
-            assert(x * (mTileWidth << i) >= gBufferWidth);
-            assert(y * (mTileHeight << i) >= gBufferHeight);
-            mMinZArrays[i] = (float *)lAlignedMalloc(sizeof(float) * x * y, 16);
-            mMaxZArrays[i] = (float *)lAlignedMalloc(sizeof(float) * x * y, 16);
-        }
-    }
-
-    void Update(float *zBuffer, int gBufferPitchInElements,
-        float cameraProj_33, float cameraProj_43,
-        float cameraNear, float cameraFar)
-    {
-        // Compute level 0 in parallel. Outer loops is here since we use Cilk
-        _Cilk_for (int tileY = 0; tileY < mNumTilesY; ++tileY) {
-            ispc::ComputeZBoundsRow(tileY,
-                mTileWidth, mTileHeight, mNumTilesX, mNumTilesY,
-                zBuffer, gBufferPitchInElements,
-                cameraProj_33, cameraProj_43, cameraNear, cameraFar,
-                mMinZArrays[0] + (tileY * mNumTilesX),
-                mMaxZArrays[0] + (tileY * mNumTilesX));
-        }
-
-        // Generate other levels
-        // NOTE: We currently don't use ispc here since it's sort of an
-        // awkward gather-based reduction Using SSE odd pack/unpack
-        // instructions might actually work here when we need to optimize
-        for (int level = 1; level < mLevels; ++level) {
-            int destTilesX = NumTilesX(level);
-            int destTilesY = NumTilesY(level);
-            int srcLevel = level - 1;
-            int srcTilesX = NumTilesX(srcLevel);
-            int srcTilesY = NumTilesY(srcLevel);
-            _Cilk_for (int y = 0; y < destTilesY; ++y) {
-                for (int x = 0; x < destTilesX; ++x) {
-                    int srcX = x << 1;
-                    int srcY = y << 1;
-                    // NOTE: Ugly branches to deal with non-multiple dimensions at some levels
-                    // TODO: SSE branchless min/max is probably better...
-                    float minZ = mMinZArrays[srcLevel][(srcY) * srcTilesX + (srcX)];
-                    float maxZ = mMaxZArrays[srcLevel][(srcY) * srcTilesX + (srcX)];
-                    if (srcX + 1 < srcTilesX) {
-                        minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY) * srcTilesX + 
-                                                                    (srcX + 1)]);
-                        maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY) * srcTilesX +
-                                                                    (srcX + 1)]);
-                        if (srcY + 1 < srcTilesY) {
-                            minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY + 1) * srcTilesX +
-                                                                        (srcX + 1)]);
-                            maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY + 1) * srcTilesX +
-                                                                        (srcX + 1)]);
-                        }
-                    }
-                    if (srcY + 1 < srcTilesY) {
-                        minZ = std::min(minZ, mMinZArrays[srcLevel][(srcY + 1) * srcTilesX +
-                                                                    (srcX    )]);
-                        maxZ = std::max(maxZ, mMaxZArrays[srcLevel][(srcY + 1) * srcTilesX +
-                                                                    (srcX    )]);
-                    }
-                    mMinZArrays[level][y * destTilesX + x] = minZ;
-                    mMaxZArrays[level][y * destTilesX + x] = maxZ;
-                }
-            }
-        }
-    }
-
-    ~MinMaxZTreeCilk() {
-        for (int i = 0; i < mLevels; ++i) {
-            lAlignedFree(mMinZArrays[i]);
-            lAlignedFree(mMaxZArrays[i]);
-        }
-        lAlignedFree(mMinZArrays);
-        lAlignedFree(mMaxZArrays); 
-    }
-
-    int Levels() const { return mLevels; }
-
-    // These round UP, so beware that the last tile for a given level may not be completely full
-    // TODO: Verify this...
-    int NumTilesX(int level = 0) const { return (mNumTilesX + (1 << level) - 1) >> level; }
-    int NumTilesY(int level = 0) const { return (mNumTilesY + (1 << level) - 1) >> level; }
-    int TileWidth(int level = 0) const { return (mTileWidth << level); }
-    int TileHeight(int level = 0) const { return (mTileHeight << level); }
-
-    float MinZ(int level, int tileX, int tileY) const {
-        return mMinZArrays[level][tileY * NumTilesX(level) + tileX];
-    }
-    float MaxZ(int level, int tileX, int tileY) const {
-        return mMaxZArrays[level][tileY * NumTilesX(level) + tileX];
-    }
-
-private:
-    int mTileWidth;
-    int mTileHeight;
-    int mLevels;
-    int mNumTilesX;
-    int mNumTilesY;
-
-    // One array for each "level" in the tree
-    float **mMinZArrays;
-    float **mMaxZArrays;
-};
-
-static MinMaxZTreeCilk *gMinMaxZTreeCilk = 0;
-
-void InitDynamicCilk(InputData *input) {
-    gMinMaxZTreeCilk = 
-        new MinMaxZTreeCilk(MIN_TILE_WIDTH, MIN_TILE_HEIGHT, DYNAMIC_TREE_LEVELS,
-                            input->header.framebufferWidth, 
-                            input->header.framebufferHeight);
-}
-
-
-static void
-ShadeDynamicTileRecurse(InputData *input, int level, int tileX, int tileY, 
-                        int *lightIndices, int numLights, 
-                        Framebuffer *framebuffer) {
-    const MinMaxZTreeCilk *minMaxZTree = gMinMaxZTreeCilk;
-    
-    // If we few enough lights or this is the base case (last level), shade
-    // this full tile directly
-    if (level == 0 || numLights < DYNAMIC_MIN_LIGHTS_TO_SUBDIVIDE) {
-        int width = minMaxZTree->TileWidth(level);
-        int height = minMaxZTree->TileHeight(level);
-        int startX = tileX * width;
-        int startY = tileY * height;
-        int endX = std::min(input->header.framebufferWidth, startX + width);
-        int endY = std::min(input->header.framebufferHeight, startY + height);
-        
-        // Skip entirely offscreen tiles
-        if (endX > startX && endY > startY) {
-            ispc::ShadeTile(
-                startX, endX, startY, endY,
-                input->header.framebufferWidth, input->header.framebufferHeight,
-                input->arrays,
-                input->header.cameraProj[0][0], input->header.cameraProj[1][1], 
-                input->header.cameraProj[2][2], input->header.cameraProj[3][2],
-                lightIndices, numLights, VISUALIZE_LIGHT_COUNT, 
-                framebuffer->r, framebuffer->g, framebuffer->b);
-        }
-    } 
-    else {
-        // Otherwise, subdivide and 4-way recurse using X and Y splitting planes
-        // Move down a level in the tree
-        --level;
-        tileX <<= 1;
-        tileY <<= 1;
-        int width = minMaxZTree->TileWidth(level);
-        int height = minMaxZTree->TileHeight(level);
-
-        // Work out splitting coords
-        int midX = (tileX + 1) * width;
-        int midY = (tileY + 1) * height;
-
-        // Read subtile min/max data
-        // NOTE: We must be sure to handle out-of-bounds access here since
-        // sometimes we'll only have 1 or 2 subtiles for non-pow-2
-        // framebuffer sizes.
-        bool rightTileExists = (tileX + 1 < minMaxZTree->NumTilesX(level));
-        bool bottomTileExists = (tileY + 1 < minMaxZTree->NumTilesY(level));
-
-        // NOTE: Order is 00, 10, 01, 11
-        // Set defaults up to cull all lights if the tile doesn't exist (offscreen)
-        float minZ[4] = {input->header.cameraFar, input->header.cameraFar, 
-                         input->header.cameraFar, input->header.cameraFar};
-        float maxZ[4] = {input->header.cameraNear, input->header.cameraNear, 
-                         input->header.cameraNear, input->header.cameraNear};
-
-        minZ[0] = minMaxZTree->MinZ(level, tileX, tileY);
-        maxZ[0] = minMaxZTree->MaxZ(level, tileX, tileY);
-        if (rightTileExists) {
-            minZ[1] = minMaxZTree->MinZ(level, tileX + 1, tileY);
-            maxZ[1] = minMaxZTree->MaxZ(level, tileX + 1, tileY);
-            if (bottomTileExists) {
-                minZ[3] = minMaxZTree->MinZ(level, tileX + 1, tileY + 1);
-                maxZ[3] = minMaxZTree->MaxZ(level, tileX + 1, tileY + 1);
-            }
-        }
-        if (bottomTileExists) {
-            minZ[2] = minMaxZTree->MinZ(level, tileX, tileY + 1);
-            maxZ[2] = minMaxZTree->MaxZ(level, tileX, tileY + 1);
-        }
-
-        // Cull lights into subtile lists
-#ifdef ISPC_IS_WINDOWS
-        __declspec(align(ALIGNMENT_BYTES)) 
-#endif
-            int subtileLightIndices[4][MAX_LIGHTS]
-#ifndef ISPC_IS_WINDOWS
-            __attribute__ ((aligned(ALIGNMENT_BYTES)))
-#endif
-;
-        int subtileNumLights[4];
-        ispc::SplitTileMinMax(midX, midY, minZ, maxZ,
-            input->header.framebufferWidth, input->header.framebufferHeight, 
-            input->header.cameraProj[0][0], input->header.cameraProj[1][1],
-            lightIndices, numLights, input->arrays.lightPositionView_x, 
-            input->arrays.lightPositionView_y, input->arrays.lightPositionView_z, 
-            input->arrays.lightAttenuationEnd,
-            subtileLightIndices[0], MAX_LIGHTS, subtileNumLights);
-        
-        // Recurse into subtiles
-        _Cilk_spawn ShadeDynamicTileRecurse(input, level, tileX    , tileY, 
-                                            subtileLightIndices[0], subtileNumLights[0],
-                                            framebuffer);
-        _Cilk_spawn ShadeDynamicTileRecurse(input, level, tileX + 1, tileY,
-                                            subtileLightIndices[1], subtileNumLights[1],
-                                            framebuffer);
-        _Cilk_spawn ShadeDynamicTileRecurse(input, level, tileX    , tileY + 1,
-                                            subtileLightIndices[2], subtileNumLights[2],
-                                            framebuffer);
-        ShadeDynamicTileRecurse(input, level, tileX + 1, tileY + 1,
-                                subtileLightIndices[3], subtileNumLights[3],
-                                framebuffer);
-    }
-}
-
-
-static void
-ShadeDynamicTile(InputData *input, int level, int tileX, int tileY,
-                 Framebuffer *framebuffer) {
-    const MinMaxZTreeCilk *minMaxZTree = gMinMaxZTreeCilk;
-
-    // Get Z min/max for this tile
-    int width = minMaxZTree->TileWidth(level);
-    int height = minMaxZTree->TileHeight(level);
-    float minZ = minMaxZTree->MinZ(level, tileX, tileY);
-    float maxZ = minMaxZTree->MaxZ(level, tileX, tileY);
-
-    int startX = tileX * width;
-    int startY = tileY * height;
-    int endX = std::min(input->header.framebufferWidth, startX + width);
-    int endY = std::min(input->header.framebufferHeight, startY + height);
-
-    // This is a root tile, so first do a full 6-plane cull
-#ifdef ISPC_IS_WINDOWS
-    __declspec(align(ALIGNMENT_BYTES)) 
-#endif
-        int lightIndices[MAX_LIGHTS]
-#ifndef ISPC_IS_WINDOWS
-        __attribute__ ((aligned(ALIGNMENT_BYTES)))
-#endif
-;
-    int numLights = ispc::IntersectLightsWithTileMinMax(
-        startX, endX, startY, endY,    minZ, maxZ,
-        input->header.framebufferWidth, input->header.framebufferHeight,
-        input->header.cameraProj[0][0], input->header.cameraProj[1][1],
-        MAX_LIGHTS, input->arrays.lightPositionView_x, 
-        input->arrays.lightPositionView_y, input->arrays.lightPositionView_z, 
-        input->arrays.lightAttenuationEnd, lightIndices);
-
-    // Now kick off the recursive process for this tile
-    ShadeDynamicTileRecurse(input, level, tileX, tileY, lightIndices, 
-                            numLights, framebuffer);
-}
-
-
-void
-DispatchDynamicCilk(InputData *input, Framebuffer *framebuffer)
-{
-    MinMaxZTreeCilk *minMaxZTree = gMinMaxZTreeCilk;
-        
-    // Update min/max Z tree
-    minMaxZTree->Update(input->arrays.zBuffer, input->header.framebufferWidth,
-        input->header.cameraProj[2][2], input->header.cameraProj[3][2], 
-        input->header.cameraNear, input->header.cameraFar);
-
-    // Launch the "root" tiles.  Ideally these should at least fill the
-    // machine... at the moment we have a static number of "levels" to the
-    // mip tree but it might make sense to compute it based on the width of
-    // the machine.
-    int rootLevel = minMaxZTree->Levels() - 1;
-    int rootTilesX = minMaxZTree->NumTilesX(rootLevel);
-    int rootTilesY = minMaxZTree->NumTilesY(rootLevel);
-    int rootTiles = rootTilesX * rootTilesY;
-    _Cilk_for (int g = 0; g < rootTiles; ++g) {
-        uint32_t tileY = g / rootTilesX;
-        uint32_t tileX = g % rootTilesX;
-        ShadeDynamicTile(input, rootLevel, tileX, tileY, framebuffer);
-    }
-}
-
-#endif // __cilk
--- a/examples_cuda/deferred/kernels.cu
+++ b/examples_cuda/deferred/kernels.cu
@@ -1,761 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-#include "deferred.h"
-#include <stdio.h>
-#include <assert.h>
-
-#define programCount 32
-#define programIndex (threadIdx.x & 31)
-#define taskIndex (blockIdx.x*4 + (threadIdx.x >> 5))
-#define taskCount (gridDim.x*4)
-#define warpIdx (threadIdx.x >> 5)
-
-#define int32 int
-#define int16 short
-#define int8 char
-
-__device__ static inline float clamp(float v, float low, float high) 
-{
-      return min(max(v, low), high);
-}
-
-struct InputDataArrays
-{
-    float *zBuffer;
-    unsigned int16 *normalEncoded_x; // half float
-    unsigned int16 *normalEncoded_y; // half float
-    unsigned int16 *specularAmount; // half float
-    unsigned int16 *specularPower; // half float
-    unsigned int8 *albedo_x; // unorm8
-    unsigned int8 *albedo_y; // unorm8
-    unsigned int8 *albedo_z; // unorm8
-    float *lightPositionView_x;
-    float *lightPositionView_y;
-    float *lightPositionView_z;
-    float *lightAttenuationBegin;
-    float *lightColor_x;
-    float *lightColor_y;
-    float *lightColor_z;
-    float *lightAttenuationEnd;
-};
-
-struct InputHeader
-{
-    float cameraProj[4][4];
-    float cameraNear;
-    float cameraFar;
-
-    int32 framebufferWidth;
-    int32 framebufferHeight;
-    int32 numLights;
-    int32 inputDataChunkSize;
-    int32 inputDataArrayOffsets[idaNum];
-};
-
-
-///////////////////////////////////////////////////////////////////////////
-// Common utility routines
-
-__device__
-static inline float
-dot3(float x, float y, float z, float a, float b, float c) {
-    return (x*a + y*b + z*c);
-}
-
-
-#if 0
-static __shared__ int shdata_full[128];
-template<typename T, int N>
-struct Uniform
-{
-  T data[(N+programCount-1)/programCount];
-  volatile T *shdata;
-
-  __device__ inline Uniform()
-  {
-    shdata = ((T*)shdata_full) + warpIdx*32;
-  }
-
-  __device__ inline int2 get_chunk(const int i) const
-  {
-    const int elem  = i & (programCount - 1);
-    const int chunk = i >> 5;
-    shdata[programIndex] = chunk;
-    shdata[        elem] = chunk;
-    return make_int2(shdata[programIndex], elem);
-  }
-
-  __device__ inline const T get(const int i) const
-  {
-    const int2 idx = get_chunk(i);
-    return __shfl(data[idx.x], idx.y);
-  }
-  
-  __device__ inline void set(const bool active, const int i, T value) 
-  {
-    const int2 idx = get_chunk(i);
-    const int chunkIdx = idx.x;
-    const int elemIdx = idx.y;
-    shdata[programIndex] = data[chunkIdx];
-    if (active) shdata[elemIdx] = value;
-    data[chunkIdx] = shdata[programIndex];
-  }
-};
-#elif 1
-template<typename T, int N>
-struct Uniform
-{
-  union
-  {
-    T *data;
-    int32_t ptr[2];
-  };
-
-  __device__ inline Uniform()
-  {
-    if (programIndex == 0)
-      data = (T*)malloc(N*sizeof(T));
-    ptr[0] = __shfl(ptr[0], 0);
-    ptr[1] = __shfl(ptr[1], 0);
-  }
-  __device__ inline ~Uniform()
-  {
-    if (programIndex == 0)
-      free(data);
-  }
-
-  __device__ inline const T get(const int i) const
-  {
-    return data[i];
-  }
-  
-  __device__ inline T* get_ptr(const int i) {return &data[i]; }
-  __device__ inline void set(const bool active, const int i, T value) 
-  {
-    if (active)
-      data[i] = value;
-  }
-};
-
-#else
-__shared__ int shdata_full[4*MAX_LIGHTS];
-template<typename T, int N>
-struct Uniform
-{
-  volatile T *shdata;
-
-  __device__ Uniform()
-  {
-    shdata = (T*)&shdata_full[warpIdx*MAX_LIGHTS];
-  }
-
-  __device__ inline const T get(const int i) const
-  {
-    return shdata[i];
-  }
-  
-  __device__ inline void set(const bool active, const int i, T value) 
-  {
-    if (active)
-      shdata[i] = value;
-  }
-};
-#endif
-
-
-__device__
-static inline void
-normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
-    float n = rsqrt(x*x + y*y + z*z);
-    ox = x * n;
-    oy = y * n;
-    oz = z * n;
-}
-
-__device__ inline
-static float reduce_min(float value)
-{
-#pragma unroll
-  for (int i = 4; i >=0; i--)
-    value = fminf(value, __shfl_xor(value, 1<<i, 32));
-  return value;
-}
-__device__ inline
-static float reduce_max(float value)
-{
-#pragma unroll
-  for (int i = 4; i >=0; i--)
-    value = fmaxf(value, __shfl_xor(value, 1<<i, 32));
-  return value;
-}
-
-#if 0
-__device__ inline
-static int reduce_sum(int value)
-{
-#pragma unroll
-  for (int i = 4; i >=0; i--)
-    value +=  __shfl_xor(value, 1<<i, 32);
-  return value;
-}
-static __device__ __forceinline__ uint shfl_scan_add_step(uint partial, uint up_offset)
-{
-  uint result;
-  asm(
-      "{.reg .u32 r0;"
-      ".reg .pred p;"
-      "shfl.up.b32 r0|p, %1, %2, 0;"
-      "@p add.u32 r0, r0, %3;"
-      "mov.u32 %0, r0;}"
-      : "=r"(result) : "r"(partial), "r"(up_offset), "r"(partial));
-  return result;
-}
-static __device__ __forceinline__ int inclusive_scan_warp(const int value)
-{
-  uint sum = value;
-#pragma unroll
-  for(int i = 0; i < 5; ++i)
-    sum = shfl_scan_add_step(sum, 1 << i);
-  return sum - value;
-}
-#endif
-
-
-static __device__ __forceinline__ int lanemask_lt()
-{
-  int mask;
-  asm("mov.u32 %0, %lanemask_lt;" : "=r" (mask));
-  return mask;
-}
-static __device__ __forceinline__ int2 warpBinExclusiveScan(const bool p)
-{
-  const int b = __ballot(p);
-  return make_int2(__popc(b), __popc(b & lanemask_lt()));
-}
-  __device__ static inline 
-int packed_store_active(bool active, int* ptr, int value)
-{
-  const int2 res = warpBinExclusiveScan(active);
-  const int idx = res.y;
-  const int nactive = res.x;
-  if (active)
-    ptr[idx] = value;
-  return nactive;
-}
-
-
-
-
-
-__device__
-static inline float
-Unorm8ToFloat32(unsigned int8 u) {
-    return (float)u * (1.0f / 255.0f);
-}
-
-
-__device__
-static inline unsigned int8
-Float32ToUnorm8(float f) {
-    return (unsigned int8)(f * 255.0f);
-}
-
-
-__device__
-static inline void
-ComputeZBounds(
-     int32 tileStartX,  int32 tileEndX,
-     int32 tileStartY,  int32 tileEndY,
-    // G-buffer data
-     float zBuffer[],
-     int32 gBufferWidth,
-    // Camera data
-     float cameraProj_33,  float cameraProj_43,
-     float cameraNear,  float cameraFar,
-    // Output
-     float &minZ,
-     float &maxZ
-    )
-{
-    // Find Z bounds
-    float laneMinZ = cameraFar;
-    float laneMaxZ = cameraNear;
-    for ( int32 y = tileStartY; y < tileEndY; ++y) {
-        for ( int xb = tileStartX; xb < tileEndX; xb += programCount)
-        {
-          const int x = xb + programIndex;
-          if (x >= tileEndX) break;
-            // Unproject depth buffer Z value into view space
-            float z = zBuffer[y * gBufferWidth + x];
-            float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
-
-            // Work out Z bounds for our samples
-            // Avoid considering skybox/background or otherwise invalid pixels
-            if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
-                laneMinZ = min(laneMinZ, viewSpaceZ);
-                laneMaxZ = max(laneMaxZ, viewSpaceZ);
-            }
-        }
-    }
-    minZ = reduce_min(laneMinZ);
-    maxZ = reduce_max(laneMaxZ);
-}
-
-
-__device__
-static inline  int32
-IntersectLightsWithTileMinMax(
-     int32 tileStartX,  int32 tileEndX,
-     int32 tileStartY,  int32 tileEndY,
-    // Tile data
-     float minZ,
-     float maxZ,
-    // G-buffer data
-     int32 gBufferWidth,  int32 gBufferHeight,
-    // Camera data
-     float cameraProj_11,  float cameraProj_22,
-    // Light Data
-     int32 numLights,
-     float light_positionView_x_array[],
-     float light_positionView_y_array[],
-     float light_positionView_z_array[],
-     float light_attenuationEnd_array[],
-    // Output
-     Uniform<int,MAX_LIGHTS> &tileLightIndices
-    )
-{
-     float gBufferScale_x = 0.5f * (float)gBufferWidth;
-     float gBufferScale_y = 0.5f * (float)gBufferHeight;
-        
-     float frustumPlanes_xy[4] = {
-        -(cameraProj_11 * gBufferScale_x),
-         (cameraProj_11 * gBufferScale_x),
-         (cameraProj_22 * gBufferScale_y),
-        -(cameraProj_22 * gBufferScale_y) };
-     float frustumPlanes_z[4] = {
-         tileEndX - gBufferScale_x,
-        -tileStartX + gBufferScale_x,
-         tileEndY - gBufferScale_y,
-        -tileStartY + gBufferScale_y };
-
-    for ( int i = 0; i < 4; ++i) {
-         float norm = rsqrt(frustumPlanes_xy[i] * frustumPlanes_xy[i] + 
-                                   frustumPlanes_z[i] * frustumPlanes_z[i]);
-        frustumPlanes_xy[i] *= norm;
-        frustumPlanes_z[i] *= norm;
-    }
-
-     int32 tileNumLights = 0;
-
-    for ( int lightIndexB = 0; lightIndexB < numLights; lightIndexB += programCount)
-    {
-      const int lightIndex = lightIndexB + programIndex;
-      if (lightIndex >= numLights) break;
-
-        float light_positionView_z = light_positionView_z_array[lightIndex];
-        float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
-        float light_attenuationEndNeg = -light_attenuationEnd;
-
-        float d = light_positionView_z - minZ;
-        bool inFrustum = (d >= light_attenuationEndNeg);
-
-        d = maxZ - light_positionView_z;
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-        
-        // This seems better than cif(!inFrustum) ccontinue; here since we
-        // don't actually need to mask the rest of this function - this is
-        // just a greedy early-out.  Could also structure all of this as
-        // nested if() statements, but this a bit easier to read
-        if (__ballot(inFrustum) > 0) 
-        {
-            float light_positionView_x = light_positionView_x_array[lightIndex];
-            float light_positionView_y = light_positionView_y_array[lightIndex];
-
-            d = light_positionView_z * frustumPlanes_z[0] + 
-                light_positionView_x * frustumPlanes_xy[0];
-            inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-            d = light_positionView_z * frustumPlanes_z[1] + 
-                light_positionView_x * frustumPlanes_xy[1];
-            inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-            d = light_positionView_z * frustumPlanes_z[2] + 
-                light_positionView_y * frustumPlanes_xy[2];
-            inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-            d = light_positionView_z * frustumPlanes_z[3] + 
-                light_positionView_y * frustumPlanes_xy[3];
-            inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-        
-            // Pack and store intersecting lights
-            const bool active = inFrustum && lightIndex < numLights;
-#if 0
-            if (__ballot(active) > 0)
-              tileNumLights += packed_store_active(active, tileLightIndices.get_ptr(tileNumLights), lightIndex);
-#else
-            if (__ballot(active) > 0)
-            {
-              const int2 res = warpBinExclusiveScan(active);
-              const int idx = tileNumLights + res.y;
-              const int nactive = res.x;
-              tileLightIndices.set(active, idx, lightIndex);
-              tileNumLights += nactive;
-            }
-#endif
-        }
-    }
-
-    return tileNumLights;
-}
-
-
-__device__
-static inline   int32
-IntersectLightsWithTile(
-     int32 tileStartX,  int32 tileEndX,
-     int32 tileStartY,  int32 tileEndY,
-     int32 gBufferWidth,  int32 gBufferHeight,
-    // G-buffer data
-     float zBuffer[],
-    // Camera data
-     float cameraProj_11,  float cameraProj_22,
-     float cameraProj_33,  float cameraProj_43,
-     float cameraNear,  float cameraFar,
-    // Light Data
-     int32 numLights,
-     float light_positionView_x_array[],
-     float light_positionView_y_array[],
-     float light_positionView_z_array[],
-     float light_attenuationEnd_array[],
-    // Output
-     Uniform<int,MAX_LIGHTS> &tileLightIndices
-    )
-{
-     float minZ, maxZ;
-    ComputeZBounds(tileStartX, tileEndX, tileStartY, tileEndY,
-        zBuffer, gBufferWidth, cameraProj_33, cameraProj_43, cameraNear, cameraFar,
-        minZ, maxZ);
-
-
-     int32 tileNumLights = IntersectLightsWithTileMinMax(
-        tileStartX, tileEndX, tileStartY, tileEndY, minZ, maxZ,
-        gBufferWidth, gBufferHeight, cameraProj_11, cameraProj_22,
-        MAX_LIGHTS, light_positionView_x_array, light_positionView_y_array, 
-        light_positionView_z_array, light_attenuationEnd_array,
-        tileLightIndices);
-
-    return tileNumLights;
-}
-
-
-__device__
-static inline void
-ShadeTile(
-     int32 tileStartX,  int32 tileEndX,
-     int32 tileStartY,  int32 tileEndY,
-     int32 gBufferWidth,  int32 gBufferHeight,
-    const  InputDataArrays &inputData,
-    // Camera data
-     float cameraProj_11,  float cameraProj_22,
-     float cameraProj_33,  float cameraProj_43,
-    // Light list
-     Uniform<int,MAX_LIGHTS> &tileLightIndices,
-     int32 tileNumLights,
-    // UI
-     bool visualizeLightCount,
-    // Output
-     unsigned int8 framebuffer_r[],
-     unsigned int8 framebuffer_g[],
-     unsigned int8 framebuffer_b[]
-    )
-{
-    if (tileNumLights == 0 || visualizeLightCount) {
-         unsigned int8 c = (unsigned int8)(min(tileNumLights << 2, 255));
-        for ( int32 y = tileStartY; y < tileEndY; ++y) {
-            for ( int xb = tileStartX ; xb < tileEndX; xb += programCount)
-            { 
-              const int x = xb + programIndex;
-              if (x >= tileEndX) continue;
-                int32 framebufferIndex = (y * gBufferWidth + x);
-                framebuffer_r[framebufferIndex] = c;
-                framebuffer_g[framebufferIndex] = c;
-                framebuffer_b[framebufferIndex] = c;
-            }
-        }
-    } else {
-         float twoOverGBufferWidth = 2.0f / gBufferWidth;
-         float twoOverGBufferHeight = 2.0f / gBufferHeight;
-        
-        for ( int32 y = tileStartY; y < tileEndY; ++y) {
-             float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
-
-            for ( int xb = tileStartX ; xb < tileEndX; xb += programCount)
-            { 
-              const int x = xb + programIndex;
-//              if (x >= tileEndX) break;
-                int32 gBufferOffset = y * gBufferWidth + x;
-                
-                // Reconstruct position and (negative) view vector from G-buffer
-                float surface_positionView_x, surface_positionView_y, surface_positionView_z;
-                float Vneg_x, Vneg_y, Vneg_z;
-
-                float z = inputData.zBuffer[gBufferOffset];
-
-                // Compute screen/clip-space position
-                // NOTE: Mind DX11 viewport transform and pixel center!
-                float positionScreen_x = (0.5f + (float)(x)) * 
-                    twoOverGBufferWidth - 1.0f;
-
-                // Unproject depth buffer Z value into view space
-                surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
-                surface_positionView_x = positionScreen_x * surface_positionView_z / 
-                    cameraProj_11;
-                surface_positionView_y = positionScreen_y * surface_positionView_z / 
-                    cameraProj_22;
-                
-                // We actually end up with a vector pointing *at* the
-                // surface (i.e. the negative view vector)
-                normalize3(surface_positionView_x, surface_positionView_y, 
-                           surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
-
-                // Reconstruct normal from G-buffer
-                float surface_normal_x, surface_normal_y, surface_normal_z;
-                asm("// half2float //");
-                float normal_x = __half2float(inputData.normalEncoded_x[gBufferOffset]);
-                float normal_y = __half2float(inputData.normalEncoded_y[gBufferOffset]);
-                asm("// half2float //");
-                    
-                float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
-                float m = sqrt(4.0f * f - 1.0f);
-                    
-                surface_normal_x = m * (4.0f * normal_x - 2.0f);
-                surface_normal_y = m * (4.0f * normal_y - 2.0f);
-                surface_normal_z = 3.0f - 8.0f * f;
-
-                // Load other G-buffer parameters
-                float surface_specularAmount = 
-                    __half2float(inputData.specularAmount[gBufferOffset]);
-                float surface_specularPower  = 
-                    __half2float(inputData.specularPower[gBufferOffset]);
-                float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
-                float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
-                float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
-                
-                float lit_x = 0.0f;
-                float lit_y = 0.0f;
-                float lit_z = 0.0f;
-                for ( int32 tileLightIndex = 0; tileLightIndex < tileNumLights; 
-                     ++tileLightIndex) {
-                     int32 lightIndex = tileLightIndices.get(tileLightIndex);
-                                        
-                    // Gather light data relevant to initial culling
-                     float light_positionView_x = 
-                        __ldg(&inputData.lightPositionView_x[lightIndex]);
-                     float light_positionView_y = 
-                        __ldg(&inputData.lightPositionView_y[lightIndex]);
-                     float light_positionView_z = 
-                        __ldg(&inputData.lightPositionView_z[lightIndex]);
-                     float light_attenuationEnd = 
-                        __ldg(&inputData.lightAttenuationEnd[lightIndex]);
-                    
-                    // Compute light vector
-                    float L_x = light_positionView_x - surface_positionView_x;
-                    float L_y = light_positionView_y - surface_positionView_y;
-                    float L_z = light_positionView_z - surface_positionView_z;
-
-                    float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
-                    
-                    // Clip at end of attenuation
-                    float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
-
-                    if (distanceToLight2 < light_attenutaionEnd2) {                    
-                        float distanceToLight = sqrt(distanceToLight2);
-
-                        // HLSL "rcp" is allowed to be fairly inaccurate
-                        float distanceToLightRcp = 1.0f/distanceToLight;
-                        L_x *= distanceToLightRcp;
-                        L_y *= distanceToLightRcp;
-                        L_z *= distanceToLightRcp;
-
-                        // Start computing brdf
-                        float NdotL = dot3(surface_normal_x, surface_normal_y, 
-                                           surface_normal_z, L_x, L_y, L_z);
-                    
-                        // Clip back facing
-                        if (NdotL > 0.0f) {
-                             float light_attenuationBegin = 
-                                inputData.lightAttenuationBegin[lightIndex];
-
-                            // Light distance attenuation (linstep)
-                            float lightRange = (light_attenuationEnd - light_attenuationBegin);
-                            float falloffPosition = (light_attenuationEnd - distanceToLight);
-                            float attenuation = min(falloffPosition / lightRange, 1.0f);
-
-                            float H_x = (L_x - Vneg_x);
-                            float H_y = (L_y - Vneg_y);
-                            float H_z = (L_z - Vneg_z);
-                            normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
-                    
-                            float NdotH = dot3(surface_normal_x, surface_normal_y, 
-                                               surface_normal_z, H_x, H_y, H_z);
-                            NdotH = max(NdotH, 0.0f);
-
-                            float specular = pow(NdotH, surface_specularPower);
-                            float specularNorm = (surface_specularPower + 2.0f) * 
-                                (1.0f / 8.0f);
-                            float specularContrib = surface_specularAmount * 
-                                specularNorm * specular;
-
-                            float k = attenuation * NdotL * (1.0f + specularContrib);
-                    
-                             float light_color_x = inputData.lightColor_x[lightIndex];
-                             float light_color_y = inputData.lightColor_y[lightIndex];
-                             float light_color_z = inputData.lightColor_z[lightIndex];
-
-                            float lightContrib_x = surface_albedo_x * light_color_x;
-                            float lightContrib_y = surface_albedo_y * light_color_y;
-                            float lightContrib_z = surface_albedo_z * light_color_z;
-
-                            lit_x += lightContrib_x * k;
-                            lit_y += lightContrib_y * k;
-                            lit_z += lightContrib_z * k;
-                        }
-                    }
-                }
-
-                // Gamma correct
-                // These pows are pretty slow right now, but we can do
-                // something faster if really necessary to squeeze every
-                // last bit of performance out of it
-                float gamma = 1.0 / 2.2f;
-                lit_x = pow(clamp(lit_x, 0.0f, 1.0f), gamma);
-                lit_y = pow(clamp(lit_y, 0.0f, 1.0f), gamma);
-                lit_z = pow(clamp(lit_z, 0.0f, 1.0f), gamma);
-                
-                framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
-                framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
-                framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
-            }
-        }
-    }
-}
-
-
-///////////////////////////////////////////////////////////////////////////
-// Static decomposition
-
-__global__ void
-RenderTile( int num_groups_x,  int num_groups_y,
-           const  InputHeader *inputHeaderPtr,
-           const  InputDataArrays *inputDataPtr,
-            int visualizeLightCount,
-           // Output
-            unsigned int8 framebuffer_r[],
-            unsigned int8 framebuffer_g[],
-            unsigned int8 framebuffer_b[]) {
-  if (taskIndex >= taskCount) return;
-
-  const  InputHeader inputHeader = *inputHeaderPtr;
-  const  InputDataArrays inputData = *inputDataPtr;
-     int32 group_y = taskIndex / num_groups_x;
-     int32 group_x = taskIndex % num_groups_x;
-
-     int32 tile_start_x = group_x * MIN_TILE_WIDTH;
-     int32 tile_start_y = group_y * MIN_TILE_HEIGHT;
-     int32 tile_end_x = tile_start_x + MIN_TILE_WIDTH;
-     int32 tile_end_y = tile_start_y + MIN_TILE_HEIGHT;
-
-     int framebufferWidth = inputHeader.framebufferWidth;
-     int framebufferHeight = inputHeader.framebufferHeight;
-     float cameraProj_00 = inputHeader.cameraProj[0][0];
-     float cameraProj_11 = inputHeader.cameraProj[1][1];
-     float cameraProj_22 = inputHeader.cameraProj[2][2];
-     float cameraProj_32 = inputHeader.cameraProj[3][2];
-
-    // Light intersection: figure out which lights illuminate this tile.
-     Uniform<int,MAX_LIGHTS> tileLightIndices;  // Light list for the tile
-#if 1
-     int numTileLights = 
-        IntersectLightsWithTile(tile_start_x, tile_end_x, 
-                                tile_start_y, tile_end_y,
-                                framebufferWidth, framebufferHeight,
-                                inputData.zBuffer,
-                                cameraProj_00, cameraProj_11,
-                                cameraProj_22, cameraProj_32,
-                                inputHeader.cameraNear, inputHeader.cameraFar,
-                                MAX_LIGHTS,
-                                inputData.lightPositionView_x, 
-                                inputData.lightPositionView_y, 
-                                inputData.lightPositionView_z, 
-                                inputData.lightAttenuationEnd,
-                                tileLightIndices);
-
-    // And now shade the tile, using the lights in tileLightIndices
-    ShadeTile(tile_start_x, tile_end_x, tile_start_y, tile_end_y,
-              framebufferWidth, framebufferHeight, inputData,
-              cameraProj_00, cameraProj_11, cameraProj_22, cameraProj_32,
-              tileLightIndices, numTileLights, visualizeLightCount, 
-              framebuffer_r, framebuffer_g, framebuffer_b);
-#endif
-}
-
-
-extern "C" __global__ void
-RenderStatic( InputHeader inputHeaderPtr[],
-              InputDataArrays inputDataPtr[],
-              int visualizeLightCount,
-             // Output
-              unsigned int8 framebuffer_r[],
-              unsigned int8 framebuffer_g[],
-              unsigned int8 framebuffer_b[]) {
-
-  const  InputHeader inputHeader = *inputHeaderPtr;
-  const  InputDataArrays inputData = *inputDataPtr;
-
-
-     int num_groups_x = (inputHeader.framebufferWidth + 
-                                MIN_TILE_WIDTH - 1) / MIN_TILE_WIDTH;
-     int num_groups_y = (inputHeader.framebufferHeight + 
-                                MIN_TILE_HEIGHT - 1) / MIN_TILE_HEIGHT;
-     int num_groups = num_groups_x * num_groups_y;
-
-    // Launch a task to render each tile, each of which is MIN_TILE_WIDTH
-    // by MIN_TILE_HEIGHT pixels.
-     if (programIndex == 0)
-       RenderTile<<<(num_groups+4-1)/4,128>>>(num_groups_x, num_groups_y,
-           inputHeaderPtr, inputDataPtr, visualizeLightCount,
-           framebuffer_r, framebuffer_g, framebuffer_b);
-     cudaDeviceSynchronize();
-}
--- a/examples_cuda/deferred/kernels.ispc
+++ b/examples_cuda/deferred/kernels.ispc
@@ -1,675 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#include "deferred.h"
-
-struct InputDataArrays
-{
-    float *zBuffer;
-    unsigned int16 *normalEncoded_x; // half float
-    unsigned int16 *normalEncoded_y; // half float
-    unsigned int16 *specularAmount; // half float
-    unsigned int16 *specularPower; // half float
-    unsigned int8 *albedo_x; // unorm8
-    unsigned int8 *albedo_y; // unorm8
-    unsigned int8 *albedo_z; // unorm8
-    float *lightPositionView_x;
-    float *lightPositionView_y;
-    float *lightPositionView_z;
-    float *lightAttenuationBegin;
-    float *lightColor_x;
-    float *lightColor_y;
-    float *lightColor_z;
-    float *lightAttenuationEnd;
-};
-
-struct InputHeader
-{
-    float cameraProj[4][4];
-    float cameraNear;
-    float cameraFar;
-
-    int32 framebufferWidth;
-    int32 framebufferHeight;
-    int32 numLights;
-    int32 inputDataChunkSize;
-    int32 inputDataArrayOffsets[idaNum];
-};
-
-
-///////////////////////////////////////////////////////////////////////////
-// Common utility routines
-
-static inline float
-dot3(float x, float y, float z, float a, float b, float c) {
-    return (x*a + y*b + z*c);
-}
-
-
-static inline void
-normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
-    float n = rsqrt(x*x + y*y + z*z);
-    ox = x * n;
-    oy = y * n;
-    oz = z * n;
-}
-
-
-static inline float
-Unorm8ToFloat32(unsigned int8 u) {
-    return (float)u * (1.0f / 255.0f);
-}
-
-
-static inline unsigned int8
-Float32ToUnorm8(float f) {
-    return (unsigned int8)(f * 255.0f);
-}
-
-
-static void
-ComputeZBounds(
-    uniform int32 tileStartX, uniform int32 tileEndX,
-    uniform int32 tileStartY, uniform int32 tileEndY,
-    // G-buffer data
-    uniform float zBuffer[],
-    uniform int32 gBufferWidth,
-    // Camera data
-    uniform float cameraProj_33, uniform float cameraProj_43,
-    uniform float cameraNear, uniform float cameraFar,
-    // Output
-    uniform float &minZ,
-    uniform float &maxZ
-    )
-{
-    // Find Z bounds
-    float laneMinZ = cameraFar;
-    float laneMaxZ = cameraNear;
-    for (uniform int32 y = tileStartY; y < tileEndY; ++y) {
-        foreach (x = tileStartX ... tileEndX) 
-        {
-            // Unproject depth buffer Z value into view space
-            float z = zBuffer[y * gBufferWidth + x];
-            float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
-
-            // Work out Z bounds for our samples
-            // Avoid considering skybox/background or otherwise invalid pixels
-            if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
-                laneMinZ = min(laneMinZ, viewSpaceZ);
-                laneMaxZ = max(laneMaxZ, viewSpaceZ);
-            }
-        }
-    }
-    minZ = reduce_min(laneMinZ);
-    maxZ = reduce_max(laneMaxZ);
-}
-
-
-export uniform int32
-IntersectLightsWithTileMinMax(
-    uniform int32 tileStartX, uniform int32 tileEndX,
-    uniform int32 tileStartY, uniform int32 tileEndY,
-    // Tile data
-    uniform float minZ,
-    uniform float maxZ,
-    // G-buffer data
-    uniform int32 gBufferWidth, uniform int32 gBufferHeight,
-    // Camera data
-    uniform float cameraProj_11, uniform float cameraProj_22,
-    // Light Data
-    uniform int32 numLights,
-    uniform float light_positionView_x_array[],
-    uniform float light_positionView_y_array[],
-    uniform float light_positionView_z_array[],
-    uniform float light_attenuationEnd_array[],
-    // Output
-    uniform int32 tileLightIndices[]
-    )
-{
-    uniform float gBufferScale_x = 0.5f * (float)gBufferWidth;
-    uniform float gBufferScale_y = 0.5f * (float)gBufferHeight;
-        
-    uniform float frustumPlanes_xy[4] = {
-        -(cameraProj_11 * gBufferScale_x),
-         (cameraProj_11 * gBufferScale_x),
-         (cameraProj_22 * gBufferScale_y),
-        -(cameraProj_22 * gBufferScale_y) };
-    uniform float frustumPlanes_z[4] = {
-         tileEndX - gBufferScale_x,
-        -tileStartX + gBufferScale_x,
-         tileEndY - gBufferScale_y,
-        -tileStartY + gBufferScale_y };
-
-    for (uniform int i = 0; i < 4; ++i) {
-        uniform float norm = rsqrt(frustumPlanes_xy[i] * frustumPlanes_xy[i] + 
-                                   frustumPlanes_z[i] * frustumPlanes_z[i]);
-        frustumPlanes_xy[i] *= norm;
-        frustumPlanes_z[i] *= norm;
-    }
-
-    uniform int32 tileNumLights = 0;
-
-    foreach (lightIndex = 0 ... numLights) 
-    {
-        float light_positionView_z = light_positionView_z_array[lightIndex];
-        float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
-        float light_attenuationEndNeg = -light_attenuationEnd;
-
-        float d = light_positionView_z - minZ;
-        bool inFrustum = (d >= light_attenuationEndNeg);
-
-        d = maxZ - light_positionView_z;
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-        
-        // This seems better than cif(!inFrustum) ccontinue; here since we
-        // don't actually need to mask the rest of this function - this is
-        // just a greedy early-out.  Could also structure all of this as
-        // nested if() statements, but this a bit easier to read
-      if (any(inFrustum)) {
-        float light_positionView_x = light_positionView_x_array[lightIndex];
-        float light_positionView_y = light_positionView_y_array[lightIndex];
-
-        d = light_positionView_z * frustumPlanes_z[0] + 
-          light_positionView_x * frustumPlanes_xy[0];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        d = light_positionView_z * frustumPlanes_z[1] + 
-          light_positionView_x * frustumPlanes_xy[1];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        d = light_positionView_z * frustumPlanes_z[2] + 
-          light_positionView_y * frustumPlanes_xy[2];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        d = light_positionView_z * frustumPlanes_z[3] + 
-          light_positionView_y * frustumPlanes_xy[3];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        // Pack and store intersecting lights
-        const bool active = inFrustum && lightIndex < numLights;
-
-        if (any(active))
-          tileNumLights += packed_store_active(active, &tileLightIndices[tileNumLights], lightIndex);
-      }
-    }
-
-    return tileNumLights;
-}
-
-
-static uniform int32
-IntersectLightsWithTile(
-    uniform int32 tileStartX, uniform int32 tileEndX,
-    uniform int32 tileStartY, uniform int32 tileEndY,
-    uniform int32 gBufferWidth, uniform int32 gBufferHeight,
-    // G-buffer data
-    uniform float zBuffer[],
-    // Camera data
-    uniform float cameraProj_11, uniform float cameraProj_22,
-    uniform float cameraProj_33, uniform float cameraProj_43,
-    uniform float cameraNear, uniform float cameraFar,
-    // Light Data
-    uniform int32 numLights,
-    uniform float light_positionView_x_array[],
-    uniform float light_positionView_y_array[],
-    uniform float light_positionView_z_array[],
-    uniform float light_attenuationEnd_array[],
-    // Output
-    uniform int32 tileLightIndices[]
-    )
-{
-    uniform float minZ, maxZ;
-    ComputeZBounds(tileStartX, tileEndX, tileStartY, tileEndY,
-        zBuffer, gBufferWidth, cameraProj_33, cameraProj_43, cameraNear, cameraFar,
-        minZ, maxZ);
-
-    uniform int32 tileNumLights = IntersectLightsWithTileMinMax(
-        tileStartX, tileEndX, tileStartY, tileEndY, minZ, maxZ,
-        gBufferWidth, gBufferHeight, cameraProj_11, cameraProj_22,
-        MAX_LIGHTS, light_positionView_x_array, light_positionView_y_array, 
-        light_positionView_z_array, light_attenuationEnd_array,
-        tileLightIndices);
-
-    return tileNumLights;
-}
-
-
-export void
-ShadeTile(
-    uniform int32 tileStartX, uniform int32 tileEndX,
-    uniform int32 tileStartY, uniform int32 tileEndY,
-    uniform int32 gBufferWidth, uniform int32 gBufferHeight,
-    uniform InputDataArrays &inputData,
-    // Camera data
-    uniform float cameraProj_11, uniform float cameraProj_22,
-    uniform float cameraProj_33, uniform float cameraProj_43,
-    // Light list
-    uniform int32 tileLightIndices[],
-    uniform int32 tileNumLights,
-    // UI
-    uniform bool visualizeLightCount,
-    // Output
-    uniform unsigned int8 framebuffer_r[],
-    uniform unsigned int8 framebuffer_g[],
-    uniform unsigned int8 framebuffer_b[]
-    )
-{
-    if (tileNumLights == 0 || visualizeLightCount) {
-        uniform unsigned int8 c = (unsigned int8)(min(tileNumLights << 2, 255));
-        for (uniform int32 y = tileStartY; y < tileEndY; ++y) {
-            foreach (x = tileStartX ... tileEndX) 
-            { 
-                int32 framebufferIndex = (y * gBufferWidth + x);
-                framebuffer_r[framebufferIndex] = c;
-                framebuffer_g[framebufferIndex] = c;
-                framebuffer_b[framebufferIndex] = c;
-            }
-        }
-    } else {
-        uniform float twoOverGBufferWidth = 2.0f / gBufferWidth;
-        uniform float twoOverGBufferHeight = 2.0f / gBufferHeight;
-        
-        for (uniform int32 y = tileStartY; y < tileEndY; ++y) {
-            uniform float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
-
-            foreach (x = tileStartX ... tileEndX) {
-                int32 gBufferOffset = y * gBufferWidth + x;
-                
-                // Reconstruct position and (negative) view vector from G-buffer
-                float surface_positionView_x, surface_positionView_y, surface_positionView_z;
-                float Vneg_x, Vneg_y, Vneg_z;
-
-                float z = inputData.zBuffer[gBufferOffset];
-
-                // Compute screen/clip-space position
-                // NOTE: Mind DX11 viewport transform and pixel center!
-                float positionScreen_x = (0.5f + (float)(x)) * 
-                    twoOverGBufferWidth - 1.0f;
-
-                // Unproject depth buffer Z value into view space
-                surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
-                surface_positionView_x = positionScreen_x * surface_positionView_z / 
-                    cameraProj_11;
-                surface_positionView_y = positionScreen_y * surface_positionView_z / 
-                    cameraProj_22;
-                
-                // We actually end up with a vector pointing *at* the
-                // surface (i.e. the negative view vector)
-                normalize3(surface_positionView_x, surface_positionView_y, 
-                           surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
-
-                // Reconstruct normal from G-buffer
-                float surface_normal_x, surface_normal_y, surface_normal_z;
-                float normal_x = half_to_float(inputData.normalEncoded_x[gBufferOffset]);
-                float normal_y = half_to_float(inputData.normalEncoded_y[gBufferOffset]);
-                    
-                float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
-                float m = sqrt(4.0f * f - 1.0f);
-                    
-                surface_normal_x = m * (4.0f * normal_x - 2.0f);
-                surface_normal_y = m * (4.0f * normal_y - 2.0f);
-                surface_normal_z = 3.0f - 8.0f * f;
-
-                // Load other G-buffer parameters
-                float surface_specularAmount = 
-                    half_to_float(inputData.specularAmount[gBufferOffset]);
-                float surface_specularPower  = 
-                    half_to_float(inputData.specularPower[gBufferOffset]);
-                float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
-                float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
-                float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
-                
-                float lit_x = 0.0f;
-                float lit_y = 0.0f;
-                float lit_z = 0.0f;
-                for (uniform int32 tileLightIndex = 0; tileLightIndex < tileNumLights; 
-                     ++tileLightIndex) {
-                    uniform int32 lightIndex = tileLightIndices[tileLightIndex];
-                                        
-                    // Gather light data relevant to initial culling
-                    uniform float light_positionView_x = 
-                        inputData.lightPositionView_x[lightIndex];
-                    uniform float light_positionView_y = 
-                        inputData.lightPositionView_y[lightIndex];
-                    uniform float light_positionView_z = 
-                        inputData.lightPositionView_z[lightIndex];
-                    uniform float light_attenuationEnd = 
-                        inputData.lightAttenuationEnd[lightIndex];
-                    
-                    // Compute light vector
-                    float L_x = light_positionView_x - surface_positionView_x;
-                    float L_y = light_positionView_y - surface_positionView_y;
-                    float L_z = light_positionView_z - surface_positionView_z;
-
-                    float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
-                    
-                    // Clip at end of attenuation
-                    float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
-
-                    cif (distanceToLight2 < light_attenutaionEnd2) {                    
-                        float distanceToLight = sqrt(distanceToLight2);
-
-                        // HLSL "rcp" is allowed to be fairly inaccurate
-                        float distanceToLightRcp = rcp(distanceToLight);
-                        L_x *= distanceToLightRcp;
-                        L_y *= distanceToLightRcp;
-                        L_z *= distanceToLightRcp;
-
-                        // Start computing brdf
-                        float NdotL = dot3(surface_normal_x, surface_normal_y, 
-                                           surface_normal_z, L_x, L_y, L_z);
-                    
-                        // Clip back facing
-                        cif (NdotL > 0.0f) {
-                            uniform float light_attenuationBegin = 
-                                inputData.lightAttenuationBegin[lightIndex];
-
-                            // Light distance attenuation (linstep)
-                            float lightRange = (light_attenuationEnd - light_attenuationBegin);
-                            float falloffPosition = (light_attenuationEnd - distanceToLight);
-                            float attenuation = min(falloffPosition / lightRange, 1.0f);
-
-                            float H_x = (L_x - Vneg_x);
-                            float H_y = (L_y - Vneg_y);
-                            float H_z = (L_z - Vneg_z);
-                            normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
-                    
-                            float NdotH = dot3(surface_normal_x, surface_normal_y, 
-                                               surface_normal_z, H_x, H_y, H_z);
-                            NdotH = max(NdotH, 0.0f);
-
-                            float specular = pow(NdotH, surface_specularPower);
-                            float specularNorm = (surface_specularPower + 2.0f) * 
-                                (1.0f / 8.0f);
-                            float specularContrib = surface_specularAmount * 
-                                specularNorm * specular;
-
-                            float k = attenuation * NdotL * (1.0f + specularContrib);
-                    
-                            uniform float light_color_x = inputData.lightColor_x[lightIndex];
-                            uniform float light_color_y = inputData.lightColor_y[lightIndex];
-                            uniform float light_color_z = inputData.lightColor_z[lightIndex];
-
-                            float lightContrib_x = surface_albedo_x * light_color_x;
-                            float lightContrib_y = surface_albedo_y * light_color_y;
-                            float lightContrib_z = surface_albedo_z * light_color_z;
-
-                            lit_x += lightContrib_x * k;
-                            lit_y += lightContrib_y * k;
-                            lit_z += lightContrib_z * k;
-                        }
-                    }
-                }
-
-                // Gamma correct
-                // These pows are pretty slow right now, but we can do
-                // something faster if really necessary to squeeze every
-                // last bit of performance out of it
-                float gamma = 1.0 / 2.2f;
-                lit_x = pow(clamp(lit_x, 0.0f, 1.0f), gamma);
-                lit_y = pow(clamp(lit_y, 0.0f, 1.0f), gamma);
-                lit_z = pow(clamp(lit_z, 0.0f, 1.0f), gamma);
-                
-                framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
-                framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
-                framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
-            }
-        }
-    }
-}
-
-
-///////////////////////////////////////////////////////////////////////////
-// Static decomposition
-
-task void
-RenderTile(uniform int num_groups_x, uniform int num_groups_y,
-           uniform InputHeader &inputHeader,
-           uniform InputDataArrays &inputData,
-           uniform int visualizeLightCount,
-           // Output
-           uniform unsigned int8 framebuffer_r[],
-           uniform unsigned int8 framebuffer_g[],
-           uniform unsigned int8 framebuffer_b[]) {
-    uniform int32 group_y = taskIndex / num_groups_x;
-    uniform int32 group_x = taskIndex % num_groups_x;
-    uniform int32 tile_start_x = group_x * MIN_TILE_WIDTH;
-    uniform int32 tile_start_y = group_y * MIN_TILE_HEIGHT;
-    uniform int32 tile_end_x = tile_start_x + MIN_TILE_WIDTH;
-    uniform int32 tile_end_y = tile_start_y + MIN_TILE_HEIGHT;
-
-    uniform int framebufferWidth = inputHeader.framebufferWidth;
-    uniform int framebufferHeight = inputHeader.framebufferHeight;
-    uniform float cameraProj_00 = inputHeader.cameraProj[0][0];
-    uniform float cameraProj_11 = inputHeader.cameraProj[1][1];
-    uniform float cameraProj_22 = inputHeader.cameraProj[2][2];
-    uniform float cameraProj_32 = inputHeader.cameraProj[3][2];
-
-    // Light intersection: figure out which lights illuminate this tile.
-    uniform int tileLightIndices[MAX_LIGHTS];  // Light list for the tile
-    uniform int numTileLights = 
-        IntersectLightsWithTile(tile_start_x, tile_end_x, 
-                                tile_start_y, tile_end_y,
-                                framebufferWidth, framebufferHeight,
-                                inputData.zBuffer,
-                                cameraProj_00, cameraProj_11,
-                                cameraProj_22, cameraProj_32,
-                                inputHeader.cameraNear, inputHeader.cameraFar,
-                                MAX_LIGHTS,
-                                inputData.lightPositionView_x, 
-                                inputData.lightPositionView_y, 
-                                inputData.lightPositionView_z, 
-                                inputData.lightAttenuationEnd,
-                                tileLightIndices);
-
-    // And now shade the tile, using the lights in tileLightIndices
-    ShadeTile(tile_start_x, tile_end_x, tile_start_y, tile_end_y,
-              framebufferWidth, framebufferHeight, inputData,
-              cameraProj_00, cameraProj_11, cameraProj_22, cameraProj_32,
-              tileLightIndices, numTileLights, visualizeLightCount, 
-              framebuffer_r, framebuffer_g, framebuffer_b);
-}
-
-
-export void
-RenderStatic(uniform InputHeader &inputHeader,
-             uniform InputDataArrays &inputData,
-             uniform int visualizeLightCount,
-             // Output
-             uniform unsigned int8 framebuffer_r[],
-             uniform unsigned int8 framebuffer_g[],
-             uniform unsigned int8 framebuffer_b[]) {
-    uniform int num_groups_x = (inputHeader.framebufferWidth + 
-                                MIN_TILE_WIDTH - 1) / MIN_TILE_WIDTH;
-    uniform int num_groups_y = (inputHeader.framebufferHeight + 
-                                MIN_TILE_HEIGHT - 1) / MIN_TILE_HEIGHT;
-    uniform int num_groups = num_groups_x * num_groups_y;
-
-    // Launch a task to render each tile, each of which is MIN_TILE_WIDTH
-    // by MIN_TILE_HEIGHT pixels.
-    launch[num_groups] RenderTile(num_groups_x, num_groups_y,
-                                  inputHeader, inputData, visualizeLightCount,
-                                  framebuffer_r, framebuffer_g, framebuffer_b);
-}
-
-
-///////////////////////////////////////////////////////////////////////////
-// Routines for dynamic decomposition path
-
-// This computes the z min/max range for a whole row worth of tiles.
-export void
-ComputeZBoundsRow(
-    uniform int32 tileY,
-    uniform int32 tileWidth, uniform int32 tileHeight,
-    uniform int32 numTilesX, uniform int32 numTilesY,
-    // G-buffer data
-    uniform float zBuffer[],
-    uniform int32 gBufferWidth,
-    // Camera data
-    uniform float cameraProj_33, uniform float cameraProj_43,
-    uniform float cameraNear, uniform float cameraFar,
-    // Output
-    uniform float minZArray[],
-    uniform float maxZArray[]
-    )
-{
-    for (uniform int32 tileX = 0; tileX < numTilesX; ++tileX) {
-        uniform float minZ, maxZ;
-        ComputeZBounds(
-            tileX * tileWidth, tileX * tileWidth + tileWidth,
-            tileY * tileHeight, tileY * tileHeight + tileHeight,
-            zBuffer, gBufferWidth,
-            cameraProj_33, cameraProj_43, cameraNear, cameraFar,
-            minZ, maxZ);
-        minZArray[tileX] = minZ;
-        maxZArray[tileX] = maxZ;
-    }
-}
-
-
-// Reclassifies the lights with respect to four sub-tiles when we refine a tile.
-// numLights need not be a multiple of programCount here, but the input and output arrays
-// should be able to handle programCount-sized load/stores.
-export void
-SplitTileMinMax(
-    uniform int32 tileMidX, uniform int32 tileMidY,
-    // Subtile data (00, 10, 01, 11)
-    uniform float subtileMinZ[],
-    uniform float subtileMaxZ[],
-    // G-buffer data
-    uniform int32 gBufferWidth, uniform int32 gBufferHeight,
-    // Camera data
-    uniform float cameraProj_11, uniform float cameraProj_22,
-    // Light Data
-    uniform int32 lightIndices[],
-    uniform int32 numLights,
-    uniform float light_positionView_x_array[],
-    uniform float light_positionView_y_array[],
-    uniform float light_positionView_z_array[],
-    uniform float light_attenuationEnd_array[],
-    // Outputs
-    uniform int32 subtileIndices[],
-    uniform int32 subtileIndicesPitch,
-    uniform int32 subtileNumLights[]
-    )
-{
-    uniform float gBufferScale_x = 0.5f * (float)gBufferWidth;
-    uniform float gBufferScale_y = 0.5f * (float)gBufferHeight;
-        
-    uniform float frustumPlanes_xy[2] = { -(cameraProj_11 * gBufferScale_x),
-                                           (cameraProj_22 * gBufferScale_y) };
-    uniform float frustumPlanes_z[2] = { tileMidX - gBufferScale_x,
-                                         tileMidY - gBufferScale_y };
-
-    // Normalize
-    uniform float norm[2] = { rsqrt(frustumPlanes_xy[0] * frustumPlanes_xy[0] + 
-                                    frustumPlanes_z[0] * frustumPlanes_z[0]),
-                              rsqrt(frustumPlanes_xy[1] * frustumPlanes_xy[1] + 
-                                    frustumPlanes_z[1] * frustumPlanes_z[1]) };
-    frustumPlanes_xy[0] *= norm[0];
-    frustumPlanes_xy[1] *= norm[1];
-    frustumPlanes_z[0] *= norm[0];
-    frustumPlanes_z[1] *= norm[1];
-
-    // Initialize
-    uniform int32 subtileLightOffset[4];
-    subtileLightOffset[0] = 0 * subtileIndicesPitch;
-    subtileLightOffset[1] = 1 * subtileIndicesPitch;
-    subtileLightOffset[2] = 2 * subtileIndicesPitch;
-    subtileLightOffset[3] = 3 * subtileIndicesPitch;
-
-    foreach (i = 0 ... numLights) {
-        int32 lightIndex = lightIndices[i];
-
-        float light_positionView_x = light_positionView_x_array[lightIndex];
-        float light_positionView_y = light_positionView_y_array[lightIndex];
-        float light_positionView_z = light_positionView_z_array[lightIndex];
-        float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
-        float light_attenuationEndNeg = -light_attenuationEnd;
-        
-        // Test lights again subtile z bounds
-        bool inFrustum[4];
-        inFrustum[0] = (light_positionView_z - subtileMinZ[0] >= light_attenuationEndNeg) &&
-            (subtileMaxZ[0] - light_positionView_z >= light_attenuationEndNeg);
-        inFrustum[1] = (light_positionView_z - subtileMinZ[1] >= light_attenuationEndNeg) && 
-            (subtileMaxZ[1] - light_positionView_z >= light_attenuationEndNeg);
-        inFrustum[2] = (light_positionView_z - subtileMinZ[2] >= light_attenuationEndNeg) && 
-            (subtileMaxZ[2] - light_positionView_z >= light_attenuationEndNeg);
-        inFrustum[3] = (light_positionView_z - subtileMinZ[3] >= light_attenuationEndNeg) && 
-            (subtileMaxZ[3] - light_positionView_z >= light_attenuationEndNeg);
-
-        float dx = light_positionView_z * frustumPlanes_z[0] + 
-            light_positionView_x * frustumPlanes_xy[0];
-        float dy = light_positionView_z * frustumPlanes_z[1] +
-            light_positionView_y * frustumPlanes_xy[1];
-        
-        cif (abs(dx) > light_attenuationEnd) {
-            bool positiveX = dx > 0.0f;
-            inFrustum[0] = inFrustum[0] &&  positiveX;    // 00 subtile
-            inFrustum[1] = inFrustum[1] && !positiveX;    // 10 subtile
-            inFrustum[2] = inFrustum[2] &&  positiveX;    // 01 subtile
-            inFrustum[3] = inFrustum[3] && !positiveX;    // 11 subtile
-        }
-        cif (abs(dy) > light_attenuationEnd) {
-            bool positiveY = dy > 0.0f;
-            inFrustum[0] = inFrustum[0] &&  positiveY;    // 00 subtile
-            inFrustum[1] = inFrustum[1] &&  positiveY;    // 10 subtile
-            inFrustum[2] = inFrustum[2] && !positiveY;    // 01 subtile
-            inFrustum[3] = inFrustum[3] && !positiveY;    // 11 subtile
-        }
-
-        // Pack and store intersecting lights
-        // TODO: Experiment with a loop here instead
-        cif (inFrustum[0])
-            subtileLightOffset[0] += 
-            packed_store_active(&subtileIndices[subtileLightOffset[0]],
-                                lightIndex);
-        cif (inFrustum[1])
-            subtileLightOffset[1] += 
-            packed_store_active(&subtileIndices[subtileLightOffset[1]],
-                                lightIndex);
-        cif (inFrustum[2])
-            subtileLightOffset[2] += 
-            packed_store_active(&subtileIndices[subtileLightOffset[2]], 
-                                lightIndex);
-        cif (inFrustum[3])
-            subtileLightOffset[3] += 
-            packed_store_active(&subtileIndices[subtileLightOffset[3]], 
-                                lightIndex);
-    }
-
-    subtileNumLights[0] = subtileLightOffset[0] - 0 * subtileIndicesPitch;
-    subtileNumLights[1] = subtileLightOffset[1] - 1 * subtileIndicesPitch;
-    subtileNumLights[2] = subtileLightOffset[2] - 2 * subtileIndicesPitch;
-    subtileNumLights[3] = subtileLightOffset[3] - 3 * subtileIndicesPitch;
-}
--- a/examples_cuda/deferred/kernels1.ispc
+++ b/examples_cuda/deferred/kernels1.ispc
@@ -1,556 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef __NVPTX__
-#warning "emitting DEVICE code"
-#define programCount warpSize()
-#define programIndex laneIndex()
-#define taskIndex    blockIndex0()
-#define taskCount    blockCount0()
-#define cif          if
-#else
-#warning "emitting HOST code"
-#endif
-
-
-#include "deferred.h"
-
-struct InputDataArrays
-{
-    float *zBuffer;
-    unsigned int16 *normalEncoded_x; // half float
-    unsigned int16 *normalEncoded_y; // half float
-    unsigned int16 *specularAmount; // half float
-    unsigned int16 *specularPower; // half float
-    unsigned int8 *albedo_x; // unorm8
-    unsigned int8 *albedo_y; // unorm8
-    unsigned int8 *albedo_z; // unorm8
-    float *lightPositionView_x;
-    float *lightPositionView_y;
-    float *lightPositionView_z;
-    float *lightAttenuationBegin;
-    float *lightColor_x;
-    float *lightColor_y;
-    float *lightColor_z;
-    float *lightAttenuationEnd;
-};
-
-struct InputHeader
-{
-    float cameraProj[4][4];
-    float cameraNear;
-    float cameraFar;
-
-    int32 framebufferWidth;
-    int32 framebufferHeight;
-    int32 numLights;
-    int32 inputDataChunkSize;
-    int32 inputDataArrayOffsets[idaNum];
-};
-
-
-///////////////////////////////////////////////////////////////////////////
-// Common utility routines
-
-static inline float
-dot3(float x, float y, float z, float a, float b, float c) {
-    return (x*a + y*b + z*c);
-}
-
-
-static inline void
-normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
-    float n = rsqrt(x*x + y*y + z*z);
-    ox = x * n;
-    oy = y * n;
-    oz = z * n;
-}
-
-
-static inline float
-Unorm8ToFloat32(unsigned int8 u) {
-    return (float)u * (1.0f / 255.0f);
-}
-
-
-static inline unsigned int8
-Float32ToUnorm8(float f) {
-    return (unsigned int8)(f * 255.0f);
-}
-
-
-static inline void
-ComputeZBounds(
-    uniform int32 tileStartX, uniform int32 tileEndX,
-    uniform int32 tileStartY, uniform int32 tileEndY,
-    // G-buffer data
-    uniform float zBuffer[],
-    uniform int32 gBufferWidth,
-    // Camera data
-    uniform float cameraProj_33, uniform float cameraProj_43,
-    uniform float cameraNear, uniform float cameraFar,
-    // Output
-    uniform float &minZ,
-    uniform float &maxZ
-    )
-{
-    // Find Z bounds
-    float laneMinZ = cameraFar;
-    float laneMaxZ = cameraNear;
-    for (uniform int32 y = tileStartY; y < tileEndY; ++y) 
-      foreach (x = tileStartX ... tileEndX) 
-      {
-        // Unproject depth buffer Z value into view space
-        float z = zBuffer[y * gBufferWidth + x];
-        float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
-
-        // Work out Z bounds for our samples
-        // Avoid considering skybox/background or otherwise invalid pixels
-        if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
-          laneMinZ = min(laneMinZ, viewSpaceZ);
-          laneMaxZ = max(laneMaxZ, viewSpaceZ);
-        }
-      }
-    minZ = reduce_min(laneMinZ);
-    maxZ = reduce_max(laneMaxZ);
-}
-
-
-static inline uniform int32
-IntersectLightsWithTileMinMax(
-    uniform int32 tileStartX, uniform int32 tileEndX,
-    uniform int32 tileStartY, uniform int32 tileEndY,
-    // Tile data
-    uniform float minZ,
-    uniform float maxZ,
-    // G-buffer data
-    uniform int32 gBufferWidth, uniform int32 gBufferHeight,
-    // Camera data
-    uniform float cameraProj_11, uniform float cameraProj_22,
-    // Light Data
-    uniform int32 numLights,
-    uniform float light_positionView_x_array[],
-    uniform float light_positionView_y_array[],
-    uniform float light_positionView_z_array[],
-    uniform float light_attenuationEnd_array[],
-    // Output
-    uniform int32 tileLightIndices[]
-    )
-{
-    uniform float gBufferScale_x = 0.5f * (float)gBufferWidth;
-    uniform float gBufferScale_y = 0.5f * (float)gBufferHeight;
-        
-    uniform float frustumPlanes_xy[4] = {
-        -(cameraProj_11 * gBufferScale_x),
-         (cameraProj_11 * gBufferScale_x),
-         (cameraProj_22 * gBufferScale_y),
-        -(cameraProj_22 * gBufferScale_y) };
-    uniform float frustumPlanes_z[4] = {
-         tileEndX - gBufferScale_x,
-        -tileStartX + gBufferScale_x,
-         tileEndY - gBufferScale_y,
-        -tileStartY + gBufferScale_y };
-
-    for (uniform int i = 0; i < 4; ++i) {
-        uniform float norm = rsqrt(frustumPlanes_xy[i] * frustumPlanes_xy[i] + 
-                                   frustumPlanes_z[i] * frustumPlanes_z[i]);
-        frustumPlanes_xy[i] *= norm;
-        frustumPlanes_z[i] *= norm;
-    }
-
-    uniform int32 tileNumLights = 0;
-
-    foreach (lightIndex = 0 ... numLights) 
-    {
-      float light_positionView_z = light_positionView_z_array[lightIndex];
-      float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
-      float light_attenuationEndNeg = -light_attenuationEnd;
-
-      float d = light_positionView_z - minZ;
-      bool inFrustum = (d >= light_attenuationEndNeg);
-
-      d = maxZ - light_positionView_z;
-      inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-      // This seems better than cif(!inFrustum) ccontinue; here since we
-      // don't actually need to mask the rest of this function - this is
-      // just a greedy early-out.  Could also structure all of this as
-      // nested if() statements, but this a bit easier to read
-      if (any(inFrustum)) 
-      {
-        float light_positionView_x = light_positionView_x_array[lightIndex];
-        float light_positionView_y = light_positionView_y_array[lightIndex];
-
-        d = light_positionView_z * frustumPlanes_z[0] + 
-          light_positionView_x * frustumPlanes_xy[0];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        d = light_positionView_z * frustumPlanes_z[1] + 
-          light_positionView_x * frustumPlanes_xy[1];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        d = light_positionView_z * frustumPlanes_z[2] + 
-          light_positionView_y * frustumPlanes_xy[2];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        d = light_positionView_z * frustumPlanes_z[3] + 
-          light_positionView_y * frustumPlanes_xy[3];
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-        // Pack and store intersecting lights
-        const bool active = inFrustum && lightIndex < numLights;
-
-        if(any(active))
-          tileNumLights += packed_store_active(active, &tileLightIndices[tileNumLights], lightIndex);
-      }
-    }
-
-    return tileNumLights;
-}
-
-
-static inline  uniform int32
-IntersectLightsWithTile(
-    uniform int32 tileStartX, uniform int32 tileEndX,
-    uniform int32 tileStartY, uniform int32 tileEndY,
-    uniform int32 gBufferWidth, uniform int32 gBufferHeight,
-    // G-buffer data
-    uniform float zBuffer[],
-    // Camera data
-    uniform float cameraProj_11, uniform float cameraProj_22,
-    uniform float cameraProj_33, uniform float cameraProj_43,
-    uniform float cameraNear, uniform float cameraFar,
-    // Light Data
-    uniform int32 numLights,
-    uniform float light_positionView_x_array[],
-    uniform float light_positionView_y_array[],
-    uniform float light_positionView_z_array[],
-    uniform float light_attenuationEnd_array[],
-    // Output
-    uniform int32 tileLightIndices[]
-    )
-{
-    uniform float minZ, maxZ;
-    ComputeZBounds(tileStartX, tileEndX, tileStartY, tileEndY,
-        zBuffer, gBufferWidth, cameraProj_33, cameraProj_43, cameraNear, cameraFar,
-        minZ, maxZ);
-
-    uniform int32 tileNumLights = IntersectLightsWithTileMinMax(
-        tileStartX, tileEndX, tileStartY, tileEndY, minZ, maxZ,
-        gBufferWidth, gBufferHeight, cameraProj_11, cameraProj_22,
-        MAX_LIGHTS, light_positionView_x_array, light_positionView_y_array, 
-        light_positionView_z_array, light_attenuationEnd_array,
-        tileLightIndices);
-
-    return tileNumLights;
-}
-
-
-static inline void
-ShadeTile(
-    uniform int32 tileStartX, uniform int32 tileEndX,
-    uniform int32 tileStartY, uniform int32 tileEndY,
-    uniform int32 gBufferWidth, uniform int32 gBufferHeight,
-    const uniform InputDataArrays &inputData,
-    // Camera data
-    uniform float cameraProj_11, uniform float cameraProj_22,
-    uniform float cameraProj_33, uniform float cameraProj_43,
-    // Light list
-    uniform int32 tileLightIndices[],
-    uniform int32 tileNumLights,
-    // UI
-    uniform bool visualizeLightCount,
-    // Output
-    uniform unsigned int8 framebuffer_r[],
-    uniform unsigned int8 framebuffer_g[],
-    uniform unsigned int8 framebuffer_b[]
-    )
-{
-    if (tileNumLights == 0 || visualizeLightCount) {
-        uniform unsigned int8 c = (unsigned int8)(min(tileNumLights << 2, 255));
-        for (uniform int32 y = tileStartY; y < tileEndY; ++y) 
-          foreach (x = tileStartX ... tileEndX) 
-          { 
-            int32 framebufferIndex = (y * gBufferWidth + x);
-            framebuffer_r[framebufferIndex] = c;
-            framebuffer_g[framebufferIndex] = c;
-            framebuffer_b[framebufferIndex] = c;
-          }
-    } else {
-        uniform float twoOverGBufferWidth = 2.0f / gBufferWidth;
-        uniform float twoOverGBufferHeight = 2.0f / gBufferHeight;
-        
-        for (uniform int32 y = tileStartY; y < tileEndY; ++y) {
-          uniform float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
-
-          foreach (x = tileStartX ... tileEndX) {
-            int32 gBufferOffset = y * gBufferWidth + x;
-
-            // Reconstruct position and (negative) view vector from G-buffer
-            float surface_positionView_x, surface_positionView_y, surface_positionView_z;
-            float Vneg_x, Vneg_y, Vneg_z;
-
-            float z = inputData.zBuffer[gBufferOffset];
-
-            // Compute screen/clip-space position
-            // NOTE: Mind DX11 viewport transform and pixel center!
-            float positionScreen_x = (0.5f + (float)(x)) * 
-              twoOverGBufferWidth - 1.0f;
-
-            // Unproject depth buffer Z value into view space
-            surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
-            surface_positionView_x = positionScreen_x * surface_positionView_z / 
-              cameraProj_11;
-            surface_positionView_y = positionScreen_y * surface_positionView_z / 
-              cameraProj_22;
-
-            // We actually end up with a vector pointing *at* the
-            // surface (i.e. the negative view vector)
-            normalize3(surface_positionView_x, surface_positionView_y, 
-                surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
-
-            // Reconstruct normal from G-buffer
-            float surface_normal_x, surface_normal_y, surface_normal_z;
-            float normal_x = half_to_float(inputData.normalEncoded_x[gBufferOffset]);
-            float normal_y = half_to_float(inputData.normalEncoded_y[gBufferOffset]);
-
-            float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
-            float m = sqrt(4.0f * f - 1.0f);
-
-            surface_normal_x = m * (4.0f * normal_x - 2.0f);
-            surface_normal_y = m * (4.0f * normal_y - 2.0f);
-            surface_normal_z = 3.0f - 8.0f * f;
-
-            // Load other G-buffer parameters
-            float surface_specularAmount = 
-              half_to_float(inputData.specularAmount[gBufferOffset]);
-            float surface_specularPower  = 
-              half_to_float(inputData.specularPower[gBufferOffset]);
-            float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
-            float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
-            float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
-
-            float lit_x = 0.0f;
-            float lit_y = 0.0f;
-            float lit_z = 0.0f;
-            for (uniform int32 tileLightIndex = 0; tileLightIndex < tileNumLights; 
-                ++tileLightIndex) {
-              uniform int32 lightIndex = tileLightIndices[tileLightIndex];
-
-              // Gather light data relevant to initial culling
-              uniform float light_positionView_x = 
-                inputData.lightPositionView_x[lightIndex];
-              uniform float light_positionView_y = 
-                inputData.lightPositionView_y[lightIndex];
-              uniform float light_positionView_z = 
-                inputData.lightPositionView_z[lightIndex];
-              uniform float light_attenuationEnd = 
-                inputData.lightAttenuationEnd[lightIndex];
-
-              // Compute light vector
-              float L_x = light_positionView_x - surface_positionView_x;
-              float L_y = light_positionView_y - surface_positionView_y;
-              float L_z = light_positionView_z - surface_positionView_z;
-
-              float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
-
-              // Clip at end of attenuation
-              float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
-
-              cif (distanceToLight2 < light_attenutaionEnd2) {                    
-                float distanceToLight = sqrt(distanceToLight2);
-
-                // HLSL "rcp" is allowed to be fairly inaccurate
-                float distanceToLightRcp = rcp(distanceToLight);
-                L_x *= distanceToLightRcp;
-                L_y *= distanceToLightRcp;
-                L_z *= distanceToLightRcp;
-
-                // Start computing brdf
-                float NdotL = dot3(surface_normal_x, surface_normal_y, 
-                    surface_normal_z, L_x, L_y, L_z);
-
-                // Clip back facing
-                cif (NdotL > 0.0f) {
-                  uniform float light_attenuationBegin = 
-                    inputData.lightAttenuationBegin[lightIndex];
-
-                  // Light distance attenuation (linstep)
-                  float lightRange = (light_attenuationEnd - light_attenuationBegin);
-                  float falloffPosition = (light_attenuationEnd - distanceToLight);
-                  float attenuation = min(falloffPosition / lightRange, 1.0f);
-
-                  float H_x = (L_x - Vneg_x);
-                  float H_y = (L_y - Vneg_y);
-                  float H_z = (L_z - Vneg_z);
-                  normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
-
-                  float NdotH = dot3(surface_normal_x, surface_normal_y, 
-                      surface_normal_z, H_x, H_y, H_z);
-                  NdotH = max(NdotH, 0.0f);
-
-                  float specular = pow(NdotH, surface_specularPower);
-                  float specularNorm = (surface_specularPower + 2.0f) * 
-                    (1.0f / 8.0f);
-                  float specularContrib = surface_specularAmount * 
-                    specularNorm * specular;
-
-                  float k = attenuation * NdotL * (1.0f + specularContrib);
-
-                  uniform float light_color_x = inputData.lightColor_x[lightIndex];
-                  uniform float light_color_y = inputData.lightColor_y[lightIndex];
-                  uniform float light_color_z = inputData.lightColor_z[lightIndex];
-
-                  float lightContrib_x = surface_albedo_x * light_color_x;
-                  float lightContrib_y = surface_albedo_y * light_color_y;
-                  float lightContrib_z = surface_albedo_z * light_color_z;
-
-                  lit_x += lightContrib_x * k;
-                  lit_y += lightContrib_y * k;
-                  lit_z += lightContrib_z * k;
-                }
-              }
-            }
-
-            // Gamma correct
-            // These pows are pretty slow right now, but we can do
-            // something faster if really necessary to squeeze every
-            // last bit of performance out of it
-            float gamma = 1.0 / 2.2f;
-            lit_x = pow(clamp(lit_x, 0.0f, 1.0f), gamma);
-            lit_y = pow(clamp(lit_y, 0.0f, 1.0f), gamma);
-            lit_z = pow(clamp(lit_z, 0.0f, 1.0f), gamma);
-
-            framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
-            framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
-            framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
-          }
-        }
-    }
-}
-
-
-///////////////////////////////////////////////////////////////////////////
-// Static decomposition
-
-void task
-RenderTile(uniform int num_groups_x, uniform int num_groups_y,
-           const  uniform InputHeader inputHeaderPtr[],
-           const  uniform InputDataArrays inputDataPtr[],
-           uniform int visualizeLightCount,
-           // Output
-           uniform unsigned int8 framebuffer_r[],
-           uniform unsigned int8 framebuffer_g[],
-           uniform unsigned int8 framebuffer_b[]) {
-  if (taskIndex >= taskCount) return;
-  const  uniform InputHeader inputHeader = *inputHeaderPtr;
-  const  uniform InputDataArrays inputData = *inputDataPtr;
-
-    uniform int32 group_y = taskIndex / num_groups_x;
-    uniform int32 group_x = taskIndex % num_groups_x;
-    uniform int32 tile_start_x = group_x * MIN_TILE_WIDTH;
-    uniform int32 tile_start_y = group_y * MIN_TILE_HEIGHT;
-    uniform int32 tile_end_x = tile_start_x + MIN_TILE_WIDTH;
-    uniform int32 tile_end_y = tile_start_y + MIN_TILE_HEIGHT;
-
-    uniform int framebufferWidth = inputHeader.framebufferWidth;
-    uniform int framebufferHeight = inputHeader.framebufferHeight;
-    uniform float cameraProj_00 = inputHeader.cameraProj[0][0];
-    uniform float cameraProj_11 = inputHeader.cameraProj[1][1];
-    uniform float cameraProj_22 = inputHeader.cameraProj[2][2];
-    uniform float cameraProj_32 = inputHeader.cameraProj[3][2];
-
-    // Light intersection: figure out which lights illuminate this tile.
-#if 1
-    uniform int * uniform tileLightIndices = uniform new uniform int [MAX_LIGHTS];
-#else
-    uniform int tileLightIndices[MAX_LIGHTS];  // Light list for the tile
-#endif
-#if 1
-    uniform int numTileLights = 
-        IntersectLightsWithTile(tile_start_x, tile_end_x, 
-                                tile_start_y, tile_end_y,
-                                framebufferWidth, framebufferHeight,
-                                inputData.zBuffer,
-                                cameraProj_00, cameraProj_11,
-                                cameraProj_22, cameraProj_32,
-                                inputHeader.cameraNear, inputHeader.cameraFar,
-                                MAX_LIGHTS,
-                                inputData.lightPositionView_x, 
-                                inputData.lightPositionView_y, 
-                                inputData.lightPositionView_z, 
-                                inputData.lightAttenuationEnd,
-                                tileLightIndices);
-
-    // And now shade the tile, using the lights in tileLightIndices
-    ShadeTile(tile_start_x, tile_end_x, tile_start_y, tile_end_y,
-              framebufferWidth, framebufferHeight, inputData,
-              cameraProj_00, cameraProj_11, cameraProj_22, cameraProj_32,
-              tileLightIndices, numTileLights, visualizeLightCount, 
-              framebuffer_r, framebuffer_g, framebuffer_b);
-#endif
-#if 1
-    delete tileLightIndices;
-#endif
-}
-
-
-export void
-RenderStatic(uniform InputHeader inputHeaderPtr[],
-             uniform InputDataArrays inputDataPtr[],
-             uniform int visualizeLightCount,
-             // Output
-             uniform unsigned int8 framebuffer_r[],
-             uniform unsigned int8 framebuffer_g[],
-             uniform unsigned int8 framebuffer_b[]) {
-
-  const uniform InputHeader inputHeader = *inputHeaderPtr;
-  const uniform InputDataArrays inputData = *inputDataPtr;
-
-
-    uniform int num_groups_x = (inputHeader.framebufferWidth + 
-                                MIN_TILE_WIDTH - 1) / MIN_TILE_WIDTH;
-    uniform int num_groups_y = (inputHeader.framebufferHeight + 
-                                MIN_TILE_HEIGHT - 1) / MIN_TILE_HEIGHT;
-    uniform int num_groups = num_groups_x * num_groups_y;
-
-    // Launch a task to render each tile, each of which is MIN_TILE_WIDTH
-    // by MIN_TILE_HEIGHT pixels.
-    launch[num_groups] RenderTile(num_groups_x, num_groups_y,
-                                  inputHeaderPtr, inputDataPtr, visualizeLightCount,
-                                  framebuffer_r, framebuffer_g, framebuffer_b);
-    sync;
-}
-
-
-
--- a/examples_cuda/deferred/kernels_shared.cu
+++ b/examples_cuda/deferred/kernels_shared.cu
@@ -1,659 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-#include "deferred.h"
-#include <stdio.h>
-#include <assert.h>
-
-#define programCount 32
-#define programIndex (threadIdx.x & 31)
-#define taskIndex (blockIdx.x*4 + (threadIdx.x >> 5))
-#define taskCount (gridDim.x*4)
-#define warpIdx (threadIdx.x >> 5)
-
-#define int32 int
-#define int16 short
-#define int8 char
-
-__device__ static inline float clamp(float v, float low, float high) 
-{
-      return min(max(v, low), high);
-}
-
-struct InputDataArrays
-{
-    float *zBuffer;
-    unsigned int16 *normalEncoded_x; // half float
-    unsigned int16 *normalEncoded_y; // half float
-    unsigned int16 *specularAmount; // half float
-    unsigned int16 *specularPower; // half float
-    unsigned int8 *albedo_x; // unorm8
-    unsigned int8 *albedo_y; // unorm8
-    unsigned int8 *albedo_z; // unorm8
-    float *lightPositionView_x;
-    float *lightPositionView_y;
-    float *lightPositionView_z;
-    float *lightAttenuationBegin;
-    float *lightColor_x;
-    float *lightColor_y;
-    float *lightColor_z;
-    float *lightAttenuationEnd;
-};
-
-struct InputHeader
-{
-    float cameraProj[4][4];
-    float cameraNear;
-    float cameraFar;
-
-    int32 framebufferWidth;
-    int32 framebufferHeight;
-    int32 numLights;
-    int32 inputDataChunkSize;
-    int32 inputDataArrayOffsets[idaNum];
-};
-
-
-///////////////////////////////////////////////////////////////////////////
-// Common utility routines
-
-__device__
-static inline float
-dot3(float x, float y, float z, float a, float b, float c) {
-    return (x*a + y*b + z*c);
-}
-
-
-#if 0
-template<typename T, int N>
-struct Uniform
-{
-  static __shared__ T shdata[128];
-  T data[(N-1)/programCount+1];
-
-  __device__ inline const T get(const int i) const
-  {
-    const int  elemIdx = i & (programCount-1);
-    const int chunkIdx = i >> 5;
-    return __shfl(data[chunkIdx], elemIdx);
-  }
-  
-  __device__ inline void set(const int i, const T value) const
-  {
-    const int  elemIdx = i & (programCount-1);
-    const int chunkIdx = i >> 5;
-    shdata[elemIdx] = value;
-    data[chunkIdx]  = shdata[programIndex];
-  }
-}
-#endif
-
-
-__device__
-static inline void
-normalize3(float x, float y, float z, float &ox, float &oy, float &oz) {
-    float n = rsqrt(x*x + y*y + z*z);
-    ox = x * n;
-    oy = y * n;
-    oz = z * n;
-}
-
-__device__ inline
-static float reduce_min(float value)
-{
-#pragma unroll
-  for (int i = 4; i >=0; i--)
-    value = min(value, __shfl_xor(value, 1<<i, 32));
-  return value;
-}
-__device__ inline
-static float reduce_max(float value)
-{
-#pragma unroll
-  for (int i = 4; i >=0; i--)
-    value = max(value, __shfl_xor(value, 1<<i, 32));
-  return value;
-}
-__device__ inline
-static int reduce_sum(int value)
-{
-#pragma unroll
-  for (int i = 4; i >=0; i--)
-    value +=  __shfl_xor(value, 1<<i, 32);
-  return value;
-}
-static __device__ __forceinline__ uint shfl_scan_add_step(uint partial, uint up_offset)
-{
-  uint result;
-  asm(
-      "{.reg .u32 r0;"
-      ".reg .pred p;"
-      "shfl.up.b32 r0|p, %1, %2, 0;"
-      "@p add.u32 r0, r0, %3;"
-      "mov.u32 %0, r0;}"
-      : "=r"(result) : "r"(partial), "r"(up_offset), "r"(partial));
-  return result;
-}
-static __device__ __forceinline__ int inclusive_scan_warp(const int value)
-{
-  uint sum = value;
-#pragma unroll
-  for(int i = 0; i < 5; ++i)
-    sum = shfl_scan_add_step(sum, 1 << i);
-  return sum - value;
-}
-
-
-static __device__ __forceinline__ int lanemask_lt()
-{
-  int mask;
-  asm("mov.u32 %0, %lanemask_lt;" : "=r" (mask));
-  return mask;
-}
-static __device__ __forceinline__ int2 warpBinExclusiveScan(const bool p)
-{
-  const unsigned int b = __ballot(p);
-  return make_int2(__popc(b & lanemask_lt()), __popc(b));
-}
-
-
-
-
-
-__device__
-static inline float
-Unorm8ToFloat32(unsigned int8 u) {
-    return (float)u * (1.0f / 255.0f);
-}
-
-
-__device__
-static inline unsigned int8
-Float32ToUnorm8(float f) {
-    return (unsigned int8)(f * 255.0f);
-}
-
-
-__device__
-static inline void
-ComputeZBounds(
-     int32 tileStartX,  int32 tileEndX,
-     int32 tileStartY,  int32 tileEndY,
-    // G-buffer data
-     float zBuffer[],
-     int32 gBufferWidth,
-    // Camera data
-     float cameraProj_33,  float cameraProj_43,
-     float cameraNear,  float cameraFar,
-    // Output
-     float &minZ,
-     float &maxZ
-    )
-{
-    // Find Z bounds
-    float laneMinZ = cameraFar;
-    float laneMaxZ = cameraNear;
-    for ( int32 y = tileStartY; y < tileEndY; ++y) {
-        for ( int xb = tileStartX; xb < tileEndX; xb += programCount)
-        {
-          const int x = xb + programIndex;
-          if (x >= tileEndX) break;
-            // Unproject depth buffer Z value into view space
-            float z = zBuffer[y * gBufferWidth + x];
-            float viewSpaceZ = cameraProj_43 / (z - cameraProj_33);
-
-            // Work out Z bounds for our samples
-            // Avoid considering skybox/background or otherwise invalid pixels
-            if ((viewSpaceZ < cameraFar) && (viewSpaceZ >= cameraNear)) {
-                laneMinZ = min(laneMinZ, viewSpaceZ);
-                laneMaxZ = max(laneMaxZ, viewSpaceZ);
-            }
-        }
-    }
-    minZ = reduce_min(laneMinZ);
-    maxZ = reduce_max(laneMaxZ);
-}
-
-
-__device__
-static inline  int32
-IntersectLightsWithTileMinMax(
-     int32 tileStartX,  int32 tileEndX,
-     int32 tileStartY,  int32 tileEndY,
-    // Tile data
-     float minZ,
-     float maxZ,
-    // G-buffer data
-     int32 gBufferWidth,  int32 gBufferHeight,
-    // Camera data
-     float cameraProj_11,  float cameraProj_22,
-    // Light Data
-     int32 numLights,
-     float light_positionView_x_array[],
-     float light_positionView_y_array[],
-     float light_positionView_z_array[],
-     float light_attenuationEnd_array[],
-    // Output
-     int32 tileLightIndices[]
-    )
-{
-     float gBufferScale_x = 0.5f * (float)gBufferWidth;
-     float gBufferScale_y = 0.5f * (float)gBufferHeight;
-        
-     float frustumPlanes_xy[4] = {
-        -(cameraProj_11 * gBufferScale_x),
-         (cameraProj_11 * gBufferScale_x),
-         (cameraProj_22 * gBufferScale_y),
-        -(cameraProj_22 * gBufferScale_y) };
-     float frustumPlanes_z[4] = {
-         tileEndX - gBufferScale_x,
-        -tileStartX + gBufferScale_x,
-         tileEndY - gBufferScale_y,
-        -tileStartY + gBufferScale_y };
-
-    for ( int i = 0; i < 4; ++i) {
-         float norm = rsqrt(frustumPlanes_xy[i] * frustumPlanes_xy[i] + 
-                                   frustumPlanes_z[i] * frustumPlanes_z[i]);
-        frustumPlanes_xy[i] *= norm;
-        frustumPlanes_z[i] *= norm;
-    }
-
-     int32 tileNumLights = 0;
-
-    for ( int lightIndexB = 0; lightIndexB < numLights; lightIndexB += programCount)
-    {
-      const int lightIndex = lightIndexB + programIndex;
-
-        float light_positionView_z = light_positionView_z_array[lightIndex];
-        float light_attenuationEnd = light_attenuationEnd_array[lightIndex];
-        float light_attenuationEndNeg = -light_attenuationEnd;
-
-        float d = light_positionView_z - minZ;
-        bool inFrustum = (d >= light_attenuationEndNeg);
-
-        d = maxZ - light_positionView_z;
-        inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-        
-        // This seems better than cif(!inFrustum) ccontinue; here since we
-        // don't actually need to mask the rest of this function - this is
-        // just a greedy early-out.  Could also structure all of this as
-        // nested if() statements, but this a bit easier to read
-        int active = 0;
-        if ((inFrustum)) {
-            float light_positionView_x = light_positionView_x_array[lightIndex];
-            float light_positionView_y = light_positionView_y_array[lightIndex];
-
-            d = light_positionView_z * frustumPlanes_z[0] + 
-                light_positionView_x * frustumPlanes_xy[0];
-            inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-            d = light_positionView_z * frustumPlanes_z[1] + 
-                light_positionView_x * frustumPlanes_xy[1];
-            inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-            d = light_positionView_z * frustumPlanes_z[2] + 
-                light_positionView_y * frustumPlanes_xy[2];
-            inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-
-            d = light_positionView_z * frustumPlanes_z[3] + 
-                light_positionView_y * frustumPlanes_xy[3];
-            inFrustum = inFrustum && (d >= light_attenuationEndNeg);
-        
-            // Pack and store intersecting lights
-#if 0
-            if (inFrustum) {
-                tileNumLights += packed_store_active(&tileLightIndices[tileNumLights], 
-                                                     lightIndex);
-            }
-#else
-            if (inFrustum)
-            {
-              active = 1;
-            }
-#endif
-        }
-#if 1
-        if (lightIndex >= numLights) 
-          active = 0;
-
-#if 0
-        const int idx = tileNumLights + inclusive_scan_warp(active);
-        const int nactive = reduce_sum(active);
-#else
-        const int2 res = warpBinExclusiveScan(active);
-        const int idx = tileNumLights + res.x;
-        const int nactive = res.y;
-#endif
-        if (active)
-          tileLightIndices[idx] = lightIndex;
-        tileNumLights += nactive;
-#endif
-    }
-
-    return tileNumLights;
-}
-
-
-__device__
-static inline   int32
-IntersectLightsWithTile(
-     int32 tileStartX,  int32 tileEndX,
-     int32 tileStartY,  int32 tileEndY,
-     int32 gBufferWidth,  int32 gBufferHeight,
-    // G-buffer data
-     float zBuffer[],
-    // Camera data
-     float cameraProj_11,  float cameraProj_22,
-     float cameraProj_33,  float cameraProj_43,
-     float cameraNear,  float cameraFar,
-    // Light Data
-     int32 numLights,
-     float light_positionView_x_array[],
-     float light_positionView_y_array[],
-     float light_positionView_z_array[],
-     float light_attenuationEnd_array[],
-    // Output
-     int32 tileLightIndices[]
-    )
-{
-     float minZ, maxZ;
-    ComputeZBounds(tileStartX, tileEndX, tileStartY, tileEndY,
-        zBuffer, gBufferWidth, cameraProj_33, cameraProj_43, cameraNear, cameraFar,
-        minZ, maxZ);
-
-
-     int32 tileNumLights = IntersectLightsWithTileMinMax(
-        tileStartX, tileEndX, tileStartY, tileEndY, minZ, maxZ,
-        gBufferWidth, gBufferHeight, cameraProj_11, cameraProj_22,
-        MAX_LIGHTS, light_positionView_x_array, light_positionView_y_array, 
-        light_positionView_z_array, light_attenuationEnd_array,
-        tileLightIndices);
-
-    return tileNumLights;
-}
-
-
-__device__
-static inline void
-ShadeTile(
-     int32 tileStartX,  int32 tileEndX,
-     int32 tileStartY,  int32 tileEndY,
-     int32 gBufferWidth,  int32 gBufferHeight,
-    const  InputDataArrays &inputData,
-    // Camera data
-     float cameraProj_11,  float cameraProj_22,
-     float cameraProj_33,  float cameraProj_43,
-    // Light list
-     int32 tileLightIndices[],
-     int32 tileNumLights,
-    // UI
-     bool visualizeLightCount,
-    // Output
-     unsigned int8 framebuffer_r[],
-     unsigned int8 framebuffer_g[],
-     unsigned int8 framebuffer_b[]
-    )
-{
-    if (tileNumLights == 0 || visualizeLightCount) {
-         unsigned int8 c = (unsigned int8)(min(tileNumLights << 2, 255));
-        for ( int32 y = tileStartY; y < tileEndY; ++y) {
-            for ( int xb = tileStartX ; xb < tileEndX; xb += programCount)
-            { 
-              const int x = xb + programIndex;
-              if (x >= tileEndX) continue;
-                int32 framebufferIndex = (y * gBufferWidth + x);
-                framebuffer_r[framebufferIndex] = c;
-                framebuffer_g[framebufferIndex] = c;
-                framebuffer_b[framebufferIndex] = c;
-            }
-        }
-    } else {
-         float twoOverGBufferWidth = 2.0f / gBufferWidth;
-         float twoOverGBufferHeight = 2.0f / gBufferHeight;
-        
-        for ( int32 y = tileStartY; y < tileEndY; ++y) {
-             float positionScreen_y = -(((0.5f + y) * twoOverGBufferHeight) - 1.f);
-
-            for ( int xb = tileStartX ; xb < tileEndX; xb += programCount)
-            { 
-              const int x = xb + programIndex;
-//              if (x >= tileEndX) break;
-                int32 gBufferOffset = y * gBufferWidth + x;
-                
-                // Reconstruct position and (negative) view vector from G-buffer
-                float surface_positionView_x, surface_positionView_y, surface_positionView_z;
-                float Vneg_x, Vneg_y, Vneg_z;
-
-                float z = inputData.zBuffer[gBufferOffset];
-
-                // Compute screen/clip-space position
-                // NOTE: Mind DX11 viewport transform and pixel center!
-                float positionScreen_x = (0.5f + (float)(x)) * 
-                    twoOverGBufferWidth - 1.0f;
-
-                // Unproject depth buffer Z value into view space
-                surface_positionView_z = cameraProj_43 / (z - cameraProj_33);
-                surface_positionView_x = positionScreen_x * surface_positionView_z / 
-                    cameraProj_11;
-                surface_positionView_y = positionScreen_y * surface_positionView_z / 
-                    cameraProj_22;
-                
-                // We actually end up with a vector pointing *at* the
-                // surface (i.e. the negative view vector)
-                normalize3(surface_positionView_x, surface_positionView_y, 
-                           surface_positionView_z, Vneg_x, Vneg_y, Vneg_z);
-
-                // Reconstruct normal from G-buffer
-                float surface_normal_x, surface_normal_y, surface_normal_z;
-                float normal_x = __half2float(inputData.normalEncoded_x[gBufferOffset]);
-                float normal_y = __half2float(inputData.normalEncoded_y[gBufferOffset]);
-                    
-                float f = (normal_x - normal_x * normal_x) + (normal_y - normal_y * normal_y);
-                float m = sqrt(4.0f * f - 1.0f);
-                    
-                surface_normal_x = m * (4.0f * normal_x - 2.0f);
-                surface_normal_y = m * (4.0f * normal_y - 2.0f);
-                surface_normal_z = 3.0f - 8.0f * f;
-
-                // Load other G-buffer parameters
-                float surface_specularAmount = 
-                    __half2float(inputData.specularAmount[gBufferOffset]);
-                float surface_specularPower  = 
-                    __half2float(inputData.specularPower[gBufferOffset]);
-                float surface_albedo_x = Unorm8ToFloat32(inputData.albedo_x[gBufferOffset]);
-                float surface_albedo_y = Unorm8ToFloat32(inputData.albedo_y[gBufferOffset]);
-                float surface_albedo_z = Unorm8ToFloat32(inputData.albedo_z[gBufferOffset]);
-                
-                float lit_x = 0.0f;
-                float lit_y = 0.0f;
-                float lit_z = 0.0f;
-                for ( int32 tileLightIndex = 0; tileLightIndex < tileNumLights; 
-                     ++tileLightIndex) {
-                     int32 lightIndex = tileLightIndices[tileLightIndex];
-                                        
-                    // Gather light data relevant to initial culling
-                     float light_positionView_x = 
-                        inputData.lightPositionView_x[lightIndex];
-                     float light_positionView_y = 
-                        inputData.lightPositionView_y[lightIndex];
-                     float light_positionView_z = 
-                        inputData.lightPositionView_z[lightIndex];
-                     float light_attenuationEnd = 
-                        inputData.lightAttenuationEnd[lightIndex];
-                    
-                    // Compute light vector
-                    float L_x = light_positionView_x - surface_positionView_x;
-                    float L_y = light_positionView_y - surface_positionView_y;
-                    float L_z = light_positionView_z - surface_positionView_z;
-
-                    float distanceToLight2 = dot3(L_x, L_y, L_z, L_x, L_y, L_z);
-                    
-                    // Clip at end of attenuation
-                    float light_attenutaionEnd2 = light_attenuationEnd * light_attenuationEnd;
-
-                    if (distanceToLight2 < light_attenutaionEnd2) {                    
-                        float distanceToLight = sqrt(distanceToLight2);
-
-                        // HLSL "rcp" is allowed to be fairly inaccurate
-                        float distanceToLightRcp = 1.0f/distanceToLight;
-                        L_x *= distanceToLightRcp;
-                        L_y *= distanceToLightRcp;
-                        L_z *= distanceToLightRcp;
-
-                        // Start computing brdf
-                        float NdotL = dot3(surface_normal_x, surface_normal_y, 
-                                           surface_normal_z, L_x, L_y, L_z);
-                    
-                        // Clip back facing
-                        if (NdotL > 0.0f) {
-                             float light_attenuationBegin = 
-                                inputData.lightAttenuationBegin[lightIndex];
-
-                            // Light distance attenuation (linstep)
-                            float lightRange = (light_attenuationEnd - light_attenuationBegin);
-                            float falloffPosition = (light_attenuationEnd - distanceToLight);
-                            float attenuation = min(falloffPosition / lightRange, 1.0f);
-
-                            float H_x = (L_x - Vneg_x);
-                            float H_y = (L_y - Vneg_y);
-                            float H_z = (L_z - Vneg_z);
-                            normalize3(H_x, H_y, H_z, H_x, H_y, H_z);
-                    
-                            float NdotH = dot3(surface_normal_x, surface_normal_y, 
-                                               surface_normal_z, H_x, H_y, H_z);
-                            NdotH = max(NdotH, 0.0f);
-
-                            float specular = pow(NdotH, surface_specularPower);
-                            float specularNorm = (surface_specularPower + 2.0f) * 
-                                (1.0f / 8.0f);
-                            float specularContrib = surface_specularAmount * 
-                                specularNorm * specular;
-
-                            float k = attenuation * NdotL * (1.0f + specularContrib);
-                    
-                             float light_color_x = inputData.lightColor_x[lightIndex];
-                             float light_color_y = inputData.lightColor_y[lightIndex];
-                             float light_color_z = inputData.lightColor_z[lightIndex];
-
-                            float lightContrib_x = surface_albedo_x * light_color_x;
-                            float lightContrib_y = surface_albedo_y * light_color_y;
-                            float lightContrib_z = surface_albedo_z * light_color_z;
-
-                            lit_x += lightContrib_x * k;
-                            lit_y += lightContrib_y * k;
-                            lit_z += lightContrib_z * k;
-                        }
-                    }
-                }
-
-                // Gamma correct
-                // These pows are pretty slow right now, but we can do
-                // something faster if really necessary to squeeze every
-                // last bit of performance out of it
-                float gamma = 1.0 / 2.2f;
-                lit_x = pow(clamp(lit_x, 0.0f, 1.0f), gamma);
-                lit_y = pow(clamp(lit_y, 0.0f, 1.0f), gamma);
-                lit_z = pow(clamp(lit_z, 0.0f, 1.0f), gamma);
-                
-                framebuffer_r[gBufferOffset] = Float32ToUnorm8(lit_x);
-                framebuffer_g[gBufferOffset] = Float32ToUnorm8(lit_y);
-                framebuffer_b[gBufferOffset] = Float32ToUnorm8(lit_z);
-            }
-        }
-    }
-}
-
-
-///////////////////////////////////////////////////////////////////////////
-// Static decomposition
-
-extern "C" __global__ void
-RenderTile( int num_groups_x,  int num_groups_y,
-           const  InputHeader *inputHeaderPtr,
-           const  InputDataArrays *inputDataPtr,
-            int visualizeLightCount,
-           // Output
-            unsigned int8 framebuffer_r[],
-            unsigned int8 framebuffer_g[],
-            unsigned int8 framebuffer_b[]) {
-  if (taskIndex >= taskCount) return;
-
-  const  InputHeader inputHeader = *inputHeaderPtr;
-  const  InputDataArrays inputData = *inputDataPtr;
-     int32 group_y = taskIndex / num_groups_x;
-     int32 group_x = taskIndex % num_groups_x;
-
-     int32 tile_start_x = group_x * MIN_TILE_WIDTH;
-     int32 tile_start_y = group_y * MIN_TILE_HEIGHT;
-     int32 tile_end_x = tile_start_x + MIN_TILE_WIDTH;
-     int32 tile_end_y = tile_start_y + MIN_TILE_HEIGHT;
-
-     int framebufferWidth = inputHeader.framebufferWidth;
-     int framebufferHeight = inputHeader.framebufferHeight;
-     float cameraProj_00 = inputHeader.cameraProj[0][0];
-     float cameraProj_11 = inputHeader.cameraProj[1][1];
-     float cameraProj_22 = inputHeader.cameraProj[2][2];
-     float cameraProj_32 = inputHeader.cameraProj[3][2];
-
-    // Light intersection: figure out which lights illuminate this tile.
-#if 0
-     int tileLightIndices[MAX_LIGHTS];  // Light list for the tile
-#else
-     __shared__ int tileLightIndicesFull[4*MAX_LIGHTS];  // Light list for the tile
-     int *tileLightIndices = &tileLightIndicesFull[warpIdx*MAX_LIGHTS];
-#endif
-     int numTileLights = 
-        IntersectLightsWithTile(tile_start_x, tile_end_x, 
-                                tile_start_y, tile_end_y,
-                                framebufferWidth, framebufferHeight,
-                                inputData.zBuffer,
-                                cameraProj_00, cameraProj_11,
-                                cameraProj_22, cameraProj_32,
-                                inputHeader.cameraNear, inputHeader.cameraFar,
-                                MAX_LIGHTS,
-                                inputData.lightPositionView_x, 
-                                inputData.lightPositionView_y, 
-                                inputData.lightPositionView_z, 
-                                inputData.lightAttenuationEnd,
-                                tileLightIndices);
-
-    // And now shade the tile, using the lights in tileLightIndices
-    ShadeTile(tile_start_x, tile_end_x, tile_start_y, tile_end_y,
-              framebufferWidth, framebufferHeight, inputData,
-              cameraProj_00, cameraProj_11, cameraProj_22, cameraProj_32,
-              tileLightIndices, numTileLights, visualizeLightCount, 
-              framebuffer_r, framebuffer_g, framebuffer_b);
-}
-
-
--- a/examples_cuda/deferred/main.cpp
+++ b/examples_cuda/deferred/main.cpp
@@ -1,156 +0,0 @@
-/*
-  Copyright (c) 2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define ISPC_IS_WINDOWS
-#define NOMINMAX
-#elif defined(__linux__)
-#define ISPC_IS_LINUX
-#elif defined(__APPLE__)
-#define ISPC_IS_APPLE
-#endif
-
-#include <fcntl.h>
-#include <float.h>
-#include <math.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <sys/types.h>
-#include <stdint.h>
-#include <algorithm>
-#include <assert.h>
-#include <vector>
-#ifdef ISPC_IS_WINDOWS
-  #define WIN32_LEAN_AND_MEAN
-  #include <windows.h>
-#endif
-#include "deferred.h"
-#include "../timing.h"
-
-#include <sys/time.h>
-static inline double rtc(void)
-{
-  struct timeval Tvalue;
-  double etime;
-  struct timezone dummy;
-
-  gettimeofday(&Tvalue,&dummy);
-  etime =  (double) Tvalue.tv_sec +
-    1.e-6*((double) Tvalue.tv_usec);
-  return etime;
-}
-
-///////////////////////////////////////////////////////////////////////////
-
-int main(int argc, char** argv) {
-    if (argc != 2) {
-        printf("usage: deferred_shading <input_file (e.g. data/pp1280x720.bin)>\n");
-        return 1;
-    }
-
-    InputData *input = CreateInputDataFromFile(argv[1]);
-    if (!input) {
-        printf("Failed to load input file \"%s\"!\n", argv[1]);
-        return 1;
-    }
-
-    Framebuffer framebuffer(input->header.framebufferWidth,
-                            input->header.framebufferHeight);
-
-#if 0
-    InitDynamicC(input);
-#ifdef __cilk
-    InitDynamicCilk(input);
-#endif // __cilk
-#endif
-
-    int nframes = 5;
-    double ispcCycles = 1e30;
-    for (int i = 0; i < 5; ++i) {
-        framebuffer.clear();
-        const double t0 = rtc();
-        for (int j = 0; j < nframes; ++j)
-            ispc::RenderStatic(&input->header, &input->arrays,
-                               VISUALIZE_LIGHT_COUNT,
-                               framebuffer.r, framebuffer.g, framebuffer.b);
-        double mcycles = 1000*(rtc() - t0) / nframes;
-        ispcCycles = std::min(ispcCycles, mcycles);
-    }
-    printf("[ispc static + tasks]:\t\t[%.3f] million cycles to render "
-           "%d x %d image\n", ispcCycles,
-           input->header.framebufferWidth, input->header.framebufferHeight);
-    WriteFrame("deferred-ispc-static.ppm", input, framebuffer);
-    return 0;
-#if 0
-
-#ifdef __cilk
-    double dynamicCilkCycles = 1e30;
-    for (int i = 0; i < 5; ++i) {
-        framebuffer.clear();
-        const double t0 = rtc();
-        for (int j = 0; j < nframes; ++j)
-            DispatchDynamicCilk(input, &framebuffer);
-        double mcycles = 1000*(rtc() - t0) / nframes;
-        dynamicCilkCycles = std::min(dynamicCilkCycles, mcycles);
-    }
-    printf("[ispc + Cilk dynamic]:\t\t[%.3f] million cycles to render image\n", 
-           dynamicCilkCycles);
-    WriteFrame("deferred-ispc-dynamic.ppm", input, framebuffer);
-#endif // __cilk
-
-    double serialCycles = 1e30;
-    for (int i = 0; i < 5; ++i) {
-        framebuffer.clear();
-        const double t0 = rtc();
-        for (int j = 0; j < nframes; ++j)
-            DispatchDynamicC(input, &framebuffer);
-        double mcycles = 1000*(rtc() - t0) / nframes;
-        serialCycles = std::min(serialCycles, mcycles);
-    }
-    printf("[C++ serial dynamic, 1 core]:\t[%.3f] million cycles to render image\n", 
-           serialCycles);
-    WriteFrame("deferred-serial-dynamic.ppm", input, framebuffer);
-
-#ifdef __cilk
-    printf("\t\t\t\t(%.2fx speedup from static ISPC, %.2fx from Cilk+ISPC)\n", 
-           serialCycles/ispcCycles, serialCycles/dynamicCilkCycles);
-#else
-    printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", serialCycles/ispcCycles);
-#endif // __cilk
-#endif
-
-    DeleteInputData(input);
-
-    return 0;
-}
--- a/examples_cuda/deferred/main_cu.cpp
+++ b/examples_cuda/deferred/main_cu.cpp
@@ -1,518 +0,0 @@
-/*
-  Copyright (c) 2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define ISPC_IS_WINDOWS
-#define NOMINMAX
-#elif defined(__linux__)
-#define ISPC_IS_LINUX
-#elif defined(__APPLE__)
-#define ISPC_IS_APPLE
-#endif
-
-#include <fcntl.h>
-#include <float.h>
-#include <math.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <sys/types.h>
-#include <stdint.h>
-#include <algorithm>
-#include <assert.h>
-#include <vector>
-#ifdef ISPC_IS_WINDOWS
-  #define WIN32_LEAN_AND_MEAN
-  #include <windows.h>
-#endif
-#include "kernels1_ispc.h"
-#include "deferred.h"
-#include "../timing.h"
-
-#include <sys/time.h>
-static inline double rtc(void)
-{
-  struct timeval Tvalue;
-  double etime;
-  struct timezone dummy;
-
-  gettimeofday(&Tvalue,&dummy);
-  etime =  (double) Tvalue.tv_sec +
-    1.e-6*((double) Tvalue.tv_usec);
-  return etime;
-}
-/******************************/ 
-#include <cassert>
-#include <iostream>
-#include <cuda.h>
-#include "drvapi_error_string.h"
-
-#define checkCudaErrors(err)  __checkCudaErrors (err, __FILE__, __LINE__)
-// These are the inline versions for all of the SDK helper functions
-void __checkCudaErrors(CUresult err, const char *file, const int line) {
-  if(CUDA_SUCCESS != err) {
-    std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
-           << getCudaDrvErrorString(err) << "\" from file <" << file
-           << ", line " << line << "\n";
-    exit(-1);
-  }
-}
-
-/**********************/
-/* Basic CUDriver API */
-CUcontext context;
-
-void createContext(const int deviceId = 0)
-{
-  CUdevice device;
-  int devCount;
-  checkCudaErrors(cuInit(0));
-  checkCudaErrors(cuDeviceGetCount(&devCount));
-  assert(devCount > 0);
-  checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
-
-  char name[128];
-  checkCudaErrors(cuDeviceGetName(name, 128, device));
-  std::cout << "Using CUDA Device [0]: " << name << "\n";
-
-  int devMajor, devMinor;
-  checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
-  std::cout << "Device Compute Capability: " 
-    << devMajor << "." << devMinor << "\n";
-  if (devMajor < 2) {
-    std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
-    exit(1); 
-  }
-
-  // Create driver context
-  checkCudaErrors(cuCtxCreate(&context, 0, device));
-    const size_t stackLimit = 4*1024;
- //   const size_t heapLimit = 1024*1024*1024;
-  checkCudaErrors(cuCtxSetLimit(CU_LIMIT_STACK_SIZE,stackLimit));
-//  checkCudaErrors(cuCtxSetLimit(CU_LIMIT_MALLOC_HEAP_SIZE,heapLimit));
-}
-void destroyContext()
-{
-  checkCudaErrors(cuCtxDestroy(context));
-}
-
-CUmodule loadModule(const char * module)
-{
-  const double t0 = rtc();
-  CUmodule cudaModule;
-  // in this branch we use compilation with parameters
-
-#if 0
-  unsigned int jitNumOptions = 1;
-  CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
-  void **jitOptVals = new void*[jitNumOptions];
-  // set up pointer to set the Maximum # of registers for a particular kernel
-  jitOptions[0] = CU_JIT_MAX_REGISTERS;
-  int jitRegCount = 64;
-  jitOptVals[0] = (void *)(size_t)jitRegCount;
-#if 0
-
-  {
-    jitNumOptions = 3;
-    // set up size of compilation log buffer
-    jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
-    int jitLogBufferSize = 1024;
-    jitOptVals[0] = (void *)(size_t)jitLogBufferSize;
-
-    // set up pointer to the compilation log buffer
-    jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
-    char *jitLogBuffer = new char[jitLogBufferSize];
-    jitOptVals[1] = jitLogBuffer;
-
-    // set up pointer to set the Maximum # of registers for a particular kernel
-    jitOptions[2] = CU_JIT_MAX_REGISTERS;
-    int jitRegCount = 32;
-    jitOptVals[2] = (void *)(size_t)jitRegCount;
-  }
-#endif
-
-  checkCudaErrors(cuModuleLoadDataEx(&cudaModule, module,jitNumOptions, jitOptions, (void **)jitOptVals));
-#else
-  CUlinkState  CUState;
-  CUlinkState *lState = &CUState;
-  const int nOptions = 8;
-    CUjit_option options[nOptions];
-    void* optionVals[nOptions];
-    float walltime;
-    const unsigned int logSize = 32768;
-    char error_log[logSize],
-         info_log[logSize];
-    void *cuOut;
-    size_t outSize;
-    int myErr = 0;
-
-    // Setup linker options
-    // Return walltime from JIT compilation
-    options[0] = CU_JIT_WALL_TIME;
-    optionVals[0] = (void*) &walltime;
-    // Pass a buffer for info messages
-    options[1] = CU_JIT_INFO_LOG_BUFFER;
-    optionVals[1] = (void*) info_log;
-    // Pass the size of the info buffer
-    options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
-    optionVals[2] = (void*) logSize;
-    // Pass a buffer for error message
-    options[3] = CU_JIT_ERROR_LOG_BUFFER;
-    optionVals[3] = (void*) error_log;
-    // Pass the size of the error buffer
-    options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
-    optionVals[4] = (void*) logSize;
-    // Make the linker verbose
-    options[5] = CU_JIT_LOG_VERBOSE;
-    optionVals[5] = (void*) 1;
-    // Max # of registers/pthread
-    options[6] = CU_JIT_MAX_REGISTERS;
-    int jitRegCount = 48;
-    optionVals[6] = (void *)(size_t)jitRegCount;
-    // Caching
-    options[7] = CU_JIT_CACHE_MODE;
-    optionVals[7] = (void *)CU_JIT_CACHE_OPTION_CA;
-    // Create a pending linker invocation
-    checkCudaErrors(cuLinkCreate(nOptions,options, optionVals, lState));
-
-#if 0
-    if (sizeof(void *)==4)
-    {
-        // Load the PTX from the string myPtx32
-        printf("Loading myPtx32[] program\n");
-        // PTX May also be loaded from file, as per below.
-        myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)myPtx32, strlen(myPtx32)+1, 0, 0, 0, 0);
-    }
-    else
-#endif
-    {
-        // Load the PTX from the string myPtx (64-bit)
-        fprintf(stderr, "Loading ptx..\n");
-        myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void*)module, strlen(module)+1, 0, 0, 0, 0);
-        myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_LIBRARY, "libcudadevrt.a", 0,0,0); 
-        // PTX May also be loaded from file, as per below.
-        // myErr = cuLinkAddFile(*lState, CU_JIT_INPUT_PTX, "myPtx64.ptx",0,0,0);
-    }
-
-    // Complete the linker step
-    myErr = cuLinkComplete(*lState, &cuOut, &outSize);
-
-    if ( myErr != CUDA_SUCCESS )
-    {
-      // Errors will be put in error_log, per CU_JIT_ERROR_LOG_BUFFER option above. 
-      fprintf(stderr,"PTX Linker Error:\n%s\n",error_log);
-      assert(0);
-    }    
-
-    // Linker walltime and info_log were requested in options above.
-    fprintf(stderr, "CUDA Link Completed in %fms [ %g ms]. Linker Output:\n%s\n",walltime,info_log,1e3*(rtc() - t0));
-
-    // Load resulting cuBin into module
-    checkCudaErrors(cuModuleLoadData(&cudaModule, cuOut));
-
-    // Destroy the linker invocation
-    checkCudaErrors(cuLinkDestroy(*lState));
-#endif
-  fprintf(stderr, " loadModule took %g ms \n", 1e3*(rtc() - t0));
-  return cudaModule;
-}
-void unloadModule(CUmodule &cudaModule)
-{
-  checkCudaErrors(cuModuleUnload(cudaModule));
-}
-
-CUfunction getFunction(CUmodule &cudaModule, const char * function)
-{
-  CUfunction cudaFunction;
-  checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
-  return cudaFunction;
-}
-  
-CUdeviceptr deviceMalloc(const size_t size)
-{
-  CUdeviceptr d_buf;
-  checkCudaErrors(cuMemAlloc(&d_buf, size));
-  return d_buf;
-}
-void deviceFree(CUdeviceptr d_buf)
-{
-  checkCudaErrors(cuMemFree(d_buf));
-}
-void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
-}
-void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
-}
-#define deviceLaunch(func,params) \
-  checkCudaErrors(cuFuncSetCacheConfig((func), CU_FUNC_CACHE_PREFER_L1)); \
-  checkCudaErrors( \
-      cuLaunchKernel( \
-        (func), \
-        1,1,1, \
-        32, 1, 1, \
-        0, NULL, (params), NULL \
-        ));
-
-
-typedef CUdeviceptr devicePtr;
-
-
-/**************/
-#include <vector>
-std::vector<char> readBinary(const char * filename)
-{
-  std::vector<char> buffer;
-  FILE *fp = fopen(filename, "rb");
-  if (!fp )
-  {
-    fprintf(stderr, "file %s not found\n", filename);
-    assert(0);
-  }
-#if 0
-  char c;
-  while ((c = fgetc(fp)) != EOF)
-    buffer.push_back(c);
-#else
-  fseek(fp, 0, SEEK_END); 
-  const unsigned long long size = ftell(fp);         /*calc the size needed*/
-  fseek(fp, 0, SEEK_SET); 
-  buffer.resize(size);
-
-  if (fp == NULL){ /*ERROR detection if file == empty*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);           
-    exit(1);
-  }
-  else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
-    exit(1);
-  }
-#endif
-  fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
-  return buffer;
-}
-
-extern "C" 
-{
-  double CUDALaunch(
-      void **handlePtr, 
-      const char * func_name,
-      void **func_args)
-  {
-    const std::vector<char> module_str = readBinary("__kernels.ptx");
-    const char *  module = &module_str[0];
-    CUmodule   cudaModule   = loadModule(module);
-    CUfunction cudaFunction = getFunction(cudaModule, func_name);
-    const double t0 = rtc();
-    deviceLaunch(cudaFunction, func_args);
-    checkCudaErrors(cuStreamSynchronize(0));
-    const double dt = rtc() - t0;
-    unloadModule(cudaModule);
-    return dt;
-  }
-}
-/******************************/
-
-///////////////////////////////////////////////////////////////////////////
-
-int main(int argc, char** argv) {
-    if (argc != 2) {
-        printf("usage: deferred_shading <input_file (e.g. data/pp1280x720.bin)>\n");
-        return 1;
-    }
-
-    InputData *input = CreateInputDataFromFile(argv[1]);
-    if (!input) {
-        printf("Failed to load input file \"%s\"!\n", argv[1]);
-        return 1;
-    }
-
-    Framebuffer framebuffer(input->header.framebufferWidth,
-                            input->header.framebufferHeight);
-
-//    InitDynamicC(input);
-#if 0
-#ifdef __cilk
-    InitDynamicCilk(input);
-#endif // __cilk
-#endif
-  
-    /*******************/
-  createContext();
-  /*******************/
-
-  devicePtr d_header = deviceMalloc(sizeof(ispc::InputHeader));
-  devicePtr d_arrays = deviceMalloc(sizeof(ispc::InputDataArrays));
-  const int buffsize = input->header.framebufferWidth*input->header.framebufferHeight;
-  devicePtr d_r      = deviceMalloc(buffsize);
-  devicePtr d_g      = deviceMalloc(buffsize);
-  devicePtr d_b      = deviceMalloc(buffsize);
-
-  for (int i = 0; i < buffsize; i++)
-    framebuffer.r[i] = framebuffer.g[i] = framebuffer.b[i] = 0;
-
-  
-  ispc::InputDataArrays dh_arrays;
-  {
-    devicePtr d_chunk = deviceMalloc(input->header.inputDataChunkSize);
-    memcpyH2D(d_chunk, input->chunk, input->header.inputDataChunkSize);
-
-    dh_arrays.zBuffer = (float*)(d_chunk + input->header.inputDataArrayOffsets[idaZBuffer]);
-    dh_arrays.normalEncoded_x =
-        (uint16_t *)(d_chunk+input->header.inputDataArrayOffsets[idaNormalEncoded_x]);
-    fprintf(stderr, "%p %p \n",
-        dh_arrays.zBuffer, dh_arrays.normalEncoded_x);
-    fprintf(stderr, " diff= %d  %d \n", 
-        input->header.inputDataArrayOffsets[idaZBuffer],
-        input->header.inputDataArrayOffsets[idaNormalEncoded_x]);
-
-    dh_arrays.normalEncoded_y =
-        (uint16_t *)(d_chunk+input->header.inputDataArrayOffsets[idaNormalEncoded_y]);
-    dh_arrays.specularAmount =
-        (uint16_t *)(d_chunk+input->header.inputDataArrayOffsets[idaSpecularAmount]);
-    dh_arrays.specularPower =
-        (uint16_t *)(d_chunk+input->header.inputDataArrayOffsets[idaSpecularPower]);
-    dh_arrays.albedo_x =
-        (uint8_t *)(d_chunk+input->header.inputDataArrayOffsets[idaAlbedo_x]);
-    dh_arrays.albedo_y =
-        (uint8_t *)(d_chunk+input->header.inputDataArrayOffsets[idaAlbedo_y]);
-    dh_arrays.albedo_z =
-        (uint8_t *)(d_chunk+input->header.inputDataArrayOffsets[idaAlbedo_z]);
-    dh_arrays.lightPositionView_x =
-        (float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightPositionView_x]);
-    dh_arrays.lightPositionView_y =
-        (float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightPositionView_y]);
-    dh_arrays.lightPositionView_z =
-        (float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightPositionView_z]);
-    dh_arrays.lightAttenuationBegin =
-        (float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightAttenuationBegin]);
-    dh_arrays.lightColor_x =
-        (float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightColor_x]);
-    dh_arrays.lightColor_y =
-        (float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightColor_y]);
-    dh_arrays.lightColor_z =
-        (float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightColor_z]);
-    dh_arrays.lightAttenuationEnd =
-        (float *)(d_chunk+input->header.inputDataArrayOffsets[idaLightAttenuationEnd]);
-  }
-
-  memcpyH2D(d_header, &input->header, sizeof(ispc::InputHeader));
-  memcpyH2D(d_arrays, &dh_arrays, sizeof(ispc::InputDataArrays));
-  memcpyH2D(d_r, framebuffer.r, buffsize);
-  memcpyH2D(d_g, framebuffer.g, buffsize);
-  memcpyH2D(d_b, framebuffer.b, buffsize);
-
-
-    int nframes = 5;
-    double ispcCycles = 1e30;
-    for (int i = 0; i < 5; ++i) {
-        framebuffer.clear();
-        const double t0 = rtc();
-        double dt = 0.0;
-        for (int j = 0; j < nframes; ++j)
-        {
-          const char * func_name = "RenderStatic";
-          int light_count = VISUALIZE_LIGHT_COUNT;
-          void *func_args[] = {
-            &d_header, 
-            &d_arrays,
-            &light_count,
-            &d_r, 
-            &d_g, 
-            &d_b};
-          dt += CUDALaunch(NULL, func_name, func_args);
-        }
-        //double mcycles = 1000*(rtc() - t0) / nframes;
-        double mcycles = 1000*dt / nframes;
-        fprintf(stderr, "dt= %g\n", mcycles);
-        ispcCycles = std::min(ispcCycles, mcycles);
-    }
-
-    memcpyD2H(framebuffer.r, d_r, buffsize);
-    memcpyD2H(framebuffer.g, d_g, buffsize);
-    memcpyD2H(framebuffer.b, d_b, buffsize);
-
-    printf("[ispc cuda]:\t\t[%.3f] million cycles to render "
-           "%d x %d image\n", ispcCycles,
-           input->header.framebufferWidth, input->header.framebufferHeight);
-    WriteFrame("deferred-cuda.ppm", input, framebuffer);
-
-  /*******************/
-  destroyContext();
-  /*******************/
-    return 0;
-
-#if 0
-
-#ifdef __cilk
-    double dynamicCilkCycles = 1e30;
-    for (int i = 0; i < 5; ++i) {
-        framebuffer.clear();
-        reset_and_start_timer();
-        for (int j = 0; j < nframes; ++j)
-            DispatchDynamicCilk(input, &framebuffer);
-        double mcycles = get_elapsed_mcycles() / nframes;
-        dynamicCilkCycles = std::min(dynamicCilkCycles, mcycles);
-    }
-    printf("[ispc + Cilk dynamic]:\t\t[%.3f] million cycles to render image\n", 
-           dynamicCilkCycles);
-    WriteFrame("deferred-ispc-dynamic.ppm", input, framebuffer);
-#endif // __cilk
-
-    double serialCycles = 1e30;
-    for (int i = 0; i < 5; ++i) {
-        framebuffer.clear();
-        reset_and_start_timer();
-        for (int j = 0; j < nframes; ++j)
-            DispatchDynamicC(input, &framebuffer);
-        double mcycles = get_elapsed_mcycles() / nframes;
-        serialCycles = std::min(serialCycles, mcycles);
-    }
-    printf("[C++ serial dynamic, 1 core]:\t[%.3f] million cycles to render image\n", 
-           serialCycles);
-    WriteFrame("deferred-serial-dynamic.ppm", input, framebuffer);
-
-#ifdef __cilk
-    printf("\t\t\t\t(%.2fx speedup from static ISPC, %.2fx from Cilk+ISPC)\n", 
-           serialCycles/ispcCycles, serialCycles/dynamicCilkCycles);
-#else
-    printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", serialCycles/ispcCycles);
-#endif // __cilk
-#endif
-
-    DeleteInputData(input);
-
-    return 0;
-}
--- a/examples_cuda/deferred/main_host.cpp
+++ b/examples_cuda/deferred/main_host.cpp
@@ -1,163 +0,0 @@
-/*
-  Copyright (c) 2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define ISPC_IS_WINDOWS
-#define NOMINMAX
-#elif defined(__linux__)
-#define ISPC_IS_LINUX
-#elif defined(__APPLE__)
-#define ISPC_IS_APPLE
-#endif
-
-#include <fcntl.h>
-#include <float.h>
-#include <math.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <sys/types.h>
-#include <stdint.h>
-#include <algorithm>
-#include <assert.h>
-#include <vector>
-#ifdef ISPC_IS_WINDOWS
-  #define WIN32_LEAN_AND_MEAN
-  #include <windows.h>
-#endif
-#include "deferred.h"
-#include "kernels_ispc.h"
-#include "../timing.h"
-
-#include <sys/time.h>
-static inline double rtc(void)
-{
-  struct timeval Tvalue;
-  double etime;
-  struct timezone dummy;
-
-  gettimeofday(&Tvalue,&dummy);
-  etime =  (double) Tvalue.tv_sec +
-    1.e-6*((double) Tvalue.tv_usec);
-  return etime;
-}
-
-///////////////////////////////////////////////////////////////////////////
-
-int main(int argc, char** argv) {
-    if (argc != 2) {
-        printf("usage: deferred_shading <input_file (e.g. data/pp1280x720.bin)>\n");
-        return 1;
-    }
-
-    InputData *input = CreateInputDataFromFile(argv[1]);
-    if (!input) {
-        printf("Failed to load input file \"%s\"!\n", argv[1]);
-        return 1;
-    }
-
-    Framebuffer framebuffer(input->header.framebufferWidth,
-                            input->header.framebufferHeight);
-
-#if 0
-    InitDynamicC(input);
-#ifdef __cilk
-    InitDynamicCilk(input);
-#endif // __cilk
-#endif
-
-  const int buffsize = input->header.framebufferWidth*input->header.framebufferHeight;
-  for (int i = 0; i < buffsize; i++)
-    framebuffer.r[i] = framebuffer.g[i] = framebuffer.b[i] = 0;
-
-    int nframes = 5;
-    double ispcCycles = 1e30;
-    for (int i = 0; i < 5; ++i) {
-        framebuffer.clear();
-        const double t0 = rtc();
-        for (int j = 0; j < nframes; ++j)
-            ispc::RenderStatic(&input->header, &input->arrays,
-                                input->header,
-                               VISUALIZE_LIGHT_COUNT,
-                               framebuffer.r, framebuffer.g, framebuffer.b);
-        double mcycles = 1000*(rtc() - t0) / nframes;
-        ispcCycles = std::min(ispcCycles, mcycles);
-    }
-    printf("[ispc static + tasks]:\t\t[%.3f] million cycles to render "
-           "%d x %d image\n", ispcCycles,
-           input->header.framebufferWidth, input->header.framebufferHeight);
-    WriteFrame("deferred-ispc-static.ppm", input, framebuffer);
-    return 0;
-
-#if 0
-
-#ifdef __cilk
-    double dynamicCilkCycles = 1e30;
-    for (int i = 0; i < 5; ++i) {
-        framebuffer.clear();
-        reset_and_start_timer();
-        for (int j = 0; j < nframes; ++j)
-            DispatchDynamicCilk(input, &framebuffer);
-        double mcycles = get_elapsed_mcycles() / nframes;
-        dynamicCilkCycles = std::min(dynamicCilkCycles, mcycles);
-    }
-    printf("[ispc + Cilk dynamic]:\t\t[%.3f] million cycles to render image\n", 
-           dynamicCilkCycles);
-    WriteFrame("deferred-ispc-dynamic.ppm", input, framebuffer);
-#endif // __cilk
-
-    double serialCycles = 1e30;
-    for (int i = 0; i < 5; ++i) {
-        framebuffer.clear();
-        reset_and_start_timer();
-        for (int j = 0; j < nframes; ++j)
-            DispatchDynamicC(input, &framebuffer);
-        double mcycles = get_elapsed_mcycles() / nframes;
-        serialCycles = std::min(serialCycles, mcycles);
-    }
-    printf("[C++ serial dynamic, 1 core]:\t[%.3f] million cycles to render image\n", 
-           serialCycles);
-    WriteFrame("deferred-serial-dynamic.ppm", input, framebuffer);
-
-#ifdef __cilk
-    printf("\t\t\t\t(%.2fx speedup from static ISPC, %.2fx from Cilk+ISPC)\n", 
-           serialCycles/ispcCycles, serialCycles/dynamicCilkCycles);
-#else
-    printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", serialCycles/ispcCycles);
-#endif // __cilk
-#endif
-
-    DeleteInputData(input);
-
-    return 0;
-}
--- a/examples_cuda/drvapi_error_string.h
+++ b/examples_cuda/drvapi_error_string.h
@@ -1,370 +0,0 @@
-/*
- * Copyright 1993-2012 NVIDIA Corporation.  All rights reserved.
- *
- * Please refer to the NVIDIA end user license agreement (EULA) associated
- * with this source code for terms and conditions that govern your use of
- * this software. Any use, reproduction, disclosure, or distribution of
- * this software and related documentation outside the terms of the EULA
- * is strictly prohibited.
- *
- */
- 
-#ifndef _DRVAPI_ERROR_STRING_H_
-#define _DRVAPI_ERROR_STRING_H_
-
-#include <stdio.h>
-#include <string.h>
-#include <stdlib.h>
-
-// Error Code string definitions here
-typedef struct
-{
-    char const *error_string;
-    int  error_id;
-} s_CudaErrorStr;
-
-/**
- * Error codes
- */
-static s_CudaErrorStr sCudaDrvErrorString[] =
-{
-    /**
-     * The API call returned with no errors. In the case of query calls, this
-     * can also mean that the operation being queried is complete (see
-     * ::cuEventQuery() and ::cuStreamQuery()).
-     */
-    { "CUDA_SUCCESS", 0 },
-
-    /**
-     * This indicates that one or more of the parameters passed to the API call
-     * is not within an acceptable range of values.
-     */
-    { "CUDA_ERROR_INVALID_VALUE", 1 },
-
-    /**
-     * The API call failed because it was unable to allocate enough memory to
-     * perform the requested operation.
-     */
-    { "CUDA_ERROR_OUT_OF_MEMORY", 2 },
-
-    /**
-     * This indicates that the CUDA driver has not been initialized with
-     * ::cuInit() or that initialization has failed.
-     */
-    { "CUDA_ERROR_NOT_INITIALIZED", 3 },
-
-    /**
-     * This indicates that the CUDA driver is in the process of shutting down.
-     */
-    { "CUDA_ERROR_DEINITIALIZED", 4 },
-
-    /**
-     * This indicates profiling APIs are called while application is running
-     * in visual profiler mode. 
-    */
-    { "CUDA_ERROR_PROFILER_DISABLED", 5 },
-    /**
-     * This indicates profiling has not been initialized for this context. 
-     * Call cuProfilerInitialize() to resolve this. 
-    */
-    { "CUDA_ERROR_PROFILER_NOT_INITIALIZED", 6 },
-    /**
-     * This indicates profiler has already been started and probably
-     * cuProfilerStart() is incorrectly called.
-    */
-    { "CUDA_ERROR_PROFILER_ALREADY_STARTED", 7 },
-    /**
-     * This indicates profiler has already been stopped and probably
-     * cuProfilerStop() is incorrectly called.
-    */
-    { "CUDA_ERROR_PROFILER_ALREADY_STOPPED", 8 },  
-    /**
-     * This indicates that no CUDA-capable devices were detected by the installed
-     * CUDA driver.
-     */
-    { "CUDA_ERROR_NO_DEVICE (no CUDA-capable devices were detected)", 100 },
-
-    /**
-     * This indicates that the device ordinal supplied by the user does not
-     * correspond to a valid CUDA device.
-     */
-    { "CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)", 101 },
-
-
-    /**
-     * This indicates that the device kernel image is invalid. This can also
-     * indicate an invalid CUDA module.
-     */
-    { "CUDA_ERROR_INVALID_IMAGE", 200 },
-
-    /**
-     * This most frequently indicates that there is no context bound to the
-     * current thread. This can also be returned if the context passed to an
-     * API call is not a valid handle (such as a context that has had
-     * ::cuCtxDestroy() invoked on it). This can also be returned if a user
-     * mixes different API versions (i.e. 3010 context with 3020 API calls).
-     * See ::cuCtxGetApiVersion() for more details.
-     */
-    { "CUDA_ERROR_INVALID_CONTEXT", 201 },
-
-    /**
-     * This indicated that the context being supplied as a parameter to the
-     * API call was already the active context.
-     * \deprecated
-     * This error return is deprecated as of CUDA 3.2. It is no longer an
-     * error to attempt to push the active context via ::cuCtxPushCurrent().
-     */
-    { "CUDA_ERROR_CONTEXT_ALREADY_CURRENT", 202 },
-
-    /**
-     * This indicates that a map or register operation has failed.
-     */
-    { "CUDA_ERROR_MAP_FAILED", 205 },
-
-    /**
-     * This indicates that an unmap or unregister operation has failed.
-     */
-    { "CUDA_ERROR_UNMAP_FAILED", 206 },
-
-    /**
-     * This indicates that the specified array is currently mapped and thus
-     * cannot be destroyed.
-     */
-    { "CUDA_ERROR_ARRAY_IS_MAPPED", 207 },
-
-    /**
-     * This indicates that the resource is already mapped.
-     */
-    { "CUDA_ERROR_ALREADY_MAPPED", 208 },
-
-    /**
-     * This indicates that there is no kernel image available that is suitable
-     * for the device. This can occur when a user specifies code generation
-     * options for a particular CUDA source file that do not include the
-     * corresponding device configuration.
-     */
-    { "CUDA_ERROR_NO_BINARY_FOR_GPU", 209 },
-
-    /**
-     * This indicates that a resource has already been acquired.
-     */
-    { "CUDA_ERROR_ALREADY_ACQUIRED", 210 },
-
-    /**
-     * This indicates that a resource is not mapped.
-     */
-    { "CUDA_ERROR_NOT_MAPPED", 211 },
-
-    /**
-     * This indicates that a mapped resource is not available for access as an
-     * array.
-     */
-    { "CUDA_ERROR_NOT_MAPPED_AS_ARRAY", 212 },
-
-    /**
-     * This indicates that a mapped resource is not available for access as a
-     * pointer.
-     */
-    { "CUDA_ERROR_NOT_MAPPED_AS_POINTER", 213 },
-
-    /**
-     * This indicates that an uncorrectable ECC error was detected during
-     * execution.
-     */
-    { "CUDA_ERROR_ECC_UNCORRECTABLE", 214 },
-
-    /**
-     * This indicates that the ::CUlimit passed to the API call is not
-     * supported by the active device.
-     */
-    { "CUDA_ERROR_UNSUPPORTED_LIMIT", 215 },
-
-    /**
-     * This indicates that the ::CUcontext passed to the API call can
-     * only be bound to a single CPU thread at a time but is already 
-     * bound to a CPU thread.
-     */
-    { "CUDA_ERROR_CONTEXT_ALREADY_IN_USE", 216 },
-
-    /**
-     * This indicates that peer access is not supported across the given
-     * devices.
-     */
-    { "CUDA_ERROR_PEER_ACCESS_UNSUPPORTED", 217},
-
-    /**
-     * This indicates that the device kernel source is invalid.
-     */
-    { "CUDA_ERROR_INVALID_SOURCE", 300 },
-
-    /**
-     * This indicates that the file specified was not found.
-     */
-    { "CUDA_ERROR_FILE_NOT_FOUND", 301 },
-
-    /**
-     * This indicates that a link to a shared object failed to resolve.
-     */
-    { "CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND", 302 },
-
-    /**
-     * This indicates that initialization of a shared object failed.
-     */
-    { "CUDA_ERROR_SHARED_OBJECT_INIT_FAILED", 303 },
-
-    /**
-     * This indicates that an OS call failed.
-     */
-    { "CUDA_ERROR_OPERATING_SYSTEM", 304 },
-
-
-    /**
-     * This indicates that a resource handle passed to the API call was not
-     * valid. Resource handles are opaque types like ::CUstream and ::CUevent.
-     */
-    { "CUDA_ERROR_INVALID_HANDLE", 400 },
-
-
-    /**
-     * This indicates that a named symbol was not found. Examples of symbols
-     * are global/constant variable names, texture names }, and surface names.
-     */
-    { "CUDA_ERROR_NOT_FOUND", 500 },
-
-
-    /**
-     * This indicates that asynchronous operations issued previously have not
-     * completed yet. This result is not actually an error, but must be indicated
-     * differently than ::CUDA_SUCCESS (which indicates completion). Calls that
-     * may return this value include ::cuEventQuery() and ::cuStreamQuery().
-     */
-    { "CUDA_ERROR_NOT_READY", 600 },
-
-
-    /**
-     * An exception occurred on the device while executing a kernel. Common
-     * causes include dereferencing an invalid device pointer and accessing
-     * out of bounds shared memory. The context cannot be used }, so it must
-     * be destroyed (and a new one should be created). All existing device
-     * memory allocations from this context are invalid and must be
-     * reconstructed if the program is to continue using CUDA.
-     */
-    { "CUDA_ERROR_LAUNCH_FAILED", 700 },
-
-    /**
-     * This indicates that a launch did not occur because it did not have
-     * appropriate resources. This error usually indicates that the user has
-     * attempted to pass too many arguments to the device kernel, or the
-     * kernel launch specifies too many threads for the kernel's register
-     * count. Passing arguments of the wrong size (i.e. a 64-bit pointer
-     * when a 32-bit int is expected) is equivalent to passing too many
-     * arguments and can also result in this error.
-     */
-    { "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES", 701 },
-
-    /**
-     * This indicates that the device kernel took too long to execute. This can
-     * only occur if timeouts are enabled - see the device attribute
-     * ::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information. The
-     * context cannot be used (and must be destroyed similar to
-     * ::CUDA_ERROR_LAUNCH_FAILED). All existing device memory allocations from
-     * this context are invalid and must be reconstructed if the program is to
-     * continue using CUDA.
-     */
-    { "CUDA_ERROR_LAUNCH_TIMEOUT", 702 },
-
-    /**
-     * This error indicates a kernel launch that uses an incompatible texturing
-     * mode.
-     */
-    { "CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING", 703 },
-    
-    /**
-     * This error indicates that a call to ::cuCtxEnablePeerAccess() is
-     * trying to re-enable peer access to a context which has already
-     * had peer access to it enabled.
-     */
-    { "CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED", 704 },
-
-    /**
-     * This error indicates that ::cuCtxDisablePeerAccess() is 
-     * trying to disable peer access which has not been enabled yet 
-     * via ::cuCtxEnablePeerAccess(). 
-     */
-    { "CUDA_ERROR_PEER_ACCESS_NOT_ENABLED", 705 },
-
-    /**
-     * This error indicates that the primary context for the specified device
-     * has already been initialized.
-     */
-    { "CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE", 708 },
-
-    /**
-     * This error indicates that the context current to the calling thread
-     * has been destroyed using ::cuCtxDestroy }, or is a primary context which
-     * has not yet been initialized.
-     */
-    { "CUDA_ERROR_CONTEXT_IS_DESTROYED", 709 },
-
-    /**
-     * A device-side assert triggered during kernel execution. The context
-     * cannot be used anymore, and must be destroyed. All existing device 
-     * memory allocations from this context are invalid and must be 
-     * reconstructed if the program is to continue using CUDA.
-     */
-    { "CUDA_ERROR_ASSERT", 710 },
-
-        /**
-     * This error indicates that the hardware resources required to enable
-     * peer access have been exhausted for one or more of the devices 
-     * passed to ::cuCtxEnablePeerAccess().
-     */
-    { "CUDA_ERROR_TOO_MANY_PEERS", 711 },
-
-    /**
-     * This error indicates that the memory range passed to ::cuMemHostRegister()
-     * has already been registered.
-     */
-    { "CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED", 712 },
-
-    /**
-     * This error indicates that the pointer passed to ::cuMemHostUnregister()
-     * does not correspond to any currently registered memory region.
-     */
-    { "CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED", 713 },
-
-    /**
-     * This error indicates that the attempted operation is not permitted.
-     */
-    { "CUDA_ERROR_NOT_PERMITTED", 800 },
-
-    /**
-     * This error indicates that the attempted operation is not supported
-     * on the current system or device.
-     */
-    { "CUDA_ERROR_NOT_SUPPORTED", 801 },
-
-    /**
-     * This indicates that an unknown internal error has occurred.
-     */
-    { "CUDA_ERROR_UNKNOWN", 999 },
-    { NULL, -1 }
-};
-
-// This is just a linear search through the array, since the error_id's are not
-// always ocurring consecutively
-const char * getCudaDrvErrorString(CUresult error_id)
-{
-    int index = 0;
-    while (sCudaDrvErrorString[index].error_id != error_id && 
-           sCudaDrvErrorString[index].error_id != -1)
-    {
-        index++;
-    }
-    if (sCudaDrvErrorString[index].error_id == error_id)
-        return (const char *)sCudaDrvErrorString[index].error_string;
-    else
-        return (const char *)"CUDA_ERROR not found!";
-}
-
-#endif
--- a/examples_cuda/examples.sln
+++ b/examples_cuda/examples.sln
@@ -1,136 +0,0 @@
-
-Microsoft Visual Studio Solution File, Format Version 11.00
-# Visual Studio 2010
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "simple", "simple\simple.vcxproj", "{947C5311-8B78-4D05-BEE4-BCF342D4B367}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "rt", "rt\rt.vcxproj", "{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "aobench", "aobench\aobench.vcxproj", "{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "mandelbrot", "mandelbrot\mandelbrot.vcxproj", "{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "options", "options\options.vcxproj", "{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "mandelbrot_tasks", "mandelbrot_tasks\mandelbrot_tasks.vcxproj", "{E80DA7D4-AB22-4648-A068-327307156BE6}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "aobench_instrumented", "aobench_instrumented\aobench_instrumented.vcxproj", "{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "noise", "noise\noise.vcxproj", "{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "volume", "volume_rendering\volume.vcxproj", "{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "stencil", "stencil\stencil.vcxproj", "{2EF070A1-F62F-4E6A-944B-88D140945C3C}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "deferred_shading", "deferred\deferred_shading.vcxproj", "{87F53C53-957E-4E91-878A-BC27828FB9EB}"
-EndProject
-Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "perfbench", "perfbench\perfbench.vcxproj", "{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}"
-EndProject
-Global
-	GlobalSection(SolutionConfigurationPlatforms) = preSolution
-		Debug|Win32 = Debug|Win32
-		Debug|x64 = Debug|x64
-		Release|Win32 = Release|Win32
-		Release|x64 = Release|x64
-	EndGlobalSection
-	GlobalSection(ProjectConfigurationPlatforms) = postSolution
-		{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Debug|Win32.ActiveCfg = Debug|Win32
-		{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Debug|Win32.Build.0 = Debug|Win32
-		{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Debug|x64.ActiveCfg = Debug|x64
-		{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Debug|x64.Build.0 = Debug|x64
-		{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Release|Win32.ActiveCfg = Release|Win32
-		{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Release|Win32.Build.0 = Release|Win32
-		{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Release|x64.ActiveCfg = Release|x64
-		{947C5311-8B78-4D05-BEE4-BCF342D4B367}.Release|x64.Build.0 = Release|x64
-		{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Debug|Win32.ActiveCfg = Debug|Win32
-		{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Debug|Win32.Build.0 = Debug|Win32
-		{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Debug|x64.ActiveCfg = Debug|x64
-		{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Debug|x64.Build.0 = Debug|x64
-		{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Release|Win32.ActiveCfg = Release|Win32
-		{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Release|Win32.Build.0 = Release|Win32
-		{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Release|x64.ActiveCfg = Release|x64
-		{E787BC3F-2D2E-425E-A64D-4721E2FF3DC9}.Release|x64.Build.0 = Release|x64
-		{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Debug|Win32.ActiveCfg = Debug|Win32
-		{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Debug|Win32.Build.0 = Debug|Win32
-		{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Debug|x64.ActiveCfg = Debug|x64
-		{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Debug|x64.Build.0 = Debug|x64
-		{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Release|Win32.ActiveCfg = Release|Win32
-		{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Release|Win32.Build.0 = Release|Win32
-		{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Release|x64.ActiveCfg = Release|x64
-		{F29204CA-19DF-4F3C-87D5-03F4EEDAAFEB}.Release|x64.Build.0 = Release|x64
-		{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Debug|Win32.ActiveCfg = Debug|Win32
-		{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Debug|Win32.Build.0 = Debug|Win32
-		{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Debug|x64.ActiveCfg = Debug|x64
-		{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Debug|x64.Build.0 = Debug|x64
-		{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Release|Win32.ActiveCfg = Release|Win32
-		{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Release|Win32.Build.0 = Release|Win32
-		{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Release|x64.ActiveCfg = Release|x64
-		{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}.Release|x64.Build.0 = Release|x64
-		{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Debug|Win32.ActiveCfg = Debug|Win32
-		{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Debug|Win32.Build.0 = Debug|Win32
-		{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Debug|x64.ActiveCfg = Debug|x64
-		{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Debug|x64.Build.0 = Debug|x64
-		{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Release|Win32.ActiveCfg = Release|Win32
-		{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Release|Win32.Build.0 = Release|Win32
-		{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Release|x64.ActiveCfg = Release|x64
-		{8C7B5D29-1E76-44E6-BBB8-09830E5DEEAE}.Release|x64.Build.0 = Release|x64
-		{E80DA7D4-AB22-4648-A068-327307156BE6}.Debug|Win32.ActiveCfg = Debug|Win32
-		{E80DA7D4-AB22-4648-A068-327307156BE6}.Debug|Win32.Build.0 = Debug|Win32
-		{E80DA7D4-AB22-4648-A068-327307156BE6}.Debug|x64.ActiveCfg = Debug|x64
-		{E80DA7D4-AB22-4648-A068-327307156BE6}.Debug|x64.Build.0 = Debug|x64
-		{E80DA7D4-AB22-4648-A068-327307156BE6}.Release|Win32.ActiveCfg = Release|Win32
-		{E80DA7D4-AB22-4648-A068-327307156BE6}.Release|Win32.Build.0 = Release|Win32
-		{E80DA7D4-AB22-4648-A068-327307156BE6}.Release|x64.ActiveCfg = Release|x64
-		{E80DA7D4-AB22-4648-A068-327307156BE6}.Release|x64.Build.0 = Release|x64
-		{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Debug|Win32.ActiveCfg = Debug|Win32
-		{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Debug|Win32.Build.0 = Debug|Win32
-		{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Debug|x64.ActiveCfg = Debug|x64
-		{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Debug|x64.Build.0 = Debug|x64
-		{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Release|Win32.ActiveCfg = Release|Win32
-		{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Release|Win32.Build.0 = Release|Win32
-		{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Release|x64.ActiveCfg = Release|x64
-		{B3B4AE3D-6D5A-4CF9-AF5B-43CF2131B958}.Release|x64.Build.0 = Release|x64
-		{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Debug|Win32.ActiveCfg = Debug|Win32
-		{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Debug|Win32.Build.0 = Debug|Win32
-		{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Debug|x64.ActiveCfg = Debug|x64
-		{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Debug|x64.Build.0 = Debug|x64
-		{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Release|Win32.ActiveCfg = Release|Win32
-		{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Release|Win32.Build.0 = Release|Win32
-		{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Release|x64.ActiveCfg = Release|x64
-		{0E0886D8-8B5E-4EAF-9A21-91E63DAF81FD}.Release|x64.Build.0 = Release|x64
-		{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Debug|Win32.ActiveCfg = Debug|Win32
-		{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Debug|Win32.Build.0 = Debug|Win32
-		{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Debug|x64.ActiveCfg = Debug|x64
-		{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Debug|x64.Build.0 = Debug|x64
-		{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Release|Win32.ActiveCfg = Release|Win32
-		{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Release|Win32.Build.0 = Release|Win32
-		{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Release|x64.ActiveCfg = Release|x64
-		{DEE5733A-E93E-449D-9114-9BFFCAEB4DF9}.Release|x64.Build.0 = Release|x64
-		{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Debug|Win32.ActiveCfg = Debug|Win32
-		{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Debug|Win32.Build.0 = Debug|Win32
-		{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Debug|x64.ActiveCfg = Debug|x64
-		{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Debug|x64.Build.0 = Debug|x64
-		{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Release|Win32.ActiveCfg = Release|Win32
-		{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Release|Win32.Build.0 = Release|Win32
-		{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Release|x64.ActiveCfg = Release|x64
-		{2EF070A1-F62F-4E6A-944B-88D140945C3C}.Release|x64.Build.0 = Release|x64
-		{87F53C53-957E-4E91-878A-BC27828FB9EB}.Debug|Win32.ActiveCfg = Debug|Win32
-		{87F53C53-957E-4E91-878A-BC27828FB9EB}.Debug|Win32.Build.0 = Debug|Win32
-		{87F53C53-957E-4E91-878A-BC27828FB9EB}.Debug|x64.ActiveCfg = Debug|x64
-		{87F53C53-957E-4E91-878A-BC27828FB9EB}.Debug|x64.Build.0 = Debug|x64
-		{87F53C53-957E-4E91-878A-BC27828FB9EB}.Release|Win32.ActiveCfg = Release|Win32
-		{87F53C53-957E-4E91-878A-BC27828FB9EB}.Release|Win32.Build.0 = Release|Win32
-		{87F53C53-957E-4E91-878A-BC27828FB9EB}.Release|x64.ActiveCfg = Release|x64
-		{87F53C53-957E-4E91-878A-BC27828FB9EB}.Release|x64.Build.0 = Release|x64
-		{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Debug|Win32.ActiveCfg = Debug|Win32
-		{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Debug|Win32.Build.0 = Debug|Win32
-		{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Debug|x64.ActiveCfg = Debug|x64
-		{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Debug|x64.Build.0 = Debug|x64
-		{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Release|Win32.ActiveCfg = Release|Win32
-		{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Release|Win32.Build.0 = Release|Win32
-		{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Release|x64.ActiveCfg = Release|x64
-		{D923BB7E-A7C8-4850-8FCF-0EB9CE35B4E8}.Release|x64.Build.0 = Release|x64
-	EndGlobalSection
-	GlobalSection(SolutionProperties) = preSolution
-		HideSolutionNode = FALSE
-	EndGlobalSection
-EndGlobal
--- a/examples_cuda/gmres/Makefile
+++ b/examples_cuda/gmres/Makefile
@@ -1,9 +0,0 @@
-
-EXAMPLE=gmres
-CPP_SRC=algorithm.cpp main.cpp matrix.cpp
-CC_SRC=mmio.c
-ISPC_SRC=matrix.ispc
-ISPC_IA_TARGETS=sse2,sse4-x2,avx-x2
-ISPC_ARM_TARGETS=neon
-
-include ../common.mk
--- a/examples_cuda/gmres/algorithm.cpp
+++ b/examples_cuda/gmres/algorithm.cpp
@@ -1,231 +0,0 @@
-/*
-  Copyright (c) 2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-/*===========================================================================*\
-|* Includes
-\*===========================================================================*/
-#include "algorithm.h"
-#include "stdio.h"
-#include "debug.h"
-
-
-/*===========================================================================*\
-|* GMRES
-\*===========================================================================*/
-/* upper_triangular_right_solve:
- * ----------------------------
- * Given upper triangular matrix R and rhs vector b, solve for
- * x.  This "solve" ignores the rows, columns of R that are greater than the
- * dimensions of x.
- */
-void upper_triangular_right_solve (const DenseMatrix &R, const Vector &b, Vector &x) 
-{
-    // Dimensionality check
-    ASSERT(R.rows() >= b.size());
-    ASSERT(R.cols() >= x.size());
-    ASSERT(b.size() >= x.size());
-
-    int max_row = x.size() - 1;
-
-    // first solve step:
-    x[max_row] = b[max_row] / R(max_row, max_row);
-
-    for (int row = max_row - 1; row >= 0; row--) {
-        double xi = b[row];
-        for (int col = max_row; col > row; col--)
-            xi -= x[col] * R(row, col);
-        x[row] = xi / R(row, row);
-    }
-}
-
-/* create_rotation (used in gmres):
- * -------------------------------
- * Construct a Givens rotation to zero out the lowest non-zero entry in a partially
- * factored Hessenburg matrix.  Note that the previous Givens rotations should be
- * applied to this column before creating a new rotation.
- */
-void create_rotation (const DenseMatrix &H, size_t col, Vector &Cn, Vector &Sn) 
-{
-    double a = H(col,     col);
-    double b = H(col + 1, col);
-    double r;
-
-    if (b == 0) {
-        Cn[col] = copysign(1, a);
-        Sn[col] = 0;
-    } 
-    else if (a == 0) {
-        Cn[col] = 0;
-        Sn[col] = copysign(1, b);
-    }
-    else {
-        r       = sqrt(a*a + b*b);
-        Sn[col] = -b / r;
-        Cn[col] =  a / r;
-    }
-}
-
-/* Applies the 'col'th Givens rotation stored in vectors Sn and Cn to the 'col'th 
- * column of the DenseMatrix M.  (Previous columns don't need the rotation applied b/c
- * presumeably, the first col-1 columns are already upper triangular, and so their
- * entries in the col and col+1 rows are 0.)
- */
-void apply_rotation (DenseMatrix &H, size_t col, Vector &Cn, Vector &Sn) 
-{
-    double c = Cn[col];
-    double s = Sn[col];
-    double tmp    = c * H(col, col) - s * H(col+1, col);
-    H(col+1, col) = s * H(col, col) + c * H(col+1, col);
-    H(col,   col) = tmp;
-}
-
-/* Applies the 'col'th Givens rotation to the vector.
- */
-void apply_rotation (Vector &v, size_t col, Vector &Cn, Vector &Sn) 
-{
-    double a = v[col];
-    double b = v[col + 1];
-
-    double c = Cn[col];
-    double s = Sn[col];
-
-    v[col]     = c * a - s * b;
-    v[col + 1] = s * a + c * b;
-}
-
-/* Applies the first 'col' Givens rotations to the newly-created column
- * of H.  (Leaves other columns alone.)
- */
-void update_column (DenseMatrix &H, size_t col, Vector &Cn, Vector &Sn) 
-{
-    for (int i = 0; i < col; i++) {
-        double c    = Cn[i];
-        double s    = Sn[i];
-        double t    = c * H(i,col) - s * H(i+1,col);
-        H(i+1, col) = s * H(i,col) + c * H(i+1,col);
-        H(i,   col) = t;
-    }
-}
-
-/* After a new column has been added to the hessenburg matrix, factor it back into
- * an upper-triangular matrix by:
- * - applying the previous Givens rotations to the new column
- * - computing the new Givens rotation to make the column upper triangluar
- * - applying the new Givens rotation to the column, and
- * - applying the new Givens rotation to the solution vector
- */
-void update_qr_decomp (DenseMatrix &H, Vector &s, size_t col, Vector &Cn, Vector &Sn)
-{
-    update_column(  H, col, Cn, Sn);
-    create_rotation(H, col, Cn, Sn);
-    apply_rotation( H, col, Cn, Sn);
-    apply_rotation( s, col, Cn, Sn);
-}
-
-void gmres (const Matrix &A, const Vector &b, Vector &x, int num_iters, double max_err)  
-{
-    DEBUG_PRINT("gmres starting!\n");
-    x.zero();
-
-    ASSERT(A.rows() == A.cols());
-    DenseMatrix Qstar(num_iters + 1, A.rows());
-    DenseMatrix H(num_iters + 1, num_iters);
-
-    // arrays for storing parameters of givens rotations
-    Vector Sn(num_iters);
-    Vector Cn(num_iters);
-
-    // array for storing the rhs projected onto the hessenburg's column space
-    Vector G(num_iters+1);
-    G.zero();
-
-    double beta = b.norm();
-    G[0] = beta;
-
-    // temp vector, stores Aqi
-    Vector w(A.rows());
-
-    w.copy(b);
-    w.normalize();
-    Qstar.set_row(0, w);
-
-    int iter = 0;
-    Vector temp(A.rows(), false);
-    double rel_err;
-
-    while (iter < num_iters) 
-    {
-        // w = Aqi
-        Qstar.row(iter, temp);
-        A.multiply(temp, w);
-
-        // construct ith column of H, i+1th row of Qstar:        
-        for (int row = 0; row <= iter; row++) {
-            Qstar.row(row, temp);
-            H(row, iter) = temp.dot(w);
-            w.add_ax(-H(row, iter), temp);
-        }
-
-        H(iter+1, iter) = w.norm();
-        w.divide(H(iter+1, iter));
-        Qstar.set_row(iter+1, w);
-
-        update_qr_decomp (H, G, iter, Cn, Sn);
-
-        rel_err = fabs(G[iter+1] / beta);
-
-        if (rel_err < max_err)
-            break;
-
-        if (iter % 100 == 0)
-            DEBUG_PRINT("Iter %d: %f err\n", iter, rel_err);
-
-        iter++;
-    }
-
-    if (iter == num_iters) {
-        fprintf(stderr, "Error: gmres failed to converge in %d iterations (relative err: %f)\n", num_iters, rel_err);
-        exit(-1);
-    }
-
-    // We've reached an acceptable solution (?):
-
-    DEBUG_PRINT("gmres completed in %d iterations (rel. resid. %f, max %f)\n", num_iters, rel_err, max_err);
-    Vector y(iter+1);
-    upper_triangular_right_solve(H, G, y);
-    for (int i = 0; i < iter + 1; i++) {
-        Qstar.row(i, temp);
-        x.add_ax(y[i], temp);
-    }
-}
--- a/examples_cuda/gmres/algorithm.h
+++ b/examples_cuda/gmres/algorithm.h
@@ -1,50 +0,0 @@
-/*
-  Copyright (c) 2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-#ifndef __ALGORITHM_H__
-#define __ALGORITHM_H__
-
-#include "matrix.h"
-
-
-/* Generalized Minimal Residual Method:
- * -----------------------------------
- * Takes a square matrix and an rhs and uses GMRES to find an estimate for x.
- * The specified error is relative.
- */
-void gmres (const Matrix &A, const Vector &b, Vector &x, int num_iters, double err);
-
-
-
-#endif
--- a/examples_cuda/gmres/data/c-18/c-18.mtx
+++ b/examples_cuda/gmres/data/c-18/c-18.mtx
--- a/examples_cuda/gmres/data/c-18/c-18_b.mtx
+++ b/examples_cuda/gmres/data/c-18/c-18_b.mtx
--- a/examples_cuda/gmres/data/c-21/c-21.mtx
+++ b/examples_cuda/gmres/data/c-21/c-21.mtx
--- a/examples_cuda/gmres/data/c-21/c-21_b.mtx
+++ b/examples_cuda/gmres/data/c-21/c-21_b.mtx
--- a/examples_cuda/gmres/data/c-22/c-22.mtx
+++ b/examples_cuda/gmres/data/c-22/c-22.mtx
--- a/examples_cuda/gmres/data/c-22/c-22_b.mtx
+++ b/examples_cuda/gmres/data/c-22/c-22_b.mtx
--- a/examples_cuda/gmres/data/c-25/c-25.mtx
+++ b/examples_cuda/gmres/data/c-25/c-25.mtx
--- a/examples_cuda/gmres/data/c-25/c-25_b.mtx
+++ b/examples_cuda/gmres/data/c-25/c-25_b.mtx
--- a/examples_cuda/gmres/debug.h
+++ b/examples_cuda/gmres/debug.h
@@ -1,55 +0,0 @@
-/*
-  Copyright (c) 2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-#ifndef __DEBUG_H__
-#define __DEBUG_H__
-
-#include <cassert>
-
-
-/**************************************************************\
-| Macros
-\**************************************************************/
-#define DEBUG
-
-#ifdef DEBUG
-#define ASSERT(expr) assert(expr)
-#define DEBUG_PRINT(...) printf(__VA_ARGS__)
-#else
-#define ASSERT(expr)
-#define DEBUG_PRINT(...)
-#endif
-
-
-#endif
--- a/examples_cuda/gmres/main.cpp
+++ b/examples_cuda/gmres/main.cpp
@@ -1,79 +0,0 @@
-/*
-  Copyright (c) 2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-#include "matrix.h"
-#include "algorithm.h"
-#include "util.h"
-#include <cmath>
-#include "../timing.h"
-
-
-int main (int argc, char **argv) 
-{
-    if (argc < 4) {
-        printf("usage: %s <input-matrix> <input-rhs> <output-file>\n", argv[0]);
-        return -1;
-    }
-
-    double gmres_cycles;
-
-    DEBUG_PRINT("Loading A...\n");
-    Matrix *A = CRSMatrix::matrix_from_mtf(argv[1]);
-    if (A == NULL) 
-        return -1;
-    DEBUG_PRINT("... size: %lu\n", A->cols());
-
-    DEBUG_PRINT("Loading b...\n");
-    Vector *b = Vector::vector_from_mtf(argv[2]);
-    if (b == NULL)
-        return -1;
-
-    Vector x(A->cols());
-    DEBUG_PRINT("Beginning gmres...\n");
-    gmres(*A, *b, x, A->cols() / 2, .01);
-
-    // Write result out to file
-    x.to_mtf(argv[argc-1]);
-
-    // Compute residual (double-check)
-#ifdef DEBUG
-    Vector bprime(b->size());
-    A->multiply(x, bprime);
-    Vector resid(bprime.size(), &(bprime[0]));
-    resid.subtract(*b);
-    DEBUG_PRINT("residual error check: %lg\n", resid.norm() / b->norm());
-#endif
-    // Print profiling results
-    DEBUG_PRINT("-- Total mcycles to solve : %.03f --\n", gmres_cycles);
-}
--- a/examples_cuda/gmres/matrix.cpp
+++ b/examples_cuda/gmres/matrix.cpp
@@ -1,246 +0,0 @@
-/*
-  Copyright (c) 2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-/**************************************************************\
-| Includes
-\**************************************************************/
-#include "matrix.h"
-#include "matrix_ispc.h"
-
-extern "C" {
-#include "mmio.h"
-}
-
-/**************************************************************\
-| DenseMatrix methods
-\**************************************************************/
-void DenseMatrix::multiply (const Vector &v, Vector &r) const 
-{
-    // Dimensionality check
-    ASSERT(v.size() == cols());
-    ASSERT(r.size() == rows());
-
-    for (int i = 0; i < rows(); i++)
-        r[i] = v.dot(entries + i * num_cols);
-}
-
-const Vector *DenseMatrix::row (size_t row) const {
-    return new Vector(num_cols, entries + row * num_cols, true);
-}
-
-void DenseMatrix::row (size_t row, Vector &r) {
-    r.entries = entries + row * cols();
-    r._size   = cols();
-}
-
-void DenseMatrix::set_row(size_t row, const Vector &v) 
-{
-    ASSERT(v.size() == num_cols);
-    memcpy(entries + row * num_cols, v.entries, num_cols * sizeof(double));
-}
-
-
-/**************************************************************\
-| CRSMatrix Methods
-\**************************************************************/
-#include <stdio.h>
-#include <stdlib.h>
-#include <vector>
-#include <algorithm>
-
-
-struct entry {
-    int row;
-    int col;
-    double val;
-};
-
-bool compare_entries(struct entry i, struct entry j) {
-    if (i.row < j.row)
-        return true;
-    if (i.row > j.row)
-        return false;
-
-    return i.col < j.col;
-}
-
-#define ERR_OUT(...) { fprintf(stderr, __VA_ARGS__); return NULL; }
-
-CRSMatrix *CRSMatrix::matrix_from_mtf (char *path) {
-    FILE *f;
-    MM_typecode matcode;
-
-    int m, n, nz;
-
-    if ((f = fopen(path, "r")) == NULL) 
-        ERR_OUT("Error: %s does not name a valid/readable file.\n", path);
-
-    if (mm_read_banner(f, &matcode) != 0)
-        ERR_OUT("Error: Could not process Matrix Market banner.\n");
-
-    if (mm_is_complex(matcode)) 
-        ERR_OUT("Error: Application does not support complex numbers.\n")
-
-    if (mm_is_dense(matcode))
-        ERR_OUT("Error: supplied matrix is dense (should be sparse.)\n");
-
-    if (!mm_is_matrix(matcode))
-        ERR_OUT("Error: %s does not encode a matrix.\n", path)
-
-    if (mm_read_mtx_crd_size(f, &m, &n, &nz) != 0)
-        ERR_OUT("Error: could not read matrix size from file.\n");
-
-    if (m != n)
-        ERR_OUT("Error: Application does not support non-square matrices.");
-
-    std::vector<struct entry> entries;
-    entries.resize(nz);
-
-    for (int i = 0; i < nz; i++) {
-        fscanf(f, "%d %d %lg\n", &entries[i].row, &entries[i].col, &entries[i].val);
-        // Adjust from 1-based to 0-based
-        entries[i].row--;
-        entries[i].col--;
-    }
-
-    sort(entries.begin(), entries.end(), compare_entries);
-
-    CRSMatrix *M = new CRSMatrix(m, n, nz);
-    int cur_row = -1;
-    for (int i = 0; i < nz; i++) {
-        while (entries[i].row > cur_row)
-            M->row_offsets[++cur_row] = i;
-        M->entries[i] = entries[i].val;
-        M->columns[i] = entries[i].col;
-    }
-
-    return M;
-}
-
-Vector *Vector::vector_from_mtf (char *path) {
-    FILE *f;
-    MM_typecode matcode;
-
-    int m, n, nz;
-
-    if ((f = fopen(path, "r")) == NULL) 
-        ERR_OUT("Error: %s does not name a valid/readable file.\n", path);
-
-    if (mm_read_banner(f, &matcode) != 0)
-        ERR_OUT("Error: Could not process Matrix Market banner.\n");
-
-    if (mm_is_complex(matcode)) 
-        ERR_OUT("Error: Application does not support complex numbers.\n")
-
-    if (mm_is_dense(matcode)) {
-        if (mm_read_mtx_array_size(f, &m, &n) != 0)
-            ERR_OUT("Error: could not read matrix size from file.\n");
-    } else {
-        if (mm_read_mtx_crd_size(f, &m, &n, &nz) != 0)
-            ERR_OUT("Error: could not read matrix size from file.\n");
-    }
-    if (n != 1)
-        ERR_OUT("Error: %s does not describe a vector.\n", path);
-
-    Vector *x = new Vector(m);
-
-    if (mm_is_dense(matcode)) {
-        double val;
-        for (int i = 0; i < m; i++) {
-            fscanf(f, "%lg\n", &val);
-            (*x)[i] = val;
-        }
-    }
-    else {
-        x->zero();
-        double val;
-        int row;
-        int col;
-        for (int i = 0; i < nz; i++) {
-            fscanf(f, "%d %d %lg\n", &row, &col, &val);
-            (*x)[row-1] = val;
-        }
-    }
-    return x;
-}
-
-#define ERR(...) { fprintf(stderr, __VA_ARGS__); exit(-1); }
-
-void Vector::to_mtf (char *path) {
-    FILE *f;
-    MM_typecode matcode;
-
-    mm_initialize_typecode(&matcode);
-    mm_set_matrix(&matcode);
-    mm_set_real(&matcode);
-    mm_set_dense(&matcode);
-    mm_set_general(&matcode);
-
-    if ((f = fopen(path, "w")) == NULL)
-        ERR("Error: cannot open/write to %s\n", path);
-
-    mm_write_banner(f, matcode);
-    mm_write_mtx_array_size(f, size(), 1);
-    for (int i = 0; i < size(); i++)
-        fprintf(f, "%lg\n", entries[i]);
-
-    fclose(f);
-}
-
-void CRSMatrix::multiply (const Vector &v, Vector &r) const
-{
-    ASSERT(v.size() == cols());
-    ASSERT(r.size() == rows());
-
-    for (int row = 0; row < rows(); row++) 
-    {
-        int row_offset = row_offsets[row];
-        int next_offset = ((row + 1 == rows()) ? _nonzeroes : row_offsets[row + 1]);
-
-        double sum = 0;
-        for (int i = row_offset; i < next_offset; i++)
-        {
-            sum += v[columns[i]] * entries[i];
-        }
-        r[row] = sum;
-    }
-}
-
-void CRSMatrix::zero ( ) 
-{
-    entries.clear();
-    row_offsets.clear();
-    columns.clear();
-    _nonzeroes = 0;
-}
--- a/examples_cuda/gmres/matrix.h
+++ b/examples_cuda/gmres/matrix.h
@@ -1,279 +0,0 @@
-/*
-  Copyright (c) 2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-#ifndef __MATRIX_H__
-#define __MATRIX_H__
-
-/**************************************************************\
-| Includes
-\**************************************************************/
-#include <cstring> // size_t
-#include <cstdlib> // malloc, memcpy, etc.
-#include <cmath>   // sqrt
-#include <vector>
-
-#include "debug.h"
-#include "matrix_ispc.h"
-
-
-class DenseMatrix;
-/**************************************************************\
-| Vector class
-\**************************************************************/
-class Vector {
- public:
-    static Vector *vector_from_mtf(char *path);
-    void to_mtf (char *path);
-
-    Vector(size_t size, bool alloc_mem=true) 
-        {
-            shared_ptr = false;
-            _size      = size;
-			
-            if (alloc_mem)
-                entries = (double *) malloc(sizeof(double) * _size);
-            else {
-                shared_ptr = true;
-                entries    = NULL;
-            }
-        }
-
-    Vector(size_t size, double *content, bool share_ptr=false) 
-        {
-            _size = size;
-            if (share_ptr) {
-                entries = content;
-                shared_ptr = true;
-            }
-            else {
-                shared_ptr = false;
-                entries = (double *) malloc(sizeof(double) * _size);
-                memcpy(entries, content, sizeof(double) * _size);
-            }
-        }
-
-    ~Vector() { if (!shared_ptr) free(entries); }
-
-    const double & operator [] (size_t index) const 
-    { 
-        ASSERT(index < _size); 
-        return *(entries + index); 
-    }
-
-    double &operator [] (size_t index) 
-    {
-        ASSERT(index < _size);
-        return *(entries + index);
-    }
-
-    bool operator == (const Vector &v) const 
-    {
-        if (v.size() != _size)
-            return false;
-
-        for (int i = 0; i < _size; i++)
-            if (entries[i] != v[i])
-                return false;
-
-        return true;
-    }
-
-    size_t size() const {return _size; }
-
-    double dot (const Vector &b) const 
-    {
-        ASSERT(b.size() == this->size());
-        return ispc::vector_dot(entries, b.entries, size());
-    }
-
-    double dot (const double * const b) const 
-    {
-        return ispc::vector_dot(entries, b, size());
-    }
-
-    void zero () 
-    {
-        ispc::zero(entries, size()); 
-    }
-
-    double norm () const { return sqrtf(dot(entries)); }
-
-    void normalize () { this->divide(this->norm()); }
-
-    void add (const Vector &a) 
-    {
-        ASSERT(size() == a.size());
-        ispc::vector_add(entries, a.entries, size());
-    }
-
-    void subtract (const Vector &s)
-    {
-        ASSERT(size() == s.size());
-        ispc::vector_sub(entries, s.entries, size());
-    }
-
-    void multiply (double scalar) 
-    {
-        ispc::vector_mult(entries, scalar, size());
-    }
-
-    void divide (double scalar) 
-    {
-        ispc::vector_div(entries, scalar, size());
-    }
-
-    // Note: x may be longer than *(this)
-    void add_ax (double a, const Vector &x) {
-        ASSERT(x.size() >= size());
-        ispc::vector_add_ax(entries, a, x.entries, size());
-    }
-
-    // Note that copy only copies the first size() elements of the
-    // supplied vector, i.e. the supplied vector can be longer than
-    // this one.  This is useful in least squares calculations.
-    void copy (const Vector &other) {
-        ASSERT(other.size() >= size());
-        memcpy(entries, other.entries, size() * sizeof(double));
-    }
-
-    friend class DenseMatrix;
-
- private:
-    size_t  _size;
-    bool     shared_ptr;
-    double  *entries;
-};
-
-
-/**************************************************************\
-| Matrix base class
-\**************************************************************/
-class Matrix {
-    friend class Vector;
-	
- public:
-    Matrix(size_t size_r, size_t size_c) 
-        { 
-            num_rows = size_r; 
-            num_cols = size_c; 
-        }
-    ~Matrix(){}
-
-    size_t rows() const { return num_rows; }
-    size_t cols() const { return num_cols; }
-
-    virtual void multiply (const Vector &v, Vector &r) const = 0;
-    virtual void zero () = 0;
-
- protected:
-    size_t num_rows;
-    size_t num_cols;
-};
-
-/**************************************************************\
-| DenseMatrix class
-\**************************************************************/
-class DenseMatrix : public Matrix { 
-    friend class Vector;
-
- public:
- DenseMatrix(size_t size_r, size_t size_c) : Matrix(size_r, size_c) 
-        {
-            entries = (double *) malloc(size_r * size_c * sizeof(double));
-        }
-
- DenseMatrix(size_t size_r, size_t size_c, const double *content) : Matrix (size_r, size_c)
-        {
-            entries = (double *) malloc(size_r * size_c * sizeof(double));
-            memcpy(entries, content, size_r * size_c * sizeof(double));
-        }
-
-    virtual void multiply (const Vector &v, Vector &r) const;
-
-    double &operator () (unsigned int r, unsigned int c)
-    {
-        return *(entries + r * num_cols + c);
-    }
-
-    const double &operator () (unsigned int r, unsigned int c) const
-    {
-        return *(entries + r * num_cols + c);			
-    }
-
-    const Vector *row(size_t row) const;
-    void          row(size_t row, Vector &r);
-    void      set_row(size_t row, const Vector &v);
-
-    virtual void zero() { ispc::zero(entries, rows() * cols()); }
-
-    void copy (const DenseMatrix &other) 
-    {
-        ASSERT(rows() == other.rows());
-        ASSERT(cols() == other.cols());
-        memcpy(entries, other.entries, rows() * cols() * sizeof(double));
-    }
-
- private:
-    double *entries;
-    bool shared_ptr;
-};
-
-/**************************************************************\
-| CSRMatrix (compressed row storage, a sparse matrix format)
-\**************************************************************/
-class CRSMatrix : public Matrix { 
- public:
-    CRSMatrix (size_t size_r, size_t size_c, size_t nonzeroes) :
-    Matrix(size_r, size_c) 
-        {
-            _nonzeroes = nonzeroes;
-            entries.resize(nonzeroes);
-            columns.resize(nonzeroes);
-            row_offsets.resize(size_r);
-        }
-
-    virtual void multiply(const Vector &v, Vector &r) const;
-
-    virtual void zero();
-
-    static CRSMatrix *matrix_from_mtf (char *path);
-
- private:
-    unsigned int        _nonzeroes;
-    std::vector<double>  entries;
-    std::vector<int>     row_offsets;
-    std::vector<int>     columns;
-};
-
-#endif
--- a/examples_cuda/gmres/matrix.ispc
+++ b/examples_cuda/gmres/matrix.ispc
@@ -1,122 +0,0 @@
-/*
-  Copyright (c) 2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-/**************************************************************\
-| General
-\**************************************************************/
-export void zero (uniform double data[],
-                  uniform int size)
-{
-    foreach (i = 0 ... size)
-        data[i] = 0.0;
-}
-
-
-/**************************************************************\
-| Vector helpers
-\**************************************************************/
-export void vector_add (uniform double a[], 
-                        const uniform double b[], 
-                        const uniform int size) 
-{
-    foreach (i = 0 ... size)
-        a[i] += b[i];
-}
-
-export void vector_sub (uniform double a[], 
-                        const uniform double b[], 
-                        const uniform int size) 
-{
-    foreach (i = 0 ... size)
-        a[i] -= b[i];
-}
-
-export void vector_mult (uniform double a[],
-                         const uniform double b,
-                         const uniform int size)
-{
-    foreach (i = 0 ... size)
-        a[i] *= b;
-}
-
-export void vector_div (uniform double a[],
-                        const uniform double b,
-                        const uniform int size)
-{
-    foreach (i = 0 ... size)
-        a[i] /= b;
-}
-
-export void vector_add_ax (uniform double r[],
-                           const uniform double a,
-                           const uniform double x[],
-                           const uniform int    size)
-{
-    foreach (i = 0 ... size)
-        r[i] += a * x[i];
-}
-
-export uniform double vector_dot (const uniform double a[],
-                                  const uniform double b[],
-                                  const uniform int size)
-{
-    varying double sum = 0.0;
-    foreach (i = 0 ... size)
-        sum += a[i] * b[i];
-    return reduce_add(sum);
-}
-
-/**************************************************************\
-| Matrix helpers
-\**************************************************************/
-export void sparse_multiply (const uniform double entries[],
-                             const uniform double columns[],
-                             const uniform double row_offsets[],
-                             const uniform int rows,
-                             const uniform int cols,
-                             const uniform int nonzeroes,
-                             const uniform double v[],
-                             uniform double r[]) 
-{
-    foreach (row = 0 ... rows) {
-        int row_offset = row_offsets[row];
-        int next_offset = ((row + 1 == rows) ? nonzeroes : row_offsets[row+1]);
-
-        double sum = 0;
-        for (int j = row_offset; j < next_offset; j++)
-            sum += v[columns[j]] * entries[j];
-        r[row] = sum;
-    }
-}
-
--- a/examples_cuda/gmres/mmio.c
+++ b/examples_cuda/gmres/mmio.c
@@ -1,511 +0,0 @@
-/* 
-*   Matrix Market I/O library for ANSI C
-*
-*   See http://math.nist.gov/MatrixMarket for details.
-*
-*
-*/
-
-
-#include <stdio.h>
-#include <string.h>
-#include <stdlib.h>
-#include <ctype.h>
-
-#include "mmio.h"
-
-int mm_read_unsymmetric_sparse(const char *fname, int *M_, int *N_, int *nz_,
-                double **val_, int **I_, int **J_)
-{
-    FILE *f;
-    MM_typecode matcode;
-    int M, N, nz;
-    int i;
-    double *val;
-    int *I, *J;
- 
-    if ((f = fopen(fname, "r")) == NULL)
-            return -1;
- 
- 
-    if (mm_read_banner(f, &matcode) != 0)
-    {
-        printf("mm_read_unsymetric: Could not process Matrix Market banner ");
-        printf(" in file [%s]\n", fname);
-        return -1;
-    }
- 
- 
- 
-    if ( !(mm_is_real(matcode) && mm_is_matrix(matcode) &&
-            mm_is_sparse(matcode)))
-    {
-        fprintf(stderr, "Sorry, this application does not support ");
-        fprintf(stderr, "Market Market type: [%s]\n",
-                mm_typecode_to_str(matcode));
-        return -1;
-    }
- 
-    /* find out size of sparse matrix: M, N, nz .... */
- 
-    if (mm_read_mtx_crd_size(f, &M, &N, &nz) !=0)
-    {
-        fprintf(stderr, "read_unsymmetric_sparse(): could not parse matrix size.\n");
-        return -1;
-    }
- 
-    *M_ = M;
-    *N_ = N;
-    *nz_ = nz;
- 
-    /* reseve memory for matrices */
- 
-    I = (int *) malloc(nz * sizeof(int));
-    J = (int *) malloc(nz * sizeof(int));
-    val = (double *) malloc(nz * sizeof(double));
- 
-    *val_ = val;
-    *I_ = I;
-    *J_ = J;
- 
-    /* NOTE: when reading in doubles, ANSI C requires the use of the "l"  */
-    /*   specifier as in "%lg", "%lf", "%le", otherwise errors will occur */
-    /*  (ANSI C X3.159-1989, Sec. 4.9.6.2, p. 136 lines 13-15)            */
- 
-    for (i=0; i<nz; i++)
-    {
-        fscanf(f, "%d %d %lg\n", &I[i], &J[i], &val[i]);
-        I[i]--;  /* adjust from 1-based to 0-based */
-        J[i]--;
-    }
-    fclose(f);
- 
-    return 0;
-}
-
-int mm_is_valid(MM_typecode matcode)
-{
-    if (!mm_is_matrix(matcode)) return 0;
-    if (mm_is_dense(matcode) && mm_is_pattern(matcode)) return 0;
-    if (mm_is_real(matcode) && mm_is_hermitian(matcode)) return 0;
-    if (mm_is_pattern(matcode) && (mm_is_hermitian(matcode) || 
-                mm_is_skew(matcode))) return 0;
-    return 1;
-}
-
-int mm_read_banner(FILE *f, MM_typecode *matcode)
-{
-    char line[MM_MAX_LINE_LENGTH];
-    char banner[MM_MAX_TOKEN_LENGTH];
-    char mtx[MM_MAX_TOKEN_LENGTH]; 
-    char crd[MM_MAX_TOKEN_LENGTH];
-    char data_type[MM_MAX_TOKEN_LENGTH];
-    char storage_scheme[MM_MAX_TOKEN_LENGTH];
-    char *p;
-
-
-    mm_clear_typecode(matcode);  
-
-    if (fgets(line, MM_MAX_LINE_LENGTH, f) == NULL) 
-        return MM_PREMATURE_EOF;
-
-    if (sscanf(line, "%s %s %s %s %s", banner, mtx, crd, data_type, 
-        storage_scheme) != 5)
-        return MM_PREMATURE_EOF;
-
-    for (p=mtx; *p!='\0'; *p=tolower(*p),p++);  /* convert to lower case */
-    for (p=crd; *p!='\0'; *p=tolower(*p),p++);  
-    for (p=data_type; *p!='\0'; *p=tolower(*p),p++);
-    for (p=storage_scheme; *p!='\0'; *p=tolower(*p),p++);
-
-    /* check for banner */
-    if (strncmp(banner, MatrixMarketBanner, strlen(MatrixMarketBanner)) != 0)
-        return MM_NO_HEADER;
-
-    /* first field should be "mtx" */
-    if (strcmp(mtx, MM_MTX_STR) != 0)
-        return  MM_UNSUPPORTED_TYPE;
-    mm_set_matrix(matcode);
-
-
-    /* second field describes whether this is a sparse matrix (in coordinate
-            storgae) or a dense array */
-
-
-    if (strcmp(crd, MM_SPARSE_STR) == 0)
-        mm_set_sparse(matcode);
-    else
-    if (strcmp(crd, MM_DENSE_STR) == 0)
-            mm_set_dense(matcode);
-    else
-        return MM_UNSUPPORTED_TYPE;
-    
-
-    /* third field */
-
-    if (strcmp(data_type, MM_REAL_STR) == 0)
-        mm_set_real(matcode);
-    else
-    if (strcmp(data_type, MM_COMPLEX_STR) == 0)
-        mm_set_complex(matcode);
-    else
-    if (strcmp(data_type, MM_PATTERN_STR) == 0)
-        mm_set_pattern(matcode);
-    else
-    if (strcmp(data_type, MM_INT_STR) == 0)
-        mm_set_integer(matcode);
-    else
-        return MM_UNSUPPORTED_TYPE;
-    
-
-    /* fourth field */
-
-    if (strcmp(storage_scheme, MM_GENERAL_STR) == 0)
-        mm_set_general(matcode);
-    else
-    if (strcmp(storage_scheme, MM_SYMM_STR) == 0)
-        mm_set_symmetric(matcode);
-    else
-    if (strcmp(storage_scheme, MM_HERM_STR) == 0)
-        mm_set_hermitian(matcode);
-    else
-    if (strcmp(storage_scheme, MM_SKEW_STR) == 0)
-        mm_set_skew(matcode);
-    else
-        return MM_UNSUPPORTED_TYPE;
-        
-
-    return 0;
-}
-
-int mm_write_mtx_crd_size(FILE *f, int M, int N, int nz)
-{
-    if (fprintf(f, "%d %d %d\n", M, N, nz) != 3)
-        return MM_COULD_NOT_WRITE_FILE;
-    else 
-        return 0;
-}
-
-int mm_read_mtx_crd_size(FILE *f, int *M, int *N, int *nz )
-{
-    char line[MM_MAX_LINE_LENGTH];
-    int num_items_read;
-
-    /* set return null parameter values, in case we exit with errors */
-    *M = *N = *nz = 0;
-
-    /* now continue scanning until you reach the end-of-comments */
-    do 
-    {
-        if (fgets(line,MM_MAX_LINE_LENGTH,f) == NULL) 
-            return MM_PREMATURE_EOF;
-    }while (line[0] == '%');
-
-    /* line[] is either blank or has M,N, nz */
-    if (sscanf(line, "%d %d %d", M, N, nz) == 3)
-        return 0;
-        
-    else
-    do
-    { 
-        num_items_read = fscanf(f, "%d %d %d", M, N, nz); 
-        if (num_items_read == EOF) return MM_PREMATURE_EOF;
-    }
-    while (num_items_read != 3);
-
-    return 0;
-}
-
-
-int mm_read_mtx_array_size(FILE *f, int *M, int *N)
-{
-    char line[MM_MAX_LINE_LENGTH];
-    int num_items_read;
-    /* set return null parameter values, in case we exit with errors */
-    *M = *N = 0;
-	
-    /* now continue scanning until you reach the end-of-comments */
-    do 
-    {
-        if (fgets(line,MM_MAX_LINE_LENGTH,f) == NULL) 
-            return MM_PREMATURE_EOF;
-    }while (line[0] == '%');
-
-    /* line[] is either blank or has M,N, nz */
-    if (sscanf(line, "%d %d", M, N) == 2)
-        return 0;
-        
-    else /* we have a blank line */
-    do
-    { 
-        num_items_read = fscanf(f, "%d %d", M, N); 
-        if (num_items_read == EOF) return MM_PREMATURE_EOF;
-    }
-    while (num_items_read != 2);
-
-    return 0;
-}
-
-int mm_write_mtx_array_size(FILE *f, int M, int N)
-{
-    if (fprintf(f, "%d %d\n", M, N) != 2)
-        return MM_COULD_NOT_WRITE_FILE;
-    else 
-        return 0;
-}
-
-
-
-/*-------------------------------------------------------------------------*/
-
-/******************************************************************/
-/* use when I[], J[], and val[]J, and val[] are already allocated */
-/******************************************************************/
-
-int mm_read_mtx_crd_data(FILE *f, int M, int N, int nz, int I[], int J[],
-        double val[], MM_typecode matcode)
-{
-    int i;
-    if (mm_is_complex(matcode))
-    {
-        for (i=0; i<nz; i++)
-            if (fscanf(f, "%d %d %lg %lg", &I[i], &J[i], &val[2*i], &val[2*i+1])
-                != 4) return MM_PREMATURE_EOF;
-    }
-    else if (mm_is_real(matcode))
-    {
-        for (i=0; i<nz; i++)
-        {
-            if (fscanf(f, "%d %d %lg\n", &I[i], &J[i], &val[i])
-                != 3) return MM_PREMATURE_EOF;
-
-        }
-    }
-
-    else if (mm_is_pattern(matcode))
-    {
-        for (i=0; i<nz; i++)
-            if (fscanf(f, "%d %d", &I[i], &J[i])
-                != 2) return MM_PREMATURE_EOF;
-    }
-    else
-        return MM_UNSUPPORTED_TYPE;
-
-    return 0;
-        
-}
-
-int mm_read_mtx_crd_entry(FILE *f, int *I, int *J,
-        double *real, double *imag, MM_typecode matcode)
-{
-    if (mm_is_complex(matcode))
-    {
-            if (fscanf(f, "%d %d %lg %lg", I, J, real, imag)
-                != 4) return MM_PREMATURE_EOF;
-    }
-    else if (mm_is_real(matcode))
-    {
-            if (fscanf(f, "%d %d %lg\n", I, J, real)
-                != 3) return MM_PREMATURE_EOF;
-
-    }
-
-    else if (mm_is_pattern(matcode))
-    {
-            if (fscanf(f, "%d %d", I, J) != 2) return MM_PREMATURE_EOF;
-    }
-    else
-        return MM_UNSUPPORTED_TYPE;
-
-    return 0;
-        
-}
-
-
-/************************************************************************
-    mm_read_mtx_crd()  fills M, N, nz, array of values, and return
-                        type code, e.g. 'MCRS'
-
-                        if matrix is complex, values[] is of size 2*nz,
-                            (nz pairs of real/imaginary values)
-************************************************************************/
-
-int mm_read_mtx_crd(char *fname, int *M, int *N, int *nz, int **I, int **J, 
-        double **val, MM_typecode *matcode)
-{
-    int ret_code;
-    FILE *f;
-
-    if (strcmp(fname, "stdin") == 0) f=stdin;
-    else
-    if ((f = fopen(fname, "r")) == NULL)
-        return MM_COULD_NOT_READ_FILE;
-
-
-    if ((ret_code = mm_read_banner(f, matcode)) != 0)
-        return ret_code;
-
-    if (!(mm_is_valid(*matcode) && mm_is_sparse(*matcode) && 
-            mm_is_matrix(*matcode)))
-        return MM_UNSUPPORTED_TYPE;
-
-    if ((ret_code = mm_read_mtx_crd_size(f, M, N, nz)) != 0)
-        return ret_code;
-
-
-    *I = (int *)  malloc(*nz * sizeof(int));
-    *J = (int *)  malloc(*nz * sizeof(int));
-    *val = NULL;
-
-    if (mm_is_complex(*matcode))
-    {
-        *val = (double *) malloc(*nz * 2 * sizeof(double));
-        ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val, 
-                *matcode);
-        if (ret_code != 0) return ret_code;
-    }
-    else if (mm_is_real(*matcode))
-    {
-        *val = (double *) malloc(*nz * sizeof(double));
-        ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val, 
-                *matcode);
-        if (ret_code != 0) return ret_code;
-    }
-
-    else if (mm_is_pattern(*matcode))
-    {
-        ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val, 
-                *matcode);
-        if (ret_code != 0) return ret_code;
-    }
-
-    if (f != stdin) fclose(f);
-    return 0;
-}
-
-int mm_write_banner(FILE *f, MM_typecode matcode)
-{
-    char *str = mm_typecode_to_str(matcode);
-    int ret_code;
-
-    ret_code = fprintf(f, "%s %s\n", MatrixMarketBanner, str);
-    free(str);
-    if (ret_code !=2 )
-        return MM_COULD_NOT_WRITE_FILE;
-    else
-        return 0;
-}
-
-int mm_write_mtx_crd(char fname[], int M, int N, int nz, int I[], int J[],
-        double val[], MM_typecode matcode)
-{
-    FILE *f;
-    int i;
-
-    if (strcmp(fname, "stdout") == 0) 
-        f = stdout;
-    else
-    if ((f = fopen(fname, "w")) == NULL)
-        return MM_COULD_NOT_WRITE_FILE;
-    
-    /* print banner followed by typecode */
-    fprintf(f, "%s ", MatrixMarketBanner);
-    fprintf(f, "%s\n", mm_typecode_to_str(matcode));
-
-    /* print matrix sizes and nonzeros */
-    fprintf(f, "%d %d %d\n", M, N, nz);
-
-    /* print values */
-    if (mm_is_pattern(matcode))
-        for (i=0; i<nz; i++)
-            fprintf(f, "%d %d\n", I[i], J[i]);
-    else
-    if (mm_is_real(matcode))
-        for (i=0; i<nz; i++)
-            fprintf(f, "%d %d %20.16g\n", I[i], J[i], val[i]);
-    else
-    if (mm_is_complex(matcode))
-        for (i=0; i<nz; i++)
-            fprintf(f, "%d %d %20.16g %20.16g\n", I[i], J[i], val[2*i], 
-                        val[2*i+1]);
-    else
-    {
-        if (f != stdout) fclose(f);
-        return MM_UNSUPPORTED_TYPE;
-    }
-
-    if (f !=stdout) fclose(f);
-
-    return 0;
-}
-  
-
-/**
-*  Create a new copy of a string s.  mm_strdup() is a common routine, but
-*  not part of ANSI C, so it is included here.  Used by mm_typecode_to_str().
-*
-*/
-char *mm_strdup(const char *s)
-{
-	int len = strlen(s);
-	char *s2 = (char *) malloc((len+1)*sizeof(char));
-	return strcpy(s2, s);
-}
-
-char  *mm_typecode_to_str(MM_typecode matcode)
-{
-    char buffer[MM_MAX_LINE_LENGTH];
-    char *types[4];
-	char *mm_strdup(const char *);
-    int error =0;
-
-    /* check for MTX type */
-    if (mm_is_matrix(matcode)) 
-        types[0] = MM_MTX_STR;
-    else
-        error=1;
-
-    /* check for CRD or ARR matrix */
-    if (mm_is_sparse(matcode))
-        types[1] = MM_SPARSE_STR;
-    else
-    if (mm_is_dense(matcode))
-        types[1] = MM_DENSE_STR;
-    else
-        return NULL;
-
-    /* check for element data type */
-    if (mm_is_real(matcode))
-        types[2] = MM_REAL_STR;
-    else
-    if (mm_is_complex(matcode))
-        types[2] = MM_COMPLEX_STR;
-    else
-    if (mm_is_pattern(matcode))
-        types[2] = MM_PATTERN_STR;
-    else
-    if (mm_is_integer(matcode))
-        types[2] = MM_INT_STR;
-    else
-        return NULL;
-
-
-    /* check for symmetry type */
-    if (mm_is_general(matcode))
-        types[3] = MM_GENERAL_STR;
-    else
-    if (mm_is_symmetric(matcode))
-        types[3] = MM_SYMM_STR;
-    else 
-    if (mm_is_hermitian(matcode))
-        types[3] = MM_HERM_STR;
-    else 
-    if (mm_is_skew(matcode))
-        types[3] = MM_SKEW_STR;
-    else
-        return NULL;
-
-    sprintf(buffer,"%s %s %s %s", types[0], types[1], types[2], types[3]);
-    return mm_strdup(buffer);
-
-}
--- a/examples_cuda/gmres/mmio.h
+++ b/examples_cuda/gmres/mmio.h
@@ -1,135 +0,0 @@
-/* 
-*   Matrix Market I/O library for ANSI C
-*
-*   See http://math.nist.gov/MatrixMarket for details.
-*
-*
-*/
-
-#ifndef MM_IO_H
-#define MM_IO_H
-
-#define MM_MAX_LINE_LENGTH 1025
-#define MatrixMarketBanner "%%MatrixMarket"
-#define MM_MAX_TOKEN_LENGTH 64
-
-typedef char MM_typecode[4];
-
-#include <stdio.h>
-
-char *mm_typecode_to_str(MM_typecode matcode);
-
-int mm_read_banner(FILE *f, MM_typecode *matcode);
-int mm_read_mtx_crd_size(FILE *f, int *M, int *N, int *nz);
-int mm_read_mtx_array_size(FILE *f, int *M, int *N);
-
-int mm_write_banner(FILE *f, MM_typecode matcode);
-int mm_write_mtx_crd_size(FILE *f, int M, int N, int nz);
-int mm_write_mtx_array_size(FILE *f, int M, int N);
-
-
-/********************* MM_typecode query fucntions ***************************/
-
-#define mm_is_matrix(typecode)	((typecode)[0]=='M')
-
-#define mm_is_sparse(typecode)	((typecode)[1]=='C')
-#define mm_is_coordinate(typecode)((typecode)[1]=='C')
-#define mm_is_dense(typecode)	((typecode)[1]=='A')
-#define mm_is_array(typecode)	((typecode)[1]=='A')
-
-#define mm_is_complex(typecode)	((typecode)[2]=='C')
-#define mm_is_real(typecode)		((typecode)[2]=='R')
-#define mm_is_pattern(typecode)	((typecode)[2]=='P')
-#define mm_is_integer(typecode) ((typecode)[2]=='I')
-
-#define mm_is_symmetric(typecode)((typecode)[3]=='S')
-#define mm_is_general(typecode)	((typecode)[3]=='G')
-#define mm_is_skew(typecode)	((typecode)[3]=='K')
-#define mm_is_hermitian(typecode)((typecode)[3]=='H')
-
-int mm_is_valid(MM_typecode matcode);		/* too complex for a macro */
-
-
-/********************* MM_typecode modify fucntions ***************************/
-
-#define mm_set_matrix(typecode)	((*typecode)[0]='M')
-#define mm_set_coordinate(typecode)	((*typecode)[1]='C')
-#define mm_set_array(typecode)	((*typecode)[1]='A')
-#define mm_set_dense(typecode)	mm_set_array(typecode)
-#define mm_set_sparse(typecode)	mm_set_coordinate(typecode)
-
-#define mm_set_complex(typecode)((*typecode)[2]='C')
-#define mm_set_real(typecode)	((*typecode)[2]='R')
-#define mm_set_pattern(typecode)((*typecode)[2]='P')
-#define mm_set_integer(typecode)((*typecode)[2]='I')
-
-
-#define mm_set_symmetric(typecode)((*typecode)[3]='S')
-#define mm_set_general(typecode)((*typecode)[3]='G')
-#define mm_set_skew(typecode)	((*typecode)[3]='K')
-#define mm_set_hermitian(typecode)((*typecode)[3]='H')
-
-#define mm_clear_typecode(typecode) ((*typecode)[0]=(*typecode)[1]= \
-									(*typecode)[2]=' ',(*typecode)[3]='G')
-
-#define mm_initialize_typecode(typecode) mm_clear_typecode(typecode)
-
-
-/********************* Matrix Market error codes ***************************/
-
-
-#define MM_COULD_NOT_READ_FILE	11
-#define MM_PREMATURE_EOF		12
-#define MM_NOT_MTX				13
-#define MM_NO_HEADER			14
-#define MM_UNSUPPORTED_TYPE		15
-#define MM_LINE_TOO_LONG		16
-#define MM_COULD_NOT_WRITE_FILE	17
-
-
-/******************** Matrix Market internal definitions ********************
-
-   MM_matrix_typecode: 4-character sequence
-
-				    ojbect 		sparse/   	data        storage 
-						  		dense     	type        scheme
-
-   string position:	 [0]        [1]			[2]         [3]
-
-   Matrix typecode:  M(atrix)  C(oord)		R(eal)   	G(eneral)
-						        A(array)	C(omplex)   H(ermitian)
-											P(attern)   S(ymmetric)
-								    		I(nteger)	K(kew)
-
- ***********************************************************************/
-
-#define MM_MTX_STR		"matrix"
-#define MM_ARRAY_STR	"array"
-#define MM_DENSE_STR	"array"
-#define MM_COORDINATE_STR "coordinate" 
-#define MM_SPARSE_STR	"coordinate"
-#define MM_COMPLEX_STR	"complex"
-#define MM_REAL_STR		"real"
-#define MM_INT_STR		"integer"
-#define MM_GENERAL_STR  "general"
-#define MM_SYMM_STR		"symmetric"
-#define MM_HERM_STR		"hermitian"
-#define MM_SKEW_STR		"skew-symmetric"
-#define MM_PATTERN_STR  "pattern"
-
-
-/*  high level routines */
-
-int mm_write_mtx_crd(char fname[], int M, int N, int nz, int I[], int J[],
-		 double val[], MM_typecode matcode);
-int mm_read_mtx_crd_data(FILE *f, int M, int N, int nz, int I[], int J[],
-		double val[], MM_typecode matcode);
-int mm_read_mtx_crd_entry(FILE *f, int *I, int *J, double *real, double *img,
-			MM_typecode matcode);
-
-int mm_read_unsymmetric_sparse(const char *fname, int *M_, int *N_, int *nz_,
-                double **val_, int **I_, int **J_);
-
-
-
-#endif
--- a/examples_cuda/gmres/util.h
+++ b/examples_cuda/gmres/util.h
@@ -1,53 +0,0 @@
-/*
-  Copyright (c) 2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-#ifndef __UTIL_H__
-#define __UTIL_H__
-
-#include <stdio.h>
-#include "matrix.h"
-
-
-inline void printMatrix (DenseMatrix &M, const char *name) {
-    printf("Matrix %s:\n", name);
-    for (int row = 0; row < M.rows(); row++) {
-        printf("row %2d: ", row + 1);
-        for (int col = 0; col < M.cols(); col++)
-            printf("%6f ", M(row, col));
-        printf("\n");
-    }
-    printf("\n");
-}
-
-#endif
--- a/examples_cuda/intrinsics/generic-16.h
+++ b/examples_cuda/intrinsics/generic-16.h
--- a/examples_cuda/intrinsics/generic-32.h
+++ b/examples_cuda/intrinsics/generic-32.h
--- a/examples_cuda/intrinsics/generic-64.h
+++ b/examples_cuda/intrinsics/generic-64.h
--- a/examples_cuda/intrinsics/knc-i1x16.h
+++ b/examples_cuda/intrinsics/knc-i1x16.h
--- a/examples_cuda/intrinsics/knc-i1x8.h
+++ b/examples_cuda/intrinsics/knc-i1x8.h
--- a/examples_cuda/intrinsics/knc-i1x8unsafe_fast.h
+++ b/examples_cuda/intrinsics/knc-i1x8unsafe_fast.h
@@ -1,86 +0,0 @@
-#define __ZMM64BIT__
-#include "knc-i1x8.h"
-
-/* the following tests fails because on KNC native vec8_i32 and vec8_float are 512 and not 256 bit in size.
- *
- *  Using test compiler: Intel(r) SPMD Program Compiler (ispc), 1.4.5dev (build commit d68dbbc7bce74803 @ 20130919, LLVM 3.3)
- *  Using C/C++ compiler: icpc (ICC) 14.0.0 20130728
- *
- */
-
-/* knc-i1x8unsafe_fast.h fails: 
- * ----------------------------
-1 / 1206 tests FAILED compilation:
-	./tests/ptr-assign-lhs-math-1.ispc
-33 / 1206 tests FAILED execution:
-	./tests/array-gather-simple.ispc
-	./tests/array-gather-vary.ispc
-	./tests/array-multidim-gather-scatter.ispc
-	./tests/array-scatter-vary.ispc
-	./tests/atomics-5.ispc
-	./tests/atomics-swap.ispc
-	./tests/cfor-array-gather-vary.ispc
-	./tests/cfor-gs-improve-varying-1.ispc
-	./tests/cfor-struct-gather-2.ispc
-	./tests/cfor-struct-gather-3.ispc
-	./tests/cfor-struct-gather.ispc
-	./tests/gather-struct-vector.ispc
-	./tests/global-array-4.ispc
-	./tests/gs-improve-varying-1.ispc
-	./tests/half-1.ispc
-	./tests/half-3.ispc
-	./tests/half.ispc
-	./tests/launch-3.ispc
-	./tests/launch-4.ispc
-	./tests/masked-scatter-vector.ispc
-	./tests/masked-struct-scatter-varying.ispc
-	./tests/new-delete-6.ispc
-	./tests/ptr-24.ispc
-	./tests/ptr-25.ispc
-	./tests/short-vec-15.ispc
-	./tests/struct-gather-2.ispc
-	./tests/struct-gather-3.ispc
-	./tests/struct-gather.ispc
-	./tests/struct-ref-lvalue.ispc
-	./tests/struct-test-118.ispc
-	./tests/struct-vary-index-expr.ispc
-	./tests/typedef-2.ispc
-	./tests/vector-varying-scatter.ispc
-*/
-
-/* knc-i1x8.h fails: 
- * ----------------------------
-1 / 1206 tests FAILED compilation:
-	./tests/ptr-assign-lhs-math-1.ispc
-3 / 1206 tests FAILED execution:
-	./tests/half-1.ispc
-	./tests/half-3.ispc
-	./tests/half.ispc
-*/
-
-/* knc-i1x8.h fails: 
- * ----------------------------
-1 / 1206 tests FAILED compilation:
-        ./tests/ptr-assign-lhs-math-1.ispc
-4 / 1206 tests FAILED execution:
-        ./tests/half-1.ispc
-        ./tests/half-3.ispc
-        ./tests/half.ispc
-        ./tests/test-141.ispc
-*/
-
-/* generic-16.h fails: (from these knc-i1x8.h & knc-i1x16.h are derived 
- * ----------------------------
-1 / 1206 tests FAILED compilation:
-        ./tests/ptr-assign-lhs-math-1.ispc
-6 / 1206 tests FAILED execution:
-        ./tests/func-overload-max.ispc
-        ./tests/half-1.ispc
-        ./tests/half-3.ispc
-        ./tests/half.ispc
-        ./tests/test-141.ispc
-        ./tests/test-143.ispc
-*/
-
-
-
--- a/examples_cuda/intrinsics/knc.h
+++ b/examples_cuda/intrinsics/knc.h
--- a/examples_cuda/intrinsics/knc2x.h
+++ b/examples_cuda/intrinsics/knc2x.h
--- a/examples_cuda/intrinsics/sse4.h
+++ b/examples_cuda/intrinsics/sse4.h
--- a/examples_cuda/mandelbrot/.gitignore
+++ b/examples_cuda/mandelbrot/.gitignore
@@ -1,3 +0,0 @@
-mandelbrot
-*.ppm
-objs
--- a/examples_cuda/mandelbrot/Makefile
+++ b/examples_cuda/mandelbrot/Makefile
@@ -1,8 +0,0 @@
-
-EXAMPLE=mandelbrot
-CPP_SRC=mandelbrot.cpp mandelbrot_serial.cpp
-ISPC_SRC=mandelbrot.ispc
-ISPC_IA_TARGETS=sse2,sse4-x2,avx-x2
-ISPC_ARM_TARGETS=neon
-
-include ../common.mk
--- a/examples_cuda/mandelbrot/avx.out
+++ b/examples_cuda/mandelbrot/avx.out
--- a/examples_cuda/mandelbrot/avx1.out
+++ b/examples_cuda/mandelbrot/avx1.out
--- a/examples_cuda/mandelbrot/mandelbrot.cpp
+++ b/examples_cuda/mandelbrot/mandelbrot.cpp
@@ -1,118 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define _CRT_SECURE_NO_WARNINGS
-#define NOMINMAX
-#pragma warning (disable: 4244)
-#pragma warning (disable: 4305)
-#endif
-
-#include <stdio.h>
-#include <algorithm>
-#include "../timing.h"
-#include "mandelbrot_ispc.h"
-using namespace ispc;
-
-extern void mandelbrot_serial(float x0, float y0, float x1, float y1,
-                              int width, int height, int maxIterations,
-                              int output[]);
-
-/* Write a PPM image file with the image of the Mandelbrot set */
-static void
-writePPM(int *buf, int width, int height, const char *fn) {
-    FILE *fp = fopen(fn, "wb");
-    fprintf(fp, "P6\n");
-    fprintf(fp, "%d %d\n", width, height);
-    fprintf(fp, "255\n");
-    for (int i = 0; i < width*height; ++i) {
-        // Map the iteration count to colors by just alternating between
-        // two greys.
-        char c = (buf[i] & 0x1) ? 240 : 20;
-        for (int j = 0; j < 3; ++j)
-            fputc(c, fp);
-    }
-    fclose(fp);
-    printf("Wrote image file %s\n", fn);
-}
-
-
-int main() {
-    unsigned int width = 768;
-    unsigned int height = 512;
-    float x0 = -2;
-    float x1 = 1;
-    float y0 = -1;
-    float y1 = 1;
-
-    int maxIterations = 256;
-    int *buf = new int[width*height];
-
-    //
-    // Compute the image using the ispc implementation; report the minimum
-    // time of three runs.
-    //
-    double minISPC = 1e30;
-    for (int i = 0; i < 3; ++i) {
-        reset_and_start_timer();
-        mandelbrot_ispc(x0, y0, x1, y1, width, height, maxIterations, buf);
-        double dt = get_elapsed_mcycles();
-        minISPC = std::min(minISPC, dt);
-    }
-
-    printf("[mandelbrot ispc]:\t\t[%.3f] million cycles\n", minISPC);
-    writePPM(buf, width, height, "mandelbrot-ispc.ppm");
-
-    // Clear out the buffer
-    for (unsigned int i = 0; i < width * height; ++i)
-        buf[i] = 0;
-
-    // 
-    // And run the serial implementation 3 times, again reporting the
-    // minimum time.
-    //
-    double minSerial = 1e30;
-    for (int i = 0; i < 3; ++i) {
-        reset_and_start_timer();
-        mandelbrot_serial(x0, y0, x1, y1, width, height, maxIterations, buf);
-        double dt = get_elapsed_mcycles();
-        minSerial = std::min(minSerial, dt);
-    }
-
-    printf("[mandelbrot serial]:\t\t[%.3f] million cycles\n", minSerial);
-    writePPM(buf, width, height, "mandelbrot-serial.ppm");
-
-    printf("\t\t\t\t(%.2fx speedup from ISPC)\n", minSerial/minISPC);
-
-    return 0;
-}
--- a/examples_cuda/mandelbrot/mandelbrot.ispc
+++ b/examples_cuda/mandelbrot/mandelbrot.ispc
@@ -1,78 +0,0 @@
-/*
-  Copyright (c) 2010-2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-static inline int mandel(float c_re, float c_im, int count) {
-    float z_re = c_re, z_im = c_im;
-    int i;
-    for (i = 0; i < count; ++i) {
-        if (z_re * z_re + z_im * z_im > 4.)
-            break;
-
-        float new_re = z_re*z_re - z_im*z_im;
-        float new_im = 2.f * z_re * z_im;
-        unmasked {
-            z_re = c_re + new_re;
-            z_im = c_im + new_im;
-        }
-    }
-
-    return i;
-}
-
-export void mandelbrot_ispc(uniform float x0, uniform float y0, 
-                            uniform float x1, uniform float y1,
-                            uniform int width, uniform int height, 
-                            uniform int maxIterations,
-                            uniform int output[])
-{
-    float dx = (x1 - x0) / width;
-    float dy = (y1 - y0) / height;
-
-    for (uniform int j = 0; j < height; j++) {
-        // Note that we'll be doing programCount computations in parallel,
-        // so increment i by that much.  This assumes that width evenly
-        // divides programCount.
-        foreach (i = 0 ... width) {
-            // Figure out the position on the complex plane to compute the
-            // number of iterations at.  Note that the x values are
-            // different across different program instances, since its
-            // initializer incorporates the value of the programIndex
-            // variable.
-            float x = x0 + i * dx;
-            float y = y0 + j * dy;
-
-            int index = j * width + i;
-            output[index] = mandel(x, y, maxIterations);
-        }
-    }
-}
--- a/examples_cuda/mandelbrot/mandelbrot.vcxproj
+++ b/examples_cuda/mandelbrot/mandelbrot.vcxproj
@@ -1,175 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
-  <ItemGroup Label="ProjectConfigurations">
-    <ProjectConfiguration Include="Debug|Win32">
-      <Configuration>Debug</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Debug|x64">
-      <Configuration>Debug</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|Win32">
-      <Configuration>Release</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|x64">
-      <Configuration>Release</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-  </ItemGroup>
-  <PropertyGroup Label="Globals">
-    <ProjectGuid>{6D3EF8C5-AE26-407B-9ECE-C27CB988D9C1}</ProjectGuid>
-    <Keyword>Win32Proj</Keyword>
-    <RootNamespace>mandelbrot</RootNamespace>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
-  <ImportGroup Label="ExtensionSettings">
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <PropertyGroup Label="UserMacros" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-  </PropertyGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemGroup>
-    <ClCompile Include="mandelbrot.cpp" />
-    <ClCompile Include="mandelbrot_serial.cpp" />
-  </ItemGroup>
-  <ItemGroup>
-    <CustomBuild Include="mandelbrot.ispc">
-      <FileType>Document</FileType>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-    </CustomBuild>
-  </ItemGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
-  <ImportGroup Label="ExtensionTargets">
-  </ImportGroup>
-</Project>
--- a/examples_cuda/mandelbrot/mandelbrot_serial.cpp
+++ b/examples_cuda/mandelbrot/mandelbrot_serial.cpp
@@ -1,68 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-static int mandel(float c_re, float c_im, int count) {
-    float z_re = c_re, z_im = c_im;
-    int i;
-    for (i = 0; i < count; ++i) {
-        if (z_re * z_re + z_im * z_im > 4.f)
-            break;
-
-        float new_re = z_re*z_re - z_im*z_im;
-        float new_im = 2.f * z_re * z_im;
-        z_re = c_re + new_re;
-        z_im = c_im + new_im;
-    }
-
-    return i;
-}
-
-void mandelbrot_serial(float x0, float y0, float x1, float y1,
-                       int width, int height, int maxIterations,
-                       int output[])
-{
-    float dx = (x1 - x0) / width;
-    float dy = (y1 - y0) / height;
-
-    for (int j = 0; j < height; j++) {
-        for (int i = 0; i < width; ++i) {
-            float x = x0 + i * dx;
-            float y = y0 + j * dy;
-
-            int index = (j * width + i);
-            output[index] = mandel(x, y, maxIterations);
-        }
-    }
-}
-
--- a/examples_cuda/mandelbrot/out.o
+++ b/examples_cuda/mandelbrot/out.o
--- a/examples_cuda/mandelbrot/out.ptx
+++ b/examples_cuda/mandelbrot/out.ptx
@@ -1,843 +0,0 @@
-//
-// Generated by LLVM NVPTX Back-End
-//
-
-.version 3.1
-.target sm_35, texmode_independent
-.address_size 64
-
-	// .globl	__vselect_i8
-                                        // @__vselect_i8
-.func  (.param .align 1 .b8 func_retval0[1]) __vselect_i8(
-	.param .align 1 .b8 __vselect_i8_param_0[1],
-	.param .align 1 .b8 __vselect_i8_param_1[1],
-	.param .align 4 .b8 __vselect_i8_param_2[4]
-)
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:
-	ld.param.u32 	%r0, [__vselect_i8_param_2];
-	setp.eq.s32 	%p0, %r0, 0;
-	ld.param.u8 	%rc0, [__vselect_i8_param_0];
-	ld.param.u8 	%rc1, [__vselect_i8_param_1];
-	selp.b16 	%rc0, %rc0, %rc1, %p0;
-	st.param.b8	[func_retval0+0], %rc0;
-	ret;
-}
-
-	// .globl	__vselect_i16
-.func  (.param .align 2 .b8 func_retval0[2]) __vselect_i16(
-	.param .align 2 .b8 __vselect_i16_param_0[2],
-	.param .align 2 .b8 __vselect_i16_param_1[2],
-	.param .align 4 .b8 __vselect_i16_param_2[4]
-)                                       // @__vselect_i16
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:
-	ld.param.u32 	%r0, [__vselect_i16_param_2];
-	setp.eq.s32 	%p0, %r0, 0;
-	ld.param.u16 	%rs0, [__vselect_i16_param_0];
-	ld.param.u16 	%rs1, [__vselect_i16_param_1];
-	selp.b16 	%rs0, %rs0, %rs1, %p0;
-	st.param.b16	[func_retval0+0], %rs0;
-	ret;
-}
-
-	// .globl	__vselect_i64
-.func  (.param .align 8 .b8 func_retval0[8]) __vselect_i64(
-	.param .align 8 .b8 __vselect_i64_param_0[8],
-	.param .align 8 .b8 __vselect_i64_param_1[8],
-	.param .align 4 .b8 __vselect_i64_param_2[4]
-)                                       // @__vselect_i64
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:
-	ld.param.u32 	%r0, [__vselect_i64_param_2];
-	setp.eq.s32 	%p0, %r0, 0;
-	ld.param.u64 	%rl0, [__vselect_i64_param_0];
-	ld.param.u64 	%rl1, [__vselect_i64_param_1];
-	selp.b64 	%rl0, %rl0, %rl1, %p0;
-	st.param.b64	[func_retval0+0], %rl0;
-	ret;
-}
-
-	// .globl	__aos_to_soa4_float1
-.func __aos_to_soa4_float1(
-	.param .align 4 .b8 __aos_to_soa4_float1_param_0[4],
-	.param .align 4 .b8 __aos_to_soa4_float1_param_1[4],
-	.param .align 4 .b8 __aos_to_soa4_float1_param_2[4],
-	.param .align 4 .b8 __aos_to_soa4_float1_param_3[4],
-	.param .b64 __aos_to_soa4_float1_param_4,
-	.param .b64 __aos_to_soa4_float1_param_5,
-	.param .b64 __aos_to_soa4_float1_param_6,
-	.param .b64 __aos_to_soa4_float1_param_7
-)                                       // @__aos_to_soa4_float1
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:
-	ld.param.u64 	%rl0, [__aos_to_soa4_float1_param_4];
-	ld.param.u64 	%rl1, [__aos_to_soa4_float1_param_5];
-	ld.param.u64 	%rl2, [__aos_to_soa4_float1_param_6];
-	ld.param.u64 	%rl3, [__aos_to_soa4_float1_param_7];
-	ld.param.f32 	%f0, [__aos_to_soa4_float1_param_0];
-	ld.param.f32 	%f1, [__aos_to_soa4_float1_param_1];
-	ld.param.f32 	%f2, [__aos_to_soa4_float1_param_2];
-	ld.param.f32 	%f3, [__aos_to_soa4_float1_param_3];
-	st.f32 	[%rl0], %f0;
-	st.f32 	[%rl1], %f1;
-	st.f32 	[%rl2], %f2;
-	st.f32 	[%rl3], %f3;
-	ret;
-}
-
-	// .globl	__soa_to_aos4_float1
-.func __soa_to_aos4_float1(
-	.param .align 4 .b8 __soa_to_aos4_float1_param_0[4],
-	.param .align 4 .b8 __soa_to_aos4_float1_param_1[4],
-	.param .align 4 .b8 __soa_to_aos4_float1_param_2[4],
-	.param .align 4 .b8 __soa_to_aos4_float1_param_3[4],
-	.param .b64 __soa_to_aos4_float1_param_4,
-	.param .b64 __soa_to_aos4_float1_param_5,
-	.param .b64 __soa_to_aos4_float1_param_6,
-	.param .b64 __soa_to_aos4_float1_param_7
-)                                       // @__soa_to_aos4_float1
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:
-	ld.param.u64 	%rl0, [__soa_to_aos4_float1_param_4];
-	ld.param.u64 	%rl1, [__soa_to_aos4_float1_param_5];
-	ld.param.u64 	%rl2, [__soa_to_aos4_float1_param_6];
-	ld.param.u64 	%rl3, [__soa_to_aos4_float1_param_7];
-	ld.param.f32 	%f0, [__soa_to_aos4_float1_param_0];
-	ld.param.f32 	%f1, [__soa_to_aos4_float1_param_1];
-	ld.param.f32 	%f2, [__soa_to_aos4_float1_param_2];
-	ld.param.f32 	%f3, [__soa_to_aos4_float1_param_3];
-	st.f32 	[%rl0], %f0;
-	st.f32 	[%rl1], %f1;
-	st.f32 	[%rl2], %f2;
-	st.f32 	[%rl3], %f3;
-	ret;
-}
-
-	// .globl	__aos_to_soa3_float1
-.func __aos_to_soa3_float1(
-	.param .align 4 .b8 __aos_to_soa3_float1_param_0[4],
-	.param .align 4 .b8 __aos_to_soa3_float1_param_1[4],
-	.param .align 4 .b8 __aos_to_soa3_float1_param_2[4],
-	.param .b64 __aos_to_soa3_float1_param_3,
-	.param .b64 __aos_to_soa3_float1_param_4,
-	.param .b64 __aos_to_soa3_float1_param_5
-)                                       // @__aos_to_soa3_float1
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:
-	ld.param.u64 	%rl0, [__aos_to_soa3_float1_param_3];
-	ld.param.u64 	%rl1, [__aos_to_soa3_float1_param_4];
-	ld.param.u64 	%rl2, [__aos_to_soa3_float1_param_5];
-	ld.param.f32 	%f0, [__aos_to_soa3_float1_param_0];
-	ld.param.f32 	%f1, [__aos_to_soa3_float1_param_1];
-	ld.param.f32 	%f2, [__aos_to_soa3_float1_param_2];
-	st.f32 	[%rl0], %f0;
-	st.f32 	[%rl1], %f1;
-	st.f32 	[%rl2], %f2;
-	ret;
-}
-
-	// .globl	__soa_to_aos3_float1
-.func __soa_to_aos3_float1(
-	.param .align 4 .b8 __soa_to_aos3_float1_param_0[4],
-	.param .align 4 .b8 __soa_to_aos3_float1_param_1[4],
-	.param .align 4 .b8 __soa_to_aos3_float1_param_2[4],
-	.param .b64 __soa_to_aos3_float1_param_3,
-	.param .b64 __soa_to_aos3_float1_param_4,
-	.param .b64 __soa_to_aos3_float1_param_5
-)                                       // @__soa_to_aos3_float1
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:
-	ld.param.u64 	%rl0, [__soa_to_aos3_float1_param_3];
-	ld.param.u64 	%rl1, [__soa_to_aos3_float1_param_4];
-	ld.param.u64 	%rl2, [__soa_to_aos3_float1_param_5];
-	ld.param.f32 	%f0, [__soa_to_aos3_float1_param_0];
-	ld.param.f32 	%f1, [__soa_to_aos3_float1_param_1];
-	ld.param.f32 	%f2, [__soa_to_aos3_float1_param_2];
-	st.f32 	[%rl0], %f0;
-	st.f32 	[%rl1], %f1;
-	st.f32 	[%rl2], %f2;
-	ret;
-}
-
-	// .globl	__rsqrt_varying_double
-.func  (.param .align 8 .b8 func_retval0[8]) __rsqrt_varying_double(
-	.param .align 8 .b8 __rsqrt_varying_double_param_0[8]
-)                                       // @__rsqrt_varying_double
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:
-	ld.param.f64 	%fl0, [__rsqrt_varying_double_param_0];
-	rsqrt.approx.f64 	%fl0, %fl0;
-	st.param.f64	[func_retval0+0], %fl0;
-	ret;
-}
-
-	// .globl	mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_
-.func mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_(
-	.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_0,
-	.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_1,
-	.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_2,
-	.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_3,
-	.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_4,
-	.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_5,
-	.param .b32 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_6,
-	.param .b64 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_7,
-	.param .align 4 .b8 mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_8[4]
-)                                       // @mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:                                // %allocas
-	ld.param.f32 	%f0, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_0];
-	ld.param.f32 	%f1, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_1];
-	ld.param.f32 	%f3, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_2];
-	ld.param.f32 	%f2, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_3];
-	ld.param.u32 	%r0, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_4];
-	ld.param.u32 	%r1, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_5];
-	ld.param.u32 	%r2, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_6];
-	ld.param.u64 	%rl0, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_7];
-	ld.param.u32 	%r3, [mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E__param_8];
-	setp.lt.s32 	%p0, %r3, 0;
-	sub.f32 	%f3, %f3, %f0;
-	cvt.rn.f32.s32 	%f4, %r0;
-	sub.f32 	%f2, %f2, %f1;
-	cvt.rn.f32.s32 	%f5, %r1;
-	div.rn.f32 	%f2, %f2, %f5;
-	div.rn.f32 	%f3, %f3, %f4;
-	@%p0 bra 	BB8_9;
-// BB#1:                                // %for_test110.preheader
-	setp.lt.s32 	%p0, %r1, 1;
-	@%p0 bra 	BB8_45;
-// BB#2:                                // %outer_not_in_extras140.preheader.lr.ph
-	setp.gt.s32 	%p0, %r2, 0;
-	mov.u32 	%r3, 0;
-	selp.b32 	%r4, -1, 0, %p0;
-	shl.b32 	%r5, %r0, 2;
-	mov.u32 	%r6, %r3;
-BB8_3:                                  // %outer_not_in_extras140.preheader
-                                        // =>This Loop Header: Depth=1
-                                        //     Child Loop BB8_41 Depth 2
-                                        //     Child Loop BB8_43 Depth 2
-                                        //     Child Loop BB8_38 Depth 2
-                                        //       Child Loop BB8_33 Depth 3
-	setp.lt.s32 	%p0, %r0, 1;
-	@%p0 bra 	BB8_4;
-// BB#31:                               // %foreach_full_body120.lr.ph
-                                        //   in Loop: Header=BB8_3 Depth=1
-	setp.lt.s32 	%p0, %r4, 0;
-	mov.u32 	%r7, %r0;
-	mov.u32 	%r8, %r3;
-	@%p0 bra 	BB8_32;
-	bra.uni 	BB8_43;
-BB8_32:                                 //   in Loop: Header=BB8_3 Depth=1
-	mov.u64 	%rl1, 0;
-	cvt.rn.f32.s32 	%f4, %r6;
-	fma.rn.f32 	%f4, %f2, %f4, %f1;
-	mul.lo.s32 	%r7, %r6, %r0;
-BB8_38:                                 // %for_loop.i380.lr.ph.us
-                                        //   Parent Loop BB8_3 Depth=1
-                                        // =>  This Loop Header: Depth=2
-                                        //       Child Loop BB8_33 Depth 3
-	cvt.u32.u64 	%r8, %rl1;
-	cvt.rn.f32.s32 	%f5, %r8;
-	fma.rn.f32 	%f5, %f3, %f5, %f0;
-	mov.u32 	%r10, 0;
-	mov.u32 	%r12, %r4;
-	mov.u32 	%r11, %r10;
-	mov.u32 	%r9, %r10;
-	mov.f32 	%f7, %f5;
-	mov.f32 	%f6, %f4;
-BB8_33:                                 // %for_loop.i380.us
-                                        //   Parent Loop BB8_3 Depth=1
-                                        //     Parent Loop BB8_38 Depth=2
-                                        // =>    This Inner Loop Header: Depth=3
-	mul.f32 	%f8, %f7, %f7;
-	fma.rn.f32 	%f9, %f6, %f6, %f8;
-	setp.gtu.f32 	%p0, %f9, 0f40800000;
-	selp.b32 	%r13, %r12, 0, %p0;
-	or.b32  	%r11, %r13, %r11;
-	shr.u32 	%r13, %r11, 31;
-	shr.u32 	%r14, %r12, 31;
-	setp.eq.s32 	%p0, %r13, %r14;
-	@%p0 bra 	BB8_34;
-	bra.uni 	BB8_35;
-BB8_34:                                 //   in Loop: Header=BB8_33 Depth=3
-	mov.u32 	%r12, %r10;
-	bra.uni 	BB8_36;
-BB8_35:                                 // %not_all_continued_or_breaked.i394.us
-                                        //   in Loop: Header=BB8_33 Depth=3
-	mul.f32 	%f9, %f6, %f6;
-	not.b32 	%r13, %r11;
-	and.b32  	%r12, %r12, %r13;
-	sub.f32 	%f8, %f8, %f9;
-	add.f32 	%f8, %f5, %f8;
-	add.f32 	%f7, %f7, %f7;
-	fma.rn.f32 	%f6, %f6, %f7, %f4;
-	mov.f32 	%f7, %f8;
-BB8_36:                                 // %for_step.i363.us
-                                        //   in Loop: Header=BB8_33 Depth=3
-	setp.ne.s32 	%p0, %r12, 0;
-	selp.u32 	%r13, 1, 0, %p0;
-	add.s32 	%r9, %r9, %r13;
-	setp.lt.s32 	%p0, %r9, %r2;
-	selp.b32 	%r12, %r12, 0, %p0;
-	setp.lt.s32 	%p0, %r12, 0;
-	@%p0 bra 	BB8_33;
-// BB#37:                               // %mandel___vyfvyfvyi.exit395.us
-                                        //   in Loop: Header=BB8_38 Depth=2
-	add.s32 	%r8, %r8, %r7;
-	shl.b32 	%r8, %r8, 2;
-	cvt.s64.s32 	%rl2, %r8;
-	add.s64 	%rl2, %rl2, %rl0;
-	st.u32 	[%rl2], %r9;
-	add.s64 	%rl1, %rl1, 1;
-	cvt.u32.u64 	%r8, %rl1;
-	setp.eq.s32 	%p0, %r8, %r0;
-	@%p0 bra 	BB8_44;
-	bra.uni 	BB8_38;
-BB8_43:                                 // %mandel___vyfvyfvyi.exit395
-                                        //   Parent Loop BB8_3 Depth=1
-                                        // =>  This Inner Loop Header: Depth=2
-	cvt.s64.s32 	%rl1, %r8;
-	add.s64 	%rl1, %rl1, %rl0;
-	mov.u32 	%r9, 0;
-	st.u32 	[%rl1], %r9;
-	add.s32 	%r8, %r8, 4;
-	add.s32 	%r7, %r7, -1;
-	setp.eq.s32 	%p0, %r7, 0;
-	@%p0 bra 	BB8_44;
-	bra.uni 	BB8_43;
-BB8_4:                                  // %partial_inner_all_outer156
-                                        //   in Loop: Header=BB8_3 Depth=1
-	@%p0 bra 	BB8_44;
-// BB#5:                                // %partial_inner_only197
-                                        //   in Loop: Header=BB8_3 Depth=1
-	setp.gt.s32 	%p0, %r0, 0;
-	mov.u32 	%r8, 0;
-	fma.rn.f32 	%f4, %f3, 0f00000000, %f0;
-	cvt.rn.f32.s32 	%f5, %r6;
-	fma.rn.f32 	%f5, %f2, %f5, %f1;
-	selp.b32 	%r7, %r4, 0, %p0;
-	setp.lt.s32 	%p1, %r7, 0;
-	mov.u32 	%r10, %r4;
-	mov.u32 	%r9, %r8;
-	mov.u32 	%r7, %r8;
-	mov.f32 	%f7, %f4;
-	mov.f32 	%f6, %f5;
-	@%p1 bra 	BB8_41;
-	bra.uni 	BB8_6;
-BB8_41:                                 // %for_loop.i
-                                        //   Parent Loop BB8_3 Depth=1
-                                        // =>  This Inner Loop Header: Depth=2
-	selp.b32 	%r11, %r10, 0, %p0;
-	mul.f32 	%f8, %f7, %f7;
-	fma.rn.f32 	%f9, %f6, %f6, %f8;
-	setp.gtu.f32 	%p1, %f9, 0f40800000;
-	selp.b32 	%r12, %r10, 0, %p1;
-	or.b32  	%r9, %r12, %r9;
-	selp.b32 	%r12, %r9, 0, %p0;
-	shr.u32 	%r12, %r12, 31;
-	shr.u32 	%r11, %r11, 31;
-	setp.eq.s32 	%p1, %r12, %r11;
-	@%p1 bra 	BB8_42;
-	bra.uni 	BB8_39;
-BB8_42:                                 //   in Loop: Header=BB8_41 Depth=2
-	mov.u32 	%r10, %r8;
-	bra.uni 	BB8_40;
-BB8_39:                                 // %not_all_continued_or_breaked.i
-                                        //   in Loop: Header=BB8_41 Depth=2
-	mul.f32 	%f9, %f6, %f6;
-	not.b32 	%r11, %r9;
-	and.b32  	%r10, %r10, %r11;
-	sub.f32 	%f8, %f8, %f9;
-	add.f32 	%f8, %f4, %f8;
-	add.f32 	%f7, %f7, %f7;
-	fma.rn.f32 	%f6, %f6, %f7, %f5;
-	mov.f32 	%f7, %f8;
-BB8_40:                                 // %for_step.i
-                                        //   in Loop: Header=BB8_41 Depth=2
-	setp.ne.s32 	%p1, %r10, 0;
-	selp.u32 	%r11, 1, 0, %p1;
-	add.s32 	%r7, %r7, %r11;
-	setp.lt.s32 	%p1, %r7, %r2;
-	selp.b32 	%r10, %r10, 0, %p1;
-	selp.b32 	%r11, %r10, 0, %p0;
-	setp.gt.s32 	%p1, %r11, -1;
-	@%p1 bra 	BB8_7;
-	bra.uni 	BB8_41;
-BB8_6:                                  //   in Loop: Header=BB8_3 Depth=1
-	mov.u32 	%r7, %r8;
-BB8_7:                                  // %mandel___vyfvyfvyi.exit
-                                        //   in Loop: Header=BB8_3 Depth=1
-	setp.lt.s32 	%p0, %r0, 1;
-	@%p0 bra 	BB8_44;
-// BB#8:                                // %pl_dolane.i
-                                        //   in Loop: Header=BB8_3 Depth=1
-	mul.lo.s32 	%r8, %r6, %r0;
-	shl.b32 	%r8, %r8, 2;
-	cvt.s64.s32 	%rl1, %r8;
-	add.s64 	%rl1, %rl1, %rl0;
-	st.u32 	[%rl1], %r7;
-BB8_44:                                 // %foreach_reset128
-                                        //   in Loop: Header=BB8_3 Depth=1
-	add.s32 	%r6, %r6, 1;
-	add.s32 	%r3, %r3, %r5;
-	setp.eq.s32 	%p0, %r6, %r1;
-	@%p0 bra 	BB8_45;
-	bra.uni 	BB8_3;
-BB8_9:                                  // %for_test.preheader
-	setp.lt.s32 	%p0, %r1, 1;
-	@%p0 bra 	BB8_45;
-// BB#10:                               // %outer_not_in_extras.preheader.lr.ph
-	setp.gt.s32 	%p0, %r2, 0;
-	mov.u32 	%r3, 0;
-	selp.b32 	%r4, -1, 0, %p0;
-	shl.b32 	%r5, %r0, 2;
-	mov.u32 	%r6, %r3;
-BB8_11:                                 // %outer_not_in_extras.preheader
-                                        // =>This Loop Header: Depth=1
-                                        //     Child Loop BB8_23 Depth 2
-                                        //     Child Loop BB8_20 Depth 2
-                                        //     Child Loop BB8_19 Depth 2
-                                        //       Child Loop BB8_14 Depth 3
-	setp.lt.s32 	%p0, %r0, 1;
-	@%p0 bra 	BB8_28;
-// BB#12:                               // %foreach_full_body.lr.ph
-                                        //   in Loop: Header=BB8_11 Depth=1
-	setp.lt.s32 	%p0, %r4, 0;
-	mov.u32 	%r7, %r0;
-	mov.u32 	%r8, %r3;
-	@%p0 bra 	BB8_13;
-	bra.uni 	BB8_20;
-BB8_13:                                 //   in Loop: Header=BB8_11 Depth=1
-	mov.u64 	%rl1, 0;
-	cvt.rn.f32.s32 	%f4, %r6;
-	fma.rn.f32 	%f4, %f2, %f4, %f1;
-	mul.lo.s32 	%r7, %r6, %r0;
-BB8_19:                                 // %for_loop.i281.lr.ph.us
-                                        //   Parent Loop BB8_11 Depth=1
-                                        // =>  This Loop Header: Depth=2
-                                        //       Child Loop BB8_14 Depth 3
-	cvt.u32.u64 	%r8, %rl1;
-	cvt.rn.f32.s32 	%f5, %r8;
-	fma.rn.f32 	%f5, %f3, %f5, %f0;
-	mov.u32 	%r10, 0;
-	mov.u32 	%r12, %r4;
-	mov.u32 	%r11, %r10;
-	mov.u32 	%r9, %r10;
-	mov.f32 	%f7, %f5;
-	mov.f32 	%f6, %f4;
-BB8_14:                                 // %for_loop.i281.us
-                                        //   Parent Loop BB8_11 Depth=1
-                                        //     Parent Loop BB8_19 Depth=2
-                                        // =>    This Inner Loop Header: Depth=3
-	mul.f32 	%f8, %f7, %f7;
-	fma.rn.f32 	%f9, %f6, %f6, %f8;
-	setp.gtu.f32 	%p0, %f9, 0f40800000;
-	selp.b32 	%r13, %r12, 0, %p0;
-	or.b32  	%r11, %r13, %r11;
-	shr.u32 	%r13, %r11, 31;
-	shr.u32 	%r14, %r12, 31;
-	setp.eq.s32 	%p0, %r13, %r14;
-	@%p0 bra 	BB8_15;
-	bra.uni 	BB8_16;
-BB8_15:                                 //   in Loop: Header=BB8_14 Depth=3
-	mov.u32 	%r12, %r10;
-	bra.uni 	BB8_17;
-BB8_16:                                 // %not_all_continued_or_breaked.i295.us
-                                        //   in Loop: Header=BB8_14 Depth=3
-	mul.f32 	%f9, %f6, %f6;
-	not.b32 	%r13, %r11;
-	and.b32  	%r12, %r12, %r13;
-	sub.f32 	%f8, %f8, %f9;
-	add.f32 	%f8, %f5, %f8;
-	add.f32 	%f7, %f7, %f7;
-	fma.rn.f32 	%f6, %f6, %f7, %f4;
-	mov.f32 	%f7, %f8;
-BB8_17:                                 // %for_step.i264.us
-                                        //   in Loop: Header=BB8_14 Depth=3
-	setp.ne.s32 	%p0, %r12, 0;
-	selp.u32 	%r13, 1, 0, %p0;
-	add.s32 	%r9, %r9, %r13;
-	setp.lt.s32 	%p0, %r9, %r2;
-	selp.b32 	%r12, %r12, 0, %p0;
-	setp.lt.s32 	%p0, %r12, 0;
-	@%p0 bra 	BB8_14;
-// BB#18:                               // %mandel___vyfvyfvyi.exit296.us
-                                        //   in Loop: Header=BB8_19 Depth=2
-	add.s32 	%r8, %r8, %r7;
-	shl.b32 	%r8, %r8, 2;
-	cvt.s64.s32 	%rl2, %r8;
-	add.s64 	%rl2, %rl2, %rl0;
-	st.u32 	[%rl2], %r9;
-	add.s64 	%rl1, %rl1, 1;
-	cvt.u32.u64 	%r8, %rl1;
-	setp.eq.s32 	%p0, %r8, %r0;
-	@%p0 bra 	BB8_27;
-	bra.uni 	BB8_19;
-BB8_20:                                 // %mandel___vyfvyfvyi.exit296
-                                        //   Parent Loop BB8_11 Depth=1
-                                        // =>  This Inner Loop Header: Depth=2
-	cvt.s64.s32 	%rl1, %r8;
-	add.s64 	%rl1, %rl1, %rl0;
-	mov.u32 	%r9, 0;
-	st.u32 	[%rl1], %r9;
-	add.s32 	%r8, %r8, 4;
-	add.s32 	%r7, %r7, -1;
-	setp.eq.s32 	%p0, %r7, 0;
-	@%p0 bra 	BB8_27;
-	bra.uni 	BB8_20;
-BB8_28:                                 // %partial_inner_all_outer
-                                        //   in Loop: Header=BB8_11 Depth=1
-	@%p0 bra 	BB8_27;
-// BB#29:                               // %partial_inner_only
-                                        //   in Loop: Header=BB8_11 Depth=1
-	setp.gt.s32 	%p0, %r0, 0;
-	mov.u32 	%r8, 0;
-	fma.rn.f32 	%f4, %f3, 0f00000000, %f0;
-	cvt.rn.f32.s32 	%f5, %r6;
-	fma.rn.f32 	%f5, %f2, %f5, %f1;
-	selp.b32 	%r7, %r4, 0, %p0;
-	setp.lt.s32 	%p1, %r7, 0;
-	mov.u32 	%r10, %r4;
-	mov.u32 	%r9, %r8;
-	mov.u32 	%r7, %r8;
-	mov.f32 	%f7, %f4;
-	mov.f32 	%f6, %f5;
-	@%p1 bra 	BB8_23;
-	bra.uni 	BB8_30;
-BB8_23:                                 // %for_loop.i332
-                                        //   Parent Loop BB8_11 Depth=1
-                                        // =>  This Inner Loop Header: Depth=2
-	selp.b32 	%r11, %r10, 0, %p0;
-	mul.f32 	%f8, %f7, %f7;
-	fma.rn.f32 	%f9, %f6, %f6, %f8;
-	setp.gtu.f32 	%p1, %f9, 0f40800000;
-	selp.b32 	%r12, %r10, 0, %p1;
-	or.b32  	%r9, %r12, %r9;
-	selp.b32 	%r12, %r9, 0, %p0;
-	shr.u32 	%r12, %r12, 31;
-	shr.u32 	%r11, %r11, 31;
-	setp.eq.s32 	%p1, %r12, %r11;
-	@%p1 bra 	BB8_24;
-	bra.uni 	BB8_21;
-BB8_24:                                 //   in Loop: Header=BB8_23 Depth=2
-	mov.u32 	%r10, %r8;
-	bra.uni 	BB8_22;
-BB8_21:                                 // %not_all_continued_or_breaked.i346
-                                        //   in Loop: Header=BB8_23 Depth=2
-	mul.f32 	%f9, %f6, %f6;
-	not.b32 	%r11, %r9;
-	and.b32  	%r10, %r10, %r11;
-	sub.f32 	%f8, %f8, %f9;
-	add.f32 	%f8, %f4, %f8;
-	add.f32 	%f7, %f7, %f7;
-	fma.rn.f32 	%f6, %f6, %f7, %f5;
-	mov.f32 	%f7, %f8;
-BB8_22:                                 // %for_step.i313
-                                        //   in Loop: Header=BB8_23 Depth=2
-	setp.ne.s32 	%p1, %r10, 0;
-	selp.u32 	%r11, 1, 0, %p1;
-	add.s32 	%r7, %r7, %r11;
-	setp.lt.s32 	%p1, %r7, %r2;
-	selp.b32 	%r10, %r10, 0, %p1;
-	selp.b32 	%r11, %r10, 0, %p0;
-	setp.gt.s32 	%p1, %r11, -1;
-	@%p1 bra 	BB8_25;
-	bra.uni 	BB8_23;
-BB8_30:                                 //   in Loop: Header=BB8_11 Depth=1
-	mov.u32 	%r7, %r8;
-BB8_25:                                 // %mandel___vyfvyfvyi.exit347
-                                        //   in Loop: Header=BB8_11 Depth=1
-	setp.lt.s32 	%p0, %r0, 1;
-	@%p0 bra 	BB8_27;
-// BB#26:                               // %pl_dolane.i452
-                                        //   in Loop: Header=BB8_11 Depth=1
-	mul.lo.s32 	%r8, %r6, %r0;
-	shl.b32 	%r8, %r8, 2;
-	cvt.s64.s32 	%rl1, %r8;
-	add.s64 	%rl1, %rl1, %rl0;
-	st.u32 	[%rl1], %r7;
-BB8_27:                                 // %foreach_reset
-                                        //   in Loop: Header=BB8_11 Depth=1
-	add.s32 	%r6, %r6, 1;
-	add.s32 	%r3, %r3, %r5;
-	setp.eq.s32 	%p0, %r6, %r1;
-	@%p0 bra 	BB8_45;
-	bra.uni 	BB8_11;
-BB8_45:                                 // %for_exit
-	ret;
-}
-
-	// .globl	mandelbrot_ispc
-.func mandelbrot_ispc(
-	.param .b32 mandelbrot_ispc_param_0,
-	.param .b32 mandelbrot_ispc_param_1,
-	.param .b32 mandelbrot_ispc_param_2,
-	.param .b32 mandelbrot_ispc_param_3,
-	.param .b32 mandelbrot_ispc_param_4,
-	.param .b32 mandelbrot_ispc_param_5,
-	.param .b32 mandelbrot_ispc_param_6,
-	.param .b64 mandelbrot_ispc_param_7
-)                                       // @mandelbrot_ispc
-{
-	.reg .pred %p<396>;
-	.reg .s16 %rc<396>;
-	.reg .s16 %rs<396>;
-	.reg .s32 %r<396>;
-	.reg .s64 %rl<396>;
-	.reg .f32 %f<396>;
-	.reg .f64 %fl<396>;
-
-// BB#0:                                // %allocas
-	ld.param.u32 	%r0, [mandelbrot_ispc_param_5];
-	setp.lt.s32 	%p0, %r0, 1;
-	@%p0 bra 	BB9_18;
-// BB#1:                                // %outer_not_in_extras.preheader.lr.ph
-	ld.param.f32 	%f0, [mandelbrot_ispc_param_0];
-	ld.param.f32 	%f1, [mandelbrot_ispc_param_1];
-	ld.param.f32 	%f3, [mandelbrot_ispc_param_2];
-	ld.param.f32 	%f2, [mandelbrot_ispc_param_3];
-	ld.param.u32 	%r1, [mandelbrot_ispc_param_4];
-	ld.param.u32 	%r2, [mandelbrot_ispc_param_6];
-	ld.param.u64 	%rl0, [mandelbrot_ispc_param_7];
-	sub.f32 	%f3, %f3, %f0;
-	cvt.rn.f32.s32 	%f4, %r1;
-	sub.f32 	%f2, %f2, %f1;
-	cvt.rn.f32.s32 	%f5, %r0;
-	div.rn.f32 	%f2, %f2, %f5;
-	div.rn.f32 	%f3, %f3, %f4;
-	setp.gt.s32 	%p0, %r2, 0;
-	mov.u32 	%r3, 0;
-	selp.b32 	%r4, -1, 0, %p0;
-BB9_2:                                  // %outer_not_in_extras.preheader
-                                        // =>This Loop Header: Depth=1
-                                        //     Child Loop BB9_13 Depth 2
-                                        //     Child Loop BB9_4 Depth 2
-                                        //       Child Loop BB9_9 Depth 3
-	setp.lt.s32 	%p0, %r1, 1;
-	@%p0 bra 	BB9_19;
-// BB#3:                                // %foreach_full_body.lr.ph
-                                        //   in Loop: Header=BB9_2 Depth=1
-	mov.u64 	%rl1, 0;
-	cvt.rn.f32.s32 	%f4, %r3;
-	fma.rn.f32 	%f4, %f2, %f4, %f1;
-	mul.lo.s32 	%r5, %r3, %r1;
-BB9_4:                                  // %foreach_full_body
-                                        //   Parent Loop BB9_2 Depth=1
-                                        // =>  This Loop Header: Depth=2
-                                        //       Child Loop BB9_9 Depth 3
-	setp.lt.s32 	%p0, %r4, 0;
-	cvt.u32.u64 	%r6, %rl1;
-	cvt.rn.f32.s32 	%f5, %r6;
-	fma.rn.f32 	%f5, %f3, %f5, %f0;
-	mov.u32 	%r8, 0;
-	mov.u32 	%r10, %r4;
-	mov.u32 	%r9, %r8;
-	mov.u32 	%r7, %r8;
-	mov.f32 	%f7, %f5;
-	mov.f32 	%f6, %f4;
-	@%p0 bra 	BB9_9;
-	bra.uni 	BB9_5;
-BB9_9:                                  // %for_loop.i281
-                                        //   Parent Loop BB9_2 Depth=1
-                                        //     Parent Loop BB9_4 Depth=2
-                                        // =>    This Inner Loop Header: Depth=3
-	mul.f32 	%f8, %f7, %f7;
-	fma.rn.f32 	%f9, %f6, %f6, %f8;
-	setp.gtu.f32 	%p0, %f9, 0f40800000;
-	selp.b32 	%r11, %r10, 0, %p0;
-	or.b32  	%r9, %r11, %r9;
-	shr.u32 	%r11, %r9, 31;
-	shr.u32 	%r12, %r10, 31;
-	setp.eq.s32 	%p0, %r11, %r12;
-	@%p0 bra 	BB9_10;
-	bra.uni 	BB9_7;
-BB9_10:                                 //   in Loop: Header=BB9_9 Depth=3
-	mov.u32 	%r10, %r8;
-	bra.uni 	BB9_8;
-BB9_7:                                  // %not_all_continued_or_breaked.i295
-                                        //   in Loop: Header=BB9_9 Depth=3
-	mul.f32 	%f9, %f6, %f6;
-	not.b32 	%r11, %r9;
-	and.b32  	%r10, %r10, %r11;
-	sub.f32 	%f8, %f8, %f9;
-	add.f32 	%f8, %f5, %f8;
-	add.f32 	%f7, %f7, %f7;
-	fma.rn.f32 	%f6, %f6, %f7, %f4;
-	mov.f32 	%f7, %f8;
-BB9_8:                                  // %for_step.i264
-                                        //   in Loop: Header=BB9_9 Depth=3
-	setp.ne.s32 	%p0, %r10, 0;
-	selp.u32 	%r11, 1, 0, %p0;
-	add.s32 	%r7, %r7, %r11;
-	setp.lt.s32 	%p0, %r7, %r2;
-	selp.b32 	%r10, %r10, 0, %p0;
-	setp.gt.s32 	%p0, %r10, -1;
-	@%p0 bra 	BB9_6;
-	bra.uni 	BB9_9;
-BB9_5:                                  //   in Loop: Header=BB9_4 Depth=2
-	mov.u32 	%r7, %r8;
-BB9_6:                                  // %mandel___vyfvyfvyi.exit296
-                                        //   in Loop: Header=BB9_4 Depth=2
-	add.s32 	%r6, %r6, %r5;
-	shl.b32 	%r6, %r6, 2;
-	cvt.s64.s32 	%rl2, %r6;
-	add.s64 	%rl2, %rl2, %rl0;
-	st.u32 	[%rl2], %r7;
-	add.s64 	%rl1, %rl1, 1;
-	cvt.u32.u64 	%r6, %rl1;
-	setp.eq.s32 	%p0, %r6, %r1;
-	@%p0 bra 	BB9_17;
-	bra.uni 	BB9_4;
-BB9_19:                                 // %partial_inner_all_outer
-                                        //   in Loop: Header=BB9_2 Depth=1
-	@%p0 bra 	BB9_17;
-// BB#20:                               // %partial_inner_only
-                                        //   in Loop: Header=BB9_2 Depth=1
-	setp.gt.s32 	%p0, %r1, 0;
-	mov.u32 	%r6, 0;
-	fma.rn.f32 	%f4, %f3, 0f00000000, %f0;
-	cvt.rn.f32.s32 	%f5, %r3;
-	fma.rn.f32 	%f5, %f2, %f5, %f1;
-	selp.b32 	%r5, %r4, 0, %p0;
-	setp.lt.s32 	%p1, %r5, 0;
-	mov.u32 	%r8, %r4;
-	mov.u32 	%r7, %r6;
-	mov.u32 	%r5, %r6;
-	mov.f32 	%f7, %f4;
-	mov.f32 	%f6, %f5;
-	@%p1 bra 	BB9_13;
-	bra.uni 	BB9_21;
-BB9_13:                                 // %for_loop.i332
-                                        //   Parent Loop BB9_2 Depth=1
-                                        // =>  This Inner Loop Header: Depth=2
-	selp.b32 	%r9, %r8, 0, %p0;
-	mul.f32 	%f8, %f7, %f7;
-	fma.rn.f32 	%f9, %f6, %f6, %f8;
-	setp.gtu.f32 	%p1, %f9, 0f40800000;
-	selp.b32 	%r10, %r8, 0, %p1;
-	or.b32  	%r7, %r10, %r7;
-	selp.b32 	%r10, %r7, 0, %p0;
-	shr.u32 	%r10, %r10, 31;
-	shr.u32 	%r9, %r9, 31;
-	setp.eq.s32 	%p1, %r10, %r9;
-	@%p1 bra 	BB9_14;
-	bra.uni 	BB9_11;
-BB9_14:                                 //   in Loop: Header=BB9_13 Depth=2
-	mov.u32 	%r8, %r6;
-	bra.uni 	BB9_12;
-BB9_11:                                 // %not_all_continued_or_breaked.i346
-                                        //   in Loop: Header=BB9_13 Depth=2
-	mul.f32 	%f9, %f6, %f6;
-	not.b32 	%r9, %r7;
-	and.b32  	%r8, %r8, %r9;
-	sub.f32 	%f8, %f8, %f9;
-	add.f32 	%f8, %f4, %f8;
-	add.f32 	%f7, %f7, %f7;
-	fma.rn.f32 	%f6, %f6, %f7, %f5;
-	mov.f32 	%f7, %f8;
-BB9_12:                                 // %for_step.i313
-                                        //   in Loop: Header=BB9_13 Depth=2
-	setp.ne.s32 	%p1, %r8, 0;
-	selp.u32 	%r9, 1, 0, %p1;
-	add.s32 	%r5, %r5, %r9;
-	setp.lt.s32 	%p1, %r5, %r2;
-	selp.b32 	%r8, %r8, 0, %p1;
-	selp.b32 	%r9, %r8, 0, %p0;
-	setp.gt.s32 	%p1, %r9, -1;
-	@%p1 bra 	BB9_15;
-	bra.uni 	BB9_13;
-BB9_21:                                 //   in Loop: Header=BB9_2 Depth=1
-	mov.u32 	%r5, %r6;
-BB9_15:                                 // %mandel___vyfvyfvyi.exit347
-                                        //   in Loop: Header=BB9_2 Depth=1
-	setp.lt.s32 	%p0, %r1, 1;
-	@%p0 bra 	BB9_17;
-// BB#16:                               // %pl_dolane.i
-                                        //   in Loop: Header=BB9_2 Depth=1
-	mul.lo.s32 	%r6, %r3, %r1;
-	shl.b32 	%r6, %r6, 2;
-	cvt.s64.s32 	%rl1, %r6;
-	add.s64 	%rl1, %rl1, %rl0;
-	st.u32 	[%rl1], %r5;
-BB9_17:                                 // %foreach_reset
-                                        //   in Loop: Header=BB9_2 Depth=1
-	add.s32 	%r3, %r3, 1;
-	setp.eq.s32 	%p0, %r3, %r0;
-	@%p0 bra 	BB9_18;
-	bra.uni 	BB9_2;
-BB9_18:                                 // %for_exit
-	ret;
-}
-
--- a/examples_cuda/mandelbrot/out.s
+++ b/examples_cuda/mandelbrot/out.s
--- a/examples_cuda/mandelbrot/out1.o
+++ b/examples_cuda/mandelbrot/out1.o
--- a/examples_cuda/mandelbrot_tasks/.gitignore
+++ b/examples_cuda/mandelbrot_tasks/.gitignore
@@ -1,2 +0,0 @@
-mandelbrot
-*.ppm
--- a/examples_cuda/mandelbrot_tasks/Makefile
+++ b/examples_cuda/mandelbrot_tasks/Makefile
@@ -1,8 +0,0 @@
-
-EXAMPLE=mandelbrot_tasks
-CPP_SRC=mandelbrot_tasks.cpp mandelbrot_tasks_serial.cpp
-ISPC_SRC=mandelbrot_tasks.ispc
-ISPC_IA_TARGETS=avx
-ISPC_ARM_TARGETS=neon
-
-include ../common.mk
--- a/examples_cuda/mandelbrot_tasks/mandelbrot_tasks.cpp
+++ b/examples_cuda/mandelbrot_tasks/mandelbrot_tasks.cpp
@@ -1,164 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define _CRT_SECURE_NO_WARNINGS
-#define NOMINMAX
-#pragma warning (disable: 4244)
-#pragma warning (disable: 4305)
-#endif
-
-#include <stdio.h>
-#include <algorithm>
-#include <string.h>
-#include "../timing.h"
-#include "mandelbrot_ispc.h"
-using namespace ispc;
-
-#include <sys/time.h>
-
-
-double rtc(void)
-{
-  struct timeval Tvalue;
-  double etime;
-  struct timezone dummy;
-
-  gettimeofday(&Tvalue,&dummy);
-  etime =  (double) Tvalue.tv_sec +
-    1.e-6*((double) Tvalue.tv_usec);
-  return etime;
-}
-
-extern void mandelbrot_serial(float x0, float y0, float x1, float y1,
-    int width, int height, int maxIterations,
-    int output[]);
-
-/* Write a PPM image file with the image of the Mandelbrot set */
-static void
-writePPM(int *buf, int width, int height, const char *fn) {
-  FILE *fp = fopen(fn, "wb");
-  fprintf(fp, "P6\n");
-  fprintf(fp, "%d %d\n", width, height);
-  fprintf(fp, "255\n");
-  for (int i = 0; i < width*height; ++i) {
-    // Map the iteration count to colors by just alternating between
-    // two greys.
-    char c = (buf[i] & 0x1) ? 240 : 20;
-    for (int j = 0; j < 3; ++j)
-      fputc(c, fp);
-  }
-  fclose(fp);
-  printf("Wrote image file %s\n", fn);
-}
-
-
-static void usage() {
-  fprintf(stderr, "usage: mandelbrot [--scale=<factor>]\n");
-  exit(1);
-}
-
-int main(int argc, char *argv[]) {
-  unsigned int width = 1536;
-  unsigned int height = 1024;
-  float x0 = -2;
-  float x1 = 1;
-  float y0 = -1;
-  float y1 = 1;
-
-  if (argc == 1)
-    ;
-  else if (argc == 2) {
-    if (strncmp(argv[1], "--scale=", 8) == 0) {
-      float scale = atof(argv[1] + 8);
-      if (scale == 0.f)
-        usage();
-      width *= scale;
-      height *= scale;
-      // round up to multiples of 16
-      width = (width + 0xf) & ~0xf;
-      height = (height + 0xf) & ~0xf;
-    }
-    else 
-      usage();
-  }
-  else
-    usage();
-
-  int maxIterations = 512;
-  int *buf = new int[width*height];
-
-  //
-  // Compute the image using the ispc implementation; report the minimum
-  // time of three runs.
-  //
-  double minISPC = 1e30;
-  for (int i = 0; i < 3; ++i) {
-    // Clear out the buffer
-    for (unsigned int i = 0; i < width * height; ++i)
-      buf[i] = 0;
-    reset_and_start_timer();
-    double t0 = rtc();
-    mandelbrot_ispc(x0, y0, x1, y1, width, height, maxIterations, buf);
-    double dt = rtc() - t0; //get_elapsed_mcycles();
-    minISPC = std::min(minISPC, dt);
-  }
-
-  printf("[mandelbrot ispc+tasks]:\t[%.3f] million cycles\n", minISPC);
-  writePPM(buf, width, height, "mandelbrot-ispc.ppm");
-
-
-  // 
-  // And run the serial implementation 3 times, again reporting the
-  // minimum time.
-  //
-  double minSerial = 1e30;
-#if 0
-  for (int i = 0; i < 3; ++i) {
-    // Clear out the buffer
-    for (unsigned int i = 0; i < width * height; ++i)
-      buf[i] = 0;
-    reset_and_start_timer();
-    mandelbrot_serial(x0, y0, x1, y1, width, height, maxIterations, buf);
-    double dt = get_elapsed_mcycles();
-    minSerial = std::min(minSerial, dt);
-  }
-
-  printf("[mandelbrot serial]:\t\t[%.3f] million cycles\n", minSerial);
-  writePPM(buf, width, height, "mandelbrot-serial.ppm");
-#endif
-
-  printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", minSerial/minISPC);
-
-  return 0;
-}
--- a/examples_cuda/mandelbrot_tasks/mandelbrot_tasks.ispc
+++ b/examples_cuda/mandelbrot_tasks/mandelbrot_tasks.ispc
@@ -1,86 +0,0 @@
-/*
-  Copyright (c) 2010-2012, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-static inline int
-mandel(float c_re, float c_im, int count) {
-    float z_re = c_re, z_im = c_im;
-    int i;
-    for (i = 0; i < count; ++i) {
-        if (z_re * z_re + z_im * z_im > 4.)
-            break;
-
-        float new_re = z_re*z_re - z_im*z_im;
-        float new_im = 2.f * z_re * z_im;
-        unmasked {
-            z_re = c_re + new_re;
-            z_im = c_im + new_im;
-        }
-    }
-
-    return i;
-}
-
-
-/* Task to compute the Mandelbrot iterations for a single scanline.
- */
-task void
-mandelbrot_scanline(uniform float x0, uniform float dx, 
-                    uniform float y0, uniform float dy,
-                    uniform int width, uniform int height, 
-                    uniform int span,
-                    uniform int maxIterations, uniform int output[]) {
-    uniform int ystart = taskIndex * span;
-    uniform int yend = min((taskIndex+1) * span, (unsigned int)height);
-
-    foreach (yi = ystart ... yend, xi = 0 ... width) {
-        float x = x0 + xi * dx;
-        float y = y0 + yi * dy;
-
-        int index = yi * width + xi;
-        output[index] = mandel(x, y, maxIterations);
-    }
-}
-                               
-
-export void
-mandelbrot_ispc(uniform float x0, uniform float y0, 
-                uniform float x1, uniform float y1,
-                uniform int width, uniform int height, 
-                uniform int maxIterations, uniform int output[]) {
-    uniform float dx = (x1 - x0) / width;
-    uniform float dy = (y1 - y0) / height;
-    uniform int span = 4;
-
-    launch[height/span] mandelbrot_scanline(x0, dx, y0, dy, width, height, span,
-                                            maxIterations, output);
-}
--- a/examples_cuda/mandelbrot_tasks/mandelbrot_tasks.vcxproj
+++ b/examples_cuda/mandelbrot_tasks/mandelbrot_tasks.vcxproj
@@ -1,180 +0,0 @@
-<?xml version="1.0" encoding="utf-8"?>
-<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
-  <ItemGroup Label="ProjectConfigurations">
-    <ProjectConfiguration Include="Debug|Win32">
-      <Configuration>Debug</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Debug|x64">
-      <Configuration>Debug</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|Win32">
-      <Configuration>Release</Configuration>
-      <Platform>Win32</Platform>
-    </ProjectConfiguration>
-    <ProjectConfiguration Include="Release|x64">
-      <Configuration>Release</Configuration>
-      <Platform>x64</Platform>
-    </ProjectConfiguration>
-  </ItemGroup>
-  <PropertyGroup Label="Globals">
-    <ProjectGuid>{E80DA7D4-AB22-4648-A068-327307156BE6}</ProjectGuid>
-    <Keyword>Win32Proj</Keyword>
-    <RootNamespace>mandelbrot_tasks</RootNamespace>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>true</UseDebugLibraries>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
-    <ConfigurationType>Application</ConfigurationType>
-    <UseDebugLibraries>false</UseDebugLibraries>
-    <WholeProgramOptimization>true</WholeProgramOptimization>
-    <CharacterSet>Unicode</CharacterSet>
-  </PropertyGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
-  <ImportGroup Label="ExtensionSettings">
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
-    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
-  </ImportGroup>
-  <PropertyGroup Label="UserMacros" />
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <TargetName>mandelbrot_tasks</TargetName>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <LinkIncremental>true</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <TargetName>mandelbrot_tasks</TargetName>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <TargetName>mandelbrot_tasks</TargetName>
-  </PropertyGroup>
-  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <LinkIncremental>false</LinkIncremental>
-    <ExecutablePath>$(ProjectDir)..\..;$(ExecutablePath)</ExecutablePath>
-    <TargetName>mandelbrot_tasks</TargetName>
-  </PropertyGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
-    <ClCompile>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <WarningLevel>Level3</WarningLevel>
-      <Optimization>Disabled</Optimization>
-      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
-    <ClCompile>
-      <WarningLevel>Level3</WarningLevel>
-      <PrecompiledHeader>
-      </PrecompiledHeader>
-      <Optimization>MaxSpeed</Optimization>
-      <FunctionLevelLinking>true</FunctionLevelLinking>
-      <IntrinsicFunctions>true</IntrinsicFunctions>
-      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-      <AdditionalIncludeDirectories>$(TargetDir)</AdditionalIncludeDirectories>
-      <FloatingPointModel>Fast</FloatingPointModel>
-    </ClCompile>
-    <Link>
-      <SubSystem>Console</SubSystem>
-      <GenerateDebugInformation>true</GenerateDebugInformation>
-      <EnableCOMDATFolding>true</EnableCOMDATFolding>
-      <OptimizeReferences>true</OptimizeReferences>
-    </Link>
-  </ItemDefinitionGroup>
-  <ItemGroup>
-    <ClCompile Include="mandelbrot_tasks.cpp" />
-    <ClCompile Include="mandelbrot_tasks_serial.cpp" />
-    <ClCompile Include="../tasksys.cpp" />
-  </ItemGroup>
-  <ItemGroup>
-    <CustomBuild Include="mandelbrot_tasks.ispc">
-      <FileType>Document</FileType>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --arch=x86 --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Command Condition="'$(Configuration)|$(Platform)'=='Release|x64'">ispc -O2 %(Filename).ispc -o $(TargetDir)%(Filename).obj -h $(TargetDir)%(Filename)_ispc.h --target=sse2,sse4-x2,avx-x2
-</Command>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-      <Outputs Condition="'$(Configuration)|$(Platform)'=='Release|x64'">$(TargetDir)%(Filename).obj;$(TargetDir)%(Filename)_sse2.obj;$(TargetDir)%(Filename)_sse4.obj;$(TargetDir)%(Filename)_avx.obj;$(TargetDir)%(Filename)_ispc.h</Outputs>
-    </CustomBuild>
-  </ItemGroup>
-  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
-  <ImportGroup Label="ExtensionTargets">
-  </ImportGroup>
-</Project>
--- a/examples_cuda/mandelbrot_tasks/mandelbrot_tasks_serial.cpp
+++ b/examples_cuda/mandelbrot_tasks/mandelbrot_tasks_serial.cpp
@@ -1,68 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-
-static int mandel(float c_re, float c_im, int count) {
-    float z_re = c_re, z_im = c_im;
-    int i;
-    for (i = 0; i < count; ++i) {
-        if (z_re * z_re + z_im * z_im > 4.f)
-            break;
-
-        float new_re = z_re*z_re - z_im*z_im;
-        float new_im = 2.f * z_re * z_im;
-        z_re = c_re + new_re;
-        z_im = c_im + new_im;
-    }
-
-    return i;
-}
-
-void mandelbrot_serial(float x0, float y0, float x1, float y1,
-                       int width, int height, int maxIterations,
-                       int output[])
-{
-    float dx = (x1 - x0) / width;
-    float dy = (y1 - y0) / height;
-
-    for (int j = 0; j < height; j++) {
-        for (int i = 0; i < width; ++i) {
-            float x = x0 + i * dx;
-            float y = y0 + j * dy;
-
-            int index = (j * width + i);
-            output[index] = mandel(x, y, maxIterations);
-        }
-    }
-}
-
--- a/examples_cuda/mandelbrot_tasks3d/.gitignore
+++ b/examples_cuda/mandelbrot_tasks3d/.gitignore
@@ -1,2 +0,0 @@
-mandelbrot
-*.ppm
--- a/examples_cuda/mandelbrot_tasks3d/Makefile
+++ b/examples_cuda/mandelbrot_tasks3d/Makefile
@@ -1,8 +0,0 @@
-
-EXAMPLE=mandelbrot_tasks3d
-CPP_SRC=mandelbrot_tasks3d.cpp mandelbrot_tasks_serial.cpp
-ISPC_SRC=mandelbrot_tasks3d.ispc
-ISPC_IA_TARGETS=avx
-ISPC_ARM_TARGETS=neon
-
-include ../common.mk
--- a/examples_cuda/mandelbrot_tasks3d/Makefile_gpu
+++ b/examples_cuda/mandelbrot_tasks3d/Makefile_gpu
@@ -1,59 +0,0 @@
-PROG=mandel_cu
-ISPC_SRC=mandelbrot_tasks3d.ispc
-CXX_SRC=mandel_cu.cpp  mandelbrot_tasks_serial.cpp
-
-CXX=g++
-CXXFLAGS=-O3 -I$(CUDATK)/include
-LD=g++
-LDFLAGS=-lcuda
-
-ISPC=ispc
-ISPCFLAGS=-O3 --math-lib=default --target=nvptx64 --opt=fast-math
-
-LLVM32 = $(HOME)/usr/local/llvm/bin-3.2
-LLVM   = $(HOME)/usr/local/llvm/bin-3.3
-PTXGEN = $(HOME)/ptxgen
-PTXGEN += -opt=3
-PTXGEN += -ftz=1 -prec-div=0 -prec-sqrt=0 -fma=1
-
-LLVM32DIS=$(LLVM32)/bin/llvm-dis
-
-.SUFFIXES: .bc .o .ptx .cu _ispc_nvptx64.bc
-
-
-ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
-ISPC_BC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.bc)
-PTXSRC=$(ISPC_SRC:%.ispc=%_ispc_nvptx64.ptx)
-CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
-
-all: $(ISPC_BC) $(PROG)
-
-CUDART:
-	cd _cuobj && make
-	g++ -o  mandel_cu_nvcc mandel_cu.cpp  -I$(CUDATK)/include  -lcuda mandelbrot_tasks_serial.cpp -L./_cuobj -lmandel_cudart -lcudart -L$(CUDATK)/lib64  -D_CUDART_ -lcudadevrt
-
-
-$(CXX_OBJ) : kernel.ptx
-$(PROG): $(CXX_OBJ) kernel.ptx
-	/bin/cp kernel.ptx __kernels.ptx
-	$(LD) -o $@ $(CXX_OBJ) $(LDFLAGS)
-
-%.o: %.cpp
-	$(CXX) $(CXXFLAGS)  -o $@ -c $<
-
-
-%_ispc_nvptx64.bc: %.ispc
-	$(ISPC) $(ISPCFLAGS) --emit-llvm -o `basename $< .ispc`_ispc_nvptx64.bc -h `basename $< .ispc`_ispc.h $< --emit-llvm
-
-%.ptx: %.bc
-	$(LLVM32DIS) $<
-	$(PTXGEN)  `basename $< .bc`.ll > $@
-
-kernel.ptx: $(PTXSRC)
-	cat $^ > kernel.ptx
-
-clean: 
-	/bin/rm -rf *.ptx *.bc *.ll $(PROG)
-
-	 
-
--- a/examples_cuda/mandelbrot_tasks3d/Makefile_knc
+++ b/examples_cuda/mandelbrot_tasks3d/Makefile_knc
@@ -1,37 +0,0 @@
-PROG=mandelbrot_mic
-ISPC_SRC=mandelbrot_tasks3d.ispc
-CXX_SRC=mandelbrot_tasks3d.cpp  ../tasksys.cpp
-
-CXX=icc
-CXXFLAGS=-O3 -I$(CUDATK)/include -mmic -openmp
-LD=icc
-LDFLAGS=-mmic -openmp
-
-ISPC=ispc
-ISPCFLAGS=-O3 --math-lib=default --target=generic-16 --c++-include-file=../intrinsics/knc-i1x16.h --opt=fast-math
-
-.SUFFIXES: .o .cpp
-
-
-ISPC_OBJ=$(ISPC_SRC:%.ispc=%_ispc.o)
-CXX_OBJ=$(CXX_SRC:%.cpp=%.o)
-
-all: $(PROG)
-
-
-
-$(PROG): $(ISPC_OBJ) $(CXX_OBJ) 
-	$(LD) -o $@ $^ $(LDFLAGS)
-
-%.o: %.cpp
-	$(CXX) $(CXXFLAGS)  -o $@ -c $<
-
-%_ispc.o: %.ispc
-	$(ISPC) $(ISPCFLAGS) --emit-c++ -o `basename $< .ispc`_ispc_zmm.cpp -h `basename $< .ispc`_ispc.h $< 
-	$(CXX) $(CXXFLAGS) -o $@ `basename $< .ispc`_ispc_zmm.cpp  -c
-
-clean: 
-	/bin/rm -rf *_ispc_zmm.cpp *.o  $(PROG)
-
-	 
-
--- a/examples_cuda/mandelbrot_tasks3d/_cuobj/Makefile
+++ b/examples_cuda/mandelbrot_tasks3d/_cuobj/Makefile
@@ -1,15 +0,0 @@
-FILE=mandel
-
-LIB=lib$(FILE)_cudart.a
-all: $(LIB)
-
-
-$(LIB) : $(FILE).cu
-	nvcc -dc $(FILE).cu -arch=sm_35 -Xptxas=-v -dryrun 2>&1 | sed 's/\#\$$//g'|awk '{ if ($$1 == "cicc") print "cp ../__kernels.ptx " $$NF; else print $0 }' > run.sh
-	sh run.sh
-	nvcc -dlink -o $(FILE)_dlink.o $(FILE).o -lcudadevrt -arch=sm_35
-	nvcc $(FILE).o $(FILE)_dlink.o --lib -o lib$(FILE)_cudart.a
-
-clean: 
-	/bin/rm -f *.o *.a run.sh
-
--- a/examples_cuda/mandelbrot_tasks3d/_cuobj/mandel.cu
+++ b/examples_cuda/mandelbrot_tasks3d/_cuobj/mandel.cu
@@ -1,22 +0,0 @@
-extern "C" static inline int __device__ mandel___vyfvyfvyi_(float c_re, float c_im, int count) {}
-extern "C" void __global__ mandelbrot_scanline___unfunfunfunfuniuniuniuniuniun_3C_uni_3E_( float x0,  float dx, 
-    float y0,  float dy,
-    int width,  int height, 
-    int xspan,  int yspan,
-    int maxIterations,  int output[]) {}
-extern "C" void __global__ mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_( float x0,  float y0, 
-    float x1,  float y1,
-    int width,  int height, 
-    int maxIterations,  int output[]) { }
-
-extern "C"
-void mandelbrot_ispc(float x0, float y0, 
-    float x1, float y1,
-    int width, int height, 
-    int maxIterations, int output[])
-{
-  mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_<<<1,32>>>
-    (x0,y0,x1,y1,width,height,maxIterations,output);
-  cudaDeviceSynchronize();
-}
-
--- a/examples_cuda/mandelbrot_tasks3d/compile.sh
+++ b/examples_cuda/mandelbrot_tasks3d/compile.sh
@@ -1,6 +0,0 @@
-#!/bin/sh
-ptxas -arch=sm_35 -c -o kernel.gpu.o kernel_cu.ptx       
-fatbinary -arch=sm_35 -create kernel.fatbin -elf kernel.gpu.o 
-nvcc -arch=sm_35 -Xptxas="-v" -dc  kernel_driver.cu   -lcudadevrt
-nvcc -arch=sm_35 -Xptxas="-v" -dlink -o mandel_nvcc.o kernel.fatbin kernel_driver.o  -rdc=true -lcudadevrt
-
--- a/examples_cuda/mandelbrot_tasks3d/cuLaunch.cpp
+++ b/examples_cuda/mandelbrot_tasks3d/cuLaunch.cpp
@@ -1,321 +0,0 @@
-#include <stdio.h>
-#include <stdlib.h>
-#include <iostream>
-#include <algorithm>
-#include <string.h>
-#include <cuda.h>
-#include <vector>
-#include <cassert>
-#include "drvapi_error_string.h"
-
-#define checkCudaErrors(err)  __checkCudaErrors (err, __FILE__, __LINE__)
-// These are the inline versions for all of the SDK helper functions
-void __checkCudaErrors(CUresult err, const char *file, const int line) {
-  if(CUDA_SUCCESS != err) {
-    std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
-           << getCudaDrvErrorString(err) << "\" from file <" << file
-           << ", line " << line << "\n";
-    exit(-1);
-  }
-}
-
-
-/**********************/
-/* Basic CUDriver API */
-CUcontext context;
-
-void createContext(const int deviceId = 0)
-{
-  CUdevice device;
-  int devCount;
-  checkCudaErrors(cuInit(0));
-  checkCudaErrors(cuDeviceGetCount(&devCount));
-  assert(devCount > 0);
-  checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
-
-  char name[128];
-  checkCudaErrors(cuDeviceGetName(name, 128, device));
-  std::cout << "Using CUDA Device [0]: " << name << "\n";
-
-  int devMajor, devMinor;
-  checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
-  std::cout << "Device Compute Capability: " 
-    << devMajor << "." << devMinor << "\n";
-  if (devMajor < 2) {
-    std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
-    exit(1); 
-  }
-
-  // Create driver context
-  checkCudaErrors(cuCtxCreate(&context, 0, device));
-}
-void destroyContext()
-{
-  checkCudaErrors(cuCtxDestroy(context));
-}
-
-CUmodule loadModule(const char * module)
-{
-  CUmodule cudaModule;
-  checkCudaErrors(cuModuleLoadData(&cudaModule, module));
-  return cudaModule;
-}
-void unloadModule(CUmodule &cudaModule)
-{
-  checkCudaErrors(cuModuleUnload(cudaModule));
-}
-
-CUfunction getFunction(CUmodule &cudaModule, const char * function)
-{
-  CUfunction cudaFunction;
-  checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
-  return cudaFunction;
-}
-  
-CUdeviceptr deviceMalloc(const size_t size)
-{
-  CUdeviceptr d_buf;
-  checkCudaErrors(cuMemAlloc(&d_buf, size));
-  return d_buf;
-}
-void deviceFree(CUdeviceptr d_buf)
-{
-  checkCudaErrors(cuMemFree(d_buf));
-}
-void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
-}
-void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
-}
-#define deviceLaunch(func,nbx,nby,nbz,params) \
-  checkCudaErrors( \
-      cuLaunchKernel( \
-        (func), \
-        (nbx), (nby), (nbz), \
-        32, 1, 1, \
-        0, NULL, (params), NULL \
-        ));
-
-typedef CUdeviceptr devicePtr;
-
-
-/**************/
-
-extern "C" 
-{
-#if 0
-  struct ModuleManager
-  {
-    private:
-      typedef std::pair<std::string, CUModule> ModulePair;
-      typedef std::map <std::string, CUModule> ModuleMap;
-      ModuleMap module_list;
-
-      ModuleMap::iterator findModule(const char * module_name)
-      {
-        return module_list.find(std::string(module_name));
-      }
-
-    public:
-
-      CUmodule loadModule(const char * module_name, const char * module_data)
-      {
-        const ModuleMap::iterator it = findModule(module_name)
-        if (it != ModuleMap::end)
-        {
-          CUmodule cudaModule = loadModule(module);
-          module_list.insert(std::make_pair(std::string(module_name), cudaModule));
-          return cudaModule
-        }
-        return it->second;
-      }
-      void unloadModule(const char * module_name)
-      {
-        ModuleMap::iterator it = findModule(module_name)
-        if (it != ModuleMap::end)
-          module_list.erase(it);
-      }
-  };
-#endif
-
-  void *CUDAAlloc(void **handlePtr, int64_t size, int32_t alignment)
-  {
-    return NULL;
-  }
-  void CUDALaunch(
-      void **handlePtr, 
-      const char * module_name,
-      const char * module, 
-      const char * func_name,
-      void **func_args, 
-      int countx, int county, int countz)
-  {
-    CUmodule   cudaModule   = loadModule(module);
-    CUfunction cudaFunction = getFunction(cudaModule, func_name);
-    deviceLaunch(cudaFunction, countx, county, countz, func_args);
-    unloadModule(cudaModule);
-  }
-  void CUDASync(void *handle)
-  {
-    checkCudaErrors(cuStreamSynchronize(0));
-  }
-  void CUDAFree(void *handle)
-  {
-  }
-}
-
-/********************/
-
-
-/* Write a PPM image file with the image of the Mandelbrot set */
-static void
-writePPM(int *buf, int width, int height, const char *fn) 
-{
-  FILE *fp = fopen(fn, "wb");
-  fprintf(fp, "P6\n");
-  fprintf(fp, "%d %d\n", width, height);
-  fprintf(fp, "255\n");
-  for (int i = 0; i < width*height; ++i) {
-    // Map the iteration count to colors by just alternating between
-    // two greys.
-    char c = (buf[i] & 0x1) ? 240 : 20;
-    for (int j = 0; j < 3; ++j)
-      fputc(c, fp);
-  }
-  fclose(fp);
-  printf("Wrote image file %s\n", fn);
-}
-
-std::vector<char> readBinary(const char * filename)
-{
-  std::vector<char> buffer;
-  FILE *fp = fopen(filename, "rb");
-  if (!fp )
-  {
-    fprintf(stderr, "file %s not found\n", filename);
-    assert(0);
-  }
-#if 0
-  char c;
-  while ((c = fgetc(fp)) != EOF)
-    buffer.push_back(c);
-#else
-  fseek(fp, 0, SEEK_END); 
-  const unsigned long long size = ftell(fp);         /*calc the size needed*/
-  fseek(fp, 0, SEEK_SET); 
-  buffer.resize(size);
-
-  if (fp == NULL){ /*ERROR detection if file == empty*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);           
-    exit(1);
-  }
-  else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
-    exit(1);
-  }
-#endif
-  fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
-  return buffer;
-}
-
-
-static void usage() 
-{
-  fprintf(stderr, "usage: mandelbrot [--scale=<factor>]\n");
-  exit(1);
-}
-
-extern "C"
-void mandelbrot_ispc(
-     float x0,  float y0, 
-     float x1,  float y1,
-     int width,  int height, 
-     int maxIterations,  int output[]) 
-{
-  float dx = (x1 - x0) / width;
-  float dy = (y1 - y0) / height;
-  int xspan = 16;  /* make sure it is big enough to avoid false-sharing */
-  int yspan = 4; 
-
-  const int nbx = width/xspan;
-  const int nby = height/yspan;
-  const int nbz = 1;
-
-  fprintf(stderr ," nbx= %d nby= %d  nbtot= %d \n", nbx, nby, nbx*nby);
-   
-#if 0
-  launch [nbx,nby]
-      mandelbrot_scanline(x0, dx, y0, dy, width, height, xspan, yspan,
-                          maxIterations, output);
-#endif
-
-  //    const std::vector<char> cubin = readBinary("cuLaunch.cubin");
-  const std::vector<char> cubin = readBinary("cuLaunch.ptx");
-  void *params[] = {&x0, &dx, &y0, &dy, &width, &height, &xspan, &yspan, &maxIterations, &output};
-  CUDALaunch(
-      NULL, //void **handlePtr, 
-      "module_01", // const char * module_name,
-      &cubin[0], //const char * module, 
-      "mandelbrot_scanline", //const char * func_name,
-      params, //void **func_args, 
-      nbx,nby,nbz); //int countx, int county, int countz)
-  CUDASync(NULL);
-}
-
-int main(int argc, char *argv[]) 
-{
-  unsigned int width = 1536;
-  unsigned int height = 1024;
-  float x0 = -2;
-  float x1 = 1;
-  float y0 = -1;
-  float y1 = 1;
-
-  if (argc == 1)
-    ;
-  else if (argc == 2) {
-    if (strncmp(argv[1], "--scale=", 8) == 0) {
-      float scale = atof(argv[1] + 8);
-      if (scale == 0.f)
-        usage();
-      width *= scale;
-      height *= scale;
-      // round up to multiples of 16
-      width = (width + 0xf) & ~0xf;
-      height = (height + 0xf) & ~0xf;
-    }
-    else 
-      usage();
-  }
-  else
-    usage();
-
-  /*******************/
-  createContext();
-  /*******************/
-
-  int maxIterations = 512;
-  int *h_buf = new int[width*height];
-  for (unsigned int i = 0; i < width*height; i++)
-    h_buf[i] = 0;
-
-  const size_t bufsize = sizeof(int)*width*height;
-  devicePtr d_buf = deviceMalloc(bufsize);
-  memcpyH2D(d_buf, h_buf, bufsize);
-
-  mandelbrot_ispc(x0,y0,x1,y1,width, height, maxIterations, (int*)d_buf);
-
-  memcpyD2H(h_buf, d_buf, bufsize);
-  deviceFree(d_buf);
-
-  writePPM(h_buf, width, height, "mandelbrot-cuda.ppm");
-
-  /*******************/
-  destroyContext();
-  /*******************/
-
-  return 0;
-}
--- a/examples_cuda/mandelbrot_tasks3d/kernel_driver.cu
+++ b/examples_cuda/mandelbrot_tasks3d/kernel_driver.cu
@@ -1,73 +0,0 @@
-typedef unsigned int uint32_t;
-typedef unsigned long long uint64_t;
-
-extern "C" __device__ void PTXmandelbrot_scanline___UM_unfunfunfunfuniuniuniuniuniun_3C_uni_3E_(
-    float,float,float,float,uint32_t,uint32_t,uint32_t,uint32_t,uint32_t,uint64_t);
-
-extern "C"
-__global__ void mandelbrot_scanline___UM_unfunfunfunfuniuniuniuniuniun_3C_uni_3E_(
-    float param0, 
-    float param1, 
-    float param2, 
-    float param3, 
-    uint32_t param4, 
-    uint32_t param5, 
-    uint32_t param6, 
-    uint32_t param7, 
-    uint32_t param8, 
-    uint64_t param9) 
-{
-  PTXmandelbrot_scanline___UM_unfunfunfunfuniuniuniuniuniun_3C_uni_3E_(
-      param0, param1, param2, param3, param4, param5, param6, param7, param8, param9);
-}
-
-extern "C" __device__ void PTXmandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_(
-	float param0,
-	float param1,
-	float param2,
-	float param3,
-	uint32_t param4,
-	uint32_t param5,
-	uint32_t param6,
-	uint64_t param7,
-	char param8);
-
-extern "C"
-__global__ void mandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_(
-	float param0,
-	float param1,
-	float param2,
-	float param3,
-	uint32_t param4,
-	uint32_t param5,
-	uint32_t param6,
-	uint64_t param7,
-	char param8)
-{
- PTXmandelbrot_ispc___unfunfunfunfuniuniuniun_3C_uni_3E_(
-     param0,param1,param2,param3,param4,param5,param6,param7,param8);
-}
-
-extern "C" __device__ void PTXmandelbrot_ispc(
-	float param0,
-	float param1,
-	float param2,
-	float param3,
-	uint32_t param4,
-	uint32_t param5,
-	uint32_t param6,
-	uint64_t param7);
-extern "C"
-__global__ void mandelbrot_ispc(
-	float param0,
-	float param1,
-	float param2,
-	float param3,
-	uint32_t param4,
-	uint32_t param5,
-	uint32_t param6,
-	uint64_t param7)
-{
- PTXmandelbrot_ispc(
-     param0,param1,param2,param3,param4,param5,param6,param7);
-}
--- a/examples_cuda/mandelbrot_tasks3d/mandel.cpp
+++ b/examples_cuda/mandelbrot_tasks3d/mandel.cpp
@@ -1,352 +0,0 @@
-#include <stdio.h>
-#include <stdlib.h>
-#include <iostream>
-#include <algorithm>
-#include <string.h>
-#include <cuda.h>
-#include <vector>
-#include <cassert>
-#include "drvapi_error_string.h"
-
-#define checkCudaErrors(err)  __checkCudaErrors (err, __FILE__, __LINE__)
-// These are the inline versions for all of the SDK helper functions
-void __checkCudaErrors(CUresult err, const char *file, const int line) {
-  if(CUDA_SUCCESS != err) {
-    std::cerr << "checkCudeErrors() Driver API error = " << err << "\""
-           << getCudaDrvErrorString(err) << "\" from file <" << file
-           << ", line " << line << "\n";
-    exit(-1);
-  }
-}
-
-
-/**********************/
-/* Basic CUDriver API */
-CUcontext context;
-
-void createContext(const int deviceId = 0)
-{
-  CUdevice device;
-  int devCount;
-  checkCudaErrors(cuInit(0));
-  checkCudaErrors(cuDeviceGetCount(&devCount));
-  assert(devCount > 0);
-  checkCudaErrors(cuDeviceGet(&device, deviceId < devCount ? deviceId : 0));
-
-  char name[128];
-  checkCudaErrors(cuDeviceGetName(name, 128, device));
-  std::cout << "Using CUDA Device [0]: " << name << "\n";
-
-  int devMajor, devMinor;
-  checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
-  std::cout << "Device Compute Capability: " 
-    << devMajor << "." << devMinor << "\n";
-  if (devMajor < 2) {
-    std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
-    exit(1); 
-  }
-
-  // Create driver context
-  checkCudaErrors(cuCtxCreate(&context, 0, device));
-}
-void destroyContext()
-{
-  checkCudaErrors(cuCtxDestroy(context));
-}
-
-CUmodule loadModule(const char * module)
-{
-  CUmodule cudaModule;
-  checkCudaErrors(cuModuleLoadData(&cudaModule, module));
-  return cudaModule;
-}
-void unloadModule(CUmodule &cudaModule)
-{
-  checkCudaErrors(cuModuleUnload(cudaModule));
-}
-
-CUfunction getFunction(CUmodule &cudaModule, const char * function)
-{
-  CUfunction cudaFunction;
-  checkCudaErrors(cuModuleGetFunction(&cudaFunction, cudaModule, function));
-  return cudaFunction;
-}
-  
-CUdeviceptr deviceMalloc(const size_t size)
-{
-  CUdeviceptr d_buf;
-  checkCudaErrors(cuMemAlloc(&d_buf, size));
-  return d_buf;
-}
-void deviceFree(CUdeviceptr d_buf)
-{
-  checkCudaErrors(cuMemFree(d_buf));
-}
-void memcpyD2H(void * h_buf, CUdeviceptr d_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyDtoH(h_buf, d_buf, size));
-}
-void memcpyH2D(CUdeviceptr d_buf, void * h_buf, const size_t size)
-{
-  checkCudaErrors(cuMemcpyHtoD(d_buf, h_buf, size));
-}
-#define deviceLaunch(func,nbx,nby,nbz,params) \
-  checkCudaErrors( \
-      cuLaunchKernel( \
-        (func), \
-        (nbx), (nby), (nbz), \
-        32, 1, 1, \
-        0, NULL, (params), NULL \
-        ));
-
-typedef CUdeviceptr devicePtr;
-
-
-/**************/
-
-extern "C" 
-{
-#if 0
-  struct ModuleManager
-  {
-    private:
-      typedef std::pair<std::string, CUModule> ModulePair;
-      typedef std::map <std::string, CUModule> ModuleMap;
-      ModuleMap module_list;
-
-      ModuleMap::iterator findModule(const char * module_name)
-      {
-        return module_list.find(std::string(module_name));
-      }
-
-    public:
-
-      CUmodule loadModule(const char * module_name, const char * module_data)
-      {
-        const ModuleMap::iterator it = findModule(module_name)
-        if (it != ModuleMap::end)
-        {
-          CUmodule cudaModule = loadModule(module);
-          module_list.insert(std::make_pair(std::string(module_name), cudaModule));
-          return cudaModule
-        }
-        return it->second;
-      }
-      void unloadModule(const char * module_name)
-      {
-        ModuleMap::iterator it = findModule(module_name)
-        if (it != ModuleMap::end)
-          module_list.erase(it);
-      }
-  };
-#endif
-
-  void *CUDAAlloc(void **handlePtr, int64_t size, int32_t alignment)
-  {
-#if 0
-    fprintf(stderr, " ptr= %p\n", *handlePtr);
-    fprintf(stderr, " size= %d\n", (int)size);
-    fprintf(stderr, " alignment= %d\n", (int)alignment);
-    fprintf(stderr, " ------- \n\n");
-#endif
-    return NULL;
-  }
-  void CUDALaunch(
-      void **handlePtr, 
-      const char * module_name,
-      const char * module, 
-      const char * func_name,
-      void **func_args, 
-      int countx, int county, int countz)
-  {
-    assert(module_name != NULL);
-    assert(module != NULL);
-    assert(func_name != NULL);
-    assert(func_args != NULL);
-#if 1
-    CUmodule   cudaModule   = loadModule(module);
-    CUfunction cudaFunction = getFunction(cudaModule, func_name);
-    deviceLaunch(cudaFunction, countx, county, countz, func_args);
-    unloadModule(cudaModule);
-#else
-    fprintf(stderr, " handle= %p\n", *handlePtr);
-    fprintf(stderr, " count= %d %d %d\n", countx, county, countz);
-
-    fprintf(stderr, " module_name= %s \n", module_name);
-    fprintf(stderr, " func_name= %s \n", func_name);
-//    fprintf(stderr, " ptx= %s \n", module);
-    fprintf(stderr, " x0= %g  \n", *((float*)(func_args[0])));
-    fprintf(stderr, " dx= %g  \n", *((float*)(func_args[1])));
-    fprintf(stderr, " y0= %g  \n", *((float*)(func_args[2])));
-    fprintf(stderr, " dy= %g  \n", *((float*)(func_args[3])));
-    fprintf(stderr, " w= %d  \n", *((int*)(func_args[4])));
-    fprintf(stderr, " h= %d  \n", *((int*)(func_args[5])));
-    fprintf(stderr, " xs= %d  \n", *((int*)(func_args[6])));
-    fprintf(stderr, " ys= %d  \n", *((int*)(func_args[7])));
-    fprintf(stderr, " maxit= %d  \n", *((int*)(func_args[8])));
-    fprintf(stderr, " ptr= %p  \n", *((int**)(func_args[9])));
-    fprintf(stderr, " ------- \n\n");
-#endif
-  }
-  void CUDASync(void *handle)
-  {
-    checkCudaErrors(cuStreamSynchronize(0));
-  }
-  void ISPCSync(void *handle)
-  {
-  }
-  void CUDAFree(void *handle)
-  {
-  }
-}
-
-/********************/
-
-
-/* Write a PPM image file with the image of the Mandelbrot set */
-static void
-writePPM(int *buf, int width, int height, const char *fn) 
-{
-  FILE *fp = fopen(fn, "wb");
-  fprintf(fp, "P6\n");
-  fprintf(fp, "%d %d\n", width, height);
-  fprintf(fp, "255\n");
-  for (int i = 0; i < width*height; ++i) {
-    // Map the iteration count to colors by just alternating between
-    // two greys.
-    char c = (buf[i] & 0x1) ? 240 : 20;
-    for (int j = 0; j < 3; ++j)
-      fputc(c, fp);
-  }
-  fclose(fp);
-  printf("Wrote image file %s\n", fn);
-}
-
-std::vector<char> readBinary(const char * filename)
-{
-  std::vector<char> buffer;
-  FILE *fp = fopen(filename, "rb");
-  if (!fp )
-  {
-    fprintf(stderr, "file %s not found\n", filename);
-    assert(0);
-  }
-#if 0
-  char c;
-  while ((c = fgetc(fp)) != EOF)
-    buffer.push_back(c);
-#else
-  fseek(fp, 0, SEEK_END); 
-  const unsigned long long size = ftell(fp);         /*calc the size needed*/
-  fseek(fp, 0, SEEK_SET); 
-  buffer.resize(size);
-
-  if (fp == NULL){ /*ERROR detection if file == empty*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n",filename);           
-    exit(1);
-  }
-  else if (fread(&buffer[0], sizeof(char), size, fp) != size){ /* if count of read bytes != calculated size of .bin file -> ERROR*/
-    fprintf(stderr, "Error: There was an Error reading the file %s \n", filename);
-    exit(1);
-  }
-#endif
-  fprintf(stderr, " read buffer of size= %d bytes \n", (int)buffer.size());
-  return buffer;
-}
-
-
-static void usage() 
-{
-  fprintf(stderr, "usage: mandelbrot [--scale=<factor>]\n");
-  exit(1);
-}
-
-extern "C"
-void mandelbrot_ispc(
-     float x0,  float y0, 
-     float x1,  float y1,
-     int width,  int height, 
-     int maxIterations,  int output[]) 
-#if 1
-;
-#else
-{
-  float dx = (x1 - x0) / width;
-  float dy = (y1 - y0) / height;
-  int xspan = 32;  /* make sure it is big enough to avoid false-sharing */
-  int yspan = 4; 
-
-  const int nbx = width/xspan;
-  const int nby = width/yspan;
-  const int nbz = 1;
-
-  fprintf(stderr ," nbx= %d nby= %d  nbtot= %d \n", nbx, nby, nbx*nby);
-   
-  //    const std::vector<char> cubin = readBinary("cuLaunch.cubin");
-  const std::vector<char> cubin = readBinary("cuLaunch.ptx");
-  void *params[] = {&x0, &dx, &y0, &dy, &width, &height, &xspan, &yspan, &maxIterations, &output};
-  CUDALaunch(
-      NULL, //void **handlePtr, 
-      "module_01", // const char * module_name,
-      &cubin[0], //const char * module, 
-      "mandelbrot_scanline", //const char * func_name,
-      params, //void **func_args, 
-      nbx,nby,nbz); //int countx, int county, int countz)
-  CUDASync(NULL);
-}
-#endif
-
-int main(int argc, char *argv[]) 
-{
-  unsigned int width = 1536;
-  unsigned int height = 1024;
-  float x0 = -2;
-  float x1 = 1;
-  float y0 = -1;
-  float y1 = 1;
-
-  if (argc == 1)
-    ;
-  else if (argc == 2) {
-    if (strncmp(argv[1], "--scale=", 8) == 0) {
-      float scale = atof(argv[1] + 8);
-      if (scale == 0.f)
-        usage();
-      width *= scale;
-      height *= scale;
-      // round up to multiples of 16
-      width = (width + 0xf) & ~0xf;
-      height = (height + 0xf) & ~0xf;
-    }
-    else 
-      usage();
-  }
-  else
-    usage();
-
-  /*******************/
-  createContext();
-  /*******************/
-
-  int maxIterations = 512;
-  int *h_buf = new int[width*height];
-  for (unsigned int i = 0; i < width*height; i++)
-    h_buf[i] = 0;
-
-  const size_t bufsize = sizeof(int)*width*height;
-  devicePtr d_buf = deviceMalloc(bufsize);
-  memcpyH2D(d_buf, h_buf, bufsize);
-
-  mandelbrot_ispc(x0,y0,x1,y1,width, height, maxIterations, (int*)d_buf);
-
-  memcpyD2H(h_buf, d_buf, bufsize);
-  deviceFree(d_buf);
-
-  writePPM(h_buf, width, height, "mandelbrot-cuda.ppm");
-
-  /*******************/
-  destroyContext();
-  /*******************/
-
-  return 0;
-}
--- a/examples_cuda/mandelbrot_tasks3d/mandel_cu.cpp
+++ b/examples_cuda/mandelbrot_tasks3d/mandel_cu.cpp
@@ -1,177 +0,0 @@
-/*
-  Copyright (c) 2010-2011, Intel Corporation
-  All rights reserved.
-
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are
-  met:
-
-    * Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-
-    * Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-
-    * Neither the name of Intel Corporation nor the names of its
-      contributors may be used to endorse or promote products derived from
-      this software without specific prior written permission.
-
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-   IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-   TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-   PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
-   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
-*/
-
-#ifdef _MSC_VER
-#define _CRT_SECURE_NO_WARNINGS
-#define NOMINMAX
-#pragma warning (disable: 4244)
-#pragma warning (disable: 4305)
-#endif
-
-#include <stdio.h>
-#include <algorithm>
-#include <string.h>
-#include "../timing.h"
-
-#include "../cuda_ispc.h"
-#ifdef _CUDART_
-extern "C"
-void mandelbrot_ispc(float x0, float y0, 
-    float x1, float y1,
-    int width, int height, 
-    int maxIterations, int output[]);
-#endif
-
-
-extern void mandelbrot_serial(float x0, float y0, float x1, float y1,
-    int width, int height, int maxIterations,
-    int output[]);
-
-/* Write a PPM image file with the image of the Mandelbrot set */
-static void
-writePPM(int *buf, int width, int height, const char *fn) {
-  FILE *fp = fopen(fn, "wb");
-  fprintf(fp, "P6\n");
-  fprintf(fp, "%d %d\n", width, height);
-  fprintf(fp, "255\n");
-  for (int i = 0; i < width*height; ++i) {
-    // Map the iteration count to colors by just alternating between
-    // two greys.
-    char c = (buf[i] & 0x1) ? 240 : 20;
-    for (int j = 0; j < 3; ++j)
-      fputc(c, fp);
-  }
-  fclose(fp);
-  printf("Wrote image file %s\n", fn);
-}
-
-
-static void usage() {
-  fprintf(stderr, "usage: mandelbrot [--scale=<factor>]\n");
-  exit(1);
-}
-
-int main(int argc, char *argv[]) {
-  unsigned int width = 1536;
-  unsigned int height = 1024;
-  float x0 = -2;
-  float x1 = 1;
-  float y0 = -1;
-  float y1 = 1;
-
-  if (argc == 1)
-    ;
-  else if (argc == 2) {
-    if (strncmp(argv[1], "--scale=", 8) == 0) {
-      float scale = atof(argv[1] + 8);
-      if (scale == 0.f)
-        usage();
-      width *= scale;
-      height *= scale;
-      // round up to multiples of 16
-      width = (width + 0xf) & ~0xf;
-      height = (height + 0xf) & ~0xf;
-    }
-    else 
-      usage();
-  }
-  else
-    usage();
-
-  /*******************/
-  createContext();
-  /*******************/
-
-  int maxIterations = 512;
-  int *buf = new int[width*height];
-
-  for (unsigned int i = 0; i < width*height; i++)
-    buf[i] = 0;
-  const size_t bufsize = sizeof(int)*width*height;
-  devicePtr d_buf = deviceMalloc(bufsize);
-  memcpyH2D(d_buf, buf, bufsize);
-
-  //
-  // Compute the image using the ispc implementation; report the minimum
-  // time of three runs.
-  //
-  double minISPC = 1e30;
-#if 1
-  for (int i = 0; i < 3; ++i) {
-    // Clear out the buffer
-    for (unsigned int i = 0; i < width * height; ++i)
-      buf[i] = 0;
-    reset_and_start_timer();
-#ifdef _CUDART_
-    const double t0 = rtc();
-    mandelbrot_ispc(x0, y0, x1, y1, width, height, maxIterations, (int*)d_buf);
-    double dt = 1e3*(rtc() - t0); //get_elapsed_mcycles();
-#else
-    const char * func_name = "mandelbrot_ispc__export";
-    void *func_args[] = {&x0, &y0, &x1, &y1, &width, &height, &maxIterations, &d_buf};
-    const double dt = 1e3*CUDALaunch(NULL, func_name, func_args);
-#endif
-    minISPC = std::min(minISPC, dt);
-  }
-#endif
-
-  memcpyD2H(buf, d_buf, bufsize);
-  deviceFree(d_buf);
-
-  printf("[mandelbrot ispc+tasks]:\t[%.3f] million cycles\n", minISPC);
-  writePPM(buf, width, height, "mandelbrot-cuda.ppm");
-
-
-  // 
-  // And run the serial implementation 3 times, again reporting the
-  // minimum time.
-  //
-  double minSerial = 1e30;
-  for (int i = 0; i < 3; ++i) {
-    // Clear out the buffer
-    for (unsigned int i = 0; i < width * height; ++i)
-      buf[i] = 0;
-    reset_and_start_timer();
-    const double t0 = rtc();
-    mandelbrot_serial(x0, y0, x1, y1, width, height, maxIterations, buf);
-    double dt = rtc() - t0; //get_elapsed_mcycles();
-    minSerial = std::min(minSerial, dt);
-  }
-
-  printf("[mandelbrot serial]:\t\t[%.3f] million cycles\n", minSerial);
-  writePPM(buf, width, height, "mandelbrot-serial.ppm");
-
-  printf("\t\t\t\t(%.2fx speedup from ISPC + tasks)\n", minSerial/minISPC);
-
-  return 0;
-}
--- a/examples_cuda/mandelbrot_tasks3d/mandel_task_cu.cu
+++ b/examples_cuda/mandelbrot_tasks3d/mandel_task_cu.cu
@@ -1,53 +0,0 @@
-#include <stdio.h>
-#define blockIndex0 (blockIdx.x*4 + (threadIdx.x >> 5))
-#define blockIndex1 (blockIdx.y)
-#define vectorWidth (32)
-#define vectorIndex (threadIdx.x & 31)
-
-  int __device__ __forceinline__
-mandel(float c_re, float c_im, int count) 
-{
-  float z_re = c_re, z_im = c_im;
-  int i;
-  for (i = 0; i < count; ++i) {
-    if (z_re * z_re + z_im * z_im > 4.0f)
-      break;
-
-    float new_re = z_re*z_re - z_im*z_im;
-    float new_im = 2.0f * z_re * z_im;
-    {
-      z_re = c_re + new_re;
-      z_im = c_im + new_im;
-    }
-  }
-
-  return i;
-}
-
-extern "C"
-__global__ void mandelbrot_scanline(
-    float x0, float dx, 
-    float y0, float dy,
-    int width, int height, 
-    int xspan, int yspan,
-    int maxIterations, int output[]) 
-{
-  const int xstart = blockIndex0 * xspan;
-  const int xend   = min(xstart  + xspan, width);
-
-  const int ystart = blockIndex1 * yspan;
-  const int yend   = min(ystart  + yspan, height);
-
-  for (int yi = ystart; yi < yend; yi++)
-    for (int xi = xstart; xi < xend; xi += vectorWidth)
-    {
-      const float x = x0 + (xi + vectorIndex) * dx;
-      const float y = y0 +  yi              * dy;
-
-      const int res = mandel(x,y,maxIterations);
-      const int index = yi * width + (xi + vectorIndex);
-      if (xi + vectorIndex < xend)
-        output[index] = res;
-    }
-}
-
--- a/Show More
+++ b/Show More