1. builtins/target-nvptx64.ll to write, now it is just a copy of target-generic-1.ll
2. add __global__ & __device__ scope
2. make code work for a single cuda thread
3. use tasks to work as a block grid and programIndex as laneIdx, programCount as warpSize
4. ... and more...