to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the cod e from one hardware to another When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The target hardware is faster than the the source hardware . User expects the code to run at least 30-40% faster. Motivating Example When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The code ran 2x slower on the more powerful hardware