The CUDA Toolkit 3.1 packs the following updates and additions:
- GPUDirect gives 3rd party devices direct access to CUDA Memory
- Support for 16-way concurrency allows up to 16 different kernels to run at the same time on Fermi architecture GPUs
- Runtime / Driver interoperability enables applications to mix-n-match use of the CUDA Driver API with CUDA C Runtime and math libraries via buffer sharing and context migration
- New language features added to CUDA C / C++:
Support for printf() in device code
Support for function pointers and recursion make it easier to port many existing algorithms to Fermi GPUs
- Unified Visual Profiler now supports both CUDA C/C++ and OpenCL, and now includes support for CUDA Driver API tracing
- Math Libraries Performance Improvements, including:
Improved performance of selected transcendental functions from the log, pow, erf, and gamma families
Significant improvements in double-precision FFT performance on Fermi-architecture GPUs for 2^n transform sizes
Streaming API now supported in CUBLAS for overlapping copy and compute operations
CUFFT Real-to-complex (R2C) and complex-to-real (C2R) optimizations for 2^n data sizes
Improved performance for GEMV and SYMV subroutines in CUBLAS
Optimized double-precision implementations of divide and reciprocal routines for the Fermi architecture
- New and updated SDK code samples demonstrating how to use:
Function pointers in CUDA C/C++ kernels
OpenCL / Direct3D buffer sharing
Hidden Markov Model in OpenCL
Microsoft Excel GPGPU example showing how to run an Excel function on the GPU