Performance

Tuning CUDA with the GPU Memory Hierarchy · 2024-11-27
Global, shared, and register memory each have distinct latency and bandwidth. Performance comes from the right access pattern.
Computer Architecture: A Quantitative Approach (6th ed.)
Computer Architecture: A Quantitative Approach (6th ed.)
Computer Systems: A Programmer's Perspective (3rd ed.)
Computer Systems: A Programmer's Perspective (3rd ed.)
Improving the Scalability and Performance of a Rails Application: A Case Study with Consul