


Once the data is loaded, users perform the matrix multiplication operation - Line 1 - and store the results - Line 1. Line 1 initializes the matrix_c elements to zero by broadcasting the scalar value 0 into each index of the fragment. Users specify the stride between rows using the leading dimension and load the data from either shared or global memory - Lines 1 – 1. Matrix kinds, users specify whether the matrix is in column- or row-major order. Users specify both the data type and the ⟨ M, N, K ⟩ shape of the fragments. as well as loading, storing, and computing semantics.

We will discuss how we alleviate some of the constraints in 6.1. The API supports 3 kinds of matrices - matrix_a ( A ), matrix_b ( B ), and accumulator ( C or D ) - with each having their own internal data layout 2 2 2 The mapping between individual matrix elements to their residing thread(s) is purposely opaque and undocumented. Lines 1 – 1 declare the matrix fragments. Listing 1 shows a simple CUDA kernel that computes a ⟨ 16, 16, 16 ⟩ matrix multiplication within a warp using the WMMA API. The optimized and naïve WMMA GEMM algorithms are described in the text. The inputs are square matrices with variable ⟨ M, N, K ⟩ dimensions. (b) Mixed precision GEMM with half precision input and single precision output.įigure 4: General matrix-matrix multiplication (GEMM) performance using Tensor Cores for both half- ( (a)a) and mixed- ( (b)b) precision on a V100 PCI-E GPU with a clock frequency of 1380 MHz and a 113 TFLOPS peek performance. (a) GEMM with half precision input and half precision output. This section first describes the current usage of the NVIDIA TCUs, then details the current TCU API and presents some evaluation results that motivate our work. In turn, each processing block contains 2 Tensor Cores - for a total of 640 Tensor Cores on the V100 and achieving a 12 × throughput improvement over previous generation Tesla P100. by NVIDIA.įigure 1 illustrates the processing block within each SM, with the V100 containing 80 SMs and each having 4 processing blocks. In total, 640 TCUs are available - achieving a theoretical peek of 113 TFLOPS.Ī marquee feature of NVIDIA’s Volta (Tesla V100) architecture is its TCUs- a programmable matrix-multiply-and-accumulate hardware units, called Tensor Cores 1 1 1 We will use TCU and Tensor Core interchangeably in this paper. Figure 1: Each processing block (subcore) in the NVIDIA Tesla V100 PCI-E architecture contains 2 TCUs. Ĭurrently algorithms other than general matrix-matrix multiplication (GEMM) do not utilize the TCUs- resulting in an idle TCU and low chip utilization for these algorithms. This is enabled by the fact that NVIDIA’s Volta GPUs dedicate a large portion of the SM processing unit (or subcore) chip area to TCUs, shown in Figure 1. This limits their applicability to general algorithms and makes them confined to narrowly specialized libraries and application domains.įor matrix multiplication, the NVIDIA Volta architecture achieves a 8 × throughput increase - with each Streaming Multiprocessor (SM) capable of performing 1024 half precision operations per cycle using the TCUs compared to 128 half precision operations per cycle without the TCUs. Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization - with only matrix-multiplication operations being supported.
