How to Write a Basic Kernel

Learn how to write a basic kernel.

Basic elements in a kernel (primfunc)
Declare function parameters
Loop iterations
A complete kernel demo

Basic Elements in a Kernel (primfunc)

A kernel is a function that performs a specific computation on a set of input data.

In Compass DSL, we write a primfunc to implement a kernel.

For a typical function in C, the key elements are:

Function parameters
Function body

Here are the typical elements in a primfunc:

Function parameters
Function body
- loop nests
- computations

For more information, see TVM Doc: tensor program abstraction.

In Compass DSL, all primfunc should be decorated with @S.prim_func.

Declare Function Parameters

We use S.ptr(dtype) to declare pointer parameters.

S.ptr(dtype) + S.match_buffer(with shape): For both one-dimension and multi-dimension data.
S.ptr(dtype): For 1-Dimension data, you can skip the match_buffer step.

# for 1-dimension data, without match_buffer is OK
@S.prim_func
def add_func1(A: S.ptr("int32", "global"), n:S.int32):
    for i in range(n):
        A[i] = A[i] + 2

The generated c_code is:

__kernel void add_func1(__global int* A, int n) {
  for (int i = 0; i < n; ++i) {
    A[i] = (A[i] + 2);
  }
}

# for 2-dimension data, use S.ptr(dtype) + S.match_buffer(with shape)
@S.prim_func
def matrix_transpose(A: S.ptr("int8", "global"), B: S.ptr("int8", "global"), h: S.int32, w: S.int32):
    a = S.match_buffer(A, shape=(h, w))
    b = S.match_buffer(B, shape=(w, h))

    for ih, iw in S.grid(h, w):
        b[iw, ih] = a[ih, iw]

Loop Iterations

Use python range directly.

@S.prim_func
def matrix_transpose(A: S.ptr("int8", "global"), B: S.ptr("int8", "global")):
    a = S.match_buffer(A, shape=(2, 3))
    b = S.match_buffer(B, shape=(3, 2))

    for h in range(3):
        for w in range(2):
            b[h, w] = a[w, h]

Use S.grid syntactic sugar in TensorIR to write multiple nested iterators.

@S.prim_func
def matrix_transpose(A: S.ptr("int8", "global"), B: S.ptr("int8", "global")):
    a = S.match_buffer(A, shape=(2, 3))
    b = S.match_buffer(B, shape=(3, 2))

    for h, w in S.grid(3, 2):
        b[h, w] = a[w, h]

A Complete Kernel Demo

Here is an example with the matrix transpose kernel.

import numpy as np
from tvm import aipu
from tvm.aipu import script as S

@S.prim_func
def matrix_transpose(A: S.ptr("int8", "global"), B: S.ptr("int8", "global"), h: S.int32, w: S.int32):
    a = S.match_buffer(A, shape=(h, w))
    b = S.match_buffer(B, shape=(w, h))

    for ih, iw in S.grid(h, w):
        b[iw, ih] = a[ih, iw]


def test_func():
    bm = aipu.tir.BuildManager()
    ex = bm.build(matrix_transpose)
    print(ex.c_code)

    h = 2
    w = 3
    a = np.array(list(range(h * w)), dtype="int8")
    b = np.zeros((h * w,), dtype="int8")
    ex(a, b, h, w)
    print(a, b)

if __name__ == "__main__":
    test_func()

The generated c code will be:

__kernel void matrix_transpose(__global char* a, __global char* b, int h, int w) {
  for (int ih = 0; ih < h; ++ih) {
    for (int iw = 0; iw < w; ++iw) {
      b[((iw * h) + ih)] = a[((ih * w) + iw)];
    }
  }
}

The test input and output data is:

a = [0 1 2 3 4 5]
b = [0 3 1 4 2 5]

For more information about how to use the DSL language to write the function body computation, see Language Basics.