Tutorial 1: Static Add

In this tutorial, you will write a simple vector addition using Compass DSL. You will learn about:

  • The basic workflow of writing a static-operator

  • How to use LSRAM for inputs and outputs

  • How to use DMA to move data between LSRAM and DDR

  • How to split data on 4 TECs

  • How to vectorize using vector-builtin

To begin with a simple example, the data shape of input and output tensors in this tutorial is [2048].

Inputs & Outputs

  • Inputs:

    • Tensor(a,shape=[2048], dtype=”float16”)

    • Tensor(b,shape=[2048], dtype=”float16”)

  • Output:

    • Tensor(c,shape=[2048], dtype=”float16”)

So you can write the primfunc like this:

n = 2048
dtype = "float16"

@S.prim_func
def add_static(a: S.ptr(dtype, "global"), b: S.ptr(dtype, "global"), c: S.ptr(dtype, "global")):
    # func body
    ...

Use LSRAM

The input tensors and ouput tensor are on DDR with “global” scope. You can use LSRAM (Local Memory for each TEC) for faster data access.

You can allocate a big block of LSRAM and split into small parts for different tensors.

  • In this case, allocate an LSRAM with 512*sizeof(float16)

  • Split the LSRAM into two parts:

    • The first half of lsram for input a: lsram_a with size 256

    • The second half of lsram for input b: lsram_b with size 256

  • Reuse lsram_a for lsram_c

The code in Compass DSL:

    lsram = S.alloc_buffer([512], dtype, scope="lsram") # lsram buffer
    lsram_a = lsram.addr_of(0)                          # ptr of first half of lsram
    lsram_b = lsram.addr_of(256)                        # ptr of second half of lsram

Split Data for 4 TECs

The vector shapes n = 2048. Using 4 TECs for computation, each TEC computes 512 elements.

  • TEC0: [0:512]

  • TEC1: [512:1024]

  • TEC2: [1024:1536]

  • TEC3: [1536:2048]

For the code in Compass DSL, use a for loop: for ti in S.tec_range(TEC_NUM), where ti is the tec_id of each_tec.

Here is the script code:

    # TEC
    # ==================================================
    # n = 2048, NUM_TEC = 4
    # each tec compute n/TEC_NUM = 2048/4 = 512 elements
    # offset of each tec is: ti* 512

    TEC_NUM = 4
    # tec
    for ti in S.tec_range(TEC_NUM):
        len_t = 512  # 2048/TEC_NUM
        off_t = ti * len_t

        # tec data ptr
        # tec_a: a + off_t
        # tec_b: b + off_t
        # tec_c: c + off_t

You can also write in another way:

    TEC_NUM = S.get_local_size()
    ti = S.get_local_id()

Use DMA to Move Data

You can use S.dma_copy(dst, src, num_elements) to move data between DDR and LSRAM. In this case, you move input from DDR to LSRAM, compute on LSRAM, then move output from LSRAM to DDR.

    S.dma_copy(lsram_a, tec_a, 256)  # use DMA mov 256 elements from ddr_tec_a -> lsram_a
    S.dma_copy(lsram_b, tec_b, 256)  # use DMA mov 256 elements from ddr_tec_b -> lsram_b

    ...

    S.dma_copy(tec_c, lsram_a, 256)  # use DMA mov 256 elements from lsram_a -> ddr_tec_c

If the data size of each TEC exceeds the limit of LSRAM size of each TEC, you have to move the data from DDR to LSRAM in multiple loops:

loop_l = tec_n / LSRAM_SIZE

    for li in range(loop_l):
        S.dma_copy(lsram_a, tec_a + li * LSRAM_SIZE, LSRAM_SIZE)
        S.dma_copy(lsram_b, tec_b + li * LSRAM_SIZE, LSRAM_SIZE)

        ...

        S.dma_copy(tec_c + li * LSRAM_SIZE, lsram_a, LSRAM_SIZE)

Non-divisible cases are not considered in this simple example. You can see dynamic_add for non-divisible cases.

Vectorize

In this case, the data type is “float16”, for AIPU_X2, the vector register is 256-bit, thus the float16 vector type is “float16x16”. Vector_lane = 16

  • There are 256 elements on lsram_a

  • Each vector computes 16 elements

  • loop_num = 256/16 = 16

The code is:

# vectorized
# ==================================================
# dtype = float16, vector_lane = 16
# we have 256 elements on lsram_a
# each vector compute 16 elements
# the loop_num = 256/16 = 16
for vi in range(16):
    # vload a vector float16x16 from lsram_a
    # vload a vector float16x16 from lsram_b
    va = S.vload(lsram_a + vi * 16)  #can also write: va = lsram_a[vi*16:vi*16+16]
    vb = S.vload(lsram_b + vi * 16)
    # vector addition
    vc = S.vadd(va, vb)   # can also use: vc = va + vb
    # store the vector into lsram
    # we reuse lsram_a for output c
    S.vstore(vc, lsram_a + vi * 16)

Build and Run Kernel

The AIPU BuildManager is a user interface for DSL program compilation.

To declare a BuildManager, the target is required. The default target is X2_1204.

You can then call the build method of bm to build the kernel into an executor.

def test_static_add():
    # build the kernel
    bm = aipu.tir.BuildManager(target="X2_1204")
    ex = bm.build(add_static)

    # input data
    a = rand(n, dtype, low=-100, high=100)
    b = rand(n, dtype, low=-100, high=100)

    # run python simulator
    py_out = np.zeros((n,), dtype=dtype)
    add_static(a, b, py_out)

    # run AIPU simulator
    aipu_out = np.zeros((n,), dtype=dtype)
    ex(a, b, aipu_out)

    # verify result
    testing.assert_allclose(py_out, gt_out, atol=1e-4)
    testing.assert_allclose(aipu_out, gt_out, atol=1e-4)

Complete Code

You can find the sample code in PYTHON_PACKAGE_PATH/tvm/aipu/samples/dsl/tutorial_1_static_add.py. The placeholder PYTHON_PACKAGE_PATH represents the location where you install the Compass DSL Python package, in general, it will be something like ~/.local/lib/python3.8/site-packages.