Tutorial 1: Static Add
In this tutorial, you will write a simple vector addition using Compass DSL. You will learn about:
The basic workflow of writing a static-operator
How to use LSRAM for inputs and outputs
How to use DMA to move data between LSRAM and DDR
How to split data on 4 TECs
How to vectorize using vector-builtin
To begin with a simple example, the data shape of input and output tensors in this tutorial is [2048]
.
Inputs & Outputs
Inputs:
Tensor(a,shape=[2048], dtype=”float16”)
Tensor(b,shape=[2048], dtype=”float16”)
Output:
Tensor(c,shape=[2048], dtype=”float16”)
So you can write the primfunc like this:
n = 2048
dtype = "float16"
@S.prim_func
def add_static(a: S.ptr(dtype, "global"), b: S.ptr(dtype, "global"), c: S.ptr(dtype, "global")):
# func body
...
Use LSRAM
The input tensors and ouput tensor are on DDR with “global” scope. You can use LSRAM (Local Memory for each TEC) for faster data access.
You can allocate a big block of LSRAM and split into small parts for different tensors.
In this case, allocate an LSRAM with 512*sizeof(float16)
Split the LSRAM into two parts:
The first half of lsram for input a: lsram_a with size 256
The second half of lsram for input b: lsram_b with size 256
Reuse lsram_a for lsram_c
The code in Compass DSL:
lsram = S.alloc_buffer([512], dtype, scope="lsram") # lsram buffer
lsram_a = lsram.addr_of(0) # ptr of first half of lsram
lsram_b = lsram.addr_of(256) # ptr of second half of lsram
Split Data for 4 TECs
The vector shapes n = 2048. Using 4 TECs for computation, each TEC computes 512 elements.
TEC0: [0:512]
TEC1: [512:1024]
TEC2: [1024:1536]
TEC3: [1536:2048]
For the code in Compass DSL, use a for loop: for ti in S.tec_range(TEC_NUM)
, where ti is the tec_id of each_tec.
Here is the script code:
# TEC
# ==================================================
# n = 2048, NUM_TEC = 4
# each tec compute n/TEC_NUM = 2048/4 = 512 elements
# offset of each tec is: ti* 512
TEC_NUM = 4
# tec
for ti in S.tec_range(TEC_NUM):
len_t = 512 # 2048/TEC_NUM
off_t = ti * len_t
# tec data ptr
# tec_a: a + off_t
# tec_b: b + off_t
# tec_c: c + off_t
You can also write in another way:
TEC_NUM = S.get_local_size()
ti = S.get_local_id()
Use DMA to Move Data
You can use S.dma_copy(dst, src, num_elements)
to move data between DDR and LSRAM. In this case, you move input from DDR to LSRAM, compute on LSRAM, then move output from LSRAM to DDR.
S.dma_copy(lsram_a, tec_a, 256) # use DMA mov 256 elements from ddr_tec_a -> lsram_a
S.dma_copy(lsram_b, tec_b, 256) # use DMA mov 256 elements from ddr_tec_b -> lsram_b
...
S.dma_copy(tec_c, lsram_a, 256) # use DMA mov 256 elements from lsram_a -> ddr_tec_c
If the data size of each TEC exceeds the limit of LSRAM size of each TEC, you have to move the data from DDR to LSRAM in multiple loops:
loop_l = tec_n / LSRAM_SIZE
for li in range(loop_l):
S.dma_copy(lsram_a, tec_a + li * LSRAM_SIZE, LSRAM_SIZE)
S.dma_copy(lsram_b, tec_b + li * LSRAM_SIZE, LSRAM_SIZE)
...
S.dma_copy(tec_c + li * LSRAM_SIZE, lsram_a, LSRAM_SIZE)
Non-divisible cases are not considered in this simple example. You can see dynamic_add for non-divisible cases.
Vectorize
In this case, the data type is “float16”, for AIPU_X2, the vector register is 256-bit, thus the float16 vector type is “float16x16”. Vector_lane = 16
There are 256 elements on lsram_a
Each vector computes 16 elements
loop_num = 256/16 = 16
The code is:
# vectorized
# ==================================================
# dtype = float16, vector_lane = 16
# we have 256 elements on lsram_a
# each vector compute 16 elements
# the loop_num = 256/16 = 16
for vi in range(16):
# vload a vector float16x16 from lsram_a
# vload a vector float16x16 from lsram_b
va = S.vload(lsram_a + vi * 16) #can also write: va = lsram_a[vi*16:vi*16+16]
vb = S.vload(lsram_b + vi * 16)
# vector addition
vc = S.vadd(va, vb) # can also use: vc = va + vb
# store the vector into lsram
# we reuse lsram_a for output c
S.vstore(vc, lsram_a + vi * 16)
Build and Run Kernel
The AIPU BuildManager is a user interface for DSL program compilation.
To declare a BuildManager, the target
is required. The default target is X2_1204
.
You can then call the build
method of bm to build the kernel into an executor.
def test_static_add():
# build the kernel
bm = aipu.tir.BuildManager(target="X2_1204")
ex = bm.build(add_static)
# input data
a = rand(n, dtype, low=-100, high=100)
b = rand(n, dtype, low=-100, high=100)
# run python simulator
py_out = np.zeros((n,), dtype=dtype)
add_static(a, b, py_out)
# run AIPU simulator
aipu_out = np.zeros((n,), dtype=dtype)
ex(a, b, aipu_out)
# verify result
testing.assert_allclose(py_out, gt_out, atol=1e-4)
testing.assert_allclose(aipu_out, gt_out, atol=1e-4)
Complete Code
You can find the sample code in PYTHON_PACKAGE_PATH/tvm/aipu/samples/dsl/tutorial_1_static_add.py
.
The placeholder PYTHON_PACKAGE_PATH
represents the location where you install the Compass DSL
Python package, in general, it will be something like ~/.local/lib/python3.8/site-packages
.