How to Debug with Python

Compass DSL provides two implementations for each interface both in TVM and Python. The reason of providing the Python interface is that it is easy to debug kernel implementation through Python.

There are many advantages using Python for debugging. On one hand, the traditional OpenCL debugging process has some more complicated steps and the code generated by TVM is relatively poor in readability. On the other hand, Python debugging makes it easier to interact with users and can be easily integrated with IDEs.

This section will explain in detail how to debug with Python through a practical example.

Prepare Kernel Function

First, we prepare a kernel function like:

@S.prim_func
def cast_from_i8_to_i32(inp: S.ptr("int8", "global"), out: S.ptr("int32", "global"), size: S.i32):
    tec_num = S.get_local_size()
    tid = S.get_local_id()

    per_tec_size = S.ceildiv(size, tec_num)
    ofs = tid * per_tec_size
    if ofs >= size:
        return
    each_tec_size = S.clip(size - ofs, min_val=0, max_val=per_tec_size)
    loop = each_tec_size / 32

    inp_offset = ofs
    out_offset = ofs
    for i in range(loop):
        data = S.vload(inp + inp_offset)
        inp_offset += 32

        out_ll = S.vxtl(S.vxtl(data))
        S.vstore(out_ll, out + out_offset)
        out_offset += 8

        out_lh = S.vxth(S.vxtl(data))
        S.vstore(out_lh, out + out_offset)
        out_offset += 8

        out_hl = S.vxtl(S.vxth(data))
        S.vstore(out_hl, out + out_offset)
        out_offset += 8

        out_hh = S.vxth(S.vxth(data))
        S.vstore(out_hh, out + out_offset)
        out_offset += 8

This kernel function casts the int8 input to int32 output. It is a bit long and does not allow you to quickly go through the code to find problems.

Prepare Test Case

We can write a test case like:

def test():
    size = 128
    a = rand(size, dtype="int8")
    gt_out = a.astype("int32")

    bm = aipu.tir.BuildManager()
    ex = bm.build(cast_from_i8_to_i32)

    # run on pysim
    py_out = np.zeros(size, "int32")
    cast_from_i8_to_i32(a, py_out, size)
    testing.assert_allclose(py_out, gt_out)

    # run on simulator
    aipu_out = np.zeros(size, "int32")
    ex.run(a, aipu_out, size)
    testing.assert_allclose(aipu_out, gt_out)

There are two ways to run this function, one is on pysim and the other is on AIPU simulator or hardware. If you want to run it on pysim, just call the kernel function with a name like:

cast_from_i8_to_i32(a, py_out, size)

This call will run entirely on Python, which means that we can use all the debugging methods available on Python.

Debug with Python

When we run this test case, everything works fine. However if we change the size to 132, things change. This means that there must be something in our program that was not considered well, but it is unlikely that we can see it directly from the code.

It would be troublesome to debug with OpenCL at this time, but fortunately we can debug directly on Python. By default, pysim runs in multi-threaded mode. It can also be manually set to single-threaded running. Follow the instructions in the document PySim.

We are able to add breakpoints in a variety of ways, including:

IDE UI, e.g., VS Code, PyCharm.
Python debugger command, e.g., pdb, pudb.
Insert code “breakpoint()”, “import pdb; pdb.set_trace()”.

For this case we add a breakpoint in the kernel function:

    for i in range(loop):

After adding this breakpoint, the test case we just debugged will stop at the breakpoint. After the program stops, we can view the values of each variable.

tec_num = 4
per_tec_size = 33
loop = 1

It is clear that every loop deals with 32 data, and there are 33 data here but only one loop. Therefore what we should do is to consider casting 8 elements each iteration of the loop, and adding a mask for load and store of tail elements.

@S.prim_func
def cast_from_i8_to_i32(inp: S.ptr("int8", "global"), out: S.ptr("int32", "global"), size: S.i32):
    tec_num = S.get_local_size()
    tid = S.get_local_id()

    per_tec_size = S.ceildiv(size, tec_num)
    ofs = tid * per_tec_size
    if ofs >= size:
        return
    each_tec_size = S.clip(size - ofs, min_val=0, max_val=per_tec_size)
    loop = each_tec_size // 8
    tail = each_tec_size % 8

    inp_offset = ofs
    out_offset = ofs
    for i in range(loop):
        data = S.vload(inp + inp_offset)
        inp_offset += 8

        out_ll = S.vxtl(S.vxtl(data))
        S.vstore(out_ll, out + out_offset)
        out_offset += 8

    if tail != 0:
        tail_mask = S.tail_mask(tail, 8)
        data = S.vload(cur_buf + inp_offset, mask=tail_mask)
        out_ll = S.vxtl(S.vxtl(data))
        S.vstore(out_ll, out + out_offset, mask=tail_mask)

Summary

We implemented pysim with the same behavior as OpenCL to facilitate debugging. We can use pysim to easily debug logical errors in the program.

This document shows how to debug through pysim through a practical use case.