Memory

The memory part of IR APIs.

tvm.aipu.script.ir.memory.ptr(dtype, scope='private')

Annotates a function parameter as a pointer.

Parameters

dtypeUnion[str, DataType]

The data type of the data that the pointer points to.

  • Scalar dtype:int8, uint8, int16, uint6, int32, uint32, float16, float32, void

  • Vector dtype:int8x32, uint8x32, int16x16, uint16x16, int32x8, uint32x8, float16x16, float32x8

scopeOptional[str]

The memory space of the data that the pointer points to. The valid choices are listed below.

  • global: Represent the global DDR space of Address Space Extension region ID (ASID) 0.

  • global.1: Represent the global DDR space of ASID 1.

  • global.2: Represent the global DDR space of ASID 2.

  • global.3: Represent the global DDR space of ASID 3.

  • private: Represent the stack space of each TEC.

  • lsram: Represent the local SRAM space of each TEC.

  • shared: Represent the shared SRAM space between all TECs in the same core.

  • constant: Represent the global constant DDR space.

Returns

retPointer

The pointer instance.

Examples

@S.prim_func
def func(a: S.ptr("i8", "global"), b: S.ptr("i8", "global"), n: S.i32):
    for i in range(n):
        b[i] = a[i]

See Also

tvm.aipu.script.ir.memory.match_buffer(pointer, shape)

Matches a pointer with the specified shape.

Parameters

pointerPointer

The data pointer to be matched.

shapeUnion[List[PrimExpr], Tuple[PrimExpr], PrimExpr, Integral]

The data shape to match.

Returns

retBuffer

The matched buffer.

Examples

# 2D transpose demo b[j, i] = a[i, j]
@S.prim_func
def func(A: S.ptr("int8", "global"), B: S.ptr("int8", "global"), h: S.int32, w: S.int32):
    a = S.match_buffer(A, shape=(h, w))
    b = S.match_buffer(B, shape=(w, h))

    for i,j in S.grid(h, w):
        b[j, i] = a[i, j]
tvm.aipu.script.ir.memory.alloc(shape, dtype, scope='private')

Allocates buffer with shape, dtype, scope, and returns the pointer.

Parameters

shapeUnion[List[int], Tuple[int], int]

The shape of buffer.

dtypeUnion[str, DataType]

The data type of buffer elements. Can be scalar dtype or vector dtype.

scopeOptional[str]

The memory space in which the data is allocated. The valid choices are listed below.

  • private: Represent the stack space of each TEC.

  • lsram: Represent the local SRAM space of each TEC.

  • shared: Represent the shared SRAM space between all TECs in the same core.

Returns

retPointer

The allocated buffer pointer.

Examples

lsram_a = S.alloc(1024, "int8", scope="lsram")
S.dma_copy(lsram_a, ddr_a, 1024)

lsram_a = S.alloc(256, "float16x16", scope="shared")
lsram_a = S.alloc((256,), "int8", scope="lsram")
lsram_a = S.alloc([1024], "int8", scope="lsram")

See Also

tvm.aipu.script.ir.memory.alloc_buffer(shape, dtype, scope='private')

The buffer allocation function, returns the allocated buffer.

Parameters

shapeUnion[List[int], Tuple[int], int]

The shape of buffer.

dtypeUnion[str, DataType]

The data type of buffer elements. Can be scalar dtype or vector dtype.

scopeOptional[str]

The memory space in which the data is allocated. The valid choices are listed below.

  • private: Represent the stack space of each TEC.

  • lsram: Represent the local SRAM space of each TEC.

  • shared: Represent the shared SRAM space between all TECs in the same core.

Returns

retBuffer

The allocated buffer.

Examples

lsram_a = S.alloc_buffer([32, 32], dtype, scope="lsram")
S.dma_copy(lsram_a, ddr_a, 32 * 32)
b = lsram_a[10, 5] + 2

See Also

tvm.aipu.script.ir.memory.alloc_const(shape, dtype, data)

Allocates constant data.

Parameters

shapeUnion[List[PrimExpr], Tuple[PrimExpr]]

The shape of const buffer.

dtypeUnion[str, DataType]

The data type of const buffer elements.

datanp.array

The data of const buffer.

Returns

retBuffer

The allocated buffer.

Examples

# The "lut_data" can be created in pure Python environment during compile time.
lut_data = np.array(list(range(512)),dtype="float16")

@S.prim_func
def func(inp: S.ptr("fp16", "global"), out: S.ptr("fp16", "global")):
    lut = S.alloc_const((512,), "float16", lut_data)
    ...
tvm.aipu.script.ir.memory.vload(ptr, mask=None, lanes=None, stride=1)

Load a vector from contiguous or strided memory addresses.

  • The inactive elements of result vector are set to zero.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

Parameters

ptrPointer

The pointer that store the base memory address.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

lanesOptional[int]

The lanes of result vector dtype. If omitted, will be automatically determined based on the type of input address.

strideOptional[Union[PrimExpr, int]]

The stride of each element. Will take one element every so many stride intervals.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

va = S.vload(ptr_a)
va = S.vload(ptr_a, mask="3T5F")
va = S.vload(ptr_a, mask=S.tail_mask(n, 8))
va = S.vload(ptr_a, mask="1T31F", lanes=32)
va = S.vload(ptr_a, mask="16T16F", lanes=32, stride=4)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vload, __vload_stride

tvm.aipu.script.ir.memory.vstore(value, ptr, mask=None, stride=1)

Store active elements of value to contiguous or strided memory addresses.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

Parameters

valuePrimExpr

The vector that needs to be stored.

ptrPointer

The pointer that store the base memory address.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

strideOptional[Union[PrimExpr, int]]

The stride of each element. Will store one element every so many stride intervals.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

S.vstore(va, ptr_a)
S.vstore(va, ptr_a, mask="3T5F")
S.vstore(va, ptr_a, mask=S.tail_mask(n, 8))
S.vstore(va, ptr_a, mask="T7F", stride=4)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vstore, __vstore_stride

tvm.aipu.script.ir.memory.vload_gather(ptr, indices, mask=None)

Load a vector from discrete memory addresses. The addresses are calculated by ptr + indices * elem_size_in_byte.

  • The inactive elements of result vector are set to zero.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

Parameters

ptrPointer

The pointer that store the base memory address.

indicesPrimExpr

The indices used to calculate the discrete memory addresses, it must be a 16-bit integer vector, its length decide the length of result vector.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

va = S.vload_gather(ptr_a, vb)
va = S.vload_gather(ptr_a, vb, mask="T7F")
va = S.vload_gather(ptr_a, vb, mask=S.tail_mask(n, 8))

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vload_gather

tvm.aipu.script.ir.memory.vstore_scatter(value, ptr, indices, mask=None)

Store active elements of value to discrete memory addresses. The addresses are calculated by ptr + indices * elem_size_in_byte.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

Parameters

valuePrimExpr

The vector that need to be stored.

ptrPointer

The pointer that store the base memory address.

indicesPrimExpr

The indices used to calculate the discrete memory addresses, it must be a 16-bit integer vector, its length should be equal to that of value.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

S.vstore_scatter(va, ptr_a, vb)
S.vstore_scatter(va, ptr_a, vb, mask="3T5F")
S.vstore_scatter(va, ptr_a, vb, mask=S.tail_mask(n, 8))

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vstore_scatter

tvm.aipu.script.ir.memory.dma_copy(dst, src, width, src_stride=None, times=1, dst_stride=None)

Copy the specified number of elements from the source address to the destination address via DMA.

Parameters

dstPointer

The pointer that store the destination memory address.

srcPointer

The pointer that store the source memory address.

widthUnion[PrimExpr, int]

The number of data to be transfer inside one stride jump.

src_strideOptional[Union[PrimExpr, int]]

The number of source data will be jump over for each stride jump. None means equal with the value of width, i.e., load from the source memory address continuously.

timesOptional[Union[PrimExpr, int]]

The total times of the stride jump.

dst_strideOptional[Union[PrimExpr, int]]

The number of destination data will be jump over for each stride jump. None means equal with the value of width, i.e., store to the destination memory address continuously.

Notes

  • The pointer type of src and dst must be same.

  • Only below scope combinations of src and dst are not supported.

    • The scope of src is lsram and the scope of dst is lsram or shared.

    • The scope of src is shared and the scope of dst is lsram or shared.

Examples

# The 1D scenario.
S.dma_copy(ptr_a, ptr_b, 16)

# The 2D scenario. Transfer all "@" in "@@@@@xxx@@@@@xxx@@@@@xxx@@@@@xxx" and store them
# continuously in destination.
S.dma_copy(ptr_a, ptr_b, width=5, src_stride=8, times=4)

See Also

tvm.aipu.script.ir.memory.async_dma_copy(dst, src, width, src_stride=None, times=1, dst_stride=None, event=None)

Copy the specified number of elements from the source address to the destination address asynchronously via DMA.

Once DMA’s configuration is finished, this API will return and the behind code will be executed immediately, at the same time the data is transferring via DMA.

Parameters

dstPointer

The pointer that store the destination memory address.

srcPointer

The pointer that store the source memory address.

widthUnion[PrimExpr, int]

The number of data to be transfer inside one stride jump.

src_strideOptional[Union[PrimExpr, int]]

The number of source data will be jump over for each stride jump. None means equal with the value of width, i.e., load from the source memory address continuously.

timesOptional[Union[PrimExpr, int]]

The total times of the stride jump.

dst_strideOptional[Union[PrimExpr, int]]

The number of destination data will be jump over for each stride jump. None means equal with the value of width, i.e., store to the destination memory address continuously.

eventPrimExpr

The event need to be triggered when the entire data transmission is completed. Note if the event is using by others, the DMA hardware will be blocked until the event is triggered by others, then the data transmission will start. The API S.wait_events can be used to wait the DMA operation to finish.

Notes

  • The pointer type of src and dst must be same.

  • Only below scope combinations of src and dst are not supported.

    • The scope of src is lsram and the scope of dst is lsram or shared.

    • The scope of src is shared and the scope of dst is lsram or shared.

Examples

ev0 = S.alloc_events(1)

# The 1D scenario.
S.async_dma_copy(ptr_a, ptr_b, 16, event=ev0)
vc = va + vb
S.wait_events(ev0)

# The 2D scenario. Transfer all "@" in "@@@@@xxx@@@@@xxx@@@@@xxx@@@@@xxx" and store them
# continuously in destination.
S.async_dma_copy(ptr_a, ptr_b, width=5, src_stride=8, times=4, event=ev0)
vc = va + vb
S.wait_events(ev0)

See Also

tvm.aipu.script.ir.memory.dma_transpose2d(dst, src, row, col, dst_stride=None, src_stride=None)

Uses DMA to transpose for 2d data.

Parameters

dst: Pointer

The pointer that store the destination memory address.

src: Pointer

The pointer that store the source memory address.

row: Union[PrimExpr, int]

The row to be transposed of input 2d data [row, col].

col: Union[PrimExpr, int]

The col to be transposed of input 2d data [row, col].

dst_stride: Optional[Union[PrimExpr, int]]

The width_stride of output 2d data. Default dst_stride = row.

src_stirde: Optional[Union[PrimExpr, int]]

The width_stride of input 2d data. Default src_stride = col.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

S.dma_transpose2d(dst, src, 8, 64)
tvm.aipu.script.ir.memory.dma_upsample(dst, src, h_scale, w_scale, c, w, src_c_stride=None, dst_c_stride=None, dst_w_stride=None)

DMA upsample data from source src to destination dst. Supports two directions between src and dst: 1. global -> [lsram, shared]. 2. [lsram, shared] -> global.

Note: 1. Each call of dma_upsample does a surface on 2D input in WC layout physically, not a 3D input. 2. If you want to upsample 3D input with h_scale, w_scale for H, W dimensions respectively, you need to call H times dma_upsample, where H is H dimension of input in HWC layout.

Parameters

dstPointer

The pointer that store the destination memory address.

srcPointer

The pointer that store the source memory address.

h_scaleUnion[PrimExpr, int]

The scale on h direction.

w_scaleUnion[PrimExpr, int]

The scale on w direction.

cUnion[PrimExpr, int]

The c of each move on source.

wUnion[PrimExpr, int]

The w of each move on source.

src_c_strideOptional[Union[PrimExpr, int]]

The c stride of each move on source. Default src_c_stride = c.

dst_c_strideOptional[Union[PrimExpr, int]]

The c stride of each move on destination. Default dst_c_stride = c.

dst_w_strideOptional[Union[PrimExpr, int]]

The w stride of each move on destination. Default dst_w_stride = w_scale * w.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

# Case0: an easy use case
S.dma_upsample(dst=ddr_ptr, src=sram_ptr, h_scale=2, w_scale=3, c=3, w=2)

# Source 2D data in WC layout(w=2, c=3)
# [
#   [0 1 2],
#   [3 4 5],
# ]
# Upsampled destination 3D data in HWC layout
# [
#   [
#     [0 1 2],
#     [0 1 2],
#     [0 1 2],
#
#     [3 4 5],
#     [3 4 5],
#     [3 4 5],
#   ],
#   [
#     [0 1 2],
#     [0 1 2],
#     [0 1 2],
#
#     [3 4 5],
#     [3 4 5],
#     [3 4 5],
#   ],
# ]

# Case1: a comprehensive use case
S.dma_upsample(dst=ddr_ptr, src=sram_ptr, h_scale=2, w_scale=3,
             c=3, w=2, src_c_stride=4, dst_c_stride=5, dst_w_stride=7)

# Source 2D data in WC layout(c=3, w=2, src_c_stride=4)
# [
#   [0 1 2 ?],
#   [3 4 5 ?],
# ]
# Upsampled destination 3D data in HWC layout(dst_w_stride=7, dst_c_stride=5)
# [
#   [
#     [0 1 2 ? ?],
#     [0 1 2 ? ?],
#     [0 1 2 ? ?],
#
#     [3 4 5 ? ?],
#     [3 4 5 ? ?],
#     [3 4 5 ? ?],
#
#     [? ? ? ? ?],
#   ],
#   [
#     [0 1 2 ? ?],
#     [0 1 2 ? ?],
#     [0 1 2 ? ?],
#
#     [3 4 5 ? ?],
#     [3 4 5 ? ?],
#     [3 4 5 ? ?],
#
#     [? ? ? ? ?],
#   ],
# ]
tvm.aipu.script.ir.memory.dma_memset(ptr, value, num)

Fills the first num elements of addr to the specific value.

Parameters

ptrPointer

The pointer that store the base memory address.

valueUnion[PrimExpr, int]

The same dtype as addr dtype. The value to be set.

numUnion[PrimExpr, int]

The number of scalar elements that need to be set.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

S.dma_memset(ptr_a, 0, 128)
tvm.aipu.script.ir.memory.flush_cache(invalidate=True)

Flushes the whole level 1 data cache by writing back to DDR.

Parameters

invalidatebool

Whether invalidates the data or not.

Examples

S.flush_cache()
S.flush_cache(invalidate=False)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __flush_cache