Memory
The memory part of IR APIs.
- tvm.aipu.script.ir.memory.ptr(dtype, scope='private')
Annotates a function parameter as a pointer.
Parameters
- dtypeUnion[str, DataType]
The data type of the data that the pointer points to.
Scalar dtype:
int8
,uint8
,int16
,uint6
,int32
,uint32
,float16
,float32
,void
Vector dtype:
int8x32
,uint8x32
,int16x16
,uint16x16
,int32x8
,uint32x8
,float16x16
,float32x8
- scopeOptional[str]
The memory space of the data that the pointer points to. The valid choices are listed below.
global: Represent the global DDR space of Address Space Extension region ID (ASID) 0.
global.1: Represent the global DDR space of ASID 1.
global.2: Represent the global DDR space of ASID 2.
global.3: Represent the global DDR space of ASID 3.
private: Represent the stack space of each TEC.
lsram: Represent the local SRAM space of each TEC.
shared: Represent the shared SRAM space between all TECs in the same core.
constant: Represent the global constant DDR space.
Returns
- retPointer
The pointer instance.
Examples
@S.prim_func def func(a: S.ptr("i8", "global"), b: S.ptr("i8", "global"), n: S.i32): for i in range(n): b[i] = a[i]
See Also
- tvm.aipu.script.ir.memory.match_buffer(pointer, shape)
Matches a pointer with the specified shape.
Parameters
- pointerPointer
The data pointer to be matched.
- shapeUnion[List[PrimExpr], Tuple[PrimExpr], PrimExpr, Integral]
The data shape to match.
Returns
- retBuffer
The matched buffer.
Examples
# 2D transpose demo b[j, i] = a[i, j] @S.prim_func def func(A: S.ptr("int8", "global"), B: S.ptr("int8", "global"), h: S.int32, w: S.int32): a = S.match_buffer(A, shape=(h, w)) b = S.match_buffer(B, shape=(w, h)) for i,j in S.grid(h, w): b[j, i] = a[i, j]
- tvm.aipu.script.ir.memory.alloc(shape, dtype, scope='private')
Allocates buffer with shape, dtype, scope, and returns the pointer.
Parameters
- shapeUnion[List[int], Tuple[int], int]
The shape of buffer.
- dtypeUnion[str, DataType]
The data type of buffer elements. Can be scalar dtype or vector dtype.
- scopeOptional[str]
The memory space in which the data is allocated. The valid choices are listed below.
private: Represent the stack space of each TEC.
lsram: Represent the local SRAM space of each TEC.
shared: Represent the shared SRAM space between all TECs in the same core.
Returns
- retPointer
The allocated buffer pointer.
Examples
lsram_a = S.alloc(1024, "int8", scope="lsram") S.dma_copy(lsram_a, ddr_a, 1024) lsram_a = S.alloc(256, "float16x16", scope="shared") lsram_a = S.alloc((256,), "int8", scope="lsram") lsram_a = S.alloc([1024], "int8", scope="lsram")
See Also
- tvm.aipu.script.ir.memory.alloc_buffer(shape, dtype, scope='private')
The buffer allocation function, returns the allocated buffer.
Parameters
- shapeUnion[List[int], Tuple[int], int]
The shape of buffer.
- dtypeUnion[str, DataType]
The data type of buffer elements. Can be scalar dtype or vector dtype.
- scopeOptional[str]
The memory space in which the data is allocated. The valid choices are listed below.
private: Represent the stack space of each TEC.
lsram: Represent the local SRAM space of each TEC.
shared: Represent the shared SRAM space between all TECs in the same core.
Returns
- retBuffer
The allocated buffer.
Examples
lsram_a = S.alloc_buffer([32, 32], dtype, scope="lsram") S.dma_copy(lsram_a, ddr_a, 32 * 32) b = lsram_a[10, 5] + 2
See Also
- tvm.aipu.script.ir.memory.alloc_const(shape, dtype, data)
Allocates constant data.
Parameters
- shapeUnion[List[PrimExpr], Tuple[PrimExpr]]
The shape of const buffer.
- dtypeUnion[str, DataType]
The data type of const buffer elements.
- datanp.array
The data of const buffer.
Returns
- retBuffer
The allocated buffer.
Examples
# The "lut_data" can be created in pure Python environment during compile time. lut_data = np.array(list(range(512)),dtype="float16") @S.prim_func def func(inp: S.ptr("fp16", "global"), out: S.ptr("fp16", "global")): lut = S.alloc_const((512,), "float16", lut_data) ...
- tvm.aipu.script.ir.memory.vload(ptr, mask=None, lanes=None, stride=1)
Load a vector from contiguous or strided memory addresses.
The inactive elements of result vector are set to zero.
The feature Flexible Width Vector is supported.
The feature Multiple Width Vector is supported.
Parameters
- ptrPointer
The pointer that store the base memory address.
- maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]
The predication mask to indicate which elements of the vector are active for the operation.
None
means all elements are active.- lanesOptional[int]
The lanes of result vector dtype. If omitted, will be automatically determined based on the type of input address.
- strideOptional[Union[PrimExpr, int]]
The stride of each element. Will take one element every so many stride intervals.
Returns
- retPrimExpr
The result expression.
Supported DType
“int8/16/32”, “uint8/16/32”, “float16/32”.
Examples
va = S.vload(ptr_a) va = S.vload(ptr_a, mask="3T5F") va = S.vload(ptr_a, mask=S.tail_mask(n, 8)) va = S.vload(ptr_a, mask="1T31F", lanes=32) va = S.vload(ptr_a, mask="16T16F", lanes=32, stride=4)
See Also
Zhouyi Compass OpenCL Programming Guide: __vload, __vload_stride
- tvm.aipu.script.ir.memory.vstore(value, ptr, mask=None, stride=1)
Store active elements of
value
to contiguous or strided memory addresses.The feature Flexible Width Vector is supported.
The feature Multiple Width Vector is supported.
Parameters
- valuePrimExpr
The vector that needs to be stored.
- ptrPointer
The pointer that store the base memory address.
- maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]
The predication mask to indicate which elements of the vector are active for the operation.
None
means all elements are active.- strideOptional[Union[PrimExpr, int]]
The stride of each element. Will store one element every so many stride intervals.
Supported DType
“int8/16/32”, “uint8/16/32”, “float16/32”.
Examples
S.vstore(va, ptr_a) S.vstore(va, ptr_a, mask="3T5F") S.vstore(va, ptr_a, mask=S.tail_mask(n, 8)) S.vstore(va, ptr_a, mask="T7F", stride=4)
See Also
Zhouyi Compass OpenCL Programming Guide: __vstore, __vstore_stride
- tvm.aipu.script.ir.memory.vload_gather(ptr, indices, mask=None)
Load a vector from discrete memory addresses. The addresses are calculated by
ptr + indices * elem_size_in_byte
.The inactive elements of result vector are set to zero.
The feature Flexible Width Vector is supported.
The feature Multiple Width Vector is supported.
Parameters
- ptrPointer
The pointer that store the base memory address.
- indicesPrimExpr
The indices used to calculate the discrete memory addresses, it must be a 16-bit integer vector, its length decide the length of result vector.
- maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]
The predication mask to indicate which elements of the vector are active for the operation.
None
means all elements are active.
Returns
- retPrimExpr
The result expression.
Supported DType
“int8/16/32”, “uint8/16/32”, “float16/32”.
Examples
va = S.vload_gather(ptr_a, vb) va = S.vload_gather(ptr_a, vb, mask="T7F") va = S.vload_gather(ptr_a, vb, mask=S.tail_mask(n, 8))
See Also
Zhouyi Compass OpenCL Programming Guide: __vload_gather
- tvm.aipu.script.ir.memory.vstore_scatter(value, ptr, indices, mask=None)
Store active elements of
value
to discrete memory addresses. The addresses are calculated byptr + indices * elem_size_in_byte
.The feature Flexible Width Vector is supported.
The feature Multiple Width Vector is supported.
Parameters
- valuePrimExpr
The vector that need to be stored.
- ptrPointer
The pointer that store the base memory address.
- indicesPrimExpr
The indices used to calculate the discrete memory addresses, it must be a 16-bit integer vector, its length should be equal to that of
value
.- maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]
The predication mask to indicate which elements of the vector are active for the operation.
None
means all elements are active.
Supported DType
“int8/16/32”, “uint8/16/32”, “float16/32”.
Examples
S.vstore_scatter(va, ptr_a, vb) S.vstore_scatter(va, ptr_a, vb, mask="3T5F") S.vstore_scatter(va, ptr_a, vb, mask=S.tail_mask(n, 8))
See Also
Zhouyi Compass OpenCL Programming Guide: __vstore_scatter
- tvm.aipu.script.ir.memory.dma_copy(dst, src, width, src_stride=None, times=1, dst_stride=None)
Copy the specified number of elements from the source address to the destination address via DMA.
Parameters
- dstPointer
The pointer that store the destination memory address.
- srcPointer
The pointer that store the source memory address.
- widthUnion[PrimExpr, int]
The number of data to be transfer inside one stride jump.
- src_strideOptional[Union[PrimExpr, int]]
The number of source data will be jump over for each stride jump.
None
means equal with the value ofwidth
, i.e., load from the source memory address continuously.- timesOptional[Union[PrimExpr, int]]
The total times of the stride jump.
- dst_strideOptional[Union[PrimExpr, int]]
The number of destination data will be jump over for each stride jump.
None
means equal with the value ofwidth
, i.e., store to the destination memory address continuously.
Notes
The pointer type of
src
anddst
must be same.Only below scope combinations of
src
anddst
are not supported.The scope of
src
islsram
and the scope ofdst
islsram
orshared
.The scope of
src
isshared
and the scope ofdst
islsram
orshared
.
Examples
# The 1D scenario. S.dma_copy(ptr_a, ptr_b, 16) # The 2D scenario. Transfer all "@" in "@@@@@xxx@@@@@xxx@@@@@xxx@@@@@xxx" and store them # continuously in destination. S.dma_copy(ptr_a, ptr_b, width=5, src_stride=8, times=4)
See Also
- tvm.aipu.script.ir.memory.async_dma_copy(dst, src, width, src_stride=None, times=1, dst_stride=None, event=None)
Copy the specified number of elements from the source address to the destination address asynchronously via DMA.
Once DMA’s configuration is finished, this API will return and the behind code will be executed immediately, at the same time the data is transferring via DMA.
Parameters
- dstPointer
The pointer that store the destination memory address.
- srcPointer
The pointer that store the source memory address.
- widthUnion[PrimExpr, int]
The number of data to be transfer inside one stride jump.
- src_strideOptional[Union[PrimExpr, int]]
The number of source data will be jump over for each stride jump.
None
means equal with the value ofwidth
, i.e., load from the source memory address continuously.- timesOptional[Union[PrimExpr, int]]
The total times of the stride jump.
- dst_strideOptional[Union[PrimExpr, int]]
The number of destination data will be jump over for each stride jump.
None
means equal with the value ofwidth
, i.e., store to the destination memory address continuously.- eventPrimExpr
The event need to be triggered when the entire data transmission is completed. Note if the event is using by others, the DMA hardware will be blocked until the event is triggered by others, then the data transmission will start. The API
S.wait_events
can be used to wait the DMA operation to finish.
Notes
The pointer type of
src
anddst
must be same.Only below scope combinations of
src
anddst
are not supported.The scope of
src
islsram
and the scope ofdst
islsram
orshared
.The scope of
src
isshared
and the scope ofdst
islsram
orshared
.
Examples
ev0 = S.alloc_events(1) # The 1D scenario. S.async_dma_copy(ptr_a, ptr_b, 16, event=ev0) vc = va + vb S.wait_events(ev0) # The 2D scenario. Transfer all "@" in "@@@@@xxx@@@@@xxx@@@@@xxx@@@@@xxx" and store them # continuously in destination. S.async_dma_copy(ptr_a, ptr_b, width=5, src_stride=8, times=4, event=ev0) vc = va + vb S.wait_events(ev0)
See Also
- tvm.aipu.script.ir.memory.dma_transpose2d(dst, src, row, col, dst_stride=None, src_stride=None)
Uses DMA to transpose for 2d data.
Parameters
- dst: Pointer
The pointer that store the destination memory address.
- src: Pointer
The pointer that store the source memory address.
- row: Union[PrimExpr, int]
The row to be transposed of input 2d data [row, col].
- col: Union[PrimExpr, int]
The col to be transposed of input 2d data [row, col].
- dst_stride: Optional[Union[PrimExpr, int]]
The width_stride of output 2d data. Default dst_stride = row.
- src_stirde: Optional[Union[PrimExpr, int]]
The width_stride of input 2d data. Default src_stride = col.
Supported DType
“int8/16/32”, “uint8/16/32”, “float16/32”.
Examples
S.dma_transpose2d(dst, src, 8, 64)
- tvm.aipu.script.ir.memory.dma_upsample(dst, src, h_scale, w_scale, c, w, src_c_stride=None, dst_c_stride=None, dst_w_stride=None)
DMA upsample data from source
src
to destinationdst
. Supports two directions betweensrc
anddst
: 1. global -> [lsram, shared]. 2. [lsram, shared] -> global.Note: 1. Each call of
dma_upsample
does a surface on 2D input in WC layout physically, not a 3D input. 2. If you want to upsample 3D input withh_scale
,w_scale
for H, W dimensions respectively, you need to call H timesdma_upsample
, where H is H dimension of input in HWC layout.Parameters
- dstPointer
The pointer that store the destination memory address.
- srcPointer
The pointer that store the source memory address.
- h_scaleUnion[PrimExpr, int]
The scale on h direction.
- w_scaleUnion[PrimExpr, int]
The scale on w direction.
- cUnion[PrimExpr, int]
The c of each move on source.
- wUnion[PrimExpr, int]
The w of each move on source.
- src_c_strideOptional[Union[PrimExpr, int]]
The c stride of each move on source. Default src_c_stride = c.
- dst_c_strideOptional[Union[PrimExpr, int]]
The c stride of each move on destination. Default dst_c_stride = c.
- dst_w_strideOptional[Union[PrimExpr, int]]
The w stride of each move on destination. Default dst_w_stride = w_scale * w.
Supported DType
“int8/16/32”, “uint8/16/32”, “float16/32”.
Examples
# Case0: an easy use case S.dma_upsample(dst=ddr_ptr, src=sram_ptr, h_scale=2, w_scale=3, c=3, w=2) # Source 2D data in WC layout(w=2, c=3) # [ # [0 1 2], # [3 4 5], # ] # Upsampled destination 3D data in HWC layout # [ # [ # [0 1 2], # [0 1 2], # [0 1 2], # # [3 4 5], # [3 4 5], # [3 4 5], # ], # [ # [0 1 2], # [0 1 2], # [0 1 2], # # [3 4 5], # [3 4 5], # [3 4 5], # ], # ] # Case1: a comprehensive use case S.dma_upsample(dst=ddr_ptr, src=sram_ptr, h_scale=2, w_scale=3, c=3, w=2, src_c_stride=4, dst_c_stride=5, dst_w_stride=7) # Source 2D data in WC layout(c=3, w=2, src_c_stride=4) # [ # [0 1 2 ?], # [3 4 5 ?], # ] # Upsampled destination 3D data in HWC layout(dst_w_stride=7, dst_c_stride=5) # [ # [ # [0 1 2 ? ?], # [0 1 2 ? ?], # [0 1 2 ? ?], # # [3 4 5 ? ?], # [3 4 5 ? ?], # [3 4 5 ? ?], # # [? ? ? ? ?], # ], # [ # [0 1 2 ? ?], # [0 1 2 ? ?], # [0 1 2 ? ?], # # [3 4 5 ? ?], # [3 4 5 ? ?], # [3 4 5 ? ?], # # [? ? ? ? ?], # ], # ]
- tvm.aipu.script.ir.memory.dma_memset(ptr, value, num)
Fills the first num elements of addr to the specific value.
Parameters
- ptrPointer
The pointer that store the base memory address.
- valueUnion[PrimExpr, int]
The same dtype as addr dtype. The value to be set.
- numUnion[PrimExpr, int]
The number of scalar elements that need to be set.
Supported DType
“int8/16/32”, “uint8/16/32”, “float16/32”.
Examples
S.dma_memset(ptr_a, 0, 128)
- tvm.aipu.script.ir.memory.flush_cache(invalidate=True)
Flushes the whole level 1 data cache by writing back to DDR.
Parameters
- invalidatebool
Whether invalidates the data or not.
Examples
S.flush_cache() S.flush_cache(invalidate=False)
See Also
Zhouyi Compass OpenCL Programming Guide: __flush_cache