Arithmetic

The arithmetic part of IR APIs.

tvm.aipu.script.ir.arithmetic.vadd(x, y, mask=None, saturate=False, out_sign=None, r=None)

Computes the addition on active elements of x with the corresponding elements of y.

  • The inactive elements of result vector are determined by r.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

   x: 1  2  3  4  5  6   7   8
   y: 1  2  3  4  5  6   7   8
mask: T  T  T  T  F  F   T   T
   z: 9  8  7  6  4  3   2   1

 out = S.vadd(x, y, mask, r=z)
 out: 2  4  6  8  4  3  14  16

Parameters

x, yUnion[PrimExpr, int, float]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

saturateOptional[bool]

Whether the result needs to be saturated or not.

out_signOptional[str]

Specify whether the output sign is signed or unsigned. It is only needed for integer operation. None means same as operands, so the sign of operands must be the same, u means unsigned, s means signed.

rOptional[PrimExpr, int, float]

Provide the value of the inactive elements in result vector. If it is a scalar, it will be automatically broadcast. None means the inactive elements of result vector are undefined.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

vc = S.vadd(va, vb)
vc = S.vadd(va, 3)
vc = S.vadd(va, vb, saturate=True)
vc = S.vadd(va, vb, out_sign="u")
vc = S.vadd(va, vb, mask=S.tail_mask(n, 8))
vc = S.vadd(va, vb, mask="3T5F", r=vb)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vadd, __vadds

tvm.aipu.script.ir.arithmetic.vaddh(x, y)

Performs an add operation on every two adjacent elements in the vector x and vector y, concats the results of x and y, and places the results of x to the lower half part and the results of y to the higher half part.

  • The feature Multiple Width Vector is supported.

  x:  9   1   8   2   7   3  6  4
  y:  9   4   8   3   7   2  6  1

out = S.vaddh(x, y)
out: 10  10  10  10  13  11  9  7

Parameters

x, yUnion[PrimExpr, int]

The operands. If either one is a scalar, it will be automatically broadcast.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”.

Examples

vc = S.vaddh(va, vb)
vc = S.vaddh(va, 3)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vaddh

tvm.aipu.script.ir.arithmetic.vabs(x, mask=None, saturate=False)

Computes the absolute value of every active element of vector x.

  • The inactive elements of result vector are undefined.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

   x: 1  -2  -3  4  -5  6  -127  -128
mask: T   T   T  T   F  F    T     T

 out = S.vabs(x, mask)
 out: 1   2   3  4   ?  ?   127  -128

 out = S.vabs(x, mask, saturate=True)
 out: 1   2   3  4   ?  ?   127   127

Parameters

x: Union[PrimExpr]

The operands. The vector x.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

saturateOptional[bool]

Whether the result needs to be saturated or not.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

vc = S.vabs(va)
vc = S.vabs(va, saturate=True)
vc = S.vabs(va, mask="3T5F")
vc = S.vabs(va, mask=S.tail_mask(n, 8))

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vabs, __vabss

tvm.aipu.script.ir.arithmetic.vsub(x, y, mask=None, saturate=False, out_sign=None, r=None)

Computes the subtraction on active elements of x with the corresponding elements of y.

  • The inactive elements of result vector are determined by r.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

   x: 1  2  3  4  5  6  7  8
   y: 1  1  1  1  2  2  2  2
mask: T  T  T  T  F  F  T  T
   z: 9  8  7  6  4  3  2  1

 out = S.vsub(x, y, mask, r=z)
 out: 0  1  2  3  4  3  5  6

Parameters

x, yUnion[PrimExpr, int, float]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

saturateOptional[bool]

Whether the result needs to be saturated or not.

out_signOptional[str]

Specify whether the output sign is signed or unsigned. It is only needed for integer operation. None means same as operands, so the sign of operands must be the same, u means unsigned, s means signed.

rOptional[PrimExpr, int, float]

Provide the value of the inactive elements in result vector. If it is a scalar, it will be automatically broadcast. None means the inactive elements of result vector are undefined.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

vc = S.vsub(va, vb)
vc = S.vsub(va, 3)
vc = S.vsub(va, vb, saturate=True)
vc = S.vsub(va, vb, out_sign="u")
vc = S.vsub(va, vb, mask=S.tail_mask(n, 8))
vc = S.vsub(va, vb, mask="3T5F", r=vb)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vsub, __vsubs

tvm.aipu.script.ir.arithmetic.vsubh(x, y)

Performs a sub operation on every two adjacent elements in the vector x and vector y, concats the results of x and y, and places the results of x to the lower half part and the results of y to the higher half part.

  • The feature Multiple Width Vector is supported.

  x: 9  1  8  2  7  3  6  4
  y: 9  4  8  3  7  2  6  1

out = S.vsubh(x, y)
out: 8  6  4  2  5  5  5  5

Parameters

x, yUnion[PrimExpr, int]

The operands. If either one is a scalar, it will be automatically broadcast.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”.

Examples

vc = S.vsubh(va, vb)
vc = S.vsubh(va, 3)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vsubh

tvm.aipu.script.ir.arithmetic.vmul(x, y, mask=None, out_sign=None, r=None)

Computes the multiplication on active elements of x with the corresponding elements of y.

  • The inactive elements of result vector are determined by r.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

   x: 1  2  3   4  5  6   7   8
   y: 1  2  3   4  5  6   7   8
mask: T  T  T   T  F  F   T   T
   z: 9  8  7   6  4  3   2   1

 out = S.vmul(x, y, mask, r=z)
 out: 1  4  9  16  4  3  49  64

Parameters

x, yUnion[PrimExpr, int, float]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

out_signOptional[str]

Specify whether the output sign is signed or unsigned. It is only needed for integer operation. None means same as operands, so the sign of operands must be the same, u means unsigned, s means signed.

rOptional[PrimExpr, int, float]

Provide the value of the inactive elements in result vector. If it is a scalar, it will be automatically broadcast. None means the inactive elements of result vector are undefined.

Returns

retPrimExpr

The result expression.

Supported DType

“int16/32”, “uint16/32”, “float16/32”.

Examples

vc = S.vmul(va, vb)
vc = S.vmul(va, 3)
vc = S.vmul(va, vb, out_sign="u")
vc = S.vmul(va, vb, mask="3T5F")
vc = S.vmul(va, vb, mask=S.tail_mask(n, 8))
vc = S.vmul(va, vb, mask"T7F", r=vb)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vmul, __vmulh

tvm.aipu.script.ir.arithmetic.vmull(x, y, mask=None, out_sign=None, r=None)

Computes the multiplication on active elements of low half part of x with the corresponding elements of y. Expands elements bit: 8bit -> 16bit or 16bit -> 32bit.

  • The inactive elements of result vector are determined by r.

 x(i16x16): 1  2  3  4  5  6   7  8  9  0  1  2   3  4   5  6
 y(i16x16): 1  2  3  4  5  6   7  8  9  0  1  2   3  4   5  6
      mask: T  T  T  T  F  F   T  T  T  T  T  T   F  F   F  F
  z(i32x8): 9     8     7      6     5     4      3      2

out = S.vmull(x, y, mask, r=z)
out(i32x8): 1     4     9     16     5     4     49     64

Parameters

x, yUnion[PrimExpr, int]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

out_signOptional[str]

Specify whether the output sign is signed or unsigned. It is only needed for integer operation. None means same as operands, so the sign of operands must be the same, u means unsigned, s means signed.

rOptional[PrimExpr, int, float]

Provide the value of the inactive elements in result vector. If it is a scalar, it will be automatically broadcast. None means the inactive elements of result vector are undefined.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”.

Examples

vc = S.vmull(va, vb)
vc = S.vmull(va, 3)
vc = S.vmull(va, vb, out_sign="u")
vc = S.vmull(va, vb, mask="3T5F")
vc = S.vmull(va, vb, mask=S.tail_mask(n, 8))
vc = S.vmull(va, vb, mask="T7F", r=vb)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vmul

tvm.aipu.script.ir.arithmetic.vmulh(x, y, mask=None, out_sign=None, r=None)

Computes the multiplication on active elements of high half part of x with the corresponding elements of y. Expands elements bit: 8bit -> 16bit or 16bit -> 32bit.

  • The inactive elements of result vector are determined by r.

 x(i16x16):  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5  6
 y(i16x16):  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5  6
      mask:  T  T  T  T  F  F  T  T  T  T  T  T  F  F  F  F
  z(i32x8):  9     8     7     6     5     4     3     2

out = S.vmulh(x, y, mask, r=z)
out(i32x8): 81     0     1     4     5     4     3     2

Parameters

x, yUnion[PrimExpr, int]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

out_signOptional[str]

Specify whether the output sign is signed or unsigned. It is only needed for integer operation. None means same as operands, so the sign of operands must be the same, u means unsigned, s means signed.

rOptional[PrimExpr, int, float]

Provide the value of the inactive elements in result vector. If it is a scalar, it will be automatically broadcast. None means the inactive elements of result vector are undefined.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”.

Examples

vc = S.vmulh(va, vb)
vc = S.vmulh(va, 3)
vc = S.vmulh(va, vb, out_sign="u")
vc = S.vmulh(va, vb, mask="3T5F")
vc = S.vmulh(va, vb, mask=S.tail_mask(n, 8))
vc = S.vmulh(va, vb, mask="T7F", r=vb)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vmulh

tvm.aipu.script.ir.arithmetic.vdiv(x, y, mask=None)

Computes the division on active elements of x with the corresponding elements of y.

  • The inactive elements of result vector are undefined.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

   x: 1  4  9  16  25  36  49  64
   y: 1  2  3   4   5   6   7   8
mask: T  T  F   F   T   T   T   T

 out = S.vdiv(x, y, mask)
 out: 1  2  ?   ?   5   6   7   8

Parameters

x, yUnion[PrimExpr, int, float]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”, “float32”.

Examples

vc = S.vdiv(va, vb)
vc = S.vdiv(va, 3)
vc = S.vdiv(va, vb, mask="3T5F")
vc = S.vdiv(va, vb, mask=S.tail_mask(n, 8))

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vdiv

tvm.aipu.script.ir.arithmetic.vmod(x, y, mask=None)

Computes the remainder on active elements of x with the corresponding elements of y.

  • The inactive elements of result vector are undefined.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

   x:  5   6  7  8  -5  -6  -7  -8
   y: -2  -5  4  3   3   4  -3  -4
mask:  T   T  T  F   F   T   T   T

 out = S.vmod(x, y, mask)
 out:  1   1  3  ?   ?  -2  -1   0

Parameters

x, yUnion[PrimExpr, int]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”.

Examples

vc = S.vmod(va, vb)
vc = S.vmod(va, 3)
vc = S.vmod(va, vb, mask="3T5F")
vc = S.vmod(va, vb, mask=S.tail_mask(n, 8))

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vmod

tvm.aipu.script.ir.arithmetic.vdot(x, y, mask=None)

Computes dot production on every two adjacent elements of x with the corresponding elements of y. The elements of result vector will be saturated.

  • The inactive elements of result vector are undefined.

    x(i16x16): x0  x1  x2  x3  x4  x5  x6  x7       ...  x14  x15
    y(i16x16): y0  y1  y2  y3  y4  y5  y6  y7       ...  y14  y15
mask(boolx16):  F   F   T   F   F   T   T   T       ...   F    T

   out = S.vdot(x, y, mask)
   out(i32x8):  ?      x2*y2   x5*y5   x6*y6+x7*y7  ...  x15*y15

Parameters

x, yUnion[PrimExpr, int]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

# Only supported integer cases:
case  result dtype  x.dtype   y.dtype
1     "int16"       "int8"    "int8"
2     "int16"       "int8"    "uint8"
3     "int16"       "uint8"   "int8"
4     "uint16"      "uint8"   "uint8"

5     "int32"       "int16"   "int16"
6     "int32"       "int16"   "uint16"
7     "int32"       "uint16"  "int16"
8     "uint32"      "uint16"  "uint16"

Examples

out0 = S.vdot(x, y)
out1 = S.vdot(x, 3, mask)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vdot

tvm.aipu.script.ir.arithmetic.vqdot(x, y, mask=None)

Computes dot production on every four adjacent elements of x with the corresponding elements of y. The elements of result vector will be saturated.

  • The inactive elements of result vector are undefined.

     x(i8x32): x0  x1  x2  x3  x4  x5  x6  x7           ...  x28  x29  x30  x31
     y(i8x32): y0  y1  y2  y3  y4  y5  y6  y7           ...  y28  y29  y30  y31
mask(boolx32):  T   F   F   T   T   T   T   T           ...   F    F    F    T

   out = S.vqdot(x, y, mask)
   out(i32x8): x0*y0+x3*y3     x4*y4+x5*y5+x6*y6+x7*y7  ...  x31*y31


   x(fp16x16): x0  x1  x2  x3  x4  x5                  x6  x7  ...  x14  x15
   y(fp16x16): y0  y1  y2  y3  y4  y5                  y6  y7  ...  y14  y15
mask(boolx16):  F   F   T   F   T   T                   T   T  ...   F    T

  out = vqdot(x, y, mask)
  out(fp32x8): x2*y2    ?      x4*y4+x5*y5+x6*y6+x7*y7  ?      ...   ?
  # For float, the result stores in even index, the values in odd index are undefined.

Parameters

x, yUnion[PrimExpr, int, float]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

Supported integer cases:
case  result dtype  x.dtype    y.dtype
1     "int32"       "int8"     "int8"
2     "int32"       "int8"     "uint8"
3     "int32"       "uint8"    "int8"
4     "uint32"      "uint8"    "uint8"

Supported floating cases:
case  result dtype  x.dtype    y.dtype
1     "float32"     "float16"  "float16"

Examples

out0 = S.vqdot(x, y)
out1 = S.vqdot(x, 3, mask)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vqdot

tvm.aipu.script.ir.arithmetic.vdpa(acc, x, y, mask=None)

Performs an accumulate add operation with every two adjacent elements of inputs.

   acc(i32x8): a0      a1              a2        ...  a7
    x(i16x16): x0  x1  x2  x3          x4  x5    ...  x14  x15
    y(i16x16): y0  y1  y2  y3          y4  y5    ...  y14  y15
mask(boolx16):  F   F   T   T           T   F    ...   F    T

   out = S.vdpa(acc, x, y, mask)
   out(i32x8): a0      a1+x2*y2+x3*y3  a2+x4*y4  ...  a7+x15*y15

Parameters

accPrimExpr

The accumulate register, should be initialized.

x, yUnion[PrimExpr, int, float]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

# Only supported integer cases:
case  acc.dtype  x.dtype   y.dtype
1     "int16"    "int8"    "int8"
2     "int16"    "int8"    "uint8"
3     "int16"    "uint8"   "int8"
4     "uint16"   "uint8"   "uint8"

5     "int32"    "int16"   "int16"
6     "int32"    "int16"   "uint16"
7     "int32"    "uint16"  "int16"
8     "uint32"   "uint16"  "uint16"

Examples

acc = S.int32x8(0)
out = S.vdpa(acc, x, y)

acc = S.int32x8(0)
out = S.vdpa(acc, x, y, mask)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vdpa

tvm.aipu.script.ir.arithmetic.vqdpa(acc, x, y, mask=None)

Performs an accumulate add operation with every four adjacent elements of inputs.

   acc(i32x8): a0              a1              a2                ...   a7
     x(i8x32): x0  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11  ...  x28  x29  x30  x31
     y(i8x32): y0  y1  y2  y3  y4  y5  y6  y7  y8  y9  y10  y11  ...  y28  y29  y30  y31
mask(boolx32):  T   F   F   F   T   F   T   F   F   T   F    F   ...   T    F    F    T

   out = S.vqdpa(acc, x, y, mask)
   out(i32x8): a0+x0*y0        a1+x4*y4+x6*y6  a2+x9*y9          ...  a7+x28*y28+x31*y31


  acc(fp32x8): a0              a1       a2                          a3      ...   a7
   x(fp16x16): x0  x1          x2   x3  x4  x5                      x6  x7  ...  x14  x15
   y(fp16x16): y0  y1          y2   y3  y4  y5                      y6  y7  ...  y14  y15
mask(boolx16):  T   F           F    T   T   T                       T   T  ...   F    T

  out = S.vqdpa(acc, x, y, mask)
  out(fp32x8): a0+x0*y0+x3*y3  a1       a2+x4*y4+x5*y5+x6*y6+x7*y7  a3      ...   a7
  # For float, the result stores in even index, odd index keep the value of "acc".

Parameters

accPrimExpr

The accumulate register, should be initialized.

x, yUnion[PrimExpr, int, float]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

Supported integer cases:
case  acc.dtype  x.dtype    y.dtype
1     "int32"    "int8"     "int8"
2     "int32"    "int8"     "uint8"
3     "int32"    "uint8"    "int8"
4     "uint32"   "uint8"    "uint8"

Supported floating cases:
case  acc.dtype  x.dtype    y.dtype
1     "float32"  "float16"  "float16"

Examples

acc = S.int32x8(0)
out = S.vqdpa(acc, x, y)

acc = S.int32x8(0)
out = S.vqdpa(acc, x, y, mask)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vqdpa

tvm.aipu.script.ir.arithmetic.vrpadd(x, mask=None)

Computes the reduction addition of all active elements of x, and places the result as the lowest elements of result vector.

  • The remaining upper elements of result vector are undefined.

   x:  0  1  2  3  4  5  6  7
mask:  T  T  T  T  T  T  F  T

 out = S.vrpadd(x, mask)
 out: 22  ?  ?  ?  ?  ?  ?  ?

Parameters

xPrimExpr,

The operands.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

out = S.vrpadd(x)
out = S.vrpadd(x, mask)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vrpadd

tvm.aipu.script.ir.arithmetic.vmml(ptr, x, y)

Performs a mixture precision 4x4 matrix multiply and addition for float16x16 vector x (row-major) and y (column-major). The result pointer ptr is the address of float32x16 with row-major. This behavior is the same as ptr[:?] = matrix_multiply(x, y).

                      y(fp16x16):  y0   y4   y8  y12
                                   y1   y5   y9  y13
                                   y2   y6  y10  y14
                                   y3   y7  y11  y15

x(fp16x16):  x0   x1   x2   x3     a0   a1   a2   a3 :ptr(fp32x16)
             x4   x5   x6   x7     a4   a5   a6   a7
             x8   x9  x10  x11     a8   a9  a10  a11
            x12  x13  x14  x15    a12  a13  a14  a15

S.vmml(ptr, x, y)

# Detailed computation for each result element:
a0 = x0*y0 + x1*y1 + x2*y2 + x3*y3
a1 = x0*y4 + x1*y5 + x2*y6 + x3*y7
...
a9 = x8*y4 + x9*y5 + x10*y6 + x11*y7
...
a15 = x12*y12 + x13*y13 + x14*y14 + x15*y15

Parameters

ptrPointer

The pointer that store the memory address in where the result will be stored, it can be a scalar or vector float32 pointer, the memory space it point to at least must can represent a 4x4 float32 matrix with row major.

xPrimExpr

The operand x with vector type float16x16 representing 4x4 fp16 elements with row major.

yPrimExpr

The operand y with vector type float16x16 representing 4x4 fp16 elements with column major.

Supported DType

“float16”.

Examples

# The "vc_fp32_ptr" can be scalar or vector float32 pointer, as long as the memory space
# that it point to is enough to store 4x4 float32 data.
S.vmml(vc_fp32_ptr, va_fp16x16, vb_fp16x16)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vmml

tvm.aipu.script.ir.arithmetic.vmma(acc_ptr, x, y)

Performs a mixture precision 4x4 matrix multiply and addition for float16x16 vector x (row-major) and y (column-major). The result pointer acc_ptr is the address of float32x16 with row-major. This behavior is the same as acc_ptr[:?] += matrix_multiply(x, y).

                      y(fp16x16):  y0   y4   y8  y12
                                   y1   y5   y9  y13
                                   y2   y6  y10  y14
                                   y3   y7  y11  y15

x(fp16x16):  x0   x1   x2   x3     a0   a1   a2   a3 :acc_ptr(fp32x16)
             x4   x5   x6   x7     a4   a5   a6   a7
             x8   x9  x10  x11     a8   a9  a10  a11
            x12  x13  x14  x15    a12  a13  a14  a15

S.vmma(acc_ptr, x, y)

# Detailed computation for each result element:
a0 += x0*y0 + x1*y1 + x2*y2 + x3*y3
a1 += x0*y4 + x1*y5 + x2*y6 + x3*y7
...
a9 += x8*y4 + x9*y5 + x10*y6 + x11*y7
...
a15 += x12*y12 + x13*y13 + x14*y14 + x15*y15

Parameters

acc_ptrPointer

The pointer that store the memory address in where the result will be stored, it can be a scalar or vector float32 pointer, the memory space it point to at least must can represent a 4x4 float32 matrix with row major.

xPrimExpr

The operand x with vector type float16x16 representing 4x4 fp16 elements with row major.

yPrimExpr

The operand y with vector type float16x16 representing 4x4 fp16 elements with column major.

Supported DType

“float16”.

Examples

# The "vc_fp32_ptr" can be scalar or vector float32 pointer, as long as the memory space
# that it point to is enough to store 4x4 float32 data.
S.vmma(vc_fp32_ptr, va_fp16x16, vb_fp16x16)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vmma

tvm.aipu.script.ir.arithmetic.fma(acc, x, y, mask=None)

Performs float multiply-add operation on every active elements of inputs.

  • The scalar situation where all of acc, x and y are scalar is also supported.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

 acc: a0     a1        a2     ...     a7
   x: x0     x1        x2     ...     x7
   y: y0     y1        y2     ...     y7
mask:  F      T         T     ...      T

 out = S.fma(acc, x, y, mask)
 out: a0  a0+x1*y1  a2+x2*y2  ...  a7+x7*y7

Parameters

accPrimExpr

The accumulate register, should be initialized.

x, yUnion[PrimExpr, float]

The operands. If it is a scalar in the vector situation, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

“float32”.

Examples

acc = S.float32x8(10)
out = S.fma(acc, x, y, mask)

scalar_out = S.fma(scalar_acc, scalar_x, scalar_y)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vfma, __fma

tvm.aipu.script.ir.arithmetic.vfmae(acc, x, y, mask=None)

Performs float multiply-add operation on even active elements of inputs.

 acc(fp32x8): a0      a1        a2        ...   a7
  x(fp16x16): x0  x1  x2  x3    x4  x5    ...  x14  x15
  y(fp16x16): y0  y1  y2  y3    y4  y5    ...  y14  y15
mask(boolx8):  F       T         T        ...   T

 out = S.vfmae(acc, x, y, mask)
 out(fp32x8): a0      a1+x2*y2  a2+x4*y4  ...  a7+x14*y14

Parameters

accPrimExpr

The accumulate register, should be initialized.

x, yUnion[PrimExpr, float]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

# Only supported floating cases:
case  acc.dtype  x.dtype    y.dtype
1     "float32"  "float16"  "float16"

Examples

acc = S.float32x8(10)
out = S.vfmae(acc, x, y)

acc = S.float32x8(10)
out = S.vfmae(acc, x, y, mask)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vfmae

tvm.aipu.script.ir.arithmetic.vfmao(acc, x, y, mask=None)

Performs float multiply-add operation on odd active elements of inputs.

 acc(fp32x8): a0      a1        a2        ...   a7
  x(fp16x16): x0  x1  x2  x3    x4  x5    ...  x14  x15
  y(fp16x16): y0  y1  y2  y3    y4  y5    ...  y14  y15
mask(boolx8):  F       T         T        ...   T

 out = S.vfmao(acc, x, y, mask)
 out(fp32x8): a0      a1+x3*y3  a2+x5*y5  ...  a7+x15*y15

Parameters

accPrimExpr

The accumulate register, should be initialized.

x, yUnion[PrimExpr, float]

The operands. If either one is a scalar, it will be automatically broadcast.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

# Only supported floating cases:
case  acc.dtype  x.dtype    y.dtype
1     "float32"  "float16"  "float16"

Examples

acc = S.float32x8(10)
out = S.vfmao(acc, x, y)

acc = S.float32x8(10)
out = S.vfmao(acc, x, y, mask)

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vfmao

tvm.aipu.script.ir.arithmetic.vrint(x, mask=None)

Computes the rounding on active elements of x.

  • The inactive elements of result vector are undefined.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

   x: -0.4  0.2  1.4  1.5  1.6  1.8  1.9  2.01
mask:   T    T    T    T    T    F    T    T

 out = S.vrint(x, mask)
 out: -0.0  0.0  1.0  2.0  2.0   ?   2.0  2.0

Parameters

xUnion[PrimExpr]

The operands. The vector x.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

“float16/32”.

Examples

vc = S.vrint(va)
vc = S.vrint(va, mask="3T5F")
vc = S.vrint(va, mask=S.tail_mask(n, 8))

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vrint

tvm.aipu.script.ir.arithmetic.clip(x, min_val, max_val, mask=None)

Clip active elements of x with the corresponding elements of min_val and max_val.

  • The scalar situation where all of x, min_val, max_val are scalar is also supported.

  • The inactive elements of result vector are set to the corresponding elements of x.

  • The feature Flexible Width Vector is supported.

  • The feature Multiple Width Vector is supported.

      x: 1  3  4  9  4  4  8  8
min_val: 3  3  3  3  5  5  5  5
max_val: 8  8  8  8  7  7  7  7
   mask: T  T  T  T  F  T  F  T

out = S.clip(x, min_val, max_val, mask)
    out: 3  3  4  8  4  5  8  7

Parameters

x, min_val, max_valUnion[PrimExpr, int, float]

The operands. If either one is a scalar, it will be automatically broadcast. It should be noted that: min_val < max_val.

maskOptional[Union[Tuple[bool], List[bool], numpy.ndarray[bool], str, PrimExpr]]

The predication mask to indicate which elements of the vector are active for the operation. None means all elements are active.

Returns

retPrimExpr

The result expression.

Supported DType

“int8/16/32”, “uint8/16/32”, “float16/32”.

Examples

b = S.clip(a, -10, 10)
vc = S.clip(va, vb, vc)
vc = S.clip(va, 3, 30)
vc = S.clip(va, vb, vc, mask="3T5F")
vc = S.clip(va, vb, vc, mask=S.tail_mask(n, 8))

See Also

  • Zhouyi Compass OpenCL Programming Guide: __vmax, __vmin, __vsel, __vclt, __vcgt