In multimedia, we often write vector assembly (SIMD) implementations of computationally expensive functions to make our software faster. At a high level, there are three basic approaches to write assembly optimizations (for any architecture):
- intrinsics;
- inline assembly;
- hand-written assembly.
Inline assembly is typically disliked because of its poor readability and portability. Intrinsics hide complex stuff (like register count or stack memory) from you, which makes writing optimizations easier, but at the same time typically carries a performance penalty (because of poor code generation by compilers) – compared to hand-written (or inline) assembly. x86inc.asm
is a helper developed originally by various x264 developers, licensed under the ISC, which aims to make writing hand-written assembly on x86 easier. Other than x264, x86inc.asm
is also used in libvpx, libaom, x265 and FFmpeg.
This guide is intended to serve as an introduction to x86inc.asm
for developers (somewhat) familiar with (x86) vector assembly (either intrinsics or hand-written assembly), but not familiar with x86inc.asm
.
Basic example of how to use x86inc.asm
To explain how this works, it’s probably best to start with an example. Imagine the following C function to calculate the SAD (sum-of-absolute-differences) of a 16×16 block (this would typically be invoked as distortion metric in a motion search):
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
typedef uint8_t pixel;
static unsigned sad_16x16_c(const pixel *src, ptrdiff_t src_stride,
const pixel *dst, ptrdiff_t dst_stride)
{
unsigned sum = 0;
int y, x;
for (y = 0; y < 16; y++) {
for (x = 0; x < 16; x++)
sum += abs(src[x] - dst[x]);
src += src_stride;
dst += dst_stride;
}
return sum;
}
In x86inc.asm
syntax, this would look like this:
%include "x86inc.asm"
SECTION .text
INIT_XMM sse2
cglobal sad_16x16, 4, 7, 5, src, src_stride, dst, dst_stride, \
src_stride3, dst_stride3, cnt
lea src_stride3q, [src_strideq*3]
lea dst_stride3q, [dst_strideq*3]
mov cntd, 4
pxor m0, m0
.loop:
mova m1, [srcq+src_strideq*0]
mova m2, [srcq+src_strideq*1]
mova m3, [srcq+src_strideq*2]
mova m4, [srcq+src_stride3q]
lea srcq, [srcq+src_strideq*4]
psadbw m1, [dstq+dst_strideq*0]
psadbw m2, [dstq+dst_strideq*1]
psadbw m3, [dstq+dst_strideq*2]
psadbw m4, [dstq+dst_stride3q]
lea dstq, [dstq+dst_strideq*4]
paddw m1, m2
paddw m3, m4
paddw m0, m1
paddw m0, m3
dec cntd
jg .loop
movhlps m1, m0
paddw m0, m1
movd eax, m0
RET
That’s a whole lot of stuff. The critical things to understand in the above example are:
- functions (symbols) are declared by cglobal;
- we don’t refer to vector registers by their official typed name (e.g.
mm0
,xmm0
,ymm0
orzmm0
), but by a templated name (m0
) which is generated inINIT_*
; - we use
mova
, notmovdqa
, to move data between registers or from/to memory; - we don’t refer to general-purpose registers by their official name (e.g.
rdi
oredi
), but by a templated name (e.g.srcq
), which is generated (and populated) incglobal
– unless it is to store return values (eax
); - use
RET
(notret
!) to return. - in your build system, this would be treated like any other hand-written assembly file, so you’d build an object file using
nasm
oryasm
.
Let’s explore and understand all of this in more detail.
Understanding INIT_*
, cglobal
, DEFINE_ARGS
and RET
INIT_*
indicates what type of vector registers we want to use: MMX (mm0
), SSE (xmm0
), AVX (ymm0
) or AVX-512 (zmm0
). The invocation also allows us to target a specific CPU instruction set (e.g. ssse3
or avx2
). This has various features:
- templated vector register names (
m0
) which mirror a specific class of registers (mm0
,xmm0
,ymm0
orzmm0
); - templated instruction names (e.g.
psadbw
) which can warn if we’re using instructions unsupported for the chosen set (e.g.pmulhrsw
in SSE2 functions); - templated instruction names also hide VEX-coding (
vpsadbw
vs.psadbw
) when targeting AVX; - aliases for moving data to or from full aligned (
mova
, which translates tomovq
formm
,(v)movdqa
forxmm
orvmovdqa
forymm
registers), full unaligned (movu
) or half (movh
) vector registers; - convenience aliases
mmsize
andgprsize
to indicate the size (in bytes) of vector and general-purpose registers.
For example, to write an AVX2 function using ymm
registers, you’d use INIT_YMM avx2
. To write a SSSE3 function using xmm
registers, you’d use INIT_XMM ssse3
. To write a “extended MMX” (the integer instructions introduced as part of SSE) function using mm
registers, you’d use INIT_MMX mmxext
. Finally, for AVX-512, you use INIT_ZMM avx512
.
cglobal
indicates a single function declaration. This has various features:
- portably declaring a global (exported) symbol with project name prefix and instruction set suffix;
- portably making callee-save general-purpose registers available (by pushing their contents to the stack);
- portably loading function arguments from stack into general-purpose registers;
- portably making callee-save
xmm
vector registers available (on Win64); - generating named and numbered general-purpose register aliases whose mapping to native registers is optimized for each particular target platform;
- allocating aligned stack memory (see “Using stack memory”).
The sad_16x16
function declared above, for example, had the following cglobal
line:
cglobal sad_16x16, 4, 7, 5, src, src_stride, dst, dst_stride, \
src_stride3, dst_stride3, cnt
Using the first argument (sad_16x16
), this creates a global symbol <prefix>_sad_16x16_sse2()
which is accessible from C functions elsewhere. The prefix is a project-wide setting defined in your nasm
/yasm
build flags (-Dprefix=name
).
Using the third argument, it requests 7 general-purpose registers (GPRs):
- on x86-32, where only the first 3 GPRs are caller-save, this would push the contents of the other 4 GPRs to the stack, so that we have 7 GPRs available in the function body;
- on unix64/win64, we have 7 or more caller-save GPRs available, so no GPR contents are pushed to the stack.
Using the second argument, 4 GPRs are indicated to be loaded with function arguments upon function entry:
- on x86-32, where all function arguments are transferred on the stack, this means that we
mov
each argument from the stack into an appropriate register; - on win64/unix64, the first 4/6 arguments are transferred through GPRs (
rdi
,rsi
,rdx
,rcx
,R8
,R9
on unix64;rcx
,rdx
,R8
,R9
on win64), so no actualmov
instructions are needed in this particular case.
This should also explain why we want to use templated register names instead of their native names (e.g. rdi
), since we want the src
variable to be kept in rcx
on win64 and rdi
on unix64. At this level inside x86inc.asm
, these registers have numbered aliases (r0
, r1
, r2
, etc.). The specific order in which each numbered register is associated with a native register per target platform depends on the function call argument order mandated by the ABI; then caller-save registers; and lastly callee-save registers.
Lastly, using the fourth argument, we indicate that we’ll be using 5 xmm
registers. On win64, if this number is larger than 6 (or 22 for AVX-512), we’ll be using callee-save xmm
registers and thus have to back up their contents to the stack. On other platforms, this value is ignored.
The remaining arguments are named aliases for each GPR (e.g. src
will refer to r0
, etc.). For each named or numbered register, you’ll notice a suffix (e.g. the q
suffix for srcq
). A full list of suffixes:
q
: qword (on 64-bit) or dword (on 32-bit) – note howq
is missing from numbered aliases, e.g. on unix64,rdi = r0 = srcq
, but on 32-biteax = r0 = srcq
;d
: dword, e.g. on unix64,eax = r6d = cntd
, but on 32-bitebp = r6d = cntd
;w
: word, e.g. on unix64ax = r6w = cntw
, but on 32-bitbp = r6w = cntw
;b
: (low-)byte in a word, e.g. on unix64al = r6b = cntb
;h
: high-byte in a word, e.g. on unix64ah = r6h = cnth
;m
: the stack location of this variable, if any, else the dword alias (e.g. on 32-bit[esp + stack_offset + 4] = r0m = srcm
);mp
: similar tom
, but using a qword register alias and memory size indicator (e.g. on unix64qword [rsp + stack_offset + 8] = r6mp = cntmp
, but on 32-bitdword [rsp + stack_offset + 28] = r6mp = cntmp
).
DEFINE_ARGS
is a way to re-name named registers defined with the last arguments of cglobal
. It allows you to re-use the same physical/numbered general-purpose registers under a different name, typically implying a different purpose, and thus allows for more readable code without requiring more general-purpose registers than strictly necessary.
RET
restores any callee-save register (GPR or vector) from the stack that was pushed by cglobal
, undoes any additional stack memory allocated for custom use (see “Using stack memory”). It will also call vzeroupper
to clear the upper half of ymm
registers (if invoked with INIT_YMM
or higher). At the end, it calls ret
to return to the caller.
Templating functions for multiple register types
At this point, it should be fairly obvious why we use templated names (r0
or src
instead of rdi
(unix64), rcx
(win64) or eax
(32-bit)) for general-purpose registers: portability. However, we have not yet explained why we use templated names for vector registers (m0
instead of mm0
, xmm0
, ymm0
or zmm0
). The reason for this is function templating. Let’s go back to the sad_16x16
function above and use templates:
%macro SAD_FN 2 ; width, height
cglobal sad_%1x%2, 4, 7, 5, src, src_stride, dst, dst_stride, \
src_stride3, dst_stride3, cnt
lea src_stride3q, [src_strideq*3]
lea dst_stride3q, [dst_strideq*3]
mov cntd, %2 / 4
pxor m0, m0
.loop:
mova m1, [srcq+src_strideq*0]
mova m2, [srcq+src_strideq*1]
mova m3, [srcq+src_strideq*2]
mova m4, [srcq+src_stride3q]
lea srcq, [srcq+src_strideq*4]
psadbw m1, [dstq+dst_strideq*0]
psadbw m2, [dstq+dst_strideq*1]
psadbw m3, [dstq+dst_strideq*2]
psadbw m4, [dstq+dst_stride3q]
lea dstq, [dstq+dst_strideq*4]
paddw m1, m2
paddw m3, m4
paddw m0, m1
paddw m0, m3
dec cntd
jg .loop
%if mmsize >= 16
%if mmsize >= 32
vextracti128 xm1, m0, 1
paddw xm0, xm1
%endif
movhlps xm1, xm0
paddw xm0, xm1
%endif
movd eax, xm0
RET
%endmacro
INIT_MMX mmxext
SAD_FN 8, 4
SAD_FN 8, 8
SAD_FN 8, 16
INIT_XMM sse2
SAD_FN 16, 8
SAD_FN 16, 16
SAD_FN 16, 32
INIT_YMM avx2
SAD_FN 32, 16
SAD_FN 32, 32
SAD_FN 32, 64
This indeed generates 9 functions for square and rectangular sizes for each register type. More could be done to reduce binary size, but for the purposes of this tutorial, the take-home message is that we can use the same source code to write functions for multiple vector register types (mm0
, xmm0
, ymm0
or zmm0
).
Some readers may at this point notice that x86inc.asm
allows a programmer to use non-VEX instruction names (such as psadbw
) for VEX instructions (such as vpsadbw
), since you can only operate on ymm
registers using VEX-coded instructions. This is indeed the case, and is discussed in more detail in “AVX three-operand instruction emulation” below. You’ll also notice how we use xm0
at the end of the function, this allows us to explicitly access xmm
registers in functions otherwise templated for ymm
registers, but will still correctly map to mm
registers in MMX functions.
Templating functions for multiple instruction sets
We can also use this same approach to template multiple variants of a function that vary in instruction sets. Consider, for example, the pabsw
instruction that was added in SSSE3. You could use templates to write two versions of a SAD version for 10- or 12-bits per pixel component pictures (typedef uint16_t pixel
in the C code above):
%macro ABSW 2 ; dst/src, tmp
%if cpuflag(ssse3)
pabsw %1, %1
%else
pxor %2, %2
psubw %2, %1
pmaxsw %1, %2
%endif
%endmacro
%macro SAD_8x8_FN 0
cglobal sad_8x8, 4, 7, 6, src, src_stride, dst, dst_stride, \
src_stride3, dst_stride3, cnt
lea src_stride3q, [src_strideq*3]
lea dst_stride3q, [dst_strideq*3]
mov cntd, 2
pxor m0, m0
.loop:
mova m1, [srcq+src_strideq*0]
mova m2, [srcq+src_strideq*1]
mova m3, [srcq+src_strideq*2]
mova m4, [srcq+src_stride3q]
lea srcq, [srcq+src_strideq*4]
psubw m1, [dstq+dst_strideq*0]
psubw m2, [dstq+dst_strideq*1]
psubw m3, [dstq+dst_strideq*2]
psubw m4, [dstq+dst_stride3q]
lea dstq, [dstq+dst_strideq*4]
ABSW m1, m5
ABSW m2, m5
ABSW m3, m5
ABSW m4, m5
paddw m1, m2
paddw m3, m4
paddw m0, m1
paddw m0, m3
dec cntd
jg .loop
movhlps m1, m0
paddw m0, m1
pshuflw m1, m0, q1010
paddw m0, m1
pshuflw m1, m0, q0000 ; qNNNN is a base4-notation for imm8 arguments
paddw m0, m0
movd eax, m0
movsxwd eax, ax
RET
%endmacro
INIT_XMM sse2
SAD_8x8_FN
INIT_XMM ssse3
SAD_8x8_FN
Altogether, function templating allows for making easily-maintainable code, as long as developers understand how templating works and what the goals are. Templating can partially hide complexity of a function in macros, which can be very confusing and thus make code harder to understand. At the same time, it will significantly reduce source code duplication, which makes code maintenance easier.
AVX three-operand instruction emulation
One of the key features introduced as part of AVX – independent of ymm
registers – is VEX encoding, which allows three-operand instructions. Since VEX-instructions (e.g. vpsadbw
) in x86inc.asm
are defined from non-VEX versions (e.g. psadbw
) for templating purposes, the three-operand instruction support actually exists in the non-VEX versions, too. Therefore, code like this (to interleave vertically adjacent pixels) is actually valid:
[..]
mova m0, [srcq+src_strideq*0]
mova m1, [srcq+src_strideq*1]
punpckhbw m2, m0, m1
punpcklbw m0, m1
[..]
An AVX version of this function (using xmm
vector registers, through INIT_XMM avx
) would translate this literally to the following on unix64:
[..]
vmovdqa xmm0, [rdi]
vmovdqa xmm1, [rdi+rsi]
vpunpckhbw xmm2, xmm0, xmm1
vpunpcklbw xmm0, xmm0, xmm1
[..]
On the other hand, a SSE2 version of the same source code (through INIT_XMM sse2
) would translate literally to the following on unix64:
[..]
movdqa xmm0, [rdi]
movdqa xmm1, [rdi+rsi]
movdqa xmm2, xmm0
punpckhbw xmm2, xmm1
punpcklbw xmm0, xmm1
[..]
In practice, as demonstrated earlier, AVX/VEX emulation also allows using the same (templated) source code for pre-AVX functions and AVX functions.
Using SWAP
Another noteworthy feature in x86inc.asm
is SWAP
, a assembler-time and instruction-less register swapper. The typical use case for this is to be numerically consistent in ordering data in vector registers. As an example: SWAP 0, 1
would switch the assembler’s internal meaning of the templated vector register names m0
and m1
. Before, m0
might refer to xmm0
and m1
might refer to xmm1
; after, m0
would refer to xmm1
and m1
would refer to xmm0
.
SWAP
can take more than 2 arguments, in which case SWAP
is sequentially invoked on each subsequent pair of numbers. For example, SWAP 1, 2, 3, 4
would be identical to SWAP 1, 2
, followed by SWAP 2, 3
, and finally followed by SWAP 3, 4
.
The ordering is re-set at the beginning of each function (in cglobal
).
Using stack memory
Using stack memory in hand-written assembly is relatively easy if all you want is unaligned data or if the platform provides aligned stack access at each function’s entry point. For example, on Linux/Mac (32-bit and 64-bit) or Windows (64-bit), the stack upon function entry is guaranteed by the ABI to be 16-byte aligned. Unfortunately, sometimes we need 32-byte alignment (for aligned vmovdqa
of ymm
register contents), or we need aligned memory on 32-bit Windows. Therefore, x86inc.asm
provides portable alignment. The fourth (optional) numeric argument to cglobal
is the number of bytes you need in terms of stack memory. If this value is 0 or missing, no stack memory is allocated. If the alignment constraint (mmsize
) is greater than that guaranteed by the platform, the stack is manually aligned.
If the stack was manually aligned, there’s two ways to restore the original (pre-aligned) stack pointer in RET
, each of which have implications for the function body/implementation:
- if we saved the original stack pointer in a general-purpose register (GPR), we will still have access to the original
m
andmp
named register aliases in the function body. However, it means that one GPR (the one holding the stack pointer) is not available for other uses in our function body. Specifically, on 32-bit, this would limit the number of available GPRs in the function body to 6. To use this option, specify a positive stack size;
cglobal sad_16x16, 4, 7, 5, 64, src, src_stride, dst, dst_stride, \
src_stride3, dst_stride3, cnt
- if we saved the original stack pointer on the (post-aligned) stack, we will not have access to
m
ormp
named register aliases in the function body, but we don’t occupy a GPR. To use this option, specify a negative stack size. Note how we write a negative number as0 - 64
, not just-64
, because of a bug in some older versions ofyasm
.
cglobal sad_16x16, 4, 7, 5, 0 - 64, src, src_stride, dst, dst_stride, \
src_stride3, dst_stride3, cnt
After stack memory allocation, it can be accessed using [rsp]
(which is an alias for [esp]
on 32-bit).
Conclusion
This post was meant as an introduction to x86inc.asm
, an x86 hand-written assembly helper, for developers (somewhat) familiar with hand-written assembly or intrinsics. It was inspired by an earlier guide called x264asm.
Pingback: Links 16/7/2017: Mesa 17.1.5, FreeBSD 11.1 RC3 | Techrights