This is an old revision of the document!
Using SSE2 instructions
by Richard Russell, updated May 2015
SSE2 instructions are supported by the ASMLIB2 library, and that will generally be the most appropriate way to incorporate them in a program. However using the library has one significant disadvantage: the resulting program cannot be straightforwardly compiled, because the SSE2 instructions will not be accepted by the cruncher. To workaround this issue the assembler code must be placed in a separate file (with an extension other than .BBC) which is executed at run time, for example:
CALL "mysse2code.bba"
(the file should have a RETURN as the last statement).
Whilst this solution is relatively straightforward it is arguably inconvenient, especially if the amount of assembler code is small. There is an alternative way of assembling many of the SSE2 instructions which does not require the use of a library and which allows the program to be compiled conventionally; that is to add a word qualifier to the equivalent MMX instruction. So for example the instruction:
punpcklbw xmm0,xmm1
can be assembled as follows:
punpcklbw word mm0,mm1 ; punpcklbw xmm0,xmm1
The full set of SSE2 instructions which can be assembled in this way is as follows:
punpcklbw word mm0,mm1 ; punpcklbw xmm0,xmm1
punpcklwd word mm0,mm1 ; punpcklwd xmm0,xmm1
punpckldq word mm0,mm1 ; punpckldq xmm0,xmm1
punpckhbw word mm0,mm1 ; punpckhbw xmm0,xmm1
punpckhwd word mm0,mm1 ; punpckhwd xmm0,xmm1
punpckhdq word mm0,mm1 ; punpckhdq xmm0,xmm1
packsswb word mm0,mm1 ; packsswb xmm0,xmm1
packssdw word mm0,mm1 ; packssdw xmm0,xmm1
packuswb word mm0,mm1 ; packuswb xmm0,xmm1
pcmpgtb word mm0,mm1 ; pcmpgtb xmm0,xmm1
pcmpgtw word mm0,mm1 ; pcmpgtw xmm0,xmm1
pcmpgtd word mm0,mm1 ; pcmpgtd xmm0,xmm1
pcmpeqb word mm0,mm1 ; pcmpeqb xmm0,xmm1
pcmpeqw word mm0,mm1 ; pcmpeqw xmm0,xmm1
pcmpeqd word mm0,mm1 ; pcmpeqd xmm0,xmm1
pshufw word mm0,mm1,5 ; pshufd xmm0,xmm1,5
psrlw word mm0,5 ; psrlw xmm0,5
psrld word mm0,5 ; psrld xmm0,5
psrlq word mm0,5 ; psrlq xmm0,5
psrlw word mm0,mm1 ; psrlw xmm0,xmm1
psrld word mm0,mm1 ; psrld xmm0,xmm1
psrlq word mm0,mm1 ; psrlq xmm0,xmm1
psraw word mm0,5 ; psraw xmm0,5
psrad word mm0,5 ; psrad xmm0,5
psraw word mm0,mm1 ; psraw xmm0,xmm1
psrad word mm0,mm1 ; psrad xmm0,xmm1
psllw word mm0,5 ; psllw xmm0,5
pslld word mm0,5 ; pslld xmm0,5
psllq word mm0,5 ; psllq xmm0,5
psllw word mm0,mm1 ; psllw xmm0,xmm1
pslld word mm0,mm1 ; pslld xmm0,xmm1
psllq word mm0,mm1 ; psllq xmm0,xmm1
pinsrw word mm0,[esi],5; pinsrw xmm0,[esi],5
pextrw word [esi],mm0,5; pextrw [esi],xmm0,5
pavgb word mm0,mm1 ; pavgb xmm0,xmm1
pavgw word mm0,mm1 ; pavgw xmm0,xmm1
pmullw word mm0,mm1 ; pmullw xmm0,xmm1
pmulhuw word mm0,mm1 ; pmulhuw xmm0,xmm1
pmulhw word mm0,mm1 ; pmulhw xmm0,xmm1
movntq word [edi],mm1 ; movntq [edi],xmm1
pmaddwd word mm0,mm1 ; pmaddwd xmm0,xmm1
psadbw word mm0,mm1 ; psadbw xmm0,xmm1
maskmovq word mm0,mm1 ; maskmovq xmm0,xmm1
movd word mm0,[esi] ; movd xmm0,[esi]
movd word [edi],mm0 ; movd [edi],xmm0
movq word mm0,[esi] ; movdqa xmm0,[esi]
movq word [edi],mm0 ; movdqa [edi],xmm0
psubusb word mm0,mm1 ; psubusb xmm0,xmm1
psubusw word mm0,mm1 ; psubusw xmm0,xmm1
psubsb word mm0,mm1 ; psubsb xmm0,xmm1
psubsw word mm0,mm1 ; psubsw xmm0,xmm1
psubb word mm0,mm1 ; psubb xmm0,xmm1
psubw word mm0,mm1 ; psubw xmm0,xmm1
psubd word mm0,mm1 ; psubd xmm0,xmm1
paddusb word mm0,mm1 ; paddusb xmm0,xmm1
paddusw word mm0,mm1 ; paddusw xmm0,xmm1
paddsb word mm0,mm1 ; paddsb xmm0,xmm1
paddsw word mm0,mm1 ; paddsw xmm0,xmm1
paddb word mm0,mm1 ; paddb xmm0,xmm1
paddw word mm0,mm1 ; paddw xmm0,xmm1
paddd word mm0,mm1 ; paddd xmm0,xmm1
pminub word mm0,mm1 ; pminub xmm0,xmm1
pminsw word mm0,mm1 ; pminsw xmm0,xmm1
pmaxub word mm0,mm1 ; pmaxub xmm0,xmm1
pmaxsw word mm0,mm1 ; pmaxsw xmm0,xmm1
pand word mm0,mm1 ; pand xmm0,xmm1
pandn word mm0,mm1 ; pandn xmm0,xmm1
por word mm0,mm1 ; por xmm0,xmm1
pxor word mm0,mm1 ; pxor xmm0,xmm1
In addition the MOVDQU instruction (unaligned move) may be assembled as follows:
repe movq mm0,[esi] ; movdqu xmm0,[esi]
repe movq [edi],mm0 ; movdqu [edi],xmm0
In all cases, where mm0 or mm1 (xmm0 or xmm1) is shown, any of the eight registers may be used instead. In many cases a memory reference can be used instead of the mm1 register in the 'source' field.