Using SSE2 instructions
by Richard Russell, updated May 2015
SSE2 instructions are supported by the ASMLIB2 library, and that will generally be the most appropriate way to incorporate them in a program. However using the library has one significant disadvantage: the resulting program cannot be straightforwardly compiled, because the SSE2 instructions will not be accepted by the cruncher. To workaround this issue the assembler code must be placed in a separate file (with an extension other than .BBC) which is executed at run time, for example:
CALL "mysse2code.bba"
(the file should have a RETURN as the last statement).
 Whilst this solution is relatively straightforward it is arguably inconvenient, especially if the amount of assembler code is small. There is an alternative way of assembling many of the SSE2 instructions which does not require the use of a library and which allows the program to be compiled conventionally; that is to add a word qualifier to the equivalent MMX instruction. So for example the instruction:
punpcklbw xmm0,xmm1
can be assembled as follows:
punpcklbw word mm0,mm1 ; punpcklbw xmm0,xmm1
The full set of SSE2 instructions which can be assembled in this way is as follows:
punpcklbw word mm0,mm1 ; punpcklbw xmm0,xmm1
      punpcklwd word mm0,mm1 ; punpcklwd xmm0,xmm1
      punpckldq word mm0,mm1 ; punpckldq xmm0,xmm1
      punpckhbw word mm0,mm1 ; punpckhbw xmm0,xmm1
      punpckhwd word mm0,mm1 ; punpckhwd xmm0,xmm1
      punpckhdq word mm0,mm1 ; punpckhdq xmm0,xmm1
      packsswb word mm0,mm1  ; packsswb xmm0,xmm1
      packssdw word mm0,mm1  ; packssdw xmm0,xmm1
      packuswb word mm0,mm1  ; packuswb xmm0,xmm1
      pcmpgtb word mm0,mm1   ; pcmpgtb xmm0,xmm1
      pcmpgtw word mm0,mm1   ; pcmpgtw xmm0,xmm1
      pcmpgtd word mm0,mm1   ; pcmpgtd xmm0,xmm1
      pcmpeqb word mm0,mm1   ; pcmpeqb xmm0,xmm1
      pcmpeqw word mm0,mm1   ; pcmpeqw xmm0,xmm1
      pcmpeqd word mm0,mm1   ; pcmpeqd xmm0,xmm1
      pshufw word mm0,mm1,5  ; pshufd xmm0,xmm1,5
      psrlw word mm0,5       ; psrlw xmm0,5
      psrld word mm0,5       ; psrld xmm0,5
      psrlq word mm0,5       ; psrlq xmm0,5
      psrlw word mm0,mm1     ; psrlw xmm0,xmm1
      psrld word mm0,mm1     ; psrld xmm0,xmm1
      psrlq word mm0,mm1     ; psrlq xmm0,xmm1
      psraw word mm0,5       ; psraw xmm0,5
      psrad word mm0,5       ; psrad xmm0,5
      psraw word mm0,mm1     ; psraw xmm0,xmm1
      psrad word mm0,mm1     ; psrad xmm0,xmm1
      psllw word mm0,5       ; psllw xmm0,5
      pslld word mm0,5       ; pslld xmm0,5
      psllq word mm0,5       ; psllq xmm0,5
      psllw word mm0,mm1     ; psllw xmm0,xmm1
      pslld word mm0,mm1     ; pslld xmm0,xmm1
      psllq word mm0,mm1     ; psllq xmm0,xmm1
      pinsrw word mm0,[esi],5; pinsrw xmm0,[esi],5
      pextrw word [esi],mm0,5; pextrw [esi],xmm0,5
      pavgb word mm0,mm1     ; pavgb xmm0,xmm1
      pavgw word mm0,mm1     ; pavgw xmm0,xmm1
      pmullw word mm0,mm1    ; pmullw xmm0,xmm1
      pmulhuw word mm0,mm1   ; pmulhuw xmm0,xmm1
      pmulhw word mm0,mm1    ; pmulhw xmm0,xmm1
      movntq word [edi],mm1  ; movntq [edi],xmm1
      pmaddwd word mm0,mm1   ; pmaddwd xmm0,xmm1
      psadbw word mm0,mm1    ; psadbw xmm0,xmm1
      maskmovq word mm0,mm1  ; maskmovq xmm0,xmm1
      movd word mm0,[esi]    ; movd xmm0,[esi]
      movd word [edi],mm0    ; movd [edi],xmm0
      movq word mm0,[esi]    ; movdqa xmm0,[esi]
      movq word [edi],mm0    ; movdqa [edi],xmm0
      psubusb word mm0,mm1   ; psubusb xmm0,xmm1
      psubusw word mm0,mm1   ; psubusw xmm0,xmm1
      psubsb word mm0,mm1    ; psubsb xmm0,xmm1
      psubsw word mm0,mm1    ; psubsw xmm0,xmm1
      psubb word mm0,mm1     ; psubb xmm0,xmm1
      psubw word mm0,mm1     ; psubw xmm0,xmm1
      psubd word mm0,mm1     ; psubd xmm0,xmm1
      paddusb word mm0,mm1   ; paddusb xmm0,xmm1
      paddusw word mm0,mm1   ; paddusw xmm0,xmm1
      paddsb word mm0,mm1    ; paddsb xmm0,xmm1
      paddsw word mm0,mm1    ; paddsw xmm0,xmm1
      paddb word mm0,mm1     ; paddb xmm0,xmm1
      paddw word mm0,mm1     ; paddw xmm0,xmm1
      paddd word mm0,mm1     ; paddd xmm0,xmm1
      pminub word mm0,mm1    ; pminub xmm0,xmm1
      pminsw word mm0,mm1    ; pminsw xmm0,xmm1
      pmaxub word mm0,mm1    ; pmaxub xmm0,xmm1
      pmaxsw word mm0,mm1    ; pmaxsw xmm0,xmm1
      pand word mm0,mm1      ; pand xmm0,xmm1
      pandn word mm0,mm1     ; pandn xmm0,xmm1
      por word mm0,mm1       ; por xmm0,xmm1
      pxor word mm0,mm1      ; pxor xmm0,xmm1
In addition the MOVDQU instruction (unaligned move) may be assembled as follows:
      repe movq mm0,[esi]    ; movdqu xmm0,[esi]
      repe movq [edi],mm0    ; movdqu [edi],xmm0
In all cases, where mm0 or mm1 (xmm0 or xmm1) is shown, any of the eight registers may be used instead. In many cases a memory reference can be used instead of the mm1 register in the 'source' field.
