Arithmetic Pedagogy

Embedded Processor Compages

Peter Barry , Patrick Crowley , in Modern Embedded Computing, 2012

Arithmetics Instructions

The arithmetics instructions define the ready of operations performed by the processor Arithmetics Logic Unit of measurement (ALU). The arithmetic instructions are further classified into binary, decimal, logical, shift/rotate, and chip/byte manipulation instructions.

Binary Operations

The binary arithmetic instructions perform basic binary integer computations on byte, discussion, and double word integers located in memory and/or the general-purpose registers, as described in Table five.4.

Tabular array 5.four. Binary Arithmetic Operation Instructions

Instruction
Mnemonic
Example Description
Add Add together EAX, EAX Add the contents of EAX to EAX
ADC ADC EAX, EAX Add together with conduct
SUB SUB EAX, 0002h Subtract the 2 from the register
SBB SBB EBX, 0002h Decrease with borrow
MUL MUL EBX Unsigned multiply EAX by EBX; results in EDX:EAX
DIV DIV EBX Unsigned divide
INC INC [EAX] Increment value at retentiveness eax by one
December DEC EAX Decrement EAX past 1
NEG NEG EAX Two's complement negation

Decimal Operations

The decimal arithmetic instructions perform decimal arithmetic on binary coded decimal (BCD) information, equally described in Table five.five. BCD is not used equally much as it has been in the by, but it still remains relevant for some financial and industrial applications.

Table five.5. Decimal Operation Instructions (Subset)

Didactics
Mnemonic
Example Description
DAA Add together EAX, EAX Decimal adjust afterwards improver
DAS DAS Decimal adjust AL later subtraction. Adjusts the upshot of the subtraction of two packed BCD values to create a packed BCD issue
AAA AAA ASCII adjust after add-on. Adjusts the sum of ii unpacked BCD values to create an unpacked BCD result
AAS AAS ASCII conform after subtraction. Adjusts the result of the subtraction of two unpacked BCD values to create a unpacked BCD result

Logical Operations

The logical instructions perform basic AND, OR, XOR, and Not logical operations on byte, discussion, and double word values, as described in Table 5.6.

Tabular array 5.half-dozen. Logical Operation Instructions

Didactics Mnemonic Example Description
AND AND EAX, 0ffffh Performs bitwise logical AND
OR OR EAX, 0fffffff0h Performs bitwise logical OR
XOR EBX, 0fffffff0h Performs bitwise logical XOR
NOT Not [EAX] Performs bitwise logical Non

Shift Rotate Operations

The shift and rotate instructions shift and rotate the bits in word and double word operands. Tabular array five.7 shows some examples.

Table 5.7. Shift and Rotate Instructions

Education
Mnemonic
Example Description
SAR SAR EAX, 4h Shifts arithmetic right
SHR SAL EAX,1 Shifts logical right
SAL/SHL SAL EAX,one Shifts arithmetic left/Shifts logical left
SHRD SHRD EAX, EBX, 4 Shifts correct double
SHLD SHRD EAX, EBX, iv Shifts left double
ROR ROR EAX, 4h Rotates right
ROL ROL EAX, 4h Rotates left
RCR RCR EAX, 4h Rotates through carry correct
RCL RCL EAX, 4h Rotates through behave left

The arithmetics shift operations are often used in power of 2 arithmetic operations (such a multiply by two), as the instructions are much faster than the equivalent multiply or split operation.

Bit/Byte Operations

Bit instructions exam and modify individual bits in word and double word operands, as described in Table 5.eight. Byte instructions set the value of a byte operand to indicate the status of flags in the EFLAGS annals.

Table 5.8. Chip/Byte Operation Instructions

Instruction
Mnemonic
Example Description
BT BT EAX, 4h Flake test. Stores selected flake in Carry flag
BTS BTS EAX, 4h Chip examination and set. Stores selected chip in Bear flag and sets the scrap
BTR BTS EAX, 4h Scrap test and reset. Stores selected bit in Conduct flag and clears the bit
BTC BTS EAX, 4h Bit examination and complement. Stores selected bit in Deport flag and complements the bit
BSF BTS EBX, [EAX] Bit scan forwards. Searches the source operand (second operand) for the least significant set bit (ane bit)
BSR BTR EBX, [EAX] Flake scan reference. Searches the source operand (second operand) for the well-nigh significant set scrap (ane bit)
SETE/SETZ Set EAX Conditional Set byte if equal/Fix byte if zilch
TEST TEST EAX, 0ffffffffh Logical compare. Computes the flake-wise logical AND of outset operand (source 1 operand) and the second operand (source 2 operand) and sets the SF, ZF, and PF status flags according to the upshot

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123914903000059

Instruction Sets

Joseph Yiu , in The Definitive Guide to the ARM Cortex-M3 (Second Edition), 2009

4.three.3 Assembler Language: Processing Data

The Cortex-M3 provides many unlike instructions for data processing. A few bones ones are introduced hither. Many information operation instructions tin take multiple didactics formats. For example, an Add instruction can operate between two registers or between 1 register and an immediate data value:

Add   R0, R0, R1   ; R0 = R0 + R1

ADDS   R0, R0, #0x12   ; R0 = R0 + 0x12

Add.West R0, R1, R2   ; R0 = R1 + R2

These are all Add instructions, but they take unlike syntaxes and binary coding.

With the traditional Thumb instruction syntax, when sixteen-chip Thumb code is used, an ADD instruction tin modify the flags in the PSR. Notwithstanding, 32-scrap Thumb-2 code can either alter a flag or proceed information technology unchanged. To separate the 2 different operations, the S suffix should be used if the following functioning depends on the flags:

ADD.Due west   R0, R1, R2 ; Flag unchanged

ADDS.W R0, R1, R2 ; Flag alter

Aside from Add instructions, the arithmetics functions that the Cortex-M3 supports include subtract (SUB), multiply (MUL), and unsigned and signed divide (UDIV/SDIV). Table 4.xviii shows some of the most unremarkably used arithmetic instructions.

Table 4.xviii. Examples of Arithmetic Instructions

Instruction Performance
ADD Rd, Rn, Rm   ; Rd = Rn + Rm ADD operation
ADD Rd, Rd, Rm   ; Rd = Rd + Rm
Add together Rd, #immed   ; Rd = Rd + #immed
Add together Rd, Rn, # immed   ; Rd = Rn + #immed
ADC Rd, Rn, Rm   ; Rd = Rn + Rm + carry ADD with acquit
ADC Rd, Rd, Rm   ; Rd = Rd + Rm + acquit
ADC Rd, #immed   ; Rd = Rd + #immed + carry
ADDW Rd, Rn,#immed   ; Rd = Rn + #immed ADD register with 12-bit immediate value
SUB Rd, Rn, Rm   ; Rd = Rn − Rm SUBTRACT
SUB Rd, #immed   ; Rd = Rd − #immed
SUB Rd, Rn,#immed   ; Rd = Rn − #immed
SBC Rd, Rm   ; Rd = Rd − Rm − infringe SUBTRACT with borrow (not carry)
SBC.W Rd, Rn, #immed ; Rd = Rn − #immed − borrow
SBC.W Rd, Rn, Rm   ; Rd = Rn − Rm − infringe
RSB.West Rd, Rn, #immed ; Rd = #immed –Rn Reverse subtract
RSB.W Rd, Rn, Rm   ; Rd = Rm − Rn
MUL Rd, Rm   ; Rd = Rd * Rm Multiply
MUL.W Rd, Rn, Rm   ; Rd = Rn * Rm
UDIV Rd, Rn, Rm   ; Rd = Rn/Rm Unsigned and signed divide
SDIV Rd, Rn, Rm   ; Rd = Rn/Rm

These instructions can exist used with or without the "Due south" suffix to decide if the APSR should be updated. In almost cases, if UAL syntax is selected and if "S" suffix is not used, the 32-bit version of the instructions would be selected as almost of the 16-bit Thumb instructions update APSR.

The Cortex-M3 also supports 32-scrap multiply instructions and multiply accumulate instructions that give 64-chip results. These instructions support signed or unsigned values (encounter Table 4.19).

Tabular array iv.19. 32-Scrap Multiply Instructions

Educational activity Functioning
SMULL RdLo, RdHi, Rn, Rm ; {RdHi,RdLo} = Rn * Rm 32-bit multiply instructions for signed values
SMLAL RdLo, RdHi, Rn, Rm ; {RdHi,RdLo} += Rn * Rm
UMULL RdLo, RdHi, Rn, Rm ; {RdHi,RdLo} = Rn * Rm 32-fleck multiply instructions for unsigned values
UMLAL RdLo, RdHi, Rn, Rm ; {RdHi,RdLo} += Rn * Rm

Some other group of data processing instructions are the logical operations instructions and logical operations such as AND, ORR (or), and shift and rotate functions. Tabular array 4.xx shows some of the most commonly used logical instructions. These instructions can be used with or without the "S" suffix to determine if the APSR should be updated. If UAL syntax is used and if "Southward" suffix is not used, the 32-bit version of the instructions would be selected equally all of the 16-flake logic operation instructions update APSR.

Table iv.twenty. Logic Operation Instructions

Instruction Operation
AND Rd, Rn   ; Rd = Rd & Rn Bitwise AND
AND.Westward Rd, Rn,#immed ;Rd = Rn & #immed
AND.Westward Rd, Rn, Rm   ; Rd = Rn & Rd
ORRRd, Rn   ; Rd = Rd | Rn Bitwise OR
ORR.W Rd, Rn,#immed ; Rd = Rn | #immed
ORR.W Rd, Rn, Rm   ; Rd = Rn | Rd
BIC Rd, Rn   ; Rd = Rd & (~Rn) Bit articulate
BIC.West Rd, Rn,#immed ; Rd = Rn &(~#immed)
BIC.W Rd, Rn, Rm   ; Rd = Rn &(~Rd)
ORN.W Rd, Rn,#immed ; Rd = Rn | (~#immed) Bitwise OR Not
ORN.W Rd, Rn, Rm   ; Rd = Rn | (~Rd)
EOR Rd, Rn   ; Rd = Rd ^ Rn Bitwise Exclusive OR
EOR.W Rd, Rn,#immed ; Rd = Rn | #immed
EOR.W Rd, Rn, Rm   ; Rd = Rn | Rd

The Cortex-M3 provides rotate and shift instructions. In some cases, the rotate operation can be combined with other operations (for example, in memory accost offset adding for load/store instructions). For standalone rotate/shift operations, the instructions shown in Tabular array 4.21 are provided. Again, a 32-bit version of the teaching is used if "S" suffix is non used and if UAL syntax is used.

Tabular array 4.21. Shift and Rotate Instructions

Pedagogy Performance
ASR Rd, Rn,#immed   ; Rd = Rn » immed Arithmetic shift correct
ASRRd, Rn   ; Rd = Rd » Rn
ASR.Westward Rd, Rn, Rm   ; Rd = Rn » Rm
LSLRd, Rn,#immed   ; Rd = Rn « immed Logical shift left
LSLRd, Rn   ; Rd = Rd « Rn
LSL.Due west Rd, Rn, Rm   ; Rd = Rn « Rm
LSRRd, Rn,#immed   ; Rd = Rn » immed Logical shift right
LSRRd, Rn   ; Rd = Rd » Rn
LSR.W Rd, Rn, Rm   ; Rd = Rn » Rm
ROR Rd, Rn   ; Rd rot by Rn Rotate right
ROR.W Rd, Rn,#immed ; Rd = Rn rot by immed
ROR.W Rd, Rn, Rm   ; Rd = Rn rot by Rm
RRX.W Rd, Rn   ; {C, Rd} = {Rn, C} Rotate right extended

In UAL syntax, the rotate and shift operations can also update the carry flag if the Due south suffix is used (and always update the carry flag if the 16-bit Thumb code is used). Come across Figure 4.1.

FIGURE 4.i. Shift and Rotate Instructions.

If the shift or rotate functioning shifts the annals position by multiple bits, the value of the carry flag C will be the last scrap that shifts out of the register.

Why Is There Rotate Right Only No Rotate Left?

The rotate left operation can be replaced by a rotate correct operation with a different rotate offset. For instance, a rotate left by 4-bit operation can exist written equally a rotate right by 28-bit instruction, which gives the aforementioned issue and takes the same amount of time to execute.

For conversion of signed data from byte or one-half word to word, the Cortex-M3 provides the two instructions shown in Table 4.22. Both sixteen-flake and 32-bit versions are bachelor. The 16-bit version can only access depression registers.

Table 4.22. Sign Extend Instructions

Instruction Functioning
SXTB Rd, Rm ; Rd = signext(Rm[7:0]) Sign extend byte information into word
SXTH Rd, Rm ; Rd = signext(Rm[15:0]) Sign extend one-half word data into word

Another grouping of information processing instructions is used for reversing data bytes in a annals (see Table 4.23). These instructions are usually used for conversion between little endian and big endian information. See Figure 4.2. Both 16-bit and 32-bit versions are bachelor. The 16-fleck version tin can only access low registers.

Table 4.23. Data Reverse Ordering Instructions

Didactics Operation
REV Rd, Rn   ; Rd = rev(Rn) Opposite bytes in word
REV16 Rd, Rn ; Rd = rev16(Rn) Reverse bytes in each half discussion
REVSH Rd, Rn ; Rd = revsh(Rn) Reverse bytes in bottom one-half word and sign extend the result

FIGURE 4.ii. Operation of Reverse instructions.

The final group of information processing instructions is for bit field processing. They include the instructions shown in Tabular array 4.24. Examples of these instructions are provided in a afterwards part of this affiliate.

Table 4.24. Flake Field Processing and Manipulation Instructions

Instruction Operation
BFC.W Rd, Rn, #<width> Clear chip field within a register
BFI.W Rd, Rn, #<lsb>, #<width> Insert bit field to a register
CLZ.W Rd, Rn Count leading zip
RBIT.Due west Rd, Rn Reverse fleck order in register
SBFX.W Rd, Rn, #<lsb>, #<width> Copy bit field from source and sign extend information technology
UBFX.W Rd, Rn, #<lsb>, #<width> Re-create bit field from source annals

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781856179638000077

The Linux/ARM embedded platform

Jason D. Bakos , in Embedded Systems, 2016

1.13 Bones ARM Instruction Prepare

This department provides a concise summary of a basic subset of the ARM instruction fix. The data provided here is but enough to get y'all started writing bones ARM assembly programs, and does not include any specialized instructions, such as organisation instructions and those related to coprocessors. Note that in the post-obit tables, the instruction mnemonics are shown in capital letter, merely can be written in uppercase or lowercase.

1.13.1 Integer arithmetic instructions

Tabular array 1.4 shows a list of integer arithmetics instructions. All of these support conditional execution, and all will update the status register when the S suffix is specified. Some of these instructions—those with "operand2"—back up the flexible second operand as described before in this chapter. This allows these instructions to have either a register, a shifted register, or an immediate as the second operand.

Tabular array i.4. Integer Arithmetic Instructions

Instruction Description Role
ADC{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Add with carry R[Rd]   =   R[Rn]   +   operand2   +   Cflag
ADD{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Add R[Rd]   =   R[Rn]   +   operand2
MLA{South}{&lt;   cond   &gt;} Rd, Rn, Rm, Ra Multiply-accumulate R[Rd]   =   R[Rn]   *   R[Rm]   +   R[Ra]
MUL{South}{&lt;   cond   &gt;} Rd, Rn, Rm Multiply R[Rd]   =   R[Rn]   *   R[Rm]
RSB{Due south}{&lt;   cond   &gt;} Rd, Rn, operand2 Opposite decrease R[Rd]   =   operand2   -   R[Rn]
RSC{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Reverse subtract with carry R[Rd]   =   operand2   -   R[Rn]     non(C flag)
SBC{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Subtract with deport R[Rd]   =   R[Rn]     operand2     not(C flag)
SMLAL{Due south}{&lt;   cond   &gt;} RdLo, RdHi, Rn, Rm Signed multiply accrue long R[RdHi]   =   upper32bits(R[Rn]   *   R[Rm])   +   R[RdHi]
R[RdLo]   =   lower32bits(R[Rn]   *   R[Rm])   +   R[RdLo]
SMULL{S}{&lt;   cond   &gt;} RdLo, RdHi, Rn, Rm Signed multiply long R[RdHi]   =   upper32bits(R[Rm]   *   R[Rs])
R[RdLo]   =   lower32bits(R[Rm] * R[Rs])
SUB{Southward}{&lt;   cond   &gt;} Rd, Rn, operand2 Decrease R[Rd]   =   R[Rn]     operand2
UMLAL{S}{&lt;   cond   &gt;} RdLo, RdHi, Rn, Rm Unsigned multiply accrue long R[RdHi]   =   upper32bits(R[Rn]   *   R[Rm])   +   R[RdHi]
R[RdLo]   =   lower32bits(R[Rn]   *   R[Rm])   +   R[RdLo]
UMULL{Due south}{&lt;   cond   &gt;} RdLo, RdHi, Rn, Rm Unsigned multiply long R[RdHi]   =   upper32bits(R[Rn]   *   R[Rm])
R[RdLo]   =   lower32bits(R[Rn]   *   R[Rm])

1.thirteen.2 Bitwise logical instructions

Table 1.v shows a list of bitwise logical instructions. All of these support conditional execution, all tin update the flags when the S suffix is specified, and all support a flexible second operand.

Table 1.5. Integer Bitwise Logical Instructions

Instruction Description Functionality
AND{Southward}{&lt;   cond   &gt;} Rd, Rn, operand2 Bitwise AND R[Rd]   =   R[Rn] &amp; operand2
BIC{South}{&lt;   cond   &gt;} Rd, Rn, operand2 Bit articulate R[Rd]   =   R[Rn] &amp; non operand2
EOR{South}{&lt;   cond   &gt;} Rd, Rn, operand2 Bitwise XOR R[Rd]   =   R[Rn]   ˆ   operand2
ORR{S}{&lt;   cond   &gt;} Rd, Rn, operand2 Bitwise OR R[Rd]   =   R[Rn]   |   operand2

i.13.3 Shift instructions

Table i.half-dozen shows a list of shift instructions. All of these back up provisional execution, all can update the flags when the S suffix is specified, but note that these instructions do not support the flexible second operand.

Tabular array 1.6. Integer Bitwise Logical Instructions

Didactics Description Functionality
ASR{S}{&lt;   cond   &gt;} Rd, Rn, Rs/#sh Arithmetic shift correct R[Rd]   =   (int)R[Rn] &gt;&gt; (R[Rs] or #sh)
allowed shift amount 1-32
LSR{S}{&lt;   cond   &gt;} Rd, Rn, Rs/#sh Logical shift right R[Rd]   =   (unsigned int)R[Rn]   &gt;&gt;   (R[Rs] or #sh)
allowed shift amount ane-32
LSL{South}{&lt;   cond   &gt;} Rd, Rn, Rs/#sh Logical shift left R[Rd]   =   R[Rn]   &lt;&lt;   (R[Rs] or #sh)
allowed shift amount 0-31
ROR{South}{&lt;   cond   &gt;} Rd, Rn, Rs/#sh Rotate right R[Rd]   =   rotate R[Rn] by operand2 bits
allowed shift amount 1-31
RRX{S}{&lt;   cond   &gt;} Rd, Rm Shift right by 1 bit
The old conduct flag is shifted into R[Rd] bit 31
If used with the Southward suffix, the former flake 0 is placed in the comport flag

1.13.4 Motion instructions

Table 1.7 shows a list of data movement instructions. Most useful of these is the MOV instruction, since its flexible second operand allows for loading immediates and register shifting.

Table one.seven. Data Movement Instructions

Educational activity Description Functionality
MOV{S}{&lt;   cond   &gt;} Rd, operand2 Move R[Rd]   =   operand2
MRS{&lt;   cond   &gt;} Rd, CPSR Motion condition annals or saved status annals to GPR R[Rd]   =   CPSR
R[Rd]   =   SPSR
MRS{&lt;   cond   &gt;} Rd, SPSR
MSR{&lt;   cond   &gt;} CPSR_f, #imm Move to status annals from ARM annals fields is ane of:
_c, _x, _s, _f
MSR{&lt;   cond   &gt;} SPSR_f, #imm
MSR{&lt;   cond   &gt;} CPSR_   &lt;   fields   &gt;, Rm
MSR{&lt;   cond   &gt;} SPSR_   &lt;   fields   &gt;, Rm
MVN{S}{&lt;   cond   &gt;} Rd, operand2 Move 1'southward complement R[Rd] = non operand2

one.13.5 Load and store instructions

Tabular array 1.eight shows a listing of load and store instructions. The LDR/STR instructions are ARM's bread-and-butter load and store instructions. The retention address can be specified using whatever of the addressing modes described earlier in this affiliate.

Table 1.8. ARM Load and Shop Instructions

Instruction Description Functionality
LDM{cond}   &lt;   accost manner   &gt;   Rn{!}, &lt;   reg listing in braces   &gt; Load multiple Loads multiple registers from consecutive words starting at R[Rn]
Bang (!) volition autoincrement base register
Address fashion:
IA   =   increment later on
IB   =   increment before
DA   =   decrement later on
DB   =   decrement earlier
Example:
LDMIA r2!, {r3,5-r7}
LDR{cond}{B|H|SB|SH} Rd, &lt;   address   &gt; Load register Loads from memory into Rd.
Optional size specifiers:
B   =   byte
H   =   halfword
SB   =   signed byte
SH   =   signed halfword
STM{cond}   &lt;   address mode   &gt;   Rn, &lt;   registers   &gt; Store multiple Stores multiple registers
Bang (!) will autoincrement base register
Address style:
IA   =   increase after
IB   =   increment before
DA   =   decrement afterwards
DB   =   decrement before
Example:
STMIA r2!, {r3,5-r7}
STR{cond}{B|H} Rd, &lt;   address   &gt; Store annals Stores from retentivity into Rd.
Optional size specifiers:
B   =   byte
H   =   halfword
SWP{cond}   &lt;   B| Rd, Rm, [Rn] Swap Swap a word (or byte) between registers and memory

The LDR instruction can also be used to load symbols into base registers, e.g. "ldr r1,=information".

The LDM and STM instructions can load and shop multiple registers and are often used for accessing the stack.

1.13.6 Comparison instructions

Table ane.9 lists comparison instructions. These instructions are used to the condition flags, which are used for conditional instructions, often used for conditional branches.

Table 1.9. Comparing Instructions

Instruction Description Functionality
CMN{&lt;   cond   &gt;} Rn, Rm Compare negative Sets flags based on comparison between R[Rn] and –R[Rm]
CMP{&lt;   cond   &gt;} Rn, Rm Compare negative Sets flags based on comparing between R[Rn] and R[Rm]
TEQ{cond} Rn, Rm Test equivalence Tests for equivalence without affecting V flag
TST{cond} Rn, Rm Test Performs a bitwise AND of ii registers and updates the flags

1.13.vii Branch instructions

Table 1.ten lists two branch instructions. The BX (branch exchange) instruction is used when branching to register values, which is used oftentimes for branching to the link register for returning from functions. When using this instruction, the LSB of the target register specifies whether the processor will be in ARM mode or Thumb fashion after the co-operative is taken.

Tabular array 1.10. Branch Instructions

Instruction Description Functionality
B{L}{cond}   &lt;   target   &gt; Branch Branches (and optionally links in register r14) to label
B{50}X{cond} Rm Branch and exchange Branches (and optionally links in register r14) to register. Bit 0 of annals specifies if the instruction set up mode will be standard or Thumb upon branching

i.thirteen.8 Floating-bespeak instructions

There are two types of floating-betoken instructions: the Vector Floating Point (VFP) instructions and the NEON instructions.

ARMv6 processors such as the Raspberry Pi (gen1)'s ARM11 support only VFP instructions. Newer architectures such every bit ARMv7 support but NEON instructions. The nigh mutual floating-betoken operations map to both a VFP instruction and a NEON education. For example, the VFP instruction FADDS and the NEON instruction VADD.F32 (when used with s-registers) both perform a unmarried precision floating point add.

The NEON didactics set is more extensive than the VFP didactics set, and then while most VFP instructions have an equivalent NEON teaching, there are many NEON instructions that perform operations not possible with VFP instructions.

In lodge to describe floating point and single instruction, multiple data (SIMD) programming techniques that are applicative to both the ARM11 and ARM Cortex processors, this section and Chapter 2 will comprehend both VFP and NEON instructions.

Table one.11 lists the VFP and NEON version of commonly used floating-point instructions. Similar the integer arithmetics instructions, most floating-point instructions support provisional execution, but at that place is a divide set of flags for floating-point instructions located in the 32-bit floating-betoken status and control annals (FPSCR). NEON instructions utilize merely bits 31 down to 27 of this register, while VFP instructions utilise additional chip fields.

Table 1.11. Floating-Point Instructions

VFP Instruction Equivalent NEON Instruction Description
FADD[S|D]{cond} Fd, Fn, Fm VADD.[F32|F64] Fd, Fn, Fm Single and double precision add
FSUB[S|D]{cond} Fd, Fn, Fm VSUB.[F32|F64] Fd, Fn, Fm Single and double precision subtract
FMUL[S|D]{cond} Fd, Fn, Fm VMUL.[F32|F64] Fd, Fn, Fm Single and double precision multiply and multiply-and-negate
FNMUL[South|D]{cond} Fd, Fn, Fm VNMUL.[F32|F64] Fd, Fn, Fm
FDIV[S|D]{cond} Fd, Fn, Fm VDIV.[F32|F64] Fd, Fn, Fm Unmarried and double precision divide
FABS[S|D]{cond} Fd, Fm VABS.[F32|F64] Fd, Fn, Fm Single and double precision accented value
FNEG[Southward|D]{cond} Fd, Fm VNEG.[F32|F64] Fd, Fn, Fm Unmarried and double precision negate
FSQRT[S|D]{cond} Fd, Fm VSQRT.[F32|F64] Fd, Fn, Fm Single and double precision square root
FCVTSD{cond} Fd, Fm VCVT.F32.F64 Fd, Fm Convert double precision to single precision
FCVTDS{cond} Fd, Fm VCVT.F64.F32 Fd, Fm Convert single precision to double precision
VCVT.[S|U][32|sixteen].[F32|F64], #fbits Fd, Fm Convert floating indicate to stock-still point
VCVT.[F32|F64].[S|U][32|sixteen],#fbits Fd, Fm, #fbits Convert floating point to fixed point
FMAC[Southward|D]{cond} Fd, Fn, Fm VMLA.[F32|F64] Fd, Fn, Fm Single and double precision floating point multiply-accrue, calculates Rd   =   Fn * Fm   +   Fd
There are similar instructions that negate the contents of Fd, Fn, or both prior to use, for case, FNMSC[Due south|D], VNMLS[.F32|.F64]
FLD[Due south|D]{cond} Fd,&lt;address   &gt; VLDR{cond} Rd, &lt;   accost   &gt; Unmarried and double precision floating bespeak load/store
FST[S|D]{cond} Fd,&lt;accost   &gt; LSTR{cond} Rd, &lt;   accost   &gt;
FLDMI[Southward|D]{cond}   &lt;   address   &gt;, &lt;   FPRegs   &gt; VLDM{cond} Rn{!}, &lt;   FPRegs   &gt; Single and double precision floating indicate load/store multiple
FSTMI[S|D]{cond}   &lt;   address   &gt;, &lt;   FPRegs   &gt; VSTM{cond} Rn{!}, &lt;   FPRegs   &gt;
FMRX{cond} Rd FMRX Rd Move from/to floating bespeak condition and control
FMXR{cond} Rm FMXR Rm register
FCPY[Due south|D]{cond} Fd,Fm VMOV{cond} Fd,Fm Copy floating point register

Floating-point instructions use a separate ready of registers than integer instructions. ARMv6/VFP provides 32 floating-point registers, used as 32 individual single-precision registers named s0-s31 or equally 16 double-precision registers named d0-s15.

ARMv7/NEON provides 64 floating-bespeak registers, which can be used in many more ways, such as:

64 single-precision registers named s0-s63,

32 ii-element single-precision registers named d0-d31,

xvi four-element unmarried-precision registers named q0-q15,

32 double-precision registers named d0-d31, and

xvi ii-element double-precision registers named q0-q15.

In both VFP and NEON, register d0 consumes the same physical infinite equally registers s0 and s1 and annals d1 consumes the aforementioned space as registers s2 and s3.

Values in floating-betoken registers tin be exchanged with general-purpose registers, and there is hardware support for type conversion between single precision, double precision, and integer.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128003428000018

Pixel Shader Reference

Ron Fosner , in Real-Fourth dimension Shader Programming, 2003

Note:

If you used the D3DTOP_ADDSIGNED2X texture operation in ane of your DirectX texture stages, the signed scaling modifier performs the aforementioned functioning.

Rules for using signed source scaling:

For utilize merely with arithmetics instructions.

Cannot be combined with the invert modifier.

Initial data outside the [0, 1] range may produce undefined results.

source scale 2X

PS 1.4 The scale by two modifier is used for shifting the range of the input register from the [0, 1] range to the [−1, + 1] range, typically when you want to use the full signed range of which registers are capable. The calibration by 2 modifier is indicated by adding a _x2 suffix to a register. Essentially, the modifier multiplies the register values by 2 before they are used. The source register values are unchanged.

Rules for using scale past two:

For apply only with arithmetic instructions.

Cannot be combined with the invert modifier.

Available for PS 1.4 shaders only.

source replication/selection

Just as vertex shaders let you select the detail elements of a source annals to use, so pixel shaders do, with some differences. You lot can select but a single element, and that element will be replicated to all channels. You specify a channel to replicate by adding .northward suffix to the register, where n is r, g, b, or a (or x, y, z, or w).

SOURCE REGISTER SELECTORS
Annals SWIZZLE
PS version .rrrr .gggg .bbbb .aaaa .gbra .brga .abgr
i.0 x
1.1 x x
ane.2 x x
1.3 x x
1.4 stage 1 x ten x x
i.4 phase two 10 x x ten
ii.0 x ten x x x x ten

texture register modifiers ps i.four only

PS 1.4 PS 1.4 has its own prepare of modifiers for texture instructions. Since only the texcrd and texld instructions are used to load or sample textures with PS one.4, these modifiers are unique to those instructions. Note that you can interchange .rgbasyntax with xyzw syntax, thus –dz is the same as –db.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978155860853550010X

Arithmetics optimization and the Linux Framebuffer

Jason D. Bakos , in Embedded Systems, 2016

3.7 Fixed-Betoken Performance

As compared to floating point, using fixed signal reduces the latency after each arithmetic instruction at the cost of additional instructions required for rounding and radix point management, although if the overhead code contains sufficient instruction level parallelism the impact of these boosted instructions on throughput may not substantial.

On the other hand, for graphics applications similar the image transformation that require frequent conversions betwixt floating point and integer, using stock-still point may result in a reduction of executed instructions.

In fact, when compared the floating-bespeak implementation on the Raspberry Pi, the stock-still-indicate implementation achieves approximately the aforementioned CPI and cache miss charge per unit, simply decreases the number of instructions per pixel from 225 to 160. This resulted in a speedup of throughput of approximately 40%.

Read total chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128003428000031

Overview of Digital Betoken Processing Algorithms

Robert Oshana , in DSP Software Development Techniques for Embedded and Real-Time Systems, 2006

Bones Software Implementation

The implementation of a FIR is straightforward; it'due south just a weighted moving average. Any processor with decent arithmetic instructions or a math library can perform the necessary computations. The real constraint is the speed. Many full general-purpose processors can't perform the calculations fast enough to generate real-time output from real-fourth dimension input. This is why a DSP is used.

A dedicated hardware solution like a DSP has two major speed advantages over a general-purpose processor. A DSP has multiple arithmetic units, which can all exist working in parallel on individual terms of the weighted boilerplate. A DSP compages also has information paths that closely mirror the information movements used past the FIR filter. The delay line in a DSP automatically aligns the current window of samples with the appropriate coefficients, which increases throughput considerably. The results of the multiplications automatically flow to the accumulating adders, further increasing efficiency.

DSP architectures provide these optimizations and concurrent opportunities in a programmable processor. DSP processors accept multiple arithmetic units that can exist used in parallel, which closely mimics the parallelism in the filtering algorithm. These DSPs also tend to take special information motion operations. These operations tin can "shift" information amidst special purpose registers in the DSP. DSP processors almost e'er have special chemical compound instructions (like a multiply and accumulate or MAC functioning) that allow data to flow directly from a multiplier into an accumulator without explicit control intervention (Effigy iv.17). This is why a DSP tin can perform one of these MAC operations in i clock cycle. A significant part of learning to use a particular DSP processor efficiently is learning how to exploit these special features.

Figure iv.17. DSPs accept optimized MAC instructions to perform multiply and accumulate operations very quickly

In a DSP context, a "MAC" is the operation of multiplying a coefficient past the respective delayed data sample and accumulating the upshot. FIR filters usually require one MAC per tap.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780750677592500065

Scalable parallel execution

Mark Ebersole , in Programming Massively Parallel Processors (Third Edition), 2017

three.7 Thread Scheduling and Latency Tolerance

Thread scheduling is strictly an implementation concept. Thus, it must be discussed in the context of specific hardware implementations. In the bulk of implementations to date, a cake assigned to an SM is further divided into 32 thread units called warps. The size of warps is implementation-specific. Warps are not part of the CUDA specification; nevertheless, noesis of warps can be helpful in agreement and optimizing the operation of CUDA applications on particular generations of CUDA devices. The size of warps is a property of a CUDA device, which is in the warpSize field of the device query variable (dev_prop in this case).

The warp is the unit of thread scheduling in SMs. Fig. 3.thirteen shows the partitioning of blocks into warps in an implementation. Each warp consists of 32 threads of consecutive threadIdx values: thread 0 through 31 course the starting time warp, 32 through 63 the 2d warp, and and so on. In this case, 3 blocks—Cake i, Block ii, and Block three—are assigned to an SM. Each of the three blocks is further divided into warps for scheduling purposes.

Figure iii.13. Blocks are partitioned into warps for thread scheduling.

We can calculate the number of warps that reside in an SM for a given block size and a given number of blocks assigned to each SM. In Fig. 3.13, if each block has 256 threads, we can make up one's mind that each block has 256/32 or viii warps. With three blocks in each SM, we have 8 ×3 =24 warps in each SM.

An SM is designed to execute all threads in a warp following the Unmarried Instruction, Multiple Data (SIMD) model—i.due east., at any instant in fourth dimension, one pedagogy is fetched and executed for all threads in the warp. This situation is illustrated in Fig. three.13 with a single instruction fetch/dispatch shared among execution units (SPs) in the SM. These threads will employ the same education to different portions of the data. Consequently, all threads in a warp volition always have the same execution timing.

Fig. three.thirteen also shows a number of hardware Streaming Processors (SPs) that actually execute instructions. In general, there are fewer SPs than the threads assigned to each SM; i.e., each SM has only enough hardware to execute instructions from a pocket-size subset of all threads assigned to the SM at any signal in fourth dimension. In early GPU designs, each SM tin can execute only one education for a single warp at whatsoever given instant. In recent designs, each SM can execute instructions for a pocket-sized number of warps at whatsoever point in fourth dimension. In either case, the hardware can execute instructions for a minor subset of all warps in the SM. A legitimate question is why nosotros need to have so many warps in an SM if information technology can just execute a modest subset of them at any instant. The answer is that this is how CUDA processors efficiently execute long-latency operations, such equally global memory accesses.

When an instruction to be executed by a warp needs to wait for the result of a previously initiated long-latency functioning, the warp is non selected for execution. Instead, another resident warp that is no longer waiting for results will be selected for execution. If more than one warp is ready for execution, a priority mechanism is used to select one for execution. This mechanism of filling the latency fourth dimension of operations with work from other threads is oft called "latency tolerance" or "latency hiding" (see "Latency Tolerance" sidebar).

Warp scheduling is also used for tolerating other types of operation latencies, such equally pipelined floating-signal arithmetic and branch instructions. Given a sufficient number of warps, the hardware will probable find a warp to execute at any signal in time, thus making full use of the execution hardware in spite of these long-latency operations. The selection of ready warps for execution avoids introducing idle or wasted time into the execution timeline, which is referred to as cipher-overhead thread scheduling. With warp scheduling, the long waiting time of warp instructions is "subconscious" by executing instructions from other warps. This ability to tolerate long-latency operations is the main reason GPUs do not dedicate almost as much chip area to cache memories and branch prediction mechanisms as do CPUs. Thus, GPUs can dedicate more of its scrap area to floating-point execution resource.

Latency Tolerance

Latency tolerance is also needed in various everyday situations. For instance, in mail offices, each person trying to ship a bundle should ideally accept filled out all necessary forms and labels before going to the service counter. Instead, some people wait for the service desk clerk to tell them which course to fill out and how to fill out the form.

When there is a long line in front of the service desk, the productivity of the service clerks has to exist maximized. Letting a person fill up out the form in front of the clerk while everyone waits is non an efficient approach. The clerk should be assisting the other customers who are waiting in line while the person fills out the class. These other customers are "ready to go" and should non be blocked by the customer who needs more than time to fill up out a form.

Thus, a good clerk would politely ask the first customer to footstep aside to fill out the grade while he/she tin serve other customers. In the majority of cases, the first customer will be served as soon equally that customer accomplishes the form and the clerk finishes serving the current customer, instead of that customer going to the end of the line.

We tin can retrieve of these post part customers as warps and the clerk as a hardware execution unit. The customer that needs to make full out the form corresponds to a warp whose continued execution is dependent on a long-latency performance.

We are at present set for a elementary practice. 3 Assume that a CUDA device allows upward to 8 blocks and 1024 threads per SM, whichever becomes a limitation first. Furthermore, information technology allows up to 512 threads in each block. For prototype mistiness, should we apply 8 ×8, 16 ×xvi, or 32 ×32 thread blocks? To respond the question, nosotros tin analyze the pros and cons of each choice. If we use viii ×8 blocks, each block would take merely 64 threads. We will demand 1024/64 =12 blocks to fully occupy an SM. Nonetheless, each SM can only allow up to 8 blocks; thus, we volition end up with only 64 ×8 =512 threads in each SM. This limited number implies that the SM execution resource volition probable be underutilized because fewer warps volition exist available to schedule around long-latency operations.

The sixteen ×16 blocks result in 256 threads per block, implying that each SM can take 1024/256 =four blocks. This number is within the 8-cake limitation and is a expert configuration as it volition allow the states a full thread capacity in each SM and a maximal number of warps for scheduling effectually the long-latency operations. The 32 ×32 blocks would give 1024 threads in each block, which exceeds the 512 threads per cake limitation of this device. Only 16 ×16 blocks let a maximal number of threads assigned to each SM.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128119860000030

INTRODUCTION TO THE ARM INSTRUCTION SET

ANDREW North. SLOSS , ... CHRIS WRIGHT , in ARM Arrangement Developer'due south Guide, 2004

three.9 SUMMARY

In this affiliate we covered the ARM didactics set. All ARM instructions are 32 $.25 in length. The arithmetic, logical, comparisons, and motility instructions tin can all apply the inline butt shifter, which pre-processes the second annals Rm before it enters into the ALU.

The ARM instruction gear up has three types of load-shop instructions: unmarried-register load-shop, multiple-register load-store, and swap. The multiple load-store instructions provide the push button-popular operations on the stack. The ARM-Pollex Procedure Call Standard (ATPCS) defines the stack as being a full descending stack.

The software interrupt instruction causes a software interrupt that forces the processor into SVC style; this pedagogy invokes privileged operating arrangement routines. The programme status annals instructions write and read to the cpsr and spsr. There are too special pseudoinstructions that optimize the loading of 32-scrap constants.

The ARMv5E extensions include count leading zeros, saturation, and improved multiply instructions. The count leading zeros didactics counts the number of binary zeros before the first binary one. Saturation handles arithmetic calculations that overflow a 32-bit integer value. The improved multiply instructions provide better flexibility in multiplying 16-bit values.

Most ARM instructions can be conditionally executed, which can dramatically reduce the number of instructions required to perform a specific algorithm.

Read total chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781558608740500046

Smarter systems and the PIC 18F2420

Tim Wilmshurst , in Designing Embedded Systems with Motion picture Microcontrollers (Second Edition), 2010

New instructions

Finally, there are many instructions that are just obviously new. These derive in many cases from enhanced hardware or retention addressing techniques. Meaning among arithmetic instructions is the multiply, available as mulwf (multiply Due west and f) and mullw (multiply Due west and literal). These invoke the hardware multiplier, seen already in Figure thirteen.2. Multiplier and multiplicand are viewed equally unsigned, and the result is placed in the registers PRODH and PRODL. It is worth noting that the multiply instructions cause no change to the Status flags, fifty-fifty though a zero consequence is possible.

Other of import additions to the didactics set are a whole block of Table Read and Write instructions, data transfer to and from the Stack, and a good pick of conditional branch instructions, which build upon the increased number of status flags in the Status register. There are also instructions that contribute to conditional branching. These include the group of compares, for example cpfseq, and the test instruction, tstfsz.

A useful new move education is movff, which gives a direct move from 1 memory location to some other. This codes in 2 words and takes two cycles to execute. Therefore, its advantage over the two 16 Series instructions which it replaces may seem slight. It does, however, save the value of the W register from being overwritten.

Some of these new instructions will be explored in the program instance and exercises of Section 13.10.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9781856177504100174

Using CUDA in Practice

Shane Cook , in CUDA Programming, 2013

Memory versus operations tradeoff

With near algorithms it's possible to trade an increased retention footprint for a decreased execution fourth dimension. It depends significantly on the speed of memory versus the cost and number of arithmetic instructions being traded.

There are implementations of AES that merely expand the operations of the substitution, shift rows left, and mix columns performance to a series of lookups. With a 32-bit processor, this requires a 4 Grand constant table and a pocket-size number of lookup and bitwise operations. Providing the four One thousand lookup table remains in the cache, the execution time is greatly reduced using such a method on most processors. We will, withal, implement at least initially the total algorithm before we wait to this type of optimization.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780124159334000077