Está en la página 1de 44

Cortex-M4

Architecture and
ASM Programming

Chapter

Introduction
In this chapter programming the Cortex-M4 in assembly and C
will be introduced. Preference will be given to explaining code
development for the STM32F4 Discovery and LPC4088 Quick
Start. The basis for the material presented in this chapter is the
course notes from the ARM LiB program1.

Overview
Cortex-M4 Memory Map
Cortex-M4 Memory Map
Bit-band Operations
Cortex-M4 Program Image and Endianness
ARM Cortex-M4 Processor Instruction Set
ARM and Thumb Instruction Set
Cortex-M4 Instruction Set
1. LiB Low-level Embedded NXP LPC4088 Quick Start

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Memory Map


The Cortex-M4 processor has 4 GB of memory address space
Support for bit-band operation (detailed later)
The 4GB memory space is architecturally defined as a number of regions
Each region is given for recommended usage
Easy for software programmer to port between different
devices
Nevertheless, despite of the default memory map, the actual
usage of the memory map can also be flexibly defined by the
user, except some fixed memory addresses, such as internal
private peripheral bus

32

ECE 5655/4655 Real-Time DSP

ECE 5655/4655 Real-Time DSP


Code

Peripherals

Mainly used for on-chip peripherals


e.g. AHB, APB peripherals

Mainly used for program code


e.g. on-chip FLASH

External RAM

Mainly used for external memories


e.g. external DDR, FLASH, LCD

SRAM

External device

Mainly used for external peripherals


e.g. SD card

Mainly used for data memory


e.g. on-chip SRAM, SDRAM

Private Peripheral Bus


(PPB)

Vendor specific
Memory

Private peripherals
e.g. NVIC, SCS

Reserved for other purposes

0xFFFFFFFF

0x00000000

0x20000000
0x1FFFFFFF

0x40000000
0x3FFFFFFF

0x60000000
0x5FFFFFFF

0xA0000000
0x9FFFFFFF

0xE0000000
0xDFFFFFFF

0xE0100000
0xE00FFFFF

512MB

512MB

512MB

1GB

1GB

512MB

Instrumentation trace macrocell

Data watchpoint and trace unit

Reserved
Fetch patch and breakpoint unit

System Control Space, including


Nested Vectored Interrupt
Controller (NVIC)

Reserved

Trace port interface unit

Embedded trace macrocell

External PPB

ROM table

Internal PPB

External PPB

M4 Memory Map (cont.)

M4 Memory Map (cont.)

33

Chapter 3 Cortex-M4 Architecture and ASM Programming

M4 Memory Map (cont.)


Code Region
Primarily used to store program code
Can also be used for data memory
On-chip memory, such as on-chip FLASH
SRAM Region
Primarily used to store data, such as heaps and stacks
Can also be used for program code
On-chip memory; despite its name SRAM, the actual
device could be SRAM, SDRAM or other types
Peripheral Region
Primarily used for peripherals, such as Advanced Highperformance Bus (AHB) or Advanced Peripheral Bus
(APB) peripherals
External RAM Region
Primarily used to store large data blocks, or memory
caches
Off-chip memory, slower than on-chip SRAM region
External Device Region
Primarily used to map to external devices
Off-chip devices, such as SD card
Internal Private Peripheral Bus (PPB)
34

ECE 5655/4655 Real-Time DSP

Cortex-M4 Memory Map Example

Used inside the processor core for internal control


Within PPB, a special range of memory is defined as System Control Space (SCS)
The Nested Vectored Interrupt Controller (NVIC) is part of
SCS

Cortex-M4 Memory Map Example


Chip
Silicon
Cortex-M4

PPB

SCS

NVIC
Debug Ctrl

AHB bus
On-chip FLASH
(Code Region)

On-chip SRAM
(SRAM Region)

External memory interface


(External RAM Region)

External SRAM,
FLASH

External LCD

ECE 5655/4655 Real-Time DSP

Timer UART

GPIO

Peripheral Region

External device interface


(External Device Region)

SD card

35

Chapter 3 Cortex-M4 Architecture and ASM Programming

Bit-band Operations
Bit-band operation allows a single load/store operation to
access a single bit in the memory, for example, to change a
single bit of one 32-bit data:
Normal operation without bit-band (read-modify-write)
Read the value of 32-bit data
Modify a single bit of the 32-bit value (keep other bits
unchanged)
Write the value back to the address
Bit-band operation
Directly write a single bit (0 or 1) to the bit-band alias
address of the data
Bit-band alias address
Each bit-band alias address is mapped to a real data
address
When writing to the bit-band alias address, only a single
bit of the data will be changed

36

ECE 5655/4655 Real-Time DSP

Bit-band Operation Example

Bit-band Operation Example


For example, in order to set bit[3] in word data in address
0x20000000:
;Read-Modify-Write Operation

;Bit-band Operation

LDR
LDR
ORR.W
STR

LDR
MOV
STR

R1,
R0,
R0,
R0,

=0x20000000
[R1]
#0x8
[R1]

;Setup address
;Read
;Modify bit
;Write back

R1, =0x2200000C
R0, #1
R0, [R1]

;Setup address
;Load data
;Write

Read-Modify-Write operation
Read the real data address (0x20000000)
Modify the desired bit (retain other bits unchanged)
Write the modified data back
Bit-band operation
Directly set the bit by writing 1 to address 0x2200000C,
which is the alias address of the fourth bit of the 32-bit
data at 0x20000000
In effect, this single instruction is mapped to 2 bus transfers: read data from 0x20000000 to the buffer, and then
write to 0x20000000 from the buffer with bit [3] set

ECE 5655/4655 Real-Time DSP

37

Chapter 3 Cortex-M4 Architecture and ASM Programming

Bit-band Alias Address


Each bit of the 32-bit data is one-to-one mapped to the bit-band
alias address
For example, the fourth bit (bit [3]) of the data at
0x20000000 is mapped to the bit-band alias address at
0x2200000C
Hence, to set bit [3] of the data at 0x20000000, we only
need to write 1 to address 0x2200000C
In Cortex-M4, there are two pre-defined bit-band alias
regions: one for SRAM region, and one for peripherals
region
Real 32-bit data
address

0x20000008
0x20000004
0x20000000

Bit-band alias address

0x22000100
0x22000080
0x22000000
0x2200000C
0x22000018

38

ECE 5655/4655 Real-Time DSP

Bit-band Alias Address (cont.)

Bit-band Alias Address (cont.)


SRAM region
32MB memory space (0x22000000 0x23FFFFFF) is
used as the bit-band alias region for 1MB data
(0x20000000 0x200FFFFF)
Peripherals region
32MB memory space (0x42000000 0x43FFFFFF) is
used as the bit-band alias region for 1MB data
(0x40000000 0x400FFFFF)
0x43FFFFFF

32MB Bit-band alias


0x42000000
0x41FFFFFF
0x40100000
0x40000000

31MB non-bit-band region


1MB Bit-band region

0x23FFFFFF

External RAM
Peripherals

0x20100000
0x20000000

512MB
0x40000000
0x3FFFFFFF

32MB Bit-band alias


0x22000000
0x21FFFFFF

0x60000000
0x5FFFFFFF

SRAM
31MB non-bit-band region
1MB Bit-band region

ECE 5655/4655 Real-Time DSP

512MB
0x20000000
0x1FFFFFFF

Code

512MB
0x00000000

39

Chapter 3 Cortex-M4 Architecture and ASM Programming

Benefits of Bit-Band Operations


Faster bit operations
Fewer instructions
Atomic operation, avoid hazards
For example, if an interrupt is triggered and served during
the Read-Modify-Write operations, and the interrupt service routine modifies the same data, a data conflict will
occur
Interrupt Service Routine
Read data at 0x00

Modify bit [1]

Write data back

Interrupt
occurs
Read data at 0x00
Main program

310

Interrupt
returns

Modify bit [1]

Write data back


Bit [1] modified by ISR is overwritten
by the main program

ECE 5655/4655 Real-Time DSP

Cortex-M4 Program Image

Cortex-M4 Program Image


The program image in Cortex-M4 contains
Vector table -- includes the starting addresses of exceptions
(vectors) and the value of the main stack point (MSP);
C start-up routine;
Program code application code and data;
C library code program codes for C library functions
Code region

Program
Image

Start-up routine &


Program code &
C library code

Vector table
0x00000000

ECE 5655/4655 Real-Time DSP

External Interrupts
SysTick
PendSV
Reserved
Debug monitor
SVCall
Reserved
Usage fault
Bus fault
MemManage fault
Hard fault vector
NMI vector
Reset vector
Initial MSP value

311

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Program Image (cont)


After Reset, the processor:
First reads the initial MSP value;
Then reads the reset vector;
Branches to the start of the programme execution address
(reset handler);
Subsequently executes program instructions
Reset

Fetch initial value for MSP


(Read address 0x00000000)

Fetch reset vector


(Read address 0x00000004)

Fetch 1st instruction


(Read address of reset vector)

Fetch 2nd instruction


(Read subsequent instructions)

312

ECE 5655/4655 Real-Time DSP

Cortex-M4 Endianness

Cortex-M4 Endianness
Endian refers to the order of bytes stored in memory
Little endian: lowest byte of a word-size data is stored in
bit 0 to bit 7
Big endian: lowest byte of a word-size data is stored in bit
24 to bit 31
Cortex-M4 supports both little endian and big endian
However, Endianness only exists in the hardware level
Address

[31:24]

[23:16]

[15:8]

[7:0]

Byte3

Byte2

Byte1

Byte0

Byte1

Byte0

Byte1

Byte0

0x00000008

Word 3

Byte3

0x00000004

Byte2
Word 2

Byte3

0x00000000

Byte2
Word 1

Little endian 32-bit memory


[31:24]

[23:16]

[15:8]

[7:0]

Byte0

Byte1

Byte2

Byte3

Byte2

Byte3

Byte2

Byte3

Word 3

Byte0

Byte1
Word 2

Byte0

Byte1
Word 1

Big endian 32-bit memory

ECE 5655/4655 Real-Time DSP

313

Chapter 3 Cortex-M4 Architecture and ASM Programming

ARM and Thumb Instruction Set


Early ARM instruction set
32-bit instruction set, called the ARM instructions
Powerful and good performance
Larger program memory compared to 8-bit and 16-bit processors
Larger power consumption
Thumb-1 instruction set
16-bit instruction set, first used in ARM7TDMI processor
in 1995
Provides a subset of the ARM instructions, giving better
code density compared to 32-bit RISC architecture
Code size is reduced by ~30%, but performance is also
reduced by ~20%

314

ECE 5655/4655 Real-Time DSP

ARM and Thumb Instruction Set (cont.)

ARM and Thumb Instruction Set (cont.)


Mix of ARM and Thumb-1 Instruction sets
Benefit from both 32-bit ARM (high performance) and 16bit Thumb-1 (high code density)
A multiplexer is used to switch between two states: ARM
state (32-bit) and Thumb state (16-bit), which requires a
switching overhead
0
Incoming
Instructions

Thumb remap
to ARM

ARM
Instruction
decoder

Instructions
Executing

T bit, 0: select ARM,


1: select Thumb

Thumb-2 instruction set


Consists of both 32-bit Thumb instructions and original 16bit Thumb-1 instruction sets
Compared to 32-bit ARM instructions set, code size is
reduced by ~26%, while keeping a similar performance
Capable of handling all processing requirements in one operation state

ECE 5655/4655 Real-Time DSP

315

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set


Cortex-M4 processor
ARMv7-M architecture
Supports 32-bit Thumb-2 instructions
Possible to handle all processing requirements in one operation state (Thumb state)
Compared with traditional ARM processors (use state
switching), advantages include:
* No state switching overhead both execution time and instruction space are saved
* No need to separate ARM code and Thumb code source files,
which makes the development and maintenance of software
easier
* Easier to get optimized efficiency and performance

316

ECE 5655/4655 Real-Time DSP

Cortex-M4 Instruction Set (cont.)

Cortex-M4 Instruction Set (cont.)


ARM assembly syntax:
label
mnemonic operand1,operand2, ; Comments

Label is used as a reference to an address location;


Mnemonic is the name of the instruction;
Operand1 is the destination of the operation;
Operand2 is normally the source of the operation;
Comments are written after ; , which does not affect the
program;
For example
MOVS R3, #0x11;Set register R3 to 0x11

Note that the assembly code can be assembled by either


ARM assembler (armasm) or assembly tools from a variety of vendors (e.g. GNU tool chain). When using GNU
tool chain, the syntax for labels and comments is slightly
different

ECE 5655/4655 Real-Time DSP

317

318
Rd, #lsb, #width
Rd, Rn, #lsb, #width
{Rd,} Rn, Op2

BFC

BFI

BIC, BICS

Rm

label

BX

Rd, Rm, <Rs|#n>

ASR, ASRS

Rm

{Rd,} Rn, Op2

AND, ANDS

BLX

Rd, label

ADR

label

{Rd,} Rn, #imm12

ADD, ADDW

BL

{Rd,} Rn, Op2

ADD, ADDS

#imm

{Rd,} Rn, Op2

ADC, ADCS

BKPT

Operands

Mnemonic

Branch indirect

Branch indirect with Link

Branch with Link

Breakpoint

Bit Clear

Bit Field Insert

Bit Field Clear

Branch

Arithmetic Shift Right

Logical AND

Load PC-relative Address

Add

Add

Add with Carry

Brief description

N,Z,C

N,Z,C

N,Z,C

N,Z,C,V

N,Z,C,V

N,Z,C,V

Flags

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set Tables

ECE 5655/4655 Real-Time DSP

ECE 5655/4655 Real-Time DSP


Rn, Op2
Rn, Op2
i
i

CMN

CMP

CPSID

CPSIE

Data Synchronization Barrier


Exclusive OR

{Rd,} Rn, Op2


-

DSB

EOR, EORS

ISB

Instruction Synchronization Barrier

Data Memory Barrier

DMB

Change Processor State, Enable Interrupts

Change Processor State, Disable Interrupts

Compare

Compare Negative

Count Leading Zeros

Rd, Rm

Compare and Branch if Zero

CLZ

Rn, label

CBZ

Compare and Branch if Non Zero

Clear Exclusive

Rn, label

CBNZ

Brief description

CLREX

Operands

Mnemonic

N,Z,C

N,Z,C,V

N,Z,C,V

Flags

Cortex-M4 Instruction Set Tables

Cortex-M4 Instruction Set Tables (cont.)

319

320
Rt, [Rn, #offset]

Rt, [Rn, #offset]

LDREX

LDRH, LDRHT

Rt, Rt2, [Rn, #offset]

LDRD

Rt, [Rn]

Rt, [Rn, #offset]

LDRB, LDRBT

LDREXH

Rt, [Rn, #offset]

LDR

Rt, [Rn]

Rn{!}, reglist

LDMFD, LDMIA

LDREXB

Rn{!}, reglist

LDMDB, LDMEA

Load Register with Halfword

Load Register Exclusive with Halfword

Load Register Exclusive with Byte

Load Register Exclusive

Load Register with two bytes

Load Register with byte

Load Register with word

Load Multiple registers, increment after

Load Multiple registers, decrement before

Load Multiple registers, increment after

Rn{!}, reglist

LDM

Brief description
If-Then condition block

Operands

IT

Mnemonic

Flags

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set Tables (cont.)

ECE 5655/4655 Real-Time DSP

Operands
Rt, [Rn, #offset]
Rt, [Rn, #offset]
Rt, [Rn, #offset]
Rd, Rm, <Rs|#n>
Rd, Rm, <Rs|#n>
Rd, Rn, Rm, Ra
Rd, Rn, Rm, Ra
Rd, Op2
Rd, #imm16
Rd, #imm16
Rd, spec_reg
spec_reg, Rm

Mnemonic

LDRSB, LDRSBT

LDRSH, LDRSHT

LDRT

LSL, LSLS

ECE 5655/4655 Real-Time DSP

LSR, LSRS

MLA

MLS

MOV, MOVS

MOVT

MOVW, MOV

MRS

MSR

Move from general register to Special Register

Move from Special Register to general register

Move 16-bit constant

Move Top

Move

Multiply and Subtract, 32-bit result

Multiply with Accumulate, 32-bit result

Logical Shift Right

Logical Shift Left

Load Register with word

Load Register with Signed Halfword

Load Register with Signed Byte

Brief description

N,Z,C,V

N,Z,C

N,Z,C

N,Z,C

N,Z,C

Flags

Cortex-M4 Instruction Set Tables

Cortex-M4 Instruction Set Tables (cont.)

321

322
{Rd,} Rn, Op2
{Rd, } Rn, Rm, Op2
reglist
reglist
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm

ORR, ORRS

PKHTB, PKHBT

POP

PUSH

QADD

QADD16

QADD8

Saturating Add 8

Saturating Add 16

Saturating double and Add

Push registers onto stack

Pop registers from stack

Pack Halfword

Logical OR

Logical OR NOT

{Rd,} Rn, Op2

Move NOT

ORN, ORNS

Rd, Op2

MVN, MVNS

Multiply, 32-bit result

No Operation

{Rd,} Rn, Rm

MUL, MULS

Brief description

NOP

Operands

Mnemonic

N,Z,C

N,Z,C

N,Z,C

N,Z

Flags

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set Tables (cont.)

ECE 5655/4655 Real-Time DSP

Operands
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
Rd, Rn
Rd, Rn
Rd, Rn
Rd, Rn
Rd, Rm, <Rs|#n>

Mnemonic

QASX

QDADD

QDSUB

QSAX

ECE 5655/4655 Real-Time DSP

QSUB

QSUB16

QSUB8

RBIT

REV

REV16

REVSH

ROR, RORS

Rotate Right

Reverse byte order in bottom halfword and sign extend

Reverse byte order in each halfword

Reverse byte order in a word

Reverse Bits

Saturating Subtract 8

Saturating Subtract 16

Saturating Subtract

Saturating Subtract and Add with Exchange

Saturating double and Subtract

Saturating Add

Saturating Add and Subtract with Exchange

Brief description

N,Z,C

Flags

Cortex-M4 Instruction Set Tables

Cortex-M4 Instruction Set Tables (cont.)

323

324
Rd, Rm
{Rd,} Rn, Op2
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd,} Rn, Op2
Rd, Rn, #lsb, #width
{Rd,} Rn, Rm

RRX, RRXS

RSB, RSBS

SADD16

SADD8

SASX

SBC, SBCS

SBFX

SDIV

{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm

SHADD16

SHADD8

SHASX

SEV

Operands

Mnemonic

Signed Halving Add and Subtract with Exchange

Signed Halving Add 8

Signed Halving Add 16

Send Event

Signed Divide

Signed Bit Field Extract

Subtract with Carry

Signed Add and Subtract with Exchange

Signed Add 8

Signed Add 16

Reverse Subtract

Rotate Right with Extend

Brief description

N,Z,C,V

GE

GE

GE

N,Z,C,V

N,Z,C

Flags

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set Tables (cont.)

ECE 5655/4655 Real-Time DSP

Operands
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
Rd, Rn, Rm, Ra
Rd, Rn, Rm, Ra
RdLo, RdHi, Rn, Rm
RdLo, RdHi, Rn, Rm
RdLo, RdHi, Rn, Rm
Rd, Rn, Rm, Ra
Rd, Rn, Rm, Ra
RdLo, RdHi, Rn, Rm
Rd, Rn, Rm, Ra

Mnemonic

SHSAX

SHSUB16

SHSUB8

SMLABB, SMLABT, SMLATB,


SMLATT

ECE 5655/4655 Real-Time DSP

SMLAD, SMLADX

SMLAL

SMLALBB, SMLALBT, SMLALTB,


SMLALTT

SMLALD, SMLALDX

SMLAWB, SMLAWT

SMLSD

SMLSLD

SMMLA

Signed Most significant word Multiply Accumulate

Signed Multiply Subtract Long Dual

Signed Multiply Subtract Dual

Signed Multiply Accumulate, word by halfword

Signed Multiply Accumulate Long Dual

Signed Multiply Accumulate Long, halfwords

Signed Multiply with Accumulate (32 x 32 + 64), 64-bit


result

Signed Multiply Accumulate Dual

Signed Multiply Accumulate Long (halfwords)

Signed Halving Subtract 8

Signed Halving Subtract 16

Signed Halving Subtract and Add with Exchange

Brief description

Flags

Cortex-M4 Instruction Set Tables

Cortex-M4 Instruction Set Tables (cont.)

325

326
Operands
Rd, Rn, Rm, Ra
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
RdLo, RdHi, Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
Rd, #n, Rm {,shift #s}
Rd, #n, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm

Mnemonic

SMMLS, SMMLR

SMMUL, SMMULR

SMUAD

SMULBB, SMULBT SMULTB,


SMULTT

SMULL

SMULWB, SMULWT

SMUSD, SMUSDX

SSAT

SSAT16

SSAX

SSUB16

SSUB8

Signed Subtract 8

Signed Subtract 16

Signed Subtract and Add with Exchange

Signed Saturate 16

Signed Saturate

Signed dual Multiply Subtract

Signed Multiply word by halfword

Signed Multiply (32 x 32), 64-bit result

Signed Multiply (halfwords)

Signed dual Multiply Add

Signed Most significant word Multiply

Signed Most significant word Multiply Subtract

Brief description

GE

Flags

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set Tables (cont.)

ECE 5655/4655 Real-Time DSP

Operands
Rn{!}, reglist
Rn{!}, reglist
Rn{!}, reglist
Rt, [Rn, #offset]
Rt, [Rn, #offset]
Rt, Rt2, [Rn, #offset]
Rd, Rt, [Rn, #offset]
Rd, Rt, [Rn]
Rd, Rt, [Rn]
Rt, [Rn, #offset]
Rt, [Rn, #offset]
{Rd,} Rn, Op2

Mnemonic

STM

STMDB, STMEA

STMFD, STMIA

STR

ECE 5655/4655 Real-Time DSP

STRB, STRBT

STRD

STREX

STREXB

STREXH

STRH, STRHT

STRT

SUB, SUBS

Subtract

Store Register word

Store Register Halfword

Store Register Exclusive Halfword

Store Register Exclusive Byte

Store Register Exclusive

Store Register two words

Store Register byte

Store Register word

Store Multiple registers, increment after

Store Multiple registers, decrement before

Store Multiple registers, increment after

Brief description

N,Z,C,V

Flags

Cortex-M4 Instruction Set Tables

Cortex-M4 Instruction Set Tables (cont.)

327

328
{Rd,} Rm {,ROR #n}
{Rd,} Rm {,ROR #n}
{Rd,} Rm {,ROR #n}
[Rn, Rm]

SXTB16

SXTB

SXTH

TBB

Rn, Op2

{Rd,} Rn, Rm,{,ROR #}

SXTAH

TST

{Rd,} Rn, Rm,{,ROR #}

SXTAB16

Rn, Op2

{Rd,} Rn, Rm,{,ROR #}

SXTAB

TEQ

#imm

SVC

[Rn, Rm, LSL #1]

{Rd,} Rn, #imm12

SUB, SUBW

TBH

Operands

Mnemonic

Test

Test Equivalence

Table Branch Halfword

Table Branch Byte

Sign extend a halfword

Sign extend a byte

Signed Extend Byte 16

Extend 16 bits to 32 and add

Dual extend 8 bits to 16 and add

Extend 8 bits to 32 and add

Supervisor Call

Subtract

Brief description

N,Z,C

N,Z,C

N,Z,C,V

Flags

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set Tables (cont.)

ECE 5655/4655 Real-Time DSP

Operands
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
Rd, Rn, #lsb, #width
{Rd,} Rn, Rm
RdLo, RdHi, Rn, Rm

Mnemonic

UADD16

UADD8

USAX

UHADD16

ECE 5655/4655 Real-Time DSP

UHADD8

UHASX

UHSAX

UHSUB16

UHSUB8

UBFX

UDIV

UMAAL

Unsigned Multiply Accumulate Accumulate Long (32 x 32


+ 32 +32), 64-bit result

Unsigned Divide

Unsigned Bit Field Extract

Unsigned Halving Subtract 8

Unsigned Halving Subtract 16

Unsigned Halving Subtract and Add with Exchange

Unsigned Halving Add and Subtract with Exchange

Unsigned Halving Add 8

Unsigned Halving Add 16

Unsigned Subtract and Add with Exchange

Unsigned Add 8

Unsigned Add 16

Brief description

GE

GE

GE

Flags

Cortex-M4 Instruction Set Tables

Cortex-M4 Instruction Set Tables (cont.)

329

330
Operands
RdLo, RdHi, Rn, Rm
RdLo, RdHi, Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm, Ra
Rd, #n, Rm {,shift #s}
Rd, #n, Rm

Mnemonic

UMLAL

UMULL

UQADD16

UQADD8

UQASX

UQSAX

UQSUB16

UQSUB8

USAD8

USADA8

USAT

USAT16

Unsigned Saturate 16

Unsigned Saturate

Unsigned Sum of Absolute Differences and Accumulate

Unsigned Sum of Absolute Differences

Unsigned Saturating Subtract 8

Unsigned Saturating Subtract 16

Unsigned Saturating Subtract and Add with Exchange

Unsigned Saturating Add and Subtract with Exchange

Unsigned Saturating Add 8

Unsigned Saturating Add 16

Unsigned Multiply (32 x 32), 64-bit result

Unsigned Multiply with Accumulate (32 x 32 + 64), 64-bit


result

Brief description

Flags

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set Tables (cont.)

ECE 5655/4655 Real-Time DSP

Operands
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm,{,ROR #}
{Rd,} Rn, Rm,{,ROR #}
{Rd,} Rn, Rm,{,ROR #}
{Rd,} Rm {,ROR #n}
{Rd,} Rm {,ROR #n}
{Rd,} Rm {,ROR #n}
Sd, Sm
{Sd,} Sn, Sm
Sd, <Sm | #0.0>

Mnemonic

UASX

USUB16

USUB8

UXTAB

ECE 5655/4655 Real-Time DSP

UXTAB16

UXTAH

UXTB

UXTB16

UXTH

VABS.F32

VADD.F32

VCMP.F32

Compare two floating-point registers, or one floatingpoint register and zero

Floating-point Add

Floating-point Absolute

Zero extend a Halfword

Unsigned Extend Byte 16

Zero extend a Byte

Rotate, unsigned extend and Add Halfword

Rotate, dual extend 8 bits to 16 and Add

Rotate, extend 8 bits to 32 and Add

Unsigned Subtract 8

Unsigned Subtract 16

Unsigned Add and Subtract with Exchange

Brief description

FPSCR

GE

GE

GE

Flags

Cortex-M4 Instruction Set Tables

Cortex-M4 Instruction Set Tables (cont.)

331

332
Operands
Sd, <Sm | #0.0>
Sd, Sm
Sd, Sd, #fbits
Sd, Sm
Sd, Sm
Sd, Sm
{Sd,} Sn, Sm
{Sd,} Sn, Sm
{Sd,} Sn, Sm
{Sd,} Sn, Sm
{Sd,} Sn, Sm
Rn{!}, list

Mnemonic

VCMPE.F32

VCVT.S32.F32

VCVT.S16.F32

VCVTR.S32.F32

VCVT<B|H>.F32.F16

VCVTT<B|T>.F32.F16

VDIV.F32

VFMA.F32

VFNMA.F32

VFMS.F32

VFNMS.F32

VLDM.F<32|64>

FPSCR

Compare two floating-point registers, or one floatingpoint register and zero with Invalid Operation check

Load Multiple extension registers

Floating-point Fused Negate Multiply Subtract

Floating-point Fused Multiply Subtract

Floating-point Fused Negate Multiply Accumulate

Floating-point Fused Multiply Accumulate

Floating-point Divide

Converts single-precision register to half-precision

Converts half-precision value to single-precision

Convert between floating-point and integer with


rounding

Convert between floating-point and fixed point

Convert between floating-point and integer

Flags

Brief description

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set Tables (cont.)

ECE 5655/4655 Real-Time DSP

Operands
<Dd|Sd>, [Rn]
{Sd,} Sn, Sm
{Sd,} Sn, Sm
Sd, #imm
Sd, Sm
Sn, Rt
Sm, Sm1, Rt, Rt2
Dd[x], Rt
Rt, Dn[x]
Rt, FPSCR
FPSCR, Rt
{Sd,} Sn, Sm

Mnemonic

VLDR.F<32|64>

VLMA.F32

VLMS.F32

VMOV.F32

ECE 5655/4655 Real-Time DSP

VMOV

VMOV

VMOV

VMOV

VMOV

VMRS

VMSR

VMUL.F32

Floating-point Multiply

Move to FPSCR from ARM Core register

Move FPSCR to ARM core register or APSR

Copy scalar to ARM core register

Copy ARM core register to scalar

Copy 2 ARM core registers to 2 single precision

Copy ARM core register to single precision

Floating-point Move register

Floating-point Move immediate

Floating-point Multiply Subtract

Floating-point Multiply Accumulate

Load an extension register from memory

Brief description

FPSCR

N,Z,C,V

Flags

Cortex-M4 Instruction Set Tables

Cortex-M4 Instruction Set Tables (cont.)

333

334
{Sd,} Sn, Sm
list
list
Sd, Sm
Rn{!}, list
Sd, [Rn]
{Sd,} Sn, Sm

VNMUL

VPOP

VPUSH

VSQRT.F32

VSTM

VSTR.F<32|64>

VSUB.F<32|64>

Floating-point Subtract

Stores an extension register to memory

Floating-point register Store Multiple

Calculates floating-point Square Root

Push extension registers

Pop extension registers

Floating-point Multiply

Floating-point Multiply and Subtract

Note: full explanation of each instruction can be found in Cortex-M4 Devices Generic User Guide (Ref-4)

Wait For Interrupt

Sd, Sn, Sm

VNMLS.F32

Floating-point Multiply and Add

WFI

Sd, Sn, Sm

VNMLA.F32

Floating-point Negate

Wait For Event

Sd, Sm

VNEG.F32

Brief description

WFE

Operands

Mnemonic

Flags

Chapter 3 Cortex-M4 Architecture and ASM Programming

Cortex-M4 Instruction Set Tables (cont.)

ECE 5655/4655 Real-Time DSP

ECE 5655/4655 Real-Time DSP


Description
Update APSR (flags)

Condition execution
e.g. EQ= equal, NE= not equal, LT= less than

Suffix

EQ, NE, CS, CC, MI, PL, VS, VC, HI, LS,
GE, LT, GT, LE
BNE label

ADDS R1, #0x21

Example

Branch to the label if not equal

Add 0x21 to R1 and update APSR

Example explanation

Cortex-M4 Instruction Set Tables

Cortex-M4 Instruction Set Tables (cont.)


Cortex-M4 Suffix
Some instructions can be followed by suffixes to update
processor flags or execute the instruction on a certain condition

335

Chapter 3 Cortex-M4 Architecture and ASM Programming

C Calling Assembly
For real-time DSP applications the most common scenario
involving assembly code writing, if needed at all, will be C calling assembly. In simple terms the rules are:

Formally, the ARM Architecture Procedure Call Standard


(AAPCS) defines:
Which registers must be saved and restored
How to call procedures
How to return from procedures
AAPCS Register Use Conventions
Make it easier to create modular, isolated and integrated code
Scratch registers are not expected to be preserved upon
returning from a called subroutine
This applies to r0r3
Preserved (variable) registers are expected to have their
original values upon returning from a called subroutine
This applies to r4r8, r10r11
Use PUSH {r4,..} and POP {r4,...}

336

ECE 5655/4655 Real-Time DSP

ECE 5655/4655 Real-Time DSP

r8
r7
r6
r5
r4
r3
r2
r1
r0

r9

r15
r14
r13
r12
r11
r10

v5
v4
v3
v2
v1
a4
a3
a2
a1

v8
v7

Role in the procedure call standard

The Program Counter.


The Link Register.
The Stack Pointer.
The Intra-Procedure-call scratch register.
Variable-register 8.
Variable-register 7.
Platform register. The meaning of this register is defined
v6,SB,TR
by the platform standard.
Variable-register 5.
Variable register 4.
Variable register 3.
Variable register 2.
Variable register 1.
Argument / scratch register 4.
Argument / scratch register 3.
Argument / result / scratch register 2.
Argument / result / scratch register 1.

PC
LR
SP
IP

Register Synonym Special

Dont need to be saved. May be


used for arguments, results, or
temporary values.

Must be saved, restored by


callee-procedure it may modify
them.
Calling subroutine expects
these to retain their value.

Must be saved, restored by


callee-procedure it may modify
them.

C Calling Assembly

AAPCS Core Register Use

337

Chapter 3 Cortex-M4 Architecture and ASM Programming

Example: Vector Norm Squared


In this example we will be computing the squared length of a
vector using 16-bit (int16_t) signed numbers. In mathematical
terms we are finding
A

An

(3.1)

A = A1 AN

(3.2)

n=1

where

is an N -dimensional vector (column or row vector).


The solution will be obtained in two different ways:
Conventional C programming
Cortex-M assembly
Optimization is not a concern at this point
The focus here is to see by way of a simple example, how to
call a C routine from C (obvious), and how to call an assembly routine from C

338

ECE 5655/4655 Real-Time DSP

Example: Vector Norm Squared

C Version
We implement this simple routine in C using a declared vector length N and vector contents in the array v
The C source, which includes the called function
norm_sq_c is given below:
/******************************************************
Vector norm-squared routine in C
******************************************************/
int main(void){
int16_t x = 0;
int16_t v[5] = {1,2,3,6,7};
...
x = norm_sq_c (v, 5);// call c function
sprintf(my_debug, "Norm: The answer is %d\n", x);
TM_USART_Puts(USART6, my_debug);
...
}
int16_t norm_sq_c(int16_t* v, int16_t n)
{
int16_t i;
int16_t out = 0;
for(i=0; i<n; i++)
{
out += v[i]*v[i];
}
return out;
}

ECE 5655/4655 Real-Time DSP

339

Chapter 3 Cortex-M4 Architecture and ASM Programming

The expected answer is 1 + 4 + 9 + 36 + 49 = 99


From CoolTerm

The cycle count comparison follow assembly version


Assembly Version
The parent C routine is the following:
/******************************************************
Vector norm-squared routine in assembly
******************************************************/
extern int16_t norm_sq_asm(int16_t *x, int16_t n);
int main()
{
int16_t x = 0;
int16_t v[5] = {1,2,3,6,7};
...
zInt_sq = simple_sqrt(zInt);
sprintf(my_debug, "uint16_t SQRT of %d is %d\n",
zInt, zInt_sq);
TM_USART_Puts(USART6, my_debug);
...
}
; File demo_asm.s
PRESERVE8 ; Preserve 8 byte stack alignment
THUMB
; indicate THUMB code is used
AREA |.text|, CODE, READONLY;Start of the CODE area
EXPORT norm_sq_asm
norm_sq_asm FUNCTION
; Input array address: R0
340

ECE 5655/4655 Real-Time DSP

Example: Vector Norm Squared

; Number of elements: R1
MOVS R2, R0 ; move the address in R0 to R2
MOVS R0, #0 ; initialize the result
sum_loop
LDRSH R3, [R2],#0x2; load int16_t value pointed to
; by R2 into R3, then increment
MLA R0, R3, R3, R0; sq & accum in one step (faster)
SUBS R1, R1, #1; R1 = R1 - 1, decrement the count
CMP R1, #0
; compare to 0 and set Z register
BNE sum_loop; branch if compare not zero
BX LR
; return R0
ENDFUNC
END
; End of file

From just the C source it is not obvious that the function prototype for norm_asm is actually an assembly routine
The answer is again 99
From CoolTerm

Performance Comparison
In the Keil IDE debugger we set break points around the
function to b timed:

ECE 5655/4655 Real-Time DSP

341

Chapter 3 Cortex-M4 Architecture and ASM Programming

cycles
86
time
0.47us

norm_sq_asm

cycles
86
time
0.51us

norm_sq_c with O3

norm_sq_c with O0

Then make note of the States and Sec in the registers window:

cycles
49
time
0.29us
BEST!

Example: Unsigned Integer Square Root1


uint32_t zInt = 64;
uint32_t zInt_sq;
...
arm_status status;
// uint16_t Square root experiment
zInt_sq = simple_sqrt(zInt);
sprintf(my_debug, "uint16_t SQRT of %d is %d\n",
zInt, zInt_sq);
TM_USART_Puts(USART6, my_debug);
; in demo_asm.s
PRESERVE8 ; Preserve 8 byte stack alignment
THUMB
; indicate THUMB code is used
AREA |.text|, CODE, READONLY; Start of the CODE area
EXPORT simple_sqrt
simple_sqrt FUNCTION
; Input : R0
; Output : R0 (square root result)
1. Yiu Chapter 20, p. 664.
342

ECE 5655/4655 Real-Time DSP

Example: Unsigned Integer Square Root

MOVW R1, #0x8000 ; R1 = 0x00008000


MOVS R2, #0 ; Initialize result
simple_sqrt_loop
ADDS R2, R2, R1 ; M = (M + N)
MUL R3, R2, R2 ; R3 = M^2
CMP R3, R0 ; If M^2 > Input
IT HI
; Greater Than
SUBHI R2, R2, R1 ; M = (M - N)
LSRS R1, R1, #1 ; N = N >> 1
BNE simple_sqrt_loop
MOV R0, R2 ; Copy to R0 and return
BX LR
; Return
ENDFUNC

Function Flow Chart

ECE 5655/4655 Real-Time DSP

343

Chapter 3 Cortex-M4 Architecture and ASM Programming

Sample Results
For an input of 64 the output is 8, as expected
cycles
125

time
0.74us

For an input of 99 the output is 9 (81 is closest to 99), as


expected
From CoolTerm

Useful Resources
Architecture Reference Manual:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403c/index.html

Cortex-M4 Technical Reference Manual:


http://infocenter.arm.com/help/topic/com.arm.doc.ddi0439d/DDI0439D_cortex_m4_processor_r0p1_trm.pdf

Cortex-M4 Devices Generic User Guide:


http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/DUI0553A_cortex_m4_dgug.pdf

344

ECE 5655/4655 Real-Time DSP