Foreword

This documentation was created out of the need to document new Game Boy Advance hardware research findings.

It does not currently aim to replace existing documentation, as that would be a massive undertaking. Instead it aims to complement existing documentation and to correct it where it is inaccurate.

The target audience for this documentation should mostly be emulator developers and individuals who are curious to learn how the Game Boy Advance really operates. Homebrew programmers might be inclined towards a more comprehensive documentation in most cases for now.

Other resources

GBATEK — the most comprehensive GBA documentation, by Martin Korth (NO$GBA developer)
Tonc — GBA programming tutorial, by cearn
gbadoc — new community-driven GBA documentation, by the GBADev community (WIP)

Timers

The Game Boy Advance has four 16-bit hardware timers (TM0 - TM3 or generally TM[x]).

Each timer is made up of a 16-bit counter, a 16-bit reload value and a control register.

IO registers

Overview

Address	Name	Description
0x04000100	TM0D	Timer #0 counter (on read) and reload (on write) value
0x04000102	TM0CNT	Timer #0 control register
0x04000104	TM1D	Timer #1 counter (on read) and reload (on write) value
0x04000106	TM1CNT	Timer #1 control register
0x04000108	TM2D	Timer #2 counter (on read) and reload (on write) value
0x0400010A	TM2CNT	Timer #2 control register
0x0400010C	TM3D	Timer #3 counter (on read) and reload (on write) value
0x0400010E	TM3CNT	Timer #3 control register

TM[x]D

Reads return the current 16-bit counter value. Writes set the 16-bit reload value.

TM[x]CNT

Bit(s)	R/W	Name	Description
0 - 1	RW	Clock Divider	Select a clock divider frequency (see Clock Divider)
2	RW	Clock Select ¹	0 = clock divider output, 1 = on TM[x-1] overflow
3 - 5	0	Unused
6	RW	IRQ enable	0 = IRQ disabled, 1 = IRQ enabled
7	RW	Enable	0 = disable, 1 = enabled
8 - 15	0	Unused

Clock Divider

The 16 MiHz (16.777.216 Hz) system clock is divided into four frequencies using a single clock divider.

When a timer uses the system clock (meaning Clock Select is zero), it runs at one of those frequencies:

Value	Frequency	Divisor
0	16 MiHz	1
1	256 KiHz	64
2	64 KiHz	256
3	16 KiHz	1024

The relation between frequency and divisor is: Frequency = 16 MiHz / Divisor

Functionality

When the enable bit (TM[x]CNT.bit7) changes from 0 to 1 the reload value is loaded into the counter.

While the enable bit is set, the counter is incremented every time its input clock pulses:

when Clock Select = 0: at the frequency selected via Clock Divider.
when Clock Select = 1: when TM[x-1] ticks and overflows (the overflow flag of TM[x-1] is connected as input clock).

When the 16-bit counter overflows:

the reload value is loaded into the counter
if IRQ enable (TM[x]CNT.bit6) is set: an IRQ is requested in IE bit 3 + x

Timing notes

Changes to TM[x]D and TM[x]CNT take one cycle to apply².
The clock divider appears to generate pulses every n cycles (where n is the divisor) relative to system startup (meaning a timer can tick in less than n cycles after it was enabled). This however hasn't been thoroughly researched yet.

for TM0 this bit always reads zero (the clock divider output is forcibly used)

See timer/reload and timer/start-stop

PPU

The following sections describe the RAM access patterns that the PPU employs to render backgrounds, sprites and to composite the final output image.

Background rendering — BG VRAM accesses in BG modes #0 to #5
Sprite rendering — OAM and OBJ VRAM accesses
Layer compositing — PRAM accesses

Background rendering

The PPU renders background pixels during all visible scanlines (0 to 159). In each scanline it renders all pixels for the current scanline, meaning that unlike sprite rendering all background rendering happens in the same scanline.

For every background (BG0 - BG3) there is a wait/idle period for a number of clock cycles before the PPU begins rendering that background according to the VRAM access patterns laid out below. Usually the length of the wait period is 31 clock cycles, however it may be shorter for text-mode backgrounds which use horizontal scrolling, as described in more detail in the Mode 0 section.

Once the wait period for a given background has passed, the PPU renders pixels until it enters the horizontal blanking period (h-blank) (cycle #1005 appears to be the last cycle that may fetch data). If all visible 240 pixels have been rendered, before the PPU reaches the h-blank period, it does not stop rendering pixels early. It renders more pixels (which are discarded) until the h-blank period.

The hardware fetches on average one background pixel (for every enabled background) every four cycles. For affine tilemap and bitmap modes this is the true rate at which pixels are fetched, however for text-mode backgrounds the access pattern is slightly more complicated.

Legend

- = no fetch
M = fetch map entry (8-bit for affine maps or 16-bit for text-mode maps)
T = fetch tile data
B = fetch bitmap (8-bit palette index or 16-bit color)

Mode 0 (BG0 - BG3 text mode)

Text-mode backgrounds are rendered tile-by-tile. Within 32 cycles a single tilemap-entry is rendered for each enabled background, starting with the left-most visible tilemap-entry. Normally 30¹ tilemap-entries are rendered per scanline. However when sub-tile² horizontal scrolling is used then an additional tilemap-entry needs to be rendered. Notice that while in this case the first and last tile are partially off-screen, hardware still fetches the full data for all eight pixels. The off-screen pixels simply are discarded.

Normally rendering for text-mode backgrounds starts at 31 cycles into the scanline. When a background has sub-tile scrolling², then its start time is shifted closer to the start of the scanline by four cycles per sub-tile scrolled pixel. That is to say, the start time can be calculated from the BG[x]HOFS register: 31 - 4 * (BG[x]HOFS mod 8). This is presumably done to ensure that the actually visible pixels are always available in time to be consumed by the composite step.

To render a tilemap-entry the relevant 16-bit descriptor is first fetched from the tilemap. The hardware then performs another two (4BPP) or four (8BPP) 16-bit VRAM fetches to read a full eight pixel line from the tile data.

4BPP access patterns

BG0: M--- T--- ---- ---- ---- T--- ---- ----
BG1: -M-- -T-- ---- ---- ---- -T-- ---- ----
BG2: --M- --T- ---- ---- ---- --T- ---- ----
BG3: ---M ---T ---- ---- ---- ---T ---- ----

8BPP access patterns

BG0: M--- T--- ---- T--- ---- T--- ---- T---
BG1: -M-- -T-- ---- -T-- ---- -T-- ---- -T--
BG2: --M- --T- ---- --T- ---- --T- ---- --T-
BG3: ---M ---T ---- ---T ---- ---T ---- ---T

Mode 1 (BG0 - BG1 text mode; BG2 affine tilemap)

Presumably just works like Mode 0 for BG0 and BG1 and like Mode 2 for BG2.

Mode 2 (BG2 - BG3 affine tilemap)

Affine tilemap backgrounds are rendered pixel-by-pixel, starting with the left-most screen pixel.

Every four cycles one BG2 and one BG3 pixel is fetched. In the first cycle the respective BG3 tilemap-entry is fetched. In the second cycle the corresponding tile data is fetched. The third and fourth cycles repeat the same process for BG2.

Even if the current tilemap pixel coordinate is out-of-bounds and wraparound is disabled in BG[x]CNT, the pixel will still be fetched (the fetched data will be discarded).

BG2: --MT --MT --MT --MT --MT --MT --MT --MT
BG3: MT-- MT-- MT-- MT-- MT-- MT-- MT-- MT--

Mode 3 - 5 (BG2 affine bitmap modes)

The bitmap background (BG2) is rendered pixel-by-pixel, starting with the left-most screen pixel.

Every four cycles a single pixel is fetched from the BG2 bitmap. This is the case even if the current bitmap pixel coordinate is out-of-bounds of the bitmap (the fetched data will be discarded).

BG2: ---B ---B ---B ---B ---B ---B ---B ---B

240 pixels / 8 pixels per tile

this means that BG[x]HOFS is not divisible by eight.

Sprite rendering

The PPU renders sprites during all visible scanlines (0 to 159) and also during the last scanline (227) of the vertical blanking period. Sprites are rendered one scanline ahead. This means sprite rendering for line 0 starts during line 227 and rendering for line 1 in line 0 and so on.

Sprite rendering for the current scanline starts at cycle #40 of the previous scanline and continues either until the horizontal blanking period of that previous scanline (if DISPCNT.bit5 = 1) or until cycle #40 of the current scanline (if DISPCNT.bit5 = 0).

The sprites are rendered in order from the lowest OAM entry (OAM #0) to the highest entry (OAM #127). The rendering is done with two pipeline stages: an OAM attribute/matrix fetch stage and a VRAM pixel fetch stage. Both stages only access OAM/VRAM every two cycles, every odd cycle does not access any memory.

OAM fetch stage

For every OAM entry the OAM fetch stage first fetches sprite attributes #0 and #1 (single 32-bit read) in cycle #0. It then decides based on the attributes if the sprite is enabled and vertically intersects the scanline that is rendered. If this is the case then it fetches attribute #2 in cycle #2 and, if the sprite is affine, the four 2x2 matrix components in cycles #4, #6, #8 and #10 ¹.

Once this pipeline stage has identified an OAM entry that should be rendered, it activates the VRAM fetch stage. The OAM fetch stage is stalled for all active VRAM fetch stage cycles except for the first and the next-to-last cycles. That is enough to prefetch OAM attributes #0 and #1 for the next OAM entry and either attribute #2 for the next OAM entry or attributes #0 and #1 for the OAM entry after.

VRAM fetch stage

The VRAM fetch stage fetches tile data for every sprite pixel in the rendered scanline.

For regular sprites width / 2 16-bit VRAM accesses are performed (one access every two cycles). With each access two pixels are rendered (even for 4BPP tile data).

For affine sprites this stage does not perform a VRAM access during the first two render cycles. It then performs width VRAM accesses (one access every two cycles). With each access a single pixel is rendered. Notice that using the 'double area' feature doubles the number of VRAM accesses.

Examples

Legend

- = no fetch
A01 = fetch OBJ Attribute 0 and Attribute 1 (single 32-bit OAM read)
A2 = fetch OBJ Attribute 2
PA = fetch matrix entry #0
PB = fetch matrix entry #1
PC = fetch matrix entry #2
PD = fetch matrix entry #3
V = fetch VRAM tile data

OAM #0 - OAM #3 rendered, non-affine and 8 pixels wide, OAM #4 - OAM #127 disabled/culled

Cycle	OAM #0	OAM #1	OAM #2	OAM #3	OAM #4	OAM #5
0	A01	-	-	-	-	-
2	A2	-	-	-	-	-
4	V	A01	-	-	-	-
6	V	-	-	-	-	-
8	V	-	-	-	-	-
10	V	A2	-	-	-	-
12	-	V	A01	-	-	-
14	-	V	-	-	-	-
16	-	V	-	-	-	-
18	-	V	A2	-	-	-
20	-	-	V	A01	-	-
22	-	-	V	-	-	-
24	-	-	V	-	-	-
26	-	-	V	A2	-	-
28	-	-	-	V	A01	-
30	-	-	-	V	-	-
32	-	-	-	V	-	-
34	-	-	-	V	-	A01

OAM #0 - OAM #3 enabled, affine and 8 pixels wide, OAM #4 - OAM #128 disabled/culled

Cycle	OAM #0	OAM #1	OAM #2	OAM #3	OAM #4	OAM #5
0	A01	-	-	-	-	-
2	A2	-	-	-	-	-
4	PA	-	-	-	-	-
8	PB	-	-	-	-	-
10	PC	-	-	-	-	-
12	PD	-	-	-	-	-
14	-	A01	-	-	-	-
16	V	-	-	-	-	-
18	V	-	-	-	-	-
20	V	-	-	-	-	-
22	V	-	-	-	-	-
24	V	-	-	-	-	-
26	V	-	-	-	-	-
28	V	-	-	-	-	-
30	V	A2	-	-	-	-
32	-	PA	-	-	-	-
34	-	PB	-	-	-	-
36	-	PC	-	-	-	-
38	-	PD	-	-	-	-
40	-	-	A01	-	-	-
42	-	V	-	-	-	-
44	-	V	-	-	-	-
46	-	V	-	-	-	-
48	-	V	-	-	-	-
50	-	V	-	-	-	-
52	-	V	-	-	-	-
54	-	V	-	-	-	-
56	-	V	A2	-	-	-
58	-	-	PA	-	-	-
60	-	-	PB	-	-	-
62	-	-	PC	-	-	-
64	-	-	PD	-	-	-
66	-	-	-	A01	-	-
68	-	-	V	-	-	-
70	-	-	V	-	-	-
72	-	-	V	-	-	-
74	-	-	V	-	-	-
76	-	-	V	-	-	-
78	-	-	V	-	-	-
80	-	-	V	-	-	-
82	-	-	V	A2	-	-
84	-	-	-	PA	-	-
86	-	-	-	PB	-	-
88	-	-	-	PC	-	-
90	-	-	-	PD	-	-
92	-	-	-	-	A01	-
94	-	-	-	V	-	-
96	-	-	-	V	-	-
98	-	-	-	V	-	-
100	-	-	-	V	-	-
102	-	-	-	V	-	-
104	-	-	-	V	-	-
106	-	-	-	V	-	-
108	-	-	-	V	-	A01

the order in which the matrix components are fetched is not known yet.

Layer compositing

Compositing of the background, sprite and backdrop layers begins after a wait period of 46 cycles. The PPU then composites an output pixel every four cycles. The last pixel will be completed on cycle #1006 (46 + 4 * 240 = 1008). It does this during all visible scanlines (0 to 159) and always composites pixels for the current scanline.

For each output pixel the top two opaque layers for that pixel are extracted. Then up to two Palette RAM (PRAM) accesses are performed, to resolve palette indices from the background or sprite layers to 16-bit colors or to get the backdrop colour. The first PRAM access resolves the colour for the top-most layer and the second PRAM access the colour for the second top-most layer.

For both layers, there are distinctive conditions that decide whether the respective PRAM access will be performed or not.

Top-most layer condition

not (PPU is in Mode 3 and the layer is BG2)

Second top-most layer condition

not (PPU is in Mode 3 and the layer is BG2)
Alpha-blending is enabled
The top-most layer is selected as the first blend target
The second top-most layer selected as the second blend target

Notice that for both layers the access is not performed when the PPU is in Mode 3 and the layer is BG2. This simply is because the BG2 bitmap in Mode 3 does not use palettes and fetches 16-bit colors directly from VRAM.

PRAM access pattern

A-B- A-B- A-B- A-B- ...

- = no fetch
A = top-most layer PRAM fetch
B = second top-most layer PRAM fetch

NBA hardware documentation