Embedded MIPS Development with the Nintendo 64
Embedded MIPS Development with the Nintendo 64
Ryan Underwood
Abstract
Using Nintendo's 64-bit console, we explore the intricacies and design decisions involved in developing software for a console or embedded platform. This is a work in progress!
1 Introduction
Embedded development, while drastically simplified compared to systems programming for a general purpose platform, presents many challenges to attain a superior product at minimal cost. The first set of challenges we can identify are purely physical. Obviously, the speed of the CPU and the size of memory are the two main decisions that impact the cost per unit. Also necessary to consider is what user interface is desired, and the appropriate I/O hardware to accomodate each aspect of the interface. Size must be considered, not only in terms of material cost; who wants to lug around a bulky digital music player? Power consumption is critical; it presents a three-pronged blade to threaten effective platform design. The first issue is simply the rate at which the system consumes power. This is a problem especially for battery-powered devices, but extends as a problem of scalability too; greater power consumption simply means greater operating cost to the end user and thus a lesser value assigned to the product. Power consumption is determined by the components selected for the platform, as well as how software chooses to utilize those components. The second issue is the heat that is dissipated by the switching of silicon gates. Excessive heat generation produces undesirable traits for the user, and can also lead to premature component failures. Usually, power consumption and heat dissipation of a system are strongly linked, and frequently result from a hardware design strategy that is excessive in nature. The last issue is that of reliability. Designs which employ a high rate of power consumption must have safety features such as filters scattered throughout the system to prevent erroneous behavior under load, and require higher or tighter component tolerances in areas such as the power supply. All this leads to extra design time, more potential sources of design error, and extra unit cost.
Among other factors, the number of components and their complexity affect size and power consumption the most. Therefore, the hardware designer is faced with a difficult tradeoff. Essentially, he must select which features of a general purpose computing platform must be omitted in his embedded platform in order to reduce costs, while allowing the features that will provide the user with the most value to remain. What makes this decision even more difficult is that knowing which features will provide the most value requires knowledge of how software designers will make use of the platform. Frequently, the wrong features are cut and less important features remain, incurring unnecessary cost on the manufacturer and thwarting attempts at elegant software design.
The focus of this text is on software development where the hardware platform is a given. Targeting a fixed and known hardware platform fits well with development for set-top boxes or game consoles, but also with the rising popularity of single-board computers and System-On-a-Chip (SOC) solutions. In this paradigm, the hardware is mass produced so that it may be sold at a very low cost, and individual providers develop the software to run on the mass produced platform. The hardware/software combination together completes the product that is to be shipped to end users.
Our focus is on development for a game console, the Nintendo 64 (heretoafter referred to as Ultra 64 or N64), which had a multi-year mass market lifespan. Consequently, the install base is very large and consists of primarily users who purchased the console to be able to use game software designed for it. The knowledge we gain in developing a software platform for this console can be extended to embedded development in general; the only difference is that in an embedded product, the platform and its software are designed to be inseparable. If the N64 were an embedded platform, it would have shipped with the program cartridge moulded to the console.
2 Taking Stock of the Hardware
As software designers, we will be working intimately with our chosen hardware platform. Therefore, it is essential to know with as much precision as possible what the details of our hardware architecture are. Sometimes hardware manufacturers or licensed third parties offer complete SDKs (Software Development Kits) for their platforms that ease the bootstrapping of a project; usually an experienced C programmer will be able to make use of a SDK to drastically cut down time-to-market. However, the SDK usually contains proprietary information obtained or developed at cost to the platform designer, which manifests itself to the software developer as a per-unit licensing fee or the unavailability of source code. (See Section 3 for more information on SDKs). For this discussion, we will focus on developing software solutions without a third-party SDK, and on developing our own SDK to use in-house or to license to third parties.
The information we need to know about our platform boils down to three categories:
- How to program the processor(s)
- How to program peripherals
- How to execute our code on the target
This information can be gleaned (rarely) from marketing materials, or more usually a designer's handbook. In a limited fashion, it can also be derived from observation or from reverse engineering.
The N64's hardware features are as follows:
- MIPS R4300i RISC 64-bit embedded processor, 93.75 MHz
- Reality CoProcessor (RCP), 62.5 MHz
- 4MB Rambus RDRAM (8MB with memory expansion)
- 4 peripheral ports
- Cartridge/system bus interface
MIPS provides a public specification for the MIPS IV instruction set as well as the R4300i processor specifically. Therefore, programming this processor should be no problem, assuming no customization has been made to it.
The rest of the system is documented only privately, in the Nintendo SDK HTML manuals and 'man' pages. These are available to SDK licensees only. Furthermore, these documents typically only cover the Nintendo SDK operating system interfaces, and do not go into much detail about the underlying software<->hardware interface. Information about the rest of the system has been derived by members of the "Dextrose" group and message board, by N64 emulator authors, by many unnamed and defunct groups around the globe, and by commercial interests who develop unofficial N64 development platforms to be sold at a much lower cost than the official ones. Through the efforts of these disparate (and only occasionally cooperating) groups, unofficial programming information for nearly all of the N64's peripherals, its coprocessor, and its memory/register map have been produced. Nintendo is unable to claim any intellectual property rights on this independently derived information, so we use it freely in this document. We use symbolic constant names compatible with the official SDK, so that individuals already familiar with the SDK can more easily follow along.
Note Sega vs Accolade, Nintendo v Tengen court cases regarding hardware lock-outs.
2.1 Programming the N64 CPU
The N64 CPU is a NEC VR4300, which is a clone of the MIPS R4300i, a low power version of the R4300, which is it itself a low cost version of the R4200. The main differences between the VR4300 and the R4200 are that the VR4300 does not support cache parity, implements only a 32-bit data bus (SysAD), and supports only a 32-bit (4GB) address space instead of the 36-bit (64GB) address space of the R4200. The R4300, like all R4000 processors, implements the MIPS III instruction set. The CPU runs at 93.75 MHz PClock (MasterClock*1.5). It can execute one instruction per clock cycle and has a 5-stage static pipeline. It has a 16KB instruction cache and a 8KB write-back data cache, both non-parit
y. At the nominal 93.75MHz clock speed, the VR4300 attains 125 MIPS and a score of 60/45 on SPECint/fp92, while dissipating 1.8 watts on average, a very modest amount. It has fixed-width instructions and a clearly specified interface for up to four coprocessors, two of which are implemented on-board (CP0, the MMU, and CP1, the FPU). The N64 configures the CPU to run in big-endian mode, but it is capable of running in either endian mode, and can be switched at runtime between 32-bit and 64-bit addressing. It can be switched to reduced power mode to slow the processor to 1/4 of its nominal clock speed at any time. The memory management unit (MMU) is powered off when not in use, saving power.
2.1.1 Addressing and Cacheability
The CPU has three modes of execution, kernel, supervisor, and user mode, each with a different memory map. Since a normal program is in complete control of the machine, we will exclusively be operating in kernel mode. (If we were writing a general purpose operating system, we would implement the operating system in kernel mode, and processes would be executed in user mode. kseg0 and kseg1 would be inaccessible to user code.) We will also use 32-bit addressing instead of 64-bit; since the N64 is physically limited to 8 megabytes of system memory, 64-bit pointers would only be a waste of space.
An important effect of the mode of the processor is in determining the system's memory map. According to the R4300 datasheet, there are five regions of memory when in Kernel mode:
- 0x00000000-0x7FFFFFFF: kuseg (TLB-mapped 2G physical)
- 0x80000000-0x9FFFFFFF: kseg0 (direct-mapped 512MB, cached)
- 0xA0000000-0xBFFFFFFF: kseg1 (direct-mapped 512MB, uncached)
- 0xC0000000-0xDFFFFFFF: ksseg (TLB-mapped)
- 0xE0000000-0xFFFFFFFF: kseg3 (TLB-mapped)
We are concerned mainly with the first three regions. The power-on configuration causes the physical memory of the N64 to be mirrored in kseg0 and kseg1. kuseg, ksseg, and kseg3 are unmapped and cannot be accessed until mapped. Since we do not need more than a 512MB virtual address space on the N64, we will use kseg0 addressing when the CPU needs to access main memory. kseg1 is useful because in some cases, bypassing the CPU's cache is desirable. The RCP, on the other hand, does not cache or mirror memory, so physical addresses are required when programming the RCP to access memory. Usually we will use the following guidelines for selecting memory addressing modes:
- Use kseg0 (cached) virtual addresses in general.
- Use kseg1 (uncached) virtual addresses when the CPU must access memory-mapped hardware registers.
- Use physical addresses when informing the RCP of a memory location (i.e. for a DMA transaction).
When peripherals are programmed to perform DMA reads, they will always access the uncached version of the data directly from memory. Therefore, if a cached region (kseg0 or kuseg) is written to, the CPU cache line corresponding to that region must be flushed before using that memory for operations in the RCP or other peripherals. Conversely, when a peripheral performs a DMA write to memory, the CPU cache line corresponding to that region must be invalidated to ensure two things: that reads from cached memory reflect the updated data, and that a subsequent CPU cache flush doesn't overwrite the new data from the peripheral.
It is possible to disable the kseg0 caching to aid in debugging. In order to do this, set the low three bits of the CP0 config to 2 instead of the default 3.
Remember that if the low 29 bits of two addresses are identical, their physical location in main memory is identical. Changing the upper four bits only changes the CPU's access semantics (cached/un
cached and TLB-mapped or direct-mapped). For example, the PIF-ROM can be found at two seemingly different locations -- 0x9FC00000 and 0xBFC00000. These two locations are actually one and the same, because the low-order 29 bits in the two addresses are identical. They only differ in the access semantics. Note that even though the physical address 0x1FC00000 is in the kuseg address space, it cannot be accessed until it has been mapped by the user.
2.1.2 Delay Slots
For the highest performing code, we must observe the necessity of delay slots on the MIPS architecture. Delay slots are required by the design in order to prevent pipeline stalls. A delay slot is specified when:
- A conditional branch is taken.
The instruction in the delay slot is executed, unless the branch instruction is of "Likely" type, in which case the instruction in the delay slot is not executed if the branch is not taken. - A register load is performed.
If the instruction in the delay slot is dependent on the register being loaded, an interlock condition occurs and the processor must stall until the register load has completed. - A register load is performed from a coprocessor.
Since there is no interlock on coprocessor loads, the destination register is not filled until after the next instruction has executed! (This is similar to pre-MIPS IV behavior, but limited to from-coprocessor loads only.)
Not observing the branch or load delay slots will cause seemingly-mysterious performance declines or unwanted program behavior.
2.1.3 Exception Handling
The CPU reset/NMI vector is mapped to 0xBFC00000, which is 0x1FC00000 in physical RAM. As we will later see, the reset condition is always handled by an embedded ROM (PIF-ROM) in the N64 which manages the lockout lock-out chip. The vector can be changed later for purposes of NMI or soft-reset (generated by the Reset button on the console), but we can't do anything about a hard reset; the PIF-ROM will always be executed. (FIXME right?)
The general exception vector is at 0xBFC00380 when BEV is set (see section Programming-the-RCP ), and can be used for all other purposes. When an exception condition is encountered, the address at 0xBFC00380 will be the target of a branch, and the machine state will be on the stack.
2.1.4 Code Generation
Assembling MIPS instructions to binary can be performed using the GNU binutils package, configured for the mips or mips64 targets. 32-bit MIPS code can be executed on the N64, but 64-bit code gives higher performance with calculations involving large values. We will later use the GNU C compiler to generate MIPS code, but it is much easier for now to simply build binutils (for the assembler) than to build a complete C/C++ toolchain. In addition, running a C program requires start-up code, which we have not developed yet, to initialize the hardware and stack pointer and to invoke main().
2.1.5 MMU (CP0)
Here we list the CP0 registers that interest us. There are more, but the others deal with memory management (TLB), which we are not particularly concerned with unless we are writing an OS.
R9 (Count) Timer count register
R11 (Compare) Timer compare value register
R12 (SR) Status Register
R13 (Cause) Exception Cause
R14 (EPC) Exception Saved PC
R16 (Config) Config Register
R18 (WatchLo) address trap lower bits
R19 (WatchHi) address trap upper bits
R30 (ErrorEPC) Reset/NMI saved PC
RP - Reduced Power - Bit 27 On
FR - Floating-point Register - Bit 26 On
RE - Reverse Endian - Bit 25 On
BEV - Bootstrap Exception Vector - Bit 22 On
SR - Is set if Soft Reset/NMI occurred - Bit 20
KSU - Set to 00 for kernel mode - Bit 3-4
ERL - Is set if a CPU error occurred - Bit 2
EXL - Is set if an exception occurred - Bit 1
Read the cause register to figure out what happened
IE - Interrupt Enable - Bit 0 On
ErrorEPC register - Reset/NMI saved PC
EPC register - Exception saved PC
WatchHi/WatchLo registers to trap memory accesses
Config register -
EC - Bit 28-30 System Clock Ratio (1:1/110 1.5:1/111 2:1/000 3:1/001) readonly
BE - Bit 15 - Big Endian - 0 => LE, 1=> BE
K0 - bit 0-2 - 010 => noncacheable 011=>cacheable
Count register / Compare register / sets IP7 which causes
interrupt if IE / write to Compare to clear
Count increments at 1/2 PClock and rolls over.
2.1.6 FPU (CP1)
2.2 Programming the RCP Coprocessor Unit
The RCP is interfaced to the CPU as a standard MIPS coprocessor unit (CP2) and connected to the CPU via the 32-bit SysAD interface that the R4300 provides. The RCP is clocked at 62.5 MHz (==MasterClock) and is capable of 500MFlops at that speed. It is based on a .35 micron manufacturing process. The RCP is connected to CPU interrupt 0 (INT0).
The various functional subunits of the RCP can be accessed through memory-mapped registers. The available subunits are:
- RAC (Rambus ASIC Cell)
A memory controller IP for the Rambus RDRAM system memory. This could be off-the-shelf depending on the design. - RSP (Reality Signal Processor)
A custom R4000-like and DSP-like processor with 64-bit instruction width, 32-bit scalar register width, 16-bit integer SIMD capability, and a Harvard microarchitecture with separated 4K instruction memory (IMEM) of 64 64-bit words and 4K data memory (DMEM) of 32 128-bit words. In typical usage, it reads the next operation from a task list, uses DMA to fetch the microcode and data for the next task, and executes the microcode. Typical tasks are to preprocess graphics commands for dispatch to the RDP, or to process sound data that is then written to the Audio Interface. The RSP has its own set of internal coprocessors: its own CP0 is the RDP, and CP1 is the 16-bit integer SIMD Vector Unit (VU). While they are accessed through the MIPS coprocessor interface, the RDP and VU are not intended to be at all compatible with the CP0 (MMU) and CP1 (FPU) that would be found on a MIPS CPU. - RDP (Reality Display Processor)
The rasterizing engine of the N64. The RDP receives commands from the RSP or a memory buffer, filters and manipulates the image data, and writes the final image data to an off-screen framebuffer in main memory. - VI (Video Interface)
The N64's framebuffer interface. It displays the contents of the on-screen framebuffer onto the external video display. - AI (Audio Interface)
Plays digital audio samples via DMA. - PI (Peripheral Interface)
Connects cartridges and external hardware units to the N64. - MI (MIPS Interface)
A control interface for CPU-related functionality, such as masking interrupts or determining the source of an interrupt. - RI (RDRAM Interface)
The RCP's interface to the RAC and RDRAM system memory. - SI (Serial Interface)
Responsible for accessing devices connected to the controller ports.
2.3 Programming the RSP (Reality Signal Processor)
2.4 Programming the RDP (Reality Display Processor)
4KB Texture memory (one 32x32 RGBA texture)
Features:
- Alpha Transparency (8-bit)
- Anti-Aliasing
- Bilinear/Trilinear Filtering/Interpolation (and Point Sampling)
- Culling/Level of Detail Management
- Dithering
- Environment Mapping
- Fog
- Mipmapping
- Perspective Correct Texture Mapping
- Shading (Flat/Gourad)
- Specular Reflection/Shiny Surfaces (Metal Mario)
- Trilinear Mipmap Interpolation
- Z-Buffering
2.5 Interfacing System RAM
The Rambus RDRAM memory used in the N64 has a narrow 9-bit interface to the Rambus ASIC Cell (RAC) in the RCP, and is capable of 562.5 MB/s burst bandwidth. It implements parity checking, so it should generate a NMI when a parity error is detected (FIXME but R4300 doesn't support parity?). The RAM is mapped at 0x00000000, and 64MB of address space is reserved for it. However, only 63MB of this address space can be used for physical memory, since the last 1MB is reserved for the RAC registers.
2.5.1 Addressing
The N64 console ships with 4MB of memory onboard, and a required bus terminator called a "Jumper Pak" is installed in the memory expansion slot. A 4MB upgrade can be purchased and installed for a total of 8MB system memory. The following maps can be used depending on the amount of memory installed:
- For cached access to 4MB memory, use 0x80000000-0x803FFFFF.
- For cached access to 8MB memory, use 0x80000000-0x807FFFFF.
- For uncached access to 4MB memory, use 0xA0000000-0xA03FFFFF.
- For uncached access to 8MB memory, use 0xA0000000-0xA07FFFFF.
2.5.4 Memory Detection
To check how much memory is installed, we consult the address 0x80000318. It corresponds to osMemSize from the SDK, and contains the size of memory (0x00400000 for 4MB or 0x00800000 for 8MB). The PIF-ROM code detects the size of memory and sets the value at that address before jumping to program code. If we wish to detect the size of memory ourselves instead of using the value the PIF-ROM gave us, we can use the following algorithm:
- Start at address (0xA0400000 - 4) and a memory count of 4MB.
- Write a 32-bit word to the address, read it back, and compare it to the value written.
- If it is the same, add 0x100000 to the address and 1MB to the memory count.
- Repeat until the memory count is equal to the desired amount of memory. If we are only verifying the (non)existence of an officially-released memory expansion pack, stop when the memory count is equal to 8MB.
Note that an attempt to access a non-existent physical memory address will result in an exception, so in order to prevent a CPU crash when performing memory detection, an appropriate general exception handler must be installed.
2.5.3 Access Strategy
The N64's Rambus RDRAM system memory has a high burst bandwidth but trades access latency for this bandwidth. Compared to the same speed SDRAM, accessing main memory in a random fashion is exceptionally slow. A poster at the Beyond3d.com forums claimed that it would take 64 clocks (640ns) to initiate memory access, which is a crippling latency when combined with the lack of prefetch and read-around-write capabilities "beyond3d-forum". (benchmark)
Also, since the memory controller is part of the RCP and can only be accessed through the RCP, the data width of memory access is limited to the 32-bit SysAD interface between the CPU and RCP; storing a 64-bit longword to memory will thus require two transfers to the RCP, while storing a value of size 32 bits or less requires only one transfer. (benchmark) Additionally, the R4300 CPU will be locked out from accessing memory whenever the RCP's internal bus is busy, causing program slowdown. (benchmark while e.g. PI DMA is going on)
To avoid this latency, programs that run on the R4300 CPU should be optimized in a fashion such that in both passing procedure parameters and for procedure temporary (stack) variables, registers are primarily utilized. Thanks to the MIPS architecture providing 32 64-bit general-purpose registers (GPR), there are ample registers to use for this purpose.
The n32 Application Binary Interface (ABI) developed by SGI leverages this advantage at the compiler and linker level. When this ABI is employed, up to eight 64-bit words, whether integer, float, struct, or union members, are passed in registers. Any data in excess of the eight 64-bit words is then additionally passed on the stack. Inside a function, up to 18 integer registers minus the number of register-pass
ed integer arguments are available, and up to 26 floating-point registers minus the number of register-passed FP arguments are available without allocating additional stack space in main memory for local temporary variables.
If register variables are allocated by the caller and its callers, they will be saved to the stack before the call, even if the subroutine parameters fit into the allocated registers. For this reason, recursive functions and other algorithms with deep call graphs, as well as algorithms with high space complexity, are, all other things equal, a poor choice for the N64; eventually the data cache will be exhausted and the stack will spill over into the main memory that is extremely slow when accessed by the CPU. Functions written in a way such that they can be inlined into the caller, on the other hand, both leverage the high register count of the R4300 and have less data cache impact, but caution must be exercised such that excessive use of inline code inside a loop does not thrash the relatively small instruction cache. (Instruction cache misses are almost as expensive as data cache misses, only being somewhat mitigated by the R4300 design, which prefetches two instructions at a time to save power.) As standard practice goes, pass variables by reference when possible, and minimize a subroutine's copying of its parameters into stack-allocated temporary variables by simply passing the variable by value instead.
Always use DMA transfers back and forth to main memory when large amounts of data must be read from or written to peripherals. Do not read a whole data structure into memory to then simply write a (possibly modified) copy to another memory-mapped location! This is what DMA is for. Trying to do all of this on the CPU will be extraordinarily slow due to the 32-bit interface between the CPU and RCP, high latency of RDRAM, and internal bus competition with the RCP. When main memory must be read from or stored to, try to design programs such that memory is accessed in bursts of contiguous reads or writes that are smaller than the data cache size. Then use the cache flush command to write the cache out to memory all at once, instead of waiting for individual cache lines to be automatically flushed to memory by the LRU algorithm. (FIXME test is this really faster?)
Also remember that the CPU has a 16K instruction cache (512 32-bit instruction words) organized with 32-byte lines, and an 8K data cache (256 32-bit or 128 64-bit data words) organized with 16-byte lines. If, in a given program mode, your program's call graph and data structures can be made to fit within these cache limits, main memory accesses will be avoided providing a tremendous speedup. Try profiling the amount of stack and heap that is used in a given program mode to see if your program is thrashing the cache.
Programming the RCP presents a different set of challenges. While the RCP has a full 64-bit data bus internally, the RSP and RDP cannot access main memory directly; either the RSP program must make a DMA request in order to transfer data back and forth from DMEM/IMEM/TMEM and main memory, or it must request that the CPU copy the data on its behalf, which is slow. To get the most performance out of the CPU and RCP, a program must thus employ an efficient DMA strategy.
2.5.4 Configuration
The RDRAM memory controller (RAC) configuration registers are 32-bit words mapped starting at 0x03F00000. Usually these registers are configured by firmware.
Address Symbolic Name
0xA3F00000 RDRAM_CONFIG
0xA3F00004 RDRAM_DEVICE_ID
0xA3F00008 RDRAM_DELAY
0xA3F0000C RDRAM_MODE
0xA3F00010 RDRAM_REF_INTERVAL
0xA3F00014 RDRAM_REF_ROW
0xA3F00018 RDRAM_RAS_INTERVAL
0xA3F0001C RDRAM_MIN_INTERVAL
0xA3F00020 RDRAM_ADDR_SELECT
0xA3F00024 RDRAM_DEVICE_MANUF
0xA3F00028-0xA3FFFFFF Unknown/Unused
2.5.5 Programming the RI (RDRAM Interface)
These registers are located in the RCP and are of 32-bit width. Usually these registers are configured by firmware. The RI registers are mapped at 0x04700000.
Address Symbolic Name Description
0xA4700000 RI_MODE
0xA4700004 RI_CONFIG
0xA4700008 RI_CURRENT_LOAD
0xA470000C RI_SELECT
0xA4700010 RI_REFRESH
0xA4700014 RI_LATENCY
0xA4700018 RI_RERROR d0=>nack error, d1=>ack error
0xA470001C RI_WERROR Write to this register to clear RI_RERROR.
2.6 Programming the MI (MIPS Interface)
This interface acts the "glue" between the CPU and the RCP. Its chief purpose is to determine the source of an interrupt. MI registers are mapped at 0x04300000.
Address Symbolic Name
0xA4300000 MI_MODE
0xA4300004 MI_VERSION
0xA4300008 MI_INTR
0xA430000C MI_INTR_MASK
describe various interrupt sources and masks.
2.7 Programming the VI (Video Interface)
The VI is the frame buffer interface for the N64. It controls the resolution, color depth, and physical characteristics of the video display. On each screen redraw, it is responsible for reading the contents of the display from a given region in system memory (the frame buffer) and updating the image on the monitor. Resolutions from 256x224 to 640x480 can be used with 16 or 32-bit color depth. (PAL systems can use a 768x576 resolution FIXME?) What about MPAL?
The VI's registers are mapped at 0x04400000.
2.8 AI (Audio Interface)
This unit produces analog audio output from digital audio samples. It supports up to 48kHz samples and 4 to 16-bit sample resolution. It uses DMA to play samples directly from system memory. A sample can be queued for play while one is currently playing. Unfortunately there is no mixing of samples, so we must employ our own mixing routines. Typically, mixing of multiple PCM streams, as well as sound manipulation such as ADPCM decompression, wavetable synthesis, and special effects (Voice, Pitch Shifting, Gain and Pan, Reverb and Chorus) are performed by an RSP microcode task.
Programming the PI (Parallel Peripheral Interface)
The PI is used to attach program cartridges, development units, or other peripheral devices to the system. Besides cartridge data, the serial data lines to the cartridge EEPROM and lockout (CIC) chip run through it, and the left and right audio outputs are cloned onto it. It can be physically accessed both via the cartridge port on the top and the 64DD port on the bottom. It is utilized in software via simple memory-mapped I/O as well as DMA transfers to and from system memory. It is connected to INT1 of the CPU.
The PI bus requires an embedded address decoder in any attached device, since the address and data are multiplexed onto the same lines.
Address Symbolic Name Description
0xA4600000 PI_DRAM_ADDR Target or source in RDRAM of PI DMA transaction & 0x1FFFFFFF (64-bit aligned)
0xA4600004 PI_CART_ADDR Target or source in PI domain of PI DMA transaction (16-bit aligned)
0xA4600008 PI_RD_LEN Final offset of data to be read from PI bus
0xA460000C PI_WR_LEN Final offset of data to be written to PI bus
0xA4600010 PI_STATUS Read { error | I/O busy | DMA busy }; Write { Clear interrupt | Reset and abort }
0xA4600014 PI_BSD_DOM1_LAT PI Domain 1 device latency
0xA4600018 PI_BSD_DOM1_PWD PI Domain 1 device R/W strobe pulse width
0xA460001C PI_BSD_DOM1_PGS PI Domain 1 device page size
0xA4600020 PI_BSD_DOM1_RLS PI Domain 1 device R/W release duration
0xA4600024 PI_BSD_DOM2_LAT PI Domain 2 device latency
0xA4600028 PI_BSD_DOM2_PWD PI Domain 2 device R/W strobe pulse width
0xA460002C PI_BSD_DOM2_PGS PI Domain 2 device page size
0xA4600030 PI_BSD_DOM2_RLS PI Domain 2 device R/W release duration
0xA4600034-0xA46FFFFF Unknown/Unused
The purpose of two separate domains is so that access timings for two PI devices can simultaneously be set. Domain 1 is used to access cartridge ROM memory, whether physical, 64DD, emulated as in the V64, or emulated as in the Indy development board, while domain 2 is used to access other PI memory devices, such as SRAM and FlashRAM. The domain 1 registers are programmed by the bootstrap ROM with the values in the first 32-bit word of the cartridge header. The timings for domain 2 must be programmed by software based on what type of memory device the software expects to see on the bus (typically the one that is included with the cartridge).
The PI will show I/O busy if the device is on the bus or if another I/O operation has not yet completed. Before writing to PI registers, software must wait for I/O busy to clear, or reset the PI if excessive time has elapsed. Also, before starting a DMA transfer, software must check if the PI is busy with an existing DMA request. A DMA request submitted while I/O or DMA is busy will be ignored. The I/O busy and DMA busy conditions can both be checked by reading PI offset 0x10.
When the PI has completed a DMA transfer, it will assert an interrupt. This will both set the interrupt flag in the RCP and, if not masked out, trigger a CPU interrupt request. Another condition that will cause an interrupt to be asserted is if the PI is reset while busy. In order to not lose future PI interrupts, the interrupt flag must be cleared before proceeding by setting bit 2 of PI offset 0x10.
Programmed I/O can be used to read from and write to the mapped PI memory locations, but this method is quite slow and requires careful timing. Also, there are mapping collisions in the PI memory space, such as the PIF ROM/RAM and the CD64 registers, which are avoided when using DMA. Using DMA is the fastest and most reliable method. To perform a PI DMA transfer:
- Write the starting DMA physical RDRAM address to PI offset 0, ensuring 64-bit alignment
- Write the starting DMA address in mapped PI space to PI offset 4, ensuring 16-bit alignment
- If reading, write the final offset from the starting DMA address to PI offset 8; if writing, write the final offset from the starting DMA address to PI offset 0xC; this action triggers the start of DMA
- Invalidate CPU cache lines corresponding to any memory region that was written to, unless uncached memory (0xA0000000-0xBFFFFFFF) is always used in the program.
- If PI interrupts are enabled, wait for an interrupt, otherwise, wait for the PI to become idle by reading offset 0x10 and ANDing it with 3. When the PI generates an interrupt and/or reads idle, the DMA operation has completed.
PI memory locations are as follows:
Start address End address Length Domain Description
0x06000000 0x07FFFFFF 32MB 1 Indy GIO card (???)
0x10000000 0x1FBFFFFF 252MB 1 0-252MB Cartridge
0x1FD00000 0x7FFFFFFF 1539MB 1 Rest of Cartridge
0x05000000 0x05FFFFFF 16MB 2 ???
0x08000000 0x0FFFFFFF 128MB 2 SRAM/FlashRAM
Known PI timings for various peripherals are as follows:
RLS PGS PWD LAT Type
0x40 0x803712 0x8037 0x803 Cartridge
0x02 0x0D 0x0C 0x05 SRAM
Indy
FlashRAM
Programming the SI (Serial Interface)
Programming the PIF (Serial Peripheral InterFace)
PIF-NUS (U6) has 2K of internal storage, divided between RAM and ROM. It is chiefly responsible for input device management. The 64-byte PIF-RAM (JoyChannel RAM) can be used to examine the current state of input devices, and commands can be written there for the PIF to execute. In its ROM region, it stores a small bootstrap program (PIF-ROM) that the MIPS CPU's reset vector (0xBFC00000) is pointed to at power-on. The PIF-ROM also plays a role in the hardware lockout scheme and will be discussed at length in that section. The PIF is connected to INT2 of the CPU as well as to the CPU NMI (Non Maskable Interrupt) and RST (Reset) pins.
Don't confuse this with the PI which is also referred to as a "Peripheral Interface".
The PIF memory map is as follows:
- 0xBFC00000-0xBFC007BF: PIF-ROM
- 0xBFC007C0-0xBFC007FB: PIF-RAM
- 0xBFC007FC-0xBFC007FF: PIF Status Register
To read or write to the PIF, the programmer will normally request a SI DMA transfer (described in section "Programming-the-SI"). It is possible to read from and write to the PIF RAM directly, and in fact this is done by boot code. But, the PIF has concurrent access to this region, and reads or writes may thus produce an inconsistent state, especially if the process accessing the PIF RAM is managed by an operating system. A DMA transfer transfers the entire PIF RAM in one read or write, ensuring that the state of the data structures it contains remains consistent with respect to both the reader process and the PIF hardware.
7FC Read 0x80 ?
Write 0x10 ?
2.12 Human Interface Devices
controller, rumble pak, can we use keyboard/mouse?
2.13 Memory Devices
cart eeprom/4xeep/flash/sram
controller paks - paged (switching) vs linear. what does "page count" measure? 4X - 492 pages
how to program a 4X linear memory pak?
ds1/dx256 passthrough saversThe Z64 can only access the EEPROM and not the CIC chip (for they have the same data and clock lines, but a different chip select line).
SRAM/FlashRAM is on Cartridge Domain 2 Address 2
2.14 Audio and Video Output
2.15 Lockout and Booting Considerations
2.15.1 Lockout hardware
PIF ROM ensures that the boot code has not been tampered with, according to magic values stored in the cartridge's lockout chip (CIC) manufactured by Nintendo. The boot code ensures that the program has not been tampered with according to the checksum in the cartridge header.
There are a few ways to work with this scheme. If the response from the lockout chip to the PIF chip can be cloned, then the PIF ROM check can be defeated and arbitrary cartridge hardware can be used. The stored program would still, however, have to satisfy the conditions of the PIF ROM code and PIF chip in order for it to be booted. This means that its boot code, as hashed by the PIF ROM code, must present a value to the PIF chip that it agrees with. The cloned lockout chip would then have to possess knowledge about the specific stored program in order to know what response the PIF chip is expecting. This scheme would cross a software-hardware boundary and as a result be very difficult to implement. The user could be expected to know which response his software requires and use e.g. a switch to select from among those available, but as new security chips are introduced, an FPGA configuration or a microcontroller firmware would have to be updated by the user, making this an expensive and support-heavy method.
If, on the other hand, the results of the PIF ROM algorithm for a given lockout chip can be cloned, a payload bootcode written for any lockout chip could be executed by a program loader, as long as the program loader itself is written for the specific lockout chip that is present. This is in fact precisely what a boot emulator does; see section "Boot-emulators". This solution is the most popular since a boot emulator is pure software that can be easily updated even by inexperienced users. Development devices which include a BIOS like the CD64 and Z64 even employ their own custom boot emulators as a standard feature. The boot emulator is written for the most common NUS-CIC-6102 lockout chip, and will successfully run a payload bootcode that is written for any lockout chip, as long as the boot emulator has prior knowledge of the results of the PIF ROM algorithm for that particular lockout chip.
Another variant on the boot emulator approach is possible. Since there is in practice a 1:1 mapping between lockout chips and cartridge boot code -- making such boot code de facto canonical -- and the start of program code is identical for any given boot code, it is possible that the machine state is always the same for any given lockout chip whenever its (canonical) boot code jumps to the program code. If this is the case, the machine state at that point can be captured, and when the boot emulator begins to execute the payload program, it would then simply restore the known machine state before jumping to the known start of payload program code, rather than mimicing the results of the PIF ROM algorithm and then executing the payload boot code. While novel, it is unclear how this approach would improve on the standard boot emulator approach. It would only be of use if there were some benefit found in not executing the payload boot code under some circumstance.
A final method to work around the PIF ROM lockout check is to find a bootcode that causes a hash collision. It is unlikely that the Nintendo algorithm is free of collisions, so it should be possible to create a replacement bootcode consisting of nothing more than 1) a routine that simply copies the program to RAM and jumps to the program start address, and 2) junk padding crafted to create a hash collision.
While the PIF ROM check can only be worked around and not exploited, since the only external inputs it depends upon are inside hardware, the boot code both is running from RAM and depends on a user input -- the program start address -- and can thus be exploited. When the boot code, running from RAM, copies part of the program to RAM in order to verify it against the results of the PIF ROM, the program start address can be crafted such that the boot code routine overwrites the rest of itself with arbitrary user code, which evades the already-performed PIF ROM check. The user code loader can then re-do the copy operation that the boot code would have performed, this time substituting the correct start address, preparing the hardware, and then jumping to the start address without verifying the checksum. The drawback of this approach is that the start address in the cartridge header remains incorrect for the rest of the program execution. If a black-box program code utilizes this address in some way, it will need to be modified or worked around. One workaround is for the user code loader to set a CPU breakpoint on this address so that any reads will trap back to the user code to be handled appropriately.
It is unclear what the practical advantage of this approach is, since for each boot code used in existing software, a different start address would have to be set in the cartridge header according to the address of the code in RAM that must be overwritten to disable the checksum verification. This doesn't save the user any more time over the current method, which is to run a program to modify the checksums in the cartridge header to match those expected by the current boot code (if known). It only obviates the need to know how to calculate the checksums, which is a meager savings indeed since the checksum code itself can be easily analyzed and mimiced.
2.15.2 PIF / Bootstrap ROM
The main task the PIF ROM performs is to ensure that a stored program is authorized for execution by the lockout interface in the PIF chip. It performs a verification procedure over the boot code before assuming that it is authorized and jumping to it.
The PIF ROM is located at 0xBFC00000, which is the reset vector for the R4300 CPU. The program counter is immediately set to this value at power on.
The first part of the PIF ROM procedure checks for bit 0x80 at PIF RAM offset $4C (probably to check if the PIF is finished resetting), then loads a word from PIF offset $7E4 in PIF RAM. (Bit 19 determines whether program memory is located at 0xB0000000 (normal), or at 0xA6000000 (N64DD?). It is then stored to $7E44 in RAM. Bit 18 is stored to location $7E54 in RAM (fixme: expansion pak?). The second byte is then stored to $7E50 in RAM; this byte is used to seed the bootcode's checksum. Bit 17 is stored to location $7E4C in RAM (fixme: expansion pak?). 1 is stored to $7E48. A pointer to the function following $138 in PIF ROM is stored to location $7E5C in RAM. (Possibly this could be called by program code as a secondary lockout check if ROM is not hidden by the final actions of PIF ROM?) The low byte is multiplied by 0x6C078965, the low part taken and added to one, and this value used to seed the lockout algorithm. The initial values of s2, s1, and s0 are stored to $7E40, $7E3C, and $7E38 respectively. The value at PIF RAM offset $4C is read, or'd with 0x10, and stored back once SI no longer shows busy I/O.)
The default values of $03,$0F,$FF,$FF are used for the PI timing registers RLS,PGS,PWD,LAT. It then reads the first four bytes from the cartridge and uses these values -- which are the same in all released commercial cartridges, 80,37,12,40 -- to set the permanent values of the PI timing registers, which are $40, $803712, $8037, and $803. It reads RDP_STATUS to check if RSP DMEM DMA shows busy (0x01), and if so, it repeatedly checks the pipe busy flag (0x20) until it is clear. It then uses programmed I/O to copy the memory range 0x40 until 0x1000 from the cartridge (the program boot-code) to the same offsets in RSP data memory (DMEM).
Once the program boot-code is copied to RSP DMEM, the lockout algorithm starts by considering offset $40, which is the first byte of the program boot-code.
Subroutines:
- $13C (hashing routine?)
- $1e8 (main loop)
- $380 should be CPU exception vector, but it is the middle of a routine. OS must set an exception vector.
- $550 multu(x, y, *hi, *lo)
- $56C ("returns"to bootcode in RSP DMEM)
When the algorithm is finished running over the program boot-code, then the CPU registers are loaded with the results as well as a byte from the PIF RAM and a function pointer into PIF ROM offset $138 (sub $13C). An output word from the algorithm is stored to PIF RAM offset $30, 5 cycles are waited and SI is quiesced, a reply word is loaded from PIF RAM offset $3C, a second output word from the algorithm is stored to PIF RAM offset $34, SI is quiesced, the second word is or'd with $20 and stored to PIF RAM offset $4C, then a delay is executed with a periodic check for bit $80 at PIF RAM offset $4C. When this condition is true, SI is quiesced, and the value at PIF RAM offset $4C is or'd with $40 and stored back. Finally, the boot code is executed from its location in RSP DMEM.
2.15.3 Program header
When the bootstrap code (described below) is executed, it eventually jumps to a start address to begin execution of the main program module. The bootstrap code compatible with the NUS-CIC-6103 and NUS-CIC-6106 lock-out chips (FIXME FIXME FIXME semantics) expects the code in a different location in memory, and thus the start address must be modified. (Does it really expect it? If so, why must it be placed in the cart header?)
- For NUS-CIC-6103 programs, add 0x100000 to the start address.
- For NUS-CIC-6106 programs, add 0x200000 to the start address.
2.15.4 NUS-CIC-6102 boot code in RSP DMEM
- Zero coprocessor exception and timer registers
- If RDRAM reg C == 0 {
- store s3-s7 passed from PIF ROM to stack
- t0 := A4700000
- t2 := A3F80000
- t3 := A3F00000
- t4 := A4300000
- RI_CONFIG_REG = $40 / current control enable (0)
- Delay 24000 cyclces
- RI_CURRENT_LOAD = 0
- RI_SELECT = $14 / receive select / transmit select
- RI_MODE = 0
- Delay 12 cycles
- RI_MODE = $0E (stop T active, stop R active, operating mode 0x2
- Delay 64 cycles
- MI_MODE = $010F (set init mode, initial code length = $F)
- RDRAM_DELAY = 0x18082838
- RDRAM_REF_ROW = 0
- RDRAM_DEVICE_ID = 0x80000000
- t5 := t6 := t8 := s2 := s7 := 0
- t7 := t9 := a2 := A3F00000
- s4 := s6 := a3 := A0000000
- fp = sp
- if MI_VERSION != 0x01010101 { / if version >=2 rcp
- s0 = $400
- s1 = $A3F08000 /what is this reg?
- }
- else { / version 1 rcp
- s0 = $200
- s1 = $A3F04000 /what is this reg?
- }
- $168:
- *(s1+4) = 0 /what is this reg?
- s5 = &RDRAM_MODE / for $778 func?
- if (call($778) != 0) { / else goto $25C
- tmp1 = ret
- MI_MODE = 0x2000 / RDRAM reg mode
- tmp2 := t3 := RDRAM_CONFIG & 0xf0ff0000
- MI_MODE = 0x1000 / clear RDRAM reg
- if (tmp2 == 0xB0190000) {
- t8 += 0x08000000
- t9 += s0*2 (either $200 or $400 depending on version)
- s4 = s6 = A0200000 / 2MB RAM
- s2 = 1
- }
- else {
- s4 = A0100000 / 1MB RAM
- } /RDRAM_CONFIG
- MI_MODE = 0x2000 / RDRAM reg mode
- t1 = RDRAM_DEVICE_MANUF & 0xffff
- k0 = RDRAM_CONFIG
- MI_MODE = 0x1000 / clear RDRAM reg
- if (t1 == 0x500 && (k0 & 0x01000000) == 0 ) {
- RDRAM_RAS_INTERVAL = 0x101c0a04
- }
- else {
- RDRAM_RAS_INTERVAL = 0x080c1204
- }
- t6 += 0x08000000
- t7 += s0*2
- if ++t5 < 8 goto $168
- }
- $25C:
- A3F8000C = 0xC4000000
- A3F80004 = 0x80000000
- sp = fp
- v1 = 0
- $274:
- if (tmp2 == 0xB0090000) { / RDRAM_CONFIG
- *(s1+4) = t8
- call($a40)(tmp1, 1)
- / Confusing code follows, trace in debugger
- t0 = *s6 / what's s6 here, a load from hw regs?
- t0 = 0x00080000 / overwrite t0
- t0 += s6
- t1 = *t0
- t0 = *s6 / but s6 hasn't changed
- t0 = 0x00080000 / overwrite t0
- t0 += s6
- t1 = *t0 / but nothing has changed, unless s6 points to volatile
- t6 += 0x400
- t9 += s0
- s6 += 0x10
- *(s1 + 4) = s7
- }
- else {
- *(s1 + 4) = s7
- s5 = a2 + 12
- call($a40)(tmp1, 1)
- t0 = *a3 (A0000000?) / uncached ram
- t0 = a3 + 0x8 /overwrite?
- t1 = *t0
- t0 = *a3 (A0000000?) / uncached ram
- t0 = a3 + 0x10 /overwrite?
- t1 = *t0
- t0 = *a3 (A0000000?) / uncached ram
- t0 = a3 + 0x18 /overwrite?
- t1 = *t0
- t0 = *a3 (A0000000?) / uncached ram
- t0 = a3 + 0x8 /overwrite?
- t1 = *t0
- t0 = *a3 (A0000000?) / uncached ram
- t0 = a3 + 0x10 /overwrite?
- t1 = *t0
- t0 = *a3 (A0000000?) / uncached ram
- t0 = a3 + 0x18 /overwrite?
- t1 = *t0
- s7 += 0x800
- a2 += s0*2
- a3 += 0x20
- v1++
- if (v1 < t5) goto $274
- s2 <<= 19
- *A4700010 = 0x00063634 | s2
- t1 = *A4700010 / ??? overwrites with this read, then gets overwritten again
- *A0000318 = s6 & 0x0FFFFFFF; / osMemSize!
- sp = fp
- load s3,s4,s5,s6,s7 from stack
- store i-cache tag 0/0 for 16k of main memory in 32 byte chunks starting at 0x80000000
- store d-cache tag 0/0 for 8k of main memory in 16 byte chunks starting at 0x80000000
- } / tmp2 == 0xB0090000
- } / call($778)
- } / reg 0c == 0
- else {
- (maybe if this was a soft reset?), invalidate 512 32-byte i-cache lines and 512 16-byte d-cache lines in RDRAM / why invalidate d-cache here but store tag 0/0 for dcache above?
- }
- $458:
- Copy routine at $4C0-$774 in bootcode (lockout finale and program loader) to uncached RAM, address zero, and jump to it
- sub $778-$87c {
- save regs to stack and call $880
- } / sub $778
- sub $880-$904 {
- t1 = t3 = t4 = 0
- for (t4 = 0; t4 < 64;) {
- v0 = 0
- a0 = t4
- call($90C)(s4)
- if (v0 > 0) {
- k0 = lo((v0 - t1) * t4)
- t1 = v0
- t3 += k0
- }
- t4++
- if (t1 >= 0x50) continue for
- a0 = t3 * 3
- a0 *= 4;
- a0 -= t3;
- a0 *= 2;
- a0 -= 880
- call($980)(a0, a1, s5)
- break
- } / loop t4
- } / sub $880
- sub $90c-$978 {
- v0 = 0
- a1 = 2
- call ($a40)(a0, a1, s5)
- for (fp = 0; fp < 10; fp++) {
- *(s4 + 4) = 0xFFFFFFFF
- v1 = 0xFFFFFFFF
- *s4 = 0xFFFFFFFF
- *s4 = 0xFFFFFFFF
- gp = 0
- v1 = 0x0000FFFF
- while ((v1 & 1) && gp++ < 8, v1 >>= 1) {
- v0++
- }
- } / fp loop
- } / sub $90c
- sub $980-$a38 {
- / Looks like protection code?
- } / sub $980
- sub $a40-$ac8 (a0, a1, s5) {
- a0 &= 0xff
- a0 ^= 0x3f
- t7 = 0x46000000
- if (a1 == 1) {
- t7 |= 0x80000000
- }
- t7 |= (a0 & 1) << 6
- t7 |= (a0 & 2) << 13
- t7 |= (a0 & 4) << 20
- t7 |= (a0 & 8) << 4
- t7 |= (a0 & 0x10) << 11
- t7 |= (a0 & 0x20) << 18
- *s5 = t7 / writes to RDRAM_REGS
- if (a1 == 1) {
- *MI_MODE = 0
- }
- } / sub $a40
- sub $ad0-$b64 (a0, s5) {
- MI_MODE = 0x2000 / MI set RDRAM
- load *s5
- MI_MODE = 0x1000 / MI clear RDRAM
- k0 = (*s5 & 0x40) << 6
- k0 |= (*s5 & 0x4000) << 13
- k0 |= (*s5 & 0x00400000) << 20
- k0 |= (*s5 & 0x80) << 4
- k0 |= (*s5 & 0x8000) << 11
- k0 |= (*s5 & 0x00800000) << 18
- msb(*a0) = k0 & 0xff;
- } / sub $ad0
- followed by unknown stuff at $b70-$fec
2.15.5 NUS-CIC-6102 loader stub in RAM
This loader is copied to RDRAM address zero from $4C0-$774 in the 6102 bootcode, and then executes from RDRAM to load the first 1MB of the program, verify its integrity, and execute it. Pseudocode:
- Read the 3rd word of cart header, which is the program start address in RAM (which must be past the end of this routine).
- Wait for PI I/O to quiesce.
- Use PI DMA to copy 1MB to the start address in RAM starting at cartridge offset 0x1000.
- Wait for PI DMA to complete.
- Starting at program start address in RAM, perform checksum seeded with a1=s6*0x5
D588B65 over first 0x100000 (1MB) bytes of program. (FIXME: s6 comes from the PIF?) Checksum routine places results in a3 and s0. - If a3 is not equal to the first checksum word in the cart header, or if s0 is not equal to the second checksum word in the cart header, loop forever (halt).
- If RSP has a PC other than zero, set RSP single step and clear halt and then set the RSP PC to zero.
- Set RSP halt and clear all other RSP conditions.
- Unmask all MI interrupts and clear each one individually.
- Place various parameters to be passed to program on stack at 0xA0000300. (FIXME: Describe)
- Write zeros to entire RSP DMEM and IMEM regions
- Jump to program start address in RAM (from cart header)
Annotation:
*A4600000 (PI DRAM_ADDR) = 3rd word of Cart header (start address?) & 0x1FFFFFFF
while ((*A4600010 & 2 )) { loop } / wait for PI no I/O busy
*A4600004 = 0x10001000 / PI DMA cart address
*A460000C = 0x000FFFFF / PI DMA write length (1MB)
do { wait 28 cycles } while ((*A4600010 & 1)) / wait for PI no DMA busy
a0 = 3rd word of cart header (boot address)
a1 = s6 * 0x5D588B65
ra to stack, s0 to stack
ra = 0x00100000 / length to checksum over starting at cart header
t1 = a0
start checksum loop {
v0 = *t1 / pick up next word in program code?
math stuff
} / checksum loop
if a3 != *0xB0000010 / 5th word of cart header checksum?
then loop forever
if s0 != *0xB0000014 / 6th word of cart header checksum?
then loop forever
690:
/ OS setup
s0 = tmp1
ra = tmp?
if (*A4080000 != 0) { / RSP PC reg
*A4040010 = 0x41 / set single step, clear halt?
*A4080000 = 0 / PC = 0
}
*A4040010 = 0x00AAAAAE / set halt, clear broke, clear intr, clear sstep, clear all sigs
*A430000C = 0x555 / enable all interrupts
*A4800018 = 0 / clear SI interrupt
*A450000C = 0 / clear AI interrupt
*A4300000 = 0x800 / clear DP interrupt
*A4600010 = 2 / clear PI interrupt
*A0000314 = s7 / A0000300 is initial stack?
*A000030C = s5
*A0000304 = s3
*A0000300 = s4
if (s3 != 0) {
*A0000308 = 0xA6000000 / devkit?
}
else {
*A0000308 = 0xB0000000 / cart
}
Clear RSP DMEM
Clear RSP IMEM
Jump to start address in cart header
2.15.6 Program code
Once the bootstrap code has transferred execution to our main program module, we are mostly in business. There is one remaining task to perform before we do anything else, and that is to OR the address 0xBFC007FC with 0x08. This address is located at the end of the PIF RAM, and it enables the soft-rese
t/NMI generation of the console's reset button. If this is not performed, the console will automatically reset in a few seconds. (FIXME WHY?)
There is one snag we might encounter, and that is that the PIF ROM has only transferred the first 1MB of our program image to RAM. If the program code is larger than 1MB, then we must execute a PI DMA ourselves to transfer its remainder. This situation may manifest itself as seemingly random crashes on large projects, because the CPU will eventually jump to a program address that has not yet been loaded into RAM.
The program start address must be after the end of the bootcode stub, if any, that executes from RAM, or the stub will be wholly or partially obliterate
d, causing boot failure. Also, when linking, the linker must have knowledge of the program start address if it is not offset zero. For example, if the program thinks it is being loaded at address zero, but in reality has been copied by the bootcode to address 0x400, any jumps to absolute addresses will jump to an incorrect location. For correct operation, the linker must take these addresses and add the start address.
2.15.7 Boot emulators
key - dont trash the structures that the PIF ROM sets up. (true?) This is why early 6105 boot emulators had problems determining whether a memory expansion pack was installed or not.
- detect boot code of cart image
- write proper cic value to $22
- jump to game's boot code (or is it start address?) can we instead simply:
- pi dma 1mb of program into ram
- jump to start address
2.15.8 Soft resetting
2.15.9 Other notes
There is a fixed memory location somewhere that can be used to preserve 64-bytes of information across a NMI/soft reset but not a hard reset. (osAppNMIBuffer). Also, anything outside of the 1MB boot segment will remain untouched after a soft reboot.
3 Setting up GNU Binutils for N64
3.1 Introduction
We will be developing for N64 using the free GNU toolchain and a Debian GNU/Linux host machine. The basic components of a toolchain are the utilities used to manipulate object files of machine language code, such as an assembler, a linker, a librarian/archiver, a profiler, and programs to copy to and from various object formats and to examine symbols within a particular object file. Later, we will add C and C++ compilers, standard C/C++ libraries, and a source-level debugger to complete the toolchain.
GNU Binutils can be downloaded at:
The most recent binutils release as of this writing is 2.15 (May 2004).
Note that Debian provides a package named 'toolchain-source' that can expedite building a set of toolchain packages from scratch. This is typically an easier but less universal approach. Occasionally toolchain-source becomes unmaintained and out-of-date, so building from the upstream sources may be preferable in those cases.
3.2 Configuring and Building Binutils
Once we have downloaded the file (in bzip2 format), execute the following commands to unpack it and change into its directory:
- $ tar jxf binutils-2.15.tar.bz2
- $ cd binutils-2.15
Next, we will configure the package for building. First, we need to define a few standard options for building the toolchain.
- Build system
The type of system that the package is being built on. - Host system
The type of system that the built package will run on. - Target system
The type of system that the built package will produce object code for.
Usually, and especially in this case, the Build and Host systems will be the same. We are building the package on an i386 host to execute on an i386 host, and those will be automatically detected by the configure process. However, the target system is our N64.
The toolchain configuration options expect a canonical system name. We can use the included 'config.guess' script to find the canonical system name of the host, but since we cannot run this script on the target, we must reasonably guess a canonical system name for the target.
A canonical system name is comprised of the following components, separated by dashes:
- Hardware architecture (including endianness)
- System vendor (or 'unknown', or nothing)
- Operating system or object file format
For our N64, the hardware architecture is 'mips' (little endian would be 'mipsel'). For discussion of 64-bit issues, see "64-Bit-Issues". We don't really need to worry about them right now except that our toolchain should be configured as 'mips64' and not 'mips'.
The system vendor should be 'unknown' since the GNU toolchain has no pre-existin g support for it. However, it turns out that we need the system vendor to be 'linux' in order to (coupled with 'mips64' arch) obtain a toolchain capable of using the desired 'n32' ABI (see "64-Bit-Issues").
rem Since we are not building object code to run under an operating system, we could define an object file format instead. We will use ELF (Executable and Linkable Format), which is 'elf' in the canonical name. ELF is a suitably flexible object format in general, but using ELF means that our toolchain will also be compatible with the official SDK libraries, which are in ELF format. A licensed developer might wish to use a toolchain built here to build a program with code from the official SDK.
Putting the pieces together, we have:
- mips64-linux
- or
- mips-unknown-linux-elf
Our configure command is thus:
- ./configure --prefix=/usr/local/n64 --program-prefix=n64- --target=mips64-linux
We define a program-prefix for the programs in our N64 toolchain, so that our binaries do not have name conflicts with any other toolchain already on the system.
After Binutils is configured, simply issue 'make' to complete the build. If there are no errors, issue 'make install', and the fundamental pieces of our toolchain will have been installed in /usr/local/n64.
3.3 Binutils Components
- n64-ar - Archiver, for combining object files into libraries
- n64-as - Assembler, to encode assembly source files into machine instructions
- n64-ld - Linker, to combine separate object files into a program and resolve symbol references between objects
- n64-nm - Lists symbols and interfaces in object files
- n64-objcopy - Translates between different object file formats
- n64-objdump - Display information about an object file, including its disassembly
- n64-ranlib - Create symbol index in a library
- n64-size - Display size of object file sections
- n64-strings - Search object for NUL-terminated ASCII strings
- n64-strip - Remove profiling information, debugging information, and/or symbols from an object
Our exposure to these tools will primarily be limited to ar, as, ld, objdump, and strip. However, all of these tools are useful in various circumstances, since they are nearly the only resources we have to easily manipulate binary object files.
3.4 Using Binutils
You can add the N64 toolchain to your executable search path:
- $ PATH=$PATH:/usr/local/n64/bin
Then, for example to invoke the assembler, you can just type 'n64-as' instead of '/usr/local/n64/bin/n64-as'.
Assemble the following test program, passing -mips3 to the assembler. This will verify that the assembler works as expected:
For developing your own programs in assembly language, see the "The MIPS64 Architecture For Programmers" volumes I and II.
4 Bootstrapping the Development Environment
4.1 Building a N64 Program Image
4.1.1 Header
4.1.2 Boot Code
4.1.3 Checksum
ld script
5 Using and Extending a Hardware Simulator
The multi-system emulator GXemul (0.4.6.3)
- useful for fast prototyping
cheaper than dedicated simulation hw (INDY device or PSY-Q SCSI thing
requires a fast CPU to simulate the target in realtime
must emulate peripherals in software
does not require boot rom from N64, since we have existing boot emulators to work from
Can be incrementally developed by attempting to run known-working software and filling in what is missing one piece at a time. Real machines have to be built according to specifications and tested for adherence to them.
- src/devices - Drivers for emulated hardware components
- src/machines - Initialization and definitions for a particular system
- src/prom - Emulation of system state after boot ROM has executed
In src/machines/machine_n64.c:
- Build memory map
- Register device emulators with IRQ machine[0].cpu[0].X (2-6)
- Emulate PIF ROM (and possibly boot-code for inserted cart) if user has not provided PIF ROM
General gxemul setup
- Edit src/include/machine.h and add a new define for your system to be used in MACHINE_REGISTER(x)
- Create a new CPU entry (R4300) according to those defined in src/include/mips_cpu_types.h; use R4000 as a template and refer to cpu_mips.h. (Cache sizes in mips_cpu_types.h are powers of 2.)
- Add code to initialize CP0 for the R4300 into src/cpus/cpu_mips_coproc.c
In src/devices/dev_n64_rcp.c:
- Implement memory access, tick, and interrupt routines for all system devices
- Register all devices with gxemul
- Instantiate devices from machine_n64.c with string arguments for I/O range and interrupts
Edit src/devices/Makefile.skel and src/machines/Makefile.skel adding the correspon
ding .o for whatever .c file you created. Then ./configure && make. ./gxemul -H and you should see your N64 port in the list. If you have dumped an image of the PIF/Bootstrap ROM from your N64, you can try executing it. Otherwise, try executing your own test program that is written according to this manual. Fill in hardware functionality as required.
Make sure the bootstrap rom is the last ROM specified if PROM emulation is not implemented or disabled.
gxemul -En64 -viJQT 0xffffffffb0000000:DX-BE12B.ROM 0xffffffffbfc00000:pifntsc.raw
6 Building A C/C++ Toolchain
6.1 64-Bit Issues
The N64 CPU is capable of using its general purpose registers in 64-bit mode by setting a CPU flag. However, just doing this does not provide any advantage to software, since the compiler may not be emitting instructions that utilize 64-bit registers. The question is, under what circumstances do we wish to actually utilize 64-bit code?
64-bit code does not necessarily run faster than 32-bit code. 64-bit register loads take more instructions, increasing the storage size of the executable code and decreasing cache efficiency. 64-bit constants also increase the size of the executable code. So we want to only emit 64-bit code when it makes sense for our application.
First of all, it would be nonsense to use 64-bit pointers in the N64, since its entire memory map, including hardware, fits in a 32-bit address space.
Since the GNU C compiler fixes the default size of the pointer to be equal to the size of the long, it would be more convenient to have 32-bit longs, because then we would not have to explicitly declare pointers to be 32-bit.
The base of existing open source software typically assumes that a long is 32 bits, since that's what most machines today use. So this choice would also eliminate issues of porting third-party code to the N64.
We thus have several advantages to choosing a ILP32 (integer/long/pointer = 32-bit) target over a LP64 (long, pointer = 64-bit) target. Furthermore, we can still utilize 64-bit integers in any special cases by using the 'long long' type and passing -mips3 to the compiler (enabling MIPS III ISA, required for instructions that utilize 64-bit registers).
The next challenge will be to choose an ABI (Application Binary Interface). The ABI essentially defines the size of C data types as well as establishing a convention for function calls with respect to what arguments are passed in registers, what is passed on the stack and where on the stack it is located, and where to find the function's return value. An ABI is what allows us to link object modules together and expect them to work. We will also need to respect the ABI when mixing C modules with pure assembly modules (usually C modules with inline assembly are more convenient).
MIPS defines four ABIs - 'o32', 'n32', 'o64', and 'n64', where the 'o' means 'old' and the 'n' means 'native', and where the default 32-bit ABI is 'o32' and the default 64-bit ABI is 'n64'. (The name is a coincidence.) The 'o32' ABI is the first MIPS ABI. It is ILP32 and only MIPS I/II instructions are allowed. It also allows only 16 floating point registers to be used, hampering modern FP coprocessors. The 'o64' ABI is obsolete and irrelevant for this discussion. The 'n64' ABI was the 64-bit ABI that replaced 'o32'. It is LP64, allows 32 FP registers to be used, and allows MIPS III/IV instructions. However, this ABI has all the disadvantages of 64-bit architectures noted above. As a compromise, the 'n32' ABI was developed. Like the 'n64' ABI, it allows 32 FP registers as well as MIPS III/IV instructions. However, it is ILP32, which turns out to be much more efficient on an embedded system like the N64.
So we will want to build our application code with a 'n32' ABI, which is done by specifying -mabi=n32 to the GNU compiler. This implies the -mips3 option. (The GNU compiler's default is to compile for MIPS I and the 'o32' [-mabi=32] ABI. If the toolchain is configured as noted below, the 'n32' ABI will be the default.)
The proprietary Nintendo ELF libraries are built for MIPS II with the 'o32' ABI. If you want to link your code with the Nintendo libraries, you will need to compile your code with the -mips2 option (which implies -mabi=32, for 'o32').
For a GNU configure tuple, the 'mips' target implies a default 'o32' ABI. The problem with configuring like this is that 'n32' isn't then even available as a target ABI. Current binutils (2.15) requires a mips64-linux target in order for the 'n32' ABI to be enabled, and in which case, it will become the default. So we will use 'mips64' as the architecture and 'linux' as the system vendor, even though neither particularly makes sense in this scenario.
Remember that if you are using functions written in assembly language, you must obey the ABI that your application is using. It would also be a good idea to assemble with the appropriate -mips[1-4] and -mabi=? options, so that the assembler can verify that you haven't accidentally used an instruction that is invalid for the ABI and ISA that you are using. You can find the information on the 'n32' ABI in the following document:
"MIPSpro N32 ABI Handbook" (SGI document number: 007-2816-005).
Note: Instead of using -mips3 when compiling, you can use -march=vr4300. This option implies -mips3, additionally enables any tuning for that specific CPU, and doesn't work around errata not present in that CPU like a generic -mips[1-4] option would. (A generic option like -mips3 must support any CPU that implements that ISA level, no matter what bugs it has, so that option will implement a union of all workarounds for CPUs implementing that ISA level.) We can use this CPU-specific option for code targeting the N64, because we know exactly what CPU is used on the target, and that our code will never need to run on any other CPU.
6.2 Building the GNU C Compiler Without A Target OS
We will be building what is known as a 'bootstrap' C compiler. It does not have support for threads or shared libraries. However, it does not require a C library, which at this point is the situation we are in. We will also wait until later to build the C++ compiler, since building it requires a working C library. Once the bootstrap compiler is built, we will use it to build a C library that either requires (glibc) or does not require (newlib) an operating system. Once that C library is built, we will then build a C++ compiler, and also rebuild the C compiler, with thread and shared library support if the target operating system supports it. The C++ compiler and rebuilding of the C compiler can be skipped if there is no necessity for those features on the target (e.g., if the bootstrap compiler is sufficient to produce the application code for the target).
./configure --prefix=/afs/icequake.net/dist/arch/i386_linux24/toolchain --program-
prefix=n64- --target=mips64-linux --disable-threads --disable-nls --disable-shar
ed --enable-languages=c --without-headers --with-newlib
7 Building A Reusable Library of Functionality
host - general functions + includes with constants + host specific functions
target - general functions + includes with constants + target specific functions (communications slave, inline asm, etc)
8 Performance Evaluation and Benchmarking
8.1 Bottlenecks
8.1.1 Video
While the N64's graphics hardware is reasonably fast, it is still not very competitive even with the Sony PlayStation which was released nearly two years ahead of it. The two video bottlenecks are:
- Triangle throughput
The N64 has a claimed 150,000 fully lit and transformed polygons per second throughput. While this sounds like a lot, it really isn't, especially in complex scenes. At 30fps, that is only 5,000 polygons per frame. Using flat or gourad shading can help this throughput. We don't know if this number includes texture mapping, so we will test this. Note that if the RSP is being used for audio processing, polygon throughput will suffer; not only will the RSP be dividing its computation time between two tasks, but the RSP microcode must also be reloaded every time it changes duties.
- Texture memory
The RDP has 4KB of texture memory, which is precisely enough for one 32x32 32-bit RGBA texture. We can make more efficient use of this memory by using smaller textures or scaling down the color depth, but image quality will suffer. Unfortunately, we cannot avoid thrashing on such a small amount of memory, so we must be creative with our texture usage.
8.1.2 Audio Hardware
The N64 AI (Audio Interface) is only a very simplistic audio processor, compared to the competing Sega Saturn and Sony PlayStation. It has a set of registers which, at any time, can have at most one sample queued while playing one sample from memory. Samples can be 4 to 16-bit, up to 48KHz, raw or ADPCM format. Any mixing, synthesis, or effects processing must be done by the CPU or RSP in real-time, at an incremental performance cost for each channel or effect in use.
8.1.3 Unified Memory
The N64's unified memory architecture, while convenient to program, presents an additional performance issue: peripherals must contend for memory access. When only using one peripheral at a time, this should not be a problem at all. But consider this scenario: we are playing a sound buffer on the AI, and a PI DMA is performed to transfer more sound or model data into memory, while the RSP writes a RDP command list to a FIFO buffer in memory, and the RDP reads from that buffer and writes to the back framebuffer... and on top of that, the VI must read the entire contents of the the front framebuffer up to 60 times per second in order to refresh the display!
This sounds like a contrived scenario, but it really is not; this is what is going on inside the N64 at all times when a typical game program is running. The problem is that we could spend years trying to optimize each line of code to ensure that hardware does not waste time waiting on memory access, but that would most likely yield only a marginal improvement compared to the amount of time spent on execution analysis trials. Instead, we will try to single out some worst-case memory access patterns through benchmarking, and optimize out those cases in library code so that all applications will benefit.
8.2 Benchmark Design
9 Porting RedBoot/eCos
RedBoot can be used as a general purpose debug loader, or specifically as a Linux bootloader.
10 Porting Linux
10.1 The kernel
10.1.1 ucLinux
Kernel messages -> screen, or debug stub (gameshark, cd64, z64, etc)
10.1.2 framebuffer
10.1.3 startup code
Interrupts: Linux-MIPS porting guide
10.2 Userland
10.2.1 cramfs
10.2.2 busybox (as init replacement)
10.2.3 newlib
10.2.4 dynamically linked busybox
10.3 Drivers
10.3.1 Input
We will support the use of a joystick and/or a hacked mouse. Explore keyboard input. Force feedback (rumble).
10.3.2 DirectFB OpenGL
10.3.3 ALSA Audio
10.3.4 Cartridge/peripheral interface
On devices with PI registers, we should be able to access the Zip disk or CD filesystems transparently, and to use the communications port of peripherals. We should have an interface to dump the cart and to write to the PI safely.
10.3.5 Memory technology devices (cart memory, memory packs)
10.3.6 Serial interface
Can the controller interface be used as a general serial interface (so we can e.g. get a terminal or run PPP over an Adaptoid?) We need to find out how to get SI interrupts.
10.3.7 Power management
Is R4300's low power mode supported? Can we enable it when the machine has been idle?
10.4 Userspace code examples
10.4.1 Universal APIs (OpenGL, ALSA)
10.4.2 System-specific APIs (cart dumping)
10.5 Preparing a Linux-N64 program for market
11 Porting KallistiOS
12 Complete Hardware Reference
13 Peripheral Hardware Reference
13.1 UFO/Success CD64
13.1.1 CD64 Hardware and Operation
How does it let the cartridge talk for CIC and bootcode, but then override the bus to deliver BIOS code when in BIOS mode, but not in "Play cart" mode?
13.1.2 Upgrading and Modifying the CD64 Hardware
13.1.3 Interfacing the CD64 from a N64 program
Example of writing a replacement CD64 BIOS.
13.1.4 CD64 PI Register Map
13.1.5 Communicating with the CD64 BIOS and Ghemor from a PC
13.2 Mr. Backup Z64
13.2.1 Z64 Hardware and Operation
HW1/HW2/HW3 differences
13.2.2 Upgrading and Modifying the Z64 Hardware
RAM upgrade (512KB -> 1MB)
HD/Zip250/Orb upgrade
Connecting IRQ/DMA lines
NIC/parallel port/commslink/VGA addons
keyboard connector
13.2.3 Interfacing the Z64 from a N64 program
Example of writing a replacement Z64 BIOS.
13.2.4 Z64 PI Register Map
13.2.5 Modifying the Z64 Operating System
13.2.6 Communicating with the Z64 from a PC
13.3 Bung Doctor V64
13.3.1 Doctor V64 Hardware and Operation
13.3.2 Upgrading and Modifying the Doctor V64 Hardware
13.3.3 Modifying the Doctor V64 Operating System
13.3.4 Communicating with the Doctor V64 from a PC
13.4 Bung Doctor V64jr
13.4.1 Doctor V64jr Hardware and Operation
13.4.2 Upgrading and Modifying the Doctor V64jr Hardware
13.4.3 Interfacing the Doctor V64jr from a N64 program
13.4.4 Communicating with the Doctor V64jr from a PC
13.5 Valery's PV-Backup
13.6 Ultra64Pro RAM Card
8-64MB SDRAM/EDO Battery backup
PC connection (ECP or USB)
R/W registers mapped to N64 PI to change mode and N64<->PC comms
built in save emulation for SRAM, 4K/16K EEP <->controller pak file (compress SRAM to fit normal pak) and flashram <->4X controller pak
BIOS with save management and configuration runs if no file loaded CIC passthrou
gh or emulation
13.7 Wildcard 64
13.8 Datel Game Shark Pro
13.8.1 Loading and Booting N64 Code
13.8.2 Communicating with the Game Shark from a PC
13.8.3 Communicating with the PC from a N64 program
13.9 Wishtech Adaptoid
13.10 Generic SI<->RS232 adapter (PIC?)
13.11 Tristar 64
13.12 GB Hunter/Game Booster 2in1
13.13 64DD Disk Drive ("Leo")
13.14 PSY-Q Development Board and Suite
13.15 Indy Emulator (Official)
13.16 N64 Country Adapters (Passport, etc)
Do they have a program/bootemu or are they only a physical adapter? Semantics of booting PAL vs NTSC (lockout chip), PAL-fixing, cart interchangeability