Jumptables in 68k
How to use Jmp Tables in 68k
Hello, I found this text in AmigaGuide format in my asm directory on my harddisk, and it should be usefull in fargo to, so I thought I share it with the rest of the fargo people. I have not tested any of it in a68k, but I thought it would be better to send I up now, when I still remember.
BTW, I don't know who wrote it in the first place.
/ Oscar Lindberg
f96osli@dd.chalmers.se
---------
Many times, when I first started assembly language coding, I would pour over a piece of code that I just needed to squeeze a couple more clock cycles out of. The code that, when a raster check is done, only over runs the frame by one raster. I still do that, but the code that I find myself looking at in that way now, would have taken at least two whole frames back then. A large part of this is a direct result of a simple observation that I made one day. The ideal piece of code is one that, with very few exceptions, makes no branches. Why? Well, not only are branches slow instructions, but on machines with a cache, in many cases, it will flush the cache.
What did I do about this? Well, the first, and foremost, step is to really look at the code. The end result of the code really has to be analyzed. Take for example the following PSet() routine.
* a0: byte in bit plane to be set.
* d0: bit in byte...
* d1: offset to next plane
* d2: color
PSet: btst #0,d2 ; is this bit to be set?
beq.b .1 ; nope
bset.b d0,(a0) ; set the bit
.1: adda.l d1,a0 ; next plane
btst #1,d2 ; is this bit to be set?
beq.b .2 ; nope
bset.b d0,(a0) ; set the bit
.2: adda.l d1,a0 ; next plane
btst #2,d2 ; is this bit to be set?
beq.b .3 ; nope
bset.b d0,(a0) ; set the bit
.3: rts
This is by far the worst case. This routine would take 92 clock cycles in the best case! It could be improved by replacing the btst's with lsr's and the beq's with bcc's, but the code is still slow. The killer: the branches.
This introduces the one branch method. It is essentially a construct that everyone has used in HLL's forever: the case statement. If one wanted to write the above code in C, they wouldn't use three if's, would they? Let's replace the above code with a simple table lookup and see what kind of speed increase it gives.
* a0: byte in bit plane to be set.
* d0: bit in byte...
* d1: offset to next plane
* d2: color
PSet: add.w d2,d2 ; word offset
add.w d2,d2 ; long offset
move.l Table(pc,d2.w),a1 ; addr. of correct function
jmp (a1) ; call it!
Table: dc.l .0,.1,.2,.3
.3: bset.b d0,(a0,d1) ; bpl1
adda.l d1,a0 ; a0->bpl1
.2: bset.b d0,(a0,d1) ; bpl1 or bpl2
.1: bset.b d0,(a0) ; bpl0
.0: rts
While this version accomplishes the something, it does it in much, much less time. This time it only takes 50 clock cycles, 54% of before. This time the killer isn't the branches, it's memory references. The (d8,PC,Xn) addressing mode takes 14 cycles on long data, but only 10 on word or byte data. We can improve this timing and get rid of the add's by making the table a byte offset. This will cause a penalty on the jmp, but it will be well justified.
* a0: byte in bit plane to be set.
* d0: bit in byte...
* d1: offset to next plane
* d2: color
PSet: move.b Table(pc,d2.w),d2 ; offset of function
jmp Table(pc,d2) ; call it!
Table: dc.b .0-Table,.1-Table,.2-Table,.3-Table
even
.3: bset.b d0,(a0) ; bpl0
adda.l d1,a0 ; a0->bpl1
.2: bset.b d0,(a0,d1) ; bpl1 or bpl2
.1: bset.b d0,(a0) ; bpl0 or bpl1
.0: rts
This routine is now reduced to a mere 44 clock cycles. Saving only six cycles my not seem like much, but in a time critical part of code it can make all the difference in the world. This still isn't the fastest that this routine could be, but the optimizations that remain are beyond the scope of this article. The bottom line is, the shortest distance between two points is a straight line!