Some hints on writing applications for the Falcon DSP
1. How to pick a suitable subject for the DSP
In general, the DSP is very well suited for multiplication and additions and is not very good at bit operations. So for instance a GIF-viewer won't gain much from using the DSP (a lot of bit operations), whereas a JPEG viewer will (JPEG needs matrix multiplications). Realtime audio and video manipulations are major areas of interest, complicated floating point calculations another. Another speed consideration is the overhead involved in getting data into the DSP and getting it out; therefore a floating point emulator using the DSP is possible, but can't be expected to give a large speed gain. If you can pack your calculations together into a DSP program then large gains are possible. The limited number range of the DSP has to be considered here as well; by proper scaling most calculations are possible however, with respect to ordinarily needed precision.
2. How to write a DSP program
You will need some documentation about the DSP and how it is tied into the Falcon for this. This is contained in the developers kit, but it has been covered very well in the German and Dutch Atari magazines also. For sound modification (a major application for the Falcon DSP) some knowledge about the Falcon sound system is needed as well. Sources: developers kit, magazines. The 'DSP56000/DSP56001 Digital Signal Processor User's Manual' from Motorola is almost a 'must-have' for a serious DSP programmer. It has a full coverage of the instruction set (not included in the developers kit!) and a lot of hardware ins and outs. It is probably for sale (around dfl 100 ?) at the importers of Motorola: in Holland that is Diode, or EBV Elektronik in Maarssenbroek. DSP Memory map: a memory map for the Falcon DSP is included in the developers kit. There are a few minor mistakes in this and therefore a revised version is included in this document.
Code always starts with a jump instruction at P:$0. The rest of memory up to P:$40 is reserved for interrupt vectors, so don't put your program there... To put the assembler to full use its documentation can be studied, revealing macro assembly, conditional assembly, buffer allocation etc.
Some example programs are included in this archive; you can study them to see how to put data in memory, how to organize code in memory, and how to use the host interface.
3. How to assemble a DSP program
I use the Atari Development kit for the Falcon myself; this kit contains Motorola's DSP Assembler and linker. A public domain assembler exists, but I haven't been able to put that one to good use yet (didn't try much either). After assembling you need to convert the code into the Atari-specific .LOD format; this is done with the CLDLOD.TTP utility program. This program needs output redirection to get a useful .LOD file (!). Therefore your commandline interpreter has to support output redirection; most of them don't. After this you can trash the .CLD file.
To do this automatically you can write a batch file along these lines:
cd d:\pc\projects\mandel
\dsp\asm56000.ttp -a -b -l -our mndl_56k.asm
\dsp\cldlod.ttp mndl_56k.cld >mndl_56k.lod
rm mndl_56k.cld
I use a Gulam shell that I found in an GNU C++ archive. Batch files for each project are stored in the project directory and copied to the Gulam directory when needed. From my Pure C compiler I then 'E'xecute Gulam and the batchfile is processed automatically. The only drawback is that Gulam can't execute an 'exit' from a batchfile, so this command has to be typed in manually. Also, Pure C can't take the status of the DSP file into account when performing a 'make', so you will have to look after that yourself.
To be sure the correct batchfile has been copied to the Gulam directory a line consisting of:
MSG 'Assembling ...'
is included in all my sourcefiles. The assembler normally does not give a clue about what file is processed, but this will. The switches in the batchfile mean 'absolute code' (.CLD output, no linking necessary), 'make object file' (what else), 'print a listing' (to disk) and 'signal unresolved references' (those are not tolerable in a project of a single assembly file ...).
4. How to run a DSP program
Use the DSP XBIOS functions to load the .LOD file into the DSP and to provide it with parameters. The developers kit includes new XBIOS function definitions that should be used to approach the Falcon DSP. Clever written programs won't be affected much by the overhead these routines put upon them. An example program is included in this archive that shows how to use these routines ideally and how to avoid some pitfalls. The latest Pure C version (1.1) contains bindings and online help for these functions.
Be careful with the host interface routines: they will probably hang the computer if the program in the DSP and your host routine don't agree on the number of parameters that should be transferred either way.
5. How to debug a DSP program
The developers kit provides you with DSPDebug, which gives you ample opportunity to examine your code's behaviour. If you don't have it, you will have to stick to putting debugging statements into your program and outputting values to the host interface. Some common sense and persistence is useful here as well...
6. How to write fast DSP programs
First of all: get your program to work well! Don't optimize it before you have validated your algorithm. You should think about how it can be optimized beforehand, though.
Very important: put all of your code into internal program RAM if possible to avoid memory wait states. Most programs will fit in the 512 word internal program RAM. Although the external RAM is zero waitstate, there is a penalty for accessing it twice in a single instruction, because there is only one external data bus. The internal buses are all separate however, so the instruction MACR in the loop:
[ ORG P:$40 ; fetch instructions from internal memory
MOVE #$00,R0 ; point into internal X-memory
MOVE #$00,R4 ; point into internal Y-memory ]
CLR A X:(R0)+,X0 Y:(R4)+,Y0 ; init loop
DO #256,loop_end
MACR X0,Y0,A X:(R0)+,X0 Y:(R4)+,Y0 ; calc sum of products
loop_end:
executes in the minimum 2 clock cycles! This means a burst rate of 48 MIPS (or 80 MIPS if you count multiply, accumulate, round separately :-) ). One of the memory references (P, X, Y) might even have been to external memory without affecting execution speed. The powerful instruction set and the cyclic buffers of the DSP56001 give it even more power than other DSP's.
In the DSP, you have to watch for opportunities to do parallel moves. Use the circular buffer modes to the full to avoid register loads and tests. Use the DO instruction for the same purpose; decrementing and testing registers is not meant for that and therefore awkward. Use short jumps and short immediate register loads as much as possible. Between host and DSP the amount of data transferred should be minimized. If possible have the host do something useful (calculating new parameters, drawing on the screen, providing a user interface) instead of busy-waiting for the DSP, to make full use of the multiprocessor system.
If large data transfers are necessary, there are a few ways to enable these. Basically there are two connections between the DSP and main memory, the host interface and the SSI (through the sound multiplexer chip). The SSI path is limited to about 1 MB per second, which can be accomplished by setting the master clock of the multiplexer to 32 MHz and using 4 stereo 16 bit channels (4 * 2 * 2 bytes per frame). The speed of host interface communication is mainly dependent on the host. Normally this is done by the host processor either by OS calls or from an interrupt routine. By programming the Blitter as a DMA device a much higher transfer rate could be accomplished at the cost of non-portable low level programming (the DMA chip cannot be programmed to transfer data from or to the DSP). Speed of main memory will always limit the transfer rate to less than 25 MB/sec (...), but the DSP cannot provide or accept more than 24 MB per second at the very most. A loop like:
DO #N_PARAMS,get_loop ; loop for N_PARAMS parameters
wait_data: JCLR #0,X:<<HSR,wait_data ; busy wait for host data
MOVEP X:<<HRX,X:(R0)+ ; store host data in X memory
get_loop:
would be the fastest I could provide, and would execute in 6 + 4 = 10 cycles, and therefore has a burst rate of 3.2 MW/s, or 9.6 MB/s at most. Data has to be provided to/accepted from the 8-bit wide host port at 9.6 MB/s for this; it is very well possible that neither the main processor nor the Blitter is up to that (I have no data on that).
8. How to convert data between the DSP and the host processor
Short and long integer data are readily transferred between the DSP and the host processor by means of the OS system calls. For long integers the range has to be limited to between 0x800000 and 0x7FFFFF; floating point numbers between -1 and 1 can be multiplied by 0x800000 to copy them into the DSP. This covers the 24- bit data format of the DSP very well; the double width 48-bit format however is more difficult to accommodate. The common floating point format with 64 bits mantissa is wide enough, but conversion is trickier. For data transfer to and from the DSP some routines have been written, both for use without and with FPU, which are included in this archive (DBL2DSP.S and testprogram TDBL2DSP.C, CPU and FPU versions through different project files). These are provided in this archive; to compile them on your system some pathnames may have to be adjusted.
9. DSP memory map
P X Y Address
+--------------+ +--------------+ +--------------+ $FFFF
: : | Int. I/O | | (Ext. I/O) | (64)
: : +--------------+ +--------------+ $FFC0
: : : : : :
: : : : : :
: : : : : :
: : : : : :
/ / / / / /
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
+--------------+ +--------------+ +--------------+ $8000
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | +--------------+ +--------------+ $4000
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
+--------------+ +--------------+ +--------------+ $0200
| Internal | | Log table or | | Sine table or| (256)
| | | external mem.| | external mem.|
| program | +--------------+ +--------------+ $0100
| | | Internal X | | Internal Y | (256)
| memory | | memory | | memory |
+--------------+ +--------------+ +--------------+ $0000
Notice the mirrors in X- and Y-memory and the relationship between external P-and X/Y-memory. The mirrors are there for a purpose: the first 256 (or 512, if dataroms enabled) bytes in the high mirror are unique; in the low mirror they are masked out by the internal data memories. The high mirrors are very convenient for large circular buffers, which have to start on an even larger address boundary (e.g. a 15K circular buffer can only start on $4000, $8000, $C000 if you don't want to include internal X/Y memory).
There is a continuous 16K block available at $4000 in both X- and Y-memories, so the total amount of memory available is 16k (X) + 16k (Y) + 512 (P) + 256 (X) + 256 (Y) words = 33 k words = 99 kbyte.
Send any questions (only about this program!) and comments to:
Robert Jan Ridder
De Sikkel 37
5384 HT Heesch
The Netherlands
or leave a message to me on Atari Benelux BBS (Holland ..-31-3473-77584). Job offers from the United States appreciated.
text provided by Earx / FUN