Copy Link
Add to Bookmark
Report
Xine - issue #5 - Phile 114
Ú-----------------------------¿
| Xine - issue #5 - Phile 114 |
À-----------------------------Ù
C to assembly, language point of view
-------------------------------------
Intro
-------
Many of you think c is useless for viruswriter. Many poeple wich start
vxing handle better c language than assembly one. But the true reason where C
is locked for virus is the dependency of the compiler, wich block some special
manipulation. But the use of a C compiler is not totally senseless. If the
optimization is not perfect, it can build a code very proach and similar to
the assembly one, if you follow some rules. Of course, you will loose a few
size, but you will gain by stability, portability, and finally by coding much
fastly your nice littles routines, that you may customize once in assembly.
The first step, code parser, analyzer
---------------------------------------
a) syntax analyzer
------------------
the major improvent in hll is the ability of analysing syntaxic expression, and
evaluate them, things that the processor can't do. Syntax analyzer is quite
common to all language, wich result in fact in a series of mathematical
operation of variable and known value. The best way I did to make a syntax
analyzer was using a binary tree and recursivity.
so, lets assume the simple case: b+c +
/ \
B C
for this case, you need to have: b+c*d +
/ \
B *
/ \
C D
the second part of the plus operation has been changed, and it's now an
expression himself, for such operation, you have to take care of operator
priority, the * is prior to the +, then you have to evaluate c*d before
making the +.
now to see why recusivity and binary tree are usefull, imagine you have
(b+c*d)+a , a part of the expression three will stay the same, but in the
upper of it, you have to add a "+a", after you then try (b+c*d)-(a*b+d)
note, it's not important here if it's B or a value, it's just symbolic.
b) language needings
--------------------
You need also to know what type of data you are manipulating, in fact, their
"class" are different, and this have many repercussion later in compiling.
Normally, assembler programmer handle just nice little integers, wich is quite
suffisant. Boolean are also integers, but the majority of the test are to see
if it's zero or not. and not using the complex processor flags, wich seems
better but much problematic for boolean operation as exemple.
Each language have a same squeletton, a language handle test, jump, function
calls, and boucle. For C, it's easy, if true (non zero as seen) , execute next
instruction. The next instruction can be a group of instruction, and then you
introduce instruction block.
Instruction block have also to be seen as a directory with a main directory.
classical instruction are subdirectory of instruction. Then for that you have
to handle the text file a certain way, line per line. So you are forced to
reconstruct the main body.
c) memory, variable, the magic key ?
------------------------------------
The only extensible zone for the processor is the memory, so, the other
extensible number variable make perfect the linking between memory and
constants.
Now we have two type of data, the global and local ones, global have to be
stored with the general program, when local have to be allocated, in C,
the compiler take care of allocation, deallocate when the block finish, wich
define the need of a simple and fast memory allocator/deallocator.
for that, the stack is quite perfect, as often seen in program, the mov ebp,esp
save the actual stack state. The job is evaluating for each constant a dynamic
place in the stack.
C have also defined pregenerated new and delete wich are just typical
malloc(sizeof), not to be implemented in the dynamic calculation
The second step, the meta-code
--------------------------------
Translating an expression to the processor is not as easy as you can think,
because there's today two style of processor, the accumulators based one and
the stack based one, ie CPU/FPU. Often the accumulator cpu can act with
some manipulation as FPU, but it make larger the code, memory and data
manipulation.
First, for CPU, you work often with a working register, the accumulator one on
intels as exemple.
lets take our first exemple b + c*d, so, analysing the tree, we look that way
load c
mul d
save #1
load b
add #1
then, now the accumulator equal the expression, watch that here the expression
have just one operator, because we have no predefined processor, then we have
assume we work with a very restricted one.
there's three type of opperation, load save and operations. The problem is
assuming the first save/add
now let look on a stack operation how it will work
push b
push d
push c
mul
plus
The principe of stack processor is that the result is allways push on top of
the stack, wich equal the accumulator of the 1st type.
Third step, translating the code
--------------------------------
Translating the code to assembly output is much easy, because meta code are a
series of macro, these macro can be directly changed into small groups of
instruction, lets see:
load c -> mov eax,[esp+c]
mul d -> mov ecx,[esp+d], xor edx,edx, mul ecx
save #1 -> mov [esp+#1],eax // # are dynamically alocated
load b -> mov eax,[esp+b]
add #1 -> add eax,[esp+#1]
and for the second one, lets see
push b -> fld qdword ptr [esp+b]
push d -> fld qdword ptr [esp+d]
push c -> fld qdword ptr [esp+c]
mul -> fmul, fxchg, fstp *
plus -> fadd, fxchg, fstp *
we fstp * mean we send ST(0) to trash, can be as exemple sub esp,8 , fstp [esp],
add esp,8
Conclusion
----------
I hope you found that interresting, I hope to see multiplatform viruses soon :)