Biweekly Malware Challenge #5.1: Emulating PrivateLoader Strings

  • Home
  • Blog
  • Biweekly Malware Challenge #5.1: Emulating PrivateLoader Strings
Biweekly Malware Challenge #5.1: Emulating PrivateLoader Strings

Aim

This challenge was technically split into 2 parts, however the main focus was on the second part – understanding the network protocol used by PrivateLoader to retrieve the core module. The first part was to decrypt the strings, revealing the C2 addresses, which would assist in locating the networking functionality. While the approach I took within this challenge doesn’t completely decrypt all strings within the binary, it does decrypt the strings needed to begin analysing the network protocol.

So, how did I approach this challenge?

Approach

Rather than putting together a hacky IDA Python string decryption function, I decided to put together a hacky emulation script, firstly as I haven’t touched the Unicorn Engine before and secondly because it seems like a pretty useful thing to know, especially for dealing with malware utilising stack strings with slightly differing algorithms. As mentioned, the script we develop doesn’t decrypt all strings within the binary, but it does decrypt the API used within network communications, allowing us to start reversing the network protocol.

Analysis

The first thing we spot within the WinMain function is some kind of stack string setup, whereby encoded/encrypted DWORDs are being moved onto the stack, before being decrypted using the pxor instruction, which is fairly unique, along with the use of xmm0 – a 128-bit register. After the call to pxor, the string is passed into LoadLibraryA, which tells us that while API hashing isn’t used, string encryption is used to obscure the imported API.

Skimming through some of the functions within the sample, it is clear this functionality occurs throughout, however there are some slight changes, likely due to the size limitation within the xmm0 register. Certain blocks have two blocks of the pxor instructions, so our approach of emulation might not be as reliable in these cases.

There are several options when it comes to emulation, such as Flare-Emu which is a wrapper for Unicorn with some additional features (as well as being a disassembler plugin), Qiling, which is built on top of Unicorn but provides handling for more than just CPU instructions, or Dumpulator which can be used to emulate code within minidump files. In this case, we will be using the core Unicorn Engine along with the Capstone Engine, to learn how the internals function.

The nice thing about Unicorn is we can emulate blocks of code, rather than executing an entire binary file, and as our string decryption functions all work with CPU operations rather than API calls, we should be able to decrypt the bulk of the strings.

Our script will be comprised of 3 core functions; the main function, the emulate function, and the convert function (used to convert a file offset to virtual address).

The first requirement is to get the script to locate all instances of string decryption, and to do that we will leverage the fairly unique set of operations performed when setting up the stack prior to the pxor calls, and create a regular expression to match on those operations. We could develop one to locate every single pxor call, however we want the emulator to emulate the actual stack setup, rather than just the pxor, as without it the emulation will result in either junk data or some sort of error message.

We will however use the pxor instruction to locate all other pxor instructions, allowing us to find overlapping opcodes across all instances. This is fairly simple, as in every instance pxor is performed with xmm0 as the first argument, and a stack pointer as the second argument. Therefore, only two bytes change within the 8 byte instruction, so a quick search in IDA for the following:

66 0F EF 85 ?? ?? FF FF

reveals 105 different calls to pxor within the loader.

Searching through a few of these blocks, I noticed there was a few similarities near the “entry point” of the decryption block, in that ecx, edx, and eax, would be zeroed out, while stack variables were setup:

xor    ecx, ecx
mov    [ebp-1A8h], cl
xor    edx, edx
mov    [ebp-1A9h], dl
xor    eax, eax
mov    [ebp-1AAh], al

This pattern also appeared in the majority of other stack string decryption blocks, however would slightly differ, for example swapping registers around so zeroing out eax first, then ecx, and finally edx.

The actual opcodes for this set of instructions is also fairly simple, and is as follows:

33 C9
88 8D 58 FE FF FF
33 D2
88 95 57 FE FF FF
33 C0
88 85 56 FE FF FF

In another instance, the opcodes were as follows:

33 D2
88 95 C7 FC FF FF
33 C0
88 85 C6 FC FF FF
33 C9
88 8D C5 FC FF FF

This clearly indicates which opcodes are specific to the instruction, and which are dependent on the operands used; converting the series of bytes to an IDA searchable byte pattern, we get the following:

33 ??
88 ?? ?? ?? FF FF
33 ??
88 ?? ?? ?? FF FF
33 ??
88 ?? ?? ?? FF FF

Running this search within IDA, we get a total of 78 results – a lot less than the previous 105.

Additionally, running a quick regex search with Python reveals even less strangely, a total of 63 discovered matches. Taking a look through some of the non-matching functions, I noticed there was another possible “entry” block format, which can be seen below:

xor    eax, eax
mov    [esp-2Ah], al
xor    ecx, ecx
mov    [esp-2Bh], cl
xor    edx, edx
mov    [esp-2Ch], dl

Instead of using the ebp register, it was utilising the esp register, which took up 3 bytes instead of 6. I don’t believe IDA supports searching byte patterns with variable lengths, so I converted it directly to a regular expression.

\x33.\x88.{2,5}\x33.\x88.{2,5}\x33.

Running this against the binary, it picked up a total of 75 matches, and while that did not include every string within the binary, it is enough to start developing a script.

When it came to cutting down each string decryption block to only execute the required instructions (primarily all the necessary pxor instructions and possible mov/lea instructions to read from memory), I tried a few different approaches, such as:

  • setting up a hook to disassemble each line as it was emulated, checking if the instruction was lea or pxor, and if so setting a variable so that the hook would read from the register in question
  • using regex to locate the last pxor instruction within a selected region of memory
  • emulating the entire block and reading the stack after completion

Unfortunately these came with their own issues, such as an added requirement to parse and reformat strings, stripping out unnecessary bytes, and so on. Instead, I decided to take the following path:

  • Using the matches we discovered, take a large region of memory from that start point, e.g. 1000 bytes
  • Disassemble each assembly instruction checking for push or call instructions
    • For each push instruction, check if a register is pushed to the stack – this is likely the decrypted string being pushed to an API call
    • If we encounter a call instruction, iterate through the previously called instructions and look for a lea or movups call, before parsing out the destination register – this could potentially contain the decrypted string
    • Once one of the above instructions have been located, set the cut-off for the code block; the call instructions will not execute properly within Unicorn, and the push instructions should lead us to the right place
  • Pass the truncated code block and destination register to the emulation function, which should execute the code block, and once completed, attempt to read from the specified destination register before displaying (hopefully) the decrypted string

This does have a few issues with certain code blocks, as in certain cases a call to aulldiv or aullmul is thrown into the code, which cannot be executed without implementing into Unicorn, but it does decrypt nearly all of the strings we need

Setting up this particular function is pretty simple, though not the cleanest Python code in the world.

def main():

    md = Cs(CS_ARCH_X86, CS_MODE_32)
    pe = pefile.PE("sample.bin")

    with open("sample.bin", "rb") as f:
        binaryData = f.read()

    #entry_block = rb'\x33.\x88...\xFF\xFF\x33.\x88...\xFF\xFF\x33'
    #entry_block = rb'\x8A.{5}\x88.{5}\x8A'
    entry_block = rb'\x33.\x88.{2,5}\x33.\x88.{2,5}\x33.'

    entryOffsets = []

    for m in re.compile(entry_block).finditer(binaryData):
        entryOffsets.append(m.start())

    for offset in entryOffsets:

        fileOffset = hex(convertOffsetToVA(pe, offset))
        dataBlock = binaryData[offset:offset + 4000]

        lastInstructions = []
        for j in md.disasm(dataBlock, 0):
            
            if j.mnemonic == "push":
                if j.op_str not in ["eax", "edx", "ebx", "ecx"]:
                    lastInstructions.append(j)
                    continue
                else:
                    dataBlockEnd = j.address + 1
                    destRegister = j.op_str
                    break
        
            if j.mnemonic == "call":

                for insn in reversed(lastInstructions):

                    if insn.mnemonic == "lea":
                        dataBlockEnd = insn.address
                        destRegister = insn.op_str.split(",")[0]
                        break

                    elif insn.mnemonic == "movups":
                        # pretty hacky method 
                        dataBlockEnd = insn.address
                        destRegister = insn.op_str.split("[")[1].split("]")[0]
                        break
                         
                break

            lastInstructions.append(j)

        dataBlock = dataBlock[:dataBlockEnd]

        emulateBlock(fileOffset, dataBlock, destRegister)


if __name__ == "__main__":
    main()

With the main function defined, it’s time to move on to the convert function, which simply uses the PEFile module to search through the sections of a binary, querying the provided offset against the PointerToRawData addresses, until it locates the correct section the offset belongs to. From there, the file offset is added to the virtual address of the correct section, before that is subtracted from the PointerToRawData. The reason we’re doing this is to cross reference the decrypted strings within IDA, so we can copy an address, jump to it, and rename any variables.

def convertOffsetToVA(pe, fileOffset):

    for i in range(0, len(pe.sections)):
        if fileOffset <= pe.sections[i].PointerToRawData:
            offsetSection = pe.sections[i-1]
            break
    
    offsetVA = fileOffset + offsetSection.VirtualAddress - offsetSection.PointerToRawData
    return offsetVA + pe.OPTIONAL_HEADER.ImageBase

Finally, all that is left is the emulate function. I found this to be surprisingly simple to setup, considering the topic of binary emulation is a fairly complex one, but Unicorn provides a great sample file to follow along with here, and this post by Jason Reaves is also extremely helpful and covers using a similar method to decrypt BazarLoader strings.

We need to first allocate 2 regions of memory in this case, one for the main code block, and one for the stack, which we can do using mem_map. As the sample decrypts strings using the stack, I created a fairly large region of memory, to prevent any possible issues arising during emulation.

ADDRESS = 0x40000000
STACK_ADDRESS = 0x90000

mu = Uc(UC_ARCH_X86, UC_MODE_32)

# create memory for main code block
mu.mem_map(ADDRESS, 4 * 1024 * 1024)

# create memory for stack
mu.mem_map(STACK_ADDRESS, 4096*15)

From there, we want to write the code block to memory using mem_write, before setting the esp and ebp registers to point to the stack addresses, which can be done using reg_write. At this point, we’re ready to start emulating, so we will call emu_start, making sure to encapsulate it within a try-except block, to display any possible errors.

mu.mem_write(ADDRESS, dataBlock)

# setup stack variables
mu.reg_write(UC_X86_REG_ESP, STACK_ADDRESS + 4096*3)
mu.reg_write(UC_X86_REG_EBP, STACK_ADDRESS + 4096*3)

try:    
    # begin emulation
    mu.emu_start(ADDRESS, ADDRESS+len(dataBlock))
except UcError as e:
    #print (e)
    pass

Once emulation has ended, we want to read from the destination register we discovered through parsing. This can be done using reg_read. As the string is likely longer than 4 bytes, the register is going to be a pointer to a region of memory, possibly on the stack – so, we will use mem_read to read 100 bytes from the region of memory pointed to by the register. Make sure this is also in a try-except block, due to a number of issues that may arise such as invalid memory pointers.

register_lookup = {
    "eax" : UC_X86_REG_EAX,
    "edx" : UC_X86_REG_EDX,
    "ecx" : UC_X86_REG_ECX,
    "ebx" : UC_X86_REG_EBX,
    "ebp" : UC_X86_REG_EBP
}

# read from destination register - should be a pointer
r_reg = mu.reg_read(register_lookup[endRegister])

try:
    # read from memory using pointer from r_reg
    stringData = mu.mem_read(r_reg, 100)

    print(fileOffset + " " + (stringData).decode())

except:
    pass

At this point all of the scripting has been completed, so we can put the code together and execute it, to see the results. The full code can be seen at the end of this post.

Upon executing the script, we can see clearly decrypted strings, including mostly API calls, but also some malware specific strings such as the HOST string, the IP address 85[.]202[.]169[.]116, as well as a user agent. The most important strings here though are the networking API calls, so HttpSendRequestA, InternetReadFile, WinHttpOpen, and so on, as this will help us in locating the communications routine within the sample.

From here, there are a few possible options on further development, such as linking up with IDA or another disassembler to automatically rename any string pointers, add comments, or just in general improve the functionality. For now, this should be fine, as we’ve managed to get a step further to reversing the network protocol – so that brings an end to this part of the challenge!

Within the next part, we will be continuing our analysis, specifically focusing on the network protocol to identify how the main payload is retrieved.

In the meantime, feel free to share your write-up within the Discord channel or via your own blog post!

import re, pefile
import binascii
from unicorn import *
from capstone import *
from unicorn.x86_const import *

register_lookup = {
    "eax" : UC_X86_REG_EAX,
    "edx" : UC_X86_REG_EDX,
    "ecx" : UC_X86_REG_ECX,
    "ebx" : UC_X86_REG_EBX,
    "ebp" : UC_X86_REG_EBP
}

def convertOffsetToVA(pe, fileOffset):

    for i in range(0, len(pe.sections)):
        if fileOffset <= pe.sections[i].PointerToRawData:
            offsetSection = pe.sections[i-1]
            break
    
    offsetVA = fileOffset + offsetSection.VirtualAddress - offsetSection.PointerToRawData
    return offsetVA + pe.OPTIONAL_HEADER.ImageBase


def emulateBlock(fileOffset, dataBlock, endRegister):

    ADDRESS = 0x40000000
    STACK_ADDRESS = 0x90000

    mu = Uc(UC_ARCH_X86, UC_MODE_32)

    # create memory for main code block
    mu.mem_map(ADDRESS, 4 * 1024 * 1024)

    # create memory for stack
    mu.mem_map(STACK_ADDRESS, 4096*15)

    # write code block to memory
    mu.mem_write(ADDRESS, dataBlock)

    # setup stack variables
    mu.reg_write(UC_X86_REG_ESP, STACK_ADDRESS + 4096*3)
    mu.reg_write(UC_X86_REG_EBP, STACK_ADDRESS + 4096*3)

    try:    
        # begin emulation
        mu.emu_start(ADDRESS, ADDRESS+len(dataBlock))
    except UcError as e:
        #print (e)
        pass

# read from destination register - should be a pointer
r_reg = mu.reg_read(register_lookup[endRegister])

try:
    # read from memory using pointer from r_reg
    stringData = mu.mem_read(r_reg, 100)
    print(fileOffset + " " + (stringData).decode())
except:
    pass

def main():

    md = Cs(CS_ARCH_X86, CS_MODE_32)
    pe = pefile.PE("sample.bin")

    with open("sample.bin", "rb") as f:
        binaryData = f.read()

    #entry_block = rb'\x33.\x88...\xFF\xFF\x33.\x88...\xFF\xFF\x33'
    #entry_block = rb'\x8A.{5}\x88.{5}\x8A'
    entry_block = rb'\x33.\x88.{2,5}\x33.\x88.{2,5}\x33.'

    entryOffsets = []


    for m in re.compile(entry_block).finditer(binaryData):
        entryOffsets.append(m.start())

    for offset in entryOffsets:

        fileOffset = hex(convertOffsetToVA(pe, offset))
        dataBlock = binaryData[offset:offset + 4000]

        lastInstructions = []
        for j in md.disasm(dataBlock, 0):
            
            if j.mnemonic == "push":
                if j.op_str not in ["eax", "edx", "ebx", "ecx"]:
                    lastInstructions.append(j)
                    continue
                else:
                    dataBlockEnd = j.address + 1
                    destRegister = j.op_str
                    break
        
            if j.mnemonic == "call":

                for insn in reversed(lastInstructions):

                    if insn.mnemonic == "lea":
                        dataBlockEnd = insn.address
                        destRegister = insn.op_str.split(",")[0]
                        break

                    elif insn.mnemonic == "movups":
                        # pretty hacky method 
                        dataBlockEnd = insn.address
                        destRegister = insn.op_str.split("[")[1].split("]")[0]
                        break
                    
                        
                break

            lastInstructions.append(j)

        dataBlock = dataBlock[:dataBlockEnd]

        emulateBlock(fileOffset, dataBlock, destRegister)


if __name__ == "__main__":
    main()
Receive the latest news

Subscribe & Stay Tuned

Stay updated on new course releases and related InfoSec content