You need to learn how to program to advance your malware analysis skills, but don’t know where to start. This post will outline what aspects you should focus on to have a direct and immediate impact on your ability to reverse engineer malware so that you don’t waste time on concepts with very little return on investment. If you don’t know which programming languages you should learn, see our previous post that answers that question, “Best Programming Languages To Learn For Malware Analysis”.
Core Programming Languages
From our previous post about the best programming languages to learn, we identified 3 core languages you should learn for malware analysis:
- C
- x86_64 assembly
- Python
We will discuss the concepts to focus on for just these 3 languages.
End Goal
Before we talk about the actual concepts to learn, we need to identify what our end goal is. If you don’t know your goal, you won’t know when to stop and you will likely spend needless time going off on tangents.
For malware analysis, you primarily need to learn to read programming code. There is a significant difference in the skills needed to write code versus read it. To write code, you need to know exactly how the APIs work, the performance implications of using different options, how to code efficiently, etc. You need to know how to solve a problem using programming.
But we aren’t writing malware. We are analyzing it to determine functionality and identify Indicators of Compromise (IOCs). Those decisions mentioned in writing code were made by the malware author. You don’t have to remember what API best accomplishes a given task. You only need to recognize what is being accomplished by an API you find in code and determine its significance.
In addition to reading malware binaries, you also need to read example source code to understand different ways a malicious actor could program capabilities. Malware analysis is largely pattern recognition. You read the malicious assembly code to view the APIs used along with the general conditional logic. Then you compare that to all the programming patterns you know to determine the most likely capability you are looking at. If you don’t know multiple ways to program a given capability, it will be much more difficult to recognize it when you see it in malware.
The exception is for Python. Here, you will actually want to learn to program in Python in addition to reading code. That’s because one of the main reasons to learn python is to write analysis scripts that will aid your malware analysis efforts. You will still need to read malware written in scripting languages, but you have the additional goal of writing code to accomplish tasks that are difficult or time consuming to perform manually.
How Much To Learn
The rest of this article will list the specifics of what you should learn, but that still leaves a question about how much you should learn for each topic. There is no easy way to define this. Ultimately, your level of understanding is very much tied to your level of malware analysis expertise. It will be very difficult to reach the top reverse engineering levels if you only have a basic understanding of programming. But if you spend all your time gaining a mastery of programming concepts before working to analyze any malware, you will still be a beginner malware analyst. We will provide a suggestion to help optimize your efforts, but remember it is just an initial suggestion. In the end, you will know when you need to dig deeper into a topic or when you can consider it complete based on how many questions arise while you are reverse engineering a piece of malware.
For each topic area, we suggest you learn just as much depth as is listed in an introductory book on programming. The introductory level will give you enough background to start reading other example code. As you read more example code, you will be introduced to more questions. You should then dive deeper into the topic to answer the new question. When you stop coming across new questions, then you know you have sufficiently learned the topic. By letting real questions from actual examples drive your learning, you will prevent yourself from diving unnecessarily deep into any given topic, wasting time learning a depth which is not needed and could be better spent on a different topic area.
Core Concepts For All Languages
Each programming language will have unique aspects to focus on, but there are some fundamentals you need to know that span all languages.
- Variables
- Comparison operators (is equal / less than / greater than)
- Conditional logic (If, If-Else statements, For / While loops, Switch Statements)
- Bitwise Operations (xor, shift right, shift left, and, or)
- Function arguments: passing by value vs by reference
Core Concepts For The C Programming Language
C programming has these additional concepts you will want to understand.
- Pointers
- Structures
- Specific APIs
For the APIs, you will want to learn those used to perform capabilities typically found in malware. You will come across unknown functions that you have to look up, but you want to understand the basic functionality of the most used APIs in malware. These common APIs will be 80% to 90% of the library calls used, so having a general understanding of what they do will dramatically speed up your analysis by reducing the need to research library functions to only a small number.
A list of common capabilities malware performs is below. You should know all the various APIs that can be used to perform these capabilities. You should focus on Windows library APIs, but also be familiar with the standard C library APIs also, e.g. WriteFile vs fwrite. If you search the internet on how to program these capabilities, you will find examples programmed with different methods so you can learn all of the relevant library functions used.
- File operations (searching for, reading from, writing to, deleting)
- Registry operations (searching for, reading from, writing to, deleting)
- Process manipulation (searching for, creating, reading from, writing to, terminating)
- Service manipulation (creating, starting, querying)
- Network operations (building a socket, connecting to a server, reading / writing a packet)
- String manipulation (getting the string length, finding a substring, copying a string, building a formatted string)
- Miscellaneous operations (Allocating memory, zeroing out memory, sleeping, starting a thread, loading a DLL, resolving an export)
We have included a list with many of the common APIs used for these capabilities in an appendix at the end of this post as a reference. But realize the appendix is just the starting list and you should still research the programming capabilities to see how the APIs are used and any alternative APIs.
Core Concepts For x86-64 Assembly
Reading assembly is at the heart of malware reverse engineering. A decompiler may be able to provide you a C like representation of the assembly, but it won’t always be available or accurate. The assembly will always be available so reading assembly means you will always have the ability to reverse engineer malware.
First, start out with learning about registers. You need to learn about the:
- General purpose registers
- Flags register
- Instruction pointer register
Make sure to learn the different access methods, e.g. referencing specific bytes of the registers (al vs ah vs ax vs eax vs rax)
Next, you should learn the essential instructions. There are hundreds of instructions in the x86_64 instruction set, but luckily 95% of the instructions used in malware come from a small subset of instructions. You will want to memorize exactly what each of these core instructions does. When you first start reading the instructions, it will be like a new language. Even when you know what the instruction does, you will need to “sound out” the assembly instructions. With a little practice, you will move to sight reading the instructions, and eventually you will be able to skim the assembly.
The core instructions to learn are as follows:
- push / pop
- call / ret
- mov / lea / brackets operator ([])
- cmp / test
- and
- xor
- inc / dec
- add / sub
- div / idiv / mul / imul
- shr /sar / shl / sal
- nop
- jmp
- jcc (jz / jnz / je / jne / ja / jae / jb / jbe / jg / jge / jl / jle)
After that, you should review the calling conventions. You don’t need to know every aspect of the conventions, but you should be familiar with how arguments are passed to functions and what the volatile / non-volatile registers are. Focus on the following calling conventions:
- cdecl
- standard calling convention
- fastcall
- x64 convention
Lastly, you should review how the following branching statements from C programs looks in assembly:
- if / if-else statements
- for / while loops
- switch statements
Core Concepts For Python
As we mentioned, Python is the outlier in required concepts because here, you will want to be able to write programs in addition to reading programs. On top of the core concepts, you should learn the following Python specific ideas.
- Collections (lists, dictionaries, sets)
- List comprehension
- Converting between strings and byte arrays
After you know the fundamental concepts, you should also know how to code a number of general tasks which will be frequently used in analysis scripts.
- Read in a file, both as a text and binary data
- Write data to a file, both as text and binary data
- Check if a string is contained in a list / dictionary / set
- Build a formatted string consisting of a mix of hardcoded strings and variables
- Use regular expressions
- Perform simple crypto operations, bitwise xor, and, or, subtract, add, shift right / left
- Perform RC4, AES, and RSA crypto using an external library such as PyCryptoDome
Final Thoughts
Once you have the basic understanding of the core concepts talked about here, your focus should be on learning all the different ways these concepts can be used. Remember, we are not writing malware, but reading it. That means you want to know as many different ways as possible these concepts can be used so that you recognize it in malware. You should spend considerable time researching different programming examples to see all the variations of how these concepts get used. The broader your knowledge base, the quicker you will be able to analyze malware.
Appendix – Common Capabilities API List
File Manipulation APIs
- CreateFile
- WriteFile
- ReadFile
- MoveFile
- DeleteFile
- CopyFile
- FindFirstFile
- FindNextFile
- fopen
- fread
- fscan
- fgetc
- fgets
- fwrite
- fputc
- fputs
- fprintf
Registry Manipulation APIs
- RegCreateKey
- RegQueryValueEx
- RegGetValue
- RegEnumValue
- RegSetValue
- RegSetKeyValue
- RegDeleteValue
- RegDeleteKey
- RegDeleteKeyValue
Process Manipulation APIs
- OpenProcess
- CreateProcess
- CreateProcessAsUser
- CreateToolhelp32Snapshot
- Process32First
- Process32Next
- ReadProcessMemory
- WriteProcessMemory
- TerminateProcess
- ExitProcess
Service Manipulation APIs
- StartServiceCtrlDispatcher
- RegisterServiceCtrlHandler
- SetServiceStatus
- OpenSCManager
- CreateService
- StartService
- OpenService
- QueryServiceConfig
- ChangeServiceConfig
Network Manipulation APIs
- WSAStartup
- socket
- connect
- bind
- listen
- accept
- send
- recv
- inet_addr
- htons
- ntohs
- closesocket
- shutdown
- gethostname
- gethostbyname
- InternetOpen
- InternetOpenURL
- InternetConnect
- InternetReadFile
- InternetWriteFile
- HttpOpenRequest
- HttpQueryInfo
- HttpSendRequest
- HttpAddRequestHeader
- WinHttpOpen
- WinHttpAddRequestHeaders
- WinHttpOpenRequest
- WinHttpQueryDataAvailable
- WinHttpConnect
- WinHttpQueryHeaders
- WinHttpSendRequest
- WinHttpReceiveResponse
- WinHttpReadData
- WinHttpWriteData
String Manipulation APIs
- sprintf
- strstr
- strcpy
- strcat
- strlen
Miscellaneous APIs
- VirtualAlloc / malloc
- memset
- memcpy
- LoadLibrary
- GetProcAddress
- CreateThread
- WaitForSingleObject
- Slttp
- GetTickCount