Learning to program is a necessary requirement to advance your malware analysis skills, but there are dozens of languages and it’s hard to know where to start. This post will detail the fundamental programming languages you should focus on along with the reasons why so you can start your malware reverse engineering education.
Why Learn Programming Languages
Learning to program is a significant time investment so you should know why you need these skills before starting. Malware reverse engineering is basically source code review. Only instead of reading the code in a high level language like C, you’re reading it in the low level assembly language. But ultimately, you’re still just reading source code. If you don’t know how to program, you won’t have the knowledge necessary to interpret the significance of what you’re reading. You won’t be able to determine what unknown functions might be based on common programming algorithms, you won’t know where to best look for Indicators of Compromise (IOCs), etc. You can do very simple triage analysis without understanding programming, but you’ll never reach advanced levels without understanding the basics of programming.
Core Programming Languages
There are three main languages you should focus on:
- C
- x86-64 assembly
- Python
C is the first language to learn because the majority of malware is written in C / C++. Additionally, many of the malware samples programmed in a different language share the same fundamental concepts as C. So by learning to program in C first, you will have a solid foundation to understand many different malware samples.
x86_64 assembly is the next language to learn. Assembly code contains the instructions the computer actually implements to accomplish an instruction seen in a higher level language like C. So while you may see an assignment instruction like “x = 49” in C, the computer will actually implement an assembly instruction like “mov eax, 49”. There are numerous assembly instruction sets, but x86_64 assembly is used on Windows computers and is used in the majority of compiled malware.
You may think that if there is a mapping of assembly instructions to C code, why can’t you just map assembly code to C so that you don’t have to learn both languages? This is called decompiling. The reason is that there is not an exact mapping. When you write a C program and compile it into a binary, the compiler translates the C instructions into a set of assembly instructions. But there is a 1 to many translation to x86_64 assembly. There are often several instructions the compiler can choose. That means decompiling the assembly back to the exact source code is impossible. But that doesn’t mean you can’t decompile the assembly code back to an equivalent source code.
That’s exactly what decompilers attempt to do; translate assembly instructions back to source code that is logically equivalent to the original source code. But for reasons more complex than is worth digging into, this is a difficult problem to get right. Tools such as Ida Pro and Ghidra have decompilers that do a decent job, but there are still numerous errors and code that they just fail on. They will likely improve as time goes on, but will probably always be inexact.
That means if you want to guarantee that you can reverse engineer a binary compiled to assembly, the only sure way is to learn to read the assembly. If you bypass this step and just use decompilers, you will be limited in what you can analyze. You will probably be able to review the majority of files, but you will be out of luck for the advanced, high impact malware.
Python is the last core language to learn. A scripting language is important for two reasons. One is to automate common analysis tasks which would be very time consuming or difficult to perform manually. Secondly, there is a whole class of malware which is written in scripting languages versus being a compiled executable.
Why Python though and not any general scripting language like perl or powershell? This is not because we are Python fanatics. It’s because many of the main reverse engineering tools have a built in scripting component which uses Python. Ida Pro has IdaPython which provides the ability to create analysis scripts in Python. Ghidra uses jython so that you can also create analysis scripts in python. These are two of the main malware analysis tools used by advanced analysts. The fact that both use Python for scripting means you will be greatly advantaged to learn Python over other scripting languages. In the beginning, you will probably only use scripts written by other analysts in these disassemblers. But over time, you will need to modify scripts and create your own if you want to work on the harder implants in a timely manner.
Once you learn Python, reading other scripting languages will be relatively easy. All scripting languages share many of the same foundational concepts, so while the APIs and syntax will be a little different, moving from one language to another is often an easy task. Whether you’re trying to analyze a malicious powershell script or port a cool analysis script you found in perl, it should be an easy task to perform once you know how to program in Python.
Niche Programming Languages
The core languages we talked about will set you up for success with the majority of malware, but there are several languages you will want to learn depending on what niche malware you look at. The good news is that the core languages provide a solid foundation which will dramatically decrease the time needed to learn any additional programming languages.
For Android malware, you will want to learn Java. This is because Android apps are written in Java and unlike malware compiled for windows computers, Java binaries can be accurately decompiled back to Java. So the tools you use to read android malware will be able to show you an accurate representation of the original source code.
For maldocs, or malware written that uses MS Office documents, you’ll want to learn Visual Basic For Applications (VBA) and powershell. VBA is a custom scripting language all MS Office products use to write macros. And because powershell is often on computers that run MS Office documents, many maldocs use VBA to decrypt custom powershell scripts to run and download next stage implants.
Go Language, or golang, malware is a newer niche malware that is starting to occur more frequently. It’s still a relatively small percentage of malicious files, but golang is a completely different structure from other compiled windows malware. So if you want to expand your abilities, it doesn’t hurt to at least learn the basics of how golang malware is structured so you can learn to reverse engineer malware if you encounter it.
What Concepts To Focus On
Now you know what programming languages you should learn to become a malware analyst, but learning to program is a large endeavour that can take years to finish. You don’t want to spend time focusing on niche concepts that won’t help you reverse engineer malware. But how do you know what to learn and what to skip?
Luckily, you don’t need to know everything about a programming language to effectively read malicious code, but outlining the core concepts is no small task. Because of that, we provided a follow on post to answer just this question, “Required Programming Concepts To Learn For Malware Analysts”.