Concerning exploiting Adobe Acrobat Reader, the malware author has two advantages. First, history proves that new vulnerabilities are discovered in Acrobat Reader each year, allowing for buffer overflow attacks and code execution. In 2016, there were twenty new vulnerabilities (CVEs) logged against Acrobat Reader. Of these, seventeen allowed for code execution, providing brand new attack vectors for malware droppers to exploit.
The second advantage for malware authors, is that security patches are not typically applied in a timely manner, leaving attack vectors viable far too long after fixes have been published. In Cyber Security Trends: Aiming Ahead of the Target to Increase Security in 2017, John Pescatore states that “while attacks that exploit zero-day vulnerabilities tend to get the most press coverage, data shows that attacks that exploit well-known vulnerabilities cause the vast majority of business damage“.
There are many examples of PDF malware droppers in the wild. In this article, we will examine an Adobe Collab.collectEmailInfo() buffer overflow sample that was originally submitted to VirusTotal in March, 2010. Sadly, even though this sample is over six years old, it is still being submitted to VirusTotal as recently as December 2016. This activity shows a prolonged interest in a six year old exploit. What does this mean? Security patches not being applied?
Let’s take a guided tour of how this sample works…
Before we start analyzing the sample, it will be useful to gain a high-level understanding of the PDF file format specification. The file format description below is intended to provide a brief overview/refresher. If needed, please reference the Adobe PDF file format specification for a more thorough understanding. Also, Didier Stevens provides an extremely good description of the Physical and Logical Structure of PDF Files.
As shown in the diagram above, PDFs are comprised of a Header, Body, Xref Table and a Trailer. The bulk of the PDF file typically resides in the body section. The body is comprised of one or more numbered objects (e.g., 1 0 obj, 2 0 obj in the diagram above). These objects can contain dictionary elements (e.g., Type, Catalog, Pages, OpenAction in object 1 in the diagram above). Optionally, objects can contain embedded streams (e.g., as shown contained within the stream/endstream tags in object 2 in the diagram above).
The sample text in the diagram above illustrates a representative example of a PDF encoding (the stream contents in the sample were abbreviated for sake of clarity/conciseness). Aside from encoded/compressed streams within the document, PDF files are comprised of printable text characters and can be viewed via a text editor. However, viewing a PDF via a text editor would not provide a cleartext view of the encoded/compressed streams. As the result, many analysts leverage tools such as pdf-parser.py written by Didier Stevens to view a comprehensive decode of a PDF file.
Now that we have seen a glimpse of what a simple PDF file encoding looks like, let’s start analyzing the real thing. The concepts described above will make much more sense as we see them used in an actual sample.
ANALYZING THE DOCUMENT
The malware dropper that we are analyzing uses four distinct stages during its attack as shown in the diagram below. Each stage is progressively more complex and ultimately builds towards the actual infection.
The third stage of the attack contains an embedded shellcode payload. This stage exploits a buffer overflow in Adobe Reader via the Collab.collectEmailInfo() vulnerability in order to hijack the instruction pointer register (EIP) and run the shellcode. This shellcode downloads malware from the Internet, saves it to disk and starts it as the last stage of the attack.
Why are so many stages used during this attack? Why not simply skip the first three stages and auto-start stage four directly as the document is opened? Why all the complexity? The answer is simple… obfuscation. The attacker is making it as difficult as possible to analyze and understand the code (manually or programmatically). If we cannot understand it, we cannot make a conclusive determination that it is malicious.
The infection is now complete. Lets examine how each of these stages work…
In the first stage of the attack, an OpenAction is used to auto start an object stream as the document is first opened as shown in the PDF encoding below. The highlighted section of the PDF shows that object 1 contains an OpenAction which targets object 4 to be started automatically (e.g., 4 0 R is a reference to object 4).
Looking at the object stream in Object 5 as shown below reveals a FlateDecode stream. Since it is compressed/encoded via the Deflate algorithm, it will need to be decompressed/decoded in order to view it properly. This can be achieved programmatically in Java via libraries such as InflaterInputStream in java.util.zip or using open source tools as described earlier.
Stage 1B: Hex-Decode Annotation Payload and Run It
Lets walk through each of these steps one by one and see if there is a payload in the annotations…
Prior to grabbing a payload via annotations in the PDF document, there needs to be a guarantee that all annotations can be detected. The syncAnnotScan() function shown below does just that. It guarantees that all annotations have been scanned such that a complete list will be returned via getAnnots().
Once the annotations have been scanned, the annotations within the document can accessed via the getAnnots() function. The code shown below gets all of the annotations on the first page (nPage: 0) of the document as described in the Adobe documentation. This sample document is comprised of only one page, so, all annotations are contained on the first page.
We see that object 3 contains the Annots tag as shown below. This is the annotations definition that will be returned via the getAnnots() function. In this sample, the annotation references two objects. The first refers to object 6 (e.g., 6 0 R). The second annotation refers to object 8 (e.g., 8 0 R). Line 16 in the code shown above retrieves the object references and stores them into the variable pr as a two-element annotation array (element 0 -> 6 0 R, element 1 -> 8 0 R). We need to remember the pr variable’s purpose as we continue our analysis.
If we look at object 8 in the PDF file as shown below, we see that its /Subj dictionary element is an object reference to object 9. Lets go look at object 9 and see what we can find…
Looking at line 30 below, we can see that the decoded payload buffer variable is passed into the app[fnc + ‘l’](buffer) function. Analyzing the code at lines 2, 21 and 29, we can see that the fnc variable was built such that its value is currently set to ‘eva’. As ‘l’ is added to the fnc variable as shown in line 30, the resulting effective code is app[‘eval’](buffer). This is an alternate way of invoking eval().
The payload is launched on line 30 and the next stage of the attack begins…
Stage 2 – Custom-Decode Annotation Payload and Run It
The main differences between this stage and stage one are that a different annotation payload is retrieved, and this decode logic is customized rather than leveraging standard hex decode logic. In fact, almost the entire code block shown is dedicated to decoding the payload. The sdcK_To5s() function is primarily one big obfuscated decode loop. One big, ugly, decode loop.
Lets walk through this logic and see the differences…
The first difference is shown in the logic below. The pr annotation array that was built in stage one is referenced in order to get the payload that will be used for this stage. As you can see in lines 14 and 16 below, this time it references annotation array index 0 rather than 1. Similar to stage one, it references the Subject dictionary element to get its payload – just from a different object reference this time.
If you remember from the Annots definition as shown below, the first element (index 0) is a reference to object 6. Object 6 contains the payload that is used in this stage as opposed to object 8 that was used in stage one.
Similar to what we saw in the first stage, the object initially referenced directly in Annots ultimately points to a different object. Object 6 shown below contains a Subj dictionary entry which references object 7. Object 7 is ultimately our payload for this stage.
As shown below, object 7 contains a FlateDecode stream, which is most likely the payload for this stage. Good… we are getting closer…
After we FlateDecode the stream, we can see the payload as shown below. This payload is obviously not a hex encoded stream (obvious because we can see characters in the stream that don’t fall within the [0-9, A-F] range of valid hex characters). We also know its not a decimal or octal payload.
This payload may be trickier to decode than what we encountered in stage one. The good news… since it is trickier to decode, it probably has a more valuable payload.
So, how are we going to decode this payload? Lets let the malware logic do it for us…
One trick used by analysts is to replace the eval() function such that it prints the decoded results rather than executing them. As we can see in line 96 below, the decoded payload (the attack) is ultimately launched via eval(). As we analyze the code, we need to be careful that this line never executes.
To decode the payload, we copy the sdcK_To5s() function (our big ugly decode loop) onto an isolated test machine and change it such that eval() behaves as a print() function. The code shown below was left completely intact except for two changes.
The first change was to add the line of code eval=print. This effectively is a reassignment such that eval() no longer works. We are safe. Rather than executing the payload, it will print it to the screen.
The second change was to declare the new variable subjectPayload and assign it the value of the object stream in object 8 (e.g., 222606805…). The code shown below abbreviates this value for conciseness in the screen shot. The actual program would obviously include the full value.
Now we can safely decode our payload and see the results…
We have not discussed in depth that the decode loop logic is “tamper-resistant”. This means that the decoder uses the actual source code of the decode loop and the source code length as key values in the decode algorithm. This is done via the arguments.callee logic listed in the original code block above. This means that if the source code were altered (e.g., we re-write it to cleanse the code for readability, etc.), decoding would fail. We were successful leveraging the decode logic for one reason. All of our code changes were outside of the function, leaving the original function intact.
Stage two is completed.
We are getting closer to the actual exploit. Lets continue onward…
Stage 3: Buffer Overflow and Launch ShellCode Payload
Lets examine the code and see if we can figure out what its doing…
The first thing that catches our eye is the shellcode payload highlighted below. Why do we think it’s a shellcode payload? There are several clues.
Second, the payload is unescaped (converted back to the actual bytes represented by the hexadecimal values) in line 82 shown below. For the given payload, this would result in many non-printable characters. It is unlikely that the unescaped payload is used for any benign display purposes. Instead, it is more likely to be shellcode.
Additional evidence that we are analyzing a buffer overflow attack is that we see signs of addresses typically used when overwriting the instruction pointer register (EIP) as highlighted below.
Memory that is allocated dynamically is stored on the heap as opposed to the stack. A buffer overflow style attack typically involves dynamically allocating (“spraying”) the heap with large combinations of NOP slides and the actual shellcode payload. This results in many pairs of NOP+shellcode chunks throughout the heap located at the higher memory addresses (e.g., 0x0a0a0a0a, 0x0c0c0c0c). The overflow attack will then attempt to exploit a vulnerability to overwrite the EIP with 0x0c0c0c0c such that the next instruction to execute points to a NOP slide within one of these chunks. When all works correctly, the exploited host application walks directly into the shellcode payload within the heap and starts executing it.
To overwrite the EIP, typically a long series of 0x0c0c0c0c bytes are strung together and passed into an exploitable section of code (code that fails to bounds check memory copies – allowing an overflow into the EIP register).
So… do we see any evidence that the 0x0c0c0c pattern is heavily duplicated and passed in to an exploitable section of code? The answer is yes. Lines 96 and 99-101 below show that 0c0c0c0c is unescaped and then duplicated into a long sequence of the pattern repeated many times. The highlighted section below then shows that this repeated pattern of 0c0c0c0c is passed in the msg argument to Collab.collectEmailInfo().
Finally, we found our exploit! The Collab.collectEmailInfo() is a well-known vulnerability in PDF Viewer that can be used via buffer overflow attacks to arbitrarily run shellcode. This is exactly what is happening in our example. Even if the vulnerability was not already known and documented, we would be extremely suspicious of Collab.collectEmailInfo based on our analysis. We have seen enough to theorize we are dealing with a shellcode payload and move on to the next stage of our analysis. We see obvious signs of a shellcode payload (e.g., 9090…). We see obvious signs of commonly used EIP overflow values (e.g., 0c0c0c0c). We see the EIP overflow value being highly duplicated into a long string and passed into a well-known vulnerability. Next step…. Lets examine the shellcode payload and see what it does…
Stage 4: Download Malware and Infect Target Machine
All of the steps described in stages 1-3 have led up to this moment of attack. The entire purpose of all of the previous stages was to run this small set of machinecode as shown below. Think about it… for this payload to be successful, it will need to be able to load networking libraries, establish a network connection to a remote malware hosting site, download malware, save it to disk and ultimately run it. All of this logic needs to be contained within this small shellcode payload.
Lets see what it contains…
Already, the payload is interesting. Looking at the printable strings in the payload above, we can see a URL artifact indicating a likely malware-hosting site. Remember that, in order for the shellcode payload to be successful, it must establish a network connection to a remote malware-hosting site. Most likely, this URL is the malware-hosting site.
To get a definitive answer as to what the shellcode does, we have several options to choose from. The best (although, most difficult) option is to paste this shellcode into a debugger (e.g., OllyDbg) and step through it as it executes. This would need to be done on the Operating System, which is being targeted by the shellcode (e.g., Windows). As the result, there is a risk of infection during this debug process and precautions need to be taken (e.g., work on isolated machine/network in a virtualized environment). Also, lets face it… stepping through shellcode in a debugger is not the easiest task.
An alternate option would be to leverage a tool such as sctest that is part of the libemu package (the analysis shown below leverages sctest preinstalled on REMnux). This tool provides basic x86 emulation and can be run on a CentOS platform (no risk of infection). The tool provides a quick and easy method to view a system call trace along with the arguments and return values for each call. It obviously isn’t as robust as a full-blown debug session, but most times it provides enough information to understand the nature of the attack.
The binary payload shown above was placed in a file named payload.raw and redirected into sctest for analysis. The screen shot below shows the results of running the unescaped payload through an sctest session.
As you can see from the results above, the shellcode payload is conclusively malicious. The shellcode performs the following steps:
- Calls LoadLibrary to dynamically load the urlmon library. This library contains many useful networking functions.
- Calls URLDownloadToFile to download malware from the hosting site. The malware is saved locally to c:\tmp\cFoN.exe
- Calls WinExec to launch the malware from c:\tmp\cFoN.exe
- Calls ExitProcess to terminate the launcher
- Keep all Web Browsers and Adobe packages up to date and patched with the latest security patches
- Do not open documents from untrusted sources
Thanks for reading… You can also connect with me on Twitter at @kd_cybersec.