Adobe Portable Document Files (PDF) still remains to be one of attacker’s favorite choices for installing (dropping) malware on a victim machine. The use of PDF files as malware droppers is largely driven by the popularity of this file format as a ubiquitous means to share documents, combined with its support for JavaScript, embedded streams and a long history of PDF viewer vulnerabilities.  Simply put, a malware author has traditionally been able to easily exploit known vulnerabilities in PDF viewers and have a high level of confidence that a victim will open the document and become infected.

Concerning exploiting Adobe Acrobat Reader, the malware author has two advantages. First, history proves that new vulnerabilities are discovered in Acrobat Reader each year, allowing for buffer overflow attacks and code execution.  In 2016, there were twenty new vulnerabilities (CVEs) logged against Acrobat Reader.  Of these, seventeen allowed for code execution, providing brand new attack vectors for malware droppers to exploit.

The second advantage for malware authors, is that security patches are not typically applied in a timely manner, leaving attack vectors viable far too long after fixes have been published.  In Cyber Security Trends: Aiming Ahead of the Target to Increase Security in 2017John Pescatore states that “while attacks that exploit zero-day vulnerabilities tend to get the most press coverage, data shows that attacks that exploit well-known vulnerabilities cause the vast majority of business damage“.

There are many examples of PDF malware droppers in the wild. In this article, we will examine an Adobe Collab.collectEmailInfo() buffer overflow sample that was originally submitted to VirusTotal in March, 2010.  Sadly, even though this sample is over six years old, it is still being submitted to VirusTotal as recently as December 2016.  This activity shows a prolonged interest in a six year old exploit. What does this mean? Security patches not being applied?

This sample was chosen for analysis due to its popularity, longevity, as well as, the multiple stages of attack used by the malware author.  It leverages multiple stages of JavaScript and shellcode payloads, and ultimately a buffer overflow exploit to drop the malware.  For those interested in learning how PDF-based droppers accomplish their goal, this is a perfect case study to examine.

Let’s take a guided tour of how this sample works…



Before we start analyzing the sample, it will be useful to gain a high-level understanding of the PDF file format specification.  The file format description below is intended to provide a brief overview/refresher.  If needed, please reference the Adobe PDF file format specification for a more thorough understanding.  Also, Didier Stevens provides an extremely good description of the Physical and Logical Structure of PDF Files.

As shown in the diagram above, PDFs are comprised of a Header, Body, Xref Table and a Trailer.  The bulk of the PDF file typically resides in the body section.  The body is comprised of one or more numbered objects (e.g., 1 0 obj, 2 0 obj in the diagram above).  These objects can contain dictionary elements (e.g., Type, Catalog, Pages, OpenAction in object 1 in the diagram above).  Optionally, objects can contain embedded streams (e.g., as shown contained within the stream/endstream tags in object 2 in the diagram above).

The streams may or may not be encoded/compressed via one or more cascading filters.  For example, the diagram above shows that the stream in object 2 has been encoded/compressed via the FlateDecode filter.  This stream would need to be decoded/decompressed in order to restore the original content to a cleartext state.  Traditionally, these streams are prime choices for malware authors to place JavaScript code and other malicious payloads.  These streams can be automatically launched via Dictionary tags within an object (e.g., the OpenAction dictionary tag in object 1 above automatically launches the stream defined in object 2 as the document is opened).

The sample text in the diagram above illustrates a representative example of a PDF encoding (the stream contents in the sample were abbreviated for sake of clarity/conciseness). Aside from encoded/compressed streams within the document, PDF files are comprised of printable text characters and can be viewed via a text editor.  However, viewing a PDF via a text editor would not provide a cleartext view of the encoded/compressed streams.  As the result, many analysts leverage tools such as pdf-parser.py written by Didier Stevens to view a comprehensive decode of a PDF file.

Now that we have seen a glimpse of what a simple PDF file encoding looks like, let’s start analyzing the real thing.  The concepts described above will make much more sense as we see them used in an actual sample.



Analysis Overview

The malware dropper that we are analyzing uses four distinct stages during its attack as shown in the diagram below.  Each stage is progressively more complex and ultimately builds towards the actual infection.

The first and second stages of the attack are very similar.  Both stages simply decode a JavaScript payload from within an object stream and launch them via a JavaScript eval() function.  The primary difference between them is that the second stage uses custom logic to decode its payload – more difficult to analyze.  The first stage simply uses a hex decoder to decode its payload – less difficult to analyze.

The third stage of the attack contains an embedded shellcode payload.  This stage exploits a buffer overflow in Adobe Reader via the Collab.collectEmailInfo() vulnerability in order to hijack the instruction pointer register (EIP) and run the shellcode.  This shellcode downloads malware from the Internet, saves it to disk and starts it as the last stage of the attack.

Why are so many stages used during this attack?  Why not simply skip the first three stages and auto-start stage four directly as the document is opened?  Why all the complexity?  The answer is simple… obfuscation.  The attacker is making it as difficult as possible to analyze and understand the code (manually or programmatically).  If we cannot understand it, we cannot make a conclusive determination that it is malicious.

The infection is now complete.  Lets examine how each of these stages work…


Stage 1A: OpenAction Launches JavaScript Object Stream

In the first stage of the attack, an OpenAction is used to auto start an object stream as the document is first opened as shown in the PDF encoding below.  The highlighted section of the PDF shows that object 1 contains an OpenAction which targets object 4 to be started automatically (e.g., 4 0 R is a reference to object 4).

Object 4 as shown below, contains a PDF dictionary element (/JS 5 0 R) that references a JavaScript stream in object 5.   The stream in object 5 will contain the actual JavaScript that we will need to analyze.

Looking at the object stream in Object 5 as shown below reveals a FlateDecode stream.  Since it is compressed/encoded via the Deflate algorithm, it will need to be decompressed/decoded in order to view it properly.  This can be achieved programmatically in Java via libraries such as InflaterInputStream in java.util.zip or using open source tools as described earlier.

The decompressed stream shown below reveals the full listing of the JavaScript code from the FlateDecode stream.  This is the code that is automatically started as the document is first opened.

Stage 1B: Hex-Decode Annotation Payload and Run It

The JavaScript run in stage one (shown above) is fairly simple.  Its only goal is to decode a follow-on JavaScript payload, and run it via the JavaScript eval() function.

The reason for additional code being placed in a follow-on payload is purely for obfuscation.  The malware author is attempting to make it as difficult as possible to analyze the malicious logic.  This first stage JavaScript is suspicious, not malicious.  The next stage will more likely be malicious…

This follow-on JavaScript payload is contained in an object stream reference via a PDF annotation tag (Annot).  We suspect this because we see references in lines 7 and 16 in the above code to Acrobat’s syncAnnotScan() and getAnnots() API functions.  We then see the payload returned from getAnnots() being processed in what appears to be a decode loop (lines 22-25).  The syncAnnotScan() and getAnnots() API functions are specifically used to access annotation tags in PDF documents.

Lets walk through each of these steps one by one and see if there is a payload in the annotations…

Prior to grabbing a payload via annotations in the PDF document, there needs to be a guarantee that all annotations can be detected.  The syncAnnotScan() function shown below does just that.  It guarantees that all annotations have been scanned such that a complete list will be returned via getAnnots().

Once the annotations have been scanned, the annotations within the document can accessed via the getAnnots() function.  The code shown below gets all of the annotations on the first page (nPage: 0) of the document as described in the Adobe documentation.  This sample document is comprised of only one page, so, all annotations are contained on the first page.

We see that object 3 contains the Annots tag as shown below.  This is the annotations definition that will be returned via the getAnnots() function.  In this sample, the annotation references two objects.  The first refers to object 6 (e.g., 6 0 R).  The second annotation refers to object 8 (e.g., 8 0 R).  Line 16 in the code shown above retrieves the object references and stores them into the variable pr as a two-element annotation array (element 0 -> 6 0 R, element 1 -> 8 0 R).  We need to remember the pr variable’s purpose as we continue our analysis.

The next step is to get the actual JavaScript payload via these annotations…

Line 17 in the code shown below accesses the annotation array and retrieves the Subject dictionary element from the object reference at index 1.  Since pr[num] in the code below points to the second element (index 1) of the pr annotation array, we know from the annotation definition shown above (e.g., 6 0 R 8 0 R) that it is referring to the Subject dictionary element from object 8.  As the result, the payload in the object 8 Subject field is placed in the JavaScript variable sum.  Our theory is that the sum variable now contains our encoded payload for the next stage.

If we look at object 8 in the PDF file as shown below, we see that its /Subj dictionary element is an object reference to object 9.  Lets go look at object 9 and see what we can find…

Object 9 as shown below contains a FlateDecode stream.  That’s good news.  This is most likely the payload that the JavaScript is trying to access.

If we decode/decompress the FlateDecode stream as shown below, it reveals that it contains a hex-encoded payload.  This is the actual value that gets set into the sum variable in the JavaScript listed above.  The sum=pr[num].subject code shown earlier, results in the hex value shown below being set into the sum variable.

This hex payload is then decoded via the lines shown below and placed into the buffer variable.  The individual hex values are converted to their corresponding ASCII equivalent via the JavaScript String.fromCharCode() function and added to a decoded buffer in lines 23-25.

Now that the payload is decoded, it can be executed in JavaScript via the eval() function.  At first glance, we don’t see an eval() function in the code listing.  Often times, the malware author will obfuscate this function such that it is more difficult to conclusively determine the code is malicious.  This is the case in this example.

Looking at line 30 below, we can see that the decoded payload buffer variable is passed into the app[fnc + ‘l’](buffer) function.   Analyzing the code at lines 2, 21 and 29, we can see that the fnc variable was built such that its value is currently set to ‘eva’.  As ‘l’ is added to the fnc variable as shown in line 30, the resulting effective code is app[‘eval’](buffer).  This is an alternate way of invoking eval().

The payload is launched on line 30 and the next stage of the attack begins…

Stage one is completed. In summary, a JavaScript object stream in object 5 (dereferenced via object 4) was auto-started via an OpenAction.  This JavaScript used an Annots dictionary entry to find a hex-encoded payload from object stream 9 (dereferenced via object 8).  This hex-encoded payload was decoded and run via the eval() function. Lets go see what the hex-encoded payload does in stage 2…


Stage 2 – Custom-Decode Annotation Payload and Run It

A full listing of the decoded JavaScript payload launched at the end of stage one is shown below.

At first glance, the listing is a bit overwhelming.  In reality, the code is very similar to the code from stage one.   It uses a PDF annotation to get a JavaScript payload.  It decodes the payload and runs it via eval().

The main differences between this stage and stage one are that a different annotation payload is retrieved, and this decode logic is customized rather than leveraging standard hex decode logic.  In fact, almost the entire code block shown is dedicated to decoding the payload.  The sdcK_To5s() function is primarily one big obfuscated decode loop.  One big, ugly, decode loop.

Lets walk through this logic and see the differences…

The first difference is shown in the logic below.  The pr annotation array that was built in stage one is referenced in order to get the payload that will be used for this stage.  As you can see in lines 14 and 16 below, this time it references annotation array index 0 rather than 1.  Similar to stage one, it references the Subject dictionary element to get its payload – just from a different object reference this time.

If you remember from the Annots definition as shown below, the first element (index 0) is a reference to object 6.  Object 6 contains the payload that is used in this stage as opposed to object 8 that was used in stage one.

Similar to what we saw in the first stage, the object initially referenced directly in Annots ultimately points to a different object.  Object 6 shown below contains a Subj dictionary entry which references object 7.  Object 7 is ultimately our payload for this stage.

As shown below, object 7 contains a FlateDecode stream, which is most likely the payload for this stage.  Good… we are getting closer…

After we FlateDecode the stream, we can see the payload as shown below.  This payload is obviously not a hex encoded stream (obvious because we can see characters in the stream that don’t fall within the [0-9, A-F] range of valid hex characters).  We also know its not a decimal or octal payload.

This payload may be trickier to decode than what we encountered in stage one.  The good news… since it is trickier to decode, it probably has a more valuable payload.

So, how are we going to decode this payload?  Lets let the malware logic do it for us…

One trick used by analysts is to replace the eval() function such that it prints the decoded results rather than executing them.  As we can see in line 96 below, the decoded payload (the attack) is ultimately launched via eval().  As we analyze the code, we need to be careful that this line never executes.

To decode the payload, we copy the sdcK_To5s() function (our big ugly decode loop) onto an isolated test machine and change it such that eval() behaves as a print() function.  The code shown below was left completely intact except for two changes.

The first change was to add the line of code eval=print.  This effectively is a reassignment such that eval() no longer works.  We are safe.  Rather than executing the payload, it will print it to the screen.

The second change was to declare the new variable subjectPayload and assign it the value of the object stream in object 8 (e.g., 222606805…).  The code shown below abbreviates this value for conciseness in the screen shot.  The actual program would obviously include the full value.

Now we can safely decode our payload and see the results…


Running the decode script shown above yields the output shown below.  We see output that appears to be a JavaScript payload.  Also we can fairly easily see a shellcode payload (e.g., 90909090…) towards the bottom of the output.  We will discuss how this shellcode payload is used as we walk through stage three of the attack.

We have not discussed in depth that the decode loop logic is “tamper-resistant”.  This means that the decoder uses the actual source code of the decode loop and the source code length as key values in the decode algorithm.  This is done via the arguments.callee logic listed in the original code block above.  This means that if the source code were altered (e.g., we re-write it to cleanse the code for readability, etc.), decoding would fail. We were successful leveraging the decode logic for one reason.  All of our code changes were outside of the function, leaving the original function intact.

As mentioned above, the resulting decoded JavaScript payload is ultimately executed via eval() as shown in line 96 below.

Stage two is completed.

In summary, the JavaScript started at the end of stage one used an Annots entry to find a custom-encoded payload from object stream 7 (dereferenced via object 6).  This payload was decoded using custom logic and run via the JavaScript eval() function.

We are getting closer to the actual exploit.  Lets continue onward…

Stage 3: Buffer Overflow and Launch ShellCode Payload

The full listing of the JavaScript payload decoded in stage two is listed below.  Unlike the previous two stages, this code appears to do more than just grab a JavaScript payload and launch it.  We can see obvious signs of shellcode payloads in lines 80-81.  This implies that the JavaScript is likely to try some sort of heap spray or buffer overflow in order to run this shellcode.

Lets examine the code and see if we can figure out what its doing…

The first thing that catches our eye is the shellcode payload highlighted below.  Why do we think it’s a shellcode payload?  There are several clues.

First, the payload contains sequences of hex 90 characters.  Hexadecimal 90 is outside the printable range of ASCII characters.  As such, the payload is obviously not JavaScript.  On the Intel x86 CPU family, hexadecimal 90 is the opcode for a NOP (a benign instruction that has no effect).  NOPs are typically used in buffer overflow attacks as NOP slides.  The NOP slide allows for a margin of area when hijacking the instruction pointer register (EIP) to target a shellcode payload.  It allows the memory address overwriting the EIP to be less accurate.  As long as the EIP address lands safely within the NOP slide region, the series of NOP instructions will benignly execute until it eventually reaches (slides to) the actual machinecode, which follows.  The payload looks like shellcode.

Second, the payload is unescaped (converted back to the actual bytes represented by the hexadecimal values) in line 82 shown below.  For the given payload, this would result in many non-printable characters.  It is unlikely that the unescaped payload is used for any benign display purposes. Instead, it is more likely to be shellcode.

Additional evidence that we are analyzing a buffer overflow attack is that we see signs of addresses typically used when overwriting the instruction pointer register (EIP) as highlighted below.

Memory that is allocated dynamically is stored on the heap as opposed to the stack.  A buffer overflow style attack typically involves dynamically allocating (“spraying”) the heap with large combinations of NOP slides and the actual shellcode payload.  This results in many pairs of NOP+shellcode chunks throughout the heap located at the higher memory addresses (e.g., 0x0a0a0a0a, 0x0c0c0c0c). The overflow attack will then attempt to exploit a vulnerability to overwrite the EIP with 0x0c0c0c0c such that the next instruction to execute points to a NOP slide within one of these chunks. When all works correctly, the exploited host application walks directly into the shellcode payload within the heap and starts executing it.

To overwrite the EIP, typically a long series of 0x0c0c0c0c bytes are strung together and passed into an exploitable section of code (code that fails to bounds check memory copies – allowing an overflow into the EIP register).

So… do we see any evidence that the 0x0c0c0c pattern is heavily duplicated and passed in to an exploitable section of code?  The answer is yes. Lines 96 and 99-101 below show that 0c0c0c0c is unescaped and then duplicated into a long sequence of the pattern repeated many times.  The highlighted section below then shows that this repeated pattern of 0c0c0c0c is passed in the msg argument to Collab.collectEmailInfo().

Finally, we found our exploit! The Collab.collectEmailInfo() is a well-known vulnerability in PDF Viewer that can be used via buffer overflow attacks to arbitrarily run shellcode.  This is exactly what is happening in our example.  Even if the vulnerability was not already known and documented, we would be extremely suspicious of Collab.collectEmailInfo based on our analysis. We have seen enough to theorize we are dealing with a shellcode payload and move on to the next stage of our analysis.  We see obvious signs of a shellcode payload (e.g., 9090…).  We see obvious signs of commonly used EIP overflow values (e.g., 0c0c0c0c).  We see the EIP overflow value being highly duplicated into a long string and passed into a well-known vulnerability. Next step…. Lets examine the shellcode payload and see what it does…

Stage 4: Download Malware and Infect Target Machine

All of the steps described in stages 1-3 have led up to this moment of attack.  The entire purpose of all of the previous stages was to run this small set of machinecode as shown below.  Think about it… for this payload to be successful, it will need to be able to load networking libraries, establish a network connection to a remote malware hosting site, download malware, save it to disk and ultimately run it.  All of this logic needs to be contained within this small shellcode payload.

Lets see what it contains…

The first step in analyzing the payload is to convert it from its current hexadecimal escape sequences into a binary payload.  This is done in JavaScript via the unescape function as described earlier.  Basically, each pair of hex bytes in the payload above are swapped and converted to their equivalent ASCII character.  The resulting conversion is shown below.

Already, the payload is interesting.  Looking at the printable strings in the payload above, we can see a URL artifact indicating a likely malware-hosting site.  Remember that, in order for the shellcode payload to be successful, it must establish a network connection to a remote malware-hosting site.  Most likely, this URL is the malware-hosting site.

To get a definitive answer as to what the shellcode does, we have several options to choose from.  The best (although, most difficult) option is to paste this shellcode into a debugger (e.g., OllyDbg) and step through it as it executes.  This would need to be done on the Operating System, which is being targeted by the shellcode (e.g., Windows).  As the result, there is a risk of infection during this debug process and precautions need to be taken (e.g., work on isolated machine/network in a virtualized environment).  Also, lets face it… stepping through shellcode in a debugger is not the easiest task.

An alternate option would be to leverage a tool such as sctest that is part of the libemu package (the analysis shown below leverages sctest preinstalled on  REMnux).  This tool provides basic x86 emulation and can be run on a CentOS platform (no risk of infection).  The tool provides a quick and easy method to view a system call trace along with the arguments and return values for each call.  It obviously isn’t as robust as a full-blown debug session, but most times it provides enough information to understand the nature of the attack.

The binary payload shown above was placed in a file named payload.raw and redirected into sctest for analysis.  The screen shot below shows the results of running the unescaped payload through an sctest session.

As you can see from the results above, the shellcode payload is conclusively malicious. The shellcode performs the following steps:

  • Calls LoadLibrary to dynamically load the urlmon library.  This library contains many useful networking functions.
  • Calls URLDownloadToFile to download malware from the hosting site.  The malware is saved locally to c:\tmp\cFoN.exe
  • Calls WinExec to launch the malware from c:\tmp\cFoN.exe
  • Calls ExitProcess to terminate the launcher



Ultimately, this sample exploited an Adobe Collab.collectEmailInfo() buffer overflow vulnerability in order to execute a shell code payload that was contained in one of the JavaScript streams within the PDF.   Once executed, the shellcode connected to a remote hosting site to download, install and run malware.   The various stages preceding the buffer overflow exploit were merely layers of obfuscation to make detection of the buffer overflow more difficult.   Each stage of the attack became slightly more sophisticated than the previous.   Stage one merely launched a hex-encoded JavaScript payload extracted via a PDF Annotation reference.  Similarly, stage two extracted a JavaScript payload from a PDF Annotation reference and launched it.  However, the payload in stage two was much more cleverly decoded via custom “tamper resistant” logic.  Stage three contained the shellcode payload and exploited the buffer overflow.  Stage four reached out over the internet and performed the actual download (drop) of the malware.

Lessons Learned

  • Keep all Web Browsers and Adobe packages up to date and patched with the latest security patches
  • Do not open documents from untrusted sources

Thanks for reading… You can also connect with me on Twitter at @kd_cybersec.


Written by 

Leave a Reply

Your email address will not be published. Required fields are marked *