Evaluating Forensic Tools: Beyond the GUI vs Text Flame War

One of the good old flamewars that comes up every now and again is which category of tools is “better”: graphical, console (e.g. interactive text-based), or command-line?

Each interface mechanism has its pros and cons, and when evaluating a tool, the interface mechanism used can make an impact on the usability of the tool. For instance displaying certain types of information (e.g. all of the picture files in a specific directory) naturally lend themselves to a graphical environment. On the other hand, it’s important to me to be able to use the keyboard to control the tool (using a mouse can often slow you down). The idea that graphical tools “waste CPU cycles” is pretty moot, considering the speed of current processors, and that much forensic work focuses on data sifting and analysis, which is heavily tied to I/O throughput.

Text based tools however do have the issue that paging through large chunks of data can be somewhat tedious (this is where I like using the scrollwheel, the ever-cool two-finger scroll on a Macbook, or even the “less” command.)

To me, there are more important issues than the type (e.g. graphical or text-based) of interface. Specifically some of the things I focus on are:

  • What can the tool do?
  • What are the limitations of the tool?
  • How easy is it to automate the tool (getting data into the tool, controlling execution of the tool, and getting data out of the tool)?

The first two items really focus on the analysis capabilities of the tool (which can be “make-or-break” decisions by themselves), and the last item (really three items rolled into one) focuses on the automation capabilities of the tool.

The automation capabilities are often important because no single tool does everything, and an analyst’s toolkit is composed of a series of tools that have differing capabilities. Being able to automate the different tools in your toolkit (and being able to transfer data between multiple tools) is often a huge time saver.

Many tools have built-in scripting capabilities. For instance ProDiscover has ProScript, EnCase has EnScript, etc. Command line tools can typically be “wrapped” by another language. Autopsy for example, is a PERL script that wraps around the various Sleuthkit tools. While it is useful to be able to automate the execution of a tool, it’s also useful to be able to automate the import and export of data. Being able to programmatically extract the results of a tool and feed them as input (or further process them) allows you to combine multiple tools in your toolkit.

So to me, when evaluating a forensic tool the capabilities (and limitations) and the ease of automation are (often) more important than the interface.

Copying 1s and 0s

I’ve been asked a few times over the past weeks about making multiple copies of disk images. Specifically, if I were to make a copy of a copy of a disk image, would the “quality” degrade? The short answer is no. It boils down to the idea of copying information from a digital format (as opposed to an analog format). Let’s say I write down the following string of ones and zeros on a 3×5 card:

101101

Now, if I take out another 3×5 card and copy over (by hand) the same string of ones and zeros, I now have two cards with the string:
101101

Finally, if I took out a third 3×5 card and copied yet again (by hand) the same string (from the second card) I would have three cards with the string:

101101

Assuming I had good handwriting, then each copy would be legible and have the same string of ones and zeros. I could continue on indefinitely in this manner, and each time the new 3×5 card would not suffer in “quality”.

However, instead of copying (by hand) the string of ones and zeros, I could have photocopied the first 3×5 card. This would yield a result of one 3×5 card, and a (slightly) degraded copy. If I then photocopied the copy, I would get a further degraded (third) copy.

So, copying images (or any digital information) verbatim (i.e. using a lossless transformation process) doesn’t degrade the quality of the information, read a “one” (from the source) write a “one” (to the destination). Where you might run into trouble is if the source or destination media has (physical) errors. So it’s always a good idea to verify your media before imaging. It’s also a good idea to use a tool that tells you (and even better if the tool logs) errors that it encounters during the imaging process.

Exhibits from deposition of RIAA’s expert available online

Updating the previous post, the exhibits from the deposition are available at:

Recording Industry vs The People blog.

Transcript of deposition of RIAA’s expert available online

In UMG v. Lindor, the RIAA’s expert was deposed on February 23rd 2007. A PDF copy of the transcript is available at ilrweb.com.

Source: Recording Industry vs The People blog.

Planting evidence

The other day, Dimitris left a comment asking about how to determine if someone has altered the BIOS clock and placed a new file on the file system. In essence, this is “planting evidence”.

So, what might the side effects of this type of activity be? It’s difficult (if not impossible) to give an exact answer that will fit every time. Especially since there are different ways of planting evidence in this scenario. For example, the suspect could:

  1. Power down the system
  2. Change the BIOS clock
  3. Power the system back up
  4. Place the file on the system
  5. Power the system back down
  6. Fix the BIOS clock

Alternatively, the suspect could also:

  1. Yank the power cord
  2. Update the BIOS clock
  3. Boot up with a bootable CD
  4. Place the file on the system
  5. Yank the power cord (again)
  6. Fix the BIOS clock

It’s also possible to change the clock while the power is running. Again, it’s a different scenario with different possible side effects. Really, this needs to be evaluated on a case-by-case basis. However, the fundamental idea is to look for evidence of events that are somehow (directly or indirectly) tied to time, and see if you can find inconsistencies (a “glitch in the matrix” if you will.) The evidence can be both physical as well as digital. For example, if the “victim” were having lunch with the CEO and CIO during the time which the document was supposedly placed on the system, then the “victim” likely has a good alibi.

For digital evidence, there are 3 basic aspects of computing devices:

  • the code they run (e.g. programs such as anti viruses)
  • the management code (e.g. operating system related aspects), and the
  • communication between computing devices (e.g. the network).

Dimitris mentioned the event log, so we’ll start with operating system related aspects (which falls under the category of host-based forensics). There are a number of operating-system specific events that are tied to time. I suspect that the reason the event log is often referred to, is that it is difficult to modify the event log. Each time the event log service is started and stopped (when a Windows machine is booted up and shutdown) an entry is made in the event log. Another time related event on a Windows system could be the System Restore functionality. If a restore point had been created during the time a suspect was planting evidence, you may find an additional restore point with time related events from the future.

Another aspect to examine would be the programs that are run on the system. This falls under the category of code forensics. For instance, if an anti-virus software were configured to run at boot up, examining the anti-virus log might show time related anomalies. Under the category of code forensics, you might also want to examine the file that has been purported to be “planted”. For example, Microsoft Word documents typically contain metadata such as the last time the file was opened, by whom, where the file has been, etc. If the suspect were to boot up the system with a bootable CD and place a document that way, examining the suspect file(s) may be one of the more fruitful options.

Don’t forget about the communication between computing devices, which would be the category of network-forensics. For example, if the system was booted up, did it grab an IP address via DHCP? If so, the DHCP server may have logs of this activity. Other common footprints of a system on a network include checking for system updates (e.g. updates server), Netbios announcements, synchronizing time with a time server, etc. This alone could be a good reason to pull logs from network devices during an incident, even if you are sure they network devices weren’t directly involved with the incident, since the log files could still contain useful information.

Now, I’m sure there are some folks who say that “well this can all be faked or modified”. Yes, that’s true, if it’s digital, at some point it can be overwritten. First, realize that I didn’t intend for this to be an all-inclusive checklist of things to look for. Second, especially if the system was booted using the host operating system, there are a myriad of actions that take place, and accounting for every single one of them can be a daunting task. I’m not saying that it’s impossible, just that it is likely to be very difficult to do so. In essence, it would come down to how well the suspect covered their tracks, and how thoroughly you would be able to examine the system.

At this point, I am interested in hearing from others, where might they look? For instance, the registry could hold useful information. Thoughts?

Caught in the act…

I had lunch at a lounge today (I’m friends with the owners, and they have free WiFi) and when I went to pay my bill, had an interesting surprise. The waitress looked at her computer screen and said “Something’s happening”. Well she’s not the most computer literate, so I took a look at the screen and sure enough, someone was editing a text based configuration file (using DOS edit) that contained amongst other things, prices for the various meal items at the restaurant. A quick call to the owner confirmed that this WAS the company who has a service contract with the restaurant to maintain their restaurant software. Still, for a moment I was wondering if this was one of those rare few times you get to watch an attacker in action. Well not this time, perhaps next time.

How digital forensics relates to computing

A lot of people are aware that there is some inherent connection between digital forensics and computing. I’m going to attempt to explain my understanding of how the two relate. However before we dive into digital forensics, we should clear up some misconceptions about what computing is (and perhaps what it is not).

Ask yourself the question “What is computing?” When I ask this question (or some related variant) to folks, I get different responses ranging from programming, to abstract definitions about Turing machines. Well, it turns out that in 1989 some folks over at the ACM came up with a definition that works out quite well. The final report titled “Computing as a Discipline” provides a short definition:

The discipline of computing is the systematic study of algorithmic processes that describe and transform information … The fundamental question underlying all of computing is, “What can be (efficiently) automated?”

This definition can be a bit abstract for those who aren’t already familiar with computing. In essence, the term “algorithmic processes” basically implies algorithms. That’s right, algorithms are at the heart of computing. A (well defined) algorithm is essentially a finite set of clearly articulated steps to accomplish some task. The task the algorithm is trying to accomplish, can vary. So computing is about algorithms whose tasks are to describe and transform information.

When we implement an algorithm on a computing device, we have a computer program. So a computer program is really just the implementation of some algorithm for a specific computing device. The computing device could be a physical machine (e.g. an Intel architecture) or an abstract model (e.g. Turing machine). When we implement an algorithm for a specific computing device, we’re really just translating the algorithm into a form the computing device can understand.  To help make this more concrete, take for example Microsoft Word. Microsoft Word is a computer program, it’s a slew of computer instructions encoded in a specific format. The computer instructions tell a computing device (e.g. the processor) what to do with information (the contents of the Word document). The computer instructions are a finite set of clearly articulated steps (described in a format the processor understands) to accomplish some task (editing the Word document).

There is one other concept to deal with before focusing on digital forensics, and that is how algorithms work with information. In order for an algorithm to transform and describe information, the information has to be encoded in some manner. For example, the letter “A” can be encoded (in ASCII) as the number 0x41 (65). The number 0x41 can then be represented in binary as 01000001. This binary number can then be encoded as the different positions of magnets and stored on a hard disk. Implicit in the structure of the algorithm, is how the algorithm decodes the representation of information. This means that given just raw the encoding of information (e.g. a stream of bits) we don’t know what information is represented, we still need to understand (to some degree) how the information is used by the algorithm. I blogged about this a bit in a previous post “Information Context (a.k.a. Code/Data Duality)“.

So how does this relate to digital forensics? Simple, digital forensics is the application of knowledge of various aspects of computing to answer legal questions. It’s been common (in practice) to extend the definition of digital forensics to answer certain types of non-legal questions (e.g. policy violations in a corporate setting).

Think for a moment about what we do in digital forensics:

  • Collection of digital evidence: We collect the representation of information from computing devices.
  • Preservation of digital evidence: We take steps to minimize the alteration of the information we collect from computing devices.
  • Analysis of digital evidence: We apply our understanding of computer programs (algorithmic processes) to interpret and understand the information we collected. We then use our interpretation and understanding of the information to arrive at conclusions, using deductive and inductive logic.
  • Reporting: We relate our findings of the analysis of information collected from computing devices, to others.

Metadata can also be explained in terms of computing. Looking back at the definition for the discipline of computing, realize there are two general categories of information:

  • information that gets described and transformed by the algorithm
  • auxiliary information used by the algorithm when the steps of the algorithm are carried out

The former (information that gets described and transformed) can be called “content”, while the latter (auxiliary information used by the algorithm when executed) can be called “metadata”. Viewed in this perspective, metadata is particular to a specific algorithm (computer program) and what is content to one algorithm could be metadata to another.Again, an example can help make this a bit clearer. Let’s go back to our Microsoft Word document. From the perspective of Microsoft Word, the content would be the text the user typed. The metadata would be the font information and attributes, revision history, etc. So, to Microsoft Word, the document contains both content and metadata. However, from the perspective of the file system, the Word document is the content, and things such as the location of the file, security attributes, etc. are all metadata. So what is considered by Microsoft Word to be metadata and content is just content to the file system.

Hopefully this helps explain what computing is, and how digital forensics relates.

The basics of how programs are compiled and executed

Well, the post “The basics of how digital forensics tools work” seemed to be fairly popular, even getting a place on Digg. This post is focused on the basics of how a program gets compiled and loaded into memory when the program is executed. It’s useful for code analysis (reverse engineering), and is aimed at those who aren’t already familiar with compilers. The basic reasoning is that if you understand (at least from a high level) how compilers work, it will help when analyzing compiled programs. So, if you’re familiar with code analysis, this post isn’t really aimed at you. If however, you’re new to the field of reverse engineering, (specifically code analysis) this post is aimed at you.

Compiling is program transformation

From an abstract perspective, compiling a program is really just transforming the program from one language into another. The “original” language that the program is written in is commonly called the source language (e.g. C, C++, Python, Java, Pascal, etc.) The program as it is written in the source language is called the source code. The “destination” language, the language that the program is written to, is commonly called the target language. So compiling a program is essentially translating it from the source language to the target language.

The “typical” notion of compiling a program is transforming the program from a higher level language (e.g. C, C++, Visual Basic, etc.) to an executable file (PE, ELF, etc.). In this case the source language is the higher level language, and the target language is machine code (the byte code representation of assembly). Realize however, that going from source code to executable file is more than just transforming source code into machine code. When you run an executable file, the operating system needs to set up an environment (process space) for the code (contained in the executable file) to run inside of. For instance, the operating system needs to know what external libraries will be used, what parts of memory should be marked executable (i.e. can contain directly executable machine code), as well as where in memory to start executing code. Two additional types of programs, linkers and loaders accomplish these tasks.

Compiler front and back ends.

Typically a compiler is composed of two primary components, the front end and the back end. The front end of a compiler typically takes in the source code, analyzes the structure (doing things such as checking for errors) and creates an intermediate representation of the source code (suitable for the back end). The back end takes the output of the front end (the intermediate representation), optionally performs optimization, and translates the intermediate representation of the source code, into the target language. In the case of compiling a program, it is common for a compiler’s back end to generate human readable assembly code (mnemonics) and then invoke an assembler to translate the assembly code into it’s byte code representation, which is suitable for a processor.

Realize, it’s not an “absolute” requirement that a compiler be divided into front and back ends. It is certainly possible to create a compiler that translates directly from source code to target language. There are however benefits to the front/back end division, such as reuse, ease of development, etc.

Linkers and Loaders

The compiler took our source code and translated it to executable machine instructions, but we don’t yet have an executable file, just one or more files that contain executable code and data. These files are typically called object code, and in many instances aren’t suitable to stand on their own. There are (at least) 3 high level tasks that still need to be performed:

  1. We still need to handle referencing dependencies, such as variables and functions (possibly in external code libraries.) This is called symbol resolution.
  2. We still need to arrange the object code into a single file, making sure separate pieces do not overlap, and adjusting code (as necessary) to reflect the new locations. This is called relocation.
  3. When we execute the program, we still need to set up an environment for it to run in, as well as load the code from disk into RAM. This is called program loading.

Conceptually, the line between linkers and loaders tends to blur near the middle. Linkers tend to focus on the first item (symbol resolution) while loaders tend to focus on the third item (program loading). The second item (relocation) can be handled by either a linker or loader, or even both. While linkers and loaders are often separate programs, there do exist single linker-loader programs which combine the functionality.

Linkers

The primary job of a linker is symbol resolution, that is to resolve references to entities in the code. For example, a linker might be responsible for replacing references to the variable X with a memory address. The output from a linker (at compile time) typically includes an executable file, a map of the different components in the executable file (which facilitates future linking and loading), and (optionally) debugging information. The different components that a linker generates don’t always have to be separate files, they could all be contained in different parts of the executable file.

Sometimes you’ll hear references to statically or dynamically linked programs. Both of these refer to how different pieces of object code are linked together at compile time (i.e. prior to the program loading.) Statically linked programs contain all of the object code they need in the executable file. Dynamically linked programs don’t contain all of the object code they need, instead they contain enough information so that at a later time, the needed object code (and symbols) can be found and made accessible. Since statically linked programs contain all the object code and information they need, they tend to be larger.

There is another interesting aspect of linkers, in that there are both linkers which work at compile time, and linkers which perform symbol resolution during program loading, or even run time. Linkers which perform their work during compile time are called compile time linkers, while linkers which perform their work at load and run time are called dynamic linkers. Don’t confuse dynamic linkers and dynamically linked executable files. The information in a dynamically linked executable is used by a dynamic linker. One is your code (dynamically linked executable), the other helps your code run properly (dynamic linker).

Loaders

There are two primary functions that are typically assigned to loaders, creating and configuring an environment for your program to execute in, and loading your program into that environment (which includes starting your program). During the loading process, the dynamic linker may come into play, to help resolve symbols in dynamically linked programs.

When creating and configuring the environment, the loader needs information that was generated by the compile time linker, such as a map of which sections of memory should be marked as executable (i.e. can contain directly executable code), as well as where various pieces of code and data should reside in memory. When loading a dynamically linked program into memory, the loader also needs to load the libraries into the process environment. This loading can happen when the environment is first created and configured, or at run time.

In the process of creating and configuring the environment, the loader will transfer the code (and initialized data) from the executable file on disk into the environment. After the code and data have been transferred to memory (and any necessary modifications for load time relocation have been made), the code is started. In essence, the way the program is started is to tell the processor to execute code at the first instruction of the code (in memory). The address of the first instruction of code (in memory) is called the entry point, and is typically set by the compile time linker.

Further reading

Well, that about wraps it up for this introduction. There are some good books out there on compilers, linkers, and loaders. The classic book on compilers is Compilers: Principles, Techniques, and Tools. I have the first edition, and it is a bit heavy on the computer science theory (although I just ordered the second edition, which was published in August 2006). The classic book on linkers and loaders is Linkers and Loaders. It too is a bit more abstract, but considerably lighter on the theory. If you’re examining code in a Windows environment, I’d also recommend reading (at least parts of) Microsoft Windows Internals, Fourth Edition.

Self replicating software – Part 4 – The difference between worms and viruses

This is the fourth part of the installment on self replicating software. This post deals with worms (a subset of computer viruses).

Briefly, a computer virus is a program that infects other programs with an optionally mutated copy of itself. This is the basic definition that Fred Cohen (the “father” of computer viruses) used in “Computer Viruses – Theory and Experiments.” If you look at the previous posts in this category, we examined how/why viruses can exist by ways of the recursion theorem, as well as a few other methods.

A computer worm is a virus that spreads, where each new infection continues spreading without the need for human intervention (e.g. load and execute the newly infected file). In essence, a computer worm is a virus that after infection, starts the newly created (optionally mutated) copy of itself. Cohen states this (and a more formal defintion) in “A Formal Definition of Computer Worms and Some Related Results.”

Since computer worms are a subset of viruses, many of the same theories apply, including applications of the recursion theorem. What is interesting about computer worms is the potential to spread very quickly due to their inherent automation.

Realize that this definition of a computer worm focuses on the spreading behavior of the malicious code, not the method that is used for spreading. This leads us to some interesting anomalies with different definitions of computer worms. I’ve found the following definitions of computer worms used at various places:

  1. Worms are programs that replicate themselves from system to system without the use of a host file. This is in contrast to viruses, which requires the spreading of an infected host file. (Symantec, Wikipedia, Cisco, et al.)
  2. A worm is a small piece of software that uses computer networks and security holes to replicate itself. (HowStuffWorks)
  3. A worm self-propagates (Common response I’ve heard from various security folks)

The first definition is used by a number of folks, including some big vendors. If you look at the Cohen definition of viruses, there is no requirement that the victim program (the one that gets infected) exist. If the victim program doesn’t exist, then the mutated version of the virus is a standalone. If the victim program is started by a human then it’s a virus, if it is started automatically then it’s a worm. Think of a virus that comes in the form of an infected executable (i.e. it has a host file) and then “drops” a copy of itself as a standalone. Another possible scenario would be a standalone program that infects another file. By the first definition, would these be classified as viruses or a worms? (Hint: The definition doesn’t cover this type of scenario.)
The second definition basically requires that a worm spread between more than one computer system. Again, per Cohen, this isn’t a requirement. A worm can spread between processes on the same machine. The worm hasn’t crossed into another machine, however the code is still propagating without the need for human intervention.

The last definition is a bit ambiguous, which is why I tend to avoid it. The ambiguity comes from the fact that “self-propagating” doesn’t necessarily imply human intervention. Under one interpretation a virus is self propagating in that the viral code is what copies itself (i.e. a user doesn’t need to copy a file around.) Under another interpretation a worm is self-propagating since it does propagate, however it continues to propagate itself.

I recommend reading the Cohen papers mentioned earlier. They’re a bit heavy on the math if you aren’t a computer science / math type, although they do explain what is going on.

Two tools to help debug shellcode

Here are two small tools to help debug/analyze shellcode. The goal of both tools is to provide an executable environment for the shellcode. Shellcode is usually intended to run in the context of a running process, and by itself doesn’t provide the environment typically provided by an executable.

The first tool, make_loader.py is a Python script which takes the name of a file containing shellcode and outputs a compilable C file with the embedded shellcode. If you compile the output, the resulting executable run the shellcode.
The second tool, run_shellcode is a C program (you have to compile it) which, at runtime, loads shellcode from disk into memory (and then transfers execution to the shellcode.) A neat feature of this tool is that it can be completely controlled by a configuration file, meaning you only need to load the file once into a debugger. You can examine new shellcode by changing the configuration file.
Both tools allow you to specify if you want to automatically trap to the debugger (typically by an int 3), and to skip over a number of bytes in the file that contains the shellcode. The
automatic debugger tripping is nice so you don’t always have to explicitly set a breakpoint.
The skip is nice if the shellcode doesn’t sit at the start of the and you don’t want to bother stripping out the unnecessary bytes. Think Wireshark “Follow TCP Stream” with a “Save”.

An alternative to these tools is shellcode2exe, although I didn’t feel like installing PHP (and a webserver)

Here are the files….
run_shellcode.c 1.0 make_loader.py 1.0