In a previous post I covered “The basics of how digital forensics tools work.” In that post, I mentioned that one of the steps an analysis tool has to do is to translate a stream of bytes into usable structures. This is the first in a series of three posts that examines this step (translating from a stream of bytes to usable structures) in more detail. In this post I’ll introduce the different phases that a tool (or human if they’re that unlucky) goes through when recovering digital evidence. The second post will go into more detail about each phase. Finally, the third post will show an example of translating a series of bytes into a usable data structure for a FAT file system directory entry.
Data Structures, Data Organization, and Digital Evidence
Data structures are central to computer science, and consequently bear importance to digital forensics. In The Art of Computer Programming, Volume 1: Fundamental Algorithms (3rd Edition), Donald Knuth provides the following definition for a data structure:
Data Structure: A table of data including structural relationships
In this sense, a “table of data” refers to how a data structure is composed. This definition does not imply that arrays are the only data structure (which would exclude other structures such as linked lists.) The units of information that compose a data structure are often referred to as fields. That is to say, a data structure is composed of one or more fields, where each field contains information, and the fields are adjacent (next) to each other in memory (RAM, hard disk, usb drive, etc.)
The information the fields contain falls into one of two categories, the data a user wishes to represent (e.g. the contents of a file), as well as the structural relationships (e.g. a pointer to the next item in a linked list.) It’s useful to think of the former (data) as data, and the latter (structural relationships) as metadata. Although the line between the two is not always clear, and depends on the context of interpretation. What may be considered data from one perspective, may be considered metadata from another perspective. An example of this would be a Microsoft Word document, which from a file system perspective is data. However, from the perspective of Microsoft Word, the file contains both data (the text) as well as metadata (the formatting, revision history, etc.)
The design of a data structure not only includes the order of the fields, but also the higher level design goals for the programs which access and manipulate the data structures. For instance, efficiency has long been a desirable aspect of many computer programs. With society’s increased dependence on computers, other higher level design goals such as security, multiple access, etc. have also become desirable. As a result, many data structures contain fields to accommodate these goals.
Another important aspect in computing is how to access and manipulate the data structures and their related fields. Knuth defines this under the term “data organization”:
Data Organization: A way to represent information in a data structure, together with algorithms that access and/or modify the structure.
An example of this would be a field that contains the bytes 0×68, 0×65, 0x6C, 0x6C, and 0x6F. One way to interpret these bytes is as the ASCII string “hello”. In another interpretation, these bytes can be the integer number 448378203247 (decimal). Which one is it? Well there are scenarios where either could be correct. To answer the question of correct interpretation requires information beyond just the data structure and field layout, hence the term data organization. Even with self-describing data structures, information about how to access and manipulate the “self-describing” parts (e.g. type “1″ means this is a string) is still needed.
So where does all this information for data organization (and data structures) come from? There are a few common sources. Perhaps the first would be a document from the organization that designed the data structures and the software that accesses and manipulates them. This could be either a formal specification, or one or more informal documents (e.g. entries in a knowledge base.) Another source would be reverse engineering the code that accesses and manipulates the data structures.
If you’ve read through all of this, you’re might be asking “So how does this relate to digital forensics?” The idea is that data structures are a type of digital evidence. Realize that the term “digital evidence” is somewhat overloaded. In one context, a disk image is digital evidence (i.e. what was collected during evidence acquisition), and in another context, an email extracted from a disk image is digital evidence. This series focuses on the latter, digital evidence extracted from a stream of bytes. Typically this would occur during the analysis phase, although (especially with activities such as verification) this can occur prior to the evidence acquisition phase.
The 5 Phases
Now that we’ve talked about what data structures are and how they relate to digital forensics, lets see how to put this to use with our forensic tools. What we’re about to do is describe five abstract phases, meaning all tools may not implement them directly, and some tools don’t focus on all five phases. These phases can also serve as a methodology for recovering data structures, should you happen to be in the business of writing digital forensic tools.
The results of each phase are used as input for the next phase, in a linear fashion.
An example will help clarify each phase. Consider the recovery of a FAT directory entry from a disk image. The first task would be to locate the desired directory entry, which could be accomplished through different mechanisms such as calculation or iteration. The next task is to extract out the various fields of the data structure, such as the name, the date and time stamps, the attributes, etc. After the fields have been extracted, fields where individual bits represent sub fields can be decoded. In the example of the directory entry, this would be the attributes field, which denotes if a file is considered hidden, to be archived, a directory, etc. Once all of the fields have been extracted and decoded, they can be interpreted. For instance, the seconds field of a FAT time stamp is really the seconds divided by two, so the value must be multiplied by two. Finally, the data structure can be reconstructed using the facilities of the language of your choice, such as the time class in Python.
There are a few interesting points to note with recovery of data structures using the above methodology. First, not all tools go through all phases, at least not directly. For instance, file carving doesn’t directly care about data structures. Depending on how you look at it, file carving really does go through all five phases, it just uses an identify function. In addition, file carving does care about (parts of) data structures, it cares about the fields of the data structures that contain “user information”, not about the rest of the fields. In fact, much file carving is done with a built-in assumption about the data structure: that the fields that contain “user information” are stored in contiguous locations.
Another interesting point is the distinction between extraction, decoding, and interpretation. Briefly, extraction and decoding focus on extracting information (from stream of bytes and already extracted bytes respectively), whereas interpretation focuses on computation using extracted and decoded information. The next post will go into these distinctions in more depth.
A third and subtler point comes from the transition of data structures between different types of memory, notably from RAM to a secondary storage device such as a hard disk or USB thumb drive. Not all structural information may make the transition from RAM, and as a result is lost. For instance, a linked list data structure, which typically contains a pointer field to the next element in the list, may not record the pointer field when being written to disk. More often that not, such information isn’t necessary to read consistent data structures from disk, otherwise the data organization mechanism wouldn’t really be consistent and reliable. However, if an analysis scenario does require such information (it’s theoretically possible), the data structures would have to come directly from RAM, as opposed to after they’ve been written to disk. This problem doesn’t stem from the five phases, but instead stems from a loss of information during the transition from RAM to disk.
In the next post, we’ll cover each phase in more depth, and examine some of the different activities that can occur at each phase.