Deductive and Inductive reasoning

One thing that I see on a fairly regular basis is confusion between deductive and inductive reasoning. Both types of reasoning play different roles in investigations/forensics/science/etc. The difference between the two is sometimes hard to define. Here are two common defintions:

1. With deductive reasoning, the conclusions are contained, whether explicit or implicit, in the premises. With inductive reasoning, the conclusions go beyond what is contained in the premises.

2. The conclusions arrived at using (correct) deductive logic are necessarily true, meaning they must be true. The conclusions arrived at using inductive logic, are not necessarily true, although they may be.

An example might clarify things (taken from a philosophy class I took years ago):

  1. If I study, I will get an A on the exam (premise)
  2. I studied (premise)
  3. Therefore I got an A on the exam (conclusion)

In this case, since I studied, I got an A on the exam. The conclusion (I got an A on the exam) is contained implicitly in 1 and 2. For the geeks in us, here is a proof:

  1. If I study, then I will get an A on the exam [ IF A then B ]
  2. I studied [ A ]
  3. Therefore I got an A on the exam [ B ] (modus ponens on 1 and 2)

With inductive reasoning however:

  1. If I study, then I will get an A on the exam (premise)
  2. I got an A on the exam (premise)
  3. Therefore I studied (conclusion)

Just because I got an A on the exam doesn’t imply I studied, I could have cheated. For the geeks in us, here is an (incorrect) proof:

  1. If I study, then I will get an A on the exam [ IF B then C ]
  2. I got an A on the exam [ C ]
  3. Therefore I studied (no logical argument, no B)

The key in these examples in is parts 1 and 2. With deductive reasoning we had B and followed the If chain [(IF B then C) ^ B yields C]. With the inductive reasoning we have no B. In terms of logic this is confusing an “if” statement with an “if and only if”, where the former requires one direction of truth and the latter requires two directions of truth.

So how does this play into investigations/forensics/etc.? The idea is to be careful the the conclusions drawn. For instance, (relating back to the blog post about context) if an examiner finds the string “hacker” on a hard disk, the hit doesn’t necessarily mean that a “hacker” was on the system, nor does it necessarily mean that “hacker” tools were used. The data around the string would (hopefully) provide more context. Although even the presence of “hacker” tools doesn’t mean that the suspect actually used them, nor does it necessarily mean that the suspect even introduced them to the system. These types of questions are often raised with “The Trojan Defense”.

One (common) misunderstanding of deductive and inductive reasoning is with our legal system. Our legal system depends heavily on inductive reasoning (inferences). For instance take the case with Keith Jones testifying at the UBS trial. Keith Jones testified about what was found on UBS systems, various different sources of logs (e.g. WTMP logs, provider logs, etc.) and his analysis of the information. Does this prove with 100% certainty that the suspect (Duronio) actually committed the crime? No it doesn’t. However with a substantial amount of evidence, a jury could reach the conclusion that the standard of “beyond a reasonable doubt” has been met.

Another example of deductive vs. inductive reasoning is with “courts approving digital forensic tools”. First courts aren’t in the business of approving digital forensics tools. They may allow a person to testify about the use and conclusions drawn using the tools. This is fundamentally different from saying “tool XYZ is approved by this court”. The reasoning for allowing an examiner to testify using the results obtained from a tool typically involves a trusted third party. Essentially one (or more) third parties comes to a conclusion about the correctness of a tool. So the decision about allowing the original examiner to testify about the results found using the tool depends on what a third party thinks. This leads to the question: Just because a third party thinks so, does it mean it’s guaranteed to be true? Perhaps yes, perhaps no. [Note: I’m not commenting about any specific digital forensics tool, this could apply to any situation involving any type of tool or even process. This is one type of review used when considering whether or not to allow a scientific tool/process/technique into court.]

Information Context (a.k.a Code/Data Duality)

One concept that pervades digital forensics, reverse engineering, exploit analysis, even computing theory is that in order to fully understand information, you need to know the context the information is used in.

For example, categorize the following four pieces of information as either code or data:

1) push 0x6F6C6C65
2) “hello” (without the quotes)
3) 448378203247
4) 110100001100101011011000110110001101111

Some common answers might be:
1) Code (x86 instruction)
2) Data (string)
3) Data (integer)
4) Code or Data (unknown)

Really, all four are encoded internally (at least on an intel architecture) the same way, so they’re all code and they’re all data. Inside the metal box, they all have the same binary representation. The key is to know the context in which they’re used. If you thought #1 was code, this is probably based on past experience (e.g. having seen x86 assembly before). If the four ascii bytes ‘h’ ‘e’ ‘l’ ‘l’ ‘o’ were in a code segment, then it is possible that they were instructions, however it is also that they were data. This is because code and data can be interspersed even in a code segment (as is often done by compilers.)

Note: Feel free to skip the following paragraph if you’re not really interested in the theory behind why this is possible. The application of information context is in the rest of the post.
The reason code and data can be interspersed is often attributed to the Von Neumann architecture, which stores both code and data in a common memory area. Other computing architectures, such as the Harvard architecture have seperate memory storage areas for code and data. However, this is an implementation issue, as information context (or lack thereof) can also be seen in Turing machines. In a Universal Turing Machine (UTM), you can’t distinguish between code and data if you encode both the instructions to the UTM and the data to the UTM with the same set (or subset) of tape symbols. Both the instructions for the UTM and the data for the UTM sit on the tape, and if they’re encoded with the same set (or subset) of tape symbols, then just by looking at any given set of symbols, there isn’t enough information about how the symbols are used to determine if they’re instructions (e.g. move the tape head) or data. This type of UTM would be more along the lines of a Von Neumann machine. A UTM which was hard-coded (via the transition function) to use different symbols for instructions and data would be more along the lines of a Harvard machine.

One area where information context greatly determines understanding of code and data is in reverse engineering, specifically executable packers/compressors/encryption/etc. (Note: executable packers/compressors/encryption/etc. are really examples of self-modifying code, so this concept extends to self-modifying code as well). Here’s an abstract view of how common block-based executable encryption works (e.g. ASPack, etc.):

| Original executable | — Packer —> | Encrypted data |

The encrypted data is then combined with a small decryption stub resulting in an encrypted executable that looks something like:

| Decryption stub | Encrypted data |

When the encrypted executable is run, the decryption stub decrypts the encrypted data, yielding something that can look like:

| Decryption stub | Encrypted data | Original executable |

(Note: this is an abstract view, there are of course implementation details such as import tables, etc. It’s also possible to have the encrypted data decrypted in place).

The encrypted data is really code, just in a different representation. The same concept applies when you zip / gzip / bzip / etc. an executable from disk.

Another example where the duality between code and data comes into play is with exploits. Take for example SEH overwrites. An SEH overwrite is a mechanism for gaining control of execution (typically during a stack based buffer overflow) without using the return pointer. SEH stands for Structured Exception Handling, an error/exception handling mechanism used on the Windows operating system. One of the structures used to implement SEH sits on the stack, and can be overwritten during a stack-based buffer overflow. The structure consists of two pointers, and looks something like this:

| Pointer to previous SEH structure | Pointer to handler code |

Now in theory, a pointer, regardless of what it points to, is considered data. The value of a pointer is the address of something in memory, hence data. During an exploit that gains control by overwriting the SEH structure, the pointer to the previous SEH structure is typically overwritten with a jump instruction (often a jump to shellcode.) The pointer to handler code is typically overwritten with the address of a set of pop+pop+ret instructions. The SEH structure now looks like:

| Jump (to shellcode) | Pointer to pop+pop+ret |

The next thing that happens is to trigger an exception so that the pop+pop+ret executes. Due to the way the stack is set up when the exection handling code (now pop+pop+ret) is called, the ret transfers control to the jump (to shellcode). The code/data duality comes into play since this was originally a pointer (to the previous SEH structure) but is now code (a jump). In the context of the operating system, it treats this as data, but in the context of a pop+pop+ret it is treated as code.

Information context goes beyond just code/data duality. Take for example string searches (a.k.a. keyword searches) which are commonly used in digital forensic examinations to generate new leads to investigate. Assume a suspect is accused of stealing classified documents. Searching a suspect’s media (hard disk, usb drive, digital camera, etc.) for the string “Classified” (amongst other things) could seem prudent. However a hit on the word “Classified” does not necessarily suggest guilt, but it does warrant further investigation. For instance the hit for “Classified” could come from a classified document, or the hit could come from a cached HTML page for “Classified Ads”. Again the key is to understand the context in which the hit comes from.

To summarize this post in a single sentence: Information context can be vital in understanding/analyzing what you’re looking at.