Based Binary Analysis Implementation Manual: Möbius Strip Reverse Engineering
Based Binary Analysis Implementation Manual: Möbius Strip Reverse Engineering
Implementation Manual
Contents
1 Python Basics 4
1.1 Basic Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.5 Iterating over Containers . . . . . . . . . . . . . . . . . . . . . 8
1.2 Transforming Containers . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 Lambda Functions . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Comprehensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Combining Iterators . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Mapping Iterators . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.3 Filtering Iterators . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Combining Map, Filter, and Reduce . . . . . . . . . . . . . . . 14
1.5 Python Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.2 Inheritance Hierarchies . . . . . . . . . . . . . . . . . . . . . . 17
1.5.3 Constructors and Destructors . . . . . . . . . . . . . . . . . . . 18
1.5.4 Deriving from object . . . . . . . . . . . . . . . . . . . . . . 18
1.5.5 Class Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Simulating Enumerations . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.8 Package Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.8.1 Inter-Module Dependencies in the Single-Directory Setting . . . 27
1.8.2 Invoking Modules in the Single-Directory Setting . . . . . . . . 28
1.8.3 Multiple Directories . . . . . . . . . . . . . . . . . . . . . . . 28
1.8.4 Inter-Module Dependencies with Multiple Directories . . . . . 29
1.8.5 Invoking Modules with Multiple Directories . . . . . . . . . . . 29
1.9 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.9.1 Testing for Type Errors . . . . . . . . . . . . . . . . . . . . . . 30
1.9.2 Python’s unitttest Module . . . . . . . . . . . . . . . . . . . 30
1.9.3 Programming Exercises in This Course . . . . . . . . . . . . . 31
1.10 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.11 Foreign Functions and ctypes . . . . . . . . . . . . . . . . . . . . . . 33
2 X86 34
2.1 Meta-Data: X86MetaData.py . . . . . . . . . . . . . . . . . . . . . . . 34
2.2 Representing Instructions and Operands: X86.py . . . . . . . . . . . . 35
2.2.1 Representing X86 Operands . . . . . . . . . . . . . . . . . . . 35
3 Reference Material 58
3.1 X86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.1 MOD R/M-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.2 MOD R/M-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.3 MOD R/M-32 SIB . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.1.4 AOTDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.1.5 DECDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
This document, accompanying the SMT-based binary program analysis training course,
describes the design and implementation of a fully-functional binary analysis frame-
work written in Python. Along the way, we shall also introduce programming concepts
used while writing programs that analyze other programs.
1 Python Basics
1.1 Basic Data Structures
1.1.1 Lists
Lists are one of the most basic types on Python. They support the len method to return
their length.
>>> x = [1,2]
>>> y = [3,6,9,12]
>>> z = ["String",None,True]
>>> len(y)
4
>>>
>>> x == y
False
>>> x == [1,2]
True
>>>
To insert elements into a list, the API provides two interfaces: a mutable interface
(whose functions modify an existing list in place), and an immutable one (whose func-
tions create new lists).
• Mutable:
– Appending elements: l.append(e)
– Concatenating lists: l1.extend(l2)
• Immutable:
– Appending elements: l + [e]
– Concatenating lists: l1 + l2
>>> x+y
[1, 2, 3, 6, 9, 12]
>>> x.append(3)
>>> x
[1, 2, 3]
>>> y.extend([15,18])
>>> y
[3, 6, 9, 12, 15, 18]
>>>
They are indexable, including from the end if a negative number is supplied.
>>> x[0]
1
>>> x[-1],y[-2]
(2, 9)
>>>
You can unpack lists by providing as many variables as there are elements in the list.
>>> l = [1,2,3]
>>> c1,c2,c3 = l
>>> c1
1
>>>
You can take slices of lists. For positive integers a and b, use an expression like · · ·
to retrieve the elements from · · · :
• y[a:b]: position a up to (but not including) position b. If a >= b, the result is [].
• y[a:-b]: position a up to the end of the list, not including the last b elements.
>>> z = y[1:3]
>>> z
[6, 9]
You can create a list of a specified length, whose contents are identical, using the *
(multiplication) syntax:
>>> z = [None]*5
>>> z
[None, None, None, None, None]
1.1.2 Tuples
Tuples are two or more objects collected into a single object.
>>> x = (1,2)
>>> y = (3,6,9,12)
>>> z = ("String",None,True)
>>> len(z)
3
>>>
They are indexable, including from the end if a negative number is supplied.
>>> x[1]+y[0]
5
>>>
>>> tuple([1,2,3,4])
(1, 2, 3, 4)
>>>
You can unpack tuples by providing as many variables as there are elements in the
tuple.
>>> t = (1,2,3)
>>> c1,c2,c3 = t
>>> c1
1
>>>
If you have a function that takes as many arguments as there are elements in a tu-
ple, you can "unpack" the tuple conveniently with the * (star) syntax while calling the
function. (Technically, the star syntax also works for lists.)
1.1.3 Sets
Sets are collections of items, where only one of a given item can be present in the set.
In other words, duplicates are removed automatically upon creating a new set or adding
an element to an existing one.
As with lists in section 1.1.1, Python provides mutable and immutable APIs for ma-
nipulating sets.
• Mutable API:
– s1.update(s2) (alternatively s1 |= s2).
– s1.intersection_update(s2) (alternatively s1 &= s2).
– Adding an element x to a set: s1.add(x).
– Removing an element x from a set: s1.remove(x).
• Immutable API:
– s3 = union(s1,s2) (alternatively s3 = s1 | s2).
– s3 = intersection(s1,s2) (alternatively s3 = s1 & s2).
1.1.4 Dictionaries
Dictionaries allow objects (called values) to be associated with other objects (called
keys).
• The key and value objects need not have the same types as other elements in the
dictionary. For example, the first two insertions into empty used strings for keys,
and booleans for values, while the third insertion used an integer for a key, and a
string for a value.
>>> d = {1:2,3:4}
>>> print 1 in d
True
>>> print d[1]
2
>>> d[1] = 3
>>> print d[1]
3
>>> d.update([(1,4),(2,0)])
>>> print d
{1: 4, 2: 0, 3: 4}
>>> del d[1]
>>> print d
{2: 0, 3: 4}
>>>
• You can over-write the existing value associated with a key with the same syntax
as adding a mapping between a key and a value.
• Given a list l of (key,value) tuples, you can use the dict method d.update(l)
to add all of them to the dictionary d simultaneously.
Python will raise a KeyError exception if you try to retrieve an element that does not
exist within a dictionary. (Exceptions are discussed in more depth in section 1.7.)
Note that in the case of a dictionary, only the keys will be retrieved with an ordinary
for expression. If you want the keys and the values, use the dict method items to
obtain a list of them.
Input Container
x1 x2 x3 x4 x5 x6
f f f f f f
Output List
1.2.2 Filter
The function filter(f,container) creates a new container from those elements for
which the function f returns True.
1 2 3 4 5
1 = f(0,1) 6 = f(3,3)
Input Container
x1 x2 x3 x4 x5 x6
f f f
x1 x3 x5
Output Container
1.2.3 Reduce
The function reduce(f,container,initial) “reduces” a new container into one ele-
ment by repeatedly applying f to the elements. An example of its action can be seen in
figure 1.
• If initial is provided (i.e., not None), the sequence of actions is roughly this:
1. result = f(initial,container[0])
2. result = f(result,container[1])
3. result = f(result,container[2])
4. · · ·
5. result = f(result,container[len(container)-1])
• If initial is not provided (i.e., is None), the sequence of actions is roughly this:
1. result = f(container[0],container[1])
2. result = f(result,container[2])
3. · · ·
4. result = f(result,container[len(container)-1])
• If initial is not provided and the container has one element, return that element
as-is.
def f(i):
return i&1 == 0 filter(lambda i: i&1==0,[1,2,3])
filter(f,[1,2,3])
def f(a,b):
return a+b reduce(lambda a,b: a+b,[1,2],0)
reduce(f,[1,2],0)
1.3 Comprehensions
Comprehensions are a very concise way to create new container objects from existing it-
erable objects. Python supports list, set, and dictionary comprehensions. The basic form
of a list comprehension uses the syntax [ expression for element in iterable ].
Some examples follow; note that they could be written using map.
Comprehensions for all types of containers optionally allow elements from the iter-
able to be ignored by placing an if-condition after the for expression. The following
examples could also be written using both map and filter.
Set comprehensions are identical to list comprehensions, except they use curly brack-
ets ({,}) instead of square ones ([,]).
1.4 Iterators
In section 1.1.5, we showed how standard containers like lists, tuples, sets, and dic-
tionaries supported iterating over the contents using syntax like for x in int_list:.
Iterators, like the container transformation functions and comprehensions discussed pre-
viously, are very powerful tools that can simplify your programming tremendously. It-
erators have been specially designed to consume as little memory as possible. They
are also performance-optimized, so using iterators is likely to be faster than manually
iterating over containers. Python provides the itertools module, allowing iterators to
be combined and manipulated in variety of interesting and useful ways.
>>> x = [1,2]
>>> y = [3,4]
>>> g = itertools.chain.from_iterable([x,y])
>>> for i in g:
... print i,
1 2 3 4
Another operation we may wish to perform upon two iterators is to create all pairs of
their possible outputs (known in mathematics as the Cartesian product). itertools.
product implements this idea, as in the below.
>>> x = [1,2]
>>> y = [3,4]
>>> g = itertools.product(x,y)
>>> for i in g:
... print i,
(1, 3) (1, 4) (2, 3) (2, 4)
>>>
In fact, you can pass more than two iterators to itertools.product to obtain all
tuples of the iterators’ possible outputs.
>>> x = [1,2]
>>> y = [3,4]
>>> g = itertools.product(x,y)
>>>
>>> # Compare to below
>>> # Parentheses (one parameter, a tuple)
>>> g1 = itertools.imap(lambda (x,y): x+y,g)
>>> for i in g1:
... print i,
4 5 5 6
itertools also provides starmap. starmap assumes that the iterator yields a tuple,
which it then “unpacks” using the * (star) operator (recalling section 1.1.2), and then
calls the provided mapping function with multiple arguments.
>>> g = itertools.product(x,y)
>>>
>>> # Compare from above
>>> # No parentheses (two parameters)
>>> g2 = itertools.starmap(lambda x,y: x+y,g)
>>> for i in g2:
... print i,
4 5 5 6
>>> x = [1,2,3,4,5,6,7,8]
>>> g = itertools.ifilter(lambda x:x&1==0,x)
>>> for i in g:
... print i,
...
2 4 6 8
ifilterfalse(f,i) works the same way, except it keeps those for which f returns
False.
One thing to notice in the snippet above is that the class method declaration for
DoSomething takes an object called self as its first parameter and an integer i as its
second parameter. However, when we invoke the method using b.DoSomething(5),
we only supply the integer parameter i. Behind the scenes, the line b.DoSomething(5)
passes b as the self parameter before the others. Hence, when Basic.DoSomething
executes, self references b.
Classes can store data inside of them. To access the data member data, class methods
must reference self in doing so, as in self.data. Code outside of the class can also
refer data members within an object, as in h.data in the last line of the snippet below.
Note that languages like C++ allow the programmer to control which code can access
class data members. In C++, data members declared public can be accessed freely by
code outside of the class, those declared private can only be accessed by members
of the class itself, and those declared protected can only be accessed by members of
the class itself or derived classes. Python’s support for restricting access to class data
members is virtually non-existent. All class data members in Python are the equivalent
of C++’s public members.
As an aside, although the first argument of a class method is typically called self,
this is by convention only: that parameter can be renamed to anything, such as the
shorter name s. The following example is equivalent to the declaration of HasData in
the snippet above, with self renamed to s.
class HasData(object):
def Store(s,v):
s.data = v
def Retrieve(s):
return s.data
1.5.1 Inheritance
In the previous discussion on how to declare classes, we used declarations like
class HasData(object):. Technically, this means that HasData is derived from an-
other class called object. Alternatively, we say that HasData inherits the class methods
from object. We illustrate classes that derive from something other than object in the
example below with the classes Base and Derived.
Note the declaration of "Derived" above: class Derived(Base). This means that
the class Derived will automatically contain all of the functions defined upon the Base
class. This is why we can call the Interface method on Derived, despite Derived not
explicitly providing it: Derived contains that method by virtue of inheriting from Base
(which does contain that method).
When a derived class provides different implementations for methods defined in
base classes, it is said to override those methods. In Python 2.7 new-style classes,
invoking class methods will always invoke the overridden versions. This point mer-
its careful investigation. We observed that Derived doesn’t explicitly implement the
method Interface; rather, it inherits it from Base. When we invoke d.Interface(),
the method Base.Interface invokes self.InternalFunction(). Which version of
InternalFunction is invoked – Base’s, or Derived’s? The output gives the answer:
Derived’s. The rule is that, if a derived class overrides methods that are invoked by base
methods, the overridden methods will be executed rather than the ones in the base class.
If you are familiar with C++, this point can be understood by saying that, in Python, all
class methods are considered virtual.
Derived classes can still call methods from base classes, using a slightly unusual syn-
tax: ClassName.MethodName(self,args). There is an example in the snippet below:
Base.InternalFunction(self). Note that the class instance self is supplied as an
argument, something that is not done in normal circumstances.
Also note in the snippet above that Derived2 is derived from Derived rather than
Base.Since Derived derives from Base, Derived2 inherits from both.
Derived OtherDerived
Base
Along these lines, Python provides a built-in function called isinstance(x,y), which
returns True if and only if the class type of object x is derived, somewhere along
the line, from class type y. This can be understood graphically as stating that, for
isinstance(x,y) to be true, there must be a path from the class type of x to class y in
the inheritance hierarchy. In figure 3, we can see that there is a path from every class
to Base, but there is no path from Derived or DerivedDerived to OtherDerived, nor
vice versa. The results of the isinstance function, as shown in table 1, confirm this
graphical intuition.
>>> d = Derived()
>>> isinstance(d,Base)
True
>>> isinstance(d,OtherDerived)
False
>>>
isinstance can also take a tuple of class types as its second parameter, as in
isinstance(x,(Derived,OtherDerived)). It will return True if x is an instance of
any of those class types.
The programmer is free to declare the constructor has taking as many parameters as
desired. Note that only the constructor in the class that is being instantiated will be
called. If the programmer wishes to call the constructor of a base class, they should
do so as previously discussed in section 1.5.1: by explicitly naming the base class in a
statement such as BaseClass.__init__(self,1).
Destructors are the opposite of constructors. The class method __del__ will be called
when the object is garbage collected. However, given the questionable garbage collec-
tion employed by Python, this may not happen. Fortunately, in languages with garbage
collection, destructors are rarely necessary. Only one of our classes shall employ a
destructor, and it will be provided. Thus, we need not speak more of destructors.
object is used. If the class does declare a method def __str__(self) that returns a
string describing the object, we gain the ability to print our objects using the built-in
facilities.
Having a unified interface for this functionality is very useful, but it does become
tedious to implement these methods for each class.
To see what happens when we don’t implement the methods in the previous table,
let’s revisit section 1.1.4, the introduction to dictionaries. Internally, Python uses a hash
table data structure to implement a dictionary. To look up an item k in the dictionary d
(i.e., d[k]), Python:
Note also how the dictionary is printed in response to the statement d: b is dis-
played as <__main__.A object at 0x02256D50>: ’b’. This is the default behavior
of object.__repr__(). On the last line, print d yields something similar. This is
the default behavior of object.__str__(). We will want to implement __repr__ and
__str__ methods for many of our objects.
After defining these functions to do sensible things, we obtain the results that we
desire.
After the introductory material, all classes shall come with these uninteresting boiler-
plate methods already provided.
Overriding __call__(self,arg1,arg2,...) in a class allows objects derived from
that class to be called, as though the objects were functions. Most commonly, this
functionality is used to provide something that behaves like a constructor; i.e., calling
an existing object will return a new object of the same type. We will use this at several
points to simplify our designs.
class Word(object):
def __init__(s,i):
s.i = i & 0xFFFF
The value is masked with 0xFFFF upon construction, but after that, the programmer
must remember to manually mask any subsequent values stored into Word.i. We could
define a method to set i:
class Word(object):
def __init__(s,i):
s.set(i)
def set(s,i):
s.i = i & 0xFFFF
However, the user can still access the i variable directly, thereby potentially allowing
invalid values.
Python presents an elegant solution for access control of data items through what
are called properties. Properties are functions that behave exactly the same way as
variables. When the user tries to get the value of the property (as in print o.i), a
getter class method is invoked. When the user tries to set the value of the property (as
in o.i = 1), a setter method will be invoked. The user accesses them identically to
variables, where the class enforces access control behind the scenes that the user has no
business knowing about or tampering with.
The code below illustrates an implementation of properties. def i(self) is prefixed
with @property, which makes it the getter for the property i. def i(self,i) is pre-
fixed with @i.setter, which makes it the setter for the property i. Internally, these
methods hold the value in variable called _i. The setter masks the value before storing
it.
enum_upper uses upper-case names for the element strings. enum_specialstr al-
lows the user to provide custom strings for each enumeration element:
1.7 Exceptions
Exceptions are a common construct in most programming languages, used to signify
that something has gone wrong while the program was executing. Python uses the
terminology raising to describe the act of signalling an exception. An example Python’s
interpreter signalling an exception follows.
Python implements exceptions via classes. The basic type of exception is a class
called Exception, from which other types of exceptions derive. The standard library
provides an entire class hierarchy of exceptions for ordinary circumstances, a small
selection of which is shown in table 2. To raise an exception, use the raise statement
followed by the exception constructor, as in raise TypeError.
If you are performing some operation that may raise an exception, you may wish to
catch and handle the exception. Let’s say you want to return the ith element of a list
l. Indexing into the list by i may raise the IndexError exception if i is outside of the
length of the list. Place the code that may raise the exception in a try block as shown
below. If an exception occurs, Python will inspect the type of the exception, look at
the types of the exceptions specified (in-order) by the except blocks following the try
block, and execute the code corresponding to the first exception type that matches. After
executing a try block, regardless of whether an exception was thrown (i.e., whether any
of the except blocks executed), it will execute the finally block if one is present.
Sometimes, you may wish to catch an exception, take some action (such as printing
something to the console), and then re-raise the same exception. You can accomplish
this using the raise statement with no arguments:
Since Python uses a class called Exception to handle exceptions internally, pro-
grammers can derive their own classes from Exception to create custom exceptions.
Derivatives of Exception should support __str__ at a minimum, and generally do not
require more than this.
class MyException(Exception):
def __init__(s,string):
s.str = string
def __str__(s):
return self.str
4. A imports everything from B, using a statement like from B import *. This op-
tion should be used with care, for the reasons discussed in the last item; that
module A may contain items with the same names as those in B, and that import-
ing everything would cause confusion as to which object of multiple objects with
the same name is being referenced.
if __name__=="__main__":
doSomething()
The module can take command-line arguments if it serves as some utility that the user
may invoke directly. Another common option, used when it is not meaningful for the
module to be invoked as a stand-alone program, is to run tests for the module’s func-
tionality within such an if-statement. (We shall discuss testing soon, in section 1.9.)
Once our code no longer fits comfortably into a single directory, we might elect to
structure it into multiple directories, as shown for example in figure 4. (Note that this
directory listing is hypothetical and does not correspond to the structure of the code in
this course.) Each directory starting from the package root that contains Python code (or
contains sub-directories with Python code) must contain a file called __init__.py,
even if that file is empty (which it typically is, and all __init__.py files in this
course will be empty).
1. Modules frequently refer to modules located in the same physical directory. For
example, we might imagine that X86Decode.py will rely upon X86.py, which
are both located in the same directory (namely, X86). It is legal, albeit verbose, to
write the statement from Pandemic.X86.X86 import *. To reduce the tedium,
the default behavior of import X is to first check the directory in which the mod-
ule performing the import resides when looking for X.py. Hence, X86Decode.py
can simply use from X86 import *, as in the single-directory scenario.
2. Another frequent case is when a module imports a module from the parent di-
rectory. For example, we can imagine that TestX86Decode.py will rely upon
X86Decode.py in the X86 directory, which is one level above the Test direc-
tory. As before, it is legal to use a fully-qualified module path in TestX86Decode.py
such as import Pandemic.X86.X86Decode. As before, this is tedious and Python
offers a shortcut. Prefixing a module’s name with two dots in an import state-
ment, as in import ..X86Decode, will cause import to check the directory above
the one in which the module resides for the named module. Using more than two
dots is erroneous, i.e., only the immediate parent directory can be referenced rel-
atively like this.
to invoke that module. For example, from the package directory, a command like
\python27\python.exe -m Pandemic.X86.Test.TestX86Decoder (note
the lack of a .py extension in this command) will cause the code in TestX86Encoder.py
to run.
1.9 Testing
Testing is a large and enormously important subject in software engineering. Testing
provides some degree of confidence that the program does what it is supposed to do.
It can reveal the presence of errors, although not guarantee their absence. Many books
have been written on the topic of testing alone. We shall not re-hash them here. A piece
of code that invokes another piece of code and performs a check that the result is as
expected will be referred to as a test case. A collection of test cases shall be known as
a test suite.
The code in this course will be heavily tested to guarantee some level of quality and
correctness, as well as to provide programming exercises for the student. You are given a
mostly-complete SMT-based binary program analysis framework where certain sections
of the code have been replaced by statements such as
raise ExerciseError("bitblast_Add"), and you will be expected to write those
code snippets. Each exercise is accompanied by a test case; the exercise is complete
when the test suite passes.
Any method upon a class derived from TestCase whose name begins with the let-
ters test is considered to be a test. The tests should perform their checks using the
unittest class methods as partially shown in table 3.
In this course, our test suites shall reside in modules that contain a single class derived
from TestCase. To invoke a test suite, we follow the lead of section 1.8.5. From the root
directory of the project, you will run a command such as
python -m unittest Pandemic.Solvers.Test.TestSMTSolver to in-
voke the tests related to the SMT solver. The command just listed will run all of the tests
in the module Pandemic.Solvers.Test.TestSMTSolver and report those tests
that have:
• Produced an error, meaning raised an exception that was not expected by the test.
• Failed, meaning that the code ran without raising any exceptions, but did not
produce the correct output.
• Passed, meaning that the code behaves properly (according to the test).
It shall be your job as a student to erase the line containing the raise statement and
replace it with a proper solution as indicated by the surrounding comments and/or the
exercise manual. You will test your solutions as in the previous section 1.9.2, by invok-
ing the unit tests for the relevant module (and the pertinent module shall be indicated in
the exercise manual). Interpret the results of the unit tests as follows:
• All exercises initially produce an error, since they are implemented by raising an
ExerciseError exception. If you have removed the ExerciseError exception
and the test still produces an error, this corresponds to an error in your code, the
details of which shall be contained in the text describing the exception.
• If a test fails, this means that the code executed without raising any exceptions,
but that its result was incorrect. I.e., your code contained an error.
The unittest.TestCase class by default runs all test cases contained within the
module under test, i.e., it will report the errors and failures for all components of the ex-
ercise. Since the exercises in this course often contain multiple tests (corresponding to
multiple parts of the code that you must complete), this means that the test suite will pro-
duce a lot of output the first few times it is executed. To reduce the amount of output, it
is advisable to use the “fail fast” feature of the unittest module, which causes it to stop
as soon as the first error or failure is encountered. This mode is enabled by specifying the
-f flag on the command line, as in
python -m unittest -f Pandemic.Solvers.Test.TestSMTSolver.
1.10 Documentation
Documentation is an import part of any software project. High-quality documentation
should have the following properties:
• It provides high-level insight, alleviating the need for the user to read the source
code to get the gist of a particular class, function, or data item. Every module
(i.e., .py file) should begin with a description of what purpose it serves within
the project.
• It is thorough. Every class, function, and data item that is meant to be usable
outside of a module is described in the documentation.
• It contains links into the source code, so that the user can quickly peruse the
implementation of something if they desire more information than what the doc-
umentation provides.
• The documentation itself is contained within the source code. This way, the docu-
mentation can be automatically generated by running a tool directly on the source
code. Thus, the documentation never gets out-of-date with respect to the soft-
ware, unless a lazy programmer decides not to update the documentation when
they change something important.
The code within this course has documentation in-line, i.e., within the declaration of
classes, class methods, functions, and data items. We use the Sphinx Documentation
Generator project to automatically parse the documentation from the code and build
HTML files that can be perused externally. You should keep ./docs/_build/index.html
open in your browser at all times to ease your cognitive burden as a programmer.
• Declare data types and data values that will be laid out in memory precisely the
same way as they would be in C. For example, the code
(ctypes.c_uint32 * 4)(1,2,3,4) will allocate a piece of memory correspond-
ing to four uint32 types (i.e., 32-bit unsigned integers) from C laid out contigu-
ously in memory with no gaps, and where the values are 1, 2, 3, and 4.
We shall return to the ctypes library when our implementation requires it.
2 X86
lock
| {z } add
|{z} word ptr [eax] , bx
|{z}
| {z }
Prefix Mnemonic Operand #1 Operand #2
1. Zero or more optional prefixes (dictating sizes, atomicity behavior (LOCK), repe-
tition (REP), etc.),
Accordingly, our Python API will expose a class called Instruction that contains:
Operand
Five classes derive directly from Operand; we examine each in turn in the following
subsections. The interfaces for these classes are summarized in table 4, and explored in
detail in subsequent sections.
1. Register
2. Immediate
3. FarTarget
4. MemExpr
5. JccTarget
In X86 machine code, registers are often represented by an associated integer 0-7 (in
the corresponding order in which they are listed in the table above). We take this into
account in the design of the Register class to simplify encoding and decoding.
• When encoding instructions, we will need to retrieve the integer for a given
Register object. We accomplish this with a Register class method IntValue(),
which returns the number.
• When decoding instructions, we will need to create the correct Register ob-
ject for a given integer. It would be convenient to be able to create register
objects simply by passing a number to the constructor. By default, Register
object constructors accept an enumeration element such as Al. We achieve our
desires of convenience using an extra parameter in the Register constructor:
__init__(self,value,adjust_value=False). If adjust_value is True, the
parameter value is interpreted as an integer, and is adjusted into an enumeration
element by adding the base element of the enum. For example, constructing the
object Gb(1,True) yields Gb(Cl).
• We use a further trick to simplify decoding. We shall see later that, when it is time
to decode a register operand, we shall already have a Register-derived object of
the correct type available to us (let’s call that object o). We will want to create a
new object of the same type with the correct register value corresponding to an
integer obtained during decoding. We override in Register one of Python’s built-
in object methods, __call__(self,int), which allows objects to be called,
as in o(1). Our __call__ implementation constructs a new object of the same
type as o with the register value specified by the integer int. While decoding,
we can simply “call” o with that integer to create the proper Register object.
For example, with o = Gb(Al), the code snippet o(1) yields Gb(Cl). This is an
instance of a factory method design pattern in software engineering.
Immediates are stored internally in a field called value. This a guarded integer object,
as previously discussed in 2.1. I.e., upon construction (and in case of modification), the
integer value is truncated to the proper size.
For reasons similar to those described in the previous section 2.2.2, we simplify
the decoder by overriding Python object’s __call__ method. Given an object i de-
rived from Immediate (as illustrated in the table), we can call i(n) to obtain a new
Immediate object of the same type as i, whose integer value is initialized to n.
The JccTarget class is not very noteworthy, except to say that it is not a perfect fit for
the call instruction. Consider call eax, represented as Instruction([],Call,Gd(Eax)),
and notice that the return address is not represented within the instruction. We address
this issue after instruction decoding, in section 2.12.1.
The classes AP32 and AP16 hold their data members (the numeric segment and offset)
in guarded integers, which are accessible through the class properties Seg and Off. AP32
uses 32-bit offsets, while AP16 uses 16-bit offsets. Both use 16-bit segments.
• Seg: the enumeration element corresponding to the segment into which the access
is taking place.
• size: the enumeration element corresponding to the size of the access. Mb desig-
nates 8-bit, Mw 16-bit, Md 32-bit, etc.
• Disp: a displacement (differs in size for 16 vs. 32-bit expressions, and may be
None).
• The class method DefaultSeg() returns the segment that would normally be used
for this particular combination of base register and displacement, if no segment
prefixes were specified.
Mem32 also exposes a data item (class property) called ScaleFac. MOD R/M-16 expres-
sions cannot have scale factors, so this data was not placed into the common MemExpr
class. ScaleFac is a 2-bit integer denoting the scale factor (such as 4 in the expression
[eax+ebx*4]). In the table below, note that a scale factor of 1 << x is represented
by x.
ScaleFac
(binary) Scale Factor
00 1
01 2
10 4
11 8
• We can run our code inside of IDA and use the IDA API function get_byte() as a
byte stream.
• We can use a PE-parsing library and provide bytes from an executable file.
• We can use our code in conjunction with a debugger library, and use whatever
facilities it offers to read bytes out of memory.
The instruction and MOD R/M decoders shall be written in such a way where they
obtain bytes as needed from a byte stream object. X86ByteStream.py defines the
interface in a class called StreamObj, whose methods are shown in table 5. Any in-
put source that can be accessed in terms of a class derived from StreamObj can be used
seamlessly within our framework; such classes need only override GetByteInternal(),
which defaults to retrieving a byte from an array (called bytes) held within the object.
One note about the StreamObj class is that, since X86-32 instructions cannot exceed
16 bytes in length, Byte() will throw an InvalidInstruction exception if more than
16 bytes have been consumed since the last call to SetPos() (and Word() and Dword()
are implemented in terms of Byte(), so they too can throw exceptions).
all three components work and their results agree for that particular instruction. We will
create a random instruction generator in section 2.8, and thus we can ensure test cover-
age for all components on all parts of the instruction set. Our X86 library is of very high
quality.
Two notes. First, the reader might have the idea that the reverse process could also
work for testing: generate random bytes, decode them, encode them, and check that
the bytes are the same. This does not work well because some instructions can be
encoded in multiple different ways. Secondly, note that, while our method of testing
achieves high code coverage and gives us reasonable confidence in our library, in fact,
the tests are not checking for “correctness” of our library per se. The tests ensure that
the encoder and decoder are consistent. Our tests will not detect situations where the
encoding and decoding components both contain the same error (i.e., instructions are
encoded erroneously, and decoded in the same erroneous way).
80 7F bx si 7Fh
Encoded Memory Expression Base Index Displacement
• Decode(self,stream), which decodes the raw MOD R/M bytes provided by the
stream object and (as in section 2.3) sets the fields listed just previously.
• Interpret(self), which examines the sub-parts of the MOD R/M and returns a
quadruple (base,index,disp,dispsize, where:
– base is the base register for the memory expression (or None).
– index is its index register (or None).
– disp is its displacement (or None).
– dispsize is the number of bytes in the displacement (0, 1, or 2).
• Encode(self), which encodes the MOD R/M parts into a list of bytes.
Following the declaration of the classes for AOTDL is an array, AOTtoAOTDL, which
assigns an element of this language to each abstract operand type. Table 9 shows some
examples. The right-hand columns are the actual Python objects that are used to repre-
sent the abstract operand types on the left.
• s: if the operand type varies with the OPSIZE prefix, use the smaller size if this
variable is True.
• a: if the operand type varies with the ADDRSIZE prefix, use the smaller size if this
variable is True.
To actually generate a random operand, we retrieve the AOTDL description for the
operand type, and use pattern-matching to decide which logic should execute. The
pattern-matching rules and associated random operand generation logic, the function
generate(a), is shown in table 8. The reader who feels confident with pattern-matching
should skip the following bulleted list and read the table instead.
• The AOTDL element Exact(o) specifies that o is the only legal value for some
abstract operand type. Therefore, to “randomly generate” a legal element of this
abstract operand type, we only have one choice: o itself. We return o as-is.
• ExactSeg(o) specifies that the only legal value for some abstract operand type
is a memory expression o, whose segment may vary. Therefore, our choices in
random generation are the exact copies of the memory expression o, with differ-
ent segments. Recall from section 2.2.6 that we overrode __call__(seg) in the
MemExpr base class in order to produce a new memory expression that is an exact
duplicate of the current one, except that its segment is set to seg. Thus, we can
simply “call” o, as in return o(rnd_seg()).
• GPart(o) AOTDL elements specify registers that have the same type as the pa-
rameter o. Recall from section 2.2.2 that we overrode __call__(int) in the
Register base class in order to produce a new register of the same type with
the register enumeration element corresponding to the integer int. Most regis-
ter types have eight possible values, so they can be handled with the same logic,
namely return o(rnd_regno()). Register types with fewer than eight values,
such as SegReg, are handled individually.
generate a register or a memory. Register types are handled identically as just de-
scribed under GPart. For memory expressions, if the randomly-generated boolean
a is specified, we generate a Mem16 object with random components; otherwise,
we generate a Mem32 object.
• SignedImm(o) AOTDL elements describe 8-bit constant values that are sign-extended
to larger values. o can describe immediate constants or jump targets. We must in-
spect o’s type to determine which case applies.
– If o is an X86 Immediate object (such as Iw(0x0)), we generate a random 8-
bit constant and sign-extend it to the larger size. As before, having overriden
Immediate.__call__(value) simplifies the task.
– If o is a JccTarget, return a JccTarget with randomly-generated taken and
not-taken addresses.
Implementing random operand generation is easy with the visitor pattern. Our pre-
vious visitor classes have all derived from Visitor, which was the generic visitor im-
plementation for when the visit method should take one operand. This algorithm uses
two parameters: the abstract operand type, and the triple of booleans (m,s,a). Thus we
derive our random operand generator, X86RandomOperand, from Visitor2 rather than
Visitor. We override the MakeMethodName method to produce names such that roughly
every row of the table has one single method responsible for implementing it. These
methods have names like visit_Immediate_MemExpr, visit_Immediate_FarTarget,
and visit_GPart_SegReg.
Clients of the random operand generator should call the method gen(aop,(m,s,a)),
which takes care of retrieving the X86AOTDL object associated with the abstract operand
type aop prior to calling visit. The logic in the visit_ methods is more or less exactly
as described in the table.
Generating random instructions is just as simple.
1. Pick a random X86 mnemonic enumeration element, mnem, and retrieve its list of
encodings from the encoder table.
2. Choose a random encoding, and retrieve its list of abstract operand type enumer-
ation elements, aotlist.
5. Return Instruction([],mnem,op1,op2,op3).
Figure 7: typecheck_x86(a,x)
a x Type-Checking Logic Return Value
Exact(o) x x == o MATCHES
ExactSeg(o) MemExpr(_) x == o MATCHES
ExactSeg(o) MemExpr(_) x == o(x.Seg) SegPFX(x.Seg)
GPart(g) x type(g) == type(x) MATCHES
RegOrMem(r,m) Register(y) type(r) == type(y) MATCHES
RegOrMem(r,m) Mem32(_) m.size == x.size AddrPFX(False)
RegOrMem(r,m) Mem16(_) m.size == x.size AddrPFX(True)
ImmEnc(Immediate(i)) Immediate(y) type(i) == type(y) MATCHES
ImmEnc(JccTarget(_)) JccTarget(_) MATCHES
ImmEnc(FarTarget(_)) AP16(_) AddrPFX(False)
ImmEnc(FarTarget(_)) AP32(_) AddrPFX(True)
ImmEnc(MemExpr(_)) Mem32(_) x.BaseReg == None and AddrPFX(False)
x.IndexReg == None
ImmEnc(MemExpr(_)) Mem16(_) x.BaseReg == None and AddrPFX(True)
x.IndexReg == None
SignedImm(JccTarget(_)) JccTarget(_) MATCHES
SignedImm(Id(_)) Id(i) 0xFFFFFF80 <= i <= 0x7F MATCHES
SignedImm(Iw(_)) Iw(i) 0xFF80 <= i <= 0x7F MATCHES
• sizeo: if None, this operand is not affected by the presence of the OPSIZE prefix.
If True or False, the operand respectively requires the presence or absence of the
OPSIZE prefix.
return None
All of the valid type-checking rules are listed figure 7, apart from the SizePrefix and
AddrPrefix AOTDL elements just discussed. Any other combination of operands and
operand types is illegal. They will be routed to the Visitor2 method Default, which
we override to return a value indicating that there was no match. The client should
call the method check(aop,opnd), where aop is an abstract operand type enumeration
element, and opnd is an X86 operand derived from Operand.
Some abstract operand types (AOTs) and their AOTDL translations are shown in table 9.
See exercise manual for exercises involving type-checking X86 operands.
2.10 Encoding
AOTDL pays off again handsomely in developing the X86 encoder. X86Encoder.py
declares the X86Encoder class, which is what the slides called the encoder context, and
whose interface is shown in table 10.
The EncodeInstruction class method is the interface through which clients encode
single instructions. It is a very direct implementation of the instruction encoding process
as laid out in the slides.
2. For each encoding, consult the X86TypeChecker object tc held within the en-
coder context as to whether the instruction matches it. If a valid encoding enc is
found, retrieve the required prefixes, store them into the variables held encoder
context, and copy enc’s stem bytes.
class Native16(X86Enc):
def Encode(self,enc):
enc.sizepfx = True
class ModRMGroup(X86Enc):
def Encode(self,enc):
enc.ModRM.GGG = self.ggg
of x86op and the X86AOTDL entry corresponding to aop, and invokes a special-
ized function for each case. The encoding rules are duplicated from the slides in
figure 8.
5. Concatenate the instruction parts, yielding X86 machine code. At this stage in
the encoding process, all of the information has been collected in order to encode
the instruction. We simply concatenate the prefixes, stem, optional MOD R/M, and
optional immediate operand bytes, and return them as a list.
Figure 8: encode_x86(a,x)
a x Encoder Logic
Exact(o) _
ExactSeg(o) _
GPart(g) _ ModRM.GGG = x.IntValue()
RegOrMem(r,m) Register(y) ModRM.RM = y.IntValue()
ModRM.MOD = 3
RegOrMem(r,m) Mem32(_) EncodeModRM32(x)
RegOrMem(r,m) Mem16(_) EncodeModRM16(x)
ImmEnc(Immediate(_)) Id(i) immediates.append(i,4)
ImmEnc(Immediate(_)) Iw(i) immediates.append(i,2)
ImmEnc(FarTarget(_)) AP32(s,o) immediates.append(s,2)
immediates.append(o,4)
ImmEnc(FarTarget(_)) AP16(s,o) immediates.append(s,2)
immediates.append(o,2)
ImmEnc(MemExpr(_)) Mem32(_) immediates.append(x.Disp,4)
ImmEnc(MemExpr(_)) Mem16(_) immediates.append(x.Disp,2)
SignedImm(Immediate(_)) _ immediates.append(x.value,2)
ImmEnc(JccTarget(t,f)) JccTarget(_) immediates.append(t - next_addr,4)
SignedImm(JccTarget(t,f)) JccTarget(_) immediates.append(t - next_addr,4)
SizePrefix(y,n) _ if sizepfx: encode_x86(y,x)
else: encode_x86(n,x)
AddrPrefix(y,n) _ if addrpfx: encode_x86(y,x)
else: encode_x86(n,x)
At the heart of our X86 decoder engine is a representation of the opcode maps from
the Intel manuals, as partially shown in figure 9. In general, when writing code that
deals with X86 machine code, it is best to represent the data in a way that is similar to
the way it is laid out in the manuals. This aids faithfullness to the manuals and easy
verification that the transcribed data matches them, and allows easy maintainability. In
addition, the pervasive use of tables makes the code easier to write and harder to get
wrong.
The slides laid out a simple description language for representing the entries of op-
code maps. X86DecodeTable.py contains the Python class declarations for the
DECDL decoder table description language, all of which derive from X86DECDL. The
language, and our treatment of it, it is very similar to AOTDL and its accompanying
classes as previously described in 2.7. Those classes and their data members are shown
in table 11.
In addition to the data members, each of these classes exports a method called
decode(decoder). The discussion of these methods is left to the next section.
After the declarations of these classes, the bulk of X86DecodeTable.py is the
array decoding_table, of length 1024. It contains one such X86DECDL object per in-
struction stem.
2. Consume the stem. X86 employs so-called escape bytes as part of its variable-
length instruction encoding scheme. Instruction stems may begin with 0F, 0F
38, or 0F 3A. The class method DecodeStem(first_byte) consumes up to two
more bytes from the stream object, and returns an integer from 00-FF for instruc-
tions with no escape bytes, from 100-1FF for instructions that began with 0F,
from 200-2FF for instructions that began with 0F 38, and from 300-3FF for
instructions that began with 0F 3A.
3. Retrieve decoder entry from decoder table. Given the stem returned previously,
use it as an index into decoding_table from X86DecodeTable.py. This
gives us an X86DECDL object.
4. Process decoder entry. Table 13 shows the logic for decoding DECDL entries. This
logic is implemented within the decode method of the X86DECDL classes. Most
of the DECDL entries simply query the state of the X86Decoder object and call
the decode method on some X86DECDL object held within. The X86DECDL class
Invalid raises an InvalidInstruction exception. Direct(m,o) X86DECDL ob-
jects correspond to instructions with mnemonic m and list of abstract operand type
enumeration elements o. In this case, we return m,o.
5. Decode operands. If we have reached this point in the code, we have a mnemonic
and a list of abstract operand types. As usual, operand decoding is implemented
via a visitor pattern within the X86Decoder class. We invoke visit on each
abstract operand type in the list o to create the actual X86 Operand objects. The
decoding rules are shown in figure 10.
MOD R/M object is held in a class member called _modrm, which is initially set to None.
Client code accesses the MOD R/M information via a class property, ModRM. Whenever
the instruction decoding code references ModRM, the getter method checks to see whether
_modrm is None, and returns _modrm directly if not. If _modrm was None, the getter
decodes either a MOD R/M-16 or MOD R/M-32 depending upon the value of addrpfx,
i.e. whether an ADDRSIZE prefix was consumed during prefix decoding, stores it into
_modrm, and returns it. As before, this allows for seamless handling of MOD R/M.
table 14. To create this information, we simply inspect an instruction’s mnemonic and
operand types after decoding and create the required class.
After decoding an instruction, we inspect its mnemonic and operands to determine its
control flow characteristics. This logic is shown in table 15, and implemented in class
method CreateFlow.
3 Reference Material
3.1 X86
3.1.1 MOD R/M-16
r8 AL CL DL BL AH CH DH BH
r16 AX CX DX BX SP BP SI DI
r32 EAX ECX EDX EBX ESP EBP ESI EDI
mm MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7
xmm XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7
ymm YMM0 YMM1 YMM2 YMM3 YMM4 YMM5 YMM6 YMM7
sreg ES CS SS DS FS GS
ccc CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7
ddd DR0 DR1 DR2 DR3 DR4 DR5 DR6 DR7
digit 0 1 2 3 4 5 6 7
ggg 000 001 010 011 100 101 110 111
effective address mod R/M value of mod R/M byte (hex)
[BX+SI] 00 000 00 08 10 18 20 28 30 38
[BX+DI] 001 01 09 11 19 21 29 31 39
[BP+SI] 010 02 0A 12 1A 22 2A 32 3A
[BP+DI] 011 03 0B 13 1B 23 2B 33 3B
[SI] 100 04 0C 14 1C 24 2C 34 3C
[DI] 101 05 0D 15 1D 25 2D 35 3D
[sword] 110 06 0E 16 1E 26 2E 36 3E
[BX] 111 07 0F 17 1F 27 2F 37 3F
[BX+SI+sbyte] 01 000 40 48 50 58 60 68 70 78
[BX+DI+sbyte] 001 41 49 51 59 61 69 71 79
[BP+SI+sbyte] 010 42 4A 52 5A 62 6A 72 7A
[BP+DI+sbyte] 011 43 4B 53 5B 63 6B 73 7B
[SI+sbyte] 100 44 4C 54 5C 64 6C 74 7C
[DI+sbyte] 101 45 4D 55 5D 65 6D 75 7D
[BP+sbyte] 110 46 4E 56 5E 66 6E 76 7E
[BX+sbyte] 111 47 4F 57 5F 67 6F 77 7F
[BX+SI+sword] 10 000 80 88 90 98 A0 A8 B0 B8
[BX+DI+sword] 001 81 89 91 99 A1 A9 B1 B9
[BP+SI+sword] 010 82 8A 92 9A A2 AA B2 BA
[BP+DI+sword] 011 83 8B 93 9B A3 AB B3 BB
[SI+sword] 100 84 8C 94 9C A4 AC B4 BC
[DI+sword] 101 85 8D 95 9D A5 AD B5 BD
[BP+sword] 110 86 8E 96 9E A6 AE B6 BE
[BX+sword] 111 87 8F 97 9F A7 AF B7 BF
AL/AX/EAX/MM0/XMM0/YMM0 11 000 C0 C8 D0 D8 E0 E8 F0 F8
CL/CX/ECX/MM1/XMM1/YMM1 001 C1 C9 D1 D9 E1 E9 F1 F9
DL/DX/EDX/MM2/XMM2/YMM2 010 C2 CA D2 DA E2 EA F2 FA
BL/BX/EBX/MM3/XMM3/YMM3 011 C3 CB D3 DB E3 EB F3 FB
AH/SP/ESP/MM4/XMM4/YMM4 100 C4 CC D4 DC E4 EC F4 FC
CH/BP/EBP/MM5/XMM5/YMM5 101 C5 CD D5 DD E5 ED F5 FD
DH/SI/ESI/MM6/XMM6/YMM6 110 C6 CE D6 DE E6 EE F6 FE
BH/DI/EDI/MM7/XMM7/YMM7 111 C7 CF D7 DF E7 EF F7 FF
3.1.4 AOTDL
Language Element Meaning
aotdl := Exact x86op x86op exactly.
| ExactSeg x86op x86op memory expression whose
segment may vary.
3.1.5 DECDL
Language Element Meaning
decdl := Direct mnem [aotdl] Mnemnonic mnem, operands [aotdl].
| Invalid Illegal instruction.