0% found this document useful (0 votes)
41 views

L05 Distributed System

Based on the information provided: - F (the send event on N1) happened before A (the receive event on N2) since F is the send of the message that A receives. - B happened after F on N1 since the description says B textually follows F on N1. - G happened after A on N2 since the description says G textually follows the receive event A on N2. So the happened before relationships are: F → A B → F A → G But there is not enough information provided to determine the relationship between B and G since they occur on different nodes and are not directly connected by a send/receive event.

Uploaded by

Jiaxu Chen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

L05 Distributed System

Based on the information provided: - F (the send event on N1) happened before A (the receive event on N2) since F is the send of the message that A receives. - B happened after F on N1 since the description says B textually follows F on N1. - G happened after A on N2 since the description says G textually follows the receive event A on N2. So the happened before relationships are: F → A B → F A → G But there is not enough information provided to determine the relationship between B and G since they occur on different nodes and are not directly connected by a send/receive event.

Uploaded by

Jiaxu Chen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

277 - Introduction - DSM basics

Welcome back. We now launch into the study of distributed systems. This is an exciting field. On a
personal note, I started my academic career as a researcher of distributed systems in the mid-80s.
There's a lot of parallels between distributed systems and parallel systems. What fundamentally
distinguishes a distributed system from a parallel system is:
• the individual autonomy for the nodes of a distributed system as compared to a parallel system,
• and the fact that the interconnection network that connects all the nodes in a distributed system
is wide open to the world, as opposed to being confined within a rack or a room or a box.
However, as the feature size of transistors in silicon continues to shrink due to advances in process
technology and break throughs in VLSI technology, many of the issues that are considered to be in the
domain of distributed systems are surfacing even within a single chip, but I digress.
In this lesson we are going to learn the fundamental communication mechanisms in distributed
systems and what an operating system has to do to make communication efficient.
As in the previous lessons, you will see the symbiotic relationship between hardware in the form of
networking gear and the operating system software stack, particularly the protocol stack, to make the
communication efficient.
We'll start this lesson module with a definition and a shared understanding of what we mean by a
distributed system. But first, a quiz to get you started.
278 - What is a Distributed System Question

So, I want to know, what do you understand by a distributed system and I am going to give you three
choices.
• The first choice says that a distributed system is a collection of nodes connected by a Local Area
Network or a Wide Area Network.
• Second choice says that a distributed system is one in which communication happens only via
messages.
• The third choice is that a distributed system is one in which events that are happening on the
same node, here like A and B. The time between that is called the event time. And the event that
is happening across nodes, which is a communication event, Node N1 sends a message to node
N2, it's a communication event from A to C. So, the third choice is saying that communication
time Tm, is much more significantly more than the event execution time Te.

279 - What is a Distributed System Solution

If you chose all 3, your right on. We'll talk about why all these choices, make perfect sense.
280 - Distributed Systems Definition

• So a distributed system is a collection of nodes which are interconnected by a Local Area


Network or a Wide Area Network. This Local Area Network may be implemented using a
twisted pair, coaxial cable and optical fiber. If it is a Wide Area Network, it could be
implemented using a satellite communication microwave links and so on. And the media access
protocols that may be available for communication of these nodes on a Local Area Network or a
Wide Area Network, maybe ATM or Ethernet and so on and so forth. That's sort of the picture
of what a distributed system is.
• There's no physical memory that is shared between any nodes of the distributed system. So the
only way nodes can communicate with one another is by sending messages on the local area
network to one another.
• There is event computation time Te, which is the time it takes a single node to do some
meaningful processing. A node may also communicate with other nodes in the system and we
have Tm as the communication time or the messaging time. So the third property of the
distributed system is that Tm is significantly larger than Te.

So these are the three properties of a distributed systems.


• They are connected by some sort of local area network or wide area network, forming a
collection of nodes.
• There are no physically shared memory, so the only way they can communicate with one
another is via messages that are sent between the nodes using the network.
• The third property is the fact that the message communication time is significantly larger than
event computation time within a node.
You probably remember a good friend, Leslie Lamport. I introduced him to you when we talked
about parallel systems, and I said that we will see him again. In parallel systems, he's the one who gave
us the notion of sequential consistency, and the same person that we're going to be talking about in this
lecture.
In particular Lamport has a definition for a distributed system and the definition of a distributed
system verbatim goes like this: a system is distributed if the message transmission time, Tm, is not
negligible to the time between events in a single process.
What is the implication of this definition? Interestingly, even a cluster is a distributed system by this
definition. We've been talking about clusters a lot when we discussed parallel systems and I told you
that clusters are the work horses of data centers today.
• Even a cluster is a distributed system by this definition because processors have become
blazingly fast, so the event computation has shrunk quite a bit.
• On the other hand the message communication time is also becoming better but not as fast as the
computation time that happens on a single processor and therefore even on a cluster which is all
contained in a single rack in a data center, the message transmission time is significantly more
than the event time.
So even a cluster is a distributed system by this definition.
The importance of this inequality is in the design of algorithms that are going to span the nodes of the
network. What we want to make sure in structuring applications that run on distributed nodes of a
system like this is that: one has to be very careful to make sure that the computation time in the
algorithms that you're designing is significantly more than the communication time. Otherwise, we are
not going to reap the benefits of parallelism if most of the time is spent on communicating.
281 - A Fun Example

We're going to look at a fun example. This is me, and I'm going to India for Christmas holidays. And
I'm going to use Expedia to make the airline reservation.
1. So I'm sending a message to Expedia, saying, hey, make a reservation for me, a-to-b.
2. Expedia chooses to make the reservation using Delta, so it sends a message, c-to-d.
3. Delta confirms by sending this message to Expedia, e-to-f, that yes, Kishore's reservation is in.
4. Once Expedia has received this confirmation from Delta, it sends me a message, g-to-h and this
message is telling me that I've got the airline reservation booked.
5. Then, I'm directly contacting Delta asking for my preference for food, i-to-j. Fortunately, it's an
international trip, so I'm going to get a little bit more than peanuts on the Delta flight to India.
6. Delta confirms that yes, you have your meal preference. That's the message k-to-l, that confirms
that I have my meal preference, I'm all set.

So everything that I've described here is what you probably do on a routine basis. All of this makes
logical sense, right? There are several beliefs that are ingrained in this picture here about the ordering
of events in the distributed system that makes all of this work.
• One belief that we have is that processes are sequential. The events that we see happening in a
given process are totally ordered within a single process.
• The other belief is that you cannot have a receipt of a message before the send is complete,
right? So you have to send the message before it can be received. In other words, the receipt of a
message, which is b here, has to happen after the messages are being sent from a.
So those are the core beliefs that we have about what is happening with events in a distributed system.
• That events within the process are sequential.
• Across processes, when you have communication events, “send” happens before “receive”.
And we call these beliefs “a → b”, as the “happened before”, relationship.
282 - Happened Before Relationship

So let's dig a little deeper into what we mean by the “Happened Before” relationship.
I'm going to denote the “Happened Before” relationship with this arrow, A → B.

What this notation is implying is one of two things.


• If A and B are in the same process, this gives us a belief that A must have happened before B.
• If A and B are not in the same process, then there must be a communication event that connects
A and B. In other words, if A is a communication event of a message, and B is a receipt of that
same message. Then, A happened before B, where A is the sender of the message and B is the
receiver of the message.
So, this is the implication of saying that an event in a distributed system A happened before B, and
these events can be anywhere in the system.
If we are asserting that A happened before B, what we are implying is one of these two possibilities.
One is that A and B are events in the same process or A is the act of a sending a message and B is the
act of receiving a mess, the same message on a different node of the distributed system.
The other property of the happened before relationship is that it is transitive. If we're asserting that
there is A → B, and B → C. The implication is this relationship is transitive so A → C.
283 - Relation Question

Consider the following set of events on node N1 and node N2.


• N1 is sending a message and the act of sending the message, the event associated with that is F.
• And A is the act of receiving the same message on node N2.
• And B is another event on node N1. Textually follows this send event.
• And G is another event on node N2. Textually follows the receive event G.
Now the question for you, what can you say about the relationship between the events A and B?

284 - Relation Solution

The right answer, it's neither.


That is you cannot say anything about the order between A and B, given what I'm showing you here.
We'll explain more about this in the continuation of this lecture.
285 - Happened Before Relation (cont)

Now that we understand the" happened before" relationship and the transitivity of" happened before"
relationship, I also want to introduce this notion of concurrent events, “A || B”

Concurrent events are events in which there is no apparent relationship between the events.
So, for instance, A is an event on one node, and B is an event on another node, and you can see that
since A and B are not events on the same node, we cannot apply the sequential process condition to say
that there is an ordering between A and B, By the same token, since there is no communication
between these A and B, they are not connected events either.
So, in other words, we cannot say anything about the ordering of A and B in the distributed system,
without knowing the communication between them, because there are events they are going to be
concurrent.
That's the nature of the game that these processes are executing asynchronously with respect to
one another.
The important point about these concurrent events that happened before relationship is that it's
important to recognize what events are connected by the “happened before” relationship, and what
events are concurrent events. Once you understand these two concepts, you can build robust
distributed systems and applications.
One of the bane of distributed programs is synchronization, communication bugs and timing bugs.
And this is a classic example of a timing bug that you can have if you mentally think that A happened
before B. It may not happen because these two events are concurrent events.
286 - Identifying Events Question

Now this quiz is an open-ended quiz. And in this, I am giving you the same example of me purchasing
a trip to got to India for the Christmas holidays. I am showing you all the communication events. The
communication events are shown by these lines here, A to B, C to D, and so on. And this is a textual
ordering of events. In, that same process. So these are eventually the same process.
The question is to identify all the events, that are connected by the happened-before relationship.

287 - Identifying Events Solution


288 - Example of Event Ordering

Returning to our original example of me ordering a ticket to go to India via Expedia and Delta.

Let's now identify all the events that are connected directly by the happened-before relationship.
• Within my process, textual ordering infers sequential order, i.e. A → H, H → I, and I → L.
• Similarly within Expedia's process, B → C → F → G → M and in Delta D → E → J → K.
Then the communication events that are directly relating events happening between any two
processes, e.g. A → B, C → D, etc.
Next are the transitive events. For instance, what is the relationship between, E and event A? Well,
since we have A → B, B → C, C → D and D → E, we can deduce based on transitivity that A → E.
Finally, let's look at concurrent events.
• For example, we have G → H and G → M, but we know nothing about the order between H and
M, thus they are concurrent events, H || M.
289 - Introduction - Lamport’s Clock

Now that you have learned the basics of distributed systems, in particular the happened-before
relationship.
Now we are ready to talk about Lamport's clock.
290 - Lamport’s Logical Clock

What does each node in the distributed system know?


• Every node knows its own events, meaning, what are the computational events that are
happening in its own execution.
• It also knows about its communication events with the rest of the nodes in the distributed
system.
• For instance, if this process P-i sends a message to process P-j, that's a send event and P-i knows
it. Similarly when P-j gets the message, it is a receive event. P-j knows about that.
• Meanwhile, P-j has no idea about the local computational events of P-i.

Lamport's logical clock builds on this very simple idea of associating a timestamp with every one
of the events that are happening in every process in the entire distributed system.
How are we going to do that?
• Well, we're going to have a local clock, C-i and C-j. The local clock can be anything (a counter
or a timestamp).

• For instance here I have C-i(a) = 2 and this counter monotonically increases as we pile up
events in our system.

• Because a → b, so we must have C-i(b) > C-i(a) and in this case C-i(b) = 4.
Now, what about these communication events?
• For example, in P-i we have already associated a timestamp with the communication event C-
i(a) = 2.

• In P-j, it will associate receive event with a timestamp as well, C-j(d).

• First obviously we must have C-j(d) > C-i(a).

• What else does C-j(d) depend on? Well, it depends on other things that are happening in P-j so I
need to know the current state of my local counter. For instance, in my execution shown here, I
haven't done anything meaningful yet so my local counter is still zero.

• So when this message comes, that's the first time I'm going to do something meaningful in P-j.

• But I cannot let C-j(d) = counter = 0, so I associate a timestamp C-j(d) = 3, upon the receipt of
this message.
These are basically the two conditions that we have talked about.
• First, if the two events a and b are in the same process and I know that a → b, we must have C-
i(a) < C-i(b).

• Second, if event a happens to be a send event on some process and event d happens to be the
receive event on another process, then must have C-i(a) < C-j(d).
Next we have a very interesting question: what about timestamps of events are happening concurrently
in a distributed system.
291 - Events Question

Let's say in the distributed system, there are two events.


I don't know where they are happening. There's an event called a and there's an event called b.
I look at the record of all the timestamps associated with the events, I see that the timestamp associated
with a is less than the timestamp associated with b.
Does that mean that
• a → b?
• b → a?
• a → b with some conditions?

292 - Events Solution

This conditional statement is the right choice and the reason is because of the fact that the
timestamps themselves don't give the whole story and all we have are partial events happening on
individual node in the system.
And we'll elaborate that a little bit more when we describe Lamport's logical clock in its entirety.
293 - Logical Clock Conditions

So now we're ready to describe the conditions for the logical clock proposed by Lamport.

• First, if a→ b in the same process, we have C-i(a) < C-i(b), i.e. a logical clock on every node of
the distributed system that is monotonically increasing as events happen on that process.

• Second, the message receipt time must be greater than the send time, i.e. C-i(a) < C-j(d), where
C-j(d) = max(C-i(a)++, C-j(current)).

• Lastly, if the events are concurrent, in this case, b and d, they can have arbitrary timestamps.

There is an important implication, i.e. we cannot infer x → y based on C-i(x) < C-j(y).
This means that Lamport's logical clock gives us a partial order of events happening in the entire
distributed system.
294 - Need For a Total Order

Is partial order good enough for constructing deterministic distributed algorithms?


• It turns out, it may be sufficient for many situations. The airline reservation example would
work fine with a partial order of events dictated by Lamport's clock.

• But there are situations where there maybe a need for a total order of events in the distributed
system.

Here is an example to illustrate the need for a total order.


• I have, one car and my wife, son and daughter share this single car.
• We want to be able to make local decision on who gets dibs on using the car at any point of
time, using Lamport's clock.
• Whenever we want to use the car, we text everyone with a timestamp.
• How do we pick the winner? Locally we can look at the timestamps of requests from others and
my own request. Whoever has the earliest time stamp wins.
• What’s more, we also have a tie-breaker in case everyone sends the message at the same time,
i.e. age wins.
You can see, through this example, that there is a need for total order in decision making when you
have a distributed system.
• You want to make local decisions without bothering anyone, based on information that you
have.

• But you have to make that local decision unambiguously, because you cannot have both my
son and wife thinking that they have the car at the same time. That'll be a problem. So, whenever
there is a tie, we have to break that, and that's the need for the total order.
295 - Lamport’s Total Order

So having seen the need for total order and an ambiguous decision making in the distributed system,
let's now introduce Lamport's Total Order formulation, denotated by “=>”.

For two event in different processes, a => b is true if


• C-i(a) < C-j(b), OR
• C-i(a) == C-j(b) and there is some arbitrary other function (tie-breaker, e.g. using process-ID)
that helps us to un ambiguously decide which event precedes the other.
One issue with the tie-breaker is that it is an arbitrary function.
So there is no single total order and the single total order depends on the choice of the tie-breaker.
The other important point to understand is that we have to believe the timestamp associated with the
respective events. We going to believe that same time standard are associated with the events.
Once we have derived the total order, the time stamps are meaningless. We don't care about them
anymore.
The whole idea of having these logical time stamps, creating a partial order, deriving a total order
from the partial order, is so that we can get a particular total order. And once we get the total order,
timestamps are meaningless.
296 - Total Order Question

So we have three processes P1, P2 and P3 and the events happening in these processes.
I want you to do to derive a total order.
In deriving the total order, we're going to use Process ID to break the tie: smaller the Process ID =>
higher the priority.

297 - Total Order Solution


298 - Distributed ME Lock Algorithm

Now, low let's put Lamport's clock to work for implementing a distributed Mutual Exclusion (ME)
lock algorithm, and it is going to be very similar to the car-sharing example that I showed you before.
Notice that we've talked about locks in a shared memory multiprocessor, where we have shared
memory to implement the lock. But in a distributed system, we don't have shared memory.
So we have to implement a mutual exclusion lock using Lamport's logical clock. Any process that
needs to acquire this lock will send the message to all other processes. And the intent to get a lock may
emanate simultaneously from several processes.
The algorithm is as follows.
• Every process has a queue data structure. Every process has its own private queue and the
private queue is ordered by the happened-before relationship (timestamp).

• To request a lock, a process sends a message to all the other processes: I want this lock and my
timestamp is here, e.g. P-1(2).

• The process also puts this request in his own queue.

• When another process receives the request, it will put the request to their local queue, which is
ordered by the timestamp information, the smallest timestamp being at the top of the queue.

• The receiving process also acknowledges the request to its peers.

• Now, what happens when there is a tie? Well, when we have a tie, we break the tie by giving
priority to the process that has a lower process ID.

• Basically that's how this algorithm works and every process can unambiguously make a decision
as to where to place an incoming request in the queue.
So an example of the state of the queue is as shown. The thing that should jump out at you immediately
is that the states of the queues are not the same in all the processes.
• For instance P-1’s queue contains its request that it generated at time 2.
• But P-2 and P-n have not seen it yet.
• Is this possible that the queue can be inconsistent with one another?
Of course it is possible. The reason is a message is going to take some time to reach the other nodes
in the distributed system. All the messages may not take the same amount of time to traverse a
network. It is possible that P-1's message is still in transit while P-2’s and P-n’s messages have already
got into queues of all the processes.
The whole purpose of this exercise is to unambiguously get the mutual exclusion lock for some
process. Now how does a process know that it has the lock?
Two things have to be true for me to think that I have gotten the lock.
• First thing, my request is at the top of the queue.
• The second thing is I've received acknowledgments from all the other nodes in the system, so
that we know all processes’ queues are consistent.
• Now, P-2 and P-n have sent their acknowledgement back to P1 and P1 received all the
acknowledgement.
• I've also received lock requests from P-2 and P-n and they are later than mine.
• So overall, this is how the lock requests are ordered: P-1(2), P-n(5) and P-2(10).
So the two conditions I'm going to look for to make a decision locally:
• I have the lock if my request is at the top of the queue
• I've received acknowledgement OR a late lock request from all the other nodes in the system.
Say that I haven't received the acknowledgement for my request from P-2 and P-n. Can I go ahead
and assume I have the lock? Yes, I can. Why? Because even though these guys have not sent me the
acknowledgment yet, I've received lock requests from them, with timestamps 5 and 10 respectively.
Therefore I can make an unambiguous decision that my lock request precedes all the other lock
requests at this point of time.
I'm sure you've figured it our already. Since we are following Lamport's clock in this mutual
execution lock algorithm, the ACK message will have a later timestamp than the timestamp associated
with the request itself. So you can see that, with the addition of a way of deriving a total order from the
partial orders given by the Lamport's clock, allows us to unambiguously make a decision locally based
on the state of local queue as to whether I have the lock or not.

Now let's talk about how I go about releasing the lock.


• So if I want to release the lock, I will send an unlock message to all the other guys.
• The first thing that I do, of course, is remove entry that I have at the top of my own queue.
• Then, I am going to send an unlock message to everybody else.
• When the peers receive the unlocked message, they will remove the corresponding entry from
the respective queues.
So P-1 is done and other processes can use the same decision making process to figure out whether
they are the winners for getting the lock next.
299 - Distributed ME Lock Algorithm - Correctness

So, we can talk about the correctness of the distributed mutual exclusion lock algorithm. The
correctness is based partially on some assumptions and partially on the construction.
The construction is that the queues are totally ordered by Lamport's logical clocks and the PID to
break the ties.
It is also based on assumptions that.
• The first assumption is that messages between any two processes arrive in order. Messages
don't crisscross each other. If I send a message and I send another message, the first message is
going to reach the destination first, second message is going to reach the destination second.

• The second assumption is that there is no loss of messages. So every message that is sent is
definitely received.
300 - Messages Complexity Question

So the question for you is, how many messages are exchanged among all the nodes for each lock
acquisition followed by a lock release?
Every process, what it is doing is making a lock request and it follows the algorithm that we just
mentioned, uses the lock in the critical section. And once it is done with that, it unlocks by sending
messages again as you saw.
The question is how many messages are exchanged among all the nodes for each lock acquisition
followed by a lock release?
That is, combination of lock plus unlock. How many messages does it generate?

301 - Messages Complexity Solution

The right answer is, 3x(N-1), where N is the number of nodes in the distributed system.
302 - Message Complexity

So let's look at the messaging complexity of the mutual exclusion lock algorithm.
• When a process makes a lock request, it sends N-1 request messages to his N-1 peers.
• In response to these request messages, every peer is going to acknowledge it, which will be
another N-1 messages.
• Then process has got the lock and work in the critical section. Upon lock release, once again,
we're going to send N-1 unlock messages to all the peers, but no acknowledgment is needed for
the unlock message.
So in total we have 3x(N-1) message complexity in the distributed mutual exclusion lock algorithm.

That begs the question, can we do better? The answer is yes. I will know that my lock request is latest
• if my peer has acknowledged it OR
• my peer sent me a lock request with a later timestamp before I get his acknowledgement.
When I am releasing my lock, instead of BROADCAST to all peers, I only need to check my local
queue and SIGNAL the next process in my queue.
If there is no process in the queue, I will just do nothing until any new lock request comes in.
The distributed mutual exclusion lock problem have been a fertile ground for researchers to think
about new algorithms that can shave the messaging complexity even further from and I invite you to
look at related literatures.
303 - Real World Scenario

So far, we've been dealing with this funny virtual time or logical time. But there are many real-world
scenarios where this logical time may not be good enough.

Let's say that I owe you some money.


• I'm telling you on the phone that I'm going to credit my account at 5 p.m., and so any time after
5 p.m., you can withdraw money from my bank, and we'll be square.
• Now you're a nice guy, so you want to give me some leeway. So you tell your branch that at
8:00 PM, debit from Kishore's account the money that he owes me. So your branch is going to
basically do a debit call to the central bank server asking for the money that is owed by Kishore
to be transferred to your account.
• All sounds good, right? But it turns out, that your branch's local time is far ahead of real time
while mine branch and the central bank are both in sync with the real time. So your branch
thought it was making the call to central bank at 8pm, but if the real time might be 4pm, your
request would be declined.
• So your local branch has a clock drifting. It's either going faster or slower than the real time.
This is a real world scenario that you have to worry about.
• One common cause is individual clock drifts. Some certain type of clock might not exactly click
a tick by “one real second”., i.e. different clock accuracy.
• Even the same type of clock, there might be small variation from clock to clock. So one clock
can tick at a particular rate, and another clock can tick at a different rate, and that can cause a
second source of discrepancy.
And these anomalies are nasty things that we have to avoid in order to make sure that in real world we
can have some guarantees about what goes on.
304 - Lamport’s Physical Clock

So that brings us to Lamport's Physical Clock, and the notation we're going to use for that is this
funny symbol here “ |→ ”

So, in physical time, in real time, event A happened before B, a |→ b, this is absolute order, which
also implies that we can infer that C-i(a) < C-j(b) from this absolute order, where both C-i and C-j are
real time.
In order to ensure that the real time associated with these events give you this guarantee, you have to
have certain conditions associated with the clocks on the machines P-i and P-j, i.e. physical clock
conditions, PC.
• The first condition PC1 is a bound on individual clock drift, i.e. each individual clock cannot
drift that much. So this equation is saying that a “perfect clock” has dC-i/dt = 1 and our clocks
should be as close to that as possible. Here kappa is the individual clock drift and it should be
very small.

• The second condition PC2 is that the mutual drift between the clocks on different nodes of the
distributed system should be very small, less than epsilon. So between any two nodes C-i and C-
j in the distributed system, the difference between the time that I read on my clock and the time
that I read on somebody else's clock should be very small.
So kappa (k) and epsilon (e) are the two important parameters in the physical clock conditions.
Intuitively, we're going to argue that these values, the absolute values of these individual clock drift
and mutual clock drift, has to be negligible compared to the inter-process communication time.
305 - IPC Time and Clock Drift

So what we're going to look at now is the relationship between the Inter-Process Communication
(IPC) time and both the individual and the mutual clock drift that I described to you.

If mu is the shortest possible IPC time in the system, and we have an IPC communication event a-to-b,
what conditions need to be met to ensure: P-i(a) |→ P-j(b)?

Communication event C-i(a) < C-j(b)

Minimal IPC u C-j(b) - C-i(a) > u

More generally C-i(t+u) - C-j(t) > 0.

C-i(t+u) - C-j(t) =
Taylor expansion
C-i(t) + u*dC-i(t)/dt - C-j(t) > 0

u*dC-i(t)/dt > C-j(t) - C-i(t)

C-j(t) - C-i(t) < u*dC-i(t)/dt

Using PC1 u*(1-k) < u*dC-i(t)/dt < u*(1+k)

Using PC2 -e < C-j(t) - C-i(t) < e

Finally e/(1-k) < u


306 - Real World Example (cont)

Let's return to our earlier example.

• Say the mutual clock drift is e=5hr between you and me. So when my clock reads 3pm, your
clock reads 8pm the IPC time lower bound is u=2hr. In this case, you transaction at local 8pm
(4pm real time) will still be declined, because the bank won’t get my deposit until my local time.
5+2 =7pm (6pm real time).
• On the other hand, if we have u=2hr and e=1hr, then your local transaction at 8pm will be 8pm
in real time. By this time, my deposit is already available in the central bank.
So the takeaway is that in constructing distributed applications which depend on real time, it is
important to pay attention to individual clock shift, mutual clock shift and IPC time.

307 - Conclusion

Lamport's clock serves as the theoretical underpinning, for achieving deterministic execution in
distributed systems, despite the fact that that there are non-determinism that are existing due to vagaries
of the network and due to drifts in the clocks, and so on.
It's a nice thing that we can come up with conditions that need to be satisfied in order to make sure
that we have deterministic executions and avoid anomalous behaviors for both logical and physical
clocks.
In the next part, we will discuss techniques for making the OS communication software stack
efficient for dealing with network communication.
308 - Introduction - Efficient Communication Kernel

Lamport's clock gave a fundamental ordering mechanism for events in a distributed system.
This theoretical basis is becoming more important in this day and age, when so many everyday
services, email, social networks, e-commerce, and now even online education are becoming distributed.
Incidentally, this is the same Lamport who gave us a way to order memory accesses in a shared
memory multi-processors, through the sequential consistency memory model.
In the next part of the lesson we turn our attention to more practical matters in distributed systems.
Specifically, given that network communication is the key to performance for distributed services, the
operating system has to strive hard to reduce the latency incurred in the system software for network
services.
Lamport's clock serves as the theoretical underpinning for achieving deterministic execution in a
distributed system, despite the non-determinism that exists due to the vagaries of a network.
In this lesson, we will discuss techniques for making the OS software stack efficient for network
communication. Both by looking at application interface to the kernel as well as inside the kernel in the
protocol stack itself, but first a quiz.

Paper:
C.A. Thekkath and H.M. Levy, " Limits to Low-Latency Communications on High Speed Networks ",
ACM Transactions on Computer Systems, May 1993. Paywall
309 - Latency Question

Let's say it takes me one minute to go from my office to the classroom, but the hallway is wide enough
that five of us can walk side-by-side going to my office to the classroom.
The question to you is to illustrate the difference between latency and throughput.
The question is, what is the latency incurred by me to get to the classroom?
What is the throughput achieved if I walk with four other students side-by-side to get to the classroom?

310 - Latency Solution

This is just a fun quiz to get you thinking about latency and throughput.
I am sure that most of you would have gotten that, the time to get to the classroom, after all, from my
office is 1 minute, so that's the latency I'm going to observe, getting to my classroom.
The interesting thing is the throughput. The throughput is 5 per minute, because the hallway is wide
enough for 5 of us to walk side by side.
The important thing I want you to see is that the latency is not the inverse of throughput.
Further, if tomorrow I widen the hallway to make 10 people walk side by side, that's going to increase
the throughput, but it does nothing to the latency. The latency remains 1 min while the throughput is
10/min now.
311 - Latency vs Throughput

It's important to understand these two concepts of latency and throughput.


• Latency is the elapsed time from an event.
• Throughput is the number of events that can be executed per unit time. Bandwidth is a measure
of throughput. Higher bandwidth does not necessarily imply lower latency.

RPC is a basis for client server based distributed systems and its performance is crucial. There are two
components to the latency that is observed for message communication in a distributed system. The
hardware overhead and the software overhead.
• The hardware overhead depends on how the network is interfaced to the computer. Typically
you have a network controller that interfaces the network to the CPU. The network controller
operates by moving the bits of the message from the node’s system memory into network
controller’s private buffer. This can be done via DMA or Programmed I/O. Then the network
controller can then put the bits out on the wire.

• The software overhead is what the operating system tax on to the hardware overhead of
moving the bits out onto the network. So if you think about the latency as a whole for doing a
network transmission, there is the software overhead incurred in the layers of the operating
system to make the message available in the memory of the processor ready for transmission.
The focus of an OS designers, of course, is to take what the hardware gives you and think about how
you can reduce the software overhead.
312 - Components of RPC Latency

Let's now discuss the components of the RPC latency.


1. It starts with a client call, which include setting up the arguments for the call and calling into the
kernel. The kernel validates the call, and then the client-stub marshals the arguments into a
network packet, and sets up the controller to actually do the network transmission.
2. The second part of the latency is from the controller latency. This is the part where the controller
gets the message from the system memory into its private buffer and then put the message out
on the wire. Controller latency is what you can only take what you are given by the hardware.
3. The third part of the latency is the time on the wire. This depends on the distance between the
client and the server, the bandwidth between the source and the destination, and also the number
of intermediate routers along the path and so on.
4. Then, the message arrives over on the destination in the form of an interrupt to the node’s OS.
The interrupt has to be handled by the OS and it includes moving the bits from the wire into the
controller buffer and from the controller buffer into the system memory.
5. Then we can set up the server procedure to serve the original call, this includes locating and
dispatching the server procedure. We also have to un-marshal the network packet into the
actual arguments server procedure. Now the server procedure is ready to execute.
We have 5-step already from the time the client makes an RPC call to the point of actually executing
the call. So even though it looks like a simple procedure call from the clients point of view, there is all
this latency to be incurred in executing a remote procedure call.
6. Then server procedure will run and do all the work. This is not under our control of the OS.
Once the server procedure has completed execution, the sever will do step (2) and (3) to send the
results to the client and the client will handle the incoming results packet similar to (4).
7. Finally, the client OS gets the results packet and re-dispatch the client, set up the client so that
the client can then receive the results, and restart execution where it left off.
313 - Sources of Overhead on RPC

Now that we understand the components of RPC latency, let's understand the sources of overhead that
creeps in in carrying out all the different functions, going from the client to the server and back to the
client.
As far as the client is concerned, this looks like an innocuous procedure call, right? So it just says, I
want to call a procedure S.foo(), and here are the arguments. Unfortunately, this call is not a simple
procedure call but it is a remote procedure call.
The sources of overhead that creeps in a remote procedure call are
• marshaling and data copying
• control transfer
• protocol processing
So we'll look at each one of these things in more detail.
Now how can we reduce the overhead in the kernel?
What we want to do is think about what the hardware gives you and make use of what the hardware
gives you to reduce the latency incurred for each of these RPC latency components.
314 - Marshaling and Data Copying

The biggest overhead in marshaling is the data copying. Potentially, during marshaling, there could be
3 copies.
1. First of all, the client stub serialize all the RPC arguments into continuous bytes in the memory,
i.e. an RPC message. The kernel can then send it out on the wire, just like any other message.
This is the first copy from the client process stack to create an RPC message.
2. This RPC message is in user space and the kernel has to make a copy of the RPC into the kernel
buffer. This is the second copy.
3. Next the OS can invoke the network controller. The network controller will move the bits from
the kernel buffer into the network controller’s internal buffer (e.g. using DMA).This is the third
copy, which is done by the network controller.
These are the three copies involved in marshaling the arguments of the RPC call before it can be put
out on the wire. How do we reduce the number of copies?
The copy from the system memory into the network controller is a hardware action. It is unavoidable
so we're going to live with it.
Then can we eliminate the first copy by the client stub? Well, if the client stub can access the kernel
buffer directly, we can eliminate this first copy.
This can be done by installing the client stub into the kernel when the client binds with the server
at the bind time (instantiation time). So the synthesized procedure is installing the kernel for each
client call for each client server relationship. Then every time I make the RPC call, the stub can be
invoked within the kernel to convert the argument on the stack into a network message, and directly put
it into the kernel buffer.
So this will reduce the two copies down to one copy. But the problem is that dumping codes into the
kernel may not be something palatable. So this solution works if the RPC service is a trusted by the
kernel.
315 - Marshaling and Data Copying (cont)

An alternative to dumping code into the kernel, is to leave the stub in the users piece itself, but have a
structured mechanism, for communication between the client stub and the kernel.

This can be done by a shared descriptor. Recall what I told you earlier that the kernel has no idea of the
semantics of the RPC call. Now the shared descriptor can be used as a vehicle for the stub to tell the
kernel the data structures that need to be passed as arguments.
• So for instance, if the RPC call has four argument, then this descriptor has four entries and each
entry could describe the address & length of each of the 4 argument in the memory.

• Kernel doesn't have to know the semantics of these data items. All it needs to know is the
starting address and length of a particular data item.

• Note that the stub doesn't have to tell the kernel the specific type of the data (int, float, etc). All
that the stub is doing is saying here is the starting address for an argument, and here is the length
of that argument.

• The descriptor provides the kernel with the information of the layout of the arguments on user
stack, and then the kernel can use data items to create a contiguous packet in the kernel buffer.
This is the second way you can do to reduce the number of copies from two to one.
So we can either push the client stub into the kernel or using a shared descriptor to inform the kernel
of the data structure layout on the user stack. Both of these allow us to reduce the number of copies
from three down to two.
The same optimization can be applied on the sever side marshalling of the RPC execution results as
well.
316 - Control Transfer

The second source of overhead is the control transfer overhead. It is the context switches that have to
happen in order to execute an RPC call and return.
Let's say on the client machine the client C is making an RPC call.
1. We know that the semantics of RPC is that the client C is blocked until the results are returned.
So the client OS will context switch to some other process, e.g. C1, to keep the CPU busy. This
is the first context which, C → C1.
2. The RPC call is sent out and reaches the server machine. At this point, the server machine might
be executing some arbitrary process, .e.g. S1. So the server kernel has to context switch to the
particular server process (e.g. S) to handle this incoming RPC call. This is the second context
switch, S1 → S.
3. Then the server procedure executes. Once it has completed execution and sent the results out,
the server OS will switch to some other process S2. (again, to keep CPU busy) This is the third
contact switch S → S2.
4. Then the RPC result is sent back to the client machine. Similarly, the client machine l could be
executing some process C2 at this moment. Once the client OS sees that the result have come
back, it would try to reschedule RPC calling process C, so that it can receive the results and
continue with his execution. This will be the fourth context switch, C2 → C.
Which contact switches are critical?
• The C → C1 contact switch is only to make sure that the client machine is not being under-
utilized, while C is blocked. So this contact switch is not critical from the point of view of the
RPC latency.
• When the message comes to the server, the S1 → S context switch is crucial because we need to
schedule the server process S to fulfill the RPC request. So this is an essential part of the RPC
latency.
• Similarly, S → S2 is only to make sure that the server's machine is not under-utilized when the
server is done with the RPC call. This is not in the critical path of RPC latency.
• Finally when the result comes back to the client machine, the kernel has to switch to this client
C, so that it can process the results and continue with its execution. So this C2 → C context
switch, is again in the critical path of the RPC latency.
317 - Control Transfer (cont)

If you look at the RPC call,


• There are two contact switches in the critical path of RPC latency, S1 → S and C2 → C.

• The context switch C → C1, which is to keep the client machine utilized, can be overlapped with
the network communication, i.e. we can defer and do it while the RPC call is in transmission on
the wire.

• Similarly, S → S2 can be deferred & overlapped with the return-result network transmission.
So, we can hide the latency of the two non-critical context switches by overlapping them with the
network communication.
Of course we are greedy, so can we reduce the context switch to one?
First for C → C1,
• If the server procedure is going to execute for a long time, then C will be blocked for a long time
so it might be a good thing to context which to make sure that client machine is fully utilized.

• But if this RPC call will be returned very soon, i.e. we are on a local area network (little time on
the wire) and the server procedure executes very fast, then we might want to just let C spin,
instead of switching to C1.
However, S1 → S cannot be avoided. That's the necessary evil. We'll incur that.
As a results, in some cases, we can probably get rid of the context switch of C → C1 and only bear the
latency from the server S1 → S context switch.
(well, why not let the server spin on S as well... ¯\_(ツ)_/¯ )
318 - Protocol Processing

319 - Protocol Processing (cont)

Now the fourth component that adds to the latency of the RPC transmission: protocol processing.
What transport protocol should we use for RPC? This is where we want to see how we can exploit
the hardware capability.
If we are working in a LAN, which is usually considered reliable, then we can focus on reducing the
latency and instead of worrying too much about reliability. That's the idea behind the following
techniques that we're going to look at.
Let's think about all the thing that could go wrong in message transmission, and see why some of those
things may not be that important, given that we have a reliable local area network.
The first thing is that you send a message and it might get lost so we usually need ACK to ensure
message is delivered.
• Messages can be lost in WAN because messages have to go out through several different
routers, and they maybe queuing in the routers, and there may be loss of packets in the wire and
so on. But that's not something that you have to worry about in LAN.
• So that assumption that messages may not get lost, suggests that there's no need for low level
acknowledgements. Why? Usually in network transmission, we send ACK to the sender to
confirm the message is received. But in the RPC case, the semantics of RPC says that the act of
receiving the RPC call is going to result in server procedure execution and the result is going to
come back. The result itself serves as the ACK.
• Therefore we don't need low level ACK. For example, if the request failed to arrive at the server
or the result failed to be sent back, the client could just re-send the request. The high level
semantic of RPC itself can serve as a way to coordinate between the client and the server.
The second thing is in message transmission on the Internet, we worry about messages getting
corrupted.
• Not maliciously or anything like that, but just due to vagaries of the network messages may get
corrupted in going on the wire that connects the source and destination.
• For that reason, it's typical to employ checksum in the messages to indicate the integrity of the
message.
• Checksum is usually computed in software and appended to the message and sent on wire.
• But in LAN, things are reliable, so we don't have to do extra overhead. Just use hardware
checksum for packet integrity if it is available and don't worry about adding an extra layer of
software in the protocol processing for doing software checksum.
The third source of overhead that comes about in message transmission is once again related to
the fact that messages may get lost in transmission.
• In order to make sure that messages are not lost in transmission, you usually buffer the packets.
So that if the message is lost in transmission, you can re-transmit the packet.
• Now once again think about the semantics of RPC. The client has made the call and the client is
blocked. Since the client is blocked, we don't need to buffer the message on the client's side.
• If the message gets lost for some reason, you don't hear back the result of the RPC from the
destination in a certain amount of time, you can just re-send the RPC call.
• Therefore you don't have to buffer the client side RPC message. We can reconstruct the client
side message and resend the call.
• Therefore client side buffering is something that you can get rid of, once again, because the
LAN is reliable.
The next source of overhead, similar to client side buffering, happens on the server side.
• The server is sending the results and the results may get lost. LAN is reliable, but still message
could be lost.
• We want to buffer on the server side, because unlike the client which could just re-send the
request, server would have to perform all the computation all over again it the result is lost
during network transmission. Re-executing the server procedure may induce more latency than
simply buffering the packet of the results.
• There is still optimization that the buffering on the server side can be overlapped with the
transmission of the message. So in other words the result has been computed by the server
procedure. Now go ahead and send the result. While you are sending the result back to the
client, do the buffering.
So, removing low level ACK, employing hardware checksum and removing software checksum,
eliminating client side buffering, overlapping the server side buffering with the result transmission are
optimizations that you can do in protocol processing in a LAN environment.
LAN is reliable and we don't have to focus so much on the reliability of message transmission. But
focus rather on the latency, and how we can reduce the latency, by making the protocol processing lean
and mean.
320 - Conclusion

So recap what we said, the sources of RPC latency are the following.
• Marshaling and data copying
• Context switches
• Protocol processing in order to send the packet on the wire.
These are all the things that are happening in software. Those are the things that as OS designers, we
have a chance to do something about.
What we saw were techniques that we can employ to reduce the number of copies, to reduce the
number of context switches, and to make the protocol processing lean and mean so that the latency
involved in RPC is reduced to as minimum as possible from the software side.
We are going to take whatever the hardware gives us. If the hardware gives us an ability to do DMA
from the client buffer, we'll use that but if it doesn't, then we have to anchor that.
321 - Introduction - Active Networks

In the previous lesson, we learned some tricks we can employ to optimize the RPC communication in
the local area network from the point of view of reducing communication latency.
Of course, user interactions go beyond the LAN to the WAN. Once a packet leaves your local node,
the primary issue is to route the packet reliably and quickly to the destination. Routing is part of the
functionality of the network layer of the OS protocol stack.
What happens to a packet once it leaves your node? Well, routers are the intermediate hardware
between your node and the destination and they have routing tables that help them to move the packet
towards the desired destination node by doing a table look-up. The routing tables evolve over time, since
the Internet itself is evolving continually. That's the big picture.
There are lots of fascinating details which you can learn in a course that is dedicated to computer
networking. For the next part of the lesson on distributed systems, we want to ask the question, what can
be done in the intermediate routers to accommodate the Quality of Service needs of individual packet
flows through the network? Or in other words, can we make the routers enroute to the destinations smarter?
The specific thought experiment we are going to discuss is called active networks. Then, we will
connect the dots from active networks to the current state of the art, which is referred to as software
defined networking.
Thus far in the course, we've been focusing on specializing operating system services for a single
processor, or a multi-core or a parallel system, or a local area network. In this lesson, we will take this
idea of specializing to the WAN. Specifically, we will study the idea of providing quality of service for
network communication in an operating system by making the network active.
Note:
David Wetherall, " Active Networks: Vision and Reality: Lessons from a Capsule-based System ", 17th
ACM Symposium on Operating System Principles, OS Review, Volume 33, Number 5, Dec. 1999.
322 - Routing on the Internet

Normally, typically what happens when a packet is routed is, at the source node, you create a network
packet and go through the layers of software stack on the sending node, and send the packet out on the
network. This network packet has a desired destination and it has to go through a whole number of
intermediate routers to get to its destination. The routers on the Internet that are intermediate between
the source and the destination, they don't inspect the packet for the contents or anything like that. All that
they is looking at the destination of that packet and figuring out what is the next hub to send the packet
to. Every router has a routing table and makes the routing decision by doing a table lookup: given a
particular destination, what is the next hop?
So in other words, the routers enroute to the destination from the source are simply forwarding packets.
The nodes are passive. Now, what does it mean to make the nodes ACTIVE? What we mean by
making the node active is that the next hop for sending this package towards a destination is not simply
a table look up, but actually determined by the router executing code. In other words, in addition to the
payload intended for the destination, the packet also carries code with it and the code is executed by the
router to make routing decision.
This sounds really clever because it can provide customized service for network flow. Every network
flow can have its own way of choosing the desired route. In other words, we're saying, this is an
opportunity to virtualize the traffic flow from my network traffic, independent of other network flows.
This should be very familiar to you all because we've been talking about customizing OS services in the
SPIN, exokernel and so on.
But the problem that we're talking about here is much harder because the network is wide open. Our
network traffic flow is going through the public internet infrastructure, and we are talking about
specializing the network flow for every network flow independent of others.
There are lots of challenges. How to write such code that can be distributed and sent over the wire so
that routers can execute it? Who can write such code? How can we be sure that the injected code does
not break the network or other network flows?
323 - An Example

Let me give you an example to motivate why this vision of active networks is both intriguing and
interesting.
You may all know that Diwali is a big festival in India, just like Christmas is in the western world. Let's
say that I am sending Diwali greetings electronically to my siblings, who are in India.
• What I can do is to send a greeting message to each of my siblings individually. So there will be
N messages going out on the Internet from source to destination.
• A nicer approach would be, given that all my siblings are clustered in one corner of the globe, it
would be nice if I could send just one message traversing the Internet.
• As the message gets close to the destination of where my siblings are, the router demultiplexes
my message and sends it to all my siblings.
Obviously, the second method is more frugal in terms of using network resources. I don't have to send N
messages. I only need to send one message and then as it gets close to the destination, an active node
takes this one message, recognizes it, demultiplexes it and sends it to all the recipients of this message.
Of course, we can generalize this idea and say that this idea of active router is going to be spread out
throughout the Internet, so that even if my siblings are distributed all over the world, I could still send a
single message from my source, and the message gets demultiplexed along the way depending on where
all the eventual recipients are for this particular message that starts from me.
So in other words, we can sprinkle this intelligence that is in this one particular router to all the routers
on the Internet. That way we are making the entire Internet an active network. That's the vision behind
active networks where the nodes on the internet will be able to look at the message and figure out what
to do with it in terms of routing decisions.
324 - How to Implement the Vision

In order to implement this vision,


• The OS should provide quality of service (QoS) APIs to the application.
• The application can use those API to give QoS hints to the Operating System, for example, this
particular network flow has certain real time constraints because it has video data and so on and
forth.
• The OS will use those QoS constrains as hints to synthesize executable codes and put those codes
into the packet.
• so the data packet will include the destination IP, the code and the Payload.
• If there are Internet routers are capable, they will execute the embedded codes and make intelligent
routing decisions accordingly on the network, instead of passively looking up a table.
That's out of the road map of how we can take this vision and try to implement it.
• One problem with carrying out the vision is that changing the operating system is non-trivial,
especially the protocol stack (TCP/IP has several hundred thousand lines of code).
• The second part of the challenge is that the network routers are not open. We cannot expect that
every router on the Internet is capable of executing the code that I'm going on slap on to this
payload and be able to make intelligent routing decisions.
So there is an impedance mismatch between the vision and the implementation that I've sketched right
here.
325 - ANTS Toolkit

The ANTS (Active Node Transfer System) toolkit took a different approach to show the utility of the
vision. Since modifying the protocol stack is non-trivial, the ANTS toolkit is made as an application-
level package.
• Application programmer can show the payload and QoS constraints to ANTS.
• ANTS creates an ANTS header to this payload and make a capsule (ANTS header + Payload).
• The capsule is given to the OS protocol stack, which is just a normal OS protocol stack.
• The OS looks adds the IP header to the capsule and the capsule appear as a “normal payload” as
far as OS is concerned.
• The packet (IP+Capsule) will then traverse the network.
• If a normal node picks up the packet, it will just use the routing table to forward it to the next hop.
• If the node is an active node, it can actually process this ANTS header. For example, if this packet
needs to be demultiplexed and sent to two different routes, this active node will take that intelligent
routing decision.
So that's the idea, that we can push one of the pain points out of the OS, into an enhanced toolkit that
lives above the OS kernel. So that's sort of the ANTS toolkit vision. That's one part.
Now, the second part is the fact not all of the internet routers can process the specialized code in the
capsule. So, what we do is to keep the active nodes only at the edge of the network. The core IP network
is unchanged, and all of the magic happens only at the edge of the network.
So, if I go back to my example of sending greetings to my siblings, only the edge nodes have to do
the magic in order to take my original message and process the code to deliver it to multiple destinations.
326 - ANTS Capsule and API

Let's dig a little deeper and look at the structure of the ANTS capsule as well as the ANTS APIs.
• The data packet consists of three parts.

• The original IP header for routing the package towards the destination if a node is a normal node.

• The payload which is generated by the application.

• In the middle is the ANTS header and two fields are particularly important: the type field and the
prev field.

• The type field is a way by which you can identify the code that has to be executed to process this
capsule and it is an MD5 hash of the code that needs to be executed on this capsule.

• The prev field is the identity of the upstream node that successfully processed the capsule of this
type.
Next, let's talk about the API that ANTS toolkit provides.
Functions such as routeForNode() and deliverToApp().
• The most important function in ANTS toolkit is forwarding packets through the network
intelligently.
• This is the set of API calls that allows you to do routing of the capsule through the network and
virtualize the network regardless of the actual physical topology.
The second part of the API is for manipulating the so-called soft-store.
• As stated previously, the type field is only hash of the code.
• The soft-store is the place where the code of each type is actually stored.
• So put(), get(), remove() can be used to manipulate objects in the soft-store and those objects are
anything that is important for personalizing the network flow for capsules of each type.
• An obvious candidate for storing in the soft-store is the code associated with each type.
• Things like computed hints can also be put in the soft-store, which is about the state of the
network and can be used for future processing of capsules of the same type.
The third category of API that's available is querying the node for interesting tidbits about the state of
the network or details about that node itself.
• For instance, what is the identity of the node that I'm currently at?
• What is the local time at this node?
The key take away is that the ANTS API is a very minimal set of API.
Remember that routers are in the public Internet. The router program that are executed at a router node
has to have certain important characteristics.
1. It has to be easy to program.
2. It should be easy to debug, maintain, and understand.
3. It should be very quick, because we are talking about routing packets, and so the router program
should not take a long time to do its route processing.
So this very simple API set allows you to generate very simple router programs that are easy to program,
easy to debug, easy to maintain and understand. The program itself is pretty small and will not take much
time to process.
327 - Capsule Implementation

Now let's talk about implementation of the capsule, and in particular what are the actions taken on capsule
arrival at a particular node.
• Now, I mentioned that the capsule does not contain code, but it is passed by reference, i.e. the
type field.
• The type field is generated as an MD5 hash of the code in ANTS toolkit.(MD5 was not broken
yet at the time of the paper).
• The point is that type field is a cryptographically strong fingerprint derived from the capsule code
and it serves as a reference for the code.
When a node receives a capsule, if the node has seen capsules of this type before,
• It is quite likely that this node already has the code for this type in its soft-store.
• The node can simply get and execute the corresponding code and proceed with forwarding this
capsule on to where its desired destination.
If this node has not seen such type before, it won’t have the code for this type.
• In this case, the node will use the prev-node field in the ANTS header and ask the previous node
for code of this type of capsule.
• Since the previous node has already processed this capsule, it is very likely to have the code and
it can send the code to the current node.
• Then the current node can use this code to process the capsule.
The key take away is that the first packet that comes to a node may not find the code for its type, and
we have to do a little bit of heavy lifting to retrieve the code from the previous node and store the code
locally.(locality!)
How do I believe that the code that I got from the previous node is the code that corresponds to this type?
• This is where the cryptographically strong fingerprint comes into play.
• When it retrieves the code from the previous node, it will compute the fingerprint of the code and
see if the fingerprint matches the type field of the capsule.
• If yes, then it knows that this code is genuine.
• If no, then obviously somebody is trying to spoof my node by giving bogus codes and I will reject
it. So code spoofing can be avoided this way.
Once I get the first packet of a particular type, it will be very likely that there will be more capsules
of this type coming in the future. So I'm going to save the corresponding code in the soft store for future
use.

What if I go back to the previous node and the previous node does not have the code that corresponds to
this type, either?
• So the action would be to simply drop the capsule.
• Once the capsule is dropped, higher-level ACK protocol for this particular network flow will
indicate to the source that something did not get through, and that source node can retransmit that
capsule.
• This is exactly the same thing that happens with IP routing on the Internet.
• That's the same semantic that is used for capsule processing also because we're relying on higher
level transfer protocol that sits on top of the network protocol to do end to end acknowledgement
to make sure that all the packets have actually reached.
• Therefore, at the level of capsule processing, we can simply drop the capsule.

Now, why doesn’t the previous node have the code for the capsule? (it just forwarded it to me!)
• The soft store of a router node is limited and all of its capacity it's not going to give to a single
network flow.
• It only gives a part of its storage for the network flow that is corresponding to a particular capsule
type. So, the capsule code may be thrown away every once in a while and it is possible that the
code that it originally stored in the soft store was just thrown away and replaced by some other
codes.
• This is particularly possible if this request for code comes at a much later time, because usually
there is a time associated with the validity of things that I want to keep in my soft store. So if the
code request a comes in very late, the corresponding code may have been thrown away already.
328 - Potential Applications

So, how useful is active network? There are lots of potential applications that can be built using this
active network's paradigm.
In particular, whenever we desire certain ways to virtualize the behavior of the network, the active
networks become very useful. For instance,
• Protocol Independent Multicast
• Reliable Multicast
• Congestion notification
• Private IP
• Any casting.
As you can see from this list, the things that you want to do using active networks are things that are
related to network layer functionality, not the high-level application functionality.
In particular, it is useful for building applications that are difficult to deploy on the internet.
• When you rely on passive routing, it is entirely an administrative set up and the administrative set
up tends to mirror the physical set up.
• But, for your particular network flow, you may want a set up that is different from the physical
set up.
• In my sending greeting message example, those kinds of multicasting are specific to network flow.
So, we are in some sense, overlaying our own desire on top of the physical topography of the
internet, by using the active network paradigm.
The key properties of active network applications is that such applications should be expressible,
compact, fast and not rely on all nodes being active.
These are key things to note in building applications that live on top of the active networks.
So all of these suggest, once again, it is for providing network layer functionality, not end application
functionality. So, what you want in the network layer, that is something that again orchestrate using the
active networks paradigm.
329 - Pros and Cons of Active Networks

Let's talk about the pros and cons of active networks.


The pro is the flexibility. From the point of an application’s perspective, you can ignore the physical
layout of the network and slap on your own network flow virtualization on top of the physical
infrastructure.
However, one concern is the protection threats. The routing infrastructures carries network flows for
me, you and other people. Just like in an OS, we need to make sure that one process does not do anything
malicious to other processes. Same way, my network flow should not do anything detrimental to your
network flow on the internet. So that's what we mean by protection threats. There are some safeguards
in the ANTS toolkit to address these protection threats.
• The first is the run time safety of ANTS program that is running on a router node. It was ensured
by implementing ANTS itself in Java and using Java sandboxing technique on the router node.
This way anything that a router code is doing for capsule processing is limited to the Java sandbox
that it is executing in and it cannot affect the flows of other network flows that are flowing through
the same routing fabric.
• The 2nd protection threat is code spoofing. We are talking about code being injected into the
router, so naturally we need to make sure that the code it not malicious or being spoofed. As
shown earlier, ANTS used a robust fingerprint associated with the code and always compute and
check the fingerprint for code downloaded from previous node.
• The third concern can be integrity of the soft state. The soft store in a router node is limited in
size, and you don't want a particular network flow to arbitrarily consume all of the soft state. So
there is a restricted API in ANTS that can be used to safeguard the soft sate from this protection
threat.
These protection threats are concerns, but at least in the ANTS tool kit, they offer solutions to ensure that
these protection threats are not show stoppers for active networks.
The second concern that one might have is resource management threats.
• Because we are executing code at a router, the result of that code execution could be that I
proliferate packets on the internet.
• In my greeting message example, at some point that one message becomes N messages. So in
some sense, we can start flooding the network with capsule processing.
• Is that a threat? Well it is. But internet is already susceptible to this kind of resource management
threat and yes capsule adds to it, but it is not anything new.
• On the other hand, we can ask each node if it will consume more resources than it should. And
this again comes back to the safeguard that they have in the ANTS toolkit. Those APIs are
restricted API. Therefore the amount of resource that you can consume at a node is fairly restricted.
So there is sort a mixed answer to this resource management concern.
• At a given node, the resource management concern doesn't quite exist as long as you adhere to the
restricted API of ANTS.
• And the second concern that the capsules may flood the network can happen, but it is not a active-
network-specific problem. We all experience spam on the internet already, so this is not adding
any new problem but it is perhaps exacerbating the existing problem.
330 - Roadblocks Question

In your opinion, what can be roadblocks to the Active Network's Vision?

331 - Roadblocks Solution

The right answers are the first two boxes here.


Now clearly, if we want to do anything in the router, we need buy in from the router vendors. This is
a big challenge and convincing the router makers to open up the network so that I can dump some code
in it and execute the code is a big challenge.
The second is also a big challenge. If you look at the traffic on the internet today, it's humongous.
There's so much traffic on the internet and this is the reason why routers are dumb animals. All the routing
happens in the hardware. So the internet core, the routing fabric is operating at huge speeds, because
even at the edge of the network today, we are already seeing gigabit speeds. Which means that the core
of the network, you have several hundreds of gigabytes of packet processing that needs to be done and
therefore it is important that the core of the network be blazingly fast. And software routing is not going
to be able to match the speed.
Now, does active network's make the Internet more vulnerable? Not really because the internet is
already vulnerable. Perhaps, it adds to it but not particularly making it more vulnerable than what it is
already.
And the last choice regarding code spoofing, as long as we make sure that the fingerprint associated
with the code, that is going to be used for processing the capsule when it arrives at a node, is
cryptographically strong, then we can make sure that this code spoofing does not happen.
332 - Feasible

So let's talk about the feasibility of the vision of active networks.


The reality is router makers like Cisco are loath to opening up the network.
• The idea of active networks is very fascinating. We can be frugal about the resources that we use
on the Internet for different network flows and we can actually virtualize the physical
infrastructure by slapping on. But in reality it's not going to be feasible given that we have to open
up the network. So it's going to be feasible only at the edge of the network.
Secondly, when we use active networks, we are talking about executing code in a router to determine the
routing decision, i.e., software routing.
• Software routing cannot match the hardware routing speed.
• At the core of the network, there's so much traffic being that you really want to do this in hardware.
• So once again, this argues that active network is only feasible at the edge of the network.
Finally there are social and psychological reasons why active networks is maybe a little bit hard to
digest. It is hard for the user community to accept arbitrary code executing in a public routing fabric.
• If my traffic is flowing through the network and the router is going to actually execute some code
in order to process my packet, that worries me.
• Already, we talk a lot about privacy and the fact that in corporate networks and university
networks, we are losing a lot of privacy. People are watching what's going on.
• Now saying that the routers are going to do something intelligent, a smart processing packets, that
might be a socially and psychologically unacceptable proposition.
These are the reasons why it is difficult to sell the idea of active networks to the WAN. On the other
hand, the idea of virtualizing the network flow is very appealing. If you put together the two thoughts:
we can virtualize the network and the active networks is only feasible at the edge of the network, that
brings up a very interesting proposition, which I am going to mention in my concluding remarks.
333 - Conclusion

Active networks was way ahead of its time, and there was not a killer app to justify this particular line of
thought.
Further, active networks focused more on safety and less on performance, so in the 90s it seemed more
like a solution looking for a problem.
But difficulties with network management, rise of virtualization, the right hardware support and data
center and cloud computing have all given active networks a new lease of life in the form of Software
Defined Networking or SDN for short.
Specifically, cloud computing promotes a model of utility computing where multiple tenants
(businesses) can host their respective corporate networks simultaneously on the same computational
resources of a data center.
Not that this is going to ever happen, but imagine Coke and Pepsi corporate networks running on the
same data center resources. What this means is there is a need for perfect isolation of the network traffic
of one business from another, even though each of the network traffic is flowing on the same physical
infrastructure.
This calls for virtualization of the physical network itself, and hence the term, software defined
networking. You will learn more about SDN if you take a companion course on networking that is offered
in this same program.
334 - Introduction - Ensemble Project

By now, I'm sure you have recognized that an operating system is a complex software system.
Especially the protocol stack that sits within an operating system is several hundred thousand lines of
code. Developing such a complex software system, protocol stack in particular, to meet specs and also
deliver very good performance is no mean challenge.
Let's skip over to the other side of the fence and see how complex hardware systems (e.g. a CPU chip
like this one with billions of transistors on it) is built. VSLI technology uses component based approach
to building large and complex hardware systems.
Can we mimic the same method for building large complex software systems? That is, rather than
start with a clean slate, can we reuse software components? It's very tempting since component based
design will be easier to test and optimize at an individual component level and also it'll allow for easy
evolution and extension through the illusion and the leash and of components. The idea of component
based design is orthogonal to the structure of the OS that we've seen in previous lessons. Be it a
monolithic kernel or a microkernel based design, the idea of component based design can be applied to
each subsystem within an OS.
Of course, there are challenges. Can this lead to inefficiencies in performance due to the additional
component level function calls? Could it lead to loss of locality when we cross boundaries of components?
Could it lead to unnecessary redundancies, such as copying? Is it possible to get the advantages of
component based design without losing performance?
The short answer is' yes', if you can put theory and practice together.
We will explore this topic more in this lesson. The idea is to synthesize Network Protocol Stack from
components. Cornell’s Ensemble project is used as a backdrop to illustrate how to build complex
systems by using a component based approach.
Liu, Kreitz, van Renesse, Hickey, Hayden, Birman, Constable, "Building Reliable High Performance
Communication Systems from Components ", 17th ACM Symposium on Operating System Principles, OS
Review, Volume 33, Number 5, Dec. 1999.
335 - The Big Picture

The idea is to put theory and practice together in the design cycle. Theory is good for expressing abstract
specifications of the system at the level of individual components.
For the first part of the design cycle, namely specification of what we want to build,
• A theoretical framework called I/O automata (IOA) is used and it has C-like syntax. So it is
intuitive to write specification and system requirements.
• The composition operator in IOA allows expressing specification of an entire subsystem that we
want to build.
• For example, if you want to build a TCP/IP protocol stack, then all of the functional relationship
between the components of the subsystem can be expressed using this powerful specification
language primitives, available in IOA.
The second part of the design cycle is to convert the specification in IOA to executable code.
• The programming language used is Oriented Categorical Abstract Machine Language, OCaml.
• The formal semantics (precise mathematical semantics) of OCaml is a nice compliment to the
specification declared in IOA.
• OCaml is an object-oriented language as well as a functional programming language, with no
side effects.
• The code generated by OCaml can be as efficient as C code. This is super important when you're
developing operating systems because you care about performance.
Next we need to optimize the code.
• we are doing component-based design, so it will be highly unoptimized. There is a lot of craft that
goes between these components, like Lego blocks.
• Nuprl uses a theorem-proving framework and it can generate optimized codes with the equivalent
functions of the original code.
• Nuprl is able to “understand” both the IOA specifications and the OCaml code, and can rewrite
the code for the purpose of optimization.
336 - Digging Deeper - Specification

What I'm showing you here is a software engineering road map to synthesizing a complex system. In
particular, what I focused on is building a TCP/IP network protocol stack using this methodology.
Specifications of communication systems range along an axis from specifying the behavior of a system
to specifying its properties.
• When specifying the behavior, we describe how the system reacts to events. For example, we
may specify that the system sends an ACK in response to a data message.
• When specifying properties, we describe logical predicates on the possible executions of the
system. An example of a property is that messages are always delivered in the order in which they
were sent (FIFO).
Both kinds of specifications are important.
• The properties describe the system at the highest level. Since the properties do not specify how
to implement a protocol, they are easy to compose.

• Behavioral specification provides the connection to the code by describing how to implement the
properties.
Behavioral specifications can be either concrete or abstract.
• Abstract specifications are nondeterministic descriptions that use global variables, and are
therefore not executable. The advantage of abstract specifications is that they are simple, and that
global or distributed properties such as FIFOness can be easily derived.

• Concrete specifications can be directly mapped onto executable code. The advantage of a
concrete specification is that an implementation can be easily derived from it.
Below, when using the term specification, we will mean behavioral specification unless otherwise noted.
On top level, specification is described using a variant of IOA as the specification language.
The properties of the abstract specification are derived by PROOF. This is facilitated by the I/O
Automata theoretical framework.
The concrete behavioral specification is derived from the abstract specification by a process called
REFINEMENT. This involves designing a protocol that implements the abstract requirements.
• For instance, we can have a LossyNetwork() described at the abstract behavior specification level.
• At the concrete specification level, we will describe how an FIFOProtocol() can be implemented
in this LossyNetwork().
• A concrete specification only involves state and events local to a single participant in the protocol,
while abstract specifications are global.
From the concrete behavioral spec, we get to the actual implementation using OCaml programming
language.
Between the implementation and the concrete behavioral spec, there is not a whole lot of difference.
It is really the scheduling of the operations that we want in the concrete behavioral spec that is being
detailed.
• I already mentioned some of the reasons why they chose OCaml as the implementation vehicle.
It is a functional programming language, it has formal semantics, and these lead to compact code.

• It has high level operations, data structures, and features like automatic garbage collection,
automatic memory allocation, and marshalling and unmarshalling of arguments.

• This is very important because we are building a complex system, from a specification, using a
component based design approach.

• Just like you take Lego blocks to build a toy, we are taking components and meshing them together
to get the complex system implementation. When we do that, we are necessarily going across
different components and we have to adhere to the interface specifications of those components.
Thus, we have to marshall and unmarshall the arguments when you go between these components.
Ocaml has facilities for marshalling and unmarshalling built into the language framework, which
makes it an ideal vehicle for component-based design.

• The programmability of Ocaml is similar to C and the definition of the primitives in OCaml makes
it a good vehicle for developing verifiable code.
337 - Digging Deeper - Implementation

At this point we have an unoptimized version of the abstract behavioral spec implemented. Remember
that in component-based design, there's going to be a lot of crap at the component interfaces.
One word of caution though, I mentioned that using IOA is fantastic from the point of view proving that
the properties that we want in the original subsystem is actually met by the behavior spec.
But the path that we took going from the “abstract spec → concrete spec → implementation”, only
ends up with unoptimized version of an executable implementation.
In other words, this leg of design exercise (proving properties of the spec that it meets what we set out
for the original subsystem) is in no way guaranteeing that those properties are actually in the
implementation. There's no easy way to show that the OCaml implementation is the same as the IOA
specification.
This brings to mind a famous quote that is attributed to Donald Knuth: beware of bugs in the above
code; I've only proved it correct, not tried it. So an expert in developing algorithm is saying there's no
way to prove that the code is actually faithfully reproducing the algorithm.
That's sort of the same thing that happens here. We have an abstract behavioral spec and we can prove
properties about the behavioral spec, and we can convince that whatever we set out for the subsystem is
actually met in terms of the specification. But there is no way to prove that this implementation is actually
faithfully reproducing the behavioral spec.
This is the software originating road map for synthesizing a complex system, starting from the behavioral
spec, all the way to an un-optimized implementation of the system.
• Those are the first two pieces of the puzzle, namely specification and implementation.
• The third piece of the puzzle of course is going from this unoptimized version to the optimized
version.
338 - Digging Deeper - Optimization

As I mentioned earlier, for going from the implementation to optimization, we once again turn to the
theoretical framework.
In this case, we're going to use this Nuprl theorem proving framework.
• The Nuprl theorem proving framework is a vehicle by which you can make some assertions and
prove some theorems about the equality of optimized and unoptimized code.

• First of all, you start with this OCaml code, which is unoptimized and there is a tool that converts
this unoptimized OCaml code to Nuprl Code. This is once again an unoptimized version of the
original OCaml code, but it is a Nuprl code.

• This theorem prover framework can be used to convert this Nuprl code to optimized Nuprl code,
through a whole series of optimization theorems. The framework ensures that the optimized Nuprl
code is equivalent to the unoptimized Nuprl code. So as far as the operating system design is
concerned, we're going to treat this as magic. For the purposes of this lesson, we're not going to
go into the theoretical details of how this theorem proving framework does its work.

• Then there is another tool that converts this optimized Nuprl code back into the optimized OCaml
code and we are ready to deploy this.
So this is sort of the design cycle going one full round, going from specification to implementation,
implementation to optimization and from optimization to deployable code that we can then take and put
it on the system.
339 - Putting the Methodology to Work

Now we can look at how to synthesize a TCP/IP protocol stack using this methodology.
• Starting with IOA, we specify the protocol in all the detail that we want, abstractly.

• We can prove that this abstract specification meets the properties using the IOA framework.

• From abstract spec, through a whole bunch of refinements, we can get to concrete spec.

Next we need to get an unoptimized Ocaml implementation from the concrete spec, which is done using
the Ensemble suite of Micro-protocols.
• Why use ensemble? Well, remember, our goal is to mimic the methodology used in building real
large-scale circuits.

• Building a billion transistor CPU chip takes a component-based design approach, i.e. taking
components of hardware structures that have already been implemented and assemble them
together in order to realize this big mammoth chip.

• And we are trying to do the same thing to build a complex software systems. So the starting point
must be a set of components, and that's what ESM give you.
Think about TCP/IP protocol, it has a lot of features in it and each of those features require non-trivial
amount of code to build.
• For example, sliding window management, flow control, congestion control, scatter and gather
packets to assemble messages into units that can be delivered to the application. The ESM has
components for each of those features/functionalities.

• If you recall the paper that we read earlier by Thekkath and Levy about optimizing RPC in LAN,
we said that one size fits all is not a good way to think about designing complex software systems.
Depending on what the environment is, you have to adapt what the layers of systems.

• Therefore, even though the TCP/IP protocol is well laid out in terms of what the functional
requirements are, whether we need ALL the components in a particular implementation of TCP/IP
is something for the designer to decide.

• And ESM gives you that freedom to mix-and-match the components that make sense in this
specific environment for which to building a protocol stack.
That's the reason for using ESM.
• It has about 60 protocols that makes up the whole Ensemble suite.

• The Ensemble Micro-protocols are written in OCaml and facilitates component-based design of a
complex system such as TCP/IP protocol stack.

• The micro-protocols have well defined interfaces that allows composition.

• Every micro-protocol has an interface for the layers that sit on top of it and for layers below it.

• This is exactly the kind of component that you want to assemble these components layer by layer
to get the functionality that you intend to build.
Just to reiterate what we set as the original goal: we want to mimic VLSI design in building a complex
software system. These well-defined interfaces of the Ensemble Micro-protocols facilitates component-
based design.
340 - How to Optimize the Protocol Stack

Given a behavior spec for a protocol say TCP/IP, can a protocol stack be synthesized as a composition
of the ensemble micro-protocols?
Given 60 micro protocols in the ensemble suite, there are way too many combinations for a brute force
approach.
The Ensemble paper used a heuristic algorithm to synthesize the stack based on the desired behavioral
spec and designer's knowledge of the micro-protocols. The result is a protocol stack which is functional,
but not optimized for performance.
Of course, as operating system designers, we're always worried about performance. In particular, the
fact that we've assembled this protocol stack like Lego blocks, putting together all these micro-protocols
layer-by-layer leads to inefficiency.
There are clear boundaries/interfaces between these layers/components. To cross the component
boundaries, you may have to copy parameters and arguments specified by the interfaces between
components and so on.
So we have to do the extra work to optimize the component-based design for performance reasons.
There are several sources that can be optimized.
• For instance I mentioned that OCaml has implicit garbage collection. It is good that it has implicit
garbage collection as a fallback, but we don't want to use it all the time. We want to be explicit
about how we manage our own memory, which is more efficient.

• I mentioned that OCaml has ability for marshaling and unmarshalling arguments to go across
layers. But, when you're going across layers, these things can add overheads and this is another
source of optimization. We can avoid the marshalling and unmarshalling across the layers by
collapsing the layers.

• Another opportunity that exists especially in networking systems is the fact that there's going to
be communication and computation. If you think about TCP/IP Protocol, it has to buffer the
packets that are being sent out, in case the packets are lost. This is where we can overlap this
buffering with the actual transmission.

• Another opportunity is compressing the header. When we have this layering, at every layer it
might add a new header specific to that layer. Those headers may have common fields (e.g. size
of the packet, checksum and so on) and we can eliminate those common fields when we go across
these layers.

• Another thing that we always have to worry about, is making sure that the code that we execute
can fit into the caches and this has been something that we've talked about all along. That locality
enhancement, making sure that the working set of whatever code that is executing on a processor
fit into the cache is very important. So, we can identify common code path across different layers
of the protocol stack, and co-locating those common code path. In this way we can enhance
locality for processing.
So there are lots of opportunities for optimization but do it by hand manually, that's tedious.
How do we automate the process of optimization so that we don't have to do it manually?
341 - NuPrl to the Rescue - CCP and bypass

To understand the way Nuprl interprets a protocol, it is useful to think of a protocol as a function.
Such a function
• takes the state of the protocol (the collected variables maintained by the protocol) and an input
event (a user operation, an arriving message, an expiring timer, ...), and
• produces an updated state and a list of output events.
This function can be optimized if something is known about an input event and the state of the protocol.
We express this knowledge by a so-called Common Case Predicate (CCP): a Boolean function on the
state of a protocol and an input event.
For example, a CCP may be true if
• the event is a Deliver event AND
• the low end of the receiver’s sliding window is equal to the sequence number in the event (in other
words, this is the next expected packet to arrive, and it was not lost or reordered).
If a message has a CCP = TRUE, that message may be delivered and the low end of the window moved
up, without a need for buffering.
CCPs are specified by the programmer of a protocol, and are typically determined from run-time
statistics.
So the basic Nuprl’s idea of optimizing code is to find the CCP that allows to skip some code path and
simply the processing, i.e. a protocol a bypass.
342 - NuPrl to the Rescue - Static and Dynamic Optimization

So how does NuPrl optimize the codes?


The first step is static optimization is done on the Ensemble’s protocol stack. This requires that an
NuPrl expert and an OCaml expert sit together in a semi-automatic manner, go layer by layer in the
protocol stack, and identify what transformation code can be applied to optimize each layer. We're not
going across layers, we only focus on the bypass within one layer. The kinds of code transformation
include:
• Function inlining and symbolic evaluation
• Directed equality substitution
• Context-dependent simplifications.
Performing these steps in a logical proof environment guarantees that the generated bypass code is
equivalent to the original code of the layer, if the CCP holds. In most cases this means that about 100-
300 lines of code have been reduced to a single update of the layer’s state and a single event to be passed
to the next layer.
In contrast to individual layers, application protocol stacks cannot be optimized a priori, as thousands
of possible configurations can be generated with the Ensemble toolkit. The application developer has
little or no knowledge about Ensemble’s code. Therefore, an optimization of an application stack has to
be completely automatic.
So if we have a CCP, which checks the sequence number (state of protocol) of an arriving packet (input
event).
• If the CCP is true, we can execute this bypass code and completely skip all these intermediate
layers of protocol and go directly to the upper layer perhaps all the way up to the application.
• If the CCP is not true, then we have to do the normal processing of giving the packet to this micro-
protocol, layer-by-layer.
We can derive CCP for each layer and possibly end up with collapsing/skipping multiple layers like
this. A good CCP “collapses” all of these layers into a single predicate.
That's the beauty of the dynamic optimization, and it's completely automated.
This comes from the power of the theorem proving framework of NuPrl, which guarantees the
equivalence of the bypass code to the layers of protocol that it is replacing.
Then we are ready, to convert the optimized NuPrl code back to OCaml and the OCaml code is optimized.
A word of caution, however. There's a difference between optimization and verification. All that the
Nuprl framework is doing is optimization, not verifying whether the OCaml code is adhering to
behavioral spec of IOA.
So this exercise has shown the path to synthesizing complex system software, starting from specification
to implementation to optimization, putting theory and practice together.

343 - Conclusion

As operating system designers, the natural question that comes up is:


Okay, all this sounds good. But do I lose out on performance for the convenience of component-based
design?
This is the same question that came up when we wanted to go for a microkernel-based design, away
from a monolithic design.
The Cornell experiment takes it one step further and argues for synthesizing individual subsystems of
an OS from modular components. Just like putting together Lego blocks to get the desired functionality.
I encourage you to read the paper from Cornell in full, which shows that this methodology, applied to
one specific subsystem, results in a performance competitive implementation of the protocol stack,
compared to a monolithic implementation.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy