Epilogues, Canaries, and Buffer Overflows

Last week we looked at how the stack works and how stack frames are built during function prologues. Now it's time to look at the inverse process as stack frames are destroyed in function epilogues. Let's bring back our friend add.c:

Simple Add Program - add.c
int add(int a, int b)
int result = a + b;
return result;

int main(int argc)
int answer;
answer = add(40, 2);

We're executing line 4, right after the assignment of a + b into result. This is what happens:

The first instruction is redundant and a little silly because we know eax is already equal to result, but this is what you get with optimization turned off. The leave instruction then runs, doing two tasks for the price of one: it resets esp to point to the start of the current stack frame, and then restores the saved ebp value. These two operations are logically distinct and thus are broken up in the diagram, but they happen atomically if you're tracing with a debugger.

After leave runs the previous stack frame is restored. The only vestige of the call to add is the return address on top of the stack. It contains the address of the instruction in main that must run after add is done. The ret instruction takes care of it: it pops the return address into the eip register, which points to the next instruction to be executed. The program has now returned to main, which resumes:

main copies the return value from add into local variable answer and then runs its own epilogue, which is identical to any other. Again the only peculiarity in main is that the saved ebp is null, since it is the first stack frame in our code. In the last step, execution has been returned to the C runtime (libc), which will exit to the operating system. Here's a diagram with the full return sequence for those who need it.

You now have an excellent grasp of how the stack operates, so let's have some fun and look at one of the most infamous hacks of all time: exploiting the stack buffer overflow. Here is a vulnerable program:

Vulnerable Program - buffer.c
void doRead()
char buffer[28];

int main(int argc)

The code above uses gets to read from standard input. gets keeps reading until it encounters a newline or end of file. Here's what the stack looks like after a string has been read:

The problem here is that gets is unaware of buffer's size: it will blithely keep reading input and stuffing data into the stack beyond buffer, obliterating the saved ebp value, return address, and whatever else is below. To exploit this behavior, attackers craft a precise payload and feed it into the program. This is what the stack looks like during an attack, after the call to gets:

The basic idea is to provide malicious assembly code to be executed and overwrite the return address on the stack to point to that code. It is a bit like a virus invading a cell, subverting it, and introducing some RNA to further its goals.

And like a virus, the exploit's payload has many notable features. It starts with several nop instructions to increase the odds of successful exploitation. This is because the return address is absolute and must be guessed, since attackers don't know exactly where in the stack their code will be stored. But as long as they land on a nop, the exploit works: the processor will execute the nops until it hits the instructions that do work.

The exec /bin/sh symbolizes raw assembly instructions that execute a shell (imagine for example that the vulnerability is in a networked program, so the exploit might provide shell access to the system). The idea of feeding raw assembly to a program expecting a command or user input is shocking at first, but that's part of what makes security research so fun and mind-expanding. To give you an idea of how weird things get, sometimes the vulnerable program calls tolower or toupper on its inputs, forcing attackers to write assembly instructions whose bytes do not fall into the range of upper- or lower-case ascii letters.

Finally, attackers repeat the guessed return address several times, again to tip the odds ever in their favor. By starting on a 4-byte boundary and providing multiple repeats, they are more likely to overwrite the original return address on the stack.

Thankfully, modern operating systems have a host of protections against buffer overflows, including non-executable stacks and stack canaries. The "canary" name comes from the canary in a coal mine expression, an addition to computer science's rich vocabulary. In the words of Steve McConnell:

Computer science has some of the most colorful language of any field. In what other field can you walk into a sterile room, carefully controlled at 68°F, and find viruses, Trojan horses, worms, bugs, bombs, crashes, flames, twisted sex changers, and fatal errors?

Steve McConnellCode Complete 2

At any rate, here's what a stack canary looks like:

Canaries are implemented by the compiler. For example, GCC's stack-protector option causes canaries to be used in any function that is potentially vulnerable. The function prologue loads a magic value into the canary location, and the epilogue makes sure the value is intact. If it's not, a buffer overflow (or bug) likely happened and the program is aborted via __stack_chk_fail. Due to their strategic location on the stack, canaries make the exploitation of stack buffer overflows much harder.

This finishes our journey within the depths of the stack. We don't want to delve too greedily and too deep. Next week we'll go up a notch in abstraction to take a good look at recursion, tail calls and other tidbits, probably using Google's V8. To end this epilogue and prologue talk, I'll close with a cherished quote inscribed on a monument in the American National Archives:

Journey to the Stack, Part I

Earlier we've explored the anatomy of a program in memory, the landscape of how our programs run in a computer. Now we turn to the call stack, the work horse in most programming languages and virtual machines. Along the way we'll meet fantastic creatures like closures, recursion, and buffer overflows. But the first step is a precise picture of how the stack operates.

The stack is so important because it keeps track of the functions running in a program, and functions are in turn the building blocks of software. In fact, the internal operation of programs is normally very simple. It consists mostly of functions pushing data onto and popping data off the stack as they call each other, while allocating memory on the heap for data that must survive across function calls. This is true for both low-level C software and VM-based languages like JavaScript and C#. A solid grasp of this reality is invaluable for debugging, performance tuning and generally knowing what the hell is going on.

When a function is called, a stack frame is created to support the function's execution. The stack frame contains the function's local variables and the arguments passed to the function by its caller. The frame also contains housekeeping information that allows the called function (the callee) to return to the caller safely. The exact contents and layout of the stack vary by processor architecture and function call convention. In this post we look at Intel x86 stacks using C-style function calls (cdecl). Here's a single stack frame sitting live on top of the stack:

Right away, three CPU registers burst into the scene. The stack pointer, esp, points to the top of the stack. The top is always occupied by the last item that was pushed onto the stack but has not yet been popped off, just as in a real-world stack of plates or $100 bills.

The address stored in esp constantly changes as stack items are pushed and popped, such that it always points to the last item. Many CPU instructions automatically update esp as a side effect, and it's impractical to use the stack without this register.

In the Intel architecture, as in most, the stack grows towards lower memory addresses. So the "top" is the lowest memory address in the stack containing live data: local_buffer in this case. Notice there's nothing vague about the arrow from esp to local_buffer. This arrow means business: it points specifically to the first byte occupied by local_buffer because that is the exact address stored in esp.

The second register tracking the stack is ebp, the base pointer or frame pointer. It points to a fixed location within the stack frame of the function currently running and provides a stable reference point (base) for access to arguments and local variables. ebp changes only when a function call begins or ends. Thus we can easily address each item in the stack as an offset from ebp, as shown in the diagram.

Unlike esp, ebp is mostly maintained by program code with little CPU interference. Sometimes there are performance benefits in ditching ebp altogether, which can be done via compiler flags. The Linux kernel is one example where this is done.

Finally, the eax register is used by convention to transfer return values back to the caller for most C data types.

Now let's inspect the data in our stack frame. These diagram shows precise byte-for-byte contents as you'd see in a debugger, with memory growing left-to-right, top-to-bottom. Here it is:

The local variable local_buffer is a byte array containing a null-terminated ascii string, a staple of C programs. The string was likely read from somewhere, for example keyboard input or a file, and it is 7 bytes long. Since local_buffer can hold 8 bytes, there's 1 free byte left. The content of this byte is unknown because in the stack's infinite dance of pushes and pops, you never know what memory holds unless you write to it. Since the C compiler does not initialize the memory for a stack frame, contents are undetermined

  • and somewhat random - until written to. This has driven some into madness.

Moving on, local1 is a 4-byte integer and you can see the contents of each byte. It looks like a big number, with all those zeros following the 8, but here your intuition leads you astray.

Intel processors are little endian machines, meaning that numbers in memory start with the little end first. So the least significant byte of a multi-byte number is in the lowest memory address. Since that is normally shown leftmost, this departs from our usual representation of numbers. It helps to know that this endian talk is borrowed from Gulliver's Travels: just as folks in Lilliput eat their eggs starting from the little end, Intel processors eat their numbers starting from the little byte.

So local1 in fact holds the number 8, as in the legs of an octopus. param1, however, has a value of 2 in the second byte position, so its mathematical value is 2 * 256 = 512 (we multiply by 256 because each place value ranges from 0 to 255). Meanwhile, param2 is carrying weight at 1 * 256 * 256 = 65536.

The housekeeping data in this stack frame consists of two crucial pieces: the address of the previous stack frame (saved ebp) and the address of the instruction to be executed upon the function's exit (the return address). Together, they make it possible for the function to return sanely and for the program to keep running along.

Now let's see the birth of a stack frame to build a clear mental picture of how this all works together. Stack growth is puzzling at first because it happens in the opposite direction you'd expect. For example, to allocate 8 bytes on the stack one subtracts 8 from esp, and subtraction is an odd way to grow something.

Let's take a simple C program:

Simple Add Program - add.c
int add(int a, int b)
int result = a + b;
return result;

int main(int argc)
int answer;
answer = add(40, 2);

Suppose we run this in Linux without command-line parameters. When you run a C program, the first code to actually execute is in the C runtime library, which then calls our main function. The diagrams below show step-by-step what happens as the program runs. Each diagram links to GDB output showing the state of memory and registers. You may also see the GDB commands used and the whole GDB output. Here we go:

Steps 2 and 3, along with 4 below, are the function prologue, which is common to nearly all functions: the current value of ebp is saved to the top of the stack, and then esp is copied to ebp, establishing a new frame. main's prologue is like any other, but with the peculiarity that ebp is zeroed out when the program starts.

If you were to inspect the stack below argc (to the right) you'd find more data, including pointers to the program name and command-line parameters (the traditional C argv), plus pointers to Unix environment variables and their actual contents. But that's not important here, so the ball keeps rolling towards the add() call:

After main subtracts 12 from esp to get the stack space it needs, it sets the values for a and b. Values in memory are shown in hex and little-endian format, as you'd see in a debugger. Once parameter values are set, main calls add and it starts running:

Now there's some excitement! We get another prologue, but this time you can see clearly how the stack frames form a linked list, starting at ebp and going down the stack. This is how debuggers and Exception objects in higher-level languages get their stack traces. You can also see the much more typical catching up of ebp to esp when a new frame is born. And again, we subtract from esp to get more stack space.

There's also the slightly weird reversal of bytes when the ebp register value is copied to memory. What's happening here is that registers don't really have endianness: there are no "growing addresses" inside a register as there are for memory. Thus by convention debuggers show register values in the most natural format to humans: most significant to least significant digits. So the results of the copy in a little-endian machine appear reversed in the usual left-to-right notation for memory. I want the diagrams to provide an accurate picture of what you find in the trenches, so there you have it.

With the hard part behind us, we add:

There are guest register appearances to help out with the addition, but otherwise no alarms and no surprises. add did its job, and at this point the stack action would go in reverse, but we'll save that for next time.

Anybody who's read this far deserves a souvenir, so I've made a large diagram showing all the steps combined in a fit of nerd pride.

It looks tame once it's all laid out. Those little boxes help a lot. In fact, little boxes are the chief tool of computer science. I hope the pictures and register movements provide an intuitive mental picture that integrates stack growth and memory contents. Up close, our software doesn't look too far from a simple Turing machine.

This concludes the first part of our stack tour. There's some more byte spelunking ahead, and then it's on to see higher level programming concepts built on this foundation. See you next week.

Page Cache, the Affair Between Memory and Files

Previously we looked at how the kernel manages virtual memory for a user process, but files and I/O were left out. This post covers the important and often misunderstood relationship between files and memory and its consequences for performance.

Two serious problems must be solved by the OS when it comes to files. The first one is the mind-blowing slowness of hard drives, and disk seeks in particular, relative to memory. The second is the need to load file contents in physical memory once and share the contents among programs. If you use Process Explorer to poke at Windows processes, you'll see there are ~15MB worth of common DLLs loaded in every process. My Windows box right now is running 100 processes, so without sharing I'd be using up to ~1.5 GB of physical RAM just for common DLLs. No good. Likewise, nearly all Linux programs need ld.so and libc, plus other common libraries.

Happily, both problems can be dealt with in one shot: the page cache, where the kernel stores page-sized chunks of files. To illustrate the page cache, I'll conjure a Linux program named render, which opens file scene.dat and reads it 512 bytes at a time, storing the file contents into a heap-allocated block. The first read goes like this:

Reading and the page cache

After 12KB have been read, render's heap and the relevant page frames look thus:

Non-mapped file read

This looks innocent enough, but there's a lot going on. First, even though this program uses regular read calls, three 4KB page frames are now in the page cache storing part of scene.dat. People are sometimes surprised by this, but all regular file I/O happens through the page cache. In x86 Linux, the kernel thinks of a file as a sequence of 4KB chunks. If you read a single byte from a file, the whole 4KB chunk containing the byte you asked for is read from disk and placed into the page cache. This makes sense because sustained disk throughput is pretty good and programs normally read more than just a few bytes from a file region. The page cache knows the position of each 4KB chunk within the file, depicted above as #0, #1, etc. Windows uses 256KB views analogous to pages in the Linux page cache.

Sadly, in a regular file read the kernel must copy the contents of the page cache into a user buffer, which not only takes cpu time and hurts the cpu caches, but also wastes physical memory with duplicate data. As per the diagram above, the scene.dat contents are stored twice, and each instance of the program would store the contents an additional time. We've mitigated the disk latency problem but failed miserably at everything else. Memory-mapped files are the way out of this madness:

Mapped file read

When you use file mapping, the kernel maps your program's virtual pages directly onto the page cache. This can deliver a significant performance boost: Windows System Programming reports run time improvements of 30% and up relative to regular file reads, while similar figures are reported for Linux and Solaris in Advanced Programming in the Unix Environment. You might also save large amounts of physical memory, depending on the nature of your application.

As always with performance, measurement is everything, but memory mapping earns its keep in a programmer's toolbox. The API is pretty nice too, it allows you to access a file as bytes in memory and does not require your soul and code readability in exchange for its benefits. Mind your address space and experiment with mmap in Unix-like systems, CreateFileMapping in Windows, or the many wrappers available in high level languages. When you map a file its contents are not brought into memory all at once, but rather on demand via page faults. The fault handler maps your virtual pages onto the page cache after obtaining a page frame with the needed file contents. This involves disk I/O if the contents weren't cached to begin with.

Now for a pop quiz. Imagine that the last instance of our render program exits. Would the pages storing scene.dat in the page cache be freed immediately? People often think so, but that would be a bad idea. When you think about it, it is very common for us to create a file in one program, exit, then use the file in a second program. The page cache must handle that case. When you think more about it, why should the kernel ever get rid of page cache contents? Remember that disk is 5 orders of magnitude slower than RAM, hence a page cache hit is a huge win. So long as there's enough free physical memory, the cache should be kept full. It is therefore not dependent on a particular process, but rather it's a system-wide resource. If you run render a week from now and scene.dat is still cached, bonus! This is why the kernel cache size climbs steadily until it hits a ceiling. It's not because the OS is garbage and hogs your RAM, it's actually good behavior because in a way free physical memory is a waste. Better use as much of the stuff for caching as possible.

Due to the page cache architecture, when a program calls write() bytes are simply copied to the page cache and the page is marked dirty. Disk I/O normally does not happen immediately, thus your program doesn't block waiting for the disk. On the downside, if the computer crashes your writes will never make it, hence critical files like database transaction logs must be fsync()ed (though one must still worry about drive controller caches, oy!). Reads, on the other hand, normally block your program until the data is available. Kernels employ eager loading to mitigate this problem, an example of which is read ahead where the kernel preloads a few pages into the page cache in anticipation of your reads. You can help the kernel tune its eager loading behavior by providing hints on whether you plan to read a file sequentially or randomly (see madvise(), readahead(), Windows cache hints ). Linux does read-ahead for memory-mapped files, but I'm not sure about Windows. Finally, it's possible to bypass the page cache using O_DIRECT in Linux or NO_BUFFERING in Windows, something database software often does.

A file mapping may be private or shared. This refers only to updates made to the contents in memory: in a private mapping the updates are not committed to disk or made visible to other processes, whereas in a shared mapping they are. Kernels use the copy on write mechanism, enabled by page table entries, to implement private mappings. In the example below, both render and another program called render3d (am I creative or what?) have mapped scene.dat privately. Render then writes to its virtual memory area that maps the file:

The Copy-On-Write mechanism

The read-only page table entries shown above do not mean the mapping is read only, they're merely a kernel trick to share physical memory until the last possible moment. You can see how 'private' is a bit of a misnomer until you remember it only applies to updates. A consequence of this design is that a virtual page that maps a file privately sees changes done to the file by other programs as long as the page has only been read from. Once copy-on-write is done, changes by others are no longer seen. This behavior is not guaranteed by the kernel, but it's what you get in x86 and makes sense from an API perspective. By contrast, a shared mapping is simply mapped onto the page cache and that's it. Updates are visible to other processes and end up in the disk. Finally, if the mapping above were read-only, page faults would trigger a segmentation fault instead of copy on write.

Dynamically loaded libraries are brought into your program's address space via file mapping. There's nothing magical about it, it's the same private file mapping available to you via regular APIs. Below is an example showing part of the address spaces from two running instances of the file-mapping render program, along with physical memory, to tie together many of the concepts we've seen.

Mapping virtual memory to physical memory

This concludes our 3-part series on memory fundamentals. I hope the series was useful and provided you with a good mental model of these OS topics.


The Thing King

I hope the previous post explained virtual memory adequately, but I must admit I held back a much better explanation, which I first saw in Expert C Programming. It wasn't written by the book's author, Peter van der Linden, but rather by Jeff Berryman in 1972. Here goes:

The Thing King and the Paging Game

This note is a formal non-working paper of the Project MAC Computer Systems Research Division. It should be reproduced and distributed wherever levity is lacking, and may be referenced at your own risk in other publications.


  1. Each player gets several million things.
  2. Things are kept in crates that hold 4096 things each. Things in the same crate are called crate-mates.
  3. Crates are stored either in the workshop or the warehouses. The workshop is almost always too small to hold all the crates.
  4. There is only one workshop but there may be several warehouses. Everybody shares them.
  5. Each thing has its own thing number.
  6. What you do with a thing is to zark it. Everybody takes turns zarking.
  7. You can only zark your things, not anybody else's.
  8. Things can only be zarked when they are in the workshop.
  9. Only the Thing King knows whether a thing is in the workshop or in a warehouse.
  10. The longer a thing goes without being zarked, the grubbier it is said to become.
  11. The way you get things is to ask the Thing King. He only gives out things by the crateful. This is to keep the royal overhead down.
  12. The way you zark a thing is to give its thing number. If you give the number of a thing that happens to be in a workshop it gets zarked right away. If it is in a warehouse, the Thing King packs the crate containing your thing back into the workshop. If there is no room in the workshop, he first finds the grubbiest crate in the workshop, whether it be yours or somebody else's, and packs it off with all its crate-mates to a warehouse. In its place he puts the crate containing your thing. Your thing then gets zarked and you never know that it wasn't in the workshop all along.
  13. Each player's stock of things have the same numbers as everybody else's. The Thing King always knows who owns what thing and whose turn it is, so you can't ever accidentally zark somebody else's thing even if it has the same thing number as one of yours.


  1. Traditionally, the Thing King sits at a large, segmented table and is attended to by pages (the so-called "table pages") whose job it is to help the king remember where all the things are and who they belong to.
  2. One consequence of Rule 13 is that everybody's thing numbers will be similar from game to game, regardless of the number of players.
  3. The Thing King has a few things of his own, some of which move back and forth between workshop and warehouse just like anybody else's, but some of which are just too heavy to move out of the workshop.
  4. With the given set of rules, oft-zarked things tend to get kept mostly in the workshop while little-zarked things stay mostly in a warehouse. This is efficient stock control.

Long Live the Thing King!

Update: Alex pointed out the difficulties of measuring grubbiness in a comment below.


How The Kernel Manages Your Memory

After examining the virtual address layout of a process, we turn to the kernel and its mechanisms for managing user memory. Here is gonzo again:

Linux kernel mm_struct

Linux processes are implemented in the kernel as instances of task_struct, the process descriptor. The mm field in task_struct points to the memory descriptor, mm_struct, which is an executive summary of a program's memory. It stores the start and end of memory segments as shown above, the number of physical memory pages used by the process (rss stands for Resident Set Size), the amount of virtual address space used, and other tidbits. Within the memory descriptor we also find the two work horses for managing program memory: the set of virtual memory areas and the page tables. Gonzo's memory areas are shown below:

Kernel memory descriptor and memory areas

Each virtual memory area (VMA) is a contiguous range of virtual addresses; these areas never overlap. An instance of vm_area_struct fully describes a memory area, including its start and end addresses, flags to determine access rights and behaviors, and the vm_file field to specify which file is being mapped by the area, if any. A VMA that does not map a file is anonymous. Each memory segment above (e.g., heap, stack) corresponds to a single VMA, with the exception of the memory mapping segment. This is not a requirement, though it is usual in x86 machines. VMAs do not care which segment they are in.

A program's VMAs are stored in its memory descriptor both as a linked list in the mmap field, ordered by starting virtual address, and as a red-black tree rooted at the mm_rb field. The red-black tree allows the kernel to search quickly for the memory area covering a given virtual address. When you read file /proc/pid_of_process/maps, the kernel is simply going through the linked list of VMAs for the process and printing each one.

In Windows, the EPROCESS block is roughly a mix of task_struct and mm_struct. The Windows analog to a VMA is the Virtual Address Descriptor, or VAD; they are stored in an AVL tree. You know what the funniest thing about Windows and Linux is? It's the little differences.

The 4GB virtual address space is divided into pages. x86 processors in 32-bit mode support page sizes of 4KB, 2MB, and 4MB. Both Linux and Windows map the user portion of the virtual address space using 4KB pages. Bytes 0-4095 fall in page 0, bytes 4096-8191 fall in page 1, and so on. The size of a VMA must be a multiple of page size. Here's 3GB of user space in 4KB pages:

4KB Pages Virtual User Space

The processor consults page tables to translate a virtual address into a physical memory address. Each process has its own set of page tables; whenever a process switch occurs, page tables for user space are switched as well. Linux stores a pointer to a process' page tables in the pgd field of the memory descriptor. To each virtual page there corresponds one page table entry (PTE) in the page tables, which in regular x86 paging is a simple 4-byte record shown below:

x86 Page Table Entry (PTE) for 4KB page

Linux has functions to read and set each flag in a PTE. Bit P tells the processor whether the virtual page is present in physical memory. If clear (equal to 0), accessing the page triggers a page fault. Keep in mind that when this bit is zero, the kernel can do whatever it pleases with the remaining fields. The R/W flag stands for read/write; if clear, the page is read-only. Flag U/S stands for user/supervisor; if clear, then the page can only be accessed by the kernel. These flags are used to implement the read-only memory and protected kernel space we saw before.

Bits D and A are for dirty and accessed. A dirty page has had a write, while an accessed page has had a write or read. Both flags are sticky: the processor only sets them, they must be cleared by the kernel. Finally, the PTE stores the starting physical address that corresponds to this page, aligned to 4KB. This naive-looking field is the source of some pain, for it limits addressable physical memory to 4 GB. The other PTE fields are for another day, as is Physical Address Extension.

A virtual page is the unit of memory protection because all of its bytes share the U/S and R/W flags. However, the same physical memory could be mapped by different pages, possibly with different protection flags. Notice that execute permissions are nowhere to be seen in the PTE. This is why classic x86 paging allows code on the stack to be executed, making it easier to exploit stack buffer overflows (it's still possible to exploit non-executable stacks using return-to-libc and other techniques). This lack of a PTE no-execute flag illustrates a broader fact: permission flags in a VMA may or may not translate cleanly into hardware protection. The kernel does what it can, but ultimately the architecture limits what is possible.

Virtual memory doesn't store anything, it simply maps a program's address space onto the underlying physical memory, which is accessed by the processor as a large block called the physical address space. While memory operations on the bus are somewhat involved, we can ignore that here and assume that physical addresses range from zero to the top of available memory in one-byte increments. This physical address space is broken down by the kernel into page frames. The processor doesn't know or care about frames, yet they are crucial to the kernel because the page frame is the unit of physical memory management. Both Linux and Windows use 4KB page frames in 32-bit mode; here is an example of a machine with 2GB of RAM:

Physical Address Space

In Linux each page frame is tracked by a descriptor and several flags. Together these descriptors track the entire physical memory in the computer; the precise state of each page frame is always known. Physical memory is managed with the buddy memory allocation technique, hence a page frame is free if it's available for allocation via the buddy system. An allocated page frame might be anonymous, holding program data, or it might be in the page cache, holding data stored in a file or block device. There are other exotic page frame uses, but leave them alone for now. Windows has an analogous Page Frame Number (PFN) database to track physical memory.

Let's put together virtual memory areas, page table entries and page frames to understand how this all works. Below is an example of a user heap:

Physical Address Space

Blue rectangles represent pages in the VMA range, while arrows represent page table entries mapping pages onto page frames. Some virtual pages lack arrows; this means their corresponding PTEs have the Present flag clear. This could be because the pages have never been touched or because their contents have been swapped out. In either case access to these pages will lead to page faults, even though they are within the VMA. It may seem strange for the VMA and the page tables to disagree, yet this often happens.

A VMA is like a contract between your program and the kernel. You ask for something to be done (memory allocated, a file mapped, etc.), the kernel says "sure", and it creates or updates the appropriate VMA. But it does not actually honor the request right away, it waits until a page fault happens to do real work. The kernel is a lazy, deceitful sack of scum; this is the fundamental principle of virtual memory. It applies in most situations, some familiar and some surprising, but the rule is that VMAs record what has been agreed upon, while PTEs reflect what has actually been done by the lazy kernel. These two data structures together manage a program's memory; both play a role in resolving page faults, freeing memory, swapping memory out, and so on. Let's take the simple case of memory allocation:

Example of demand paging and memory allocation

When the program asks for more memory via the brk() system call, the kernel simply updates the heap VMA and calls it good. No page frames are actually allocated at this point and the new pages are not present in physical memory. Once the program tries to access the pages, the processor page faults and do_page_fault() is called. It searches for the VMA covering the faulted virtual address using find_vma(). If found, the permissions on the VMA are also checked against the attempted access (read or write). If there's no suitable VMA, no contract covers the attempted memory access and the process is punished by Segmentation Fault.

When a VMA is found the kernel must handle the fault by looking at the PTE contents and the type of VMA. In our case, the PTE shows the page is not present. In fact, our PTE is completely blank (all zeros), which in Linux means the virtual page has never been mapped. Since this is an anonymous VMA, we have a purely RAM affair that must be handled by do_anonymous_page(), which allocates a page frame and makes a PTE to map the faulted virtual page onto the freshly allocated frame.

Things could have been different. The PTE for a swapped out page, for example, has 0 in the Present flag but is not blank. Instead, it stores the swap location holding the page contents, which must be read from disk and loaded into a page frame by do_swap_page() in what is called a major fault.

This concludes the first half of our tour through the kernel's user memory management. In the next post, we'll throw files into the mix to build a complete picture of memory fundamentals, including consequences for performance.


Anatomy of a Program in Memory

Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I'll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.

Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space, which in 32-bit mode is always a 4GB block of memory addresses. These virtual addresses are mapped to physical memory by page tables, which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself. Thus a portion of the virtual address space must be reserved to the kernel:

Kernel/User Memory Split

This does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:

Process Switch Effects on Virtual Memory

Blue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux process:

Flexible Process Address Space Layout In Linux

When computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine. This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux randomizes the stack, memory mapping segment, and heap by adding offsets to their starting addresses. Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and hampering its effectiveness.

The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents - a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the cpu caches, speeding up access. Each thread in a process gets its own stack.

It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by expand_stack(), which in turn calls acct_stack_growth() to check whether it's appropriate to grow the stack. If the stack size is below RLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.

Dynamic stack growth is the only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.

Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call (implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. 'Large' means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().

Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.

If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call (implementation) to make room for the requested block. Heap management is complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs' chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have special-purpose allocators to deal with this problem. Heaps also become fragmented, shown below:

Fragmented Heap

Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents of uninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file. If you say static int cntActiveUsers, the contents of cntActiveUsers live in the BSS.

The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous. It maps the part of the program's binary image that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping, which means that updates to memory are not reflected in the underlying file. This must be the case, otherwise assignments to global variables would change your on-disk binary image. Inconceivable!

The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo - a 4-byte memory address - live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here's a diagram showing these segments and our example variables:

ELF Binary Image Mapped Into Memory

You can examine the memory areas in a Linux process by reading the file /proc/pid_of_process/maps. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what 'area' really means. Also, sometimes people say "data segment" meaning all of data + bss + heap.

You can examine binary images using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the "flexible" layout in Linux, which has been the default for a few years. It assumes that we have a value for RLIMIT_STACK. When that's not the case, Linux reverts back to the "classic" layout shown below:

Classic Process Address Space Layout In Linux

That's it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we'll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.


Getting Physical With Memory

When trying to understand complex systems, you can often learn a lot by stripping away abstractions and looking at their lowest levels. In that spirit we take a look at memory and I/O ports in their simplest and most fundamental level: the interface between the processor and bus. These details underlie higher level topics like thread synchronization and the need for the Core i7. Also, since I'm a programmer I ignore things EE people care about. Here's our friend the Core 2 again:

Physical Memory Access

A Core 2 processor has 775 pins, about half of which only provide power and carry no data. Once you group the pins by functionality, the physical interface to the processor is surprisingly simple. The diagram shows the key pins involved in a memory or I/O port operation: address lines, data pins, and request pins. These operations take place in the context of a transaction on the front side bus. FSB transactions go through 5 phases: arbitration, request, snoop, response, and data. Throughout these phases, different roles are played by the components on the FSB, which are called agents. Normally the agents are all the processors plus the northbridge.

We only look at the request phase in this post, in which 2 packets are output by the request agent, who is usually a processor. Here are the juiciest bits of the first packet, output by the address and request pins:

FSB Request Phase, Packet A

The address lines output the starting physical memory address for the transaction. We have 33 bits but they are interpreted as bits 35-3 of an address in which bits 2-0 are zero. Hence we have a 36-bit address, aligned to 8 bytes, for a total of 64GB addressable physical memory. This has been the case since the Pentium Pro. The request pins specify what type of transaction is being initiated; in I/O requests the address pins specify an I/O port rather than a memory address. After the first packet is output, the same pins transmit a second packet in the subsequent bus clock cycle:

FSB Request Phase, Packet B

The attribute signals are interesting: they reflect the 5 types of memory caching behavior available in Intel processors. By putting this information on the FSB, the request agent lets other processors know how this transaction affects their caches, and how the memory controller (northbridge) should behave. The processor determines the type of a given memory region mainly by looking at page tables, which are maintained by the kernel.

Typically kernels treat all RAM memory as write-back, which yields the best performance. In write-back mode the unit of memory access is the cache line, 64 bytes in the Core 2. If a program reads a single byte in memory, the processor loads the whole cache line that contains that byte into the L2 and L1 caches. When a program writes to memory, the processor only modifies the line in the cache, but does not update main memory. Later, when it becomes necessary to post the modified line to the bus, the whole cache line is written at once. So most requests have 11 in their length field, for 64 bytes. Here's a read example in which the data is not in the caches:

Memory Read Sequence Diagram

Some of the physical memory range in an Intel computer is mapped to devices like hard drives and network cards instead of actual RAM memory. This allows drivers to communicate with their devices by writing to and reading from memory. The kernel marks these memory regions as uncacheable in the page tables. Accesses to uncacheable memory regions are reproduced in the bus exactly as requested by a program or driver. Hence it's possible to read or write single bytes, words, and so on. This is done via the byte enable mask in packet B above.

The primitives discussed here have many implications. For example:

  1. Performance-sensitive applications should try to pack data that is accessed together into the same cache line. Once the cache line is loaded, further reads are much faster and extra RAM accesses are avoided.
  2. Any memory access that falls within a single cache line is guaranteed to be atomic (assuming write-back memory). Such an access is serviced by the processor's L1 cache and the data is read or written all at once; it cannot be affected halfway by other processors or threads. In particular, 32-bit and 64-bit operations that don't cross cache line boundaries are atomic.
  3. The front bus is shared by all agents, who must arbitrate for bus ownership before they can start a transaction. Moreover, all agents must listen to all transactions in order to maintain cache coherence. Thus bus contention becomes a severe problem as more cores and processors are added to Intel computers. The Core i7 solves this by having processors attached directly to memory and communicating in a point-to-point rather than broadcast fashion.

These are the highlights of physical memory requests; the bus will surface again later in connection with locking, multi-threading, and cache coherence. The first time I saw FSB packet descriptions I had a huge "ahhh!" moment so I hope someone out there gets the same benefit. In the next post we'll go back up the abstraction ladder to take a thorough look at virtual memory.

[22 Comments](/comments/physical-memory.html)

Cache: a place for concealment and safekeeping

This post shows briefly how CPU caches are organized in modern Intel processors. Cache discussions often lack concrete examples, obfuscating the simple concepts involved. Or maybe my pretty little head is slow. At any rate, here's half the story on how a Core 2 L1 cache is accessed:

Selecting an L1 cache set (row)

The unit of data in the cache is the line, which is just a contiguous chunk of bytes in memory. This cache uses 64-byte lines. The lines are stored in cache banks or ways, and each way has a dedicated directory to store its housekeeping information. You can imagine each way and its directory as columns in a spreadsheet, in which case the rows are the sets. Then each cell in the way column contains a cache line, tracked by the corresponding cell in the directory. This particular cache has 64 sets and 8 ways, hence 512 cells to store cache lines, which adds up to 32KB of space.

In this cache's view of the world, physical memory is divided into 4KB physical pages. Each page has 4KB / 64 bytes == 64 cache lines in it. When you look at a 4KB page, bytes 0 through 63 within that page are in the first cache line, bytes 64-127 in the second cache line, and so on. The pattern repeats for each page, so the 3rd line in page 0 is different than the 3rd line in page 1.

In a fully associative cache any line in memory can be stored in any of the cache cells. This makes storage flexible, but it becomes expensive to search for cells when accessing them. Since the L1 and L2 caches operate under tight constraints of power consumption, physical space, and speed, a fully associative cache is not a good trade off in most scenarios.

Instead, this cache is set associative, which means that a given line in memory can only be stored in one specific set (or row) shown above. So the first line of any physical page (bytes 0-63 within a page) must be stored in row 0, the second line in row 1, etc. Each row has 8 cells available to store the cache lines it is associated with, making this an 8-way associative set. When looking at a memory address, bits 11-6 determine the line number within the 4KB page and therefore the set to be used. For example, physical address 0x800010a0 has 000010 in those bits so it must be stored in set 2.

But we still have the problem of finding which cell in the row holds the data, if any. That's where the directory comes in. Each cached line is tagged by its corresponding directory cell; the tag is simply the number for the page where the line came from. The processor can address 64GB of physical RAM, so there are 64GB / 4KB == 224 of these pages and thus we need 24 bits for our tag. Our example physical address 0x800010a0 corresponds to page number 524,289. Here's the second half of the story:

Finding cache line by matching tags

Since we only need to look in one set of 8 ways, the tag matching is very fast; in fact, electrically all tags are compared simultaneously, which I tried to show with the arrows. If there's a valid cache line with a matching tag, we have a cache hit. Otherwise, the request is forwarded to the L2 cache, and failing that to main system memory. Intel builds large L2 caches by playing with the size and quantity of the ways, but the design is the same. For example, you could turn this into a 64KB cache by adding 8 more ways. Then increase the number of sets to 4096 and each way can store 256KB. These two modifications would deliver a 4MB L2 cache. In this scenario, you'd need 18 bits for the tags and 12 for the set index; the physical page size used by the cache is equal to its way size.

If a set fills up, then a cache line must be evicted before another one can be stored. To avoid this, performance-sensitive programs try to organize their data so that memory accesses are evenly spread among cache lines. For example, suppose a program has an array of 512-byte objects such that some objects are 4KB apart in memory. Fields in these objects fall into the same lines and compete for the same cache set. If the program frequently accesses a given field (e.g., the vtable by calling a virtual method), the set will likely fill up and the cache will start trashing as lines are repeatedly evicted and later reloaded. Our example L1 cache can only hold the vtables for 8 of these objects due to set size. This is the cost of the set associativity trade-off: we can get cache misses due to set conflicts even when overall cache usage is not heavy. However, due to the relative speeds in a computer, most apps don't need to worry about this anyway.

A memory access usually starts with a linear (virtual) address, so the L1 cache relies on the paging unit to obtain the physical page address used for the cache tags. By contrast, the set index comes from the least significant bits of the linear address and is used without translation (bits 11-6 in our example). Hence the L1 cache is physically tagged but virtually indexed, helping the CPU to parallelize lookup operations. Because the L1 way is never bigger than an MMU page, a given physical memory location is guaranteed to be associated with the same set even with virtual indexing. L2 caches, on the other hand, must be physically tagged and physically indexed because their way size can be bigger than MMU pages. But then again, by the time a request gets to the L2 cache the physical address was already resolved by the L1 cache, so it works out nicely.

Finally, a directory cell also stores the state of its corresponding cached line. A line in the L1 code cache is either Invalid or Shared (which means valid, really). In the L1 data cache and the L2 cache, a line can be in any of the 4 MESI states: Modified, Exclusive, Shared, or Invalid. Intel caches are inclusive: the contents of the L1 cache are duplicated in the L2 cache. These states will play a part in later posts about threading, locking, and that kind of stuff. Next time we'll look at the front side bus and how memory access really works. This is going to be memory week.

Update: Dave brought up direct-mapped caches in a comment below. They're basically a special case of set-associative caches that have only one way. In the trade-off spectrum, they're the opposite of fully associative caches: blazing fast access, lots of conflict misses.


Counting Infinity

Infinity is a fascinating idea and it is behind some of the most beautiful results in mathematics. Much of this beauty is accessible to everyone, brought to us by the brilliant Georg Cantor using simple arguments that require little math. I hope to show some of this goodness to people who haven't seen it before. Here we go.

Imagine a shepherd tending a flock of a few dozen sheep. In the morning, the sheep get out of the farm to do sheepish things. In the evening, the shepherd wants to make sure he got all his sheep back. One problem: he can't count. What's a shepherd to do? One solution would be to keep track of the sheep as they go out. For example, he could throw a pebble into a bucket for each sheep leaving the farm. In the evening, a pebble goes out for each returned sheep. At any given time, the number of pebbles in the bucket is exactly equal to that of outstanding sheep. You can picture the sets of sheep and pebbles like so:

Bijection of sheep and pebbles

Each sheep is paired with a distinct pebble and there are no left overs on either side. In set theory this is a bijection. Even though the sheep have not been counted, we know the sets have the same number of elements. They are equivalent in a way, not because you eat pebbles or throw sheep, though I suppose you could, but because of the bijection. If you accept this premise, which is reasonable enough, then you're in for some fun. Interesting things happen when we use bijections to compare sets nobody can count: the infinite sets. For example, let's compare the natural numbers (1, 2, 3...) to the even natural numbers (2, 4, 6...). It sounds like we should have less even numbers, but lo and behold, we get this:

Bijection of naturals and even naturals

Every natural number has been paired with a distinct even number. No left overs. Strangely, these sets are equivalent. It turns out many sets are equivalent to the natural numbers; we call these sets denumerable. For example, the set of Turing Machines is denumerable because there are infinitely many machines and they can each be fully described by a distinct natural number. The integers are yet another example:

Bijection of naturals and integers

Some people see these results as trivial: their intuitive notion is that any "infinity" must be the same. Others see a paradox. How can a part of something be equivalent to the whole? Let's try to give everyone a paradox by finding a set that is not denumerable. How about the rationals? Surely there must be more fractions than natural numbers! Right? Cantor probably thought so, until he discovered a procedure to enumerate the rationals:

Counting Rationals

Starting at the top and following the pattern, this procedure hits each rational number once. As it moves along, we can associate successive natural numbers with each hit. Here are the first few pairings:

Natural and Rational Pairings

It is pretty clear that the arrangement above covers all of the rationals, but the crucial point is that Cantor's pattern will reach any given rational in a finite number of steps. The zigzag is important: if we go off in a single direction (say, across the first row to the right) then we get stuck in the infinitude of that row alone, and there would be rationals left over. The zigzag delivers us from that evil. This argument does not produce an equation for a bijection, but since it builds a listing that contains every rational number, it shows they are denumerable. Conversely, if a set is denumerable its elements can be put into a listing where each element is paired with a natural number.

(As an aside, the 1999 paper Recounting the rationals establishes a precise bijection by using a tree to generate all fractions in reduced form. The tree has all sorts of magical properties, cool stuff. It's an accessible paper, plus Brent Yorgey wrote a great walk through of it.)

All of our sets so far have been denumerable, except for the sheep who were finite. At this point, Cantor might have wondered if maybe there is only one infinity. But then again, if you build a number line using the sets we have seen so far, you end up with holes. The naturals are mere dots on the number line, most fractions fall in a hole:

A Hole In the Naturals

The rationals, on the other hand, are dense. Between any two given rationals, there is another rational (an infinite number of them actually). Yet the rational number line also has holes, which are the irrational numbers. Below is a famous irrational number, on whose account its discoverer was allegedly drowned at sea, shown along the rational number line:

A Hole In the Rationals

Irrational numbers cannot be expressed as a fraction; moreover, when written out as numbers using positional notation (say, as decimal numbers) their digits never settle into any kind of periodic pattern. Fractions, by contrast, eventually settle into a pattern when written out (e.g., 0.333..., 0.25000...). All of the rationals plus all of the irrational numbers make up the real numbers. And those guys form a number line free of holes. Any point in the number line is a real number, any crazy sequence of digits you come up with is a real number. It's the continuum. You might get the impression that irrational numbers are rare, oddballs among the familiar rationals, but that's not the case at all. The sparse number line above is an attempt at visualizing that fact, in addition to showcasing my superior preschool MS Paint skillz. The reals are a vast ocean of irrationals pointed here and there by a fraction, and we'll see why.

So now Cantor pits the real numbers against the naturals. He shows that even a section of the reals - the interval between 0 and 1 - is not denumerable. He uses a proof by contradiction, which is a common way to show that something cannot be true. It goes like this: suppose there is a way to enumerate the reals between 0 and 1. Then there is a listing containing every real number in that interval, each one associated with a different natural number. It might look like this:

Cantor's diagonal

The actual numbers shown above are not important, since the complete list is infinite and has every real number in the interval. Some of the reals up there are clearly rationals and have settled into a periodic pattern (0.250... and 0.500...). The others look pretty random, they could be irrational or maybe they haven't settled into their periodic patterns yet. Either way, for each number the digits go on indefinitely, rationals in a pattern (possibly of zeros) and irrationals randomly.

Notice how the numbers in red form a diagonal, which is itself infinite. Let's build a number using that diagonal. Going down the list, for each row we pick a digit that is different than the digit in red. For example, we could add 1 to each digit (9 becomes 0, no carry). You can use any rule as long as the digit is different. Here are the first few digits we get by adding 1 to the red diagonal digits:

Diagonal paradox

The digits of this number p are random and go on infinitely; it is a real number. But due to how we have defined p, it is different from every other number in the list. It differs in at least one digit. Even though the list is infinite, this number is not in it! But then, this list was supposed to have all real numbers so we have a contradiction. Hence our initial assumption must be false; the reals are not denumerable.

The notion of different infinities was a revolutionary one in mathematics. It is a stunning result which ignited the mother of all mathematician flame wars. This was actually the second proof Cantor offered for the non-denumerability of the continuum. The first proof is also valid, but slightly more technical. Cantor's diagonal argument however is amazingly simple and highly original, as brilliant as its conclusion. It went on to feature prominently in other intellectual landmarks. Alan Turing used it while proving that a Turing Machine cannot predict the behavior of other machines (specifically, predict whether another machine is "cycle-free", which is now known as the Halting Problem). Gödel used it in his Incompleteness Theorem discussed in Gödel, Escher, Bach. Diagonalization is crucial in recursion theory and complexity theory.

In order to capture these new concepts, Cantor proposed cardinality as a measure of how numerous a set is. A finite set's cardinality is simply how many elements it has, say 49 for our set of sheep. But the cardinality of an infinite set is expressed by a transfinite number. To make them more esoteric, Cantor used the Hebrew letter aleph for these numbers. By definition, the naturals took aleph zero, Aleph zero, for their cardinality. Every denumerable set has cardinality Aleph zero. The cardinality of the reals was defined as c, and Cantor spent years trying to discover whether c = Aleph one, that is, whether the cardinality of the reals is the 'next bigger one' after the naturals. So that explains the name in Mona Lisa Overdrive and also the handle for Aleph One, who wrote the classic paper on stack buffer overflows during the Phrack renaissance.

This concludes our first foray into the infinities. There's way too much to talk about. If there's interest I'll write a follow up to see where else infinity leads us. Iterative rather than upfront blogging :) Meanwhile, Journey Through Genius is a great book that does for many theorems, including Cantor's, what I did on this post. Only much better. Thanks for reading!

[31 Comments](/comments/counting-infinity.html)

Performance is a Science

For 2,000 years scholars held that heavier objects fall faster than lighter ones, partly because Aristotle couldn't be bothered to take 2 minutes to experiment. Hell, he even wrote that men have more teeth than women. Isn't that crazy? And yet, people often rely on this kind of fact-free reasoning to arrive at conclusions about computer performance, among other things. Worse, they spend their IT budgets or sacrifice code clarity based on these flawed ideas. In reality computers are far too complex for anyone to handle performance problems by "reasoning" alone.

Galileo On Pisa (This story is not true by the way)

Think about a routine in a modern jitted language. Right off the bat you face hidden magic like type coercion, boxing, and unboxing. Even if you know the language intimately, unknowns are introduced as your code is optimized first by the compiler, then again by the JIT compiler. It is then fed to the CPU, where optimizations such as branch prediction, memory prefetching and caching have drastic performance implications. What's worse, much of the above can and does change between different versions of compilers, runtimes, and processors. Your ability to predict what is going to happen is limited indeed.

To take another example, consider a user thinking of RAID-0 to boost performance. Whether there are any gains depends on a host of variables. What are the patterns of the I/O workload? Is it dominated by seeks and random operations, or is there a lot of streaming going on? Reads or writes? How does the kernel I/O scheduler play into it? How smart are the RAID controller and drivers? How will a journaling file system impact performance given the need for write barriers? What stripe sizes and file system block sizes will be used? There are way too many interdependent factors and interactions for speculative analysis. Even kernel developers are stumped by surprising and counterintuitive performance results.

Measurement is the only way to go. Without it, you're in the speculation realm of performance tuning, the kingdom of fools and the deluded. But even measurement has its problems. Maybe you're investigating a given algorithm by running it thousands of times in a row and timing the results. Is that really a valid test? By doing so you are measuring a special case where the caches are always hot. Do the conclusions hold in practice? Most importantly, do you know what percentage of time is spent in that algorithm in the normal use of the application? Is it even worth optimizing?

LHC - CMS Detector

Or say you've got a fancy new RAID-0 set up. You run some benchmark that writes large globs of data to the disk and see that your sustained write throughput is twice that of a single disk. Sounds great, too bad it has no bearing on most real-world workloads. The problem with the naive timing test and the benchmark is that they are synthetic measurements. They are scarcely better than speculation.

To tackle performance you must make accurate measurements of real-world workloads and obtain quantitative data. Thus we as developers must be proficient using performance measurement tools. For code this usually means profiling so you know exactly where time is being spent as your app runs. When dealing with complex applications, you may need to build instrumentation to collect enough data. Tools like Cachegrind can help paint a fuller picture of reality.

For website load times and networks you might use tools like WireShark and Fiddler, as Google did for GMail. In databases, use SQL profiling to figure out how much CPU, reading, and writing each query is consuming; these are more telling than the time a query takes to run since the query might be blocked or starved for resources, in which case elapsed time doesn't mean much. Locks and who is blocking who are also crucial in a database. When looking at a whole system, use your OS tools to record things such as CPU usage, disk queue length, I/Os per second, I/O completion times, swapping activity, and memory usage.

In sum, do what it takes to obtain good data and rely on it. I'm big on empiricism overall, but in performance it is everything. Don't trust hearsay, don't assume that what held in version 1 is still true for version 2, question common wisdom and blog posts like this one. We all make comical mistakes, even Aristotle did. Naturally, it takes theory and analysis to decide what to measure, how to interpret it, and how to make progress. You need real-world measurement plus reasoning. Like science.