- Terminology
- Overview of page table
- Memory Layout
- Functions of interest {Part 1 : building blocks}
- Functions of interest {Part 2 : core functions}
- How all of this works
- Final Words
- Acknowledgement
- VA : Virtual Address
- PA : Physical Address
- PD : Page Directory
- PDE : Page Directory Entry
- PT : Page Table
- PTE : Page Table Entry
- PPN : Physical Page Number
All process works on virtual address. Machine’s RAM is physical address. Page Table maps virtual address to physical address.
Xv6 does this in 2 steps.
Some things to note for xv6 :
- 32 bit virtual address (so, virtual address space 4GB)
- page size of 4KB
- The CPU register
CR3
contains a pointer to the outer page directory of the current running process.
xv6 does not do demand paging, so there is no concept of virtual memory. That is, all valid pages of a process are always allocated physical pages.
Every page table in xv6 has mappings for user pages as well as kernel pages. The part of the page table dealing with kernel pages is the same across all processes.
- virtual memory >
KERNBASE
(0x8000 0000) is for kernel - always mapped as kernel-mode only
- check
PTE_U
fag - protection fault for user-mode programs to access
- check
- physical memory address
N
is mapped toKERNBASE+N
or0:PHYSTOP
toKERNBASE:KERNBASE+PHYSTOP
- Note that although the size of physical memory is 4 GB, only 2 GB can be used by Xv6
- kernel code loaded into contiguous physical addresses
Each process has separate page table. For any process, user memory VA range is 0:KERNBASE
where KERNBASE
is 0x80000000 i.e. 2 GB of memory is available to process.
- The kernel code doesn’t exactly begin at
KERNBASE
, but a bit later atKERNLINK
, to leave some space at the start of memory for I/O devices. Next comes the kernel code+read-only data from the kernel binary. Apart from the memory set aside for kernel code and I/O devices, the remaining memory is in the form of free pages managed by the kernel. When any user process requests for memory to build up its user part of the address space, the kernel allocates memory to the user process from this free space list. That is, most physical memory can be mapped twice, once into the kernel part of the address space of a process, and once into the user part.
-
V2P(a)
(virtual to physical)- convert kernel address
a
to physical address- subtract
KERNBASE
(0x8000 0000)
- subtract
#define V2P(a) (((uint) (a)) - KERNBASE)
- convert kernel address
-
P2V(a)
(physical to virtual)- convert physical address
a
to kernel address- add
KERNBASE
(0x8000 0000)
- add
#define P2V(a) ((void *)(((char *) (a)) + KERNBASE))
- convert physical address
What it does is round the address up as a multiple of page number i.e. do a CEIL type operation.
#define PGROUNDUP(sz) (((sz)+PGSIZE-1) & ~(PGSIZE-1))
PGROUNDUP(620)
→ ((620 + (1024 -1)) & ~(1023)) → 1024
another related function is PGROUNDDOWN
, works similarly. just makes the address round down as a multiple of page number
#define PGROUNDDOWN(a) (((a)) & ~(PGSIZE-1))
PGROUNDDOWN(2400)
→ (2400 & ~(1023)) → 2048
u
here stands for user- basically OS loads the user process information to run it
- loads process's Page Table to
%cr3
This function is responsible to return an address of one new, currently unused page (4096 byte) in RAM. kalloc
removes first free page from kmem
and returns its (virtual!) address, where kmem
points to the head of a list of free (that is, available) pages of memory.
Failure : If it returns 0, that means there are no available unused pages currently.
Main job of this function is to get the content of 2nd level PTE. This is such an important function that we will go line by line.
static pte_t *
walkpgdir(pde_t *pgdir, const void *va, int alloc)
{
pde_t *pde;
pte_t *pgtab;
pde = &pgdir[PDX(va)]; // PDX(va) returns the first 10 bit. pgdir is level 1 page table. So, pgdir[PDX(va)] is level 1 PTE where there is PPN and offset
if(*pde & PTE_P){ // not NULL and Present
pgtab = (pte_t*)P2V(PTE_ADDR(*pde)); // PTE_ADDR return the first 20 bit or PPN. PPN is converted to VPN for finding 2nd level PTE. pgtab is level 2 page table
} else {
if(!alloc || (pgtab = (pte_t*)kalloc()) == 0)
return 0;
// Make sure all those PTE_P bits are zero.
memset(pgtab, 0, PGSIZE);
// The permissions here are overly generous, but they can
// be further restricted by the permissions in the page table
// entries, if necessary.
*pde = V2P(pgtab) | PTE_P | PTE_W | PTE_U;
}
return &pgtab[PTX(va)]; // PDX(va) returns the second 10 bit. So, pgtab[PTX(va)] is level 2 PTE where there is PPN and offset
}
Recall that
// +--------10------+-------10-------+---------12----------+
// | Page Directory | Page Table | Offset within Page |
// | Index | Index | |
// +----------------+----------------+---------------------+
// \--- PDX(va) --/ \--- PTX(va) --/
-
PDX(va)
returns the first 10 bit.pgdir
is level 1 page table. So,pgdir[PDX(va)]
is level 1 PTE where there is PPN and offsetpde = &pgdir[PDX(va)];
-
PTE_ADDR
return the first 20 bit or PPN. PPN is converted to VPN usingP2V
for finding 2nd level PTE.pgtab
is level 2 page tablepgtab = (pte_t*)P2V(PTE_ADDR(*pde));
-
PDX(va)
returns the second 10 bit. So,pgtab[PTX(va)]
is level 2 PTE where there is PPN and offset&pgtab[PTX(va)];
-
return NULL if not trying to make new page table otherwise use
kalloc
to allocate itif(!alloc || (pgtab = (pte_t*)kalloc()) == 0) return 0;
-
clear the new second-level page table. Make sure all those
PTE_P
bits are zero.memset(pgtab, 0, PGSIZE);
-
now that we have made the 2nd level page table, we have to put that address in 1st level page table. plus add some flags
*pde = V2P(pgtab) | PTE_P | PTE_W | PTE_U;
static int
mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm)
{
char *a, *last;
pte_t *pte;
a = (char*)PGROUNDDOWN((uint)va);
last = (char*)PGROUNDDOWN(((uint)va) + size - 1);
for(;;){
if((pte = walkpgdir(pgdir, a, 1)) == 0)
return -1;
if(*pte & PTE_P)
panic("remap");
*pte = pa | perm | PTE_P;
if(a == last)
break;
a += PGSIZE;
pa += PGSIZE;
}
return 0;
}
As the name suggests, mappages
adds a mapping from the virtual address (starting from given VA
to VA+SIZE
) to the physical address (starting from given PA
to PA+SIZE
)
-
for each virtual page in range, get its page table entry. if 2nd level not present allocate it (recall
alloc=1
case ofwalkpgdir
). failure happens if runs out of memoryif((pte = walkpgdir(pgdir, a, 1)) == 0) return -1;
-
make sure it’s not already set
if(*pte & PTE_P) panic("remap");
-
set page table entry to valid value pointing to physical page at
PA
with specified permission (perm
) andP
for present*pte = pa | perm | PTE_P; // pa is first 20 bit, perm is flags, PTE_P is done to mark as present
-
advance to next physical page (
PA
) and next virtual page (VA
)a += PGSIZE; pa += PGSIZE;
Allocates user virtual memory (page tables and physical memory) to grow process. This function is responsible to increase the user's virtual memory in a specific page directory from oldsz
to newsz
. This function used for initial allocation plus expanding heap on request
int
allocuvm(pde_t *pgdir, uint oldsz, uint newsz)
{
char *mem;
uint a;
if(newsz >= KERNBASE)
return 0;
if(newsz < oldsz)
return oldsz;
a = PGROUNDUP(oldsz);
for(; a < newsz; a += PGSIZE){
mem = kalloc();
if(mem == 0){
cprintf("allocuvm out of memory\n");
deallocuvm(pgdir, newsz, oldsz);
return 0;
}
memset(mem, 0, PGSIZE);
if(mappages(pgdir, (char*)a, PGSIZE, V2P(mem), PTE_W|PTE_U) < 0){
cprintf("allocuvm out of memory (2)\n");
deallocuvm(pgdir, newsz, oldsz);
kfree(mem);
return 0;
}
}
return newsz;
}
Returns newsz
if succeeded, 0 otherwise.
-
walks the virtual address space between the old size and new size in page-sized chunks. For each new logical page to be created, it allocates a new free page from the kernel, and adds a mapping from the virtual address to the physical address by calling
mappages
-
allocate a new, zero page
mem = kalloc();
-
add page to second-level page table, also do the allocation. recall that
mappages
callwalkpgdir
withalloc = 1
mappages(pgdir, (char*)a, PGSIZE, v2p(mem), PTE_W|PTE_U);
-
There are indeed 2 cases where this function can fail:
Case 1:
kalloc
(kernel allocation) function failed. This function is responsible to return an address of a new, currently unused page in RAM. If it returns 0, that means there are no available unused pages currently.Case 2:
mappages
function failed. This function is responsible of making the new allocated page to be accessible by the process who uses the given page directory by mapping that page with the next virtual address available in the page directory. If this function fails that means it failed in doing so, probably due to the page directory being already full.In both cases,
allocuvm
didn't managed to increase the user's memory to the size requested, Therefore,deallocuvm
is undoing all allocations until the point of failure, so the virtual memory will remain unchanged, and returns an error it self.
deallocuvm looks at all the logical pages from the (bigger) old size of the process to the (smaller) new size, locates the corresponding physical pages, frees them up, and zeroes out the corresponding PTE as well.
Once a child process is allocated, its memory image is setup as a complete copy of the parent’s memory image by a call to copyuvm
setupkvm
: sets up kernel virtual memory- walks through the entire address space of the parent in page-sized chunks, gets
the physical address of every page of the parent using a call to
walkpgdir
- allocates a new physical page for the child using
kalloc
- copies the contents of the parent’s page into the child’s page, adds an entry to the child’s page table using
mappages
- returns the child’s page table
- has similar failure cases as discussed in
allocuvm
Now that we are done with the function basics, let’s focus on how things actually work. Most of the findings and intuition gained are from a shit ton of debug statements and by reading a few books and notes. So, take it with a grain of salt.
The first user process is the init
function. It is initiated by userinit
inside main
of main.c
.
Let’s focus on this first.
//PAGEBREAK: 32
// Set up first user process.
void
userinit(void)
{
struct proc *p;
extern char _binary_initcode_start[], _binary_initcode_size[];
p = allocproc();
initproc = p;
if((p->pgdir = setupkvm()) == 0)
panic("userinit: out of memory?");
inituvm(p->pgdir, _binary_initcode_start, (int)_binary_initcode_size);
p->sz = PGSIZE;
memset(p->tf, 0, sizeof(*p->tf));
p->tf->cs = (SEG_UCODE << 3) | DPL_USER;
p->tf->ds = (SEG_UDATA << 3) | DPL_USER;
p->tf->es = p->tf->ds;
p->tf->ss = p->tf->ds;
p->tf->eflags = FL_IF;
p->tf->esp = PGSIZE;
p->tf->eip = 0; // beginning of initcode.S
safestrcpy(p->name, "initcode", sizeof(p->name));
p->cwd = namei("/");
// this assignment to p->state lets other cores
// run this process. the acquire forces the above
// writes to be visible, and the lock is also needed
// because the assignment might not be atomic.
acquire(&ptable.lock);
p->state = RUNNABLE;
release(&ptable.lock);
}
Now let’s see what things are done step by step
-
allocproc
: for every processpid
needs to be assigned,proc
structure needs to be initialized. all of this is done here. for theinit
processpid
is returned 1 byallocproc
kalloc
:allocproc
callskalloc
which sets up data on the kernel stack
-
setupkvm
: creates kernel page table of thisinit
process -
inituvm
: allocates one page of physical memory, copies the init executable into that memory, sets up a page table entry (PTE) for the first page of the user virtual address space.Let’s also keep track of how many pages are being allocated where, this will be key to our understanding when we want to find what which pages are being deallocated in
deallocuvm
in later part.PAGE ALLOCATED HERE = 1
now the init executable that has been loaded runs, which to my understanding points to this part inside code
SYS_exec
in turn calls exec
. The exec
function looks like this -
int
exec(char *path, char **argv)
{
char *s, *last;
int i, off;
uint argc, sz, sp, ustack[3+MAXARG+1];
struct elfhdr elf;
struct inode *ip;
struct proghdr ph;
pde_t *pgdir, *oldpgdir;
struct proc *curproc = myproc();
begin_op();
if((ip = namei(path)) == 0){
end_op();
cprintf("exec: fail\n");
return -1;
}
ilock(ip);
pgdir = 0;
// Check ELF header
if(readi(ip, (char*)&elf, 0, sizeof(elf)) != sizeof(elf))
goto bad;
if(elf.magic != ELF_MAGIC)
goto bad;
if((pgdir = setupkvm()) == 0)
goto bad;
// Load program into memory.
sz = 0;
for(i=0, off=elf.phoff; i<elf.phnum; i++, off+=sizeof(ph)){
if(readi(ip, (char*)&ph, off, sizeof(ph)) != sizeof(ph))
goto bad;
if(ph.type != ELF_PROG_LOAD)
continue;
if(ph.memsz < ph.filesz)
goto bad;
if(ph.vaddr + ph.memsz < ph.vaddr)
goto bad;
if((sz = allocuvm(pgdir, sz, ph.vaddr + ph.memsz)) == 0)
goto bad;
if(ph.vaddr % PGSIZE != 0)
goto bad;
if(loaduvm(pgdir, (char*)ph.vaddr, ip, ph.off, ph.filesz) < 0)
goto bad;
}
iunlockput(ip);
end_op();
ip = 0;
// Allocate two pages at the next page boundary.
// Make the first inaccessible. Use the second as the user stack.
sz = PGROUNDUP(sz);
if((sz = allocuvm(pgdir, sz, sz + 2*PGSIZE)) == 0)
goto bad;
clearpteu(pgdir, (char*)(sz - 2*PGSIZE));
sp = sz;
// Push argument strings, prepare rest of stack in ustack.
for(argc = 0; argv[argc]; argc++) {
if(argc >= MAXARG)
goto bad;
sp = (sp - (strlen(argv[argc]) + 1)) & ~3;
if(copyout(pgdir, sp, argv[argc], strlen(argv[argc]) + 1) < 0)
goto bad;
ustack[3+argc] = sp;
}
ustack[3+argc] = 0;
ustack[0] = 0xffffffff; // fake return PC
ustack[1] = argc;
ustack[2] = sp - (argc+1)*4; // argv pointer
sp -= (3+argc+1) * 4;
if(copyout(pgdir, sp, ustack, (3+argc+1)*4) < 0)
goto bad;
// Save program name for debugging.
for(last=s=path; *s; s++)
if(*s == '/')
last = s+1;
safestrcpy(curproc->name, last, sizeof(curproc->name));
// Commit to the user image.
oldpgdir = curproc->pgdir;
curproc->pgdir = pgdir;
curproc->sz = sz;
curproc->tf->eip = elf.entry; // main
curproc->tf->esp = sp;
switchuvm(curproc);
freevm(oldpgdir);
return 0;
bad:
if(pgdir)
freevm(pgdir);
if(ip){
iunlockput(ip);
end_op();
}
return -1;
}
-
setupkvm
: The thing that I understood so far is that, the 1 page allocated insideinituvm
is responsible for executing this exec function. Whatever this exec function now wants to execute, will be in a separate new kernel page table, hence asetupkvm
is called once more. (what I exactly mean here will be more clear when I talk about other user processes) -
allocuvm
: allocates pages for the executable it wants to run, here 1 page is sufficient forinit
PAGE ALLOCATED HERE = 1
-
loaduvm
: loads the executable from disk into the newly allocated pages inallocuvm
-
allocuvm
: the new memory image so far only has executable code and data, now we also need stack space. Rather than allocating 1 page for the stack, it allocates 2. The 2nd one is the actual stack, 1st one serves as a guard page. This is the only page in the user’s memory which is marked as present but not user-accessible (PTE_U
is cleared)Reason for having a guard page : To guard a stack growing off the stack page, xv6 places a guard page right below the stack. As the guard page is not mapped, if the stack runs off the stack page, the hardware will generate an exception because it cannot translate the faulting address.
Strings containing the command-line arguments, as well as an array of pointers to them, are at the very top of the stack
PAGE ALLOCATED HERE = 2
-
trap frame update : It is important to note that exec does not replace/reallocate the kernel stack. The exec system call only replaces the user part of the memory image, and does nothing to the kernel part. And if you think about it, there is no way the process can replace the kernel stack, because the process is executing in kernel mode on the kernel stack itself, and has important information like the trap frame stored on it.
Recall that a process that makes the exec system call has moved into kernel mode to service the software interrupt of the system call. Normally, when the process moves back to user mode again (by popping the trap frame on the kernel stack), it is expected to return to the instruction after the system call. However, in the case of exec, the process doesn’t have to return to the instruction after exec when it gets the CPU next, but instead must start executing the new executable it just loaded from disk. So, the code in exec changes the return address in the trap frame to point to the entry address of the binary. It also sets the stack pointer in the trap frame to point to the top of the newly created user stack.
curproc->tf->eip = elf.entry; // main curproc->tf->esp = sp;
-
switchuvm
: Finally, once all these operations succeed,exec
switches page tables to start using the new memory image. That is, it writes the address of the new page table into the CR3 register, so that it can start accessing the new memory image when it goes back to userspace. -
freevm
: the process that called exec i.e.init
, was still using the old memory image. we free the old memory image that it was pointing to after updating-
deallocuvm
: The pages that are deleted here are the pages that were created ininituvm
. So,PAGE DEALLOCATED HERE = 1
-
The new memory image created by exec looks like this
At this point, the process that called exec
can start executing on the new memory image
when it returns from trap
Note : exec waits until the end to do this switch of page tables, because if anything went wrong in the system call, exec returns from trap into the old memory image and prints out an error
Apart from init
, all other processes are created by the fork system call. When the init
process runs, it executes the init executable, whose main function forks a shell (sh
) and starts listening to the user
-
allocproc
: returnspid = 2
forsh
-
copyuvm
: once a child process is allocated, its memory image is setup as a complete copy of the parent’s memory. This is done usingcopyuvm
.setupkvm
: inside copyuvm, a new kernel page table is created, and parent memory is allocated here.
Recall that 3 pages were created in
exec
ofinit
. These 3 pages are copied herePAGES ALLOCATED HERE = 3
The description of exec for shell is the same as discussed for init
. So lets just find out how many pages are allocated where
-
setupkvm
-
allocuvm
The executable that
sh
wants to run needs more memory thaninit
needed. So 2 pages are allocated here.PAGES ALLOCATED HERE = 2
-
loaduvm
-
allocuvm
-
freevm
-
deallocuvm
The pages that were allocated (copied) usingcopyuvm
are deallocated as they are pointing to old page directoryPAGE DEALLOCATED HERE = 3
-
Now let’s look at a general user process. Every other user process will fork sh
. Let’s look at echo
-
allocproc
-
copyuvm
: here 4 page will be copied (2 + 2 pages that were allocated insidesh
exec)PAGE ALLOCATED HERE = 4
After fork, growproc
is called to grow userspace memory image. growproc
can be called by the sbrk
system call. It basically increases the heap size
PAGE ALLOCATED HERE = 8
-
allocuvm
PAGE ALLOCATED HERE = 1
-
loaduvm
-
allocuvm
PAGE ALLOCATED HERE = 2
-
freevm
-
deallocuvm
: 4+8 pages deleted of old memory imagePAGE DEALLOCATED HERE = 12
-
freevm
-
deallocuvm
: 1+2 pages deleted of new memory imagePAGE DEALLOCATED HERE = 3
-
Finally, an overview of all the things said in a picture - because ofcourse a picture speaks more than a thousand words!
Thanks for reading this far ! I am not sure how much helpful this has been for you, but I hope you can start navigating the memory management part of the xv6 codebase after reading this. Go through this writeup multiple times if you don't undersand anything and if you have some more time to spare, please go through this doc. A lot of the part of this writeup has been taken or inspired from this.
May the force be with you! And if you liked the write-up, you can support me by buying me a cup of coffee!