forked from mit-pdos/xv6-riscv-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
fs.tex
1826 lines (1714 loc) · 57.5 KB
/
fs.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{File system}
\label{CH:FS}
% is it process or kernel thread?
%
% the logging text (and some of the buffer text) assumes the reader
% knows a fair amount about how inodes and directories work,
% but they are introduced later.
%
% have to decide on processor vs CPU.
%
% be sure to say buffer, not block
%
% Mount
The purpose of a file system is to organize and store data. File systems
typically support sharing of data among users and applications, as well as
\indextext{persistence}
so that data is still available after a reboot.
The xv6 file system provides Unix-like files, directories, and pathnames
(see Chapter~\ref{CH:UNIX}), and stores its data on a virtio disk for
persistence. The file system addresses
several challenges:
\begin{itemize}
\item The file system needs on-disk data structures to represent the tree
of named directories and files, to record the identities of the
blocks that hold each file's content, and to record which areas
of the disk are free.
\item The file system must support
\indextext{crash recovery}.
That is, if a crash (e.g., power failure) occurs, the file system must
still work correctly after a restart. The risk is that a crash might
interrupt a sequence of updates and leave inconsistent on-disk data
structures (e.g., a block that is both used in a file and marked free).
\item Different processes may operate on the file system at the same time,
so the file-system code must coordinate to maintain invariants.
\item Accessing a disk is orders of magnitude slower than accessing
memory, so the file system must maintain an in-memory cache of
popular blocks.
\end{itemize}
The rest of this chapter explains how xv6 addresses these challenges.
%%
%% -------------------------------------------
%%
\section{Overview}
The xv6 file system implementation is
organized in seven layers, shown in
Figure~\ref{fig:fslayer}.
The disk layer reads and writes blocks on an virtio hard drive.
The buffer cache layer caches disk blocks and synchronizes access to them,
making sure that only one kernel process at a time can modify the
data stored in any particular block. The logging layer allows higher
layers to wrap updates to several blocks in a
\indextext{transaction},
and ensures that the blocks are updated atomically in the
face of crashes (i.e., all of them are updated or none).
The inode layer provides individual files, each represented as an
\indextext{inode}
with a unique i-number
and some blocks holding the file's data. The directory
layer implements each directory as a special kind of
inode whose content is a sequence of directory entries, each of which contains a
file's name and i-number.
The pathname layer provides
hierarchical path names like
\lstinline{/usr/rtm/xv6/fs.c},
and resolves them with recursive lookup.
The file descriptor layer abstracts many Unix resources (e.g., pipes, devices,
files, etc.) using the file system interface, simplifying the lives of
application programmers.
\begin{figure}[t]
\center
\includegraphics[scale=0.5]{fig/fslayer.pdf}
\caption{Layers of the xv6 file system.}
\label{fig:fslayer}
\end{figure}
Disk hardware traditionally presents the data on the
disk as a numbered sequence of 512-byte
\textit{blocks}
\index{block}
(also called
\textit{sectors}):
\index{sector}
sector 0 is the first 512 bytes, sector 1 is the next, and so on. The block size
that an operating system uses for its file system maybe different than the
sector size that a disk uses, but typically the block size is a multiple of the
sector size. Xv6 holds copies of blocks that it has read into memory
in objects of type
\lstinline{struct buf}
\lineref{kernel/buf.h:/^struct.buf/}.
The
data stored in this structure is sometimes out of sync with the disk: it might have
not yet been read in from disk (the disk is working on it but hasn't returned
the sector's content yet), or it might have been updated by software
but not yet written to the disk.
The file system must have a plan for where it stores inodes and
content blocks on the disk.
To do so, xv6 divides the disk into several
sections, as
Figure~\ref{fig:fslayout} shows.
The file system does not use
block 0 (it holds the boot sector). Block 1 is called the
\indextext{superblock};
it contains metadata about the file system (the file system size in blocks, the
number of data blocks, the number of inodes, and the number of blocks in the
log). Blocks starting at 2 hold the log. After the log are the inodes, with multiple inodes per block. After
those come bitmap blocks tracking which data blocks are in use.
The remaining blocks are data blocks; each is either marked
free in the bitmap block, or holds content for a file or directory.
The superblock is filled in by a separate program, called
\indexcode{mkfs},
which builds an initial file system.
The rest of this chapter discusses each layer, starting with the
buffer cache.
Look out for situations where well-chosen abstractions at lower layers
ease the design of higher ones.
%%
%% -------------------------------------------
%%
\section{Buffer cache layer}
\label{s:bcache}
The buffer cache has two jobs: (1) synchronize access to disk blocks to ensure
that only one copy of a block is in memory and that only one kernel thread at a time
uses that copy; (2) cache popular blocks so that they don't need to be re-read from
the slow disk. The code is in
\lstinline{bio.c}.
The main interface exported by the buffer cache consists of
\indexcode{bread}
and
\indexcode{bwrite};
the former obtains a
\indextext{buf}
containing a copy of a block which can be read or modified in memory, and the
latter writes a modified buffer to the appropriate block on the disk.
A kernel thread must release a buffer by calling
\indexcode{brelse}
when it is done with it.
The buffer cache uses a per-buffer sleep-lock to ensure
that only one thread at a time uses each buffer
(and thus each disk block);
\lstinline{bread}
returns a locked buffer, and
\lstinline{brelse}
releases the lock.
Let's return to the buffer cache.
The buffer cache has a fixed number of buffers to hold disk blocks,
which means that if the file system asks for a block that is not
already in the cache, the buffer cache must recycle a buffer currently
holding some other block. The buffer cache recycles the
least recently used buffer for the new block. The assumption is that
the least recently used buffer is the one least likely to be used
again soon.
\begin{figure}[t]
\center
\includegraphics[scale=0.5]{fig/fslayout.pdf}
\caption{Structure of the xv6 file system. }
\label{fig:fslayout}
\end{figure}
%%
%% -------------------------------------------
%%
\section{Code: Buffer cache}
The buffer cache is a doubly-linked list of buffers.
The function
\indexcode{binit},
called by
\indexcode{main}
\lineref{kernel/main.c:/binit/},
initializes the list with the
\indexcode{NBUF}
buffers in the static array
\lstinline{buf}
\linerefs{kernel/bio.c:/Create.linked.list/,/^..}/}.
All other access to the buffer cache refer to the linked list via
\indexcode{bcache.head},
not the
\lstinline{buf}
array.
A buffer has two state fields associated with it.
The field
\indexcode{valid}
indicates that the buffer contains a copy of the block.
The field \indexcode{disk}
indicates that the buffer content has been handed to
the disk, which may change the buffer (e.g., write
data from the disk into \lstinline{data}).
\lstinline{bread}
\lineref{kernel/bio.c:/^bread/}
calls
\indexcode{bget}
to get a buffer for the given sector
\lineref{kernel/bio.c:/b.=.bget/}.
If the buffer needs to be read from disk,
\lstinline{bread}
calls
\indexcode{virtio_disk_rw}
to do that before returning the buffer.
\lstinline{bget}
\lineref{kernel/bio.c:/^bget/}
scans the buffer list for a buffer with the given device and sector numbers
\linerefs{kernel/bio.c:/Is.the.block.already/,/^..}/}.
If there is such a buffer,
\indexcode{bget}
acquires the sleep-lock for the buffer.
\lstinline{bget}
then returns the locked buffer.
If there is no cached buffer for the given sector,
\indexcode{bget}
must make one, possibly reusing a buffer that held
a different sector.
It scans the buffer list a second time, looking for a buffer
that is not in use (\lstinline{b->refcnt = 0});
any such buffer can be used.
\lstinline{bget}
edits the buffer metadata to record the new device and sector number
and acquires its sleep-lock.
Note that the assignment
\lstinline{b->valid = 0}
ensures that
\lstinline{bread}
will read the block data from disk
rather than incorrectly using the buffer's previous contents.
It is important that there is at most one cached buffer per
disk sector, to ensure that readers see writes, and because the
file system uses locks on buffers for synchronization.
\lstinline{bget}
ensures this invariant by holding the
\lstinline{bache.lock}
continuously from the first loop's check of whether the
block is cached through the second loop's declaration that
the block is now cached (by setting
\lstinline{dev},
\lstinline{blockno},
and
\lstinline{refcnt}).
This causes the check for a block's presence and (if not
present) the designation of a buffer to hold the block to
be atomic.
It is safe for
\lstinline{bget}
to acquire the buffer's sleep-lock outside of the
\lstinline{bcache.lock}
critical section,
since the non-zero
\lstinline{b->refcnt}
prevents the buffer from being re-used for a different disk block.
The sleep-lock protects reads
and writes of the block's buffered content, while the
\lstinline{bcache.lock}
protects information about which blocks are cached.
If all the buffers are busy, then too many processes are
simultaneously executing file system calls;
\lstinline{bget}
panics.
A more graceful response might be to sleep until a buffer became free,
though there would then be a possibility of deadlock.
Once
\indexcode{bread}
has read the disk (if needed) and returned the
buffer to its caller, the caller has
exclusive use of the buffer and can read or write the data bytes.
If the caller does modify the buffer, it must call
\indexcode{bwrite}
to write the changed data to disk before releasing the buffer.
\lstinline{bwrite}
\lineref{kernel/bio.c:/^bwrite/}
calls
\indexcode{virtio_disk_rw}
to talk to the disk hardware.
When the caller is done with a buffer,
it must call
\indexcode{brelse}
to release it.
(The name
\lstinline{brelse},
a shortening of
b-release,
is cryptic but worth learning:
it originated in Unix and is used in BSD, Linux, and Solaris too.)
\lstinline{brelse}
\lineref{kernel/bio.c:/^brelse/}
releases the sleep-lock and
moves the buffer
to the front of the linked list
\linerefs{kernel/bio.c:/b->next->prev.=.b->prev/,/bcache.head.next.=.b/}.
Moving the buffer causes the
list to be ordered by how recently the buffers were used (meaning released):
the first buffer in the list is the most recently used,
and the last is the least recently used.
The two loops in
\lstinline{bget}
take advantage of this:
the scan for an existing buffer must process the entire list
in the worst case, but checking the most recently used buffers
first (starting at
\lstinline{bcache.head}
and following
\lstinline{next}
pointers) will reduce scan time when there is good locality of reference.
The scan to pick a buffer to reuse picks the least recently used
buffer by scanning backward
(following
\lstinline{prev}
pointers).
%%
%% -------------------------------------------
%%
\section{Logging layer}
One of the most interesting problems in file system design is crash
recovery. The problem arises because many file-system operations
involve multiple writes to the disk, and a crash after a subset of the
writes may leave the on-disk file system in an inconsistent state. For
example, suppose a crash occurs during file truncation (setting
the length of a file to zero and freeing its content blocks).
Depending on the order of the disk writes, the crash
may either leave an inode with a reference
to a content block that is marked free,
or it may leave an allocated but unreferenced content block.
The latter is relatively benign, but an inode that refers to a freed
block is likely to cause serious problems after a reboot. After reboot, the
kernel might allocate that block to another file, and now we have two different
files pointing unintentionally to the same block. If xv6 supported
multiple users, this situation could be a security problem, since the
old file's owner would be able to read and write blocks in the
new file, owned by a different user.
Xv6 solves the problem of crashes during file-system operations with a
simple form of logging. An xv6 system call does not directly write
the on-disk file system data structures. Instead, it places a
description of all the disk writes it wishes to make in a
\indextext{log}
on the disk. Once the system call has logged all of its writes, it writes a
special
\indextext{commit}
record to the disk indicating that the log contains
a complete operation. At that point the system call copies the writes
to the on-disk file system data structures. After those writes have
completed, the system call erases the log on disk.
If the system should crash and reboot, the file-system code recovers
from the crash as follows, before running any processes. If the log is
marked as containing a complete operation, then the recovery code
copies the writes to where they belong in the on-disk file system. If
the log is not marked as containing a complete operation, the recovery
code ignores the log. The recovery code finishes by erasing
the log.
Why does xv6's log solve the problem of crashes during file system
operations? If the crash occurs before the operation commits, then the
log on disk will not be marked as complete, the recovery code will
ignore it, and the state of the disk will be as if the operation had
not even started. If the crash occurs after the operation commits,
then recovery will replay all of the operation's writes, perhaps
repeating them if the operation had started to write them to the
on-disk data structure. In either case, the log makes operations
atomic with respect to crashes: after recovery, either all of the
operation's writes appear on the disk, or none of them appear.
%%
%%
%%
\section{Log design}
The log resides at a known fixed location, specified in the superblock.
It consists of a header block followed by a sequence
of updated block copies (``logged blocks'').
The header block contains an array of sector
numbers, one for each of the logged blocks, and
the count of log blocks.
The count in the header block on disk is either
zero, indicating that there is no transaction in the log,
or non-zero, indicating that the log contains a complete committed
transaction with the indicated number of logged blocks.
Xv6 writes the header
block when a transaction commits, but not before, and sets the
count to zero after copying the logged blocks to the file system.
Thus a crash midway through a transaction will result in a
count of zero in the log's header block; a crash after a commit
will result in a non-zero count.
Each system call's code indicates the start and end of the sequence of
writes that must be atomic with respect to crashes.
To allow concurrent execution of file-system operations
by different processes,
the logging system can accumulate the writes
of multiple system calls into one transaction.
Thus a single commit may involve the writes of multiple
complete system calls.
To avoid splitting a system call across transactions, the logging system
only commits when no file-system system calls are underway.
The idea of committing several transactions together is known as
\indextext{group commit}.
Group commit reduces the number of disk operations
because it amortizes the fixed cost of a commit over multiple
operations.
Group commit also hands the disk system more concurrent writes
at the same time, perhaps allowing the disk to write
them all during a single disk rotation.
Xv6's virtio driver doesn't support this kind of
\indextext{batching},
but xv6's file system design allows for it.
Xv6 dedicates a fixed amount of space on the disk to hold the log.
The total number of blocks written by the system calls in a
transaction must fit in that space.
This has two consequences.
No single system call
can be allowed to write more distinct blocks than there is space
in the log. This is not a problem for most system calls, but two
of them can potentially write many blocks:
\indexcode{write}
and
\indexcode{unlink}.
A large file write may write many data blocks and many bitmap blocks
as well as an inode block; unlinking a large file might write many
bitmap blocks and an inode.
Xv6's write system call breaks up large writes into multiple smaller
writes that fit in the log,
and
\lstinline{unlink}
doesn't cause problems because in practice the xv6 file system uses
only one bitmap block.
The other consequence of limited log space
is that the logging system cannot allow a system call to start
unless it is certain that the system call's writes will
fit in the space remaining in the log.
%%
%%
%%
\section{Code: logging}
A typical use of the log in a system call looks like this:
\begin{lstlisting}[]
begin_op();
...
bp = bread(...);
bp->data[...] = ...;
log_write(bp);
...
end_op();
\end{lstlisting}
\indexcode{begin_op}
\lineref{kernel/log.c:/^begin.op/}
waits until
the logging system is not currently committing, and until
there is enough unreserved log space to hold
the writes from this call.
\lstinline{log.outstanding}
counts the number of system calls that have reserved log
space; the total reserved space is
\lstinline{log.outstanding}
times
\lstinline{MAXOPBLOCKS}.
Incrementing
\lstinline{log.outstanding}
both reserves space and prevents a commit
from occurring during this system call.
The code conservatively assumes that each system call might write up to
\lstinline{MAXOPBLOCKS}
distinct blocks.
\indexcode{log_write}
\lineref{kernel/log.c:/^log.write/}
acts as a proxy for
\indexcode{bwrite}.
It records the block's sector number in memory,
reserving it a slot in the log on disk,
and pins the buffer in the block cache
to prevent the block cache from evicting it.
The block must stay in the cache until committed:
until then, the cached copy is the only record
of the modification; it cannot be written to
its place on disk until after commit;
and other reads in the same transaction must
see the modifications.
\lstinline{log_write}
notices when a block is written multiple times during a single
transaction, and allocates that block the same slot in the log.
This optimization is often called
\indextext{absorption}.
It is common that, for example, the disk block containing inodes
of several files is written several times within a transaction. By absorbing
several disk writes into one, the file system can save log space and
can achieve better performance because only one copy of the disk block must be
written to disk.
\indexcode{end_op}
\lineref{kernel/log.c:/^end.op/}
first decrements the count of outstanding system calls.
If the count is now zero, it commits the current
transaction by calling
\lstinline{commit().}
There are four stages in this process.
\lstinline{write_log()}
\lineref{kernel/log.c:/^write.log/}
copies each block modified in the transaction from the buffer
cache to its slot in the log on disk.
\lstinline{write_head()}
\lineref{kernel/log.c:/^write.head/}
writes the header block to disk: this is the
commit point, and a crash after the write will
result in recovery replaying the transaction's writes from the log.
\indexcode{install_trans}
\lineref{kernel/log.c:/^install_trans/}
reads each block from the log and writes it to the proper
place in the file system.
Finally
\lstinline{end_op}
writes the log header with a count of zero;
this has to happen before the next transaction starts writing
logged blocks, so that a crash doesn't result in recovery
using one transaction's header with the subsequent transaction's
logged blocks.
\indexcode{recover_from_log}
\lineref{kernel/log.c:/^recover_from_log/}
is called from
\indexcode{initlog}
\lineref{kernel/log.c:/^initlog/},
which is called from \indexcode{fsinit}\lineref{kernel/fs.c:/^fsinit/} during boot before the first user process runs
\lineref{kernel/proc.c:/fsinit/}.
It reads the log header, and mimics the actions of
\lstinline{end_op}
if the header indicates that the log contains a committed transaction.
An example use of the log occurs in
\indexcode{filewrite}
\lineref{kernel/file.c:/^filewrite/}.
The transaction looks like this:
\begin{lstlisting}[]
begin_op();
ilock(f->ip);
r = writei(f->ip, ...);
iunlock(f->ip);
end_op();
\end{lstlisting}
This code is wrapped in a loop that breaks up large writes into individual
transactions of just a few sectors at a time, to avoid overflowing
the log. The call to
\indexcode{writei}
writes many blocks as part of this
transaction: the file's inode, one or more bitmap blocks, and some data
blocks.
%%
%%
%%
\section{Code: Block allocator}
File and directory content is stored in disk blocks,
which must be allocated from a free pool.
Xv6's block allocator
maintains a free bitmap on disk, with one bit per block.
A zero bit indicates that the corresponding block is free;
a one bit indicates that it is in use.
The program
\lstinline{mkfs}
sets the bits corresponding to the boot sector, superblock, log blocks, inode
blocks, and bitmap blocks.
The block allocator provides two functions:
\indexcode{balloc}
allocates a new disk block, and
\indexcode{bfree}
frees a block.
\lstinline{balloc}
The loop in
\lstinline{balloc}
at
\lineref{kernel/fs.c:/^..for.b.=.0/}
considers every block, starting at block 0 up to
\lstinline{sb.size},
the number of blocks in the file system.
It looks for a block whose bitmap bit is zero,
indicating that it is free.
If
\lstinline{balloc}
finds such a block, it updates the bitmap
and returns the block.
For efficiency, the loop is split into two
pieces.
The outer loop reads each block of bitmap bits.
The inner loop checks all
Bits-Per-Block (\lstinline{BPB})
bits in a single bitmap block.
The race that might occur if two processes try to allocate
a block at the same time is prevented by the fact that
the buffer cache only lets one process use any one bitmap block at a time.
\lstinline{bfree}
\lineref{kernel/fs.c:/^bfree/}
finds the right bitmap block and clears the right bit.
Again the exclusive use implied by
\lstinline{bread}
and
\lstinline{brelse}
avoids the need for explicit locking.
As with much of the code described in the remainder of this chapter,
\lstinline{balloc}
and
\lstinline{bfree}
must be called inside a transaction.
%%
%% -------------------------------------------
%%
\section{Inode layer}
The term
\indextext{inode}
can have one of two related meanings.
It might refer to the on-disk data structure containing
a file's size and list of data block numbers.
Or ``inode'' might refer to an in-memory inode, which contains
a copy of the on-disk inode as well as extra information needed
within the kernel.
The on-disk inodes
are packed into a contiguous area
of disk called the inode blocks.
Every inode is the same size, so it is easy, given a
number n, to find the nth inode on the disk.
In fact, this number n, called the inode number or i-number,
is how inodes are identified in the implementation.
The on-disk inode is defined by a
\indexcode{struct dinode}
\lineref{kernel/fs.h:/^struct.dinode/}.
The
\lstinline{type}
field distinguishes between files, directories, and special
files (devices).
A type of zero indicates that an on-disk inode is free.
The
\lstinline{nlink}
field counts the number of directory entries that
refer to this inode, in order to recognize when the
on-disk inode and its data blocks should be freed.
The
\lstinline{size}
field records the number of bytes of content in the file.
The
\lstinline{addrs}
array records the block numbers of the disk blocks holding
the file's content.
The kernel keeps the set of active inodes in memory
in a table called \indexcode{itable};
\indexcode{struct inode}
\lineref{kernel/file.h:/^struct.inode/}
is the in-memory copy of a
\lstinline{struct}
\lstinline{dinode}
on disk.
The kernel stores an inode in memory only if there are
C pointers referring to that inode. The
\lstinline{ref}
field counts the number of C pointers referring to the
in-memory inode, and the kernel discards the inode from
memory if the reference count drops to zero.
The
\indexcode{iget}
and
\indexcode{iput}
functions acquire and release pointers to an inode,
modifying the reference count.
Pointers to an inode can come from file descriptors,
current working directories, and transient kernel code
such as
\lstinline{exec}.
There are four lock or lock-like mechanisms in xv6's
inode code.
\lstinline{itable.lock}
protects the invariant that an inode is present in the inode table
at most once, and the invariant that an in-memory inode's
\lstinline{ref}
field counts the number of in-memory pointers to the inode.
Each in-memory inode has a
\lstinline{lock}
field containing a
sleep-lock, which ensures exclusive access to the
inode's fields (such as file length) as well as to the
inode's file or directory content blocks.
An inode's
\lstinline{ref},
if it is greater than zero, causes the system to maintain
the inode in the table, and not re-use the table entry for
a different inode.
Finally, each inode contains a
\lstinline{nlink}
field (on disk and copied in memory if in memory) that
counts the number of directory entries that refer to a file;
xv6 won't free an inode if its link count is greater than zero.
A
\lstinline{struct}
\lstinline{inode}
pointer returned by
\lstinline{iget()}
is guaranteed to be valid until the corresponding call to
\lstinline{iput()};
the inode won't be deleted, and the memory referred to
by the pointer won't be re-used for a different inode.
\lstinline{iget()}
provides non-exclusive access to an inode, so that
there can be many pointers to the same inode.
Many parts of the file-system code depend on this behavior of
\lstinline{iget()},
both to hold long-term references to inodes (as open files
and current directories) and to prevent races while avoiding
deadlock in code that manipulates multiple inodes (such as
pathname lookup).
The
\lstinline{struct}
\lstinline{inode}
that
\lstinline{iget}
returns may not have any useful content.
In order to ensure it holds a copy of the on-disk
inode, code must call
\indexcode{ilock}.
This locks the inode (so that no other process can
\lstinline{ilock}
it) and reads the inode from the disk,
if it has not already been read.
\lstinline{iunlock}
releases the lock on the inode.
Separating acquisition of inode pointers from locking
helps avoid deadlock in some situations, for example during
directory lookup.
Multiple processes can hold a C pointer to an inode
returned by
\lstinline{iget},
but only one process can lock the inode at a time.
The inode table only stores inodes to which kernel code
or data structures hold C pointers.
Its main job is synchronizing access by multiple processes.
The inode table also happens to cache frequently-used inodes, but
caching is secondary; if an inode is used frequently, the buffer cache will probably
keep it in memory.
Code that modifies an in-memory inode writes it to disk with
\lstinline{iupdate}.
%%
%% -------------------------------------------
%%
\section{Code: Inodes}
To allocate a new inode (for example, when creating a file),
xv6 calls
\indexcode{ialloc}
\lineref{kernel/fs.c:/^ialloc/}.
\lstinline{ialloc}
is similar to
\indexcode{balloc}:
it loops over the inode structures on the disk, one block at a time,
looking for one that is marked free.
When it finds one, it claims it by writing the new
\lstinline{type}
to the disk and then returns an entry from the inode table
with the tail call to
\indexcode{iget}
\lineref{kernel/fs.c:/return.iget\(dev..inum\)/}.
The correct operation of
\lstinline{ialloc}
depends on the fact that only one process at a time
can be holding a reference to
\lstinline{bp}:
\lstinline{ialloc}
can be sure that some other process does not
simultaneously see that the inode is available
and try to claim it.
\lstinline{iget}
\lineref{kernel/fs.c:/^iget/}
looks through the inode table for an active entry (\lstinline{ip->ref}
\lstinline{>}
\lstinline{0})
with the desired device and inode number.
If it finds one, it returns a new reference to that inode
\linerefs{kernel/fs.c:/^....if.ip->ref.>.0/,/^....}/}.
As
\indexcode{iget}
scans, it records the position of the first empty slot
\linerefs{kernel/fs.c:/^....if.empty.==.0/,/empty.=.ip/},
which it uses if it needs to allocate a table entry.
Code must lock the inode using
\indexcode{ilock}
before reading or writing its metadata or content.
\lstinline{ilock}
\lineref{kernel/fs.c:/^ilock/}
uses a sleep-lock for this purpose.
Once
\indexcode{ilock}
has exclusive access to the inode, it reads the inode
from disk (more likely, the buffer cache) if needed.
The function
\indexcode{iunlock}
\lineref{kernel/fs.c:/^iunlock/}
releases the sleep-lock,
which may cause any processes sleeping
to be woken up.
\lstinline{iput}
\lineref{kernel/fs.c:/^iput/}
releases a C pointer to an inode
by decrementing the reference count
\lineref{kernel/fs.c:/^..ip->ref--/}.
If this is the last reference, the inode's
slot in the inode table is now free and can be re-used
for a different inode.
If
\indexcode{iput}
sees that there are no C pointer references to an inode
and that the inode has no links to it (occurs in no
directory), then the inode and its data blocks must
be freed.
\lstinline{iput}
calls
\indexcode{itrunc}
to truncate the file to zero bytes, freeing the data blocks;
sets the inode type to 0 (unallocated);
and writes the inode to disk
\lineref{kernel/fs.c:/inode.has.no.links.and/}.
The locking protocol in
\indexcode{iput}
in the case in which it frees the inode deserves a closer look.
One danger is that a concurrent thread might be waiting in
\lstinline{ilock}
to use this inode (e.g., to read a file or list a directory),
and won't be prepared to find that the inode is no longer
allocated. This can't happen because there is no way for
a system call to get a pointer to an in-memory inode if it has
no links to it and
\lstinline{ip->ref}
is one. That one reference is the reference owned by the
thread calling
\lstinline{iput}.
It's true that
\lstinline{iput}
checks that the reference count is one outside of its
\lstinline{itable.lock}
critical section, but at that point the link
count is known to be zero, so no thread will try
to acquire a new reference.
The other main danger is that a concurrent call to
\lstinline{ialloc}
might choose the same inode that
\lstinline{iput}
is freeing.
This can happen only after the
\lstinline{iupdate}
writes the disk so that the inode has type zero.
This race is benign; the allocating thread will politely wait
to acquire the inode's sleep-lock before reading or writing
the inode, at which point
\lstinline{iput}
is done with it.
\lstinline{iput()}
can write to the disk. This means that any system call that uses the file
system may write to the disk, because the system call may be the last one having
a reference to the file. Even calls like
\lstinline{read()}
that appear to be read-only, may end up calling
\lstinline{iput().}
This, in turn, means that even read-only system calls
must be wrapped in transactions if they use the file system.
There is a challenging interaction between
\lstinline{iput()}
and crashes.
\lstinline{iput()}
doesn't truncate a file immediately when the link count for the file
drops to zero, because some process might still hold a reference to the inode in
memory: a process might still be reading and writing to the file, because it
successfully opened it. But, if a crash happens before the last process closes
the file descriptor for the file, then the file will be marked allocated on disk
but no directory entry will point to it.
File systems handle this case in one of two ways. The simple solution is that on
recovery, after reboot, the file system scans the whole file system for files
that are marked allocated, but have no directory entry pointing to them. If any
such file exists, then it can free those files.
The second solution doesn't require scanning the file system. In this solution,
the file system records on disk (e.g., in the super block) the inode inumber of
a file whose link count drops to zero but whose reference count isn't zero. If
the file system removes the file when its reference count reaches 0, then it
updates the on-disk list by removing that inode from the list. On recovery, the
file system frees any file in the list.
Xv6 implements neither solution, which means that inodes may be marked allocated
on disk, even though they are not in use anymore. This means that over time xv6
runs the risk that it may run out of disk space.
%%
%%
%%
\section{Code: Inode content}
\begin{figure}[t]
\center
\includegraphics[scale=0.5]{fig/inode.pdf}
\caption{The representation of a file on disk.}
\label{fig:inode}
\end{figure}
The on-disk inode structure,
\indexcode{struct dinode},
contains a size and an array of block numbers (see
Figure~\ref{fig:inode}).
The inode data is found in the blocks listed
in the
\lstinline{dinode} 's
\lstinline{addrs}
array.
The first
\indexcode{NDIRECT}
blocks of data are listed in the first
\lstinline{NDIRECT}
entries in the array; these blocks are called
\indextext{direct blocks}.
The next
\indexcode{NINDIRECT}
blocks of data are listed not in the inode
but in a data block called the
\indextext{indirect block}.
The last entry in the
\lstinline{addrs}
array gives the address of the indirect block.
Thus the first 12 kB (
\lstinline{NDIRECT}
\lstinline{x}
\indexcode{BSIZE})
bytes of a file can be loaded from blocks listed in the inode,
while the next
\lstinline{256} kB (
\lstinline{NINDIRECT}
\lstinline{x}
\lstinline{BSIZE})
bytes can only be loaded after consulting the indirect block.
This is a good on-disk representation but a
complex one for clients.
The function
\indexcode{bmap}
manages the representation so that higher-level routines, such as
\indexcode{readi}
and
\indexcode{writei},
which we will see shortly, do not need to manage this complexity.
\lstinline{bmap}
returns the disk block number of the
\lstinline{bn}'th
data block for the inode
\lstinline{ip}.
If
\lstinline{ip}
does not have such a block yet,
\lstinline{bmap}
allocates one.
The function
\indexcode{bmap}
\lineref{kernel/fs.c:/^bmap/}
begins by picking off the easy case: the first
\indexcode{NDIRECT}
blocks are listed in the inode itself
\linerefs{kernel/fs.c:/^..if.bn.<.NDIRECT/,/^..}/}.
The next
\indexcode{NINDIRECT}