-
Notifications
You must be signed in to change notification settings - Fork 11
/
linux_cpu_scheduler.lyx
3497 lines (3023 loc) · 109 KB
/
linux_cpu_scheduler.lyx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#LyX 1.3 created this file. For more info see http://www.lyx.org/
\lyxformat 221
\textclass article
\begin_preamble
\usepackage{ae}
\end_preamble
\language english
\inputencoding auto
\fontscheme default
\graphics default
\paperfontsize default
\spacing single
\papersize Default
\paperpackage a4
\use_geometry 0
\use_amsmath 0
\use_natbib 0
\use_numerical_citations 0
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\quotes_times 2
\papercolumns 1
\papersides 1
\paperpagestyle default
\layout Title
Understanding the Linux 2.6.8.1 CPU Scheduler
\layout Author
By Josh Aas
\newline
©2005 Silicon Graphics, Inc.
(SGI)
\layout Standard
\begin_inset LatexCommand \tableofcontents{}
\end_inset
\layout Section
Introduction
\layout Subsection
Paper Overview
\layout Standard
Linux kernel development is relatively fast-paced given the size and complexity
of the code base.
This is because of its widespread adoption by hobbyists, home users, businesses
(including many Fortune 500 companies), and educational institutions.
The Linux kernel mailing list (LKML, a mailing list for kernel developers),
as of summer 2004, averages about 300 messages per day from between 50
and 100 different developers.
These numbers do not include most architecture-specific discussions, which
happen on separate lists.
In the year before August 1st, 2004, over 16,000 patches of widely varying
sizes were committed to the official Linux kernel
\begin_inset LatexCommand \cite{key-10}
\end_inset
.
This pace of development has led to a situation where very few of the kernel's
major components are adequately documented at the implementation level.
\layout Standard
This lack of documentation makes it more difficult for new contributors,
students, researchers, and even veteran contributors to understand the
Linux kernel's implementation.
For all of these people, implementation-level documentation of the Linux
kernel provides many benefits.
Obviously, those who wish to contribute to the Linux kernel must have a
fairly good understanding of its actual implementation.
But why is it valuable for students and researchers to understand the Linux
kernel at the implementation level? Isn't the theory behind it or a general
idea of what is going on enough? Since the Linux kernel is "developed with
a strong practical emphasis more than a theoretical one"
\begin_inset LatexCommand \cite{key-9}
\end_inset
, many decisions are made in reaction to Linux's real-world performance.
This means that it is quite common for Linux's implementation to diverge
from theoretical foundations; when this happens, it is usually for a good
reason.
Understanding deployed algorithms, the reasoning behind divergences from
theory, and the weaknesses in theories that real-world applications bring
to light is essential for the development of future algorithms.
\layout Standard
For the reasons listed above, Linux needs documentation specific to its
implementation, not just the theory that may or may not have at one time
been the basis for the design choices made by its developers.
This paper on the Linux 2.6.8.1 scheduler was inspired by Mel Gorman's thesis
on the Linux virtual memory (VM) system
\begin_inset LatexCommand \cite{key-9}
\end_inset
, which current Linux VM developers probably reference and value more than
any other piece of documentation on the subject.
\layout Standard
The goal of this paper is to provide in-depth documentation of the Linux
2.6.8.1 CPU scheduler.
This documentation will hopefully be of use to kernel developers who must
work with the code, as well as students and researchers who wish to understand
the implementation of a real, working scheduler.
Hopefully this paper will greatly reduce the amount of time required to
gain a detailed understanding of how the Linux 2.6.8.1 scheduler works.
In the same way that Mr.
Gorman's documentation of the Linux 2.4.20 VM system is still very helpful
in understanding the VM system in the Linux 2.6.x series of kernels, it is
hoped that this paper will remain relevant for many versions of the Linux
kernel beyond 2.6.8.1.
\layout Subsection
Linux Kernel Literature
\layout Standard
While the Linux kernel lacks up-to-date code-level documentation, there
is a reasonable amount of higher-level and introductory documentation available.
Any of the following literature is highly recommended reading for those
who wish to gain a basic knowledge of kernel internals.
\layout Standard
\emph on
Linux Kernel Development
\emph default
by Robert Love (a highly respected Linux kernel hacker) was released in
2004
\begin_inset LatexCommand \cite{key-7}
\end_inset
.
It covers the Linux 2.6.x kernel series, and as of fall 2004 it is perhaps
the only book to do so (most others cover Linux 2.4.x and earlier).
At 332 pages, it is quite manageable as a book to read page-by-page and
to use as a reference.
It gives a general overview of each of the Linux kernel's components, and
helps to illustrate how they fit together.
It contains a well-written overview of the Linux 2.6.x scheduler.
\layout Standard
Robert Love's
\emph on
Linux Kernel Development
\emph default
may be the only book available that covers the Linux 2.6.x kernel, but there
are several books available about the Linux 2.4.x kernel that may be helpful
in understanding many components of the Linux 2.6.x kernels (some component
have not changed drastically).
Books providing such coverage include:
\layout Itemize
\emph on
Understand The Linux Kernel
\emph default
by Daniel Bovet and Marco Cesati.
O'Reilly, 2003.
\layout Itemize
\emph on
Linux Device Drivers
\emph default
by Alessandro Rubini and Jonathan Corbet.
O'Reilly, 2001.
\layout Itemize
\emph on
IA-64 Linux Kernel
\emph default
by David Mosberger and Stephane Eranian.
Prentice Hall PTR, 2002.
\layout Itemize
\emph on
Understanding The Linux Virtual Memory Manager
\emph default
by Mel Gorman.
2004.
\newline
(
\emph on
http://www.skynet.ie/~mel/projects/vm/
\emph default
)
\layout Standard
The Linux Documentation Project (
\emph on
http://www.tldp.org/
\emph default
) is another good source of documentation.
It contains documents covering many different aspects of Linux distributions
and the Linux kernel.
\layout Standard
Archives of all past conversation on the official Linux kernel development
mailing list (LKML) are available on many web sites.
Simply search for
\begin_inset Quotes eld
\end_inset
LKML archive
\begin_inset Quotes erd
\end_inset
using a search engine such as Google (
\emph on
http://www.google.com
\emph default
/).
LKML should be read liberally and posted to conservatively.
\layout Standard
Last but not least, the documentation distributed with the kernel source
itself is quite helpful.
It can be found in the
\family typewriter
Documentation/
\family default
directory.
\layout Standard
Unfortunately, Linux documentation covering kernels prior to the 2.6.x series
will be of minimal use in understanding the scheduler described in this
paper because the scheduler was heavily modified between the 2.4.x and 2.6.x
kernel series.
\layout Subsection
Typographic Conventions
\layout Standard
New concepts and URLs are
\emph on
italicized
\emph default
.
Binaries, commands, and package names are in
\series bold
bold
\series default
.
Code, macros, and file paths are in a
\family typewriter
constant-width font
\family default
.
Paths to included files will be written with brackets around them (e.g.
\family typewriter
<linux/sched.h>
\family default
); these files can be found in the
\family typewriter
include/
\family default
directory of the Linux kernel source code.
All paths are rooted in the Linux kernel source code unless otherwise noted.
Fields in a structure are referred to with an arrow pointing from the structure
to the field (e.g.
\family typewriter
structure->field
\family default
).
\layout Subsection
About this Document
\layout Standard
This document was written in LaTeX using the LyX editor on SuSE Linux 9.x
and Mac OS X 10.3.x.
It is made available in HTML, PDF, and LaTeX form.
It can be downloaded from the author's web site (
\emph on
http://josh.trancesoftware.com/linux
\emph default
/).
\layout Subsection
Companion CD
\layout Standard
The companion disc included with this document includes the full source
code of the Linux 2.6.8.1 kernel, a patch to add in-depth comments to the
scheduler code, and a digital copy of this document.
The disc is an ISO-9660 formatted CD that should work in any modern operating
system.
To apply the scheduler comments patch, move it to the directory
\family typewriter
kernel/
\family default
in your Linux source code,
\series bold
cd
\series default
into that directory, and run the following command:
\layout Standard
\series bold
patch -p0 < sched_comments.patch
\layout Section
Linux Kernel Source Code
\layout Subsection
Getting the Source
\layout Standard
The Linux kernel source code is an essential resource for learning about
the kernel.
In attempting to gain a detailed understanding of the kernel, no paper
can entirely replace reading the code itself.
This paper will refer to it heavily.
The Linux kernel source code is available at The Linux Kernel Archives
(
\emph on
http://www.kernel.org
\emph default
).
The main page of the kernel archive lists the latest release from each
kernel series, including complete source code, upgrade patches, and change
logs.
All released versions of the Linux kernel are available on the archive's
FTP site (
\emph on
ftp://ftp.kernel.org/
\emph default
).
\layout Subsection
Kernel Versioning
\layout Standard
Linux kernels have version numbers in the form W.X.Y.Z.
The W position is rarely incremented - only when an extremely significant
change has been made to the kernel, such that a considerable amount of
software that works on one version won't work on another.
This has only happened once in the history of Linux (thus the "2" at the
beginning of the kernel version this paper focuses on, 2.6.8.1).
\layout Standard
The X position denotes the kernel
\emph on
series
\emph default
.
An even series indicates a stable release series, and an odd series denotes
a development release series.
Historically, the series number is incremented every couple of years.
Development of older series' continues as long as there is interest.
For example - though Linux 2.0 was originally released in June of 1996,
version 2.0.40 was released in February of 2004 (largely by/for people who
want to continue to support older hardware).
\layout Standard
The Y position is the
\emph on
version number
\emph default
, which is normally incremented for every release.
Often it is the last position in a kernel version (e.g.
2.6.7), but occasionally there is a need to fix something critical in a release.
In such cases the Z position is incremented.
The first instance of this happening was the release of the 2.6.8.1 kernel.
The 2.6.8 kernel contains a very serious bug in its Network File System (NFS)
implementation.
This was discovered very soon after its release, and thus 2.6.8.1 was released
containing little more than a fix for that specific bug.
\layout Subsection
Code Organization
\layout Standard
There are quite a few subdirectories within each Linux source code package.
Subdirectories that it would be most helpful to know about while reading
this paper are:
\layout Description
Documentation a directory containing lots of good documentation on kernel
internals and the development process
\layout Description
arch a directory containing architecture-specific code; it contains one
subdirectory for each supported architecture (e.g.
i386, ia64, ppc64...)
\layout Description
include a directory containing header files
\layout Description
kernel a directory containing the main kernel code
\layout Description
mm a directory containing the kernel's memory management code
\layout Section
Overview of Processes and Threads
\layout Standard
It is important to have a decent understanding of both processes and threads
before learning about schedulers.
Explaining processes and threads in depth is outside of the scope of this
document, thus only a summary of the things that one must know about them
is provided here.
Readers of this document are strongly encouraged to gain an in-depth understand
ing of processes and threads from another source.
Excellent sources are listed in the bibliography
\begin_inset LatexCommand \cite{key-4,key-6,key-7,key-8}
\end_inset
.
\layout Subsection
Programs and Processes
\layout Standard
A
\emph on
program
\emph default
is a combination of instructions and data put together to perform a task
when executed.
A
\emph on
process
\emph default
is an instance of a program (what one might call a
\begin_inset Quotes eld
\end_inset
running
\begin_inset Quotes erd
\end_inset
program).
An analogy is that programs are like classes in languages like C++ and
Java, and processes are like objects (instantiated instances of classes).
Processes are an abstraction created to embody the state of a program during
its execution.
This means keeping track of the data that is associated with a thread or
threads of execution, which includes variables, hardware state (e.g.
registers and the program counter, etc...), and the contents of an address
space
\begin_inset Foot
collapsed true
\layout Standard
An address space is the set of memory addresses that a process is allowed
to read and/or write to.
\end_inset
\begin_inset LatexCommand \cite{key-2}
\end_inset
.
\layout Subsection
Threads
\layout Standard
A process can have multiple threads of execution that work together to accomplis
h its goals.
These threads of execution are aptly named
\emph on
threads
\emph default
.
A kernel must keep track of each thread's stack and hardware state, or
whatever is necessary to track a single flow of execution within a process.
Usually threads share address spaces, but they do not have to (often they
merely overlap).
It is important to remember that only one thread may be executing on a
CPU at any given time, which is basically the reason kernels have CPU scheduler
s.
An example of multiple threads within a process can be found in most web
browsers.
Usually at least one thread exists to handle user interface events (like
stopping a page load), one thread exists to handle network transactions,
and one thread exists to render web pages.
\layout Subsection
Scheduling
\layout Standard
Multitasking kernels (like Linux) allow more than one process to exist at
any given time, and furthermore each process is allowed to run as if it
were the only process on the system.
Processes do not need to be aware of any other processes unless they are
explicitly designed to be.
This makes programs easier to develop, maintain, and port
\begin_inset LatexCommand \cite{key-2}
\end_inset
.
Though each CPU in a system can execute only one thread within a process
at a time, many threads from many processes appear to be executing at the
same time.
This is because threads are scheduled to run for very short periods of
time and then other threads are given a chance to run.
A kernel's scheduler enforces a thread scheduling policy, including when,
for how long, and in some cases where (on
\emph on
Symmetric Multiprocessing (SMP)
\emph default
systems) threads can execute.
Normally the scheduler runs in its own thread, which is woken up by a timer
interrupt.
Otherwise it is invoked via a system call or another kernel thread that
wishes to yield the CPU.
A thread will be allowed to execute for a certain amount of time, then
a context switch to the scheduler thread will occur, followed by another
context switch to a thread of the scheduler's choice.
This cycle continues, and in this way a certain policy for CPU usage is
carried out.
\layout Subsection
CPU and I/O-bound Threads
\layout Standard
Threads of execution tend to be either
\emph on
CPU-bound
\emph default
or
\emph on
I/O-bound
\emph default
(Input/Output bound).
That is, some threads spend a lot of time using the CPU to do computations,
and others spend a lot of time waiting for relatively slow I/O operations
to complete.
For example - a thread that is sequencing DNA will be CPU bound.
A thread taking input for a word processing program will be I/O-bound as
it spends most of its time waiting for a human to type.
It is not always clear whether a thread should be considered CPU or I/O
bound.
The best a scheduler can do is guess, if it cares at all.
Many schedulers do care about whether or not a thread should be considered
CPU or I/O bound, and thus techniques for classifying threads as one or
the other are important parts of schedulers.
\layout Standard
Schedulers tend to give I/O-bound threads priority access to CPUs.
Programs that accept human input tend to be I/O-bound - even the fastest
typist has a considerable amount of time between each keystroke during
which the program he or she is interacting with is simply waiting.
It is important to give programs that interact with humans priority since
a lack of speed and responsiveness is more likely to be perceived when
a human is expecting an immediate response than when a human is waiting
for some large job to finish.
\layout Standard
It is also beneficial to the system as a whole to give priority to programs
that are I/O-bound but not because of human input
\begin_inset Foot
collapsed false
\layout Standard
It is fortunate that both human-interactive and non-human-interactive I/O
activity should be awarded a higher priority since there is really no way
to tell at the scheduler level what I/O was human-initiated and what was
not.
The scheduler does not know whether a program is blocked waiting for keyboard
input or it is blocked waiting for data from a hard drive.
\end_inset
.
Because I/O operations usually take a long time it is good to get them
started as early as possible.
For example, a program that needs a piece of data from a hard disk has
a long wait ahead before it gets its data.
Kicking off the data request as quickly as possible frees up the CPU to
work on something else during the request and helps the program that submitted
the data request to be able to move on as quickly as possible.
Essentially, this comes down to parallelizing system resources as efficiently
as possible.
A hard drive can seek data while a CPU works on something else, so having
both resources working as early and often as possible is beneficial.
Many CPU operations can be performed in the time it takes to get data from
a hard drive.
\layout Subsection
Context Switching
\layout Standard
\emph on
Context switching
\emph default
is the process of switching from one thread of execution to another.
This involves saving the state of the CPU's registers and loading a new
state, flushing caches, and changing the current virtual memory map.
Context switches on most architectures are a relatively expensive operation
and as such they are avoided as much as possible.
Quite a bit of actual work can be done during the time it takes to perform
a context switch.
How context switching is handled is highly architecture-dependent and is
not really part of a kernel's scheduler, though the way it is done can
greatly influence a scheduler's design.
Context switching code in the Linux kernel is defined in the files
\family typewriter
include/asm-[arch]/mmu_context.h
\family default
(change current virtual memory mapping) and
\family typewriter
include/asm-[arch]/system.h
\family default
(perform CPU context switch, e.g.
PC and general registers).
\layout Subsection
Linux Processes/Threads
\layout Standard
Linux takes a unique approach to implementing the process and thread abstraction
s.
In Linux, all threads are simply processes that might share certain resources.
Instead of being something different than a thread or a group of threads,
a process in Linux is simply a group of threads that share something called
a
\family typewriter
\emph on
thread group ID (TGID)
\family default
\emph default
and whatever resources are necessary.
In order to reconcile Linux's treatment of processes and threads with the
terms themselves, the term
\begin_inset Quotes eld
\end_inset
task
\begin_inset Quotes erd
\end_inset
will be used from here on to mean a Linux thread - it does not mean thread
in the POSIX sense.
\begin_inset Quotes eld
\end_inset
Process
\begin_inset Quotes erd
\end_inset
or
\begin_inset Quotes eld
\end_inset
thread
\begin_inset Quotes erd
\end_inset
will be used only when the difference really matters.
In the Linux task structure
\family typewriter
task_struct
\family default
(one of which exists for each thread), the TGID that is a process's POSIX
PID is stored as
\family typewriter
[task_struct]->tgid
\family default
.
Linux assigns unique
\begin_inset Quotes eld
\end_inset
PID
\begin_inset Quotes erd
\end_inset
s to every thread (
\family typewriter
[task_struct]->pid
\family default
), but the (POSIX) PID that most people think of is really a task's TGID.
It is worth mentioning that this model, combined with certain tricks like
a COW (Copy On Write) forking algorithm
\begin_inset Foot
collapsed true
\layout Standard
Normally a call to
\family typewriter
fork()
\family default
causes a copy of the caller's resources to be created and labeled as a
child.
Copy On Write means that the resources are not actually copied until the
child's resources differ from the parent's (i.e.
the child or parent tries to write to some shared data).
Even then, only the differing resources are copied and thus no longer shared.
This saves time in the usual case where
\family typewriter
fork()
\family default
is immediately followed by a call to
\family typewriter
exec()
\family default
because if
\family typewriter
fork()
\family default
did not use COW, a copy of the parent's executable data (text section)
would be created only to be overwritten by new data taken in during the
\family typewriter
exec()
\family default
call.
\end_inset
causes process and thread spawning to be very fast and efficient in Linux,
whereas spawning a process is much more expensive than spawning threads
\begin_inset Foot
collapsed true
\layout Standard
Operating systems that differentiate between process and thread spawning
often referred to threads as lightweight processes (LWPs).
\end_inset
on many other operating systems (e.g.
BSD UNIX® and Microsoft® Windows®).
\layout Standard
Unfortunately, further details about Linux's process and thread implementation
would be out of the scope of this paper.
It is only important to know that Linux considers processes to be merely
groups of threads and does not differentiate between the two.
Because of this, Linux schedules threads only, essentially ignoring what
POSIX processes they belong to.
\layout Section
Linux Scheduling Goals
\layout Subsection
Linux's Target Market(s) And Their Effects on its Scheduler
\layout Standard
An operating system's scheduling algorithm is largely determined by its
target market, and vice-versa.
Understanding an operating system's target market helps to explain its
scheduling goals, and thus its scheduling algorithm.
\layout Standard
Linux was originally created by Linus Torvalds for use on his personal computer.
However, despite its origins, Linux has become known as a server operating
system.
There are many reasons for this, not the least of which is the fact that
most software designed to run on top of the Linux kernel is meant for users
with a relatively high skill level or inherits design qualities targeting
more skilled users.
This led to Linux's notoriously complex and unrefined graphical user interface
options (compared to Apple® and Microsoft® operating systems) and subsequent
relegation to the server room.
Linux's exposure in the server market guided its development along the
lines of the one market that it initially succeeded in.
Linux's prowess as a server operating system is nowadays perhaps matched
only by a few operating systems such as Sun's Solaris and IBM's AIX.
However, cost and legal advantages are causing many companies to replace
both of those operating systems with Linux as well.
\layout Standard
While Linux has made a name for itself in the server operating systems arena,
many users and developers believe that it can also be a success on the
desktop.
In the last several years, there has been a push to optimize the Linux
kernel for the desktop market.
Perhaps the biggest step in that direction was the scheduler written by
Ingo Molnar for the 2.6.x kernel series.
Molnar designed his scheduler with the desktop and the server market in
mind, and as a result desktop performance is much improved in Linux distributio
ns based on 2.6.x kernels.
Targeting both the server and the desktop market imposes particularly heavy
demands on the kernel's scheduler, and thus the Linux kernel's scheduler
is an interesting case study in how to please two very different markets
at the same time.
\layout Subsection
Efficiency
\layout Standard
An important goal for the Linux scheduler is efficiency.
This means that it must try to allow as much real work as possible to be
done while staying within the restraints of other requirements.
For example - since context switching is expensive, allowing tasks to run
for longer periods of time increases efficiency.
Also, since the scheduler's code is run quite often, its own speed is an
important factor in scheduling efficiency.
The code making scheduling decisions should run as quickly and efficiently
as possible.
Efficiency suffers for the sake of other goals such as interactivity, because
interactivity essentially means having more frequent context switches.
However, once all other requirements have been met, overall efficiency
is the most important goal for the scheduler.
\layout Subsection
Interactivity
\layout Standard
Interactivity is an important goal for the Linux scheduler, especially given
the growing effort to optimize Linux for desktop environments.
Interactivity often flies in the face of efficiency, but it is very important
nonetheless.
An example of interactivity might be a keystroke or mouse click.
Such events usually require a quick response (i.e.
the thread handling them should be allowed to execute very soon) because
users will probably notice and be annoyed if they do not see some result
from their action almost immediately.
Users don't expect a quick response when, for example, they are compiling
programs or rendering high-resolution images.
They are unlikely to notice if something like compiling the Linux kernel
takes an extra twenty seconds.
Schedulers used for interactive computing should be designed in such a
way that they respond to user interaction within a certain time period.
Ideally, this should be a time period that is imperceptible to users and
thus gives the impression of an immediate response.
\layout Subsection
Fairness and Preventing Starvation
\layout Standard
It is important for tasks to be treated with a certain degree of fairness,
including the stipulation that no thread ever starves.
Starvation happens when a thread is not allowed to run for an unacceptably
long period of time due to the prioritization of other threads over it.
Starvation must not be allowed to happen, though certain threads should
be allowed to have a considerably higher priority level than others based
on user-defined values and/or heuristic indicators.
Somehow, threads that are approaching the starvation threshold (which is
generally defined by a scheduler's implementors) must get a significant
priority boost or one-time immediate preemption before they starve.
Fairness does not mean that every thread should have the same degree of
access to CPU time with the same priority, but it means that no thread
should ever starve or be able to trick the scheduler into giving it a higher
priority or more CPU time than it ought to have.
\layout Subsection
SMP Scheduling
\layout Standard
Since the Linux kernel supports multiprocessing, its scheduler must work
(and work well for that matter) when scheduling tasks across more than
one CPU on the same motherboard.
This means keeping track of which tasks are running on which CPUs, making
sure any given task is not executing on more than one CPU at a time, and
in general doing all of the accounting necessary to efficiently schedule
tasks across multiple CPUs.
Since all CPUs generally access the same memory and system resources, the
scheduler is primarily concerned with making the best use of processor
time.
There is little reason to prefer one CPU over another in terms of choosing
where to schedule a task.
The most conspicuous consideration is caching - by scheduling a given task
on the same CPU as often as possible, the likelihood of that CPU's cache
being hot increases.
\layout Subsection
SMT Scheduling
\layout Standard
The Linux kernel supports scheduling multiple threads on a single
\emph on
Symmetric Multi-Threading (SMT)
\emph default
chip.
While the concept of SMT has been around for some time, Intel's Hyper-Threading
(HT) technology made SMT technology mainstream.
Essentially, each physical SMT chip can have more than one virtual processor
with the caveat that the virtual processors share certain resources (e.g.
some types of cache).
Because certain resources are shared, virtual processors should not be
treated in the same way that regular processors are.
\layout Subsection
NUMA Scheduling
\layout Standard
The Linux kernel supports
\emph on
Non-Uniform Memory Access (NUMA)
\emph default
, which means it can run a single system image across more than one node
if such hardware is present (essentially a node is defined as a motherboard).
At a hardware level, a node is something like a traditional uniprocessor
or multiprocessor machine in that it has its own CPU(s) and memory.
However, NUMA systems treat multiple nodes as parts of a single system
running a single system image (i.e.
one instance of the Linux kernel).
This is usually accomplished through some sort of high-speed interconnect
(such as SGI's NUMAlink technology), which connects nodes at a more of
a motherboard level than at a networking level.
This means that all CPUs are capable of executing any thread, and all of
the memory across nodes is accessible via the same address space (i.e.
any CPU can allocate memory on any node on the system).
NUMA support involves being aware of cache issues similar to those in SMP
scheduling, but can also include issues of memory locality (i.e.
if a CPU is executing a thread which is allocating memory from a local
memory bank, it would be inefficient to move the thread across the system
as memory requests would take longer to fulfill).
Perhaps the biggest issue that a scheduler supporting NUMA needs to tackle
is the possibility of having far more CPUs to schedule on than most SMP
systems.
Common SMP systems might have anywhere from 2-8 processors, but NUMA systems
might have hundreds of processors.
At the time of this writing, SGI is shipping NUMA systems containing 512
processors.
This is the largest number of processors any company has been able to run
under a single Linux system image, and the limit to which the Linux 2.6.8.1
scheduler has been stretched.
\layout Subsection
Soft Real-Time Scheduling
\layout Standard
The Linux scheduler supports soft
\emph on
real-time (RT)
\emph default
scheduling.
This means that it can effectively schedule tasks that have strict timing
requirements.
However, while the Linux 2.6.x kernel is usually capable of meeting very
strict RT scheduling deadlines, it does not guarantee that deadlines will
be met.
RT tasks are assigned special scheduling modes and the scheduler gives
them priority over any other task on the system.
RT scheduling modes include a first-in-first-out (FIFO) mode which allows
RT tasks to run to completion on a first-come-first-serve basis, and a
round-robin scheduling mode that schedules RT tasks in a round-robin fashion
while essentially ignoring non-RT tasks on the system.
\layout Subsection
Scheduling Performance Perspectives
\layout Standard
In terms of schedulers, there is no single definition of performance that
fits everyone's needs; that is, there is not a single performance goal
for the Linux scheduler to strive for.
The many definitions of good scheduling performance often lead to a give-and-ta
ke situation, such that improving performance in one sense hurts performance
in another.
Some improvements to the Linux scheduler help performance all-around, but
such improvements are getting more and more hard to come by.
A good example of a give-and-take performance issue is desktop vs.
server vs.
\emph on
high performance computing (HPC)
\emph default
performance.
\layout Standard
The most important performance metric for desktop users is perceived performance
- that is, how fast does a machine seem to respond to requests such as
mouse clicks and key presses.
If a user is compiling a kernel in the background and typing in a word
processor in the foreground, he or she is unlikely to notice if the kernel
compile takes an extra minute because it is constantly interrupted by the
word processor responding to keystrokes.
What matters most to the users is that when he or she presses a key, the
word processor inserts and displays the desired character as quickly as
possible.
This entails a CPU making a context switch to the word processor's thread
as soon as possible after the user presses a key.
In order for this to happen, the currently running thread must either give
up the processor before its timeslice is up, or its timeslice must be short
enough that the delay between the time the keystroke happens and the timeslice
ends is imperceptible to the user.
Since context switching is expensive, context switches must be minimized