-
Notifications
You must be signed in to change notification settings - Fork 0
/
ietf-rift.xml
9020 lines (7762 loc) · 398 KB
/
ietf-rift.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc, which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
There has to be one entity for each item to be referenced.
An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC1982 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1982.xml">
<!ENTITY RFC5304 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5304.xml">
<!ENTITY RFC5310 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5310.xml">
<!ENTITY RFC4271 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4271.xml">
<!--
<!ENTITY RFC4655 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4655.xml">
-->
<!ENTITY RFC5301 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5301.xml">
<!ENTITY RFC5306 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5306.xml">
<!ENTITY RFC5308 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5308.xml">
<!ENTITY RFC5309 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5309.xml">
<!ENTITY RFC5120 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5120.xml">
<!ENTITY RFC7602 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7602.xml">
<!ENTITY RFC7938 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7938.xml">
<!--
<!ENTITY RFC7855 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7855.xml">
-->
<!ENTITY RFC2328 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2328.xml">
<!ENTITY RFC5303 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5303.xml">
<!ENTITY RFC0826 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.0826.xml">
<!ENTITY RFC2131 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2131.xml">
<!ENTITY RFC8415 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8415.xml">
<!ENTITY RFC3626 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3626.xml">
<!ENTITY RFC2365 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2365.xml">
<!ENTITY RFC4291 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4291.xml">
<!ENTITY RFC4861 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4861.xml">
<!ENTITY RFC4862 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4862.xml">
<!ENTITY RFC5082 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5082.xml">
<!ENTITY RFC5549 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5549.xml">
<!ENTITY RFC5881 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5881.xml">
<!ENTITY RFC5709 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5709.xml">
<!ENTITY RFC5905 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5905.xml">
<!ENTITY RFC6518 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6518.xml">
<!ENTITY RFC7752 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7752.xml">
<!ENTITY RFC7987 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7987.xml">
<!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml">
<!ENTITY RFC8200 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8200.xml">
<!ENTITY RFC8202 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8202.xml">
<!-- SR removed
<!ENTITY RFC8402 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8402.xml">
-->
<!ENTITY RFC8505 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8505.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs),
please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
(Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space
(using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-ietf-rift-rift-10" ipr="trust200902">
<!-- category values: std, bcp, info, exp, and historic
ipr values: full3667, noModification3667, noDerivatives3667
you can add the attributes updates="NNNN" and obsoletes="NNNN"
they will automatically be output with "(if approved)" -->
<front>
<!-- The abbreviated title is used in the page header - it is
only necessary if the
full title is longer than 39 characters -->
<title abbrev="RIFT">RIFT: Routing in Fat Trees</title>
<!-- add 'role="editor"' below for the editors if appropriate -->
<!-- Another author who claims to be an editor -->
<author fullname="Tony Przygienda" initials="A." surname="Przygienda" role="editor">
<organization>Juniper</organization>
<address>
<postal>
<street>1137 Innovation Way
</street>
<city>Sunnyvale</city>
<region>CA
</region>
<code/>
<country>USA
</country>
</postal>
<phone/>
<facsimile/>
<email>[email protected]
</email>
<uri/>
</address>
</author>
<author fullname="Alankar Sharma" initials="A"
surname="Sharma">
<organization>Comcast</organization>
<address>
<postal>
<street>1800 Bishops Gate Blvd</street>
<city>Mount Laurel</city>
<region>NJ</region>
<code>08054</code>
<country>US</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author initials="P" surname="Thubert" fullname="Pascal Thubert">
<organization abbrev="Cisco">Cisco Systems, Inc</organization>
<address>
<postal>
<street>Building D</street>
<street>45 Allee des Ormes - BP1200 </street>
<city>MOUGINS - Sophia Antipolis</city>
<code>06254</code>
<country>FRANCE</country>
</postal>
<phone>+33 497 23 26 34</phone>
<email>[email protected]</email>
</address>
</author>
<author fullname="Bruno Rijsman" initials="Bruno" surname="Rijsman">
<organization>Individual</organization>
<address>
<postal>
<street></street>
<city></city>
<region></region>
<code></code>
<country></country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname="Dmitry Afanasiev" initials="Dmitry" surname="Afanasiev">
<organization>Yandex</organization>
<address>
<postal>
<street></street>
<city></city>
<region></region>
<code></code>
<country></country>
</postal>
<email>[email protected]</email>
</address>
</author>
<date year="2020"/>
<!-- If the month and year are both specified and are the current ones, xml2rfc will fill
in the current day for you. If only the current year is specified, xml2rfc will fill
in the current day and month for you. If the year is not the current one, it is
necessary to specify at least a month (xml2rfc assumes day="1" if not specified for the
purpose of calculating the expiry date). With drafts it is normally sufficient to
specify just the year. -->
<!-- Meta-data Declarations -->
<area>Routing</area>
<workgroup>RIFT Working Group</workgroup>
<!-- WG name at the upper left corner of the doc,
IETF is fine for individual submissions.
If this element is not present, the default is "Network Working Group",
which is used by the RFC Editor as a nod to the history of the IETF. -->
<!-- Keywords will be incorporated into HTML output
files in a meta tag but they have no effect on text or nroff
output. If you submit your draft to the RFC Editor, the
keywords will be used for the search engine. -->
<abstract>
<t>This document defines a
specialized, dynamic routing protocol for
Clos and fat-tree network topologies optimized towards minimization of
configuration and operational
complexity. The protocol
<list style="symbols">
<t>deals with no configuration,
fully automated construction of fat-tree topologies based
on detection of links,
</t>
<t>minimizes the amount of routing
state held at each level,</t>
<t>automatically prunes and load balances
topology
flooding exchanges over a sufficient subset of links,
</t>
<t>supports
automatic disaggregation of prefixes on link and node failures to
prevent black-holing and suboptimal routing,
</t>
<t>allows traffic steering and
re-routing policies,
</t>
<t>allows loop-free non-ECMP forwarding,
</t>
<t>automatically re-balances traffic towards the spines based on
bandwidth available and finally
</t>
<t>provides
mechanisms to synchronize a limited key-value data-store that
can be used after protocol convergence to e.g.
bootstrap higher levels of functionality on nodes.
</t>
</list>
</t>
</abstract>
</front>
<middle>
<section title="Authors">
<t>
This work is a product of a list of individuals which are all to
be considered major contributors independent of the fact whether
their name made it to the limited boilerplate author's list or not.
</t>
<texttable anchor="authors" style="none" title="RIFT Authors">
<ttcol></ttcol><ttcol></ttcol><ttcol></ttcol><ttcol></ttcol><ttcol></ttcol>
<c>Tony Przygienda, Ed.</c><c>|</c><c>Alankar Sharma</c><c>|</c><c>Pascal Thubert</c>
<c>Juniper Networks</c> <c>|</c><c>Comcast</c> <c>|</c><c>Cisco</c>
<c></c><c></c><c></c><c></c><c></c>
<c>Bruno Rijsman</c> <c>|</c><c>Ilya Vershkov</c> <c>|</c><c>Dmitry Afanasiev</c>
<c>Individual</c> <c>|</c><c>Mellanox</c> <c>|</c><c>Yandex</c>
<c></c><c></c><c></c><c></c><c></c>
<c>Don Fedyk</c> <c>|</c><c>Alia Atlas</c> <c>|</c><c>John Drake</c>
<c>Individual</c> <c>|</c><c>Individual</c> <c>|</c><c>Juniper</c>
</texttable>
</section>
<section title="Introduction">
<!--<t> ANISOTROPIC protocol could be used to describe RIFT in contrary to
uniform information distribution</t>-->
<t><xref
target="CLOS">Clos</xref> and <xref
target="FATTREE">Fat-Tree</xref> topologies
have gained prominence in today's networking, primarily as
result of
the paradigm shift towards a centralized data-center based
architecture that is poised to deliver a majority of
computation and storage services
in the future.
Today's current routing protocols were geared towards a
network with an irregular topology and low degree of connectivity originally
but given
they were the only available options, consequently
several
attempts to apply those protocols to Clos have been made.
Most successfully
<xref
target="RFC4271">BGP</xref> <xref
target="RFC7938"></xref>
has been extended to this purpose, not as much due to its inherent
suitability but rather because the perceived capability to easily
modify BGP and the immanent difficulties with
<xref
target="DIJKSTRA">link-state</xref>
based protocols to optimize topology exchange and converge quickly in
large scale densely meshed topologies. The incumbent protocols precondition
normally extensive configuration or provisioning during bring up and
re-dimensioning. This tends to be viable only for a set of organizations with
according networking operation skills and budgets.
For many IP fabric
builders a desirable protocol would be one that auto-configures itself
and deals with failures and mis-configurations with a minimum of human
intervention only. Such a solution would allow local IP fabric bandwidth to
be consumed in a 'standard component' fashion, i.e. provision it much
faster and operate it at much lower costs than today, much like compute or storage
is consumed already.
</t>
<t>
In looking at the problem through the lens of data center
requirements, RIFT addresses challenges in IP fabric routing
not through an incremental
modification of either a link-state (distributed computation)
or distance-vector (diffused computation) but rather a
mixture of both, colloquially best described as "link-state towards
the spine" and "distance vector towards the leaves". In other words, "bottom" levels
are flooding their link-state information in
the "northern" direction while each node generates under normal
conditions a "default route" and floods it in the "southern" direction.
This type of protocol allows naturally for
highly desirable aggregation. Alas, such
aggregation could blackhole
traffic in cases of misconfiguration or while failures are being
resolved or even cause partial network partitioning and this
has to be addressed by some adequate mechanism.
The approach RIFT takes
is described
in <xref target="disaggregate"/> and is basically
based on automatic, sufficient disaggregation of prefixes in case
of link and node failures.</t>
<t>For the visually oriented reader, <xref target="first-simple"/>
presents a first level simplified view of the resulting information
and routes on a RIFT fabric. The top of the fabric is holding
in its link-state database the nodes below it and the routes to
them. In the second row of the
database table we indicate that partial information of other nodes in
the same level is available as well. The details of how this is
achieved will be postponed for the moment. When we look at the
"bottom" of the fabric, the leaves, we see that the topology is
basically empty and they only hold a load balanced default route
to the next level under normal conditions.
</t>
<t>The balance of this document details a dedicated IP fabric
routing protocol, fills in the
specification details and ultimately includes resulting
security considerations.
</t>
<t>
<figure align="center" anchor="first-simple"
title="RIFT information distribution">
<artwork align="center" type="ascii-art"><![CDATA[
. [A,B,C,D]
. [E]
. +-----+ +-----+
. | E | | F | A/32 @ [C,D]
. +-+-+-+ +-+-+-+ B/32 @ [C,D]
. | | | | C/32 @ C
. | | +-----+ | D/32 @ D
. | | | |
. | +------+ |
. | | | |
. [A,B] +-+---+ | | +---+-+ [A,B]
. [D] | C +--+ +-+ D | [C]
. +-+-+-+ +-+-+-+
. 0/0 @ [E,F] | | | | 0/0 @ [E,F]
. A/32 @ A | | +-----+ | A/32 @ A
. B/32 @ B | | | | B/32 @ B
. | +------+ |
. | | | |
. +-+---+ | | +---+-+
. | A +--+ +-+ B |
. 0/0 @ [C,D] +-----+ +-----+ 0/0 @ [C,D]
]]>
</artwork>
</figure>
</t>
<section title="Requirements Language">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref
target="RFC8174">RFC 8174</xref>.</t>
</section>
</section>
<section title="Reference Frame">
<section title="Terminology" toc="default" anchor="glossary">
<t>
This section presents the terminology used in this document.
It is assumed that the reader is thoroughly familiar with the
terms and concepts used in <xref target="RFC2328">OSPF</xref>
and <xref target="ISO10589-Second-Edition">IS-IS</xref>, <xref target="ISO10589"/>
as well as the according
graph theoretical concepts of shortest path first <xref
target="DIJKSTRA">(SPF)</xref> computation and DAGs.
</t>
<t>
<list style='hanging'>
<t hangText="Crossbar:">
Physical arrangement of ports in a switching matrix without
implying any further scheduling or buffering disciplines.
</t>
<t hangText="Clos/Fat Tree:">
This document uses the terms Clos and Fat Tree interchangeably
whereas it always refers to a folded spine-and-leaf topology with possibly multiple
Points of Delivery (PoDs) and one or multiple Top of Fabric (ToF) planes. Several modifications such as leaf-2-leaf
shortcuts and multiple level shortcuts are possible and described further in
the document.
</t>
<t hangText="Directed Acyclic Graph (DAG):">A finite directed graph with no directed cycles (loops).
If links in Clos are considered as either being all directed towards the top or vice versa, each
of such two graphs is a DAG.
</t>
<t hangText="Folded Spine-and-Leaf:">
In case Clos fabric input and output stages are analogous, the fabric can be
"folded" to build a "superspine" or top which we will call Top of Fabric (ToF)
in this document.
</t>
<t hangText="Level:"> Clos and Fat Tree networks are
topologically partially ordered graphs and 'level' denotes the set of nodes at the
same height
in such a network, where the bottom level (leaf) is the level with lowest
value.
A node has links to nodes one level down and/or one level up.
Under some circumstances, a node may have links to nodes at
the same level.
As footnote: Clos terminology
uses often the concept of "stage" but due to the
folded nature of the Fat Tree we do not use it
to prevent misunderstandings.</t>
<t hangText="Superspine vs. Aggregation and Spine vs. Edge/Leaf:">
Traditional level names in 5-stages folded Clos for Level 2, 1 and 0 respectively. We
normalize this language to talk about top-of-fabric (ToF), top-of-pod (ToP)
and leaves.
</t>
<t hangText="Zero Touch Provisioning (ZTP):">
Optional RIFT mechanism which allows to derive node levels automatically
based on minimum configuration (only ToF property has to be provisioned on according nodes).
</t>
<t hangText="Point of Delivery (PoD):">A self-contained
vertical slice or subset of a Clos or Fat Tree network
containing normally only level 0
and level 1 nodes. A node in a PoD communicates with
nodes in other PoDs via the Top-of-Fabric. We number PoDs to
distinguish them and use PoD #0 to denote "undefined" PoD.
</t>
<t hangText="Top of PoD (ToP):">
The set of nodes that provide intra-PoD communication and have
northbound adjacencies outside of the PoD, i.e. are at the
"top" of the PoD.
</t>
<t hangText="Top of Fabric (ToF):">
The set of nodes that provide inter-PoD communication and have
no northbound adjacencies, i.e. are at the "very top" of the fabric.
ToF nodes do not belong
to any PoD and are assigned "undefined"
PoD value to indicate
the equivalent of "any" PoD.
</t>
<t hangText="Spine:">Any nodes north of leaves and south of top-of-fabric nodes. Multiple
layers of spines in a PoD are possible.
</t>
<t hangText="Leaf:">A node without southbound adjacencies. Its level
is 0 (except cases where it is deriving its level via ZTP and
is running without
LEAF_ONLY which will be explained in <xref target="ZTP"/>).
</t>
<t hangText="Top-of-fabric Plane or Partition:">In large fabrics top-of-fabric switches may not
have enough ports to aggregate all switches south of them and
with that, the ToF is 'split' into multiple independent planes.
Introduction and <xref target="Planes"/> explains the concept in more detail.
A plane is subset of ToF nodes that see each other through
south reflection or E-W links.
</t>
<t hangText="Radix:">A radix of a switch is basically number of
switching ports it provides. It's sometimes called
fanout as well.</t>
<t hangText="North Radix:">Ports cabled northbound to higher level nodes.</t>
<t hangText="South Radix:">Ports cabled southbound to lower level nodes.</t>
<t hangText="South/Southbound and North/Northbound (Direction):">
When describing protocol
elements and procedures,
we will be
using in different situations the directionality
of the compass. I.e., 'south' or 'southbound' mean
moving
towards the bottom of the Clos or Fat Tree network
and 'north' and 'northbound' mean moving towards
the top of the Clos or Fat Tree network.
</t>
<t hangText="Northbound Link:">
A link to a node one level up or in other words, one
level further north.
</t>
<t hangText="Southbound Link:">
A link to a node one level down or in other words, one
level further south.
</t>
<t hangText="East-West Link:">A link between
two nodes at the same level. East-West
links are normally not part of Clos or
"fat-tree" topologies.
</t>
<t hangText="Leaf shortcuts (L2L):"> East-West links at
leaf level
will need to be differentiated from East-West links at
other levels.
</t>
<t hangText="Routing on the host (RotH):">Modern data center
architecture variant where servers/leaves are multi-homed and
consecutively participate in routing.
</t>
<t hangText="Northbound representation:">Subset of topology
information flooded
towards higher levels of the fabric.
</t>
<t hangText="Southbound representation:">Subset of topology
information sent
towards a lower level.
</t>
<t hangText="South Reflection:">Often abbreviated just as
"reflection" it defines a mechanism where South Node TIEs
are "reflected" from the level south back up north to allow
nodes in the same level
without E-W links to "see" each other's node TIEs.</t>
<t hangText="TIE:">This is an acronym for a "Topology
Information Element". TIEs are exchanged between RIFT nodes to
describe parts of a network such as links and address prefixes,
in a fashion similar to ISIS LSPs or OSPF LSAs.
A TIE has always a direction and a type. We will talk about
North TIEs (sometimes abbreviated as N-TIEs) when talking about
TIEs in the
northbound representation
and South-TIEs (sometimes abbreviated as S-TIEs)
for the southbound equivalent. TIEs have different types such
as node and prefix TIEs.
</t>
<t hangText="Node TIE:">This stands as acronym for a
"Node Topology Information Element" that contains all
adjacencies
the node discovered and
information about node itself. Node TIE should NOT be confused with
a N-TIE since "node" defines the type of TIE rather than
its direction.
</t>
<t hangText="Prefix TIE:">This is an acronym for a "Prefix Topology
Information Element" and it contains all prefixes
directly attached to
this node in case of a North TIE and in case of South TIE the necessary
default routes the node advertises
southbound.
</t>
<t hangText="Key Value TIE:">A South TIE that is carrying a set of
key value pairs <xref target="DYNAMO"/>.
It can be used to distribute information in the southbound
direction within
the protocol.
</t>
<t hangText="TIDE:">Topology Information Description Element,
equivalent to CSNP in ISIS.</t>
<t hangText="TIRE:">Topology
Information Request Element, equivalent to PSNP in ISIS. It
can both
confirm received and request missing TIEs.</t>
<t hangText="De-aggregation/Disaggregation:">
Process in which a node decides to
advertise more specific prefixes Southwards, either positively to
attract the corresponding traffic, or negatively to repel it.
Disaggregation is performed to prevent black-holing and suboptimal
routing to the more specific prefixes.</t>
<t hangText="LIE:">This is an acronym for a
"Link Information Element",
largely equivalent to HELLOs in IGPs and exchanged over
all the links between systems running RIFT to form three way adjacencies.
</t>
<t hangText="Flood Repeater (FR):">A node can designate one or more northbound
neighbor nodes to be flood repeaters. The flood repeaters are responsible for
flooding northbound TIEs further north. They are
similar to MPR in OSLR. The document
sometimes calls them flood leaders as well.
</t>
<t hangText="Bandwidth Adjusted Distance (BAD):">
Each RIFT node
can calculate the amount of northbound bandwidth
available towards a node
compared to other nodes at the same level and
can modify the
route distance accordingly to allow for the lower level to
adjust their load balancing towards spines.</t>
<t hangText="Overloaded:">Applies to a node advertising
`overload` attribute as set. The semantics closely
follow the meaning of the same attribute
in <xref target="ISO10589-Second-Edition"/>.</t>
<t hangText="Interface:">A layer 3 entity over which RIFT
control packets are exchanged.
</t>
<t hangText="Three-Way Adjacency:">RIFT tries to form a unique adjacency
over an interface and exchange local configuration and
necessary ZTP information. An adjacency is only advertised in node TIEs and
used for computations after
it achieved three-way state, i.e. both routers reflected each other in
LIEs including relevant security information. LIEs before three-way state is
reached may carry ZTP related information already.</t>
<t hangText="Bi-directional Adjacency:">
Bidirectional adjacency is an adjacency where nodes of both sides of the
adjacency advertised it in the node TIEs with the correct levels and system
IDs. Bi-directionality is used to check in different algorithms whether the link should be
included.
</t>
<t hangText="Neighbor:">Once a three-way adjacency has been
formed a neighborship relationship contains the neighbor's
properties. Multiple adjacencies can be formed to a remote node
via parallel interfaces but such adjacencies are NOT sharing
a neighbor structure. Saying "neighbor" is thus equivalent to
saying "a three-way adjacency".</t>
<t hangText="Cost:">The term signifies the weighted distance between
two neighbors.</t>
<t hangText="Distance:">Sum of costs (bound by infinite distance)
between two nodes.</t>
<t hangText="Shortest-Path First (SPF):">A well-known graph algorithm
attributed to Dijkstra that establishes a tree of shortest paths
from a source to destinations on the graph. We use SPF acronym due to its
familiarity as general term for the node reachability calculations
RIFT can employ to ultimately calculate routes of which Dijkstra algorithm is one.
</t>
<t hangText="North SPF (N-SPF):">A reachability calculation that is progressing northbound,
as example SPF that is using South Node TIEs only. Normally it progresses a single hop
only and installs default routes.
</t>
<t hangText="South SPF (S-SPF):">A reachability calculation that is progressing southbound,
as example SPF that is using North Node TIEs only.
</t>
<t hangText="Security Envelope">RIFT packets are flooded within an authenticated
security envelope that
allows to protect the integrity of information a node accepts.
</t>
</list>
</t>
</section>
<section title="Topology">
<t>
<figure align="center" anchor="pic-topo-three"
title="A three level spine-and-leaf topology">
<artwork align="center"><![CDATA[
. +--------+ +--------+ ^ N
. |ToF 21| |ToF 22| |
.Level 2 ++-+--+-++ ++-+--+-++ <-*-> E/W
. | | | | | | | | |
. P111/2| |P121 | | | | S v
. ^ ^ ^ ^ | | | |
. | | | | | | | |
. +--------------+ | +-----------+ | | | +---------------+
. | | | | | | | |
. South +-----------------------------+ | | ^
. | | | | | | | All TIEs
. 0/0 0/0 0/0 +-----------------------------+ |
. v v v | | | | |
. | | +-+ +<-0/0----------+ | |
. | | | | | | | |
.+-+----++ optional +-+----++ ++----+-+ ++-----++
.| | E/W link | | | | | |
.|Spin111+----------+Spin112| |Spin121| |Spin122|
.+-+---+-+ ++----+-+ +-+---+-+ ++---+--+
. | | | South | | | |
. | +---0/0--->-----+ 0/0 | +----------------+ |
. 0/0 | | | | | | |
. | +---<-0/0-----+ | v | +--------------+ | |
. v | | | | | | |
.+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+
.| | (L2L) | | | | Level 0 | |
.|Leaf111~~~~~~~~~~~~Leaf112| |Leaf121| |Leaf122|
.+-+-----+ +-+---+-+ +--+--+-+ +-+-----+
. + + \ / + +
. Prefix111 Prefix112 \ / Prefix121 Prefix122
. multi-homed
. Prefix
.+---------- Pod 1 ---------+ +---------- Pod 2 ---------+
]]>
</artwork>
</figure>
</t>
<t>
<figure align="center" anchor="partitioned-spine"
title="Topology with multiple planes">
<artwork align="center"><![CDATA[
.+--------+ +--------+ +--------+ +--------+
.|ToF A1| |ToF B1| |ToF B2| |ToF A2|
.++-+-----+ ++-+-----+ ++-+-----+ ++-+-----+
. | | | | | | | |
. | | | | | +---------------+
. | | | | | | | |
. | | | +-------------------------+ |
. | | | | | | | |
. | +-----------------------+ | | | |
. | | | | | | | |
. | | +---------+ | +---------+ | |
. | | | | | | | |
. | +---------------------------------+ | |
. | | | | | | | |
.++-+-----+ ++-+-----+ +--+-+---+ +----+-+-+
.|Spine111| |Spine112| |Spine121| |Spine122|
.+-+---+--+ ++----+--+ +-+---+--+ ++---+---+
. | | | | | | | |
. | +--------+ | | +--------+ |
. | | | | | | | |
. | -------+ | | | +------+ | |
. | | | | | | | |
.+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+
.|Leaf111| |Leaf112| |Leaf121| |Leaf122|
.+-------+ +-------+ +-------+ +-------+
]]>
</artwork>
</figure>
</t>
<t>
We will use topology in <xref target="pic-topo-three"/> (called commonly a fat
tree/network in modern IP fabric considerations
<xref target="VAHDAT08"/>
as homonym to the
<xref target="FATTREE">original definition of the term</xref>)
in all further considerations.
This figure depicts
a generic "single plane fat-tree" and the concepts explained using
three levels
apply by induction to further levels and higher degrees
of connectivity. Further, this document will deal also
with designs that
provide only sparser connectivity and "partitioned spines"
as shown in <xref target="partitioned-spine"/>
and explained further in <xref target="Planes"/>.
</t>
</section>
</section>
<!-- based on Hardwick IAB review
<section anchor="reqs" title="Requirement Considerations">
<t>
<xref
target="RFC7938"></xref> gives the original set of requirements
augmented here based upon recent experience in the
operation of fat-tree
networks.
</t>
<t>
<list style='format REQ%d: ' >
<t>The control protocol should discover the physical
links automatically
and be able to detect cabling that
violates fat-tree topology constraints.
It must react accordingly to such mis-cabling attempts,
at a minimum
preventing adjacencies between nodes from being
formed and traffic
from being forwarded on those mis-cabled links. E.g.
connecting a leaf to a spine at level 2 should be
detected and ideally prevented.
</t>
<t>A node without any configuration beside default values
should come up at the correct level
in any PoD it is introduced into. Optionally,
it must be possible to
configure nodes to restrict their participation to
the PoD(s) targeted at any level.
</t>
<t>Optionally, the protocol should allow to provision IP
fabrics where the
individual
switches carry no configuration information and are
all deriving their
level from a "seed". Observe that this requirement
may collide with the desire
to detect cabling misconfiguration and with that
only one of the requirements
can be fully met in a chosen configuration mode.
</t>
<t>
The solution should allow for minimum size routing
information base and forwarding
tables at leaf level for speed, cost and simplicity
reasons. Holding excessive amount of information away
from leaf nodes simplifies operation and lowers cost of
the underlay and allows to scale and introduce proper multi-homing
down to the server level. The routing solution should allow for
easy instantiation of multiple routing planes. Coupled
with mobility defined in <xref target="mobreq"/>
this should allow for "light-weight" overlays
on an IP fabric with e.g. native IPv6 mobility support.
</t>
<t>Very high degree of ECMP must be
supported. Maximum ECMP is currently understood as the most
efficient
routing approach to maximize the throughput of switching
fabrics <xref target="MAKSIC2013"/>.
</t>
<t>Non equal cost anycast must be supported to allow for
easy and robust multi-homing of services without regressing to
careful balancing of link costs.
</t>
<t>Traffic engineering should be allowed by modification of
prefixes and/or their next-hops.
</t>
<t>The solution should allow for access to link states of
the whole topology
to enable efficient support for modern control
architectures like <xref
target="RFC7855">SPRING</xref> or
<xref target="RFC4655">PCE</xref>.
</t>
<t>The solution should easily accommodate opaque data to
be carried throughout the topology to subsets of nodes.
This can be used
for many purposes, one of them being a key-value
store that allows
bootstrapping of nodes based right at the time of
topology discovery. Another use is distributing
MAC to L3 address binding from the leaves up north in
case of e.g. DHCP.
</t>
<t>Nodes should be taken out and introduced into production
with minimum
wait-times and minimum of "shaking" of the network, i.e.
radius of propagation (often called "blast radius")
of changed information should be as small as feasible.
</t>
<t>The protocol should allow for maximum aggregation of carried
routing information while at the same time automatically
de-aggregating
the prefixes to prevent black-holing in case of failures.
The de-aggregation
should support maximum possible ECMP/N-ECMP remaining
after failure.
</t>
<t>Reducing the scope of communication needed throughout
the network on link and state
failure, as well as reducing advertisements of
repeating or idiomatic information in
stable state is highly desirable since it leads to
better stability and faster convergence behavior.
</t>
<t>Under normal, fully converged condition,
once a packet is forwarded along a link in a "southbound" direction,
it must not take any further "northbound"
links (Valley Free Routing).
Taking a path
through the spine in cases where a shorter
path is available is highly undesirable
(Bow Tying).
</t>
<t>Parallel links between same set of
nodes must be distinguishable for SPF, failure and
traffic engineering
purposes. </t>
<t> The protocol must support interfaces sharing the same address.
Specifically, it must operate in presence of
unnumbered links (even parallel ones) and/or links of a single node
being configured with same addresses.</t>
<t>It would be desirable to achieve fast re-balancing of flows when links,
especially towards the spines are lost or provisioned without regressing to
per flow traffic engineering which introduces significant amount of complexity
while possibly not being reactive enough to account for short-lived flows.
</t>
<t anchor="mobreq">The control plane should be able to unambiguously determine the current
point of attachment (which port on which leaf node) of a prefix, even in a
context of fast mobility, e.g., when the prefix is a host address on a
wireless node that 1) may associate to any of multiple access points (APs)
that are attached to different ports on a same leaf node or to different
leaf nodes, and 2) may move and reassociate several times to a different access point
within a sub-second period.
</t>
<t>The protocol must provide
security mechanisms that allow the operator to restrict nodes,
especially leaf nodes without proper credentials, from forming a three-way adjacency and
participating in routing.
</t>
</list>
</t>
<t>
Following list represents non-requirements:
</t>