-
Notifications
You must be signed in to change notification settings - Fork 4
/
rift-applicability.xml
1494 lines (1288 loc) · 71.6 KB
/
rift-applicability.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="no"?>
<?rfc subcompact="no"?>
<?rfc authorship="yes"?>
<?rfc tocappendix="yes"?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" category="info" ipr='trust200902' tocInclude="true" obsoletes="" updates="" consensus="true" submissionType="IETF" xml:lang="en" version="3" docName="draft-ietf-rift-applicability-06" >
<front>
<title abbrev='RIFT Applicability Statement'>RIFT Applicability</title>
<author fullname='Yuehua Wei' initials='Yuehua' surname='Wei' role='editor' >
<organization>ZTE Corporation</organization>
<address>
<postal>
<street>No.50, Software Avenue</street>
<city>Nanjing</city>
<region/>
<code>210012</code>
<country>China</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname='Zheng Zhang' initials='Zheng' surname='Zhang'>
<organization>ZTE Corporation</organization>
<address>
<postal>
<street>No.50, Software Avenue</street>
<city>Nanjing</city>
<region/>
<code>210012</code>
<country>China</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname='Dmitry Afanasiev' initials='Dmitry' surname='Afanasiev'>
<organization>Yandex</organization>
<address>
<postal>
<street/>
<city/>
<region/>
<code/>
<country/>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname='Pascal Thubert' initials='P.' surname='Thubert'>
<organization abbrev='Cisco Systems'>Cisco Systems, Inc</organization>
<address>
<postal>
<country>FRANCE</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname='Tom Verhaeg' initials='Tom' surname='Verhaeg'>
<organization>Juniper Networks</organization>
<address>
<postal>
<street/>
<city/>
<region/>
<code/>
<country/>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname='Jaroslaw Kowalczyk' initials='Jaroslaw' surname='Kowalczyk'>
<organization>Orange Polska</organization>
<address>
<postal>
<street/>
<city/>
<region/>
<code/>
<country/>
</postal>
<email>[email protected]</email>
</address>
</author>
<date/>
<area>Routing</area>
<workgroup>RIFT WG</workgroup>
<keyword>RIFT</keyword>
<abstract>
<t>
This document discusses the properties, applicability and operational considerations
of RIFT in different
network scenarios. It intends to provide a
rough guide how RIFT can be deployed to simplify routing operations in
Clos topologies and their variations.
</t>
</abstract>
</front>
<!-- ***** MIDDLE MATTER ***** -->
<middle>
<section><name>Introduction</name>
<t>This document discusses the properties and applicability of
<xref target='I-D.ietf-rift-rift'>"Routing in Fat Trees"</xref> (RIFT) in
different deployment scenarios and highlights the operational simplicity of the
technology compared to traditional routing solutions.
It also documents special considerations when RIFT is used with or without overlays and/or controllers, and how RIFT identifies topology mis-cablings and reroutes around node and link failures.
</t>
</section>
<section><name>Problem Statement of Routing in Modern IP Fabric Fat Tree Networks</name>
<t><xref target="CLOS">Clos</xref> and <xref target="FATTREE">fat tree</xref> topologies have gained prominence in today's networking, primarily as a result of the paradigm shift towards a centralized data-center based architecture that deliver a majority of computation and storage services.
</t>
<t>Today's current routing protocols were geared towards a network with an
irregular topology with isotropic properties, and low degree of connectivity.
When applied to Fat Tree topologies:
</t>
<ul>
<li>They tend to need extensive configuration or provisioning during bring up
and re-dimensioning.</li>
<li>All nodes including spine and leaf nodes learn the entire network topology and
routing information, which is in fact, not needed on the leaf nodes during normal
operation.</li>
<!--
<li>Significant link-state PDUs (LSPs) flooding duplication between
spine nodes and leaf nodes occurs during network bring up and topology updates.
</li>
<li>This
consumes both CPU and link bandwidth resources which prevents the
use of cheaper hardware at the lower levels (leaf and spine) and reduces the scalability and reactivity.of the network.</li>
-->
<li>They flood significant amounts of duplicate link state information between spine
and leaf nodes during topology updates and convergence events, requiring that
additional CPU and link bandwidth be consumed.
This may impact the stability and scalability of the fabric, make the fabric less
reactive to failures, and prevent the use of cheaper hardware at the lower levels
(i.e. spine and leaf nodes).
</li>
</ul>
</section>
<section><name>Applicability of RIFT to Clos IP Fabrics</name>
<t>
Further content of this document assumes that the reader is
familiar with the
terms and concepts used in <xref target='RFC2328'>OSPF</xref>
and <xref target='ISO10589-Second-Edition'>IS-IS</xref> link-state protocols. The sections of <xref target='I-D.ietf-rift-rift'>RIFT</xref> outline the requirements of routing in IP fabrics and RIFT protocol concepts.
</t>
<section><name>Overview of RIFT</name>
<t>
RIFT is a dynamic routing protocol that is tailored for use in Clos, Fat-Tree, and other anisotropic topologies.
A core property of RIFT is that its operation is
sensitive to the structure of the fabric - it is anisotropic. RIFT
acts as a link-state protocol when "pointing north" - advertising southwards
routes to northwards peer routers (parents) through flooding and database synchronization- but
operates hop-by-hop like a distance-vector protocol
when "pointing south" - typically advertising a fabric default route directed
towards the Top of Fabric (ToF, aka superspine) to southwards peer routers (children).
</t>
<t>
The fabric default is typically the default route, as described in
Section 3.2.3.8 "Southbound Default Route Origination" of
<xref target='I-D.ietf-rift-rift'>RIFT</xref>.
The ToF nodes may alternatively originate more specific prefixes (P') southbound
instead of the default route. In such a scenario, all addresses carried within
the RIFT domain MUST be contained within P', and it is possible for a leaf that
acts as gateway to the internet to advertise the default route instead.
</t>
<t>RIFT floods flat link-state information northbound only so that each level
obtains the full topology of levels south of it. That information is never flooded
east-west or back south again. So a top tier node has full set of prefixes from
the Shortest Path First (SPF) calculation.
</t>
<t>In the southbound direction, the protocol operates like a "fully summarizing,
unidirectional" path-vector protocol or rather a distance-vector with implicit split horizon. Routing information, normally just the default route, propagates one hop south and is 're-advertised' by nodes at next lower level.
</t>
<figure align='center' anchor='pic-rift'><name>RIFT overview</name>
<artwork align='center'><![CDATA[
+-----------+ +-----------+
| ToF | | ToF | LEVEL 2
+ +-----+--+--+ +-+--+------+
| | | | | | | | | ^
+ | | | +-------------------------+ |
Distance | +-------------------+ | | | | |
Vector | | | | | | | | +
South | | | | +--------+ | | | Link-State
+ | | | | | | | | Flooding
| | | +-------------+ | | | North
v | | | | | | | | +
+-+--+-+ +------+ +-------+ +--+--+-+ |
|SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1
+ ++----++ ++---+-+ +--+--+-+ ++----+-+ |
+ | | | | | | | | | ^ N
Distance | +-------+ | | +--------+ | | | E
Vector | | | | | | | | | +------>
South | +-------+ | | | +-------+ | | | |
+ | | | | | | | | | +
v ++--++ +-+-++ ++-+-+ +-+--++ +
|LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0
+----+ +----+ +----+ +-----+
]]></artwork>
</figure>
<t>A spine node has only information necessary for its level, which is all
destinations south of the node based on SPF calculation, default route, and
potential disaggregated routes.
</t>
<t>RIFT combines the advantage of both link-state and distance-vector:
</t>
<ul>
<li>Fastest possible convergence</li>
<li>Automatic detection of topology</li>
<li>Minimal routes/info on Top-of-Rack (ToR) switches, aka leaf nodes</li>
<li>High degree of ECMP</li>
<li>Fast de-commissioning of nodes</li>
<li>Maximum propagation speed with flexible prefixes in an update</li>
</ul>
<t>So there are two types of link-state database which are "north representation" North Topology Information Elements (N-TIEs) and "south representation" South Topology Information Elements (S-TIEs). The N-TIEs contain a link-state topology
description of lower levels and S-TIEs carry simply default routes for the lower
levels.
</t>
<t>RIFT also eliminates major disadvantages of link-state and distance-vector with:
</t>
<t>
</t><ul>
<li>Reduced and balanced flooding</li>
<li>Automatic neighbor detection</li>
</ul><t>
</t>
<t>To achieve this, RIFT builds on the art of IGPs, not only OSPF and IS-IS but also MANET and IoT, to provide unique features:
</t>
<ul>
<li>Automatic (positive or negative) route disaggregation of northwards routes upon fallen leaves</li>
<li>Recursive operation in the case of negative route disaggregation </li>
<li>Anisotropic routing that extends a principle seen in <xref target='RFC6550'>RPL</xref> to wide superspines</li>
<li>Optimal Flooding Reduction that derives from the concept of a "multipoint relay" (MPR) found in <xref target='RFC3626'>OLSR</xref> and
balances the flooding load over northbound links and nodes.</li>
</ul>
<t>Additional advantages that are unique to RIFT are listed below, the details of which can be found in <xref target='I-D.ietf-rift-rift'>RIFT</xref>.
</t>
<ul>
<li>True ZTP</li>
<li>Minimal blast radius on failures</li>
<li>Can utilize all Paths through fabric without looping</li>
<li>Simple leaf implementation that can scale down to servers</li>
<li>Key-Value store</li>
<li>Horizontal links used for protection only</li>
<li>Supports non-equal cost multipath (NECMP) and can replace multi-chassis link aggregation group (MLAG or MC-LAG)</li>
</ul>
</section>
<section><name>Applicable Topologies</name>
<t>
Albeit RIFT is specified primarily for "proper" Clos or Fat Tree topologies,
the protocol natively supports Points of Delivery (PoD) concepts, which, strictly speaking, are not found in the original Clos concept.
</t>
<t>Further, the specification explains and supports operations of multi-plane
Clos variants where the protocol recommends the use of inter-plane rings at the
Top-of-Fabric level to allow the reconciliation of topology view of different planes
to make the negative disaggregation viable in case of failures within a plane.
These observations hold not only in case of RIFT but also in the generic
case of dynamic routing on Clos variants with multiple planes and failures
in bi-sectional bandwidth, especially on the leafs.
</t>
<section><name>Horizontal Links</name>
<t>
RIFT is not limited to pure Clos divided into PoD and multi-planes but
supports horizontal (East-West) links below the top of fabric level. Those links
are used only for last resort northbound routes when a spine loses all its
northbound links or cannot compute a default route through them.
</t>
<t>A possible configuration is a "ring" of horizontal links
at a level. In presence of such a "ring" in any level (except Top of Fabric (ToF) level)
neither North SPF (N-SPF) nor South SPF (S-SPF) will provide a "ring-based protection"
scheme since such a computation would have to deal necessarily
with breaking of "loops" in Dijkstra sense;
an application for which RIFT is not intended.
</t>
<t> A full-mesh connectivity between nodes on the same level can be employed
and that allows N-SPF to provide for any node loosing
all its northbound adjacencies (as long as any of the other
nodes in the level are northbound connected) to still participate in northbound
forwarding.
</t>
</section>
<section><name>Vertical Shortcuts</name>
<t>
Through relaxations of the specified adjacency forming rules, RIFT implementations can be extended to support vertical "shortcuts" as
proposed by e.g. <xref target='I-D.white-distoptflood'/>. The RIFT specification
itself does not provide the exact details since the resulting solution suffers from
either much larger blast radius with increased flooding volumes or
in case of maximum aggregation routing, bow-tie problems.
</t>
</section>
<section><name>Generalizing to any Directed Acyclic Graph</name>
<t>
RIFT is an anisotropic routing protocol, meaning that it has a sense of direction (northbound, southbound, east-west) and that it operates differently depending on the direction.
</t>
<ul>
<li>
Northbound, RIFT operates as a link-state protocol, whereby the control packets are reflooded first all the way north and only interpreted later. All the individual fine grained routes are advertised.
</li>
<li>
<t>
Southbound, RIFT operates as a distance-vector protocol, whereby the control packets are flooded only one-hop, interpreted, and the consequence of that computation is what gets flooded one more hop south. In the most common use-cases, a ToF node can reach most of the prefixes in the fabric. If that is the case, the ToF node advertises the fabric default and disaggregates the prefixes that it cannot reach. On the other hand, a ToF node that can reach only a small subset of the prefixes in the fabric will preferably advertise those prefixes and refrain from aggregating.
</t>
<t>
In the general case, what gets advertised south is in more details:
</t>
<ol>
<li>A fabric default that aggregates all the prefixes that are reachable within the fabric, and that could be a default route or a prefix that is dedicated to this particular fabric.
</li>
<li>The loopback addresses of the northbound nodes, e.g., for inband management.
</li>
<li>The disaggregated prefixes for the dynamic exceptions to the fabric default, advertised to route around the black hole that may form.
</li>
</ol>
</li>
<li>East-West routing can optionally be used, with specific restrictions. It is used when a sibling has access to the fabric default but this node does not.
</li>
</ul>
<t>
A Directed Acyclic Graph (DAG) provides a sense of north (the direction of the DAG) and of south (the reverse), which can be used to apply RIFT. For the purpose of RIFT, an edge in the DAG that has only incoming vertices is a ToF node.
</t><t>
There are a number of caveats though:
</t>
<ul>
<li>The DAG structure must exist before RIFT starts, so there is a need for a companion protocol to establish the logical DAG structure.
</li>
<li>A generic DAG does not have a sense of east and west. The operation specified for east-west links and the southbound reflection between nodes are not applicable.
Also ZTP will derive a sense of depth that will eliminate some links. Variations of ZTP could be derived to meet specific objectives, e.g., make it so that most routers have at least 2 parents to reach the ToF.
</li>
<li>
RIFT applies to any Destination-Oriented DAG (DODAG) where there's only one ToF node and the problem of disaggregation does not exist. In that case, RIFT
operates very much like RPL <xref target='RFC6550'/>, but using Link State for southbound routes (downwards in RPL's terms).
For an arbitrary DAG with multiple destinations (ToFs) the way disaggregation happens has to be considered.
</li>
<li>Positive disaggregation expects that most of the ToF nodes reach most of the leaves, so disaggregation is the exception as opposed to the rule. When this is no more true, it makes sense to turn off disaggrgation and route between the ToF nodes over a ring, a full mesh, transit network, or a form of area zero. There again, this operation is similar to RPL operating as a single DODAG with a virtual root.
</li>
<li>
In order to aggregate and disaggregate routes, RIFT requires that all the ToF nodes share the full knowledge of the prefixes in the fabric.
</li>
<li>
This can be achieved with a ring as suggested by the RIFT main specification, by some preconfiguration, or using a synchronization with a common repository where all the active prefixes are registered.
</li>
</ul>
</section>
<section title="Reachability of Internal Nodes in the Fabric" anchor="onastick">
<t>RIFT does not require that nodes have reachable addresses in the fabric,
though it is clearly desirable for operational purposes. Under normal operating
conditions this can be easily achieved by injecting the node's loopback
address into North and South Prefix TIEs or other implementation specific
mechanisms.
</t>
<t>
Special considerations arise when a node loses all northbound adjacencies,
but is not at the top of the fabric. These are outside the scope of this
document and could be discussed in a separate document.
</t>
</section>
</section>
<section><name>Use Cases</name>
<section><name>Data Center Topologies</name>
<section><name>Data Center Fabrics</name>
<t>
RIFT is suited for applying in data center (DC) IP fabrics underlay routing, vast majority of which seem to be currently (and
for
the foreseeable future)
Clos architectures. It significantly simplifies operation and deployment
of such fabrics as described in <xref target='opex'/> for environments compared
to
extensive proprietary provisioning and operational solutions.
</t>
</section>
<section><name>Adaptations to Other Proposed Data Center Topologies</name>
<figure align='center' anchor='levelshortcuts'><name>Level Shortcut</name>
<artwork align='center'><![CDATA[
. +-----+ +-----+
. | | | |
.+-+ S0 | | S1 |
.| ++---++ ++---++
.| | | | |
.| | +------------+ |
.| | | +------------+ |
.| | | | |
.| ++-+--+ +--+-++
.| | | | |
.| | A0 | | A1 |
.| +-+--++ ++---++
.| | | | |
.| | +------------+ |
.| | +-----------+ | |
.| | | | |
.| +-+-+-+ +--+-++
.+-+ | | |
. | L0 | | L1 |
. +-----+ +-----+
]]>
</artwork>
</figure>
<t>
RIFT is not strictly limited to Clos topologies. The protocol only
requires a sense of "compass rose directionality" either achieved
through configuration or derivation of levels.
So, conceptually, shortcuts between levels could be included.
<xref target="levelshortcuts"/> depicts an example of a shortcut
between levels. In this example, sub-optimal routing will
occur when traffic is sent from L0 to L1 via S0's
default route and back down through A0 or A1.
In order to ensure that, only default routes from A0 or A1
are used, all leaves would be required to install each others routes.
</t>
<t>
While various technical and operational challenges may require the use of such modifications,
discussion of those topics are outside the scope of this document.
</t>
</section>
</section>
<section><name>Metro Fabrics</name>
<t>
The demand for bandwidth is increasing steadily, driven primarily by
environments close to
content producers (server farms connection via DC fabrics) but in
proximity to content consumers as well.
Consumers are often clustered in metro areas with their own network
architectures that can benefit
from simplified, regular Clos structures and hence from RIFT.
</t>
</section>
<section><name>Building Cabling</name>
<t>
Commercial edifices are often cabled in topologies that are
either Clos or its isomorphic equivalents. The
Clos can grow rather high with many floors. That presents a challenge
for traditional routing protocols (except BGP and by now largely
phased-out PNNI) which do not support
an arbitrary number of levels which RIFT does naturally. Moreover, due to the limited sizes of forwarding tables in network elements of building cabling, the minimum FIB size RIFT maintains under normal conditions is cost-effective in terms of hardware and operational costs.
</t>
</section>
<section><name>Internal Router Switching Fabrics</name>
<t>
It is common in high-speed communications switching and routing
devices to use fabrics when a crossbar is not feasible due to cost,
head-of-line blocking
or size trade-offs. Normally such fabrics are not self-healing or rely
on 1:/+1 protection schemes but it is conceivable to use RIFT to
operate Clos fabrics that can deal effectively with interconnections
or subsystem failures in such module. RIFT is neither IP specific and
hence any link addressing connecting internal device subnets is
conceivable.
</t>
</section>
<section><name>CloudCO</name>
<t>
The Cloud Central Office (CloudCO) is a new stage of telecom Central Office. It takes the advantage of Software Defined Networking (SDN) and Network Function Virtualization (NFV) in conjunction with general purpose hardware to optimize current networks.
The following figure illustrates this architecture at a high level. It describes a single instance or macro-node of cloud CO that provides a number of Value Added Services (VAS), a Broadband Access Abstraction (BAA), and virtualized nerwork services. An Access I/O module faces a Cloud CO access node, and the Customer Premises Equipments (CPEs) behind it. A Network I/O module is facing the core network. The two I/O modules are interconnected by a leaf and spine fabric <xref target='TR-384'/>.
</t>
<figure align='center' anchor='pic-CloudCO'><name>An example of CloudCO architecture</name>
<artwork align='center'><![CDATA[
+---------------------+ +----------------------+
| Spine | | Spine |
| Switch | | Switch |
+------+---+------+-+-+ +--+-+-+-+-----+-------+
| | | | | | | | | | | |
| | | | | +-------------------------------+ |
| | | | | | | | | | | |
| | | | +-------------------------+ | | |
| | | | | | | | | | | |
| | +----------------------+ | | | | | | | |
| | | | | | | | | | | |
| +---------------------------------+ | | | | | | |
| | | | | | | | | | | |
| | | +-----------------------------+ | | | | |
| | | | | | | | | | | |
| | | | | +--------------------+ | | | |
| | | | | | | | | | | |
+--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+
|L | | Leaf | | Leaf | | Leaf | | Leaf | |L |
|S | | Switch | | Switch | | Switch | | Switch| |S |
++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++
| | | | | | | | | | | | | |
| +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ |
| |Compute | |Compute | | Compute | |Compute| |
| |Node | |Node | | Node | |Node | |
| +--------+ +--------+ +----------+ +-------+ |
| || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || |
| |--------| |--------| |----------| |-------| |
| |--------| |--------| |----------| |-------| |
| || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || |
| |--------| |--------| |----------| |-------| |
| |--------| |--------| |----------| |-------| |
| || VAS7 || || VAS4 || || vIGMP || ||BAA || |
| |--------| |--------| |----------| |-------| |
| +--------+ +--------+ +----------+ +-------+ |
| |
++-----------+ +---------++
|Network I/O | |Access I/O|
+------------+ +----------+
]]>
</artwork>
</figure>
<t>
The Spine-Leaf architecture deployed inside CloudCO meets the network requirements of adaptable, agile, scalable and dynamic.
</t>
</section>
</section>
</section>
<section anchor='opex'><name>Operational Considerations</name>
<t>
RIFT presents the opportunity for organizations building and operating
IP fabrics to simplify their operation and deployments while achieving
many desirable
properties of a dynamic routing on such a substrate:
</t>
<ul>
<li>
RIFT only floods routing information to the devices that absolutely need it. RIFT design follows minimum blast radius and minimum necessary epistemological scope philosophy which leads to good scaling properties while delivering maximum reactiveness.
</li>
<li>
RIFT allows for extensive Zero Touch Provisioning within the protocol.
In its most extreme version RIFT does not rely on any specific addressing
and for IP fabric can operate using <xref target='RFC4861'>IPv6 ND</xref> only.
</li>
<li>
RIFT has provisions to detect common IP fabric mis-cabling scenarios.
</li>
<li>
RIFT negotiates automatically BFD per link allowing this way for IP and <xref target='RFC7130'>micro-BFD</xref> to replace Link Aggregation Groups (LAGs) which do hide bandwidth
imbalances in case of constituent failures. Further automatic link validation
techniques similar to <xref target='RFC5357'/> could be supported as well.
</li>
<li>
RIFT inherently solves many difficult problems associated with the use of
traditional routing topologies with dense meshes and high degrees of ECMP by
including automatic bandwidth balancing, flood reduction and automatic
disaggregation on failures while providing maximum aggregation of prefixes
in default scenarios.
</li>
<li>
RIFT reduces FIB size towards the bottom of the IP fabric where most nodes
reside and allows with that for cheaper hardware on the edges and introduction
of modern IP fabric architectures that encompass e.g. server multi-homing.
</li>
<li> RIFT provides valley-free
routing and with that is loop free. This allows the use of any such valley-free
path
in bi-sectional fabric bandwidth between two destination irrespective of their
metrics which can be used to balance load on the fabric in different ways.
</li>
<li>
RIFT includes a key-value distribution mechanism
which allows for many future applications
such as automatic provisioning of basic overlay services or automatic key
roll-overs over whole fabrics.
</li>
<li>
RIFT is designed for minimum delay in case of prefix mobility on the fabric. In
conjunction with <xref target='RFC8505'/>, RIFT can differentiate anycast advertisements from mobility events and retain only the most recent advertisement in the latter case.
</li>
<li>
Many further operational and design points collected over many years of
routing protocol deployments have been incorporated in RIFT such as
fast flooding rates, protection of information lifetimes and operationally
easily recognizable remote ends of links and node names.
</li>
</ul>
<section><name>South Reflection</name>
<t>South reflection is a mechanism that South Node TIEs are "reflected"
back up north to allow nodes in same level without East-west links to "see"
each other.
</t>
<t>For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs
from ToF21 to ToF22 separately. Respectively, Spine111\Spine112\Spine121\Spine122 reflects Node
S-TIEs from ToF22 to ToF21 separately. So ToF22 and ToF21 see each other's
node information as level 2 nodes.
</t>
<t>In an equivalent fashion, as the result of the south reflection between Spine121-Leaf121-Spine122
and Spine121-Leaf122-Spine122, Spine121 and Spine 122 knows each other at
level 1.
</t>
</section>
<section><name>Suboptimal Routing on Link Failures</name>
<figure align='center' anchor='pic-suboptimal'><name>Suboptimal routing upon link failure use case</name>
<artwork align='center'><![CDATA[
+--------+ +--------+
| ToF21 | | ToF22 | LEVEL 2
++--+-+-++ ++-+--+-++
| | | | | | | +
| | | | | | | linkTS8
+-------------+ | +-+linkTS3+-+ | | | +-------------+
| | | | | | + |
| +----------------------------+ | linkTS7 |
| | | | + + + |
| | | +-------+linkTS4+------------+ |
| | | + + | | |
| | | +------------+--+ | |
| | | | | linkTS6 | |
+-+----+-+ +-----+--+ ++--------+ +-+----+-+
|Spine111| |Spine112| |Spine121 | |Spine122| LEVEL 1
+-+---+--+ +----+---+ +-+---+---+ +-+---+--+
| | | | | | | |
| +--------------+ | + ++XX+linkSL6+---+ +
| | | | linkSL5 | | linkSL8
| +------------+ | | + +---+linkSL7+-+ | +
| | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ ++------+ +-----+-+ +-+-----+
+ + + +
Prefix111 Prefix112 Prefix121 Prefix122
]]></artwork>
</figure>
<t>As shown in <xref target='pic-suboptimal'/>, as the result of the south reflection between
Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and Spine
122 knows each other at level 1.</t>
<t>Without disaggregation mechanism, when linkSL6 fails, the packet from
leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 then go
down through linkTS4 to linkSL8 to Leaf122 or go up through linkSL5 to linkTS6
then go down through linkTS4 and linkSL8 to Leaf122 based on pure default route.
It's the case of suboptimal routing or bow-tieing.</t>
<t>With disaggregation mechanism, when linkSL6 fails, Spine122 will detect the
failure according to the reflected node S-TIE from Spine121. Based on the
disaggregation algorithm provided by RIFT, Spine122 will explicitly advertise
prefix122 in Disaggregated Prefix S-TIE PrefixesElement(prefix122, cost 1). The packet
from leaf121 to prefix122 will only be sent to linkSL7 following a longest-prefix
match to prefix 122 directly then go down through linkSL8 to Leaf122 .
</t>
</section>
<section><name>Black-Holing on Link Failures</name>
<figure align='center' anchor='pic-blackhole'><name>Black-holing upon link failure use case</name>
<artwork align='center'><![CDATA[
+--------+ +--------+
| ToF 21 | | ToF 22 | LEVEL 2
++-+--+-++ ++-+--+-++
| | | | | | | +
| | | | | | | linkTS8
+--------------+ | +-+linkTS3+X+ | | | +--------------+
linkTS1 | | | | | + |
+ +-----------------------------+ | linkTS7 |
| | + | + + + |
| | linkTS2 +-------+linkTS4+X+----------+ |
| + + + + | | |
| linkTS5 +-+ +------------+--+ | |
| + | | | linkTS6 | |
+-+----+-+ +-+----+-+ ++-------+ +-+-----++
|Spine111| |Spine112| |Spine121| |Spine122| LEVEL 1
+-+---+--+ ++----+--+ +-+---+--+ +-+---+--+
| | | | | | | |
+ +---------------+ | + +---+linkSL6+---+ +
linkSL1 | | | linkSL5 | | linkSL8
+ +--+linkSL3+--+ | | + +---+linkSL7+-+ | +
| | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ ++------+ +-----+-+ +-+-----+
+ + + +
Prefix111 Prefix112 Prefix121 Prefix122
]]></artwork>
</figure>
<t>This scenario illustrates a case when double link failure occurs and with that
black-holing can happen.</t>
<t>Without disaggregation mechanism, when linkTS3 and linkTS4 both fail,
the packet from leaf111 to prefix122 would suffer 50% black-holing based
on pure default route. The packet supposed to go up through linkSL1 to
linkTS1 then go down through linkTS3 or linkTS4 will be dropped. The
packet supposed to go up through linkSL3 to linkTS2 then go down through
linkTS3 or linkTS4 will be dropped as well. It's the case of black-holing.</t>
<t>With disaggregation mechanism, when linkTS3 and linkTS4 both fail, ToF22 will
detect the failure according to the reflected node S-TIE of ToF21 from
Spine111\Spine112. Based on the disaggregation algorithm
provided by RITF, ToF22 will explicitly originate an S-TIE with prefix 121 and
prefix 122, that is flooded to spines 111, 112, 121 and 122.</t>
<t>The packet from leaf111 to prefix122 will not be routed to linkTS1 or
linkTS2. The packet from leaf111 to prefix122 will only be routed to linkTS5
or linkTS7 following a longest-prefix match to prefix122.</t>
</section>
<section><name>Zero Touch Provisioning (ZTP)</name>
<t>
RIFT is designed to require a very minimal configuration to simplify its operation and avoid human errors; based on that minimal information, Zero Touch Provisioning (ZTP) autoconfigures the key operational parameters of all the RIFT nodes, that is, on the one hand, the SystemID of the node that must be unique in the RIFT network, and on the other hand the level of the node in the Fat Tree, which determines which peers are northwards "parents" and which are southwards "children".
</t>
<t>
ZTP is always on, but its decisions can be overridden when a network administrator prefers to impose its own configuration. In that case, it is the responsibility of the administrator to ensure that the configured parameters are correct,
in other words that the SystemID of each node is unique, and that the administratively set levels truly reflect the relative position of the nodes in the fabric. It is
recommended to let ZTP configure the network, and when not, it is recommended to
configure the level of all the nodes but those that are forced as leaves to avoid an undesirable interaction between ZTP and the manual configuration.
</t>
<t>ZTP requires that the administrator points out the Top-of-Fabric (ToF) nodes to set the baseline from which the fabric topology is derived. The Top-of-Fabric nodes are configured with TOP_OF_FABRIC flag which are initial 'seeds' needed for other ZTP nodes to derive their level in the topology.
ZTP computes the level of each node based on the Highest Available Level (HAL)
of the potential parent(s) nearest that baseline, which represents the superspine.
In a fashion, RIFT can be seen as a distance-vector protocol that computes a set of feasible successors towards the superspine and auto-configures the rest of the topology.
</t>
<t>
The autoconfiguration mechanism computes a global maximum of levels by diffusion.
The derivation of the level of each node happens then based on Link Information Elements (LIEs) received from its
neighbors whereas each node (with possibly exceptions of configured leaves) tries to
attach at the highest possible point in the fabric. This guarantees that even if the diffusion front reaches a node from "below" faster
than from "above", it will greedily abandon already negotiated level derived from nodes
topologically below it and properly peer with nodes above.
</t>
<t>
The achieved equilibrium can be disturbed massively by all nodes with highest level either leaving or entering the domain (with some finer distinctions not explained further).
It is therefore recommended that each node is multi-homed towards nodes with respective HAL offerings. Fortunately, this is the natural state of things for the topology variants considered in RIFT.
</t>
<t>
A RIFT node may also be configured to confine it to the leaf role with the LEAF_ONLY flag. A leaf node can also be configured to support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either case the node cannot be TOP_OF_FABRIC and its level cannot be configured. RIFT will fully configure the node's level after it is attached to the topology and ensure that the node is at the "bottom of the hierarchy" (southernmost).
</t>
</section>
<section><name>Mis-cabling Examples</name>
<figure align='center' anchor='single-plane-mis-cabling'><name>A single plane mis-cabling example</name>
<artwork align='center'><![CDATA[
+----------------+ +-----------------+
| ToF21 | +------+ ToF22 | LEVEL 2
+-------+----+---+ | +----+---+--------+
| | | | | | | | |
| | | +----------------------------+ |
| +---------------------------+ | | | |
| | | | | | | | |
| | | | +-----------------------+ | |
| | +------------------------+ | | |
| | | | | | | | |
+-+---+--+ +-+---+--+ | +--+---+-+ +--+---+-+
|Spine111| |Spine112| | |Spine121| |Spine122| LEVEL 1
+-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+
| | | | | | | | |
| +---------+ | link-M | +---------+ |
| | | | | | | | |
| +-------+ | | | | +-------+ | |
| | | | | | | | |
+-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+
|Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0
+-------+ +-------+ +-------+ +-------+
]]></artwork>
</figure>
<t><xref target='single-plane-mis-cabling'/> shows a single plane mis-cabling example. It's a perfect Fat Tree fabric except link-M connecting Leaf112 to ToF22.
</t>
<t>The RIFT control protocol can discover the physical links automatically and be able to detect cabling that violates Fat Tree topology constraints.
It reacts accordingly to such mis-cabling attempts, at a minimum preventing adjacencies between nodes from being formed and traffic from being forwarded on those mis-cabled links.
Leaf112 will in such scenario use link-M to derive its level (unless it is leaf) and can report links to Spine111 and Spine112 as mis-cabled unless the implementations
allows horizontal links.
</t>
<t><xref target='multi-plane-mis-cabling'/> shows a multiple plane mis-cabling example. Since Leaf112 and Spine121 belong to two different PoDs, the adjacency between Leaf112 and Spine121 can not be formed. link-W would be detected and prevented.
</t>
<figure align='center' anchor='multi-plane-mis-cabling'><name>A multiple plane mis-cabling example</name>
<artwork align='center'><![CDATA[
+-------+ +-------+ +-------+ +-------+
|ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2
+-------+ +-------+ +-------+ +-------+
| | | | | | | |
| | | +-----------------+ | | |
| +--------------------------+ | | | |
| | | | | | | |
| +------+ | | | +------+ |
| | +-----------------+ | | | | |
| | | +--------------------------+ | |
| A | | B | | A | | B |
+-----+--+ +-+---+--+ +--+---+-+ +--+-----+
|Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1
+-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+
| | | | | | | | |
| +---------+ | | | +---------+ |
| | | | link-W | | | |
| +-------+ | | | | +-------+ | |
| | | | | | | | |
+-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+
|Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0
+-------+ +-------+ +-------+ +-------+
+--------PoD#1----------+ +---------PoD#2---------+
]]></artwork>
</figure>
<t>RIFT provides an optional level determination procedure in its Zero Touch Provisioning mode. Nodes in the fabric without
their level configured determine it automatically. This can have possibly counter-intuitive consequences however.
One extreme failure scenario is depicted in <xref target='Fallen-spine'/> and it shows that if all northbound links of spine11 fail at the same time,
spine11 negotiates a lower level than Leaf11 and Leaf12.
</t>
<t>To prevent such scenario where leafs are expected to act as switches, LEAF_ONLY flag can be set for Leaf111 and Leaf112.
Since level -1 is invalid, Spine11 would not derive a valid level from the topology in <xref target='Fallen-spine'/>. It will be isolated from the whole fabric
and it would be up to the leafs to declare the links towards such spine as mis-cabled.
</t>
<figure align='center' anchor='Fallen-spine'><name>Fallen spine</name>
<artwork align='center'><![CDATA[
+-------+ +-------+ +-------+ +-------+
|ToF A1| |ToF A2| |ToF A1| |ToF A2|
+-------+ +-------+ +-------+ +-------+
| | | | | |
| +-------+ | | |
+ + | | ====> | |
X X +------+ | +------+ |
+ + | | | |
+----+--+ +-+-----+ +-+-----+
|Spine11| |Spine12| |Spine12|
+-+---+-+ ++----+-+ ++----+-+
| | | | | |
| +---------+ | | |
| | | | | |
| +-------+ | | +-------+ |
| | | | | |
+-+---+-+ +--+--+-+ +-----+-+ +-----+-+
|Leaf111| |Leaf112| |Leaf111| |Leaf112|
+-------+ +-------+ +-+-----+ +-+-----+
| |
| +--------+
| |
+-+---+-+
|Spine11|
+-------+
]]></artwork>
</figure>
</section>
<section><name>Positive vs. Negative Disaggregation</name>
<t>
Disaggregation is the procedure whereby <xref target='I-D.ietf-rift-rift'/>
advertises a more specific route southwards as an exception to the
aggregated fabric-default north. Disaggregation is useful when a prefix
within the aggregation is reachable via some of the parents but not the
others at the same level of the fabric.
It is mandatory when the level is the ToF since a ToF node that cannot reach
a prefix becomes a black hole for that prefix.
The hard problem is to know which prefixes are reachable by whom.
</t>
<t>
In the general case, <xref target='I-D.ietf-rift-rift'/> solves that
problem by interconnecting the ToF nodes. So the ToF nodes can exchange the full list
of prefixes that exist in the fabric and figure when a ToF node lacks
reachability and to existing prefix. This requires additional ports at the
ToF, typically 2 ports per ToF node to form a ToF-spanning ring.
<xref target='I-D.ietf-rift-rift'/> also defines the southbound reflection
procedure that enables a parent to explore the direct connectivity of its
peers, meaning their own parents and children; based on the advertisements
received from the shared parents and children, it may enable the parent to
infer the prefixes its peers can reach.
</t>
<t>
When a parent lacks reachability to a prefix, it may disaggregate the prefix
negatively, i.e., advertise that this parent can be used to reach any prefix
in the aggregation except that one. The Negative Disaggregation signaling is
simple and functions transitively from ToF to top-of-pod (ToP) and then from ToP to Leaf.
But it is hard for a parent to figure which prefix it needs to disaggregate,
because it does not know what it does not know; it results that the use of a
spanning ring at the ToF is required to operate the Negative Disaggregation.
Also, though it is only an implementation problem, the programmation of the
FIB is complex compared to normal routes, and may incur recursions.
</t>
<t>
The more classical alternative is, for the parents that can reach a prefix
that peers at the same level cannot, to advertise a more specific route to
that prefix. This leverages the normal longest prefix match in the FIB, and
does not require a special implementation. But as opposed to the Negative
Disaggregation, the Positive Disaggregation is difficult and inefficient to
operate transitively.
</t>
<t>
Transitivity is not needed to a grandchild if all its parents received the
Positive Disaggregation, meaning that they shall all avoid the black hole;
when that is the case, they collectively build a ceiling that protects the
grandchild. But until then, a parent that received a Positive Disaggregation
may believe that some peers are lacking the reachability and readvertise too
early, or defer and maintain a black hole situation longer than necessary.
</t>
<t>
In a non-partitioned fabric, all the ToF nodes see one another through the
reflection and can figure if one is missing a child. In that case it is
possible to compute the prefixes that the peer cannot reach and disaggregate
positively without a ToF-spanning ring. The ToF nodes can also ascertain
that the ToP nodes are connected each to at least a ToF node that can still
reach the prefix, meaning that the transitive operation is not required.
</t>
<t>
The bottom line is that in a fabric that is partitioned
(e.g., using multiple planes) and/or where the ToP nodes are not guaranteed
to always form a ceiling for their children, it is
mandatory to use the Negative Disaggregation.