DARD

DARD: Distributed Adaptive Routing for Datacenter Networks
Xin Wu Xiaowei Yang

Dept. of Computer Science, Duke University Duke-CS-TR-2011-01
ABSTRACT
Datacenter networks typically have many paths connecting each host pair to achieve high bisection bandwidth for arbitrary communication patterns. Fully utilizing the bisection bandwidth may require ows between the same source and destination pair to take different paths to avoid hot spots. However, the existing routing protocols have little support for load-sensitive adaptive routing. We propose DARD, a Distributed Adaptive Routing architecture for Datacenter networks. DARD allows each end host to move trafc from overloaded paths to underloaded paths without central coordination. We use an openow implementation and simulations to show that DARD can effectively use a datacenter networks bisection bandwidth under both static and dynamic trafc patterns. It outperforms previous solutions based on random path selection by 10%, and performs similarly to previous work that assigns ows to paths using a centralized controller. We use competitive game theory to show that DARDs path selection algorithm makes progress in every step and converges to a Nash equilibrium in nite steps. Our evaluation results suggest that DARD can achieve a closeto-optimal solution in practice.
1.
INTRODUCTION
Datacenter network applications, e.g., MapReduce and network storage, often demand high intra-cluster bandwidth [11] to transfer data among distributed components. This is because the components of an application cannot always be placed on machines close to each other (e.g., within a rack) for two main reasons. First, applications may share common services provided by the datacenter network, e.g., DNS, search, and storage. These services are not necessarily placed in nearby machines. Second, the auto-scaling feature offered by a datacenter network [1, 5] allows an application to create dynamic instances when its workload increases. Where those instances will be placed depends on machine availability, and is not guaranteed to be close to the applications other instances. Therefore, it is important for a datacenter network to have high bisection bandwidth to avoid hot spots between any pair of hosts. To achieve this goal, todays datacenter networks often use commodity Ethernet switches to form multi-rooted tree topologies [21] (e.g., fat-tree [10] or Clos topology [16]) that have multiple equal-cost paths connecting any host pair. A ow (a ow refers to a TCP connection in this paper) can
use an alternative path if one path is overloaded. However, legacy transport protocols such as TCP lack the ability to dynamically select paths based on trafc load. To overcome this limitation, researchers have advocated a variety of dynamic path selection mechanisms to take advantage of the multiple paths connecting any host pair. At a high-level view, these mechanisms fall into two categories: centralized dynamic path selection, and distributed trafcoblivious load balancing. A representative example of centralized path selection is Hedera [11], which uses a central controller to compute an optimal ow-to-path assignment based on dynamic trafc load. Equal-Cost-Multi-Path forwarding (ECMP) [19] and VL2 [16] are examples of trafcoblivious load balancing. With ECMP, routers hash ows based on ow identiers to multiple equal-cost next hops. VL2 [16] uses edge switches to forward a ow to a randomly selected core switch to achieve valiant load balancing [23]. Each of these two design paradigms has merit and improves the available bisection bandwidth between a host pair in a datacenter network. Yet each has its limitations. A centralized path selection approach introduces a potential scaling bottleneck and a centralized point of failure. When a datacenter scales to a large size, the control trafc sent to and from the controller may congest the link that connects the controller to the rest of the datacenter network, Distibuted trafc-oblivious load balancing scales well to large datacenter networks, but may create hot spots, as their ow assignment algorithms do not consider dynamic trafc load on each path. In this paper, we aim to explore the design space that uses end-to-end distributed load-sensitive path selection to fully use a datacenters bisection bandwidth. This design paradigm has a number of advantages. First, placing the path selection logic at an end system rather than inside a switch facilitates deployment, as it does not require special hardware or replacing commodity switches. One can also upgrade or extend the path selection logic later by applying software patches rather than upgrading switching hardware. Second, a distributed design can be more robust and scale better than a centralized approach. This paper presents DARD, a lightweight, distributed, end system based path selection system for datacenter networks. DARDs design goal is to fully utilize bisection bandwidth and dynamically balance the trafc among the multipath paths between any host pair. A key design challenge DARD faces is how to achieve dynamic distributed load balancing. Un-
like in a centralized approach, with DARD, no end system or router has a global view of the network. Each end system can only select a path based on its local knowledge, thereby making achieving close-to-optimal load balancing a challenging problem. To address this challenge, DARD uses a selsh path selection algorithm that provably converges to a Nash equilibrium in nite steps (Appedix B). Our experimental evaluation shows that the equilibriums gap to the optimal solution is small. To facilitate path selection, DARD uses hierarchical addressing to represent an end-to-end path with a pair of source and destination addresses, as in [27]. Thus, an end system can switch paths by switching addresses. We have implemented a DARD prototype on DeterLab [6] and a ns-2 simulator. We use static trafc pattern to show that DARD converges to a stable state in two to three control intervals. We use dynamic trafc pattern to show that DARD outperforms ECMP, VL2 and TeXCP and its performance gap to the centralized scheduling is small. Under dynamic trafc pattern, DARD maintains stable link utilization. About 90% of the ows change their paths less than 4 times in their life cycles. Evaluation result also shows that the bandwidth taken by DARDs control trafc is bounded by the size of the topology. DARD is a scalable and stable end host based approach to load-balance datacenter trafc. We make every effort to leverage existing infrastructures and to make DARD practically deployable. The rest of this paper is organized as follows. Section 2 introduces background knowledge and discusses related work. Section 3 describes DARDs design goals and system components. In Section 4, we introduce the system implementation details. We evaluate DARD in Section 5. Section 6 concludes our work.
Figure 1: A multi-rooted tree topology for a datacenter network. The

aggregation layers oversubscription ratio is dened as
BWdown . BWup
fat-tree topology to illustrate DARDs design, unless otherwise noted. Therefore, we briey describe what a fat-tree topology is. Figure 2 shows a fat-tree topology example. A p-pod fattree topology (in Figure 2, p = 4)has p pods in the horizontal direction. It uses 5p2 /4 p-port switches and supports nonblocking communication among p3 /4 end hosts. A pair of end hosts in different pods have p2 /4 equal-cost paths connecting them. Once the two end hosts choose a core switch as the intermediate node, the path between them is uniquely determined.
2.
BACKGROUND AND RELATED WORK
Figure 2:
A 4-pod fat-tree topology.
In this section, we rst briey introduce what a datacenter network looks like and then discuss related work.
2.1 Datacenter Topologies

Recent proposals [10, 16, 24] suggest to use multi-rooted tree topologies to build datacenter networks. Figure 1 shows a 3-stage multi-rooted tree topology. The topology has three vertical layers: Top-of-Rack (ToR), aggregation, and core. A pod is a management unit. It represents a replicable building block consisting of a number of servers and switches sharing the same power and management infrastructure. An important design parameter of a datacenter network is the oversubscription ratio at each layer of the hierarchy, which is computed as a layers downstream bandwidth devided by its upstream bandwidth, as shown in Figure 1. The oversubscription ratio is usually designed larger than one, assuming that not all downstream devices will be active concurrently. We design DARD to work for arbitrary multi-rooted tree topologies. But for ease of exposition, we mostly use the
In this paper, we use the term elephant ow to refer to a continuous TCP connection longer than a threshold dened in the number of transferred bytes. We discuss how to choose this threshold in 3.
2.2 Related work

Related work falls into three broad categories: adaptive path selection mechanisms, end host based multipath transmission, and trafc engineering protocols. Adaptive path selection. Adaptive path selection mechanisms [11, 16, 19] can be further divided into centralized and distributed approaches. Hedera [11] adopts a centralized approach in the granularity of a ow. In Hedera, edge switches detect and report elephant ows to a centralized controller. The controller calculates a path arrangement and periodically updates switches routing tables. Hedera can almost fully utilize a networks bisection bandwidth, but a recent data center trafc measurement suggests that this centralized approach needs parallelism and fast route computation
to support dynamic trafc patterns [12]. Equal-Cost-Multi-Path forwarding (ECMP) [19] is a distributed ow-level path selection approach. An ECMP-enabled switch is congured with multiple next hops for a given destination and forwards a packet according to a hash of selected elds of the packet header. It can split trafc to each destination across multiple paths. Since packets of the same ow share the same hash value, they take the same path from the source to the destination and maintain the packet order. However, multiple large ows can collide on their hash values and congest an output port [11]. VL2 [16] is another distributed path selection mechanism. Different from ECMP, it places the path selection logic at edge switches. In VL2, an edge switch rst forwards a ow to a randomly selected core switch, which then forwards the ow to the destination. As a result, multiple elephant ows can still get collided on the same output port as ECMP. DARD belongs to the distributed adaptive path selection family. It differs from ECMP and VL2 in two key aspects. First, its path selection algorithm is load sensitive. If multiple elephant ows collide on the same path, the algorithm will shift ows from the collided path to more lightly loaded paths. Second, it places the path selection logic at end systems rather than at switches to facilitate deployment. A path selection module running at an end system monitors path state and switches path according to path load ( 3.5). A datacenter network can deploy DARD by upgrading its end systems software stack rather than updating switches. Multi-path Transport Protocol. A different design paradigm, multipath TCP (MPTCP) [26], enables an end system to simultaneously use multiple paths to improve TCP throughput. However, it requires applications to use MPTCP rather than legacy TCP to take advantage of underutilized paths. In contrast, DARD is transparent to applications as it is implemented as a path selection module under the transport layer ( 3). Therefore, legacy applications need not upgrade to take advantage of multiple paths in a datacenter network. Trafc Engineering Protocols. Trafc engineering protocols such as TeXCP [20] are originally designed to balance trafc in an ISP network, but can be adopted by datacenter networks. However, because these protocols are not end-toend solutions, they forward trafc along different paths in the granularity of a packet rather than a TCP ow. Therefore, it can cause TCP packet reordering, harming a TCP ows performance. In addition, different from DARD, they also place the path selection logic at switches and therefore require upgrading switches.
DARDs essential goal is to effectively utilize a datacenters bisection bandwidth with practically deployable mechanisms and limited control overhead. We elaborate the design goal in more detail. 1. Efciently utilizing the bisection bandwidth . Given the large bisection bandwidth in datacenter networks, we aim to take advantage of the multiple paths connecting each host pair and fully utilize the available bandwidth. Meanwhile we desire to prevent any systematic design risk that may cause packet reordering and decrease the system goodput. 2. Fairness among elephant ows . We aim to provide fairness among elephant ows so that concurrent elephant ows can evenly share the available bisection bandwidth. We focus our work on elephant ows for two reasons. First, existing work shows ECMP and VL2 already perform well on scheduling a large number of short ows [16]. Second, elephant ows occupy a signicant fraction of the total bandwidth (more than 90% of bytes are in the 1% of the largest ows [16]). 3. Lightweight and scalable. We aim to design a lightweight and scalable system. We desire to avoid a centralized scaling bottleneck and minimize the amount of control trafc and computation needed to fully utilize bisection bandwidth. 4. Practically deployable. We aim to make DARD compatible with existing datacenter infrastructures so that it can be deployed without signicant modications or upgrade of existing infrastructures.
3.2 Overview
In this section, we present an overview of the DARD design. DARD uses three key mechanisms to meet the above system design goals. First, it uses a lightweight distributed end-system-based path selection algorithm to move ows from overloaded paths to underloaded paths to improve efciency and prevent hot spots ( 3.5). Second, it uses hierarchical addressing to facilitate efcient path selection ( 3.3). Each end system can use a pair of source and destination addresses to represent an end-to-end path, and vary paths by varying addresses. Third, DARD places the path selection logic at an end system to facilitate practical deployment, as a datacenter network can upgrade its end systems by applying software patches. It only requires that a switch support the openow protocol and such switches are commercially available [?]. Figure 3 shows DARDs system components and how it works. Since we choose to place the path selection logic at an end system, a switch in DARD has only two functions: (1) it forwards packets to the next hop according to a pre-congured routing table; (2) it keeps track of the Switch State (SS, dened in 3.4) and replies to end systems Switch State Request (SSR). Our design implements this function using the openow protocol. An end system has three DARD components as shown in Figure 3: Elephant Flow Detector, Path State Monitor and
3.
DARD DESIGN
In this section, we describe DARDs design. We rst highlight the system design goals. Then we present an overview of the system. We present more design details in the following sub-sections.
3.1 Design Goals
Figure 3:
DARDs system overview. There are multiple paths con-
necting each source and destination pair. DARD is a distributed system running on every end host. It has three components. The Elephant Flow Detector detects elephant ows. The Path State Monitor monitors trafc load on each path by periodically querying the switches. The Path Selector moves ows from overloaded paths to underloaded paths.
Path Selector. The Elephant Flow Detector monitors all the output ows and treats one ow as an elephant once its size grows beyond a threshold. We use 100KB as the threshold in our implementation. This is because according to a recent study, more than 85% of ows in a datacenter are less than 100 KB [16]. The Path State Monitor sends SSR to the switches on all the paths and assembles the SS replies in Path State (PS, as dened in 3.4). The path state indicates the load on each path. Based on both the path state and the detected elephant ows, the Path Selector periodically assigns ows from overloaded paths to underloaded paths. The rest of this section presents more design details of DARD, including how to use hierarchical addressing to select paths at an end system ( 3.3), how to actively monitor all paths state in a scalable fashion ( 3.4), and how to assign ows from overloaded paths to underloaded paths to improve efciency and prevent hot spots ( 3.5).
3.3 Addressing and Routing

To fully utilize the bisection bandwidth and, at the same time, to prevent retransmissions caused by packet reordering (Goal 1), we allow a ow to take different paths in its life cycle to reach the destination. However, one ow can use only one path at any given time. Since we are exploring the design space of putting as much control logic as possible to the end hosts, we decided to leverage the datacenters hierarchical structure to enable an end host to actively select paths for a ow. A datacenter network is usually constructed as a multirooted tree. Take Figure 4 as an example, all the switches
and end hosts highlighted by the solid circles form a tree with its root core1 . Three other similar trees exist in the same topology. This strictly hierarchical structure facilitates adaptive routing through some customized addressing rules [10]. We borrow the idea from NIRA [27] to split an end-to-end path into uphill and downhill segments and encode a path in the source and destination addresses. In DARD, each of the core switches obtains a unique prex and then allocates nonoverlapping subdivisions of the prex to each of its sub-trees. The sub-trees will recursively allocate nonoverlapping subdivisions of their prexes to lower hierarchies. By this hierarchical prex allocation, each network device receives multiple IP addresses, each of which represents the devices position in one of the trees. As shown in Figure 4, we use corei to refer to the ith core, aggrij to refer to the jth aggregation switch in the ith pod. We follow the same rule to interpret T oRij for the top of rack switches and Eij for the end hosts. We use the device names prexed with letter P and delimited by colons to illustrate how prexes are allocated along the hierarchies. The rst core is allocated with prex P core1 . It then allocates nonoverlapping prexes P core1 .P aggr11 and P core1 .P aggr21 to two of its sub-trees. The sub-tree rooted from aggr11 will further allocate four prexes to lower hierarchies. For a general multi-rooted tree topology, the datacenter operators can generate a similar address assignment schema and allocate the prexes along the topology hierarchies. In case more IP addresses than network cards are assigned to each end host, we propose to use IP alias to congure multiple IP addresses to one network interface. The latest operating systems support a large number of IP alias to associate with one network interface, e.g., Linux kernel 2.6 sets the limit to be 256K IP alias per interface [3]. Windows NT 4.0 has no limitation on the number of IP addresses that can be bound to a network interface [4]. One nice property of this hierarchical addressing is that one host address uniquely encodes the sequence of upperlevel switches that allocate that address, e.g., in Figure 4, E11 s address P core1 .P aggr11 .P T oR11 .P E11 uniquely encodes the address allocation sequence core1 aggr11 T oR11 . A source and destination address pair can further uniquely identify a path, e.g., in Figure 4, we can use the source and destination pair highlighted by dotted circles to uniquely encode the dotted path from E11 to E21 through core1 . We call the partial path encoded by source address the uphill path and the partial path encoded by destination address the downhill path. To move a ow to a different path, we can simply use different source and destination address combinations without dynamically reconguring the routing tables. To forward a packet, each switch stores a downhill table and an uphill table as described in [27]. The uphill table keeps the entries for the prexes allocated from upstream switches and the downhill table keeps the entries for the pre-
Figure 4: DARDs addressing and routing. E11 s address P core1 .P aggr11 .P T oR11 .P E11 encodes the uphill path T oR11 -aggr11 -core1 . E21 s address P core1 .P aggr21 .P T oR21 .P E21 encodes the downhill path core1 -aggr21 -T oR21 . xes allocated to the downstream switches. Table 1 shows the switch aggr11 s downhill and uphill table. The port indexes are marked in Figure 4. When a packet arrives, a switch rst looks up the destination address in the downhill table using the longest prex matching algorithm. If a match is found, the packet will be forwarded to the corresponding downstream switch. Otherwise, the switch will look up the source address in the uphill table to forward the packet to the corresponding upstream switch. A core switch only has the downhill table.
downhill table Prexes P core1 .P aggr11 .P T oR11 P core1 .P aggr11 .P T oR12 P core2 .P aggr11 .P T oR11 P core2 .P aggr11 .P T oR12 uphill table Prexes P core1 P core2 Port 1 2 1 2 Port 3 4
Switches in the middle will forward the packet according to the encapsulated packet header. When the packet arrives at the destination, it will be decapsulated and passed to upper layer protocols.
3.4 Path Monitoring

To achieve load-sensitive path selection at an end host, DARD informs every end host with the trafc load in the network. Each end host will accordingly select paths for its outbound elephant ows. In a high level perspective, there are two options to inform an end host with the network trafc load. First, a mechanism similar to the Netow [15], in which trafc is logged at the switches and stored at a centralized server. We can then either pull or push the log to the end hosts. Second, an active measurement mechanism, in which each end host actively probes the network and collects replies. TeXCP [20] chooses the second option. Since we desire to make DARD not to rely on any conceptually centralized component to prevent any potential scalability issue, an end host in DARD also uses the active probe method to monitor the trafc load in the network. This function is done by the Path State Monitor shown in Figure 3. This section rst describes a straw man design of the Path State Monitor which enables an end host to monitor the trafc load in the network. Then we improve the straw man design by decreasing the control trafc. We rst dene some terms before describing the straw man design. We use Cl to note the output link ls capacity. Nl denotes the number of elephant ows on the output link l. We dene link ls fair share Sl = Cl /Nl for the bandwidth each elephant ow will get if they fairly share that link (Sl = 0, if Nl = 0). Output link ls Link State (LSl ) is dened as a triple [Cl , Nl , Sl ]. A switch rs Switch State (SSr ) is dened as {LSl | l is rs output link}. A path p refers to a set of links that connect a source and destination ToR switch pair. If link l has the smallest Sl among all the links on path p, we use LSl to represent ps Path State (P Sp ). The Path State Monitor is a DARD component running on an end host and is responsible to monitor the states of all the paths to other hosts. In its straw man design, each switch keeps track of its switch state (SS) locally. The path state
Table 1: aggr11 s downhill and uphill routing tables. In fact, the downhill-uphill-looking-up is not necessary for a fat-tree topology, since a core switch in a fat-tree uniquely determines the entire path. However, not all the multi-rooted trees share the same property, e.g., a Clos network. The downhill-uphill-looking-up modies current switchs forwarding algorithm. However, an increasing number of switches support highly customized forwarding policies. In our implementation we use OpenFlow enabled switch to support this forwarding algorithm. All switches uphill and downhill tables are automatically congured during their initialization. These congurations are static unless the topology changes. Each network component is also assigned a location independent IP address, ID, which uniquely identies the component and is used for making TCP connections. The mapping from IDs to underlying IP addresses is maintained by a DNS-like system and cached locally. To deliver a packet from a source to a destination, the source encapsulates the packet with a proper source and destination address pair.
Figure 5:
The set of switches a source sends switch state requests to.
monitor of an end host periodically sends the Switch State Requests (SSR) to every switch and assembles the switch state replies in the path states (PS). These path states indicate the trafc load on each path. This straw man design requires every switch to have a customized ow counter and the capability of replying to SSR. These two functions are already supported by OpenFlow enabled switches [17]. In the straw man design, every end host periodically sends SSR to all the switches in the topology. The switches will then reply to the requests. The control trafc in every control interval (we will discuss this control interval in 5) can be estimated using formula (1), where pkt size refers to the sum of request and response packet sizes. num of servers num of switchs pkt size (1)
Algorithm selsh path selection 1: for each src-dst ToR switch pair, P , do 2: max index = 0; max S = 0.0; 3: min index = 0; min S = ; 4: for each i [1, P.P V.length] do 5: if P.F V [i] > 0 and max S < P.P V [i].S then 6: max S = M.P V [i].S; 7: max index = i; 8: else if min S > M.P V [i].S then 9: min S = M.P V [i].S; 10: min index = i; 11: end if 12: end for 13: end for 14: if max index = min index then 15: estimation = P.PP.P V [max index].bandwidth V [max index].f low numbers+1 16: if estimation min S > then 17: move one elephant ow from path min index to path max index. 18: end if 19: end if width divided by the number of elephant ows. A paths fair share is dened as the smallest link fair share along the path. Given an elephant ows elastic trafc demand and small delays in datacenters, elephant ows tend to fully and fairly utilize their bottlenecks. As a result, moving one ow from a path with small fair share to a path with large fair share will push both the small and large fair shares toward the middle and thus improve fairness to some extent. Based on this observation, we propose DARDs path selection algorithm, whose high level idea is to enable every end host to selshly increase the minimum fair share they can observe. The Algorithm selsh ow scheduling illustrates one round of the path selection process. In DARD, every source and destination pair maintains two vectors, the path state vector (P V ), whose ith item is the state of the ith path, and the ow vector (F V ), whose ith item is the number of elephant ows the source is sending along the ith path. Line 15 estimates the fair share of the max indexth path if another elephant ow is moved to it. The in line 16 is a positive threshold to decide whether to move a ow. If we set to 0, line 16 is to make sure this algorithm will not decrease the global minimum fair share. If we set to be larger than 0, the algorithm will converge as soon as the estimation being close enough to the current minimum fair share. In general, a small will evenly split elephant ows among different paths and a large will accelerate the algorithms convergence. Existing work shows that load-sensitive adaptive routing protocols can lead to oscillations and instability [22]. Figure 6 shows an example of how an oscillation might happen using the selsh path selection algorithm. There are three source and destination pairs, (E1 , E2 ), (E3 , E4 ) and (E5 ,
Even though the above control trafc is bounded by the size of the topology, we can still improve the straw man design by decreasing the control trafc. There are two intuitions for the optimizations. First, if an end host is not sending any elephant ow, it is not necessary for the end host to monitor the trafc load, since DARD is designed to select paths for only elephant ows. Second, as shown in Figure 5, E21 is sending an elephant ow to E31 . The switches highlighted by the dotted circles are the ones E21 needs to send SSR to. The rest switches are not on any path from the source to the destination. We do not highlight the destination ToR switch (T oR31 ), because its output link (the one connected to E31 ) is shared by all the four paths from the source to the destination, thus DARD cannot move ows away from that link anyway. Based on these two observations, we limit the number of switches each source sends SSR to. For any multi-rooted tree topology with three hierarchies, this set of switches includes (1) the source ToR switch, (2) the aggregation switches directly connected to the source ToR switch, (3) all the core switches and (4) the aggregation switches directly connected to the destination ToR switch.
3.5 Path Selection

As shown in Figure 3, a Path Selector running on an end host takes the detected elephant ows and the path state as the input and periodically moves ows from overloaded paths to underloaded paths. We desire to design a stable and distributed algorithm to improve the systems efciency. This section rst introduces the intuition of DARDs path selection and then describes the algorithm in detail. In 3.4 we dene a links fair share to be the links band-
E6 ). Each of the pairs has two paths and two elephant ows. The source in each pair will run the path state monitor and the path selector independently without knowing the other twos behaviors. In the beginning, the shared path, (link switch1 -switch2 ), has not elephant ows on it. According to the selsh path selection algorithm, the three sources will move ows to it and increase the number of elephant ows on the shared path to three, larger than one. This will further cause the three sources to move ows away from the shared paths. The three sources repeat this process and cause permanent oscillation and bandwidth underutilization.
Figure 6:
Path oscillation example.
The reason for path oscillation is that different sources move ows to under-utilized paths in a synchronized manner. As a result, in DARD, the interval between two adjacent ow movements of the same end host consists of a xed span of time and a random span of time. According to the evaluation in 5.3.3, simply adding this randomness in the control interval can prevent path oscillation.
for OpenFlow enabled networks. However, DARD does not rely on it. We use NOX only once to initialize the static ow tables. Each links bandwidth is congured as 100M bps. A daemon program is running on every end host. It has the three components shown in Figure 3. The Elephant Flow Detector leverages the TCPTrack [9] at each end host to monitor TCP connections and detects an elephant ow if a TCP connection grows beyond 100KB [16]. The Path State Monitor tracks the fair share of all the equal-cost paths connecting the source and destination ToR switches as described in 3.4. It queries switches for their states using the aggregate ow statistics interfaces provided by OpenFlow infrastructure [8]. The query interval is set to 1 second. This interval causes acceptable amount of control trafc as shown in 5.3.4. We leave exploring the impact of varying this query interval to our future work. A Path Selector moves elephant ows from overloaded paths to underloaded paths according to the selsh path selection algorithm, where we set the to be 10M bps. This number is a tradeoff between maximizing the minimum ow rate and fast convergence. The ow movement interval is 5 seconds plus a random time from [0s, 5s]. Because a signicant amount of elephant ows last for more than 10s [21], this interval setting prevents an elephant ow from being delivered even without the chance to be moved to a less congested path, and at the same time, this conservative interval setting limits the frequency of ow movement. We use the Linux IP-in-IP tunneling as the encapsulation/decapsulation module. All the mappings from IDs to underlying IP addresses are kept at every end host.
4.2 Simulator
To evaluate DARDs performance in larger topologies, we build a DARD simulator on ns-2, which captures the systems packet level behavior. The simulator supports fat-tree, Clos network [16] and the 3-tier topologies whose oversubscription is larger than 1 [2]. The topology and trafc patterns are passed in as tcl conguration les. A links bandwidth is 1Gbps and its delay is 0.01ms. The queue size is set to be the delay-bandwidth product. TCP New Reno is used as the transport protocol. We use the same settings as the test bed for the rest of the parameters.
4.
IMPLEMENTATION
To test DARDs performance in real datecenter networks, we implemented a prototype and deployed it in a 4-pod fattree topology in DeterLab [6]. We also implemented a simulator on ns-2 to evaluate DARDs performance on different types of topologies.
4.1 Test Bed

We set up a fat-tree topology using 4-port PCs acting as the switches and congure IP addresses according to the hierarchical addressing scheme described in 3.3. All PCs run the Ubuntu 10.04 LTS standard image. All switches run OpenFlow 1.0. An OpenFlow enabled switch allows us to access and customize the forwarding table. It also maintains per ow and per port statistics. Different vendors, e.g., Cisco and HP, have implemented OpenFlow in their products. We implement our prototype based on existing OpenFlow platform to show DARD is practical and readily deployable. We implement a NOX [18] component to congure all switches ow tables during their initialization. This component allocates the downhill table to OpenFlows ow table 0 and the uphill table to OpenFlows ow table 1 to enforce a higher priority for the downhill table. All entries are set to be permanent. NOX is often used as a centralized controller
5. EVALUATION
This section describes the evaluation of DARD using DeterLab test bed and ns-2 simulation. We focus this evaluation on four aspects. (1) Can DARD fully utilize the bisection bandwidth and prevent elephant ows from colliding at some hot spots? (2) How fast can DARD converge to a stable state given different static trafc patterns? (3) Will DARDs distributed algorithm cause any path oscillation? (4) How much control overhead does DARD introduce to a datacenter?
5.1 Trafc Patterns

Due to the absence of commercial datacenter network traces,
we use the three trafc patterns introduced in [10] for both our test bed and simulation evaluations. (1) Stride, where an end host with index Eij sends elephant ows to the end host with index Ekj , where k = ((i + 1)mod(num pods)) + 1. This trafc pattern emulates the worst case where a source and a destination are in different pods. As a result, the trafc stresses out the links between the core and the aggregation layers. (2) Staggered(ToRP,PodP), where an end host sends elephant ows to another end host connecting to the same ToR switch with probability T oRP , to any other end host in the same pod with probability P odP and to the end hosts in different pods with probability 1 T oRP P odP . In our evaluation T oRP is 0.5 and P odP is 0.3. This trafc pattern emulates the case where an applications instances are close to each other and the most intra-cluster trafc is in the same pod or even in the same ToR switch. (3) Random, where an end host sends elephant ows to any other end host in the topology with a uniform probability. This trafc pattern emulates an average case where applications are randomly placed in datacenters. The above three trafc patterns can be either static or dynamic. The static trafc refers to a number of permanent elephant ows from the source to the destination. The dynamic trafc means the elephant ows between a source and a destination start at different times. The elephant ows transfer large les of different sizes. Two key parameters for the dynamic trafc are the ow inter-arrival time and the le size. According to [21], the distribution of inter-arrival times between two ows at an end host has periodic modes spaced by 15ms. Given 20% of the ows are elephants [21], we set the inter-arrival time between two elephant ows to 75ms. Because 99% of the ows are smaller than 100MB and 90% of the bytes are in ows between 100MB and 1GB [21], we set an elephant ows size to be uniformly distributed between 100MB and 1GB. We do not include any short term TCP ows in the evaluation because elephant ows occupy a signicant fraction of the total bandwidth (more than 90% of bytes are in the 1% of ows [16, 21]). We leave the short ows impact on DARDs performance to our future work.
ports as ECMP. In the hope of smooth out the collisions by randomness, our ow-level VLB implementation randomly picks a core every 10 seconds for an elephant ow. This 10s control interval is set roughly the same as DARDs control interval. We do not explore other choices. We note this implementation as periodical VLB (pVLB). We do not implement other approaches, e.g., Hedera, TeXCP and MPTCP, in the test bed. We compare DARD and these approaches in the simulation. Figure 7 shows the comparison of DARD, ECMP and pVLBs bisection bandwidths under different static trafc patterns. DARD outperforms both ECMP and pVLB. One observation is that the bisection bandwidth gap between DARD and the other two approaches increases in the order of staggered, random and stride. This is because ows through the core have more path diversities than the ows inside a pod. Compared with ECMP and pVLB, DARDs strategic path selection can converge to a better ow allocation than simply relying on randomness.
Figure 7:
DARD, ECMP and pVLBs bisection bandwidths under
different static trafc patterns. Measured on 4-pod fat-tree test bed.
5.2 Test Bed Results

To evaluate whether DARD can fully utilize the bisection bandwidth, we use the static trafc patterns and for each source and destination pair, a TCP connection transfers an innite le. We constantly measure the incoming bandwidth at every end host. The experiment lasts for one minute. We use the results from the middle 40 seconds to calculate the average bisection bandwidth. We also implement a static hash-based ECMP and a modied version of ow-level VLB in the test bed. In the ECMP implementation, a ow is forwarded according to a hash of the source and destinations IP addresses and TCP ports. Because a ow-level VLB randomly chooses a core switch to forward a ow, it can also introduce collisions at output
We also measure DARDs large le transfer times under dynamic trafc patterns. We vary each source-destination pairs ow generating rate from 1 to 10 per second. Each elephant ow is a TCP connection transferring a 128MB le. We use a xed le length for all the ows instead of lengths uniformly distributed between 100MB and 1GB. It is because we need to differentiate whether nishing a ow earlier is because of a better path selection or a smaller le. The experiment lasts for ve minutes. We track the start and the end time of every elephant ow and calculate the average le transfer time. We run the same experiment on ECMP and calculate the improvement of DARD over ECMP using formula (2), where avg TECMP is the average le transfer time using ECMP, and avg TDARD is the average le transfer time using DARD. improvement = avg TECMP avg TDARD avg TECMP (2)
Figure 8 shows the improvement vs. the ow generating rate under different trafc patterns. For the stride trafc pattern, DARD outperforms ECMP because DARD moves ows from overloaded paths to underloaded ones and in-
creases the minimum ow throughput in every step. We nd both random trafc and staggered trafc share an interesting pattern. When the ow generating rate is low, ECMP and DARD have almost the same performance because the bandwidth is over-provided. As the ow generating rate increases, cross-pod ows congest the switch-to-switch links, in which case DARD reallocates the ows sharing the same bottleneck and improves the average le transfer time. When ow generating rate becomes even higher, the host-switch links are occupied by ows within the same pod and thus become the bottlenecks, in which case DARD helps little.
Improvement of average file transfer time 20% 15% 10% 5% 0% 2 3 4 5 6 7 8 9 10 Flow generating rate for each srcdst pair (number_of_flows / s) 1 staggered random stride
and random trafc patterns. However, given staggered trafc pattern, Hedera achieves less bisection bandwidth than DARD. This is because current Hedera only schedules the ows going through the core. When intra-pod trafc is dominant, Hedera degrades to ECMP. We believe this issue can be addressed by introducing new neighbor generating functions in Hedera. MPTCP outperforms DARD by completely exploring the path diversities. However, it achieves less bisection bandwidth than Hedera. We suspect this is because the current MPTCPs ns-2 implementation does not support MPTCP level retransmission. Thus, lost packets are always retransmitted to the same path regardless how congested the path is. We leave a further comparison between DARD and MPTCP as our future work.
Figure 8:
File transfer improvement. Measured on testbed.
5.3 Simulation Results

To fully understand DARDs advantages and disadvantages, besides comparing DARD with ECMP and pVLB, we also compare DARD with Hedera, TeXCP and MPTCP in our simulation. We implement both the demand-estimation and the simulated annealing algorithm described in Hedera and set its scheduling interval to 5 seconds [11]. In the TeXCP implementation each ToR switch pair maintains the utilizations for all the available paths connecting the two of them by periodical probing (The default probe interval is 200ms. However, since the RTT in datacenter is in granularity of 1ms or even smaller, we decrease this probe interval to 10ms). The control interval is ve times of the probe interval [20]. We do not implement the owlet [25] mechanisms in the simulator. As a result, each ToR switch schedules trafc at packet level. We use MPTCPs ns-2 implementation [7] to compare with DARD. Each MPTCP connection uses all the simple paths connecting the source and the destination.
Figure 9: Bisection bandwidth under different static trafc patterns.

Simulated on p = 16 fat-tree.
We also measure the large le transfer time to compare DARD with other approaches on a fat-tree topology with 1024 hosts (p = 16). We assign each elephant ow from a dynamic random trafc pattern a unique index, and compare its transmission times under different trafc scheduling approaches. Each experiment lasts for 120s in ns-2. We dene Tim to be ow js transmission time when the underlying trafc scheduling approach is m, e.g., DARD or ECMP. We use every TiECMP as the reference and calculate the improvement of le transfer time for every trafc scheduling approach according to formula (3). The improvementECMP i is 1 for all the ows. improvementm = i TiECMP Tim TiECMP
(3)
5.3.1 Performance Improvement

We use static trafc pattern on a fat-tree with 1024 hosts (p = 16) to evaluate whether DARD can fully utilize the bisection bandwidth in a larger topology. Figure 9 shows the result. DARD achieves higher bisection bandwidth than both ECMP and pVLB under all the three trafc patterns. This is because DARD monitors all the paths connecting a source and destination pair and moves ows to underloaded paths to smooth out the collisions caused by ECMP. TeXCP and DARD achieve similar bisection bandwidth. We will compare these two approaches in detail later. As a centralized method, Hedera outperforms DARD under both stride
Figure 10 shows the CDF of the above improvement. ECMP and pVLB essentially have the same performance, since ECMP transfers half of the ows faster than pVLB and transfers the rest half slower than pVLB. Hedera outperforms MPTCP because Hedera achieves higher bisection bandwidth. Even though DARD and TeXCP achieve the same bisection bandwidth under dynamic random trafc pattern (Figure 9), DARD still outperforms TeXCP. We further measure every elephant ows retransmission rate, which is dened as the number of retransmitted packets over the number of unique packets. Figure 11 shows TeXCP has a higher retransmission rate than DARD. In other words, even though TeXCP can achieve a high bisection bandwidth,
some of the packets are retransmitted because of reordering and thus its goodput is not as high as DARD.
100% 80% CDF 60% 40% pVLB TeXCP DARD 20% MPTCP Hedera 0 5% 0 5% 10% 15% 20% 25% 30% Improvement of file transfer time CDF
100% 75% 50% 25% 0 10 staggered random stride 15 20 Convergence time (s) 25
Figure 12:
DARD converges to a stable state in 2 or 3 control intervals given static trafc patterns. Simulated on p = 16 fat-tree.
Figure 10:
Improvement of large le transfer time under dynamic
random trafc pattern. Simulated on p = 16 fat-tree.
for a inter-pod elephant ow [11], Figure 13 shows the link utilizations on the 8 output ports at the rst core switch. We can see that after the initial oscillation the link utilizations stabilize afterward.
100% Link utilization 90% 80% 70% 0
100% 75% CDF 50% 25% 0 0.2% 1% 2% 3% Retransmission rate DARD TeXCP 4%
50
100 Time (s)
150
200
Figure 11: DARD and TeXCPs TCP retransmission rate. Simulated

on p = 16 fat-tree.
Figure 13:
The rst core switchs output port utilizations under dy-
namic random trafc pattern. Simulated on p = 8 fat-tree.
5.3.2 Convergence Speed

Being a distributed path selection algorithm, DARD is provable to converge to a Nash equilibrium (Appendix B). However, if the convergence takes a signicant amount of time, the network will be underutilized during the process. As a result, we measure how fast can DARD converge to a stable state, in which every ows stops changing paths. We use the static trafc patterns on a fat-tree with 1024 hosts (p = 16). For each source and destination pair, we vary the number of elephant ows from 1 to 64, which is the number of core switches. We start these elephant ows simultaneously and track the time when all the ows stop changing paths. Figure 12 show the CDF of DARDs convergence time, from which we can see DARD converges in less than 25s for more than 80% of the cases. Given DARDs control interval at each end host is roughly 10s, the entire system converges in less than three control intervals.
5.3.3 Stability
Load-sensitive adaptive routing can lead to oscillation. The main reason is because different sources move ows to underloaded paths in a highly synchronized manner. To prevent this oscillation, DARD adds a random span of time to each end hosts control interval. This section evaluates the effects of this simple mechanism. We use dynamic random trafc pattern on a fat-tree with 128 end hosts (p = 8) and track the output link utilizations at the core, since a cores output link is usually the bottleneck
However we cannot simply conclude that DARD does not cause path oscillations, because the link utilization is an aggregated metric and misses every single ows behavior. We rst disable the random time span added to the control interval and log every single ows path selection history. As a result, we nd that even though the link utilizations are stable, Certain ows are constantly moved between two paths, e.g., one 512MB elephant ow are moved between two paths 23 times in its life cycle. This indicates path oscillation does exist in load-sensitive adaptive routing. After many attempts, we choose to add a random span of time to the control interval to address above problem. Figure 14 shows the CDF of how many times ows change their paths in their life cycles. For the staggered trafc, around 90% of the ows stick to their original path assignment. This indicates when most of the ows are within the same pod or even the same ToR switch, the bottleneck is most likely located at the host-switch links, in which case few path diversities exist. On the other hand, for the stride trafc, where all ows are inter-pod, around 50% of the ows do not change their paths. Another 50% change their paths for less than 4 times. This small number of path changing times indicates that DARD is stable and no ow changes its paths back and forth.
5.3.4 Control Overhead

To evaluate DARDs communication overhead, we trace the control messages for both DARD and Hedera on a fat-
6. CONCLUSION
90% CDF
70%
50% 0
staggered random stride 2 4 6 Path switch times 8
Figure 14:
CDF of the times that ows change their paths under dynamic trafc pattern. Simulated on p = 8 fat-tree.
This paper proposes DARD, a readily deployable, lightweight distributed adaptive routing system for datacenter networks. DARD allows each end host to selshly move elephant ows from overloaded paths to underloaded paths. Our analysis shows that this algorithm converges to a Nash equilibrium in nite steps. Test bed emulation and ns-2 simulation show that DARD outperforms random ow-level scheduling when the bottlenecks are not at the edge, and outperforms centralized scheduling when intra-pod trafc is dominant.
tree with 128 hosts (p = 8) under static random trafc pattern. DARDs communication overhead is mainly introduced by the periodical probes, including both queries from hosts and replies from switches. This communication overhead is bounded by the size of the topology, because in the worst case, the system needs to process all pair probes. However, for Hederas centralized scheduling, ToR switches report elephant ows to the centralized controller and the controller further updates some switches ow tables. As a result, the communication overhead is bounded by the number of ows. Figure 15 shows how much of the bandwidth is taken by control messages given different number of elephant ows. With the increase of the number of elephant ows, there are three stages. In the rst stage (between 0K and 1.5K in this example), DARDs control messages take less bandwidth than Hederas. The reason is mainly because Hederas control message size is larger than DARDs (In Hedera, the payload of a message from a ToR switch to the controller is 80 bytes and the payload of a message from the controller to a switch is 72 Bytes. On the other hand, these two numbers are 48 bytes and 32 bytes for DARD). In the second stage (between 1.5K and 3K in this example), DARDs control messages take more bandwidth. That is because the sources are probing for the states of all the paths to their destinations. In the third stage (more than 3K in this example), DARDs probe trafc is bounded by the topology size. However, Hederas communication overhead does not increase proportionally to the number of elephant ows. That is mainly because when the trafc pattern is dense enough, even the centralized scheduling cannot easily nd out an improved ow allocation and thus few messages are sent from the controller to the switches.
Control overhead (MB/s) 300 200 100 0 0 DARD Simulated Annealing 1K 2K 3K 4K Peak number of elephant flows 5K
7. ACKNOWLEDGEMENT
This material is based upon work supported by the National Science Foundation under Grant No. 1040043.
8. REFERENCES
[1] Amazon elastic compute cloud. http://aws.amazon.com/ec2. [2] Cisco Data Center Infrastructure 2.5 Design Guide. http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_ Center/DC_Infra2_5/DCI_SRND_2_5_book.html. [3] IP alias limitation in linux kernel 2.6. http://lxr.free-electrons.com/source/net/core/dev.c#L935. [4] IP alias limitation in windows nt 4.0. http://support.microsoft.com/kb/149426. [5] Microsoft Windows Azure. http://www.microsoft.com/windowsazure. [6] Microsoft Windows Azure. http://www.isi.deterlab.net/. [7] ns-2 implementation of mptcp. http://www.jp.nishida.org/mptcp/. [8] Openow switch specication, version 1.0.0. http: //www.openflowswitch.org/documents/openflow-spec-v1.0.0.pdf. [9] TCPTrack. http://www.rhythm.cx/~steve/devel/tcptrack/. [10] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. SIGCOMM Comput. Commun. Rev., 38(4):6374, 2008. [11] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic ow scheduling for data center networks. In Proceedings of the 7th ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI), San Jose, CA, Apr. 2010. [12] T. Benson, A. Akella, and D. A. Maltz. Network trafc characteristics of data centers in the wild. In Proceedings of the 10th annual conference on Internet measurement, IMC 10, pages 267280, New York, NY, USA, 2010. ACM. [13] J.-Y. Boudec. Rate adaptation, congestion control and fairness: A tutorial, 2000. [14] C. Busch and M. Magdon-Ismail. Atomic routing games on maximum congestion. Theor. Comput. Sci., 410(36):33373347, 2009. [15] B. Claise. Cisco Systems Netow Services Export Version 9. RFC 3954, 2004. [16] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Vl2: a scalable and exible data center network. SIGCOMM Comput. Commun. Rev., 39(4):5162, 2009. [17] T. Greene. Researchers show off advanced network control technology. http://www.networkworld.com/news/2008/102908-openflow.html. [18] N. Gude, T. Koponen, J. Pettit, B. Pfaff, M. Casado, N. McKeown, and S. Shenker. Nox: towards an operating system for networks. SIGCOMM Comput. Commun. Rev., 38(3):105110, 2008. [19] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992, 2000. [20] S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking the tightrope: Responsive yet stable trafc engineering. In In Proc. ACM SIGCOMM, 2005. [21] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center trafc: measurements & analysis. In IMC 09: Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, pages 202208, New York, NY, USA, 2009. ACM. [22] A. Khanna and J. Zinky. The revised arpanet routing metric. In Symposium proceedings on Communications architectures & protocols, SIGCOMM 89, pages 4556, New York, NY, USA, 1989. ACM. [23] M. K. Lakshman, T. V. Lakshman, and S. Sengupta. Efcient and robust routing of highly variable trafc. In In Proceedings of Third Workshop on Hot Topics in Networks (HotNets-III), 2004. [24] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. Portland: a scalable fault-tolerant layer 2 data center network fabric. SIGCOMM Comput. Commun. Rev., 39(4):3950, 2009. [25] S. Sinha, S. Kandula, and D. Katabi. Harnessing TCPs Burstiness using Flowlet Switching. In 3rd ACM SIGCOMM Workshop on Hot Topics in Networks (HotNets), San Diego, CA, November 2004. [26] D. Wischik, C. Raiciu, A. Greenhalgh, and M. Handley. Design, implementation and evaluation of congestion control for multipath tcp. In Proceedings of the 8th USENIX conference on Networked systems design and
Figure 15:
DARD and Hederas communication overhead. Simu-
lated on p = 8 fat-tree.
implementation, NSDI11, pages 88, Berkeley, CA, USA, 2011. USENIX Association. [27] X. Yang. Nira: a new internet routing architecture. In FDNA 03: Proceedings of the ACM SIGCOMM workshop on Future directions in network architecture, pages 301312, New York, NY, USA, 2003. ACM.
Appendix A. EXPLANATION OF THE OBJECTIVE

We assume TCP is the dominant transport protocol in datacenter, which tries to achieve max-min fairness if combined with fair queuing. Each end host moves ows from overloaded paths to underloaded ones to increase its observed minimum fair share (link ls fair share is dened as the link capacity, Cl , divided by the number of elephant ows, Nl ). This section explains given max-min fair bandwidth allocation, the global minimum fair share is the lower bound of the global minimum ow rate, thus increasing the minimum fair share actually increases the global minimum ow rate. Theorem 1. Given max-min fair bandwidth allocation for any network topology and any trafc pattern, the global minimum fair share is the lower bound of global minimum ow rate. First we dene a bottleneck link according to [13]. A link l is a bottleneck for a ow f if and only if (a) link l is fully utilized, and (b) ow f has the maximum rate among all the ows using link l. Given max-min fair bandwidth allocation, link li has the fair share Si = Ci /Ni . Suppose link l0 has the minimum fair share S0 . Flow f has the minimum ow rate, min rate. Link lf is ow f s bottleneck. Theorem 1 claims min rate S0 . We prove this theorem using contradiction. According to the bottleneck denition, min rate is the maximum ow rate on link lf , and thus Cf /Nf min rate. Suppose min rate < S0 , we get Cf /Nf < S0 (A1)
F Sf (s) = P Sp (s), ow f is using path p. k Notation sk refers to the strategy s without pfk , i.e. i
|F k1 k+1 1 k [pf1 , . . . , pik1 , pik+1 , . . . , pi|F || ]. (sk , pfk ) refers to the i i |F k1 k+1 1 k strategy [pf1 , . . . , pik1 , pfk , pik+1 , . . . , pi|F || ]. Flow fk is i i locally optimal in strategy s if
k F Sfk (s).S F Sfk (sk , pfk ).S i
(B1)
(A1) is conict with S0 being the minimum fair share. As a result, the minimum fair share is the lower bound of the global minimum ow rate. In DARD, every end host tries to increase its observed minimum fair share in each round, thus the global minimum fair share increases, so does the global minimum ow rate.
B.
CONVERGENCE PROOF
We now formalize DARDs ow scheduling algorithm and prove that this algorithm can converge to a Nash equilibrium in nite steps. The proof is a special case of a congestion game [14], which is dened as (F, G, {pf }f F ). F is the set of all the ows. G = (V, E) is an directed graph. pf is a set of paths f|F 1 2 that can be used by ow f . A strategy s = [pf1 , pf2 , . . . , pi|F || ] i i
k is a collection of paths, in which the ik th path in pfk , pfk , is i used by ow fk . For a strategy s and a link j, the link state LSj (s) is a triple (Cj , Nj , Sj ), as dened in Section 3.4. For a path p, the path state P Sp (s) is the link state with the smallest fair share over all links in p. The system state SysS(s) is the link state with the smallest fair share over all links in E. A f low state F Sf (s) is the corresponding path state, i.e.,
k for all pfk pfk . N ash equilibrium is a state where all i ows are locally optimal. A strategy s is global optimal if for any strategy s, SysS(s ).S SysS(s).S. Theorem 2. If there is no synchronized ow scheduling, Algorithm selsh ow scheduling will increase the minimum fair share round by round and converge to a Nash equilibrium in nite steps. The global optimal strategy is also a Nash equilibrium strategy. For a strategy s, the state vector SV (s) = [v0 (s), v1 (s), v2 (s),. . .], where vk (s) stands for the number of links whose fair share is located at [k, (k + 1)), where is a positive parameter, e.g., 10M bps, to cluster links into groups. As a result k vk (s) = |E|. A small will group the links in a ne granularity and increase the minimum fair share. A large will improve the convergence speed. Suppose s and s are two strategies, SV (s) = [v0 (s), v1 (s), v2 (s),. . .] and SV (s ) = [v0 (s ), v1 (s ), v2 (s ), . . .]. We dene s = s when vk (s) = vk (s ) for all k 0. s < s when there exists some K such that vK (s) < vK (s ) and k < K, vk (s) vk (s ). It is easy to show that given three strategies s, s and s , if s s and s s , then s s . Given a congestion game (F, G, {pf }f F ) and , there are only nite number of state vectors. According to the denition of = and < , we can nd out at least one strategy s that is the smallest, i.e., for any strategy s, s s. It is easy to see that this s has the largest minimum fair share or has the least number of links that have the minimum fair share and thus is the global optimal. If only one ow f selshly changes its route to improve its fair share, making the strategy change from s to s , this action decreases the number of links with small fair shares and increases the number of links with larger fair shares. In other words, s < s. This indicates that asynchronous and selsh ow movements actually increase global minimum fair share round by round until all ows reach their locally optimal state. Since the number of state vectors is limited, the steps converge to a Nash equilibrium is nite. What is more, because s is the smallest strategy, no ow can have a further movement to decrease s, i.e every ow is in its locally optimal state. Hence this global optimal strategy s is also a Nash equilibrium strategy.

DARD

Cargado por

Información del documento

Descripción original:

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

DARD

Cargado por

Copyright:

Formatos disponibles

DARD: Distributed Adaptive Routing for Datacenter Networks

Xin Wu Xiaowei Yang

Figure 1: A multi-rooted tree topology for a datacenter network. The

BACKGROUND AND RELATED WORK

A 4-pod fat-tree topology.

2.1 Datacenter Topologies

2.2 Related work

3.1 Design Goals

DARDs system overview. There are multiple paths con-

3.3 Addressing and Routing

3.4 Path Monitoring

The set of switches a source sends switch state requests to.

3.5 Path Selection

Path oscillation example.

4.1 Test Bed

5.1 Trafc Patterns

DARD, ECMP and pVLBs bisection bandwidths under

different static trafc patterns. Measured on 4-pod fat-tree test bed.

5.2 Test Bed Results

File transfer improvement. Measured on testbed.

5.3 Simulation Results

Figure 9: Bisection bandwidth under different static trafc patterns.

5.3.1 Performance Improvement

Improvement of large le transfer time under dynamic

random trafc pattern. Simulated on p = 16 fat-tree.

100 Time (s)

Figure 11: DARD and TeXCPs TCP retransmission rate. Simulated

The rst core switchs output port utilizations under dy-

namic random trafc pattern. Simulated on p = 8 fat-tree.

5.3.2 Convergence Speed

5.3.4 Control Overhead

staggered random stride 2 4 6 Path switch times 8

DARD and Hederas communication overhead. Simu-

Appendix A. EXPLANATION OF THE OBJECTIVE

k F Sfk (s).S F Sfk (sk , pfk ).S i

También podría gustarte