PODC/SPAA Tutorial

Sunday, June 28, 1998
15:00 - 17:00

Algorithmic Problems in Internet Research
George Varghese
Washington University in St. Louis

The first part of this tutorial starts by describing the basic Internet Protocols (e.g., TCP/IP), goes on to describe some of the newer protocols being proposed (e.g., RSVP, Mobile IP, IP Multicast), and some of the newer problems for which protocols must be designed (e.g., accounting, reliable multicast, combining multicast, mobility, and policy routing). The tutorial describes some of the problems with existing protocols, and the need for careful designs and proofs for the newer protocols. The PODC community may be interested in making contributions in this area.

The second part of the tutorial will describe specific bottlenecks in Internet implementations that are important today. Effective solutions in this area appear to require somewhat different thinking from that suggested by standard algorithmic approaches. The tutorial will also outline how parallel approaches may be useful for some of the harder problems (e.g., filter matching, weighted fair queuing) faced by routers today. Such problems may be of interest to the SPAA community.

Part 1: Internet Protocols, Old and New

We start with the basic routing protocols that undergird the Internet. Several Internet domains (e.g., your university or company campus) use a Bellman-Ford variant called RIP to compute shortest paths within the domain. RIP can exhibit interesting behavior when nodes fail. Some domains use a protocol called OSPF which is based on all nodes reporting neighbor information to all other nodes in the domain. The domains are connected by a backbone consisting of multiple Internet Service Providers (ISPs). Since the providers may have differing goals, simple textbook shortest path routing is insufficient. Instead, backbone routing requires policy routing by which the possibly conflicting goals of the providers are resolved to provide loop-free routes. It would be nice to have a correctness proof for BGP, the currently used policy routing protocol.

Many existing Local Area Networks allow a single copy of a message to be sent to multiple recipients on the same wire, almost as an accident of the technology. The simplicity and efficiency of this model has led to IP multicast by which a sender can send a single message to multiple receivers spread over the Internet. Multicast is currently embodied in the Internet via the Multicast Backbone (MBONE), which is a virtual network of multicast capable routers connected by tunnels. The first multicast protocol, DVMRP, is based on a clever algorithmic idea called reverse path forwarding. Because DVMRP effectively computes a separate multicast tree for each source in a group, a simpler approach called Core based Trees (CBT) has been proposed which uses a shared tree for each group. CBT involves choosing a core router for each group which serves as the root of the group multicast tree. More recent proposals like PIM involve dynamic switching between CBT and DVMRP-like protocols. The behaviors of PIM and CBT, especially during failures, are not well understood and could benefit from careful models and proofs. Internet designers are also grappling with the issues of distributed multicast address allocation and the integration of multicast, mobility, and policy routing.

With the prevalence of laptops and PDAs, it is important to route packets to mobile users. Mobile IP does so using a home agent that proxies for the roaming user and forwards packets for the user to the new location. While this is a simple model, there are interesting interactions between (say) multicasting and mobility that are worthy of study. More fundamentally, are there radically different approaches to mobile routing than mobile IP (which is a simple overlay on top of ordinary IP)? A particular problem is that both multicasting and mobility break down the notion of hierarchical addressing, by which a group of contiguously located nodes are given addresses that have a common pattern. Hierarchical addressing has, thus far, been our only approach to building scalable global networks. The Internet is planning a transition to the next generation (IPv6). Besides the obvious need for expanded addresses, there is a chance to redesign the basic infrastructure.

Currently, the Internet does not provide guarantees. Your file transfer may usually take milliseconds but can take several minutes during periods of congestion. However, once users begin to pay for service and start using applications like Internet video (for which a late packet is as good as a lost packet), the Internet may have to provide service guarantees for certain traffic types. RSVP is a proposed standard by which receivers can make reservations. There are interesting issues in resource reservation such as the granularity of reservations, scalability, and heterogeneity. The harder QoS routing problem asks whether one can make efficient routing decisions based not just on shortest paths, but also on other metrics such as bandwidth, latency, and error rate. It seems clear that reservations must be paid for; this introduces the dimension of accounting. Are there reasonable accounting structures that make economic sense and yet do not require routers to do a lot of extra processing? Finally, sceptics ask whether guarantees can be more simply provided by overprovisioning in a world of cheap bandwidth.

Layered over the Internet Routing protocols, are the Transport Protocols, TCP and UDP. TCP is akin to certified mail and comes with a return receipt, while UDP is like ordinary postal mail. The basic sliding window and connection management strategies used by TCP are well studied. Of greater interest are the congestion control strategies that have been embodied into TCP that allows a sender to slow down in response to congestion. While there are hordes of simulation studies, a probablistic or competitive analysis of such algorithms would be of great interest.

Traditional transport protocols like TCP work between a sender and a receiver. The availability of IP multicast together with applications that wish to send content reliably to thousands of receivers (e.g., publishing, content push) has led to strong commercial interest in reliable multicast protocols. The seminal protocol in this field is the Scalable Reliable Multicast (SRM) Protocol that relies on negative acknowledgements and randomized timers for scalability. Recently, Cisco has been proposing the use of Pragmatic General Multicast (PGM) which enlists router assistance to aggregate negative acknowledgements. Do these protocols work in all cases? Can we describe precisely the guarantees they offer, especially in the face of failures? Are there better protocols?

Besides a list of interesting protocols and problems, the first part of this tutorial will also attempt to articulate some important goals for Internet protocols. Often theoretically sound protocols are rejected by the community because they do not conform to the Internet design philosophy. Goals for new protocols include scalability, adaptability, incremental deployment, handling heterogeneity, modularity, and compatibility with all existing protocols.

Part 2: Internet Implementation Bottlenecks

We describe a set of well defined algorithmic problems. Effective solutions to these problems can allow increased flexibility for routers while allowing them to keep up with increased traffic. With link speeds in the backbone already increasing to 622 Mbit/sec to keep up with demand, it is important for packet processing to keep up with higher link speeds.

First, prefixes are used in the Internet today to aggregate forwarding entries at routers. Such aggregation reduces memory and control traffic, but requires that every router perform a longest matching prefix to process each received packet. Internet backbone routers have a database of roughly 40,000 prefixes, each of which is a binary string of anywhere from 8 to 32 bits. When your computer sends an internet message to say Joe@JoeHost, your computer asks something akin to directory assistance (called DNS or the Domain Name Service) to translate the host name (JoeHost) to a 32 bit Internet address. Each packet then carries a 32 bit IP destination address D; a router must find the best matching prefix corresponding to D in its database of prefixes in (hopefully) hundreds of nanoseconds.

Standard solutions do not work well. Caching, the old warhorse of system designers, does not do well because cache hit rates are often low because of lack of locality possibly caused by Web surfing. Standard solutions based on Patricia tries require up to 32 accesses to memory. The last year has seen a number of new solutions to this problem. One solution is based on using multibit tries but using a form of node compression which allows the database to fit into a small amount of memory in the machine cache (in software) or SRAM (in hardware). A second solution is based on doing binary search on the set of possible prefix lengths; this allows a worst case time of 5 memory access for 32 bit IPv4 addresses and 7 accesses for 128 bit IPv6 addresses. Some vendors have also proposed Tag Switching and IP Switching; the basic idea in these proposals is to finesse the need for a lookup by having the previous hop router pass an index into the forwarding table of the next hop router.

To complicate matters further, there is a recent trend to provide what is sometimes called Level 4 Switching or "service differentiation". For example, this can be used to give important traffic preferential treatment, to reserve bandwidth for traffic flowing between two company sites, and to block traffic from dangerous external sites as in firewalls. This requires packet forwarding decisions to be made not just on the destination IP address but possibly also based on TCP headers (e.g., port numbers can be used to infer the application type), routing source addresses, and even application headers.

In its most general form, the database consists of a sequence of filters that specify matches on packet fields (matches could be exact matches, prefix matches, or range matches). In case a given packet matches multiple filters, we wish to return the first matching filter. Since IP lookups at Gigabit rates are considered hard, solving the packet filtering problem at high speeds will be an important challenge. We note that most firewalls today, which use only a small number of filter rules, are rather slow.

A third interesting problem is that of scheduling at output links to provide service guarantees. Consider the example of a number of chefs who share an oven; some of the chefs prepare fast food and require a guaranteed response time, others make restaurant meals and need adequate response times, and still others make frozen dinners in bulk for which throughput is more important. In our metaphor, the oven corresponds to an output link at a router, the fast food chef corresponds to say video, the restaurant chef to say remote login, and the frozen food chef to File Transfer.

The problem is to have a fast scheduler at the router, with decision times comparable to a lookup time, that can provide guaranteed response times for delay critical traffic, and yet provide fair throughput for other kinds of traffic. The problem is distinguished from standard real time schedulers such as EDF by the throughput fairness requirement. There are a number of good solutions such as Weighted Fair Queuing and Worst Case Weighted Fair Queuing, but finding faster solutions is still an interesting research problem.

Besides these three important problems, there are a number of other algorithmic problems that arise in protocol implementations. These include computing fast checksums, load balancing, and sequence number bookkeeping. Besides the specific problems and potential opportunities for interesting parallel solutions, this tutorial will highlight why good solutions for such problems sometimes differ from standard algorithmic practice. First, it is important to take advantage of system structure to either leverage (or altogether finesse) algorithmic problems whenever possible. Second, the high cost of accessing memory versus cache implies that the most important speed metric is the number of memory accesses; this in turn leads to restructuring data structures for better memory locality. Third, insertion costs are sometimes less important than search costs; this allows solutions that use large amounts of precomputation. Finally, when designing nanosecond algorithms, constant factors are crucial; these can only be determined by actual implementation.

The above list of problems and directions is a personal list that is clearly conditioned by the presentor's background and interests. However, the major point is that there are interesting algorithmic problems in Internet research. With the popularity of the Internet, good solutions to these problems --- based on sound theory but validated by implementation and experiments --- can make a difference.

This page maintained by Gil Neiger.
Last modified: April 28, 1998