idnits 2.17.00 (12 Aug 2021) /tmp/idnits42467/draft-tiesel-taps-socketintents-bsdsockets-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([I-D.tiesel-taps-socketintents]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 02, 2018) is 1412 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '1' on line 875 -- Looks like a reference, but probably isn't: '2' on line 877 == Unused Reference: 'RFC2119' is defined on line 855, but no explicit reference was found in the text == Unused Reference: 'RFC6824' is defined on line 860, but no explicit reference was found in the text == Unused Reference: 'RFC7413' is defined on line 865, but no explicit reference was found in the text == Unused Reference: 'RFC7556' is defined on line 869, but no explicit reference was found in the text == Outdated reference: A later version (-03) exists of draft-tiesel-taps-communitgrany-02 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TAPS Working Group P. Tiesel 3 Internet-Draft T. Enghardt 4 Intended status: Informational TU Berlin 5 Expires: January 3, 2019 July 02, 2018 7 A Socket Intents Prototype for the BSD Socket API - Experiences, Lessons 8 Learned and Considerations 9 draft-tiesel-taps-socketintents-bsdsockets-02 11 Abstract 13 This document describes a prototype implementation of Socket Intents 14 [I-D.tiesel-taps-socketintents] for the BSD Socket API as an 15 illustrative example how Socket Intents could be implemented. It 16 described the experiences made with the prototype and lessons learned 17 from trying to extend the BSD Socket API. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on January 3, 2019. 36 Copyright Notice 38 Copyright (c) 2018 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 2. Prototype Architecture . . . . . . . . . . . . . . . . . . . 3 55 3. Multiple Access Manager . . . . . . . . . . . . . . . . . . . 4 56 3.1. Policy . . . . . . . . . . . . . . . . . . . . . . . . . 5 57 3.2. Path characteristics data collectors . . . . . . . . . . 6 58 4. Socket Intents Representation . . . . . . . . . . . . . . . . 7 59 5. The Socket Intents API Variants . . . . . . . . . . . . . . . 7 60 5.1. Classic API / muacc_context . . . . . . . . . . . . . . . 8 61 5.1.1. muacc_getaddrinfo() . . . . . . . . . . . . . . . . . 8 62 5.1.2. muacc_socket() . . . . . . . . . . . . . . . . . . . 9 63 5.1.3. muacc_setsockopt() . . . . . . . . . . . . . . . . . 10 64 5.1.4. muacc_connect() . . . . . . . . . . . . . . . . . . . 10 65 5.1.5. muacc_close() . . . . . . . . . . . . . . . . . . . . 11 66 5.2. Classic API / getaddrinfo . . . . . . . . . . . . . . . . 11 67 5.3. Socketconnect API . . . . . . . . . . . . . . . . . . . . 14 68 6. API Implementation Experiences & Lessons Learned . . . . . . 15 69 6.1. The Missing Link to Name Resolution . . . . . . . . . . . 15 70 6.2. File Descriptors Considered Harmful . . . . . . . . . . . 16 71 6.3. Asynchronous API Anarchy . . . . . . . . . . . . . . . . 17 72 6.4. Here Be Dragons hiding in Shadow Structures . . . . . . . 17 73 7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 18 74 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 18 75 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 76 9.1. Informative References . . . . . . . . . . . . . . . . . 19 77 9.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 20 78 Appendix A. API Usage Examples . . . . . . . . . . . . . . . . . 20 79 A.1. Usage Example of the Classic / muacc_context API . . . . 20 80 A.2. Usage Example of the Classic / getaddrinfo API . . . . . 21 81 A.3. Usage Example of the Socketconnect API . . . . . . . . . 22 82 Appendix B. Changes . . . . . . . . . . . . . . . . . . . . . . 23 83 B.1. Since -01 . . . . . . . . . . . . . . . . . . . . . . . . 23 84 B.2. Since -00 . . . . . . . . . . . . . . . . . . . . . . . . 23 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 87 1. Introduction 89 With the proliferation of devices that have multiple paths to the 90 internet and an increasing number of transport protocols available, 91 the number of transport options to serve a communication unit 92 explodes. Implementing a heuristic or strategy for choosing from 93 this overwhelming set of transport options by each application puts a 94 huge burden on the application developer. Thus, the decisions 95 regarding all transport options mentioned so far should be supported 96 and, if requested by the application, automated within the transport 97 layer. 99 Socket Intents [I-D.tiesel-taps-socketintents] allow an application 100 to express what it knows, assumes, expects or wants to prioritize 101 regarding its own network communication. This information can than 102 be used by the OS to perform destination selection, path selection 103 and transport protocol stack instance selection. 105 Our Socket Intents prototype for the BSD Socket API is a first 106 attempt to automate transport option selection within the OS. It is 107 primarily targeted at path and destination address selection and 108 tries to be as close as possible to the semantics of the BSD Socket 109 API. The prototype mostly excludes the problem of transport protocol 110 stack instance selection, which is more closely discussed in 111 [I-D.tiesel-taps-communitgrany]. 113 We implemented the prototype as a wrapper for the BSD Socket API that 114 communicates to a central Multiple Access Manager that makes the 115 actual decisions and can optimize across applications. The whole 116 implementation was done in about 15k lines of C code. The code is 117 available at Github [1] under BSD License. 119 This document describes our Socket Intents prototype for the BSD 120 Socket API. It details important aspects of the implementation and 121 the API variants we developed over time based on lessons learned. 122 Finally, it summarizes these lessons and points out why the BSD 123 Socket API is not particularly well suited to integrate automated 124 transport protocol stack instance selection. Furthermore, it 125 describes the limitations for destination address and path selection 126 within the BSD Socket API. 128 2. Prototype Architecture 130 The Socket Intents prototype consists of the following components, 131 also shown in Figure 1: 133 o The Socket Intents API, a BSD Socket API wrapper for applications 134 to use, including a representation of the actual Socket Intents. 136 o The Socket Intents Library which implements the Socket Intents 137 API. It sends requests to the Multiple Access Manager, e.g. 138 before establishing a connection, and gets back a response 139 regarding what interface to use. 141 o The Multiple Access Manager (MAM), a daemon which gets informed 142 about all application requests and has knowledge of the available 143 network interfaces. 145 o The Policy, a dynamically loaded library hosted by the MAM. It 146 chooses which of the available interfaces to use based on the 147 available knowledge about them and the Socket Intents. 149 o Data collectors that that reside inside the MAM and that provide 150 information like bandwidth usage, smoothed RTT estimate and RSSI 151 for wireless links to the policy. 153 +------------------------+ 154 | Application | 155 | | +-------------------+ 156 +-{ Socket Intents API }-+ (MAM Request) | Multiple Access | 157 | | ----------------> | Manager | 158 | Socket Intents | (MAM Response) | +---------------+ | 159 | Library | <---------------- | | Policy | | 160 +------------------------+ | +---------------+ | 161 | BSD Sockets | | |Data Collectors| | 162 +------------------------+ +-+---------------+-+ 164 Figure 1: Components of the Socket Intents Prototype 166 3. Multiple Access Manager 168 The Multiple Access Manager (MAM) is the central transport option 169 selection instance on a host. It is realized as a daemon that runs 170 in userspace and receives requests from each application that uses 171 the Socket Intents Library. 173 The MAM hosts the Policy, which is the actual decision making 174 component, e.g., deciding which source address and therefore which 175 source interface to use. Upon events, such as an application 176 requesting to resolve a name or to connect a socket (see Section 5 177 for details), the Socket Intents Library issues a MAM request and the 178 MAM invokes a callback to the policy - see Section 3.1 for details - 179 which can either communicate its decision right away or defer its 180 decision, e.g., when it has to wait for the results of name 181 resolution. The results and decisions are communicated back to the 182 Socket Intents Library through the MAM response, where they are 183 applied to the actual socket, see also Figure 1. 185 To support the policy, the MAM maintains a list of IP prefixes that 186 are configured on the local interfaces and available for outgoing 187 communications. As destination address selection and path selection 188 are highly dependent on each other, the MAM integrates DNS resolution 189 and maintains separate resolver configurations per prefix (see 190 [ANRW17-MH] for further discussion on multiple PvDs and DNS 191 resolution). Furthermore, the MAM includes data collectors which 192 periodically gather statistics on the available paths, see 193 Section 3.2 for details. 195 3.1. Policy 197 In the Socket Intents prototype, the Policy implements the decision 198 logic for selecting among available transport options. In our 199 current implementation, only one policy can be active at a given 200 time. We implement different interchangeable policies as dynamically 201 loaded libraries, which are hosted by the Multi Access Manager (MAM), 202 see Figure 1. When launching the MAM, the user has to choose a 203 policy and supply a policy configuration, which can contain 204 additional information to configure the policy. 206 Examples of policy configuration include: 208 o A list of IP prefixes configured on local interfaces to consider 209 as source for the communication 211 o Name server(s) to use for each of the IP prefixes 213 o Preferences to instrument the policy, e.g., default prefix to use 215 The policy is initialized with this configuration and then waits for 216 the callback of an incoming MAM request. 218 Upon a callback, the policy can use information from the MAM request, 219 such as Socket Intents, and information available within the MAM, 220 such as recently measured path characteristics (see Section 3.2), to 221 make decisions. 223 Policy decisions can include: 225 o The source address(es) used for name resolution 227 o How to order the results of name resolution (i.e., preferring 228 certain IP addresses over others) 230 o Picking an IP protocol version 232 o Picking a transport protocol (Note that in our current 233 implementation, we are constrained by the Socket API, so our 234 policy cannot override the transport protocol chosen by an 235 application.) 237 o Setting socket options (e.g., disable TCP Nagle) 239 o Choosing a source address for the outgoing communication 240 o Reusing a socket from a given socket set (only for the API variant 241 described in Section 5.3) 243 Note that in our current implementation, the policy is a piece of 244 code which can in principle execute arbitrary instructions. We 245 assume this is acceptable for an experimental platform but would 246 prefer an abstract description like a domain-specific language for a 247 production system. 249 3.2. Path characteristics data collectors 251 The data collectors are implemented as a component of the MAM, within 252 a callback that is executed periodically, e.g., every 100 ms. When 253 this callback is invoked, the MAM passively gathers statistics about 254 the current usage and properties of the available local interfaces 255 and stores them in per-interface or per-network prefix data 256 structures. 258 Measured properties include: 260 o Minimum Smoothed Round Trip Time (SRTT) of current TCP connections 261 using a network prefix, as an estimate for last-mile latency 263 o Median SRTT of current TCP connections using a network prefix, as 264 an alternate estimate for last-mile latency 266 o Median of Round Trip Time variations within connections 268 o Median variation of Smoothed Round Trip Times across connections 270 o Median of percentage of segments deemed lost of all transmitted 271 segments of current TCP connections, as an estimate of upstream 272 packet loss 274 o Maximum transmitted and received bytes per second over an 275 interface within the last 5 minutes, as an estimate for maximum 276 available bandwidth 278 o On 802.11 interfaces, the Received Signal Strength Indicator 279 (RSSI) of the last received frame on that interface, as an 280 estimate for reception strength 282 o On 802.11 interfaces, the modulation rate of the last received and 283 the last transmitted unicast data frame on that interface, as an 284 estimate for the available data transmission rate on the first hop 286 o On 802.11 interfaces, the latest Channel Utilization as parsed 287 from a Beacon frame, as an estimate of congestion on the wireless 288 medium 290 See [ANRW18-Metrics] for more discussion of the gathered metrics. 292 When a policy callback is invoked, the policy can use the latest 293 measured properties to guide its decisions, see Section 3.1. 295 Note that we do not perform active measurements from within the MAM 296 to avoid overhead. 298 4. Socket Intents Representation 300 As described in [I-D.tiesel-taps-socketintents], Socket Intents are 301 pieces of information about upcoming traffic. An application can 302 share the information that it has available through the Socket 303 Intents API. 305 In our implementation, Socket Intents are represented as socket 306 options for get/setsockopt on its own socket option level 307 (SOL_INTENTS). 309 For some of the API variants, we had to introduce socket option 310 lists, i.e., data structures that can hold multiple socket options 311 and therefore multiple Socket Intents. 313 Which of these variants is actually used depends on the API variant, 314 see Section 5. 316 5. The Socket Intents API Variants 318 The Socket Intents API is a wrapper around the BSD Socket API. It 319 sends requests to the Multiple Access Manager (MAM) at certain 320 events, e.g., before a connection is established, and applies the 321 suggestions that it gets from the MAM, e.g., to bind to a certain 322 local interface or to set a certain socket option. 324 There exist different variants of this API, see Section 5, that try 325 to fit different concepts: 327 o The Classic API with muacc_context, see Section 5.1, was 328 attempting to stick as close as possible to the call sequence of 329 BSD Sockets. 331 o The second variant of the classic API does all transport option 332 selection in "getaddrinfo", see Section 5.2. This variant tries 333 to simplify the implementation without deriving too much from the 334 usage of BSD Sockets. It minimizes the changes to the BSD Socket 335 API, but adds additional overhead to the application. 337 o The "socketconnect" API, see Section 5.3, tries to automate as 338 much functionality as possible and adds support for automating 339 connection caching. It replaces the usual sequence of BSD Socket 340 API calls with a single call. 342 5.1. Classic API / muacc_context 344 In the first variant, we add a parameter called "muacc_context" to 345 the BSD Socket API calls and to getaddrinfo. This parameter holds 346 properties provided by the socket calls and retains them across 347 function calls to enable automation of the connection properties by 348 our Socket Intents Prototype. The shadow data structures behind the 349 "muacc_context" parameter are initialized by API wrapper at the time 350 of the first call (which we assume to be muacc_getaddrinfo most of 351 the time) with most of its fields empty. Then within each call to 352 our modified Socket API, it is filled with data. 354 Properties include: 356 o Socket file descriptor 358 o API calls that were already performed on this context 360 o domain, type, and protocol of the socket 362 o remote hostname 364 o remote address 366 o hints for resolving the remote address 368 o local address to bind to that the application requested 370 o local address to bind to that the MAM suggested 372 o current socket options that were set 374 o socket options suggested by MAM 376 5.1.1. muacc_getaddrinfo() 378 This function resolves a host name or service to an addrinfo data 379 structure, usually containing an IP address or port. Internally, the 380 Socket Intents prototype sends a "getaddrinfo" request to the MAM, 381 which should do the name resolution. It can, e.g., resolve the name 382 over multiple available interfaces at the same time, and then order 383 the results according to a policy decision, or only return results 384 obtained over a specific interface. 386 SIGNATURE: 388 int muacc_getaddrinfo(muacc_context_t *ctx, const char *hostname, 389 const char *servname, const struct addrinfo *hints, struct addrinfo 390 **res) 392 ARGUMENTS: 394 ctx: Context that can contain properties of this socket/connection 395 and retains them across function calls. This function is mostly 396 called with an empty context, which is then filled within the 397 function. 399 hostname: Remote host name to be resolved 401 servname: Remote service to be resolved 403 hints: Hints for resolving the name 405 res: Data structure for result of name resolution 407 RETURN VALUE: 409 Returns 0 on success, or an error code as provided by getaddrinfo(). 411 5.1.2. muacc_socket() 413 This function creates a socket file descriptor just like the regular 414 socket call. 416 SIGNATURE: 418 int muacc_socket(muacc_context_t *ctx, int domain, int type, int 419 protocol) 421 ARGUMENTS: 423 ctx: Context that can contain properties of this socket/connection 424 and retains them across function calls. This function is mostly 425 called after muacc_getaddrinfo(), since domain, type, and protocol 426 can depend on the type of resolved address. 428 domain: Domain of the socket 429 type: Type of the socket 431 protocol: Protocol of the socket 433 RETURN VALUE: 435 Returns a file descriptor of the new socket on success, or -1 on 436 failure. 438 5.1.3. muacc_setsockopt() 440 This call allows to set socket options (including Socket Intents). 441 For Socket Intents, this function can be called on a valid 442 "muacc_context" and an invalided file descriptor (-1) to provide 443 assertional hints to "muacc_getaddrinfo()". 445 SIGNATURE: 447 int muacc_setsockopt(muacc_context_t *ctx, int socket, int level, int 448 option_name, const void *option_value, socklen_t option_len) 450 ARGUMENTS: 452 ctx: Context that can contain properties of this socket/connection 453 and retains them across function calls. This function is mostly 454 called to set Intents as socket options within the context. 456 socket: Socket file descriptor 458 level: Level of the socket option to set 460 option_name: Name of the socket option to set 462 option_value: Value of the socket option to set 464 option_len: Length of the socket option to set 466 RETURN VALUE: 468 Returns 0 on success, or -1 on failure. 470 5.1.4. muacc_connect() 472 Like the regular connect call, but also binds to the source address 473 selected by the Socket Intents Policy and applies socket options 474 suggested by the Socket Intents Policy. 476 SIGNATURE: 478 int muacc_connect(muacc_context_t *ctx, int socket, const struct 479 sockaddr *address, socklen_t address_len) 481 ARGUMENTS: 483 ctx: Context that can contain properties of this socket/connection 484 and retains them across function calls. This function is mostly 485 called after all Socket Intents for this connection have been set 486 via muacc_setsockopt(). 488 socket: Socket file descriptor 490 address: Remote address to connect to 492 address_len: Length of the remote address 494 RETURN VALUE: 496 Returns 0 on success, or -1 on failure. 498 5.1.5. muacc_close() 500 Like regular close, but also cleans up state held in shadow 501 structures behind "muacc_context" 503 SIGNATURE: 505 int muacc_close(muacc_context_t *ctx, int socket) 507 ARGUMENTS: 509 ctx: Context that can contain properties of this socket/connection 510 and retains them across function calls. This function 511 deinitializes and releases the context. 513 socket: Socket file descriptor 515 RETURN VALUE: 517 Returns 0 on success, or -1 on failure. 519 5.2. Classic API / getaddrinfo 521 In this variant, Socket Intents are passed directly to 522 "getaddrinfo()" as part of the "hints" parameter. The name 523 resolution is done by the MAM, which makes all decisions and stores 524 them in the "result" data structure as list of options ordered by 525 preference. Subsequently, applications can use this information for 526 calls to the unmodified BSD Socket API or other APIs. We provide 527 helpers to apply all socket options from the "result" data structure. 529 All relevant infos are stored in our addrinfo struct (see Figure 2) 531 SIGNATURE: 533 int muacc_ai_getaddrinfo(const char * hostname, const char * service, 534 const struct muacc_addrinfo * hints, struct muacc_addrinfo ** result) 536 ARGUMENTS: 538 hostname: Remote host name to be resolved 540 service: Remote service to be resolved 542 hints: Hints for resolving the name. Contents include family, 543 socket type, protocol, socket options (including Socket Intents 544 for this socket/connection), local address to bind to. 546 result: Data structure for result of name resolution 548 RETURN VALUE: 550 Returns 0 on success, or an error code as provided by getaddrinfo(). 552 /** Extended version of the standard library's struct addrinfo 553 * 554 * This is used both as hint and as result from the 555 * muacc_ai_getaddrinfo * function. This structure 556 * differs from struct addrinfo only in the three members 557 * ai_bindaddrlen, ai_bindaddr and ai_socketopt. 558 */ 559 struct muacc_addrinfo { 560 int ai_flags; 561 int ai_family; 562 int ai_socktype; 563 int ai_protocol; 565 /** Not included in struct addrinfo. Purpose: 566 * 1. If the structure is given to muacc_ai_getaddrinfo 567 * as hints, you set socket intents that influence MAM's 568 * source and destination as well as transport protocol 569 * selection 570 * 2. The recommended socket options MAM will be returned 571 * through this attribute. 572 */ 573 struct socketopt *ai_sockopts; 575 int ai_addrlen; 576 struct sockaddr *ai_addr; 577 char *ai_canonname; 579 /** Not included in struct addrinfo. 580 * Length of ai_bindaddr. 581 */ 582 int ai_bindaddrlen; 583 /** Not included in struct addrinfo. 584 * Contains the address, which the MAM recommends us to bind to. 585 */ 586 struct sockaddr *ai_bindaddr; 588 struct muacc_addrinfo *ai_next; 589 }; 591 Figure 2: Definition of the muacc_addrinfo struct 593 Appendix A.2 shows an example usage of the classic API with most 594 functionality in getaddrinfo. 596 5.3. Socketconnect API 598 In this API variant, we move the functionality of resolving a 599 hostname and connecting to the resulting address into one function 600 called "socketconnect()". This API makes it possible to call 601 socketconnect not only for each connection, but also to multiplex 602 messages across multiple existing sockets. 604 This function returns a file descriptor of a connected socket for the 605 application to use. This socket can either be a newly created one or 606 a socket that existed previously and is now being reused. 607 Furthermore, a socket can belong to a socket set of sockets with 608 common destination and service. These sockets may, e.g., be bound to 609 different local addresses, but are treated as interchangeable by the 610 API implementation. So if the application passes a socket file 611 descriptor to this function, it may get back a different file 612 descriptor to a socket from the same set, e.g., to use the connection 613 over a different local interface for its following communication. 615 SIGNATURE: 617 int socketconnect(int *socket, const char *host, size_t hostlen, 618 const char *serv, size_t servlen, struct socketopt *sockopts, int 619 domain, int type, int proto) 621 ARGUMENTS: 623 socket: Existing socket file descriptor as representant to a socket 624 set, "-1" to create a new socket, or "0" to automatically try to 625 find a suitable socket set 627 host: Remote hostname to be resolved 629 hostlen: Length of remote hostname 631 serv: Remote service or port 633 servlen: Length of remote service 635 socketopts: List of socket options, including Socket Intents 637 domain: Domain of the socket 639 type: Type of the socket 641 proto: Protocol of the socket 643 RETURN VALUE: 645 Returns 0 on success if socket is from an existing socket set, 1 on 646 success if socket was newly created, or -1 on fail. 648 Appendix A.3 shows an example usage of the Socketconnect API. 650 6. API Implementation Experiences & Lessons Learned 652 While designing and implementing the different parts of the system as 653 described in this document, we faced several challenges. In the 654 Multiple Access Manager discovering the currently available paths and 655 statistics about their performance turned out to be quite complex and 656 had to be implemented in a partially platform-dependent way. 657 However, the most challenging parts were the Socket Intents API and 658 Library, on which we focus in the following sections. 660 6.1. The Missing Link to Name Resolution 662 Transport option selection is most useful if crucial information, 663 such as Socket Intents or other socket options, is available as early 664 as possible, i.e., for name resolution. The primary problem here is 665 the order of the function calls that are involved in name resolution, 666 destination selection, protocol, and path selection, and how they are 667 linked. 669 In the classic BSD Socket API, most functions either take a socket 670 file descriptor as argument or return it, and thus link different 671 function calls to the same flow. However, "getaddrinfo()" is not 672 linked to a socket file descriptor, and it is typically called before 673 the socket is created. At this point, it is not yet possible to set 674 a socket option, because the socket does not exist yet. 676 Consequently, across BSD Socket API calls, several choices are being 677 made before it is possible to set a Socket Intent: A call to 678 "getaddrinfo()" returns a linked list of "addrinfo" structs, where 679 each entry contains an "ai_family" (IP version), the pair of 680 "ai_socktype" and "ai_protocol" (transport protocol), and a 681 "sockaddr" struct containing an IP address and port to connect to. 682 Then a socket of the given family, type, and protocol is created. 683 Only after this has been done, socket options can be set on the 684 socket, but at this point destination, IP version, and transport 685 protocol are already fixed. Before calling "connect()", only the 686 path to be used (i.e., the local address to bind to) can still be 687 chosen, but the available paths and which one to prefer may be 688 constrained by the choice of destination. 690 The three variants described in Section 5 work around this problem in 691 different ways: 693 o The approach in Section 5.2 places the whole automation of 694 transport option selection into the "getaddrinfo()" function. The 695 results are returned in an extended "addrinfo" struct and have to 696 be applied manually by the application, including binding to a 697 source address representing the selected path and applying all 698 socket options provided in a list, for each connection attempt. 700 o The approach in Section 5.1 adds a context to all socket- and name 701 resolution-related API calls. 703 o The approach in Section 5.3 puts all functionality into one call. 705 All of these approaches add the missing link between name resolution 706 and the other parts of the API, but add a lot of state keeping either 707 to the API, which the application developer has to manage, or to the 708 Socket Intents library. 710 6.2. File Descriptors Considered Harmful 712 When using BSD sockets, file descriptors are the abstraction for 713 network flows. Depending on the transport protocol used, their 714 semantics changes and these file handles represent streams 715 (SOCK_STREAM), associations (SOCK_DRAM) or network interfaces 716 (SOCK_RAW). This does not provide a unified API, but is merely an 717 artifact of squeezing networking into the "Everything is a file" UNIX 718 philosophy. 720 File descriptors make no good abstraction for automated protocol 721 stack instance selection as applications have to adopt to changed 722 semantics, e.g., whether message boundaries are preserved, depending 723 on the transport protocol chosen. 725 File descriptors make no good abstraction for destination instance 726 selection and path selection either. Once a socket has been created, 727 its protocol stack instance is fixed, so selecting a path by binding 728 to a local address and connecting to a destination instance is now 729 only possible using this protocol stack instance. If such a 730 connection attempt fails, it is possible to retry using another path 731 and destination, but changing the protocol stack instance requires 732 creating a new socket with a different file descriptor. 734 For further discussion of other asynchronous I/O weirdness with file 735 descriptors see end of Section 6.3. 737 6.3. Asynchronous API Anarchy 739 Network I/O is asynchronous, but asynchronous I/O within the POSIX 740 filesystem API is hard to use. There are at least three different 741 asynchronous I/O APIs for each operating system. 743 To implement asynchronous I/O for our Socket Intents prototype, we 744 wrapped one of the asynchronous I/O APIs that is available on most 745 platforms: "select()". To make Socket Intents accessible to more 746 applications and on more platforms, a production-grade system would 747 need to wrap all asynchronous I/O APIs and implement most of the 748 socket creation logic, path selection and connection logic within 749 these wrappers. However, mixing asynchronous I/O and multithreading 750 may lead to unintuitive behavior, e.g., calling our prototype's 751 select() from different threads could lead to anything from deadlocks 752 to busy waiting. 754 Another issue is that we use Unix domain sockets to communicate 755 between our Multiple Access Manager and the Socket Intents API 756 library called by the application, so we need to make sure that the 757 application does not block on communication with the Multiple Access 758 Manager. 760 Also the problems with using file descriptors get even worse. If a 761 Socket API call should return immediately, it needs to provide the 762 application with a reference to a flow that has not yet been fully 763 set up, i.e., a reference to a "future" socket. An implementation of 764 such an asynchronous API has to return an unconnected socket file 765 descriptor, on which the application then calls, e.g., "select()", 766 and starts using it once it becomes readable and writable. If the 767 destination, path and transport protocol have not been chosen yet at 768 this point, the file descriptor returned by the implementation might 769 not yet have the final family and transport protocol. When the 770 implementation later creates the final socket of the right type, it 771 can re-bind it to the file-id of the originally returned file 772 descriptor using "dup2". This procedure can easily lead to time-of- 773 check to time-of-use confusion. To make things even worse, the 774 application can copy the "future" file descriptor using "dup", which 775 is rarely useful for sockets, but in combination with file 776 descriptors used as "future" it leads to unexpected behavior. 778 6.4. Here Be Dragons hiding in Shadow Structures 780 The API variants described in Section 5.3 and Section 5.1 need to 781 keep a lot of state in shadow structures that cannot be passed 782 between the Socket API calls otherwise. This state needs to be 783 cleaned up when the last copy of the file descriptor is closed or the 784 last socket held for reuse has timed out. In addition, access to 785 these shadow structures has to be thread-safe. 787 Implementing both has turned out to be extremely error-prone and 788 there is a high amount of unspecified behavior and platform-dependent 789 extensions in the system library. These issues guarantee that an 790 implementation of transport option selection that nicely integrates 791 with BSD Sockets will come with lots of limitations and will not be 792 portable across POSIX-compliant operating systems. 794 7. Conclusion 796 Adding transport option selection to BSD Sockets is hard, as the API 797 calls are not designed to defer making and applying choices to a 798 moment where all information needed for transport option selection is 799 available. 801 After all, if limiting transport option selection to the granularity 802 BSD Sockets typically provide today (TCP connections and UDP 803 associations), the API variant described in Section 5.2 seems to be a 804 good compromise, even if it forces the application to try all 805 candidates itself (either in a sequential or partial parallel 806 fashion). This option is easily deployable, but does not include 807 automation of techniques like connection caching or HTTP pipelining. 809 The most versatile API variant described in Section 5.3 implements 810 connection caching on the transport layer. This comes at the cost of 811 heavily modifying existing applications. If feasible, given the 812 unnecessary complexity of the file I/O integration of BSD sockets, it 813 seems easier to move to a totally different system like 814 [I-D.trammell-taps-post-sockets]. 816 8. Acknowledgments 818 The API variant described in Section 5.2 was originally drafted and 819 implemented by Tobias Kaiser mail@tb-kaiser.de [2] as part of his BA 820 thesis. 822 This work has been supported by Leibniz Prize project funds of DFG - 823 German Research Foundation: Gottfried Wilhelm Leibniz-Preis 2011 (FKZ 824 FE 570/4-1). 826 9. References 827 9.1. Informative References 829 [ANRW17-MH] 830 Tiesel, P., May, B., and A. Feldmann, "Multi-Homed on a 831 Single Link", Proceedings of the 2016 workshop on Applied 832 Networking Research Workshop - ANRW 16, 833 DOI 10.1145/2959424.2959434, 2016. 835 [ANRW18-Metrics] 836 "Metrics for access network selection (ANRW 2018)", n.d.. 838 [I-D.tiesel-taps-communitgrany] 839 Tiesel, P. and T. Enghardt, "Communication Units 840 Granularity Considerations for Multi-Path Aware Transport 841 Selection", draft-tiesel-taps-communitgrany-02 (work in 842 progress), May 2018. 844 [I-D.tiesel-taps-socketintents] 845 Tiesel, P., Enghardt, T., and A. Feldmann, "Socket 846 Intents", draft-tiesel-taps-socketintents-01 (work in 847 progress), October 2017. 849 [I-D.trammell-taps-post-sockets] 850 Trammell, B., Perkins, C., Pauly, T., Kuehlewind, M., and 851 C. Wood, "Post Sockets, An Abstract Programming Interface 852 for the Transport Layer", draft-trammell-taps-post- 853 sockets-03 (work in progress), October 2017. 855 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 856 Requirement Levels", BCP 14, RFC 2119, 857 DOI 10.17487/RFC2119, March 1997, . 860 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 861 "TCP Extensions for Multipath Operation with Multiple 862 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 863 . 865 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 866 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 867 . 869 [RFC7556] Anipko, D., Ed., "Multiple Provisioning Domain 870 Architecture", RFC 7556, DOI 10.17487/RFC7556, June 2015, 871 . 873 9.2. URIs 875 [1] https://github.com/fg-inet/socket-intents/ 877 [2] mailto:mail@tb-kaiser.de 879 Appendix A. API Usage Examples 881 A.1. Usage Example of the Classic / muacc_context API 883 In this example, a client application sets up a connection to a 884 remote host and sends data to it. It specifies two Socket Intents on 885 this connection: The Category of Bulk Transfer and the File Size of 1 886 MB. 888 #define LENGTH_OF_DATA 1048576 890 // Create and initialize a context to retain information across function 891 // calls 892 muacc_context_t ctx; 893 muacc_init_context(&ctx); 895 int socket = -1; 897 struct addrinfo *result = NULL; 899 // Initialize a buffer of data to send later. 900 char buf[LENGTH_OF_DATA]; 901 memset(&buf, 0, LENGTH_OF_DATA); 903 // Set Socket Intents for this connection. Note that the "socket" is 904 // still invalid, but it does not yet need to exist at this time. The 905 // Socket Intents prototype just sets the Intent within the 906 // muacc_context data structure. 908 enum intent_category category = INTENT_BULKTRANSFER; 909 muacc_setsockopt(&ctx, socket, SOL_INTENTS, 910 INTENT_CATEGORY, &category, sizeof(enum intent_category)); 912 int filesize = LENGTH_OF_DATA; 913 muacc_setsockopt(&ctx, socket, SOL_INTENTS, 914 INTENT_FILESIZE, &filesize, sizeof(int)); 916 // Resolve a host name. This involves a request to the MAM, which can 917 // automatically choose a suitable local interface or other parameters 918 // for the DNS request and set other parameters, such as preferred 919 // address family or transport protocol. 921 muacc_getaddrinfo(&ctx, "example.org", NULL, NULL, &result); 923 // Create the socket with the address family, type, and protocol 924 // obtained by getaddrinfo. 925 socket = muacc_socket(&ctx, result->ai_family, result->ai_socktype, 926 result->ai_protocol); 928 // Connect the socket to the remote endpoint as determined by 929 // getaddrinfo. This involves another request to MAM, which may at this 930 // point, e.g., choose to bind the socket to a local IP address before 931 // connecting it. 932 muacc_connect(&ctx, socket, result->ai_addr, result->ai_addrlen); 934 // Send data to the remote host over the socket. 935 write(socket, &buf, LENGTH_OF_DATA); 937 // Close the socket. This de-initializes any data that was stored within 938 // the muacc_context. 939 muacc_close(&ctx, socket); 941 A.2. Usage Example of the Classic / getaddrinfo API 943 As in Appendix A.1, the application sets the Intents "Category" and 944 "File Size". 946 #define LENGTH_OF_DATA 1048576 948 // Define Intents to be set later 949 enum intent_category category = INTENT_BULKTRANSFER; 950 int filesize = LENGTH_OF_DATA; 952 struct socketopt intents = { .level = SOL_INTENTS, 953 .optname = INTENT_CATEGORY, .optval = &category, .next = NULL}; 954 struct socketopt filesize_intent = { .level = SOL_INTENTS, 955 .optname = INTENT_FILESIZE, .optval = &filesize, .next = NULL}; 957 intents.next = &filesize_intent; 959 // Initialize a buffer of data to send later. 960 char buf[LENGTH_OF_DATA]; 961 memset(&buf, 0, LENGTH_OF_DATA); 963 struct muacc_addrinfo intent_hints = { .ai_flags = 0, 964 .ai_family = AF_INET, .ai_socktype = SOCK_STREAM, .ai_protocol = 0, 965 .ai_sockopts = &intents, .ai_addr = NULL, .ai_addrlen = 0, 966 .ai_bindaddr = NULL, .ai_bindaddrlen = 0, .ai_next = NULL }; 968 struct muacc_addrinfo *result = NULL; 970 muacc_ai_getaddrinfo("example.org", NULL, &intent_hints, 971 &result); 973 // Create and connect the socket, using the information obtained through 974 // getaddrinfo 975 int fd; 976 fd = socket(result->ai_family, result->ai_socktype, 977 result->ai_protocol); 978 muacc_ai_simple_connect(fd, result); 980 // Send data to the remote host over the socket, then close it. 981 write(fd, &buf, LENGTH_OF_DATA); 982 close(fd); 984 muacc_ai_freeaddrinfo(result); 986 A.3. Usage Example of the Socketconnect API 988 As in Appendix A.1, the application sets the Intents "Category" and 989 "File Size". As we provide "-1" as socket, no we do not reuse 990 existing connections. 992 #define LENGTH_OF_DATA 1048576 994 // Define Intents to be set later 995 enum intent_category category = INTENT_BULKTRANSFER; 996 int filesize = LENGTH_OF_DATA; 998 struct socketopt intents = { .level = SOL_INTENTS, 999 .optname = INTENT_CATEGORY, .optval = &category, .next = NULL}; 1000 struct socketopt filesize_intent = { .level = SOL_INTENTS, 1001 .optname = INTENT_FILESIZE, .optval = &filesize, .next = NULL}; 1003 intents.next = &filesize_intent; 1005 // Initialize a buffer of data to send later. 1006 char buf[LENGTH_OF_DATA]; 1007 memset(&buf, 0, LENGTH_OF_DATA); 1009 int socket = -1; 1011 // Get a socket that is connected to the given host and service, 1012 // with the given Intents 1013 socketconnect(&socket, "example.org", 11, "80", 2, &intents, AF_INET, 1014 SOCK_STREAM, 0); 1016 // Send data to the remote host over the socket. 1017 write(socket, &buf, LENGTH_OF_DATA); 1019 // Close the socket and tear down the data structure kept for it 1020 // in the library 1021 socketclose(socket); 1023 Appendix B. Changes 1025 B.1. Since -01 1027 o Updated list of gathered path characteristics 1029 o Reordered start of Policy section to make it clearer 1031 B.2. Since -00 1033 o Fixed Author's affiliations and funding 1035 o Fixed acknowledgments 1037 Authors' Addresses 1039 Philipp S. Tiesel 1040 TU Berlin 1041 Marchstr. 23 1042 Berlin 1043 Germany 1045 Email: philipp@inet.tu-berlin.de 1047 Theresa Enghardt 1048 TU Berlin 1049 Marchstr. 23 1050 Berlin 1051 Germany 1053 Email: theresa@inet.tu-berlin.de