idnits 2.17.00 (12 Aug 2021) /tmp/idnits44292/draft-swhyte-i2rs-data-collection-system-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 21, 2013) is 3127 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'DeBoer' is defined on line 435, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Whyte 3 Internet-Draft Google Inc. 4 Intended status: Informational M. Hines 5 Expires: April 24, 2014 W. Kumari 6 Google, Inc. 7 October 21, 2013 9 Bulk Network Data Collection System 10 draft-swhyte-i2rs-data-collection-system-00 12 Abstract 14 Collecting large amounts of data from network infrastructure devices 15 has never been very easy. Existing methods generate CPU and memory 16 loads that may be unacceptable, the output varies across 17 implementations and can be difficult to parse, and these methods are 18 often difficult to scale. I2RS programmatic interfacing with the 19 routing system may exacerbate this problem: state needs to be 20 collected from nodes and fed to consumers participating in the 21 control plane that may not be physically close to the nodes. This 22 state includes not only control plane information, but elements of 23 the data plane that have a direct impact on control plane behavior, 24 like traffic engineering. 26 This document outlines a set of use cases requiring a flexible 27 framework to collect routing system data, and the features and 28 functionality needed to make such a framework useful for these use 29 cases. 31 Requirements Language 33 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 34 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 35 document are to be interpreted as described in RFC 2119 [RFC2119]. 37 Status of This Memo 39 This Internet-Draft is submitted in full conformance with the 40 provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF). Note that other groups may also distribute 44 working documents as Internet-Drafts. The list of current Internet- 45 Drafts is at http://datatracker.ietf.org/drafts/current/. 47 Internet-Drafts are draft documents valid for a maximum of six months 48 and may be updated, replaced, or obsoleted by other documents at any 49 time. It is inappropriate to use Internet-Drafts as reference 50 material or to cite them other than as "work in progress." 52 This Internet-Draft will expire on April 24, 2014. 54 Copyright Notice 56 Copyright (c) 2013 IETF Trust and the persons identified as the 57 document authors. All rights reserved. 59 This document is subject to BCP 78 and the IETF Trust's Legal 60 Provisions Relating to IETF Documents 61 (http://trustee.ietf.org/license-info) in effect on the date of 62 publication of this document. Please review these documents 63 carefully, as they describe your rights and restrictions with respect 64 to this document. Code Components extracted from this document must 65 include Simplified BSD License text as described in Section 4.e of 66 the Trust Legal Provisions and are provided without warranty as 67 described in the Simplified BSD License. 69 Table of Contents 71 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 72 2. Desired functionality . . . . . . . . . . . . . . . . . . . . 3 73 2.1. Database Model . . . . . . . . . . . . . . . . . . . . . 4 74 2.2. Pub-Sub . . . . . . . . . . . . . . . . . . . . . . . . . 4 75 2.3. Capability Negotiation . . . . . . . . . . . . . . . . . 5 76 2.4. Format Agnostic . . . . . . . . . . . . . . . . . . . . . 5 77 2.5. Transport Options . . . . . . . . . . . . . . . . . . . . 5 78 2.6. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 5 79 2.7. Timestamps . . . . . . . . . . . . . . . . . . . . . . . 6 80 2.8. Introspection . . . . . . . . . . . . . . . . . . . . . . 6 81 2.9. Registration . . . . . . . . . . . . . . . . . . . . . . 6 82 3. Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . 6 83 3.1. Push . . . . . . . . . . . . . . . . . . . . . . . . . . 7 84 3.1.1. Interface counters . . . . . . . . . . . . . . . . . 7 85 3.1.2. Thresholds . . . . . . . . . . . . . . . . . . . . . 7 86 3.1.3. Streaming . . . . . . . . . . . . . . . . . . . . . . 7 87 3.2. Pull . . . . . . . . . . . . . . . . . . . . . . . . . . 7 88 3.2.1. Interface counters . . . . . . . . . . . . . . . . . 8 89 3.2.2. RIB Dump . . . . . . . . . . . . . . . . . . . . . . 8 90 3.2.3. Arbitrary data collection . . . . . . . . . . . . . . 8 91 3.3. Dynamic subscriptions . . . . . . . . . . . . . . . . . . 8 92 4. Subscriber versus consumer . . . . . . . . . . . . . . . . . 8 93 4.1. Remapping . . . . . . . . . . . . . . . . . . . . . . . . 8 94 5. Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 95 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 96 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 97 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 98 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 99 9.1. Normative References . . . . . . . . . . . . . . . . . . 10 100 9.2. Informative References . . . . . . . . . . . . . . . . . 10 101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 103 1. Introduction 105 Managing and monitoring a network requires getting state out of it. 106 You can't manage what you don't measure, as the saying goes. 107 Currently there are a limited set of tools to get data off of network 108 nodes, and they do not lend themselves to programmatic access. 110 The primary tool today is SNMP. SNMP can be used to both push data 111 off a node (via traps/notifications) and pull data off the box (via 112 queries). SNMP queries have a variety of issues, not the least of 113 which is the fact that the protocol specification requires data 114 structures to be created on demand on network nodes that do not match 115 how the device's operating system data structures store the same 116 data. Fixing this problem has the immediate benefit of reducing CPU 117 and memory consumption of the monitored network devices, greatly 118 increasing the deployability and relevance of a solution. SNMP traps 119 /notifications suffer from a lack of introspection; the network 120 management system (NMS) must be preconfigured to understand what 121 information is being reported. 123 Other tools include CLI scraping and Syslog. CLI scraping is a low- 124 level pull mechanism and essentially the opposite of programmatic 125 access. Any change in CLI implementation, whether its a simple 126 whitespace correction, re-ordering of configuration stanzas, 127 typographical errors, or even unit changes, can require a rewriting 128 of monitoring software. This is compounded by the fact there is no 129 standardized CLI specification, such that a network with multiple 130 vendors in it requires these rewrites per vendor CLI change. 132 Syslog is another way to push data off of a network node. Syslog has 133 been around a long time, and while current standards provide 134 structured data output, very few implementations exist on network 135 nodes currently. For the most part NMSes must be trained how to 136 consume and interpret different implementations of syslog. 138 2. Desired functionality 140 Collecting large data sets with high frequency and resolution, with 141 minimal impact to a device's CPU and memory, is the primary 142 objective. Aspects of the over-all data collection system, such as 143 availability or reliability or scaling, are outside of scope as they 144 deal with the data once it has left the network node. 146 We are only focusing on getting data off the node in an easily 147 machine parsable format. 149 2.1. Database Model 151 A database model is desired, whereby a network node can describe the 152 data it has available, and the structure of that data. This gives 153 the implementor the ability to present a database model that can be 154 optimal with the node's internal data structure implementations. The 155 NMS consumes and understands the database model only after it has 156 been trained to do so by incorporating a published version of the 157 database model from the vendor. 159 It should be noted that all existing data collection methods outlined 160 earlier require explicit knowledge of the method's implementation for 161 integration into a NMS. We do not propose a solution that eliminates 162 this, because heterogeneity of the data is not required, as we can 163 see from existing implementations. Rather, capability negotiation 164 and flexible formats and transports, outlined below, are desired 165 enabling the primary objective of getting large data sets off the 166 nodes with as little impact as possible. 168 2.2. Pub-Sub 170 An underlying pub-sub model is desired for a variety of features. It 171 provides a security model for authorization, it supports 172 intermediaries allowing the system to scale as needed, and it 173 provides both push and pull methods of data distribution. 175 In the context of this draft, a pub-sub model is a general concept 176 indicating information flow. Specific system details are obviously 177 critical yet belong in a data model document. The high level desire 178 is to have network nodes as publishers, with an NMS implementing 179 subscribers. Conceptually, they are connected by a message bus, a 180 layer of indirection between the publishers and subscribers. Having 181 a message bus allows publisher fan-in, subscriber fan-out, and a 182 number of other useful features outside the scope of this document. 183 The message bus is frequently referred to as a broker inside pub-sub 184 models. 186 Having a message bus abstraction allows for considerable flexibility 187 in NMS design as well. Placement of brokers in the network, their 188 redundancy, availablility, scaling per publisher or subscriber, can 189 all be tailored to suit an individual network's needs, from extremely 190 simple (flat) to extremely complex with multiple layers of hierarchy. 191 Many implementations of pub-sub models exist, scaling both in number 192 of subscriptions and in number of messages, both of which should be 193 considered carefully in the I2RS context. 195 2.3. Capability Negotiation 197 Capability negotiation allows a node to inform a subscriber of a 198 number of options. Two extremely important options would be 199 transport protocols and formats supported. Other aspects such as 200 security options and error handling would also be negotiated during 201 this phase. 203 The capability negotiation phase is done via a control channel opened 204 for the purpose of registering subscriptions with the node. This 205 control channel should be TCP. 207 2.4. Format Agnostic 209 From the I2RS perspective, this framework should be format agnostic. 210 If a node advertises the ability to present data in XML and the 211 subscriber agrees, then XML can be used. Other formats that have 212 interest are JSON, HTML, and protobufs. Even interest for /proc/net 213 formatted output exists, and would help a NMS based on this framework 214 integrate into existing server configuration management systems. 216 [ Editor note: even ASN.1 should be an acceptable format. This would 217 potentially allow an extremely easy deployment into an existing SNMP 218 based NMS.] 220 2.5. Transport Options 222 Because the focus of this framework ends at getting data off the box 223 as quickly as possible, implementations should have the freedom to 224 choose a transport that meets their system design needs and not be 225 restricted by a specific format. 227 During the negotiation phase a node should advertise all the 228 transport options it provides and allow the subscriber to select what 229 it needs. 231 Given the time-value of different data elements coming off the node 232 can be quite different, it should be possible to request multiple 233 transports and associate a subscription with the transport protocol 234 of choice. 236 2.6. Filtering 238 Once a network node has provided its database model to a subscriber, 239 the subscriber needs a way to select parts of the model for 240 subscription, and it needs to be able to request multiple 241 subscriptions at a time. 243 This framework should provide a standard filtering mechanism so that, 244 independent of the database model structure and contents, a 245 subscriber can select interesting items to collect and bucket them 246 based on standard parameters such as frequency of collection, 247 underlying transport required, whether the data is to be pushed or 248 pulled, or even streaming or one-shot. 250 2.7. Timestamps 252 Every piece of data collected by this framework needs a timestamp 253 associated with it indicating when the node made it available for 254 collection. This is not required on a per-variable basis, for 255 example data organized into a table only requires a timestamp 256 associated with the table. 258 This is not to say additional timestamps are not useful for certain 259 data sets nor that other timestamps with other semantics, for example 260 collection time versus advertisement time, can not be used, but 261 rather those additional timestamps are better placed in the database 262 model supported by the device. 264 2.8. Introspection 266 This framework should support introspection of the database model. 267 Introspection provides support for data verification, easier 268 inclusion of legacy data, and easier merging of data stream. 270 2.9. Registration 272 After capabilities and a database model have been exchanged, and a 273 filter used to select elements of the model to subscribe to, the 274 framework should support a standard way to register for all the data 275 desired, using whatever capabilities were advertised by the node. 277 Once registration is complete, the control channel can be closed. 278 Ensuring subscriptions are correct, complete, and replicated or not, 279 is up to the overall system and not the network node. 281 3. Use cases 283 Following are example use cases outlining the utility of subscribing 284 to data with different parameters. 286 3.1. Push 288 Pushing data off the box can be done synchronously at fixed 289 intervals, or asynchronously in an ad-hoc fashion. All data pushed 290 is set up via registered subscriptions. 292 3.1.1. Interface counters 294 Interface counters provide a use case demonstrating the need to push 295 data off of a network node at specific intervals. In this proposed 296 framework, a node would advertise its database model including all 297 the interfaces it has to offer and what it can count on each. A 298 subscriber would select the interfaces and counters of each it is 299 interested in via a filter, use the filter to group them according to 300 available parameters, and register with the node to have them 301 published at agreed upon intervals. 303 3.1.2. Thresholds 305 Another use case demonstrating a push capability is thresholding. 306 Assuming a node advertises the capability to record and track a 307 threshold for a particular data type, it would use the registered 308 subscription to push relevant data to the subscriber whenever the 309 threshold was crossed. As an example, a subscriber may want to set a 310 threshold for memory consumed - if the available device memory falls 311 below a threshold the subscriber should be informed so that the 312 operator can investigate the issue manually or programatically. 314 3.1.3. Streaming 316 Streaming data, such as RIB information, will be critical to 317 supporting I2RS functionality. In this use case, a subscriber may 318 desire to have all updates to a RIB streamed into the collection 319 system, in as close to real-time as possible. 321 3.2. Pull 323 Pulling data off the node will always be a one-shot function. As 324 such it is probably the most heavy-handed way to get data into the 325 collection system, as it requires all the overhead of setting up and 326 tearing down the control channel, exchanging the database model, 327 creating a filter, and receiving the data. Nevertheless, it can be a 328 valuable option and should be supported. 330 n.b. it is certainly possible to cache requests on publishers, and 331 have them "replayed" via a subscription identifier. However the 332 capability to track the state required to do so may not be available 333 on a node, and this is somewhat counter to the overall goal of 334 minimizing impact to the node. Having this capability as an optional 335 parameter of a database model, is worth exploring. 337 3.2.1. Interface counters 339 Similar to the interface counter example above, except in this case 340 the registration includes a parameter indicating the data should be 341 collected immediately and sent only once. 343 3.2.2. RIB Dump 345 Getting a snapshot of the node's current RIB can be useful for a 346 variety of reasons. Similar to collecting RIB information above, in 347 this example the subscriber would register for a one-shot dump of the 348 RIB, collected and sent immediately. 350 3.2.3. Arbitrary data collection 352 Once the NMS understands a node's database model, it should be able 353 to register for one-shot collection of any subset of that database 354 model. Given the overheads involved, this would best be restricted 355 to one-off collection needs, such as troubleshooting, but the use 356 case need is solid. 358 3.3. Dynamic subscriptions 360 This framework should support dynamic subscription capabilities with 361 pre-existing monitoring protocols that currently require static 362 configuration. For example, if a node's database model indicates it 363 support IPFIX, using the standard registration process outlined above 364 a subscriber should be able to set up a streaming IPFIX feed. BMP 365 and the like should also be available via this mechanism. 367 4. Subscriber versus consumer 369 It should be noted that because overall data collection system 370 architecture is out of scope, it is opaque to this framework whether 371 a subscriber is also the consumer of data. In order to maximize 372 design options, including scalability of the overall system, both 373 options should be supported. 375 4.1. Remapping 377 Remapping in this context is the ability to modify a node's database 378 model and request the modified model be used in subscriptions. While 379 this has interesting properties, it strays far from the primary 380 objective of getting data off of nodes as fast was with as little 381 impact as possible, and thus should be considered out of scope. 383 5. Errors 385 Errors happen. Many classes of errors and their handling are already 386 well-understood and don't need to be re-iterated here. There are 387 certainly failure modes that may be unique to I2RS or this framework, 388 however, and we should be prepared to incorporate solutions for 389 those. 391 For example,providing a method for a node and a subscriber to agree 392 on resolution steps after defined error events would be very useful. 393 A subscriber may want certain subscriptions to be available for 394 pulling, if the push mechanism failed. 396 There may also be value in defining how a subscriber can probe the 397 transport layer, such that publisher responses can assist in 398 troubleshooting protocol-specific failures. 400 The framework needs to support standardized handling of stale data. 401 This class of error will largely be related to handling changes and 402 exceptions in the database models exchanged. For example what 403 happens when a node's physical configuration changes and part of an 404 existing subscription becomes invalid. Similar thought to logical 405 changes, such as the disappearance of a BGP speaker, needs to be 406 give. 408 6. IANA Considerations 410 This documents makes no request of the IANA. 412 7. Security Considerations 414 I2RS provides security requirements, any security requirements raised 415 by this framework should be encompassed there. 417 [TODO(WK, SW): This section needs more work / text ] 419 8. Acknowledgements 421 The author wishes to acknowledge the contributions of a number of 422 folk, including 424 {TODO(WK, SW): Remember to add folk! ] 426 9. References 428 9.1. Normative References 430 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 431 Requirement Levels", BCP 14, RFC 2119, March 1997. 433 9.2. Informative References 435 [DeBoer] De Boer, M. and J. Bosma, "Discovering Path MTU black 436 holes on the Internet using RIPE Atlas", July 2012, . 440 Authors' Addresses 442 Scott Whyte 443 Google Inc. 444 1600 Amphitheatre Parkway 445 Mountain view, California 94043 446 USA 448 Email: swhyte@google.com 450 Marcus Hines 451 Google, Inc. 452 1600 Amphitheatre Parkway 453 Mountain view, California 94043 454 USA 456 Email: hines@google.com 458 Warren Kumari 459 Google, Inc. 460 1600 Amphitheatre Parkway 461 Mountain view, California 94043 462 USA 464 Email: warren@kumari.net