idnits 2.17.00 (12 Aug 2021) /tmp/idnits29821/draft-ietf-cbor-7049bis-15.txt: -(2548): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(2609): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(3634): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 4 instances of lines with non-ascii characters in the document. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (24 September 2020) is 603 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '2' on line 2900 -- Looks like a reference, but probably isn't: '3' on line 2900 -- Looks like a reference, but probably isn't: '4' on line 2898 -- Looks like a reference, but probably isn't: '5' on line 2898 -- Looks like a reference, but probably isn't: '100' on line 1566 == Missing Reference: '-1' is mentioned on line 1562, but not defined -- Looks like a reference, but probably isn't: '1' on line 3189 == Missing Reference: 'RFCthis' is mentioned on line 2371, but not defined == Missing Reference: 'TM' is mentioned on line 2720, but not defined -- Looks like a reference, but probably isn't: '0' on line 3205 -- Possible downref: Non-RFC (?) normative reference: ref. 'C' -- Possible downref: Non-RFC (?) normative reference: ref. 'Cplusplus17' -- Possible downref: Non-RFC (?) normative reference: ref. 'IEEE754' == Outdated reference: A later version (-06) exists of draft-bormann-cbor-notable-tags-02 -- Obsolete informational reference (is this intentional?): RFC 7049 (Obsoleted by RFC 8949) Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 13 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Bormann 3 Internet-Draft Universitaet Bremen TZI 4 Obsoletes: 7049 (if approved) P. Hoffman 5 Intended status: Standards Track ICANN 6 Expires: 28 March 2021 24 September 2020 8 Concise Binary Object Representation (CBOR) 9 draft-ietf-cbor-7049bis-15 11 Abstract 13 The Concise Binary Object Representation (CBOR) is a data format 14 whose design goals include the possibility of extremely small code 15 size, fairly small message size, and extensibility without the need 16 for version negotiation. These design goals make it different from 17 earlier binary serializations such as ASN.1 and MessagePack. 19 This document is a revised edition of RFC 7049, with editorial 20 improvements, added detail, and fixed errata. This revision formally 21 obsoletes RFC 7049, while keeping full compatibility of the 22 interchange format from RFC 7049. It does not create a new version 23 of the format. 25 Contributing 27 This note is to be removed before publishing as an RFC. 29 This document is being worked on in the CBOR Working Group. Please 30 contribute on the mailing list there, or in the GitHub repository for 31 this draft: https://github.com/cbor-wg/CBORbis 33 The charter for the CBOR Working Group says that the WG will update 34 RFC 7049 to fix verified errata. Security issues and clarifications 35 may be addressed, but changes to this document will ensure backward 36 compatibility for popular deployed codebases. This document will be 37 targeted at becoming an Internet Standard. 39 Status of This Memo 41 This Internet-Draft is submitted in full conformance with the 42 provisions of BCP 78 and BCP 79. 44 Internet-Drafts are working documents of the Internet Engineering 45 Task Force (IETF). Note that other groups may also distribute 46 working documents as Internet-Drafts. The list of current Internet- 47 Drafts is at https://datatracker.ietf.org/drafts/current/. 49 Internet-Drafts are draft documents valid for a maximum of six months 50 and may be updated, replaced, or obsoleted by other documents at any 51 time. It is inappropriate to use Internet-Drafts as reference 52 material or to cite them other than as "work in progress." 54 This Internet-Draft will expire on 28 March 2021. 56 Copyright Notice 58 Copyright (c) 2020 IETF Trust and the persons identified as the 59 document authors. All rights reserved. 61 This document is subject to BCP 78 and the IETF Trust's Legal 62 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 63 license-info) in effect on the date of publication of this document. 64 Please review these documents carefully, as they describe your rights 65 and restrictions with respect to this document. Code Components 66 extracted from this document must include Simplified BSD License text 67 as described in Section 4.e of the Trust Legal Provisions and are 68 provided without warranty as described in the Simplified BSD License. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 73 1.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . 4 74 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 75 2. CBOR Data Models . . . . . . . . . . . . . . . . . . . . . . 8 76 2.1. Extended Generic Data Models . . . . . . . . . . . . . . 9 77 2.2. Specific Data Models . . . . . . . . . . . . . . . . . . 10 78 3. Specification of the CBOR Encoding . . . . . . . . . . . . . 10 79 3.1. Major Types . . . . . . . . . . . . . . . . . . . . . . . 11 80 3.2. Indefinite Lengths for Some Major Types . . . . . . . . . 14 81 3.2.1. The "break" Stop Code . . . . . . . . . . . . . . . . 14 82 3.2.2. Indefinite-Length Arrays and Maps . . . . . . . . . . 15 83 3.2.3. Indefinite-Length Byte Strings and Text Strings . . . 17 84 3.2.4. Summary of indefinite-length use of major types . . . 18 85 3.3. Floating-Point Numbers and Values with No Content . . . . 18 86 3.4. Tagging of Items . . . . . . . . . . . . . . . . . . . . 20 87 3.4.1. Standard Date/Time String . . . . . . . . . . . . . . 23 88 3.4.2. Epoch-based Date/Time . . . . . . . . . . . . . . . . 23 89 3.4.3. Bignums . . . . . . . . . . . . . . . . . . . . . . . 24 90 3.4.4. Decimal Fractions and Bigfloats . . . . . . . . . . . 25 91 3.4.5. Content Hints . . . . . . . . . . . . . . . . . . . . 26 92 3.4.5.1. Encoded CBOR Data Item . . . . . . . . . . . . . 27 93 3.4.5.2. Expected Later Encoding for CBOR-to-JSON 94 Converters . . . . . . . . . . . . . . . . . . . . 27 95 3.4.5.3. Encoded Text . . . . . . . . . . . . . . . . . . 28 96 3.4.6. Self-Described CBOR . . . . . . . . . . . . . . . . . 29 98 4. Serialization Considerations . . . . . . . . . . . . . . . . 29 99 4.1. Preferred Serialization . . . . . . . . . . . . . . . . . 29 100 4.2. Deterministically Encoded CBOR . . . . . . . . . . . . . 31 101 4.2.1. Core Deterministic Encoding Requirements . . . . . . 31 102 4.2.2. Additional Deterministic Encoding Considerations . . 32 103 4.2.3. Length-first Map Key Ordering . . . . . . . . . . . . 34 104 5. Creating CBOR-Based Protocols . . . . . . . . . . . . . . . . 35 105 5.1. CBOR in Streaming Applications . . . . . . . . . . . . . 35 106 5.2. Generic Encoders and Decoders . . . . . . . . . . . . . . 36 107 5.3. Validity of Items . . . . . . . . . . . . . . . . . . . . 37 108 5.3.1. Basic validity . . . . . . . . . . . . . . . . . . . 37 109 5.3.2. Tag validity . . . . . . . . . . . . . . . . . . . . 37 110 5.4. Validity and Evolution . . . . . . . . . . . . . . . . . 38 111 5.5. Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 39 112 5.6. Specifying Keys for Maps . . . . . . . . . . . . . . . . 40 113 5.6.1. Equivalence of Keys . . . . . . . . . . . . . . . . . 42 114 5.7. Undefined Values . . . . . . . . . . . . . . . . . . . . 43 115 6. Converting Data between CBOR and JSON . . . . . . . . . . . . 43 116 6.1. Converting from CBOR to JSON . . . . . . . . . . . . . . 43 117 6.2. Converting from JSON to CBOR . . . . . . . . . . . . . . 44 118 7. Future Evolution of CBOR . . . . . . . . . . . . . . . . . . 46 119 7.1. Extension Points . . . . . . . . . . . . . . . . . . . . 46 120 7.2. Curating the Additional Information Space . . . . . . . . 47 121 8. Diagnostic Notation . . . . . . . . . . . . . . . . . . . . . 47 122 8.1. Encoding Indicators . . . . . . . . . . . . . . . . . . . 49 123 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 49 124 9.1. Simple Values Registry . . . . . . . . . . . . . . . . . 50 125 9.2. Tags Registry . . . . . . . . . . . . . . . . . . . . . . 50 126 9.3. Media Type ("MIME Type") . . . . . . . . . . . . . . . . 51 127 9.4. CoAP Content-Format . . . . . . . . . . . . . . . . . . . 51 128 9.5. The +cbor Structured Syntax Suffix Registration . . . . . 52 129 10. Security Considerations . . . . . . . . . . . . . . . . . . . 53 130 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 56 131 11.1. Normative References . . . . . . . . . . . . . . . . . . 56 132 11.2. Informative References . . . . . . . . . . . . . . . . . 57 133 Appendix A. Examples of Encoded CBOR Data Items . . . . . . . . 60 134 Appendix B. Jump Table for Initial Byte . . . . . . . . . . . . 64 135 Appendix C. Pseudocode . . . . . . . . . . . . . . . . . . . . . 67 136 Appendix D. Half-Precision . . . . . . . . . . . . . . . . . . . 69 137 Appendix E. Comparison of Other Binary Formats to CBOR's Design 138 Objectives . . . . . . . . . . . . . . . . . . . . . . . 70 139 E.1. ASN.1 DER, BER, and PER . . . . . . . . . . . . . . . . . 71 140 E.2. MessagePack . . . . . . . . . . . . . . . . . . . . . . . 71 141 E.3. BSON . . . . . . . . . . . . . . . . . . . . . . . . . . 72 142 E.4. MSDTP: RFC 713 . . . . . . . . . . . . . . . . . . . . . 72 143 E.5. Conciseness on the Wire . . . . . . . . . . . . . . . . . 72 144 Appendix F. Well-formedness errors and examples . . . . . . . . 73 145 F.1. Examples for CBOR data items that are not well-formed . . 74 147 Appendix G. Changes from RFC 7049 . . . . . . . . . . . . . . . 76 148 G.1. Errata processing, clerical changes . . . . . . . . . . . 76 149 G.2. Changes in IANA considerations . . . . . . . . . . . . . 77 150 G.3. Changes in suggestions and other informational 151 components . . . . . . . . . . . . . . . . . . . . . . . 77 152 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 79 153 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 79 155 1. Introduction 157 There are hundreds of standardized formats for binary representation 158 of structured data (also known as binary serialization formats). Of 159 those, some are for specific domains of information, while others are 160 generalized for arbitrary data. In the IETF, probably the best-known 161 formats in the latter category are ASN.1's BER and DER [ASN.1]. 163 The format defined here follows some specific design goals that are 164 not well met by current formats. The underlying data model is an 165 extended version of the JSON data model [RFC8259]. It is important 166 to note that this is not a proposal that the grammar in RFC 8259 be 167 extended in general, since doing so would cause a significant 168 backwards incompatibility with already deployed JSON documents. 169 Instead, this document simply defines its own data model that starts 170 from JSON. 172 Appendix E lists some existing binary formats and discusses how well 173 they do or do not fit the design objectives of the Concise Binary 174 Object Representation (CBOR). 176 This document is a revised edition of [RFC7049], with editorial 177 improvements, added detail, and fixed errata. This revision formally 178 obsoletes RFC 7049, while keeping full compatibility of the 179 interchange format from RFC 7049. It does not create a new version 180 of the format. 182 1.1. Objectives 184 The objectives of CBOR, roughly in decreasing order of importance, 185 are: 187 1. The representation must be able to unambiguously encode most 188 common data formats used in Internet standards. 190 * It must represent a reasonable set of basic data types and 191 structures using binary encoding. "Reasonable" here is 192 largely influenced by the capabilities of JSON, with the major 193 addition of binary byte strings. The structures supported are 194 limited to arrays and trees; loops and lattice-style graphs 195 are not supported. 197 * There is no requirement that all data formats be uniquely 198 encoded; that is, it is acceptable that the number "7" might 199 be encoded in multiple different ways. 201 2. The code for an encoder or decoder must be able to be compact in 202 order to support systems with very limited memory, processor 203 power, and instruction sets. 205 * An encoder and a decoder need to be implementable in a very 206 small amount of code (for example, in class 1 constrained 207 nodes as defined in [RFC7228]). 209 * The format should use contemporary machine representations of 210 data (for example, not requiring binary-to-decimal 211 conversion). 213 3. Data must be able to be decoded without a schema description. 215 * Similar to JSON, encoded data should be self-describing so 216 that a generic decoder can be written. 218 4. The serialization must be reasonably compact, but data 219 compactness is secondary to code compactness for the encoder and 220 decoder. 222 * "Reasonable" here is bounded by JSON as an upper bound in 223 size, and by the implementation complexity limiting how much 224 effort can go into achieving that compactness. Using either 225 general compression schemes or extensive bit-fiddling violates 226 the complexity goals. 228 5. The format must be applicable to both constrained nodes and high- 229 volume applications. 231 * This means it must be reasonably frugal in CPU usage for both 232 encoding and decoding. This is relevant both for constrained 233 nodes and for potential usage in applications with a very high 234 volume of data. 236 6. The format must support all JSON data types for conversion to and 237 from JSON. 239 * It must support a reasonable level of conversion as long as 240 the data represented is within the capabilities of JSON. It 241 must be possible to define a unidirectional mapping towards 242 JSON for all types of data. 244 7. The format must be extensible, and the extended data must be 245 decodable by earlier decoders. 247 * The format is designed for decades of use. 249 * The format must support a form of extensibility that allows 250 fallback so that a decoder that does not understand an 251 extension can still decode the message. 253 * The format must be able to be extended in the future by later 254 IETF standards. 256 1.2. Terminology 258 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 259 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 260 "OPTIONAL" in this document are to be interpreted as described in 261 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 262 capitals, as shown here. 264 The term "byte" is used in its now-customary sense as a synonym for 265 "octet". All multi-byte values are encoded in network byte order 266 (that is, most significant byte first, also known as "big-endian"). 268 This specification makes use of the following terminology: 270 Data item: A single piece of CBOR data. The structure of a data 271 item may contain zero, one, or more nested data items. The term 272 is used both for the data item in representation format and for 273 the abstract idea that can be derived from that by a decoder; the 274 former can be addressed specifically by using "encoded data item". 276 Decoder: A process that decodes a well-formed encoded CBOR data item 277 and makes it available to an application. Formally speaking, a 278 decoder contains a parser to break up the input using the syntax 279 rules of CBOR, as well as a semantic processor to prepare the data 280 in a form suitable to the application. 282 Encoder: A process that generates the (well-formed) representation 283 format of a CBOR data item from application information. 285 Data Stream: A sequence of zero or more data items, not further 286 assembled into a larger containing data item (see [RFC8742] for 287 one application). The independent data items that make up a data 288 stream are sometimes also referred to as "top-level data items". 290 Well-formed: A data item that follows the syntactic structure of 291 CBOR. A well-formed data item uses the initial bytes and the byte 292 strings and/or data items that are implied by their values as 293 defined in CBOR and does not include following extraneous data. 294 CBOR decoders by definition only return contents from well-formed 295 data items. 297 Valid: A data item that is well-formed and also follows the semantic 298 restrictions that apply to CBOR data items (Section 5.3). 300 Expected: Besides its normal English meaning, the term "expected" is 301 used to describe requirements beyond CBOR validity that an 302 application has on its input data. Well-formed (processable at 303 all), valid (checked by a validity-checking generic decoder), and 304 expected (checked by the application) form a hierarchy of layers 305 of acceptability. 307 Stream decoder: A process that decodes a data stream and makes each 308 of the data items in the sequence available to an application as 309 they are received. 311 Terms and concepts for floating-point values such as Infinity, NaN 312 (not a number), negative zero, and subnormal are defined in 313 [IEEE754]. 315 Where bit arithmetic or data types are explained, this document uses 316 the notation familiar from the programming language C [C], except 317 that "**" denotes exponentiation and ".." denotes a range that 318 includes both ends given. Examples and pseudocode assume that signed 319 integers use two's complement representation and that right shifts of 320 signed integers perform sign extension; these assumptions are also 321 specified in Sections 6.8.2 and 7.6.7 of the 2020 version of C++, 322 successor of [Cplusplus17]. 324 Similar to the "0x" notation for hexadecimal numbers, numbers in 325 binary notation are prefixed with "0b". Underscores can be added to 326 a number solely for readability, so 0b00100001 (0x21) might be 327 written 0b001_00001 to emphasize the desired interpretation of the 328 bits in the byte; in this case, it is split into three bits and five 329 bits. Encoded CBOR data items are sometimes given in the "0x" or 330 "0b" notation; these values are first interpreted as numbers as in C 331 and are then interpreted as byte strings in network byte order, 332 including any leading zero bytes expressed in the notation. 334 Words may be _italicized_ for emphasis; in the plain text form of 335 this specification this is indicated by surrounding words with 336 underscore characters. Verbatim text (e.g., names from a programming 337 language) may be set in "monospace" type; in plain text this is 338 approximated somewhat ambiguously by surrounding the text in double 339 quotes (which also retain their usual meaning). 341 2. CBOR Data Models 343 CBOR is explicit about its generic data model, which defines the set 344 of all data items that can be represented in CBOR. Its basic generic 345 data model is extensible by the registration of "simple values" and 346 tags. Applications can then subset the resulting extended generic 347 data model to build their specific data models. 349 Within environments that can represent the data items in the generic 350 data model, generic CBOR encoders and decoders can be implemented 351 (which usually involves defining additional implementation data types 352 for those data items that do not already have a natural 353 representation in the environment). The ability to provide generic 354 encoders and decoders is an explicit design goal of CBOR; however 355 many applications will provide their own application-specific 356 encoders and/or decoders. 358 In the basic (un-extended) generic data model defined in Section 3, a 359 data item is one of: 361 * an integer in the range -2**64..2**64-1 inclusive 363 * a simple value, identified by a number between 0 and 255, but 364 distinct from that number itself 366 * a floating-point value, distinct from an integer, out of the set 367 representable by IEEE 754 binary64 (including non-finites) 368 [IEEE754] 370 * a sequence of zero or more bytes ("byte string") 372 * a sequence of zero or more Unicode code points ("text string") 374 * a sequence of zero or more data items ("array") 376 * a mapping (mathematical function) from zero or more data items 377 ("keys") each to a data item ("values"), ("map") 379 * a tagged data item ("tag"), comprising a tag number (an integer in 380 the range 0..2**64-1) and the tag content (a data item) 382 Note that integer and floating-point values are distinct in this 383 model, even if they have the same numeric value. 385 Also note that serialization variants are not visible at the generic 386 data model level, including the number of bytes of the encoded 387 floating-point value or the choice of one of the ways in which an 388 integer, the length of a text or byte string, the number of elements 389 in an array or pairs in a map, or a tag number, (collectively "the 390 argument", see Section 3) can be encoded. 392 2.1. Extended Generic Data Models 394 This basic generic data model comes pre-extended by the registration 395 of a number of simple values and tag numbers right in this document, 396 such as: 398 * "false", "true", "null", and "undefined" (simple values identified 399 by 20..23) 401 * integer and floating-point values with a larger range and 402 precision than the above (tag numbers 2 to 5) 404 * application data types such as a point in time or an RFC 3339 405 date/time string (tag numbers 1, 0) 407 Further elements of the extended generic data model can be (and have 408 been) defined via the IANA registries created for CBOR. Even if such 409 an extension is unknown to a generic encoder or decoder, data items 410 using that extension can be passed to or from the application by 411 representing them at the interface to the application within the 412 basic generic data model, i.e., as generic simple values or generic 413 tags. 415 In other words, the basic generic data model is stable as defined in 416 this document, while the extended generic data model expands by the 417 registration of new simple values or tag numbers, but never shrinks. 419 While there is a strong expectation that generic encoders and 420 decoders can represent "false", "true", and "null" ("undefined" is 421 intentionally omitted) in the form appropriate for their programming 422 environment, implementation of the data model extensions created by 423 tags is truly optional and a matter of implementation quality. 425 2.2. Specific Data Models 427 The specific data model for a CBOR-based protocol usually subsets the 428 extended generic data model and assigns application semantics to the 429 data items within this subset and its components. When documenting 430 such specific data models, where it is desired to specify the types 431 of data items, it is preferred to identify the types by the names 432 they have in the generic data model ("negative integer", "array") 433 instead of by referring to aspects of their CBOR representation 434 ("major type 1", "major type 4"). 436 Specific data models can also specify what values (including values 437 of different types) are equivalent for the purposes of map keys and 438 encoder freedom. For example, in the generic data model, a valid map 439 MAY have both "0" and "0.0" as keys, and an encoder MUST NOT encode 440 "0.0" as an integer (major type 0, Section 3.1). However, if a 441 specific data model declares that floating-point and integer 442 representations of integral values are equivalent, using both map 443 keys "0" and "0.0" in a single map would be considered duplicates, 444 even while encoded as different major types, and so invalid; and an 445 encoder could encode integral-valued floats as integers or vice 446 versa, perhaps to save encoded bytes. 448 3. Specification of the CBOR Encoding 450 A CBOR data item (Section 2) is encoded to or decoded from a byte 451 string carrying a well-formed encoded data item as described in this 452 section. The encoding is summarized in Table 7 in Appendix B, 453 indexed by the initial byte. An encoder MUST produce only well- 454 formed encoded data items. A decoder MUST NOT return a decoded data 455 item when it encounters input that is not a well-formed encoded CBOR 456 data item (this does not detract from the usefulness of diagnostic 457 and recovery tools that might make available some information from a 458 damaged encoded CBOR data item). 460 The initial byte of each encoded data item contains both information 461 about the major type (the high-order 3 bits, described in 462 Section 3.1) and additional information (the low-order 5 bits). With 463 a few exceptions, the additional information's value describes how to 464 load an unsigned integer "argument": 466 Less than 24: The argument's value is the value of the additional 467 information. 469 24, 25, 26, or 27: The argument's value is held in the following 1, 470 2, 4, or 8 bytes, respectively, in network byte order. For major 471 type 7 and additional information value 25, 26, 27, these bytes 472 are not used as an integer argument, but as a floating-point value 473 (see Section 3.3). 475 28, 29, 30: These values are reserved for future additions to the 476 CBOR format. In the present version of CBOR, the encoded item is 477 not well-formed. 479 31: No argument value is derived. If the major type is 0, 1, or 6, 480 the encoded item is not well-formed. For major types 2 to 5, the 481 item's length is indefinite, and for major type 7, the byte does 482 not constitute a data item at all but terminates an indefinite 483 length item; all are described in Section 3.2. 485 The initial byte and any additional bytes consumed to construct the 486 argument are collectively referred to as the "head" of the data item. 488 The meaning of this argument depends on the major type. For example, 489 in major type 0, the argument is the value of the data item itself 490 (and in major type 1 the value of the data item is computed from the 491 argument); in major type 2 and 3 it gives the length of the string 492 data in bytes that follows; and in major types 4 and 5 it is used to 493 determine the number of data items enclosed. 495 If the encoded sequence of bytes ends before the end of a data item, 496 that item is not well-formed. If the encoded sequence of bytes still 497 has bytes remaining after the outermost encoded item is decoded, that 498 encoding is not a single well-formed CBOR item; depending on the 499 application, the decoder may either treat the encoding as not well- 500 formed or just identify the start of the remaining bytes to the 501 application. 503 A CBOR decoder implementation can be based on a jump table with all 504 256 defined values for the initial byte (Table 7). A decoder in a 505 constrained implementation can instead use the structure of the 506 initial byte and following bytes for more compact code (see 507 Appendix C for a rough impression of how this could look). 509 3.1. Major Types 511 The following lists the major types and the additional information 512 and other bytes associated with the type. 514 Major type 0: an unsigned integer in the range 0..2**64-1 inclusive. 516 The value of the encoded item is the argument itself. For 517 example, the integer 10 is denoted as the one byte 0b000_01010 518 (major type 0, additional information 10). The integer 500 would 519 be 0b000_11001 (major type 0, additional information 25) followed 520 by the two bytes 0x01f4, which is 500 in decimal. 522 Major type 1: a negative integer in the range -2**64..-1 inclusive. 523 The value of the item is -1 minus the argument. For example, the 524 integer -500 would be 0b001_11001 (major type 1, additional 525 information 25) followed by the two bytes 0x01f3, which is 499 in 526 decimal. 528 Major type 2: a byte string. The number of bytes in the string is 529 equal to the argument. For example, a byte string whose length is 530 5 would have an initial byte of 0b010_00101 (major type 2, 531 additional information 5 for the length), followed by 5 bytes of 532 binary content. A byte string whose length is 500 would have 3 533 initial bytes of 0b010_11001 (major type 2, additional information 534 25 to indicate a two-byte length) followed by the two bytes 0x01f4 535 for a length of 500, followed by 500 bytes of binary content. 537 Major type 3: a text string (Section 2), encoded as UTF-8 538 ([RFC3629]). The number of bytes in the string is equal to the 539 argument. A string containing an invalid UTF-8 sequence is well- 540 formed but invalid (Section 1.2). This type is provided for 541 systems that need to interpret or display human-readable text, and 542 allows the differentiation between unstructured bytes and text 543 that has a specified repertoire (that of Unicode) and encoding 544 (UTF-8). In contrast to formats such as JSON, the Unicode 545 characters in this type are never escaped. Thus, a newline 546 character (U+000A) is always represented in a string as the byte 547 0x0a, and never as the bytes 0x5c6e (the characters "\" and "n") 548 nor as 0x5c7530303061 (the characters "\", "u", "0", "0", "0", and 549 "a"). 551 Major type 4: an array of data items. In other formats, arrays are 552 also called lists, sequences, or tuples (a "CBOR sequence" is 553 something slightly different, though [RFC8742]). The argument is 554 the number of data items in the array. Items in an array do not 555 need to all be of the same type. For example, an array that 556 contains 10 items of any type would have an initial byte of 557 0b100_01010 (major type 4, additional information 10 for the 558 length) followed by the 10 remaining items. 560 Major type 5: a map of pairs of data items. Maps are also called 561 tables, dictionaries, hashes, or objects (in JSON). A map is 562 comprised of pairs of data items, each pair consisting of a key 563 that is immediately followed by a value. The argument is the 564 number of _pairs_ of data items in the map. For example, a map 565 that contains 9 pairs would have an initial byte of 0b101_01001 566 (major type 5, additional information 9 for the number of pairs) 567 followed by the 18 remaining items. The first item is the first 568 key, the second item is the first value, the third item is the 569 second key, and so on. Because items in a map come in pairs, 570 their total number is always even: A map that contains an odd 571 number of items (no value data present after the last key data 572 item) is not well-formed. A map that has duplicate keys may be 573 well-formed, but it is not valid, and thus it causes indeterminate 574 decoding; see also Section 5.6. 576 Major type 6: a tagged data item ("tag") whose tag number, an 577 integer in the range 0..2**64-1 inclusive, is the argument and 578 whose enclosed data item ("tag content") is the single encoded 579 data item that follows the head. See Section 3.4. 581 Major type 7: floating-point numbers and simple values, as well as 582 the "break" stop code. See Section 3.3. 584 These eight major types lead to a simple table showing which of the 585 256 possible values for the initial byte of a data item are used 586 (Table 7). 588 In major types 6 and 7, many of the possible values are reserved for 589 future specification. See Section 9 for more information on these 590 values. 592 Table 1 summarizes the major types defined by CBOR, ignoring the next 593 section for now. The number N in this table stands for the argument, 594 mt for the major type. 596 +====+=======================+=================================+ 597 | mt | Meaning | Content | 598 +====+=======================+=================================+ 599 | 0 | unsigned integer N | - | 600 +----+-----------------------+---------------------------------+ 601 | 1 | negative integer -1-N | - | 602 +----+-----------------------+---------------------------------+ 603 | 2 | byte string | N bytes | 604 +----+-----------------------+---------------------------------+ 605 | 3 | text string | N bytes (UTF-8 text) | 606 +----+-----------------------+---------------------------------+ 607 | 4 | array | N data items (elements) | 608 +----+-----------------------+---------------------------------+ 609 | 5 | map | 2N data items (key/value pairs) | 610 +----+-----------------------+---------------------------------+ 611 | 6 | tag of number N | 1 data item | 612 +----+-----------------------+---------------------------------+ 613 | 7 | simple/float | - | 614 +----+-----------------------+---------------------------------+ 616 Table 1: Overview over the definite-length use of CBOR major 617 types (mt = major type, N = argument) 619 3.2. Indefinite Lengths for Some Major Types 621 Four CBOR items (arrays, maps, byte strings, and text strings) can be 622 encoded with an indefinite length using additional information value 623 31. This is useful if the encoding of the item needs to begin before 624 the number of items inside the array or map, or the total length of 625 the string, is known. (The ability to start sending a data item 626 before all of it is known is often referred to as "streaming" within 627 that data item.) 629 Indefinite-length arrays and maps are dealt with differently than 630 indefinite-length strings (byte strings and text strings). 632 3.2.1. The "break" Stop Code 634 The "break" stop code is encoded with major type 7 and additional 635 information value 31 (0b111_11111). It is not itself a data item: it 636 is just a syntactic feature to close an indefinite-length item. 638 If the "break" stop code appears anywhere where a data item is 639 expected, other than directly inside an indefinite-length string, 640 array, or map -- for example directly inside a definite-length array 641 or map -- the enclosing item is not well-formed. 643 3.2.2. Indefinite-Length Arrays and Maps 645 Indefinite-length arrays and maps are represented using their major 646 type with the additional information value of 31, followed by an 647 arbitrary-length sequence of zero or more items for an array or key/ 648 value pairs for a map, followed by the "break" stop code 649 (Section 3.2.1). In other words, indefinite-length arrays and maps 650 look identical to other arrays and maps except for beginning with the 651 additional information value of 31 and ending with the "break" stop 652 code. 654 If the "break" stop code appears after a key in a map, in place of 655 that key's value, the map is not well-formed. 657 There is no restriction against nesting indefinite-length array or 658 map items. A "break" only terminates a single item, so nested 659 indefinite-length items need exactly as many "break" stop codes as 660 there are type bytes starting an indefinite-length item. 662 For example, assume an encoder wants to represent the abstract array 663 [1, [2, 3], [4, 5]]. The definite-length encoding would be 664 0x8301820203820405: 666 83 -- Array of length 3 667 01 -- 1 668 82 -- Array of length 2 669 02 -- 2 670 03 -- 3 671 82 -- Array of length 2 672 04 -- 4 673 05 -- 5 675 Indefinite-length encoding could be applied independently to each of 676 the three arrays encoded in this data item, as required, leading to 677 representations such as: 679 0x9f018202039f0405ffff 680 9F -- Start indefinite-length array 681 01 -- 1 682 82 -- Array of length 2 683 02 -- 2 684 03 -- 3 685 9F -- Start indefinite-length array 686 04 -- 4 687 05 -- 5 688 FF -- "break" (inner array) 689 FF -- "break" (outer array) 691 0x9f01820203820405ff 692 9F -- Start indefinite-length array 693 01 -- 1 694 82 -- Array of length 2 695 02 -- 2 696 03 -- 3 697 82 -- Array of length 2 698 04 -- 4 699 05 -- 5 700 FF -- "break" 702 0x83018202039f0405ff 703 83 -- Array of length 3 704 01 -- 1 705 82 -- Array of length 2 706 02 -- 2 707 03 -- 3 708 9F -- Start indefinite-length array 709 04 -- 4 710 05 -- 5 711 FF -- "break" 713 0x83019f0203ff820405 714 83 -- Array of length 3 715 01 -- 1 716 9F -- Start indefinite-length array 717 02 -- 2 718 03 -- 3 719 FF -- "break" 720 82 -- Array of length 2 721 04 -- 4 722 05 -- 5 724 An example of an indefinite-length map (that happens to have two key/ 725 value pairs) might be: 727 0xbf6346756ef563416d7421ff 728 BF -- Start indefinite-length map 729 63 -- First key, UTF-8 string length 3 730 46756e -- "Fun" 731 F5 -- First value, true 732 63 -- Second key, UTF-8 string length 3 733 416d74 -- "Amt" 734 21 -- Second value, -2 735 FF -- "break" 737 3.2.3. Indefinite-Length Byte Strings and Text Strings 739 Indefinite-length strings are represented by a byte containing the 740 major type and additional information value of 31, followed by a 741 series of zero or more byte or text strings ("chunks") that have 742 definite lengths, followed by the "break" stop code (Section 3.2.1). 743 The data item represented by the indefinite-length string is the 744 concatenation of the chunks (i.e., the empty byte or text string, 745 respectively, if no chunk is present). (Note that zero-length 746 chunks, while not particularly useful, are permitted.) 748 If any item between the indefinite-length string indicator 749 (0b010_11111 or 0b011_11111) and the "break" stop code is not a 750 definite-length string item of the same major type, the string is not 751 well-formed. 753 The design does not allow nesting indefinite-length strings as chunks 754 into indefinite-length strings. If it were allowed, it would require 755 decoder implementations to keep a stack, or at least a count, of 756 nesting levels. It is unnecessary on the encoder side because the 757 inner indefinite-length string would consist of chunks, and these 758 could instead be put directly into the outer indefinite-length 759 string. 761 If any definite-length text string inside an indefinite-length text 762 string is invalid, the indefinite-length text string is invalid. 763 Note that this implies that the UTF-8 bytes of a single Unicode code 764 point (scalar value) cannot be spread between chunks: a new chunk of 765 a text string can only be started at a code point boundary. 767 For example, assume an encoded data item consisting of the bytes: 769 0b010_11111 0b010_00100 0xaabbccdd 0b010_00011 0xeeff99 0b111_11111 771 5F -- Start indefinite-length byte string 772 44 -- Byte string of length 4 773 aabbccdd -- Bytes content 774 43 -- Byte string of length 3 775 eeff99 -- Bytes content 776 FF -- "break" 778 After decoding, this results in a single byte string with seven 779 bytes: 0xaabbccddeeff99. 781 3.2.4. Summary of indefinite-length use of major types 783 Table 2 summarizes the major types defined by CBOR as used for 784 indefinite length encoding (with additional information set to 31). 785 mt stands for the major type. 787 +====+===================+==================================+ 788 | mt | Meaning | enclosed up to "break" stop code | 789 +====+===================+==================================+ 790 | 0 | (not well-formed) | - | 791 +----+-------------------+----------------------------------+ 792 | 1 | (not well-formed) | - | 793 +----+-------------------+----------------------------------+ 794 | 2 | byte string | definite-length byte strings | 795 +----+-------------------+----------------------------------+ 796 | 3 | text string | definite-length text strings | 797 +----+-------------------+----------------------------------+ 798 | 4 | array | data items (elements) | 799 +----+-------------------+----------------------------------+ 800 | 5 | map | data items (key/value pairs) | 801 +----+-------------------+----------------------------------+ 802 | 6 | (not well-formed) | - | 803 +----+-------------------+----------------------------------+ 804 | 7 | "break" stop code | - | 805 +----+-------------------+----------------------------------+ 807 Table 2: Overview over the indefinite-length use of CBOR 808 major types (mt = major type, additional information = 809 31) 811 3.3. Floating-Point Numbers and Values with No Content 813 Major type 7 is for two types of data: floating-point numbers and 814 "simple values" that do not need any content. Each value of the 815 5-bit additional information in the initial byte has its own separate 816 meaning, as defined in Table 3. Like the major types for integers, 817 items of this major type do not carry content data; all the 818 information is in the initial bytes (the head). 820 +=============+===================================================+ 821 | 5-Bit Value | Semantics | 822 +=============+===================================================+ 823 | 0..23 | Simple value (value 0..23) | 824 +-------------+---------------------------------------------------+ 825 | 24 | Simple value (value 32..255 in following byte) | 826 +-------------+---------------------------------------------------+ 827 | 25 | IEEE 754 Half-Precision Float (16 bits follow) | 828 +-------------+---------------------------------------------------+ 829 | 26 | IEEE 754 Single-Precision Float (32 bits follow) | 830 +-------------+---------------------------------------------------+ 831 | 27 | IEEE 754 Double-Precision Float (64 bits follow) | 832 +-------------+---------------------------------------------------+ 833 | 28-30 | Reserved, not well-formed in the present document | 834 +-------------+---------------------------------------------------+ 835 | 31 | "break" stop code for indefinite-length items | 836 | | (Section 3.2.1) | 837 +-------------+---------------------------------------------------+ 839 Table 3: Values for Additional Information in Major Type 7 841 As with all other major types, the 5-bit value 24 signifies a single- 842 byte extension: it is followed by an additional byte to represent the 843 simple value. (To minimize confusion, only the values 32 to 255 are 844 used.) This maintains the structure of the initial bytes: as for the 845 other major types, the length of these always depends on the 846 additional information in the first byte. Table 4 lists the numeric 847 values assigned and available for simple values. 849 +=========+==============+ 850 | Value | Semantics | 851 +=========+==============+ 852 | 0..19 | (Unassigned) | 853 +---------+--------------+ 854 | 20 | False | 855 +---------+--------------+ 856 | 21 | True | 857 +---------+--------------+ 858 | 22 | Null | 859 +---------+--------------+ 860 | 23 | Undefined | 861 +---------+--------------+ 862 | 24..31 | (Reserved) | 863 +---------+--------------+ 864 | 32..255 | (Unassigned) | 865 +---------+--------------+ 867 Table 4: Simple Values 869 An encoder MUST NOT issue two-byte sequences that start with 0xf8 870 (major type 7, additional information 24) and continue with a byte 871 less than 0x20 (32 decimal). Such sequences are not well-formed. 872 (This implies that an encoder cannot encode false, true, null, or 873 undefined in two-byte sequences, and that only the one-byte variants 874 of these are well-formed; more generally speaking, each simple value 875 only has a single representation variant). 877 The 5-bit values of 25, 26, and 27 are for 16-bit, 32-bit, and 64-bit 878 IEEE 754 binary floating-point values [IEEE754]. These floating- 879 point values are encoded in the additional bytes of the appropriate 880 size. (See Appendix D for some information about 16-bit floating- 881 point numbers.) 883 3.4. Tagging of Items 885 In CBOR, a data item can be enclosed by a tag to give it some 886 additional semantics, as uniquely identified by a "tag number". The 887 tag is major type 6, its argument (Section 3) indicates the tag 888 number, and it contains a single enclosed data item, the "tag 889 content". (If a tag requires further structure to its content, this 890 structure is provided by the enclosed data item.) We use the term 891 "tag" for the entire data item consisting of both a tag number and 892 the tag content: the tag content is the data item that is being 893 tagged. 895 For example, assume that a byte string of length 12 is marked with a 896 tag of number 2 to indicate it is a positive "bignum" 897 (Section 3.4.3). The encoded data item would start with a byte 898 0b110_00010 (major type 6, additional information 2 for the tag 899 number) followed by the encoded tag content: 0b010_01100 (major type 900 2, additional information of 12 for the length) followed by the 12 901 bytes of the bignum. 903 The definition of a tag number describes the additional semantics 904 conveyed for tags with this tag number in the extended generic data 905 model. These semantics may include equivalence of some tagged data 906 items with other data items, including some that can already be 907 represented in the basic generic data model. For instance, 0xc24101, 908 a bignum the tag content of which is the byte string with the single 909 byte 0x01, is equivalent to an integer 1, which could also be encoded 910 for instance as 0x01, 0x1801, or 0x190001. The tag definition may 911 include the definition of a preferred serialization (Section 4.1) 912 that is recommended for generic encoders; this may prefer basic 913 generic data model representations over ones that employ a tag. 915 The tag definition usually restricts what kinds of nested data item 916 or items are valid for such tags. Tag definitions may restrict their 917 content to a very specific syntactic structure, as the tags defined 918 in this document do, or they may aim at a more semantically defined 919 definition of their content, as for instance tags 40 and 1040 do 920 [RFC8746]: These accept a number of different ways of representing 921 arrays. 923 As a matter of convention, many tags do not accept null or undefined 924 values as tag content; instead, the expectation is that a null or 925 undefined value can be used in place of the entire tag; Section 3.4.2 926 provides some further considerations for one specific tag about the 927 handling of this convention in application protocols and in mapping 928 to platform types. 930 Decoders do not need to understand tags of every tag number, and tags 931 may be of little value in applications where the implementation 932 creating a particular CBOR data item and the implementation decoding 933 that stream know the semantic meaning of each item in the data flow. 934 Their primary purpose in this specification is to define common data 935 types such as dates. A secondary purpose is to provide conversion 936 hints when it is foreseen that the CBOR data item needs to be 937 translated into a different format, requiring hints about the content 938 of items. Understanding the semantics of tags is optional for a 939 decoder; it can simply present both the tag number and the tag 940 content to the application, without interpreting the additional 941 semantics of the tag. 943 A tag applies semantics to the data item it encloses. Tags can nest: 944 If tag A encloses tag B, which encloses data item C, tag A applies to 945 the result of applying tag B on data item C. 947 IANA maintains a registry of tag numbers as described in Section 9.2. 948 Table 5 provides a list of tag numbers that were defined in 949 [RFC7049], with definitions in the rest of this section. (Tag number 950 35 was also defined in [RFC7049]; a discussion of this tag number 951 follows in Section 3.4.5.3.) Note that many other tag numbers have 952 been defined since the publication of [RFC7049]; see the registry 953 described at Section 9.2 for the complete list. 955 +============+=============+==================================+ 956 | Tag Number | Data Item | Tag Content Semantics | 957 +============+=============+==================================+ 958 | 0 | text string | Standard date/time string; see | 959 | | | Section 3.4.1 | 960 +------------+-------------+----------------------------------+ 961 | 1 | integer or | Epoch-based date/time; see | 962 | | float | Section 3.4.2 | 963 +------------+-------------+----------------------------------+ 964 | 2 | byte string | Positive bignum; see | 965 | | | Section 3.4.3 | 966 +------------+-------------+----------------------------------+ 967 | 3 | byte string | Negative bignum; see | 968 | | | Section 3.4.3 | 969 +------------+-------------+----------------------------------+ 970 | 4 | array | Decimal fraction; see | 971 | | | Section 3.4.4 | 972 +------------+-------------+----------------------------------+ 973 | 5 | array | Bigfloat; see Section 3.4.4 | 974 +------------+-------------+----------------------------------+ 975 | 21 | (any) | Expected conversion to base64url | 976 | | | encoding; see Section 3.4.5.2 | 977 +------------+-------------+----------------------------------+ 978 | 22 | (any) | Expected conversion to base64 | 979 | | | encoding; see Section 3.4.5.2 | 980 +------------+-------------+----------------------------------+ 981 | 23 | (any) | Expected conversion to base16 | 982 | | | encoding; see Section 3.4.5.2 | 983 +------------+-------------+----------------------------------+ 984 | 24 | byte string | Encoded CBOR data item; see | 985 | | | Section 3.4.5.1 | 986 +------------+-------------+----------------------------------+ 987 | 32 | text string | URI; see Section 3.4.5.3 | 988 +------------+-------------+----------------------------------+ 989 | 33 | text string | base64url; see Section 3.4.5.3 | 990 +------------+-------------+----------------------------------+ 991 | 34 | text string | base64; see Section 3.4.5.3 | 992 +------------+-------------+----------------------------------+ 993 | 36 | text string | MIME message; see | 994 | | | Section 3.4.5.3 | 995 +------------+-------------+----------------------------------+ 996 | 55799 | (any) | Self-described CBOR; see | 997 | | | Section 3.4.6 | 998 +------------+-------------+----------------------------------+ 1000 Table 5: Tag numbers defined in RFC 7049 1002 Conceptually, tags are interpreted in the generic data model, not at 1003 (de-)serialization time. A small number of tags (at this time, tag 1004 number 25 and tag number 29 [IANA.cbor-tags]) have been registered 1005 with semantics that may require processing at (de-)serialization 1006 time: The decoder needs to be aware and the encoder needs to be in 1007 control of the exact sequence in which data items are encoded into 1008 the CBOR data item. This means these tags cannot be implemented on 1009 top of an arbitrary generic CBOR encoder/decoder (which might not 1010 reflect the serialization order for entries in a map at the data 1011 model level and vice versa); their implementation therefore typically 1012 needs to be integrated into the generic encoder/decoder. The 1013 definition of new tags with this property is NOT RECOMMENDED. 1015 IANA allocated tag numbers 65535, 4294967295, and 1016 18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit). 1017 These can be used as a convenience for implementers that want a 1018 single integer data structure to indicate either that a specific tag 1019 is present, or the absence of a tag. That allocation is described in 1020 Section 10 of [I-D.bormann-cbor-notable-tags]. These tags are not 1021 intended to occur in actual CBOR data items; implementations MAY flag 1022 such an occurrence as an error. 1024 Protocols using tag numbers 0 and 1 extend the generic data model 1025 (Section 2) with data items representing points in time; tag numbers 1026 2 and 3, with arbitrarily sized integers; and tag numbers 4 and 5, 1027 with floating-point values of arbitrary size and precision. 1029 3.4.1. Standard Date/Time String 1031 Tag number 0 contains a text string in the standard format described 1032 by the "date-time" production in [RFC3339], as refined by Section 3.3 1033 of [RFC4287], representing the point in time described there. A 1034 nested item of another type or a text string that doesn't match the 1035 [RFC4287] format is invalid. 1037 3.4.2. Epoch-based Date/Time 1039 Tag number 1 contains a numerical value counting the number of 1040 seconds from 1970-01-01T00:00Z in UTC time to the represented point 1041 in civil time. 1043 The tag content MUST be an unsigned or negative integer (major types 1044 0 and 1), or a floating-point number (major type 7 with additional 1045 information 25, 26, or 27). Other contained types are invalid. 1047 Non-negative values (major type 0 and non-negative floating-point 1048 numbers) stand for time values on or after 1970-01-01T00:00Z UTC and 1049 are interpreted according to POSIX [TIME_T]. (POSIX time is also 1050 known as "UNIX Epoch time".) Leap seconds are handled specially by 1051 POSIX time and this results in a 1 second discontinuity several times 1052 per decade. Note that applications that require the expression of 1053 times beyond early 2106 cannot leave out support of 64-bit integers 1054 for the tag content. 1056 Negative values (major type 1 and negative floating-point numbers) 1057 are interpreted as determined by the application requirements as 1058 there is no universal standard for UTC count-of-seconds time before 1059 1970-01-01T00:00Z (this is particularly true for points in time that 1060 precede discontinuities in national calendars). The same applies to 1061 non-finite values. 1063 To indicate fractional seconds, floating-point values can be used 1064 within tag number 1 instead of integer values. Note that this 1065 generally requires binary64 support, as binary16 and binary32 provide 1066 non-zero fractions of seconds only for a short period of time around 1067 early 1970. An application that requires tag number 1 support may 1068 restrict the tag content to be an integer (or a floating-point value) 1069 only. 1071 Note that platform types for date/time may include null or undefined 1072 values, which may also be desirable at an application protocol level. 1073 While emitting tag number 1 values with non-finite tag content values 1074 (e.g., with NaN for undefined date/time values or with Infinite for 1075 an expiry date that is not set) may seem an obvious way to handle 1076 this, using untagged null or undefined avoids the use of non-finites 1077 and results in a shorter encoding. Application protocol designers 1078 are encouraged to consider these cases and include clear guidelines 1079 for handling them. 1081 3.4.3. Bignums 1083 Protocols using tag numbers 2 and 3 extend the generic data model 1084 (Section 2) with "bignums" representing arbitrarily sized integers. 1085 In the basic generic data model, bignum values are not equal to 1086 integers from the same model, but the extended generic data model 1087 created by this tag definition defines equivalence based on numeric 1088 value, and preferred serialization (Section 4.1) never makes use of 1089 bignums that also can be expressed as basic integers (see below). 1091 Bignums are encoded as a byte string data item, which is interpreted 1092 as an unsigned integer n in network byte order. Contained items of 1093 other types are invalid. For tag number 2, the value of the bignum 1094 is n. For tag number 3, the value of the bignum is -1 - n. The 1095 preferred serialization of the byte string is to leave out any 1096 leading zeroes (note that this means the preferred serialization for 1097 n = 0 is the empty byte string, but see below). Decoders that 1098 understand these tags MUST be able to decode bignums that do have 1099 leading zeroes. The preferred serialization of an integer that can 1100 be represented using major type 0 or 1 is to encode it this way 1101 instead of as a bignum (which means that the empty string never 1102 occurs in a bignum when using preferred serialization). Note that 1103 this means the non-preferred choice of a bignum representation 1104 instead of a basic integer for encoding a number is not intended to 1105 have application semantics (just as the choice of a longer basic 1106 integer representation than needed, such as 0x1800 for 0x00 does 1107 not). 1109 For example, the number 18446744073709551616 (2**64) is represented 1110 as 0b110_00010 (major type 6, tag number 2), followed by 0b010_01001 1111 (major type 2, length 9), followed by 0x010000000000000000 (one byte 1112 0x01 and eight bytes 0x00). In hexadecimal: 1114 C2 -- Tag 2 1115 49 -- Byte string of length 9 1116 010000000000000000 -- Bytes content 1118 3.4.4. Decimal Fractions and Bigfloats 1120 Protocols using tag number 4 extend the generic data model with data 1121 items representing arbitrary-length decimal fractions of the form 1122 m*(10**e). Protocols using tag number 5 extend the generic data 1123 model with data items representing arbitrary-length binary fractions 1124 of the form m*(2**e). As with bignums, values of different types are 1125 not equal in the generic data model. 1127 Decimal fractions combine an integer mantissa with a base-10 scaling 1128 factor. They are most useful if an application needs the exact 1129 representation of a decimal fraction such as 1.1 because there is no 1130 exact representation for many decimal fractions in binary floating- 1131 point representations. 1133 "Bigfloats" combine an integer mantissa with a base-2 scaling factor. 1134 They are binary floating-point values that can exceed the range or 1135 the precision of the three IEEE 754 formats supported by CBOR 1136 (Section 3.3). Bigfloats may also be used by constrained 1137 applications that need some basic binary floating-point capability 1138 without the need for supporting IEEE 754. 1140 A decimal fraction or a bigfloat is represented as a tagged array 1141 that contains exactly two integer numbers: an exponent e and a 1142 mantissa m. Decimal fractions (tag number 4) use base-10 exponents; 1143 the value of a decimal fraction data item is m*(10**e). Bigfloats 1144 (tag number 5) use base-2 exponents; the value of a bigfloat data 1145 item is m*(2**e). The exponent e MUST be represented in an integer 1146 of major type 0 or 1, while the mantissa can also be a bignum 1147 (Section 3.4.3). Contained items with other structures are invalid. 1149 An example of a decimal fraction is that the number 273.15 could be 1150 represented as 0b110_00100 (major type 6 for tag, additional 1151 information 4 for the tag number), followed by 0b100_00010 (major 1152 type 4 for the array, additional information 2 for the length of the 1153 array), followed by 0b001_00001 (major type 1 for the first integer, 1154 additional information 1 for the value of -2), followed by 1155 0b000_11001 (major type 0 for the second integer, additional 1156 information 25 for a two-byte value), followed by 0b0110101010110011 1157 (27315 in two bytes). In hexadecimal: 1159 C4 -- Tag 4 1160 82 -- Array of length 2 1161 21 -- -2 1162 19 6ab3 -- 27315 1164 An example of a bigfloat is that the number 1.5 could be represented 1165 as 0b110_00101 (major type 6 for tag, additional information 5 for 1166 the tag number), followed by 0b100_00010 (major type 4 for the array, 1167 additional information 2 for the length of the array), followed by 1168 0b001_00000 (major type 1 for the first integer, additional 1169 information 0 for the value of -1), followed by 0b000_00011 (major 1170 type 0 for the second integer, additional information 3 for the value 1171 of 3). In hexadecimal: 1173 C5 -- Tag 5 1174 82 -- Array of length 2 1175 20 -- -1 1176 03 -- 3 1178 Decimal fractions and bigfloats provide no representation of 1179 Infinity, -Infinity, or NaN; if these are needed in place of a 1180 decimal fraction or bigfloat, the IEEE 754 half-precision 1181 representations from Section 3.3 can be used. 1183 3.4.5. Content Hints 1185 The tags in this section are for content hints that might be used by 1186 generic CBOR processors. These content hints do not extend the 1187 generic data model. 1189 3.4.5.1. Encoded CBOR Data Item 1191 Sometimes it is beneficial to carry an embedded CBOR data item that 1192 is not meant to be decoded immediately at the time the enclosing data 1193 item is being decoded. Tag number 24 (CBOR data item) can be used to 1194 tag the embedded byte string as a single data item encoded in CBOR 1195 format. Contained items that aren't byte strings are invalid. A 1196 contained byte string is valid if it encodes a well-formed CBOR data 1197 item; validity checking of the decoded CBOR item is not required for 1198 tag validity (but could be offered by a generic decoder as a special 1199 option). 1201 3.4.5.2. Expected Later Encoding for CBOR-to-JSON Converters 1203 Tag numbers 21 to 23 indicate that a byte string might require a 1204 specific encoding when interoperating with a text-based 1205 representation. These tags are useful when an encoder knows that the 1206 byte string data it is writing is likely to be later converted to a 1207 particular JSON-based usage. That usage specifies that some strings 1208 are encoded as base64, base64url, and so on. The encoder uses byte 1209 strings instead of doing the encoding itself to reduce the message 1210 size, to reduce the code size of the encoder, or both. The encoder 1211 does not know whether or not the converter will be generic, and 1212 therefore wants to say what it believes is the proper way to convert 1213 binary strings to JSON. 1215 The data item tagged can be a byte string or any other data item. In 1216 the latter case, the tag applies to all of the byte string data items 1217 contained in the data item, except for those contained in a nested 1218 data item tagged with an expected conversion. 1220 These three tag numbers suggest conversions to three of the base data 1221 encodings defined in [RFC4648]. Tag number 21 suggests conversion to 1222 base64url encoding (Section 5 of RFC 4648), where padding is not used 1223 (see Section 3.2 of RFC 4648); that is, all trailing equals signs 1224 ("=") are removed from the encoded string. Tag number 22 suggests 1225 conversion to classical base64 encoding (Section 4 of RFC 4648), with 1226 padding as defined in RFC 4648. For both base64url and base64, 1227 padding bits are set to zero (see Section 3.5 of RFC 4648), and the 1228 conversion to alternate encoding is performed on the contents of the 1229 byte string (that is, without adding any line breaks, whitespace, or 1230 other additional characters). Tag number 23 suggests conversion to 1231 base16 (hex) encoding, with uppercase alphabetics (see Section 8 of 1232 RFC 4648). Note that, for all three tag numbers, the encoding of the 1233 empty byte string is the empty text string. 1235 3.4.5.3. Encoded Text 1237 Some text strings hold data that have formats widely used on the 1238 Internet, and sometimes those formats can be validated and presented 1239 to the application in appropriate form by the decoder. There are 1240 tags for some of these formats. 1242 * Tag number 32 is for URIs, as defined in [RFC3986]. If the text 1243 string doesn't match the "URI-reference" production, the string is 1244 invalid. 1246 * Tag numbers 33 and 34 are for base64url- and base64-encoded text 1247 strings, respectively, as defined in [RFC4648]. If any of: 1249 - the encoded text string contains non-alphabet characters or 1250 only 1 alphabet character in the last block of 4 (where 1251 alphabet is defined by Section 5 of [RFC4648] for tag number 33 1252 and Section 4 of [RFC4648] for tag number 34), or 1254 - the padding bits in a 2- or 3-character block are not 0, or 1256 - the base64 encoding has the wrong number of padding characters, 1257 or 1259 - the base64url encoding has padding characters, 1261 the string is invalid. 1263 * Tag number 36 is for MIME messages (including all headers), as 1264 defined in [RFC2045]. A text string that isn't a valid MIME 1265 message is invalid. (For this tag, validity checking may be 1266 particularly onerous for a generic decoder and might therefore not 1267 be offered. Note that many MIME messages are general binary data 1268 and can therefore not be represented in a text string; 1269 [IANA.cbor-tags] lists a registration for tag number 257 that is 1270 similar to tag number 36 but uses a byte string as its tag 1271 content.) 1273 Note that tag numbers 33 and 34 differ from 21 and 22 in that the 1274 data is transported in base-encoded form for the former and in raw 1275 byte string form for the latter. 1277 [RFC7049] also defined a tag number 35, for regular expressions that 1278 are in Perl Compatible Regular Expressions (PCRE/PCRE2) form [PCRE] 1279 or in JavaScript regular expression syntax [ECMA262]. The state of 1280 the art in these regular expression specifications has since advanced 1281 and is continually advancing, so the present specification does not 1282 attempt to update the references to a snapshot that is current at the 1283 time of writing. Instead, this tag remains available (as registered 1284 in [RFC7049]) for applications that specify the particular regular 1285 expression variant they use out-of-band (possibly by limiting the 1286 usage to a defined common subset of both PCRE and ECMA262). As the 1287 present specification clarifies tag validity beyond [RFC7049], we 1288 note that due to the open way the tag was defined in [RFC7049], any 1289 contained string value needs to be valid at the CBOR tag level (but 1290 may then not be "expected" at the application level). 1292 3.4.6. Self-Described CBOR 1294 In many applications, it will be clear from the context that CBOR is 1295 being employed for encoding a data item. For instance, a specific 1296 protocol might specify the use of CBOR, or a media type is indicated 1297 that specifies its use. However, there may be applications where 1298 such context information is not available, such as when CBOR data is 1299 stored in a file that does not have disambiguating metadata. Here, 1300 it may help to have some distinguishing characteristics for the data 1301 itself. 1303 Tag number 55799 is defined for this purpose, specifically for use at 1304 the start of a stored encoded CBOR data item as specified by an 1305 application. It does not impart any special semantics on the data 1306 item that it encloses; that is, the semantics of the tag content 1307 enclosed in tag number 55799 is exactly identical to the semantics of 1308 the tag content itself. 1310 The serialization of this tag's head is 0xd9d9f7, which does not 1311 appear to be in use as a distinguishing mark for any frequently used 1312 file types. In particular, 0xd9d9f7 is not a valid start of a 1313 Unicode text in any Unicode encoding if it is followed by a valid 1314 CBOR data item. 1316 For instance, a decoder might be able to decode both CBOR and JSON. 1317 Such a decoder would need to mechanically distinguish the two 1318 formats. An easy way for an encoder to help the decoder would be to 1319 tag the entire CBOR item with tag number 55799, the serialization of 1320 which will never be found at the beginning of a JSON text. 1322 4. Serialization Considerations 1324 4.1. Preferred Serialization 1326 For some values at the data model level, CBOR provides multiple 1327 serializations. For many applications, it is desirable that an 1328 encoder always chooses a preferred serialization (preferred 1329 encoding); however, the present specification does not put the burden 1330 of enforcing this preference on either encoder or decoder. 1332 Some constrained decoders may be limited in their ability to decode 1333 non-preferred serializations: For example, if only integers below 1334 1_000_000_000 (one billion) are expected in an application, the 1335 decoder may leave out the code that would be needed to decode 64-bit 1336 arguments in integers. An encoder that always uses preferred 1337 serialization ("preferred encoder") interoperates with this decoder 1338 for the numbers that can occur in this application. More generally 1339 speaking, it therefore can be said that a preferred encoder is more 1340 universally interoperable (and also less wasteful) than one that, 1341 say, always uses 64-bit integers. 1343 Similarly, a constrained encoder may be limited in the variety of 1344 representation variants it supports in such a way that it does not 1345 emit preferred serializations ("variant encoder"): Say, it could be 1346 designed to always use the 32-bit variant for an integer that it 1347 encodes even if a short representation is available (again, assuming 1348 that there is no application need for integers that can only be 1349 represented with the 64-bit variant). A decoder that does not rely 1350 on only ever receiving preferred serializations ("variation-tolerant 1351 decoder") can therefore be said to be more universally interoperable 1352 (it might very well optimize for the case of receiving preferred 1353 serializations, though). Full implementations of CBOR decoders are 1354 by definition variation-tolerant; the distinction is only relevant if 1355 a constrained implementation of a CBOR decoder meets a variant 1356 encoder. 1358 The preferred serialization always uses the shortest form of 1359 representing the argument (Section 3); it also uses the shortest 1360 floating-point encoding that preserves the value being encoded. 1362 The preferred serialization for a floating-point value is the 1363 shortest floating-point encoding that preserves its value, e.g., 1364 0xf94580 for the number 5.5, and 0xfa45ad9c00 for the number 5555.5. 1365 For NaN values, a shorter encoding is preferred if zero-padding the 1366 shorter significand towards the right reconstitutes the original NaN 1367 value (for many applications, the single NaN encoding 0xf97e00 will 1368 suffice). 1370 Definite length encoding is preferred whenever the length is known at 1371 the time the serialization of the item starts. 1373 4.2. Deterministically Encoded CBOR 1375 Some protocols may want encoders to only emit CBOR in a particular 1376 deterministic format; those protocols might also have the decoders 1377 check that their input is in that deterministic format. Those 1378 protocols are free to define what they mean by a "deterministic 1379 format" and what encoders and decoders are expected to do. This 1380 section defines a set of restrictions that can serve as the base of 1381 such a deterministic format. 1383 4.2.1. Core Deterministic Encoding Requirements 1385 A CBOR encoding satisfies the "core deterministic encoding 1386 requirements" if it satisfies the following restrictions: 1388 * Preferred serialization MUST be used. In particular, this means 1389 that arguments (see Section 3) for integers, lengths in major 1390 types 2 through 5, and tags MUST be as short as possible, for 1391 instance: 1393 - 0 to 23 and -1 to -24 MUST be expressed in the same byte as the 1394 major type; 1396 - 24 to 255 and -25 to -256 MUST be expressed only with an 1397 additional uint8_t; 1399 - 256 to 65535 and -257 to -65536 MUST be expressed only with an 1400 additional uint16_t; 1402 - 65536 to 4294967295 and -65537 to -4294967296 MUST be expressed 1403 only with an additional uint32_t. 1405 Floating-point values also MUST use the shortest form that 1406 preserves the value, e.g. 1.5 is encoded as 0xf93e00 (binary16) 1407 and 1000000.5 as 0xfa49742408 (binary32). (One implementation of 1408 this is to have all floats start as a 64-bit float, then do a test 1409 conversion to a 32-bit float; if the result is the same numeric 1410 value, use the shorter form and repeat the process with a test 1411 conversion to a 16-bit float. This also works to select 16-bit 1412 float for positive and negative Infinity as well.) 1414 * Indefinite-length items MUST NOT appear. They can be encoded as 1415 definite-length items instead. 1417 * The keys in every map MUST be sorted in the bytewise lexicographic 1418 order of their deterministic encodings. For example, the 1419 following keys are sorted correctly: 1421 1. 10, encoded as 0x0a. 1423 2. 100, encoded as 0x1864. 1425 3. -1, encoded as 0x20. 1427 4. "z", encoded as 0x617a. 1429 5. "aa", encoded as 0x626161. 1431 6. [100], encoded as 0x811864. 1433 7. [-1], encoded as 0x8120. 1435 8. false, encoded as 0xf4. 1437 (Implementation note: the self-delimiting nature of the CBOR 1438 encoding means that there are no two well-formed CBOR encoded data 1439 items where one is a prefix of the other. The bytewise 1440 lexicographic comparison of deterministic encodings of different 1441 map keys therefore always ends in a position where the byte 1442 differs between the keys, before the end of a key is reached.) 1444 4.2.2. Additional Deterministic Encoding Considerations 1446 CBOR tags present additional considerations for deterministic 1447 encoding. If a CBOR-based protocol were to provide the same 1448 semantics for the presence and absence of a specific tag (e.g., by 1449 allowing both tag 1 data items and raw numbers in a date/time 1450 position, treating the latter as if they were tagged), the 1451 deterministic format would not allow the presence of the tag, based 1452 on the "shortest form" principle. For example, a protocol might give 1453 encoders the choice of representing a URL as either a text string or, 1454 using Section 3.4.5.3, tag number 32 containing a text string. This 1455 protocol's deterministic encoding needs to either require that the 1456 tag is present or require that it is absent, not allow either one. 1458 In a protocol that does require tags in certain places to obtain 1459 specific semantics, the tag needs to appear in the deterministic 1460 format as well. Deterministic encoding considerations also apply to 1461 the content of tags. 1463 If a protocol includes a field that can express integers with an 1464 absolute value of 2^64 or larger using tag numbers 2 or 3 1465 (Section 3.4.3), the protocol's deterministic encoding needs to 1466 specify whether smaller integers are also expressed using these tags 1467 or using major types 0 and 1. Preferred serialization uses the 1468 latter choice, which is therefore recommended. 1470 Protocols that include floating-point values, whether represented 1471 using basic floating-point values (Section 3.3) or using tags (or 1472 both), may need to define extra requirements on their deterministic 1473 encodings, such as: 1475 * Although IEEE floating-point values can represent both positive 1476 and negative zero as distinct values, the application might not 1477 distinguish these and might decide to represent all zero values 1478 with a positive sign, disallowing negative zero. (The application 1479 may also want to restrict the precision of floating-point values 1480 in such a way that there is never a need to represent 64-bit -- or 1481 even 32-bit -- floating-point values.) 1483 * If a protocol includes a field that can express floating-point 1484 values, with a specific data model that declares integer and 1485 floating-point values to be interchangeable, the protocol's 1486 deterministic encoding needs to specify whether (for example) the 1487 integer 1.0 is encoded as 0x01 (unsigned integer), 0xf93c00 1488 (binary16), 0xfa3f800000 (binary32), or 0xfb3ff0000000000000 1489 (binary64). Example rules for this are: 1491 1. Encode integral values that fit in 64 bits as values from 1492 major types 0 and 1, and other values as the preferred 1493 (smallest of 16-, 32-, or 64-bit) floating-point 1494 representation that accurately represents the value, 1496 2. Encode all values as the preferred floating-point 1497 representation that accurately represents the value, even for 1498 integral values, or 1500 3. Encode all values as 64-bit floating-point representations. 1502 Rule 1 straddles the boundaries between integers and floating- 1503 point values, and Rule 3 does not use preferred serialization, so 1504 Rule 2 may be a good choice in many cases. 1506 * If NaN is an allowed value and there is no intent to support NaN 1507 payloads or signaling NaNs, the protocol needs to pick a single 1508 representation, typically 0xf97e00. If that simple choice is not 1509 possible, specific attention will be needed for NaN handling. 1511 * Subnormal numbers (nonzero numbers with the lowest possible 1512 exponent of a given IEEE 754 number format) may be flushed to zero 1513 outputs or be treated as zero inputs in some floating-point 1514 implementations. A protocol's deterministic encoding may want to 1515 specifically accommodate such implementations while creating an 1516 onus on other implementations, by excluding subnormal numbers from 1517 interchange, interchanging zero instead. 1519 * The same number can be represented by different decimal fractions, 1520 by different bigfloats, and by different forms under other tags 1521 that may be defined to express numeric values. Depending on the 1522 implementation, it may not always be practical to determine 1523 whether any of these forms (or forms in the basic generic data 1524 model) are equivalent. An application protocol that presents 1525 choices of this kind for the representation format of numbers 1526 needs to be explicit in how the formats are to be chosen for 1527 deterministic encoding. 1529 4.2.3. Length-first Map Key Ordering 1531 The core deterministic encoding requirements (Section 4.2.1) sort map 1532 keys in a different order from the one suggested by Section 3.9 of 1533 [RFC7049] (called "Canonical CBOR" there). Protocols that need to be 1534 compatible with [RFC7049]'s order can instead be specified in terms 1535 of this specification's "length-first core deterministic encoding 1536 requirements": 1538 A CBOR encoding satisfies the "length-first core deterministic 1539 encoding requirements" if it satisfies the core deterministic 1540 encoding requirements except that the keys in every map MUST be 1541 sorted such that: 1543 1. If two keys have different lengths, the shorter one sorts 1544 earlier; 1546 2. If two keys have the same length, the one with the lower value in 1547 (byte-wise) lexical order sorts earlier. 1549 For example, under the length-first core deterministic encoding 1550 requirements, the following keys are sorted correctly: 1552 1. 10, encoded as 0x0a. 1554 2. -1, encoded as 0x20. 1556 3. false, encoded as 0xf4. 1558 4. 100, encoded as 0x1864. 1560 5. "z", encoded as 0x617a. 1562 6. [-1], encoded as 0x8120. 1564 7. "aa", encoded as 0x626161. 1566 8. [100], encoded as 0x811864. 1568 (Although [RFC7049] used the term "Canonical CBOR" for its form of 1569 requirements on deterministic encoding, this document avoids this 1570 term because "canonicalization" is often associated with specific 1571 uses of deterministic encoding only. The terms are essentially 1572 interchangeable, however, and the set of core requirements in this 1573 document could also be called "Canonical CBOR", while the length- 1574 first-ordered version of that could be called "Old Canonical CBOR".) 1576 5. Creating CBOR-Based Protocols 1578 Data formats such as CBOR are often used in environments where there 1579 is no format negotiation. A specific design goal of CBOR is to not 1580 need any included or assumed schema: a decoder can take a CBOR item 1581 and decode it with no other knowledge. 1583 Of course, in real-world implementations, the encoder and the decoder 1584 will have a shared view of what should be in a CBOR data item. For 1585 example, an agreed-to format might be "the item is an array whose 1586 first value is a UTF-8 string, second value is an integer, and 1587 subsequent values are zero or more floating-point numbers" or "the 1588 item is a map that has byte strings for keys and contains a pair 1589 whose key is 0xab01". 1591 CBOR-based protocols MUST specify how their decoders handle invalid 1592 and other unexpected data. CBOR-based protocols MAY specify that 1593 they treat arbitrary valid data as unexpected. Encoders for CBOR- 1594 based protocols MUST produce only valid items, that is, the protocol 1595 cannot be designed to make use of invalid items. An encoder can be 1596 capable of encoding as many or as few types of values as is required 1597 by the protocol in which it is used; a decoder can be capable of 1598 understanding as many or as few types of values as is required by the 1599 protocols in which it is used. This lack of restrictions allows CBOR 1600 to be used in extremely constrained environments. 1602 The rest of this section discusses some considerations in creating 1603 CBOR-based protocols. With few exceptions, it is advisory only and 1604 explicitly excludes any language from BCP 14 other than words that 1605 could be interpreted as "MAY" in the sense of BCP 14. The exceptions 1606 aim at facilitating interoperability of CBOR-based protocols while 1607 making use of a wide variety of both generic and application-specific 1608 encoders and decoders. 1610 5.1. CBOR in Streaming Applications 1612 In a streaming application, a data stream may be composed of a 1613 sequence of CBOR data items concatenated back-to-back. In such an 1614 environment, the decoder immediately begins decoding a new data item 1615 if data is found after the end of a previous data item. 1617 Not all of the bytes making up a data item may be immediately 1618 available to the decoder; some decoders will buffer additional data 1619 until a complete data item can be presented to the application. 1620 Other decoders can present partial information about a top-level data 1621 item to an application, such as the nested data items that could 1622 already be decoded, or even parts of a byte string that hasn't 1623 completely arrived yet. Such an application also MUST have matching 1624 streaming security mechanism, where the desired protection is 1625 available for incremental data presented to the application. 1627 Note that some applications and protocols will not want to use 1628 indefinite-length encoding. Using indefinite-length encoding allows 1629 an encoder to not need to marshal all the data for counting, but it 1630 requires a decoder to allocate increasing amounts of memory while 1631 waiting for the end of the item. This might be fine for some 1632 applications but not others. 1634 5.2. Generic Encoders and Decoders 1636 A generic CBOR decoder can decode all well-formed encoded CBOR data 1637 items and present the data items to an application. See Appendix C. 1638 (The diagnostic notation, Section 8, may be used to present well- 1639 formed CBOR values to humans.) 1641 Generic CBOR encoders provide an application interface that allows 1642 the application to specify any well-formed value to be encoded as a 1643 CBOR data item, including simple values and tags unknown to the 1644 encoder. 1646 Even though CBOR attempts to minimize these cases, not all well- 1647 formed CBOR data is valid: for example, the encoded text string 1648 "0x62c0ae" does not contain valid UTF-8 (because [RFC3629] requires 1649 always using the shortest form) and so is not a valid CBOR item. 1650 Also, specific tags may make semantic constraints that may be 1651 violated, for instance by a bignum tag enclosing another tag, or by 1652 an instance of tag number 0 containing a byte string, or containing a 1653 text string with contents that do not match [RFC3339]'s "date-time" 1654 production. There is no requirement that generic encoders and 1655 decoders make unnatural choices for their application interface to 1656 enable the processing of invalid data. Generic encoders and decoders 1657 are expected to forward simple values and tags even if their specific 1658 codepoints are not registered at the time the encoder/decoder is 1659 written (Section 5.4). 1661 5.3. Validity of Items 1663 A well-formed but invalid CBOR data item (Section 1.2) presents a 1664 problem with interpreting the data encoded in it in the CBOR data 1665 model. A CBOR-based protocol could be specified in several layers, 1666 in which the lower layers don't process the semantics of some of the 1667 CBOR data they forward. These layers can't notice any validity 1668 errors in data they don't process and MUST forward that data as-is. 1669 The first layer that does process the semantics of an invalid CBOR 1670 item MUST take one of two choices: 1672 1. Replace the problematic item with an error marker and continue 1673 with the next item, or 1675 2. Issue an error and stop processing altogether. 1677 A CBOR-based protocol MUST specify which of these options its 1678 decoders take, for each kind of invalid item they might encounter. 1680 Such problems might occur at the basic validity level of CBOR or in 1681 the context of tags (tag validity). 1683 5.3.1. Basic validity 1685 Two kinds of validity errors can occur in the basic generic data 1686 model: 1688 Duplicate keys in a map: Generic decoders (Section 5.2) make data 1689 available to applications using the native CBOR data model. That 1690 data model includes maps (key-value mappings with unique keys), 1691 not multimaps (key-value mappings where multiple entries can have 1692 the same key). Thus, a generic decoder that gets a CBOR map item 1693 that has duplicate keys will decode to a map with only one 1694 instance of that key, or it might stop processing altogether. On 1695 the other hand, a "streaming decoder" may not even be able to 1696 notice. See Section 5.6 for more discussion of keys in maps. 1698 Invalid UTF-8 string: A decoder might or might not want to verify 1699 that the sequence of bytes in a UTF-8 string (major type 3) is 1700 actually valid UTF-8 and react appropriately. 1702 5.3.2. Tag validity 1704 Two additional kinds of validity errors are introduced by adding tags 1705 to the basic generic data model: 1707 Inadmissible type for tag content: Tag numbers (Section 3.4) specify 1708 what type of data item is supposed to be used as their tag 1709 content; for example, the tag numbers for positive or negative 1710 bignums are supposed to be put on byte strings. A decoder that 1711 decodes the tagged data item into a native representation (a 1712 native big integer in this example) is expected to check the type 1713 of the data item being tagged. Even decoders that don't have such 1714 native representations available in their environment may perform 1715 the check on those tags known to them and react appropriately. 1717 Inadmissible value for tag content: The type of data item may be 1718 admissible for a tag's content, but the specific value may not be; 1719 e.g., a value of "yesterday" is not acceptable for the content of 1720 tag 0, even though it properly is a text string. A decoder that 1721 normally ingests such tags into equivalent platform types might 1722 present this tag to the application in a similar way to how it 1723 would present a tag with an unknown tag number (Section 5.4). 1725 5.4. Validity and Evolution 1727 A decoder with validity checking will expend the effort to reliably 1728 detect data items with validity errors. For example, such a decoder 1729 needs to have an API that reports an error (and does not return data) 1730 for a CBOR data item that contains any of the validity errors listed 1731 in the previous subsection. 1733 The set of tags defined in the tag registry (Section 9.2), as well as 1734 the set of simple values defined in the simple values registry 1735 (Section 9.1), can grow at any time beyond the set understood by a 1736 generic decoder. A validity-checking decoder can do one of two 1737 things when it encounters such a case that it does not recognize: 1739 * It can report an error (and not return data). Note that treating 1740 this case as an error can cause ossification, and is thus not 1741 encouraged. This error is not a validity error per se. This kind 1742 of error is more likely to be raised by a decoder that would be 1743 performing validity checking if this were a known case. 1745 * It can emit the unknown item (type, value, and, for tags, the 1746 decoded tagged data item) to the application calling the decoder, 1747 with an indication that the decoder did not recognize that tag 1748 number or simple value. 1750 The latter approach, which is also appropriate for decoders that do 1751 not support validity checking, provides forward compatibility with 1752 newly registered tags and simple values without the requirement to 1753 update the encoder at the same time as the calling application. (For 1754 this, the API for the decoder needs to have a way to mark unknown 1755 items so that the calling application can handle them in a manner 1756 appropriate for the program.) 1758 Since some of the processing needed for validity checking may have an 1759 appreciable cost (in particular with duplicate detection for maps), 1760 support of validity checking is not a requirement placed on all CBOR 1761 decoders. 1763 Some encoders will rely on their applications to provide input data 1764 in such a way that valid CBOR results from the encoder. A generic 1765 encoder may also want to provide a validity-checking mode where it 1766 reliably limits its output to valid CBOR, independent of whether or 1767 not its application is indeed providing API-conformant data. 1769 5.5. Numbers 1771 CBOR-based protocols should take into account that different language 1772 environments pose different restrictions on the range and precision 1773 of numbers that are representable. For example, the basic JavaScript 1774 number system treats all numbers as floating-point values, which may 1775 result in silent loss of precision in decoding integers with more 1776 than 53 significant bits. Another example is that, since CBOR keeps 1777 the sign bit for its integer representation in the major type, it has 1778 one bit more for signed numbers of a certain length (e.g., 1779 -2**64..2**64-1 for 1+8-byte integers) than the typical platform 1780 signed integer representation of the same length (-2**63..2**63-1 for 1781 8-byte int64_t). A protocol that uses numbers should define its 1782 expectations on the handling of non-trivial numbers in decoders and 1783 receiving applications. 1785 A CBOR-based protocol that includes floating-point numbers can 1786 restrict which of the three formats (half-precision, single- 1787 precision, and double-precision) are to be supported. For an 1788 integer-only application, a protocol may want to completely exclude 1789 the use of floating-point values. 1791 A CBOR-based protocol designed for compactness may want to exclude 1792 specific integer encodings that are longer than necessary for the 1793 application, such as to save the need to implement 64-bit integers. 1794 There is an expectation that encoders will use the most compact 1795 integer representation that can represent a given value. However, a 1796 compact application that does not require deterministic encoding 1797 should accept values that use a longer-than-needed encoding (such as 1798 encoding "0" as 0b000_11001 followed by two bytes of 0x00) as long as 1799 the application can decode an integer of the given size. Similar 1800 considerations apply to floating-point values; decoding both 1801 preferred serializations and longer-than-needed ones is recommended. 1803 CBOR-based protocols for constrained applications that provide a 1804 choice between representing a specific number as an integer and as a 1805 decimal fraction or bigfloat (such as when the exponent is small and 1806 non-negative), might express a quality-of-implementation expectation 1807 that the integer representation is used directly. 1809 5.6. Specifying Keys for Maps 1811 The encoding and decoding applications need to agree on what types of 1812 keys are going to be used in maps. In applications that need to 1813 interwork with JSON-based applications, conversion is simplified by 1814 limiting keys to text strings only; otherwise, there has to be a 1815 specified mapping from the other CBOR types to text strings, and this 1816 often leads to implementation errors. In applications where keys are 1817 numeric in nature and numeric ordering of keys is important to the 1818 application, directly using the numbers for the keys is useful. 1820 If multiple types of keys are to be used, consideration should be 1821 given to how these types would be represented in the specific 1822 programming environments that are to be used. For example, in 1823 JavaScript Maps [ECMA262], a key of integer 1 cannot be distinguished 1824 from a key of floating-point 1.0. This means that, if integer keys 1825 are used, the protocol needs to avoid use of floating-point keys the 1826 values of which happen to be integer numbers in the same map. 1828 Decoders that deliver data items nested within a CBOR data item 1829 immediately on decoding them ("streaming decoders") often do not keep 1830 the state that is necessary to ascertain uniqueness of a key in a 1831 map. Similarly, an encoder that can start encoding data items before 1832 the enclosing data item is completely available ("streaming encoder") 1833 may want to reduce its overhead significantly by relying on its data 1834 source to maintain uniqueness. 1836 A CBOR-based protocol MUST define what to do when a receiving 1837 application does see multiple identical keys in a map. The resulting 1838 rule in the protocol MUST respect the CBOR data model: it cannot 1839 prescribe a specific handling of the entries with the identical keys, 1840 except that it might have a rule that having identical keys in a map 1841 indicates a malformed map and that the decoder has to stop with an 1842 error. When processing maps that exhibit entries with duplicate 1843 keys, a generic decoder might do one of the following: 1845 * Not accept maps with duplicate keys (that is, enforce validity for 1846 maps, see also Section 5.4). These generic decoders are 1847 universally useful. An application may still need to do perform 1848 its own duplicate checking based on application rules (for 1849 instance if the application equates integers and floating-point 1850 values in map key positions for specific maps). 1852 * Pass all map entries to the application, including ones with 1853 duplicate keys. This requires the application to handle (check 1854 against) duplicate keys, even if the application rules are 1855 identical to the generic data model rules. 1857 * Lose some entries with duplicate keys, e.g. by only delivering the 1858 final (or first) entry out of the entries with the same key. With 1859 such a generic decoder, applications may get different results for 1860 a specific key on different runs and with different generic 1861 decoders as which value is returned is based on generic decoder 1862 implementation and the actual order of keys in the map. In 1863 particular, applications cannot validate key uniqueness on their 1864 own as they do not necessarily see all entries; they may not be 1865 able to use such a generic decoder if they do need to validate key 1866 uniqueness. These generic decoders can only be used in situations 1867 where the data source and transfer can be relied upon to always 1868 provide valid maps; this is not possible if the data source and 1869 transfer can be attacked. 1871 Generic decoders need to document which of these three approaches 1872 they implement. 1874 The CBOR data model for maps does not allow ascribing semantics to 1875 the order of the key/value pairs in the map representation. Thus, a 1876 CBOR-based protocol MUST NOT specify that changing the key/value pair 1877 order in a map would change the semantics, except to specify that 1878 some orders are disallowed, for example where they would not meet the 1879 requirements of a deterministic encoding (Section 4.2). (Any 1880 secondary effects of map ordering such as on timing, cache usage, and 1881 other potential side channels are not considered part of the 1882 semantics but may be enough reason on their own for a protocol to 1883 require a deterministic encoding format.) 1885 Applications for constrained devices that have maps where a small 1886 number of frequently used keys can be identified should consider 1887 using small integers as keys; for instance, a set of 24 or fewer 1888 frequent keys can be encoded in a single byte as unsigned integers, 1889 up to 48 if negative integers are also used. Less frequently 1890 occurring keys can then use integers with longer encodings. 1892 5.6.1. Equivalence of Keys 1894 The specific data model applying to a CBOR data item is used to 1895 determine whether keys occurring in maps are duplicates or distinct. 1897 At the generic data model level, numerically equivalent integer and 1898 floating-point values are distinct from each other, as they are from 1899 the various big numbers (Tags 2 to 5). Similarly, text strings are 1900 distinct from byte strings, even if composed of the same bytes. A 1901 tagged value is distinct from an untagged value or from a value 1902 tagged with a different tag number. 1904 Within each of these groups, numeric values are distinct unless they 1905 are numerically equal (specifically, -0.0 is equal to 0.0); for the 1906 purpose of map key equivalence, NaN (not a number) values are 1907 equivalent if they have the same significand after zero-extending 1908 both significands at the right to 64 bits. 1910 (Byte and text) strings are compared byte by byte, arrays element by 1911 element, and are equal if they have the same number of bytes/elements 1912 and the same values at the same positions. Two maps are equal if 1913 they have the same set of pairs regardless of their order; pairs are 1914 equal if both the key and value are equal. 1916 Tagged values are equal if both the tag number and the tag content 1917 are equal. (Note that a generic decoder that provides processing for 1918 a specific tag may not be able to distinguish some semantically 1919 equivalent values, e.g. if leading zeroes occur in the content of tag 1920 2/3 (Section 3.4.3).) Simple values are equal if they simply have 1921 the same value. Nothing else is equal in the generic data model; a 1922 simple value 2 is not equivalent to an integer 2 and an array is 1923 never equivalent to a map. 1925 As discussed in Section 2.2, specific data models can make values 1926 equivalent for the purpose of comparing map keys that are distinct in 1927 the generic data model. Note that this implies that a generic 1928 decoder may deliver a decoded map to an application that needs to be 1929 checked for duplicate map keys by that application (alternatively, 1930 the decoder may provide a programming interface to perform this 1931 service for the application). Specific data models are not able to 1932 distinguish values for map keys that are equal for this purpose at 1933 the generic data model level. 1935 5.7. Undefined Values 1937 In some CBOR-based protocols, the simple value (Section 3.3) of 1938 Undefined might be used by an encoder as a substitute for a data item 1939 with an encoding problem, in order to allow the rest of the enclosing 1940 data items to be encoded without harm. 1942 6. Converting Data between CBOR and JSON 1944 This section gives non-normative advice about converting between CBOR 1945 and JSON. Implementations of converters MAY use whichever advice 1946 here they want. 1948 It is worth noting that a JSON text is a sequence of characters, not 1949 an encoded sequence of bytes, while a CBOR data item consists of 1950 bytes, not characters. 1952 6.1. Converting from CBOR to JSON 1954 Most of the types in CBOR have direct analogs in JSON. However, some 1955 do not, and someone implementing a CBOR-to-JSON converter has to 1956 consider what to do in those cases. The following non-normative 1957 advice deals with these by converting them to a single substitute 1958 value, such as a JSON null. 1960 * An integer (major type 0 or 1) becomes a JSON number. 1962 * A byte string (major type 2) that is not embedded in a tag that 1963 specifies a proposed encoding is encoded in base64url without 1964 padding and becomes a JSON string. 1966 * A UTF-8 string (major type 3) becomes a JSON string. Note that 1967 JSON requires escaping certain characters ([RFC8259], Section 7): 1968 quotation mark (U+0022), reverse solidus (U+005C), and the "C0 1969 control characters" (U+0000 through U+001F). All other characters 1970 are copied unchanged into the JSON UTF-8 string. 1972 * An array (major type 4) becomes a JSON array. 1974 * A map (major type 5) becomes a JSON object. This is possible 1975 directly only if all keys are UTF-8 strings. A converter might 1976 also convert other keys into UTF-8 strings (such as by converting 1977 integers into strings containing their decimal representation); 1978 however, doing so introduces a danger of key collision. Note also 1979 that, if tags on UTF-8 strings are ignored as proposed below, this 1980 will cause a key collision if the tags are different but the 1981 strings are the same. 1983 * False (major type 7, additional information 20) becomes a JSON 1984 false. 1986 * True (major type 7, additional information 21) becomes a JSON 1987 true. 1989 * Null (major type 7, additional information 22) becomes a JSON 1990 null. 1992 * A floating-point value (major type 7, additional information 25 1993 through 27) becomes a JSON number if it is finite (that is, it can 1994 be represented in a JSON number); if the value is non-finite (NaN, 1995 or positive or negative Infinity), it is represented by the 1996 substitute value. 1998 * Any other simple value (major type 7, any additional information 1999 value not yet discussed) is represented by the substitute value. 2001 * A bignum (major type 6, tag number 2 or 3) is represented by 2002 encoding its byte string in base64url without padding and becomes 2003 a JSON string. For tag number 3 (negative bignum), a "~" (ASCII 2004 tilde) is inserted before the base-encoded value. (The conversion 2005 to a binary blob instead of a number is to prevent a likely 2006 numeric overflow for the JSON decoder.) 2008 * A byte string with an encoding hint (major type 6, tag number 21 2009 through 23) is encoded as described by the hint and becomes a JSON 2010 string. 2012 * For all other tags (major type 6, any other tag number), the tag 2013 content is represented as a JSON value; the tag number is ignored. 2015 * Indefinite-length items are made definite before conversion. 2017 A CBOR-to-JSON converter may want to keep to the JSON profile I-JSON 2018 [RFC7493], to maximize interoperability and increase confidence that 2019 the JSON output can be processed with predictable results. For 2020 example, this has implications on the range of integers that can be 2021 represented reliably, as well as on the top-level items that may be 2022 supported by older JSON implementations. 2024 6.2. Converting from JSON to CBOR 2026 All JSON values, once decoded, directly map into one or more CBOR 2027 values. As with any kind of CBOR generation, decisions have to be 2028 made with respect to number representation. In a suggested 2029 conversion: 2031 * JSON numbers without fractional parts (integer numbers) are 2032 represented as integers (major types 0 and 1, possibly major type 2033 6 tag number 2 and 3), choosing the shortest form; integers longer 2034 than an implementation-defined threshold may instead be 2035 represented as floating-point values. The default range that is 2036 represented as integer is -2**53+1..2**53-1 (fully exploiting the 2037 range for exact integers in the binary64 representation often used 2038 for decoding JSON [RFC7493]). A CBOR-based protocol, or a generic 2039 converter implementation, may choose -2**32..2**32-1 or 2040 -2**64..2**64-1 (fully using the integer ranges available in CBOR 2041 with uint32_t or uint64_t, respectively) or even -2**31..2**31-1 2042 or -2**63..2**63-1 (using popular ranges for two's complement 2043 signed integers). (If the JSON was generated from a JavaScript 2044 implementation, its precision is already limited to 53 bits 2045 maximum.) 2047 * Numbers with fractional parts are represented as floating-point 2048 values, performing the decimal-to-binary conversion based on the 2049 precision provided by IEEE 754 binary64. The mathematical value 2050 of the JSON number is converted to binary64 using the 2051 roundTiesToEven procedure in Section 4.3.1 of [IEEE754]. Then, 2052 when encoding in CBOR, the preferred serialization uses the 2053 shortest floating-point representation exactly representing this 2054 conversion result; for instance, 1.5 is represented in a 16-bit 2055 floating-point value (not all implementations will be capable of 2056 efficiently finding the minimum form, though). Instead of using 2057 the default binary64 precision, there may be an implementation- 2058 defined limit to the precision of the conversion that will affect 2059 the precision of the represented values. Decimal representation 2060 should only be used on the CBOR side if that is specified in a 2061 protocol. 2063 CBOR has been designed to generally provide a more compact encoding 2064 than JSON. One implementation strategy that might come to mind is to 2065 perform a JSON-to-CBOR encoding in place in a single buffer. This 2066 strategy would need to carefully consider a number of pathological 2067 cases, such as that some strings represented with no or very few 2068 escapes and longer (or much longer) than 255 bytes may expand when 2069 encoded as UTF-8 strings in CBOR. Similarly, a few of the binary 2070 floating-point representations might cause expansion from some short 2071 decimal representations (1.1, 1e9) in JSON. This may be hard to get 2072 right, and any ensuing vulnerabilities may be exploited by an 2073 attacker. 2075 7. Future Evolution of CBOR 2077 Successful protocols evolve over time. New ideas appear, 2078 implementation platforms improve, related protocols are developed and 2079 evolve, and new requirements from applications and protocols are 2080 added. Facilitating protocol evolution is therefore an important 2081 design consideration for any protocol development. 2083 For protocols that will use CBOR, CBOR provides some useful 2084 mechanisms to facilitate their evolution. Best practices for this 2085 are well known, particularly from JSON format development of JSON- 2086 based protocols. Therefore, such best practices are outside the 2087 scope of this specification. 2089 However, facilitating the evolution of CBOR itself is very well 2090 within its scope. CBOR is designed to both provide a stable basis 2091 for development of CBOR-based protocols and to be able to evolve. 2092 Since a successful protocol may live for decades, CBOR needs to be 2093 designed for decades of use and evolution. This section provides 2094 some guidance for the evolution of CBOR. It is necessarily more 2095 subjective than other parts of this document. It is also necessarily 2096 incomplete, lest it turn into a textbook on protocol development. 2098 7.1. Extension Points 2100 In a protocol design, opportunities for evolution are often included 2101 in the form of extension points. For example, there may be a 2102 codepoint space that is not fully allocated from the outset, and the 2103 protocol is designed to tolerate and embrace implementations that 2104 start using more codepoints than initially allocated. 2106 Sizing the codepoint space may be difficult because the range 2107 required may be hard to predict. Protocol designs should attempt to 2108 make the codepoint space large enough so that it can slowly be filled 2109 over the intended lifetime of the protocol. 2111 CBOR has three major extension points: 2113 * the "simple" space (values in major type 7). Of the 24 efficient 2114 (and 224 slightly less efficient) values, only a small number have 2115 been allocated. Implementations receiving an unknown simple data 2116 item may easily be able to process it as such, given that the 2117 structure of the value is indeed simple. The IANA registry in 2118 Section 9.1 is the appropriate way to address the extensibility of 2119 this codepoint space. 2121 * the "tag" space (values in major type 6). The total codepoint 2122 space is abundant; only a tiny part of it has been allocated. 2123 However, not all of these codepoints are equally efficient: the 2124 first 24 only consume a single ("1+0") byte, and half of them have 2125 already been allocated. The next 232 values only consume two 2126 ("1+1") bytes, with nearly a quarter already allocated. These 2127 subspaces need some curation to last for a few more decades. 2128 Implementations receiving an unknown tag number can choose to 2129 process just the enclosed tag content or, preferably, to process 2130 the tag as an unknown tag number wrapping the tag content. The 2131 IANA registry in Section 9.2 is the appropriate way to address the 2132 extensibility of this codepoint space. 2134 * the "additional information" space. An implementation receiving 2135 an unknown additional information value has no way to continue 2136 decoding, so allocating codepoints in this space is a major step 2137 beyond just exercising an extension point. There are also very 2138 few codepoints left. See also Section 7.2. 2140 7.2. Curating the Additional Information Space 2142 The human mind is sometimes drawn to filling in little perceived gaps 2143 to make something neat. We expect the remaining gaps in the 2144 codepoint space for the additional information values to be an 2145 attractor for new ideas, just because they are there. 2147 The present specification does not manage the additional information 2148 codepoint space by an IANA registry. Instead, allocations out of 2149 this space can only be done by updating this specification. 2151 For an additional information value of n >= 24, the size of the 2152 additional data typically is 2**(n-24) bytes. Therefore, additional 2153 information values 28 and 29 should be viewed as candidates for 2154 128-bit and 256-bit quantities, in case a need arises to add them to 2155 the protocol. Additional information value 30 is then the only 2156 additional information value available for general allocation, and 2157 there should be a very good reason for allocating it before assigning 2158 it through an update of the present specification. 2160 8. Diagnostic Notation 2162 CBOR is a binary interchange format. To facilitate documentation and 2163 debugging, and in particular to facilitate communication between 2164 entities cooperating in debugging, this section defines a simple 2165 human-readable diagnostic notation. All actual interchange always 2166 happens in the binary format. 2168 Note that this truly is a diagnostic format; it is not meant to be 2169 parsed. Therefore, no formal definition (as in ABNF) is given in 2170 this document. (Implementers looking for a text-based format for 2171 representing CBOR data items in configuration files may also want to 2172 consider YAML [YAML].) 2174 The diagnostic notation is loosely based on JSON as it is defined in 2175 RFC 8259, extending it where needed. 2177 The notation borrows the JSON syntax for numbers (integer and 2178 floating-point), True (>true<), False (>false<), Null (>null<), UTF-8 2179 strings, arrays, and maps (maps are called objects in JSON; the 2180 diagnostic notation extends JSON here by allowing any data item in 2181 the key position). Undefined is written >undefined< as in 2182 JavaScript. The non-finite floating-point numbers Infinity, 2183 -Infinity, and NaN are written exactly as in this sentence (this is 2184 also a way they can be written in JavaScript, although JSON does not 2185 allow them). A tag is written as an integer number for the tag 2186 number, followed by the tag content in parentheses; for instance, an 2187 RFC 3339 (ISO 8601) date could be notated as: 2189 0("2013-03-21T20:04:00Z") 2191 or the equivalent relative time as 2193 1(1363896240) 2195 Byte strings are notated in one of the base encodings, without 2196 padding, enclosed in single quotes, prefixed by >h< for base16, >b32< 2197 for base32, >h32< for base32hex, >b64< for base64 or base64url (the 2198 actual encodings do not overlap, so the string remains unambiguous). 2199 For example, the byte string 0x12345678 could be written h'12345678', 2200 b32'CI2FM6A', or b64'EjRWeA'. 2202 Unassigned simple values are given as "simple()" with the appropriate 2203 integer in the parentheses. For example, "simple(42)" indicates 2204 major type 7, value 42. 2206 A number of useful extensions to the diagnostic notation defined here 2207 are provided in Appendix G of [RFC8610], "Extended Diagnostic 2208 Notation" (EDN). Similarly, an extension of this notation could be 2209 provided in a separate document to provide for the documentation of 2210 NaN payloads, which are not covered in the present document. 2212 8.1. Encoding Indicators 2214 Sometimes it is useful to indicate in the diagnostic notation which 2215 of several alternative representations were actually used; for 2216 example, a data item written >1.5< by a diagnostic decoder might have 2217 been encoded as a half-, single-, or double-precision float. 2219 The convention for encoding indicators is that anything starting with 2220 an underscore and all following characters that are alphanumeric or 2221 underscore, is an encoding indicator, and can be ignored by anyone 2222 not interested in this information. For example, "_" or "_3". 2223 Encoding indicators are always optional. 2225 A single underscore can be written after the opening brace of a map 2226 or the opening bracket of an array to indicate that the data item was 2227 represented in indefinite-length format. For example, [_ 1, 2] 2228 contains an indicator that an indefinite-length representation was 2229 used to represent the data item [1, 2]. 2231 An underscore followed by a decimal digit n indicates that the 2232 preceding item (or, for arrays and maps, the item starting with the 2233 preceding bracket or brace) was encoded with an additional 2234 information value of 24+n. For example, 1.5_1 is a half-precision 2235 floating-point number, while 1.5_3 is encoded as double precision. 2236 This encoding indicator is not shown in Appendix A. (Note that the 2237 encoding indicator "_" is thus an abbreviation of the full form "_7", 2238 which is not used.) 2240 The detailed chunk structure of byte and text strings of indefinite 2241 length can be notated in the form (_ h'0123', h'4567') and (_ "foo", 2242 "bar"). However, for an indefinite length string with no chunks 2243 inside, (_ ) would be ambiguous whether a byte string (0x5fff) or a 2244 text string (0x7fff) is meant and is therefore not used. The basic 2245 forms ''_ and ""_ can be used instead and are reserved for the case 2246 with no chunks only -- not as short forms for the (permitted, but not 2247 really useful) encodings with only empty chunks, which to preserve 2248 the chunk structure need to be notated as (_ ''), (_ ""), etc. 2250 9. IANA Considerations 2252 IANA has created two registries for new CBOR values. The registries 2253 are separate, that is, not under an umbrella registry, and follow the 2254 rules in [RFC8126]. IANA has also assigned a new MIME media type and 2255 an associated Constrained Application Protocol (CoAP) Content-Format 2256 entry. 2258 [To be removed by RFC editor:] IANA is requested to update these 2259 registries to point to the present document instead of RFC 7049. 2261 9.1. Simple Values Registry 2263 IANA has created the "Concise Binary Object Representation (CBOR) 2264 Simple Values" registry at [IANA.cbor-simple-values]. The initial 2265 values are shown in Table 4. 2267 New entries in the range 0 to 19 are assigned by Standards Action. 2268 It is suggested that these Standards Actions allocate values starting 2269 with the number 16 in order to reserve the lower numbers for 2270 contiguous blocks (if any). 2272 New entries in the range 32 to 255 are assigned by Specification 2273 Required. 2275 9.2. Tags Registry 2277 IANA has created the "Concise Binary Object Representation (CBOR) 2278 Tags" registry at [IANA.cbor-tags]. The tags that were defined in 2279 [RFC7049] are described in detail in Section 3.4, and other tags have 2280 already been defined since then. 2282 New entries in the range 0 to 23 ("1+0") are assigned by Standards 2283 Action. New entries in the ranges 24 to 255 ("1+1") and 256 to 32767 2284 (lower half of "1+2") are assigned by Specification Required. New 2285 entries in the range 32768 to 18446744073709551615 (upper half of 2286 "1+2", "1+4", and "1+8") are assigned by First Come First Served. 2287 The template for registration requests is: 2289 * Data item 2291 * Semantics (short form) 2293 In addition, First Come First Served requests should include: 2295 * Point of contact 2297 * Description of semantics (URL) -- This description is optional; 2298 the URL can point to something like an Internet-Draft or a web 2299 page. 2301 Applicants exercising the First Come First Served range and making a 2302 suggestion for a tag number that is not representable in 32 bits 2303 (i.e., larger than 4294967295) should be aware that this could reduce 2304 interoperability with implementations that do not support 64-bit 2305 numbers. 2307 9.3. Media Type ("MIME Type") 2309 The Internet media type [RFC6838] for a single encoded CBOR data item 2310 is application/cbor, as defined in [IANA.media-types]: 2312 Type name: application 2314 Subtype name: cbor 2316 Required parameters: n/a 2318 Optional parameters: n/a 2320 Encoding considerations: Binary 2322 Security considerations: See Section 10 of this document 2324 Interoperability considerations: n/a 2326 Published specification: This document 2328 Applications that use this media type: Many 2330 Additional information: 2331 * Magic number(s): n/a 2333 * File extension(s): .cbor 2335 * Macintosh file type code(s): n/a 2337 Person & email address to contact for further information: IETF CBOR 2338 Working Group cbor@ietf.org (mailto:cbor@ietf.org) or IETF 2339 Applications and Real-Time Area art@ietf.org (mailto:art@ietf.org) 2341 Intended usage: COMMON 2343 Restrictions on usage: none 2345 Author: IETF CBOR Working Group cbor@ietf.org (mailto:cbor@ietf.org) 2347 Change controller: The IESG iesg@ietf.org (mailto:iesg@ietf.org) 2349 9.4. CoAP Content-Format 2351 The CoAP Content-Format for CBOR is registered in 2352 [IANA.core-parameters]: 2354 Media Type: application/cbor 2355 Encoding: - 2357 Id: 60 2359 Reference: [RFCthis] 2361 9.5. The +cbor Structured Syntax Suffix Registration 2363 The Structured Syntax Suffix [RFC6838] for media types based on a 2364 single encoded CBOR data item is +cbor, as defined in 2365 [IANA.media-type-structured-suffix]: 2367 Name: Concise Binary Object Representation (CBOR) 2369 +suffix: +cbor 2371 References: [RFCthis] 2373 Encoding Considerations: CBOR is a binary format. 2375 Interoperability Considerations: n/a 2377 Fragment Identifier Considerations: The syntax and semantics of 2378 fragment identifiers specified for +cbor SHOULD be as specified 2379 for "application/cbor". (At publication of this document, there 2380 is no fragment identification syntax defined for "application/ 2381 cbor".) 2383 The syntax and semantics for fragment identifiers for a specific 2384 "xxx/yyy+cbor" SHOULD be processed as follows: 2386 * For cases defined in +cbor, where the fragment identifier 2387 resolves per the +cbor rules, then process as specified in 2388 +cbor. 2390 * For cases defined in +cbor, where the fragment identifier does 2391 not resolve per the +cbor rules, then process as specified in 2392 "xxx/yyy+cbor". 2394 * For cases not defined in +cbor, then process as specified in 2395 "xxx/yyy+cbor". 2397 Security Considerations: See Section 10 of this document 2399 Contact: IETF CBOR Working Group cbor@ietf.org 2400 (mailto:cbor@ietf.org) or IETF Applications and Real-Time Area 2401 art@ietf.org (mailto:art@ietf.org) 2403 Author/Change Controller: The IESG iesg@ietf.org 2404 (mailto:iesg@ietf.org) 2405 // Editors' note: RFC 6838 has a template field Author/Change 2406 // controller, the descriptive text of which makes clear that this 2407 is 2408 // the change controller, not the author. Go figure. There is no 2409 // separate author entry as in the media types registry. (RFC 2410 // editor: Please remove this note before publication.) 2412 10. Security Considerations 2414 A network-facing application can exhibit vulnerabilities in its 2415 processing logic for incoming data. Complex parsers are well known 2416 as a likely source of such vulnerabilities, such as the ability to 2417 remotely crash a node, or even remotely execute arbitrary code on it. 2418 CBOR attempts to narrow the opportunities for introducing such 2419 vulnerabilities by reducing parser complexity, by giving the entire 2420 range of encodable values a meaning where possible. 2422 Because CBOR decoders are often used as a first step in processing 2423 unvalidated input, they need to be fully prepared for all types of 2424 hostile input that may be designed to corrupt, overrun, or achieve 2425 control of the system decoding the CBOR data item. A CBOR decoder 2426 needs to assume that all input may be hostile even if it has been 2427 checked by a firewall, has come over a secure channel such as TLS, is 2428 encrypted or signed, or has come from some other source that is 2429 presumed trusted. 2431 Section 4.1 gives examples of limitations in interoperability when 2432 using a constrained CBOR decoder with input from a CBOR encoder that 2433 uses a non-preferred serialization. When a single data item is 2434 consumed both by such a constrained decoder and a full decoder, it 2435 can lead to security issues that can be exploited by an attacker who 2436 can inject or manipulate content. 2438 As discussed throughout this document, there are many values that can 2439 be considered "equivalent" in some circumstances and "not equivalent" 2440 in others. As just one example, the numeric value for the number 2441 "one" might be expressed as an integer or a bignum. A system 2442 interpreting CBOR input might accept either form for the number 2443 "one", or might reject one (or both) forms. Such acceptance or 2444 rejection can have security implications in the program that is using 2445 the interpreted input. 2447 Hostile input may be constructed to overrun buffers, overflow or 2448 underflow integer arithmetic, or cause other decoding disruption. 2449 CBOR data items might have lengths or sizes that are intentionally 2450 extremely large or too short. Resource exhaustion attacks might 2451 attempt to lure a decoder into allocating very big data items 2452 (strings, arrays, maps, or even arbitrary precision numbers) or 2453 exhaust the stack depth by setting up deeply nested items. Decoders 2454 need to have appropriate resource management to mitigate these 2455 attacks. (Items for which very large sizes are given can also 2456 attempt to exploit integer overflow vulnerabilities.) 2458 A CBOR decoder, by definition, only accepts well-formed CBOR; this is 2459 the first step to its robustness. Input that is not well-formed CBOR 2460 causes no further processing from the point where the lack of well- 2461 formedness was detected. If possible, any data decoded up to this 2462 point should have no impact on the application using the CBOR 2463 decoder. 2465 In addition to ascertaining well-formedness, a CBOR decoder might 2466 also perform validity checks on the CBOR data. Alternatively, it can 2467 leave those checks to the application using the decoder. This choice 2468 needs to be clearly documented in the decoder. Beyond the validity 2469 at the CBOR level, an application also needs to ascertain that the 2470 input is in alignment with the application protocol that is 2471 serialized in CBOR. 2473 The input check itself may consume resources. This is usually linear 2474 in the size of the input, which means that an attacker has to spend 2475 resources that are commensurate to the resources spent by the 2476 defender on input validation. However, an attacker might be able to 2477 craft inputs that will take longer for a target decoder to process 2478 than for the attacker to produce. Processing for arbitrary-precision 2479 numbers may exceed linear effort. Also, some hash-table 2480 implementations that are used by decoders to build in-memory 2481 representations of maps can be attacked to spend quadratic effort, 2482 unless a secret key (see Section 7 of [SIPHASH_LNCS], also 2483 [SIPHASH_OPEN]) or some other mitigation is employed. Such 2484 superlinear efforts can be exploited by an attacker to exhaust 2485 resources at or before the input validator; they therefore need to be 2486 avoided in a CBOR decoder implementation. Note that tag number 2487 definitions and their implementations can add security considerations 2488 of this kind; this should then be discussed in the security 2489 considerations of the tag number definition. 2491 CBOR encoders do not receive input directly from the network and are 2492 thus not directly attackable in the same way as CBOR decoders. 2493 However, CBOR encoders often have an API that takes input from 2494 another level in the implementation and can be attacked through that 2495 API. The design and implementation of that API should assume the 2496 behavior of its caller may be based on hostile input or on coding 2497 mistakes. It should check inputs for buffer overruns, overflow and 2498 underflow of integer arithmetic, and other such errors that are aimed 2499 to disrupt the encoder. 2501 Protocols should be defined in such a way that potential multiple 2502 interpretations are reliably reduced to a single interpretation. For 2503 example, an attacker could make use of invalid input such as 2504 duplicate keys in maps, or exploit different precision in processing 2505 numbers to make one application base its decisions on a different 2506 interpretation than the one that will be used by a second 2507 application. To facilitate consistent interpretation, encoder and 2508 decoder implementations should provide a validity checking mode of 2509 operation (Section 5.4). Note, however, that a generic decoder 2510 cannot know about all requirements that an application poses on its 2511 input data; it is therefore not relieving the application from 2512 performing its own input checking. Also, since the set of defined 2513 tag numbers evolves, the application may employ a tag number that is 2514 not yet supported for validity checking by the generic decoder it 2515 uses. Generic decoders therefore need to provide documentation which 2516 tag numbers they support and what validity checking they can provide 2517 for each of them as well as for basic CBOR validity (UTF-8 checking, 2518 duplicate map key checking). 2520 Section 3.4.3 notes that using the non-preferred choice of a bignum 2521 representation instead of a basic integer for encoding a number is 2522 not intended to have application semantics, but it can have such 2523 semantics if an application receiving CBOR data is using a decoder in 2524 the basic generic data model. This disparity causes a security issue 2525 if the two sets of semantics differ. Thus, applications using CBOR 2526 need to specify the data model that they are using for each use of 2527 CBOR data. 2529 It is common to convert CBOR data to other formats. In many cases, 2530 CBOR has more expressive types than other formats; this is 2531 particularly true for the common conversion to JSON. The loss of 2532 type information can cause security issues for the systems that are 2533 processing the less-expressive data. 2535 Section 6.2 describes a possibly-common usage scenario of converting 2536 between CBOR and JSON that could allow an attack if the attcker knows 2537 that the application is performing the conversion. 2539 Security considerations for the use of base16 and base64 from 2540 [RFC4648], and the use of UTF-8 from [RFC3629], are relevant to CBOR 2541 as well. 2543 11. References 2545 11.1. Normative References 2547 [C] International Organization for Standardization, 2548 "Information technology — Programming languages — C", ISO/ 2549 IEC 9899:2018, Fourth Edition, June 2018. 2551 [Cplusplus17] 2552 International Organization for Standardization, 2553 "Programming languages — C++", ISO/IEC 14882:2017, Fifth 2554 Edition, December 2017. 2556 [IEEE754] IEEE, "IEEE Standard for Floating-Point Arithmetic", IEEE 2557 Std 754-2019, DOI 10.1109/IEEESTD.2019.8766229, 2558 . 2560 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 2561 Extensions (MIME) Part One: Format of Internet Message 2562 Bodies", RFC 2045, DOI 10.17487/RFC2045, November 1996, 2563 . 2565 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2566 Requirement Levels", BCP 14, RFC 2119, 2567 DOI 10.17487/RFC2119, March 1997, 2568 . 2570 [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: 2571 Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002, 2572 . 2574 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 2575 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2576 2003, . 2578 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2579 Resource Identifier (URI): Generic Syntax", STD 66, 2580 RFC 3986, DOI 10.17487/RFC3986, January 2005, 2581 . 2583 [RFC4287] Nottingham, M., Ed. and R. Sayre, Ed., "The Atom 2584 Syndication Format", RFC 4287, DOI 10.17487/RFC4287, 2585 December 2005, . 2587 [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data 2588 Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006, 2589 . 2591 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 2592 Writing an IANA Considerations Section in RFCs", BCP 26, 2593 RFC 8126, DOI 10.17487/RFC8126, June 2017, 2594 . 2596 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2597 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2598 May 2017, . 2600 [TIME_T] The Open Group Base Specifications, "Open Group Standard: 2601 Vol. 1: Base Definitions, Issue 7", Section 4.16 'Seconds 2602 Since the Epoch', IEEE Std 1003.1, 2018 Edition, 2018, 2603 . 2606 11.2. Informative References 2608 [ASN.1] International Telecommunication Union, "Information 2609 Technology — ASN.1 encoding rules: Specification of Basic 2610 Encoding Rules (BER), Canonical Encoding Rules (CER) and 2611 Distinguished Encoding Rules (DER)", ITU-T Recommendation 2612 X.690, 1994. 2614 [BSON] Various, "BSON - Binary JSON", 2013, 2615 . 2617 [ECMA262] Ecma International, "ECMAScript 2018 Language 2618 Specification", ECMA Standard ECMA-262, 9th Edition, June 2619 2018, . 2623 [I-D.bormann-cbor-notable-tags] 2624 Bormann, C., "Notable CBOR Tags", Work in Progress, 2625 Internet-Draft, draft-bormann-cbor-notable-tags-02, 25 2626 June 2020, . 2629 [IANA.cbor-simple-values] 2630 IANA, "Concise Binary Object Representation (CBOR) Simple 2631 Values", 2632 . 2634 [IANA.cbor-tags] 2635 IANA, "Concise Binary Object Representation (CBOR) Tags", 2636 . 2638 [IANA.core-parameters] 2639 IANA, "Constrained RESTful Environments (CoRE) 2640 Parameters", 2641 . 2643 [IANA.media-type-structured-suffix] 2644 IANA, "Structured Syntax Suffix Registry", 2645 . 2648 [IANA.media-types] 2649 IANA, "Media Types", 2650 . 2652 [MessagePack] 2653 Furuhashi, S., "MessagePack", 2013, . 2655 [PCRE] Ho, A., "PCRE - Perl Compatible Regular Expressions", 2656 2018, . 2658 [RFC0713] Haverty, J., "MSDTP-Message Services Data Transmission 2659 Protocol", RFC 713, DOI 10.17487/RFC0713, April 1976, 2660 . 2662 [RFC6838] Freed, N., Klensin, J., and T. Hansen, "Media Type 2663 Specifications and Registration Procedures", BCP 13, 2664 RFC 6838, DOI 10.17487/RFC6838, January 2013, 2665 . 2667 [RFC7049] Bormann, C. and P. Hoffman, "Concise Binary Object 2668 Representation (CBOR)", RFC 7049, DOI 10.17487/RFC7049, 2669 October 2013, . 2671 [RFC7228] Bormann, C., Ersue, M., and A. Keranen, "Terminology for 2672 Constrained-Node Networks", RFC 7228, 2673 DOI 10.17487/RFC7228, May 2014, 2674 . 2676 [RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493, 2677 DOI 10.17487/RFC7493, March 2015, 2678 . 2680 [RFC7991] Hoffman, P., "The "xml2rfc" Version 3 Vocabulary", 2681 RFC 7991, DOI 10.17487/RFC7991, December 2016, 2682 . 2684 [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data 2685 Interchange Format", STD 90, RFC 8259, 2686 DOI 10.17487/RFC8259, December 2017, 2687 . 2689 [RFC8610] Birkholz, H., Vigano, C., and C. Bormann, "Concise Data 2690 Definition Language (CDDL): A Notational Convention to 2691 Express Concise Binary Object Representation (CBOR) and 2692 JSON Data Structures", RFC 8610, DOI 10.17487/RFC8610, 2693 June 2019, . 2695 [RFC8618] Dickinson, J., Hague, J., Dickinson, S., Manderson, T., 2696 and J. Bond, "Compacted-DNS (C-DNS): A Format for DNS 2697 Packet Capture", RFC 8618, DOI 10.17487/RFC8618, September 2698 2019, . 2700 [RFC8742] Bormann, C., "Concise Binary Object Representation (CBOR) 2701 Sequences", RFC 8742, DOI 10.17487/RFC8742, February 2020, 2702 . 2704 [RFC8746] Bormann, C., Ed., "Concise Binary Object Representation 2705 (CBOR) Tags for Typed Arrays", RFC 8746, 2706 DOI 10.17487/RFC8746, February 2020, 2707 . 2709 [SIPHASH_LNCS] 2710 Aumasson, J. and D. Bernstein, "SipHash: A Fast Short- 2711 Input PRF", Lecture Notes in Computer Science pp. 489-508, 2712 DOI 10.1007/978-3-642-34931-7_28, 2012, 2713 . 2715 [SIPHASH_OPEN] 2716 Aumasson, J. and D.J. Bernstein, "SipHash: a fast short- 2717 input PRF", . 2719 [YAML] Ben-Kiki, O., Evans, C., and I.d. Net, "YAML Ain't Markup 2720 Language (YAML[TM]) Version 1.2", 3rd Edition, October 2721 2009, . 2723 Appendix A. Examples of Encoded CBOR Data Items 2725 The following table provides some CBOR-encoded values in hexadecimal 2726 (right column), together with diagnostic notation for these values 2727 (left column). Note that the string "\u00fc" is one form of 2728 diagnostic notation for a UTF-8 string containing the single Unicode 2729 character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (u umlaut). 2730 Similarly, "\u6c34" is a UTF-8 string in diagnostic notation with a 2731 single character U+6C34 (CJK UNIFIED IDEOGRAPH-6C34, often 2732 representing "water"), and "\ud800\udd51" is a UTF-8 string in 2733 diagnostic notation with a single character U+10151 (GREEK ACROPHONIC 2734 ATTIC FIFTY STATERS). (Note that all these single-character strings 2735 could also be represented in native UTF-8 in diagnostic notation, 2736 just not in an ASCII-only specification.) In the diagnostic notation 2737 provided for bignums, their intended numeric value is shown as a 2738 decimal number (such as 18446744073709551616) instead of showing a 2739 tagged byte string (such as 2(h'010000000000000000')). 2741 +==============================+====================================+ 2742 |Diagnostic | Encoded | 2743 +==============================+====================================+ 2744 |0 | 0x00 | 2745 +------------------------------+------------------------------------+ 2746 |1 | 0x01 | 2747 +------------------------------+------------------------------------+ 2748 |10 | 0x0a | 2749 +------------------------------+------------------------------------+ 2750 |23 | 0x17 | 2751 +------------------------------+------------------------------------+ 2752 |24 | 0x1818 | 2753 +------------------------------+------------------------------------+ 2754 |25 | 0x1819 | 2755 +------------------------------+------------------------------------+ 2756 |100 | 0x1864 | 2757 +------------------------------+------------------------------------+ 2758 |1000 | 0x1903e8 | 2759 +------------------------------+------------------------------------+ 2760 |1000000 | 0x1a000f4240 | 2761 +------------------------------+------------------------------------+ 2762 |1000000000000 | 0x1b000000e8d4a51000 | 2763 +------------------------------+------------------------------------+ 2764 |18446744073709551615 | 0x1bffffffffffffffff | 2765 +------------------------------+------------------------------------+ 2766 |18446744073709551616 | 0xc249010000000000000000 | 2767 +------------------------------+------------------------------------+ 2768 |-18446744073709551616 | 0x3bffffffffffffffff | 2769 +------------------------------+------------------------------------+ 2770 |-18446744073709551617 | 0xc349010000000000000000 | 2771 +------------------------------+------------------------------------+ 2772 |-1 | 0x20 | 2773 +------------------------------+------------------------------------+ 2774 |-10 | 0x29 | 2775 +------------------------------+------------------------------------+ 2776 |-100 | 0x3863 | 2777 +------------------------------+------------------------------------+ 2778 |-1000 | 0x3903e7 | 2779 +------------------------------+------------------------------------+ 2780 |0.0 | 0xf90000 | 2781 +------------------------------+------------------------------------+ 2782 |-0.0 | 0xf98000 | 2783 +------------------------------+------------------------------------+ 2784 |1.0 | 0xf93c00 | 2785 +------------------------------+------------------------------------+ 2786 |1.1 | 0xfb3ff199999999999a | 2787 +------------------------------+------------------------------------+ 2788 |1.5 | 0xf93e00 | 2789 +------------------------------+------------------------------------+ 2790 |65504.0 | 0xf97bff | 2791 +------------------------------+------------------------------------+ 2792 |100000.0 | 0xfa47c35000 | 2793 +------------------------------+------------------------------------+ 2794 |3.4028234663852886e+38 | 0xfa7f7fffff | 2795 +------------------------------+------------------------------------+ 2796 |1.0e+300 | 0xfb7e37e43c8800759c | 2797 +------------------------------+------------------------------------+ 2798 |5.960464477539063e-8 | 0xf90001 | 2799 +------------------------------+------------------------------------+ 2800 |0.00006103515625 | 0xf90400 | 2801 +------------------------------+------------------------------------+ 2802 |-4.0 | 0xf9c400 | 2803 +------------------------------+------------------------------------+ 2804 |-4.1 | 0xfbc010666666666666 | 2805 +------------------------------+------------------------------------+ 2806 |Infinity | 0xf97c00 | 2807 +------------------------------+------------------------------------+ 2808 |NaN | 0xf97e00 | 2809 +------------------------------+------------------------------------+ 2810 |-Infinity | 0xf9fc00 | 2811 +------------------------------+------------------------------------+ 2812 |Infinity | 0xfa7f800000 | 2813 +------------------------------+------------------------------------+ 2814 |NaN | 0xfa7fc00000 | 2815 +------------------------------+------------------------------------+ 2816 |-Infinity | 0xfaff800000 | 2817 +------------------------------+------------------------------------+ 2818 |Infinity | 0xfb7ff0000000000000 | 2819 +------------------------------+------------------------------------+ 2820 |NaN | 0xfb7ff8000000000000 | 2821 +------------------------------+------------------------------------+ 2822 |-Infinity | 0xfbfff0000000000000 | 2823 +------------------------------+------------------------------------+ 2824 |false | 0xf4 | 2825 +------------------------------+------------------------------------+ 2826 |true | 0xf5 | 2827 +------------------------------+------------------------------------+ 2828 |null | 0xf6 | 2829 +------------------------------+------------------------------------+ 2830 |undefined | 0xf7 | 2831 +------------------------------+------------------------------------+ 2832 |simple(16) | 0xf0 | 2833 +------------------------------+------------------------------------+ 2834 |simple(255) | 0xf8ff | 2835 +------------------------------+------------------------------------+ 2836 |0("2013-03-21T20:04:00Z") | 0xc074323031332d30332d32315432303a | 2837 | | 30343a30305a | 2838 +------------------------------+------------------------------------+ 2839 |1(1363896240) | 0xc11a514b67b0 | 2840 +------------------------------+------------------------------------+ 2841 |1(1363896240.5) | 0xc1fb41d452d9ec200000 | 2842 +------------------------------+------------------------------------+ 2843 |23(h'01020304') | 0xd74401020304 | 2844 +------------------------------+------------------------------------+ 2845 |24(h'6449455446') | 0xd818456449455446 | 2846 +------------------------------+------------------------------------+ 2847 |32("http://www.example.com") | 0xd82076687474703a2f2f7777772e6578 | 2848 | | 616d706c652e636f6d | 2849 +------------------------------+------------------------------------+ 2850 |h'' | 0x40 | 2851 +------------------------------+------------------------------------+ 2852 |h'01020304' | 0x4401020304 | 2853 +------------------------------+------------------------------------+ 2854 |"" | 0x60 | 2855 +------------------------------+------------------------------------+ 2856 |"a" | 0x6161 | 2857 +------------------------------+------------------------------------+ 2858 |"IETF" | 0x6449455446 | 2859 +------------------------------+------------------------------------+ 2860 |"\"\\" | 0x62225c | 2861 +------------------------------+------------------------------------+ 2862 |"\u00fc" | 0x62c3bc | 2863 +------------------------------+------------------------------------+ 2864 |"\u6c34" | 0x63e6b0b4 | 2865 +------------------------------+------------------------------------+ 2866 |"\ud800\udd51" | 0x64f0908591 | 2867 +------------------------------+------------------------------------+ 2868 |[] | 0x80 | 2869 +------------------------------+------------------------------------+ 2870 |[1, 2, 3] | 0x83010203 | 2871 +------------------------------+------------------------------------+ 2872 |[1, [2, 3], [4, 5]] | 0x8301820203820405 | 2873 +------------------------------+------------------------------------+ 2874 |[1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x98190102030405060708090a0b0c0d0e | 2875 |10, 11, 12, 13, 14, 15, 16, | 0f101112131415161718181819 | 2876 |17, 18, 19, 20, 21, 22, 23, | | 2877 |24, 25] | | 2878 +------------------------------+------------------------------------+ 2879 |{} | 0xa0 | 2880 +------------------------------+------------------------------------+ 2881 |{1: 2, 3: 4} | 0xa201020304 | 2882 +------------------------------+------------------------------------+ 2883 |{"a": 1, "b": [2, 3]} | 0xa26161016162820203 | 2884 +------------------------------+------------------------------------+ 2885 |["a", {"b": "c"}] | 0x826161a161626163 | 2886 +------------------------------+------------------------------------+ 2887 |{"a": "A", "b": "B", "c": "C",| 0xa5616161416162614261636143616461 | 2888 |"d": "D", "e": "E"} | 4461656145 | 2889 +------------------------------+------------------------------------+ 2890 |(_ h'0102', h'030405') | 0x5f42010243030405ff | 2891 +------------------------------+------------------------------------+ 2892 |(_ "strea", "ming") | 0x7f657374726561646d696e67ff | 2893 +------------------------------+------------------------------------+ 2894 |[_ ] | 0x9fff | 2895 +------------------------------+------------------------------------+ 2896 |[_ 1, [2, 3], [_ 4, 5]] | 0x9f018202039f0405ffff | 2897 +------------------------------+------------------------------------+ 2898 |[_ 1, [2, 3], [4, 5]] | 0x9f01820203820405ff | 2899 +------------------------------+------------------------------------+ 2900 |[1, [2, 3], [_ 4, 5]] | 0x83018202039f0405ff | 2901 +------------------------------+------------------------------------+ 2902 |[1, [_ 2, 3], [4, 5]] | 0x83019f0203ff820405 | 2903 +------------------------------+------------------------------------+ 2904 |[_ 1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x9f0102030405060708090a0b0c0d0e0f | 2905 |10, 11, 12, 13, 14, 15, 16, | 101112131415161718181819ff | 2906 |17, 18, 19, 20, 21, 22, 23, | | 2907 |24, 25] | | 2908 +------------------------------+------------------------------------+ 2909 |{_ "a": 1, "b": [_ 2, 3]} | 0xbf61610161629f0203ffff | 2910 +------------------------------+------------------------------------+ 2911 |["a", {_ "b": "c"}] | 0x826161bf61626163ff | 2912 +------------------------------+------------------------------------+ 2913 |{_ "Fun": true, "Amt": -2} | 0xbf6346756ef563416d7421ff | 2914 +------------------------------+------------------------------------+ 2915 Table 6: Examples of Encoded CBOR Data Items 2917 Appendix B. Jump Table for Initial Byte 2919 For brevity, this jump table does not show initial bytes that are 2920 reserved for future extension. It also only shows a selection of the 2921 initial bytes that can be used for optional features. (All unsigned 2922 integers are in network byte order.) 2924 +============+================================================+ 2925 | Byte | Structure/Semantics | 2926 +============+================================================+ 2927 | 0x00..0x17 | Unsigned integer 0x00..0x17 (0..23) | 2928 +------------+------------------------------------------------+ 2929 | 0x18 | Unsigned integer (one-byte uint8_t follows) | 2930 +------------+------------------------------------------------+ 2931 | 0x19 | Unsigned integer (two-byte uint16_t follows) | 2932 +------------+------------------------------------------------+ 2933 | 0x1a | Unsigned integer (four-byte uint32_t follows) | 2934 +------------+------------------------------------------------+ 2935 | 0x1b | Unsigned integer (eight-byte uint64_t follows) | 2936 +------------+------------------------------------------------+ 2937 | 0x20..0x37 | Negative integer -1-0x00..-1-0x17 (-1..-24) | 2938 +------------+------------------------------------------------+ 2939 | 0x38 | Negative integer -1-n (one-byte uint8_t for n | 2940 | | follows) | 2941 +------------+------------------------------------------------+ 2942 | 0x39 | Negative integer -1-n (two-byte uint16_t for n | 2943 | | follows) | 2944 +------------+------------------------------------------------+ 2945 | 0x3a | Negative integer -1-n (four-byte uint32_t for | 2946 | | n follows) | 2947 +------------+------------------------------------------------+ 2948 | 0x3b | Negative integer -1-n (eight-byte uint64_t for | 2949 | | n follows) | 2950 +------------+------------------------------------------------+ 2951 | 0x40..0x57 | byte string (0x00..0x17 bytes follow) | 2952 +------------+------------------------------------------------+ 2953 | 0x58 | byte string (one-byte uint8_t for n, and then | 2954 | | n bytes follow) | 2955 +------------+------------------------------------------------+ 2956 | 0x59 | byte string (two-byte uint16_t for n, and then | 2957 | | n bytes follow) | 2958 +------------+------------------------------------------------+ 2959 | 0x5a | byte string (four-byte uint32_t for n, and | 2960 | | then n bytes follow) | 2961 +------------+------------------------------------------------+ 2962 | 0x5b | byte string (eight-byte uint64_t for n, and | 2963 | | then n bytes follow) | 2964 +------------+------------------------------------------------+ 2965 | 0x5f | byte string, byte strings follow, terminated | 2966 | | by "break" | 2967 +------------+------------------------------------------------+ 2968 | 0x60..0x77 | UTF-8 string (0x00..0x17 bytes follow) | 2969 +------------+------------------------------------------------+ 2970 | 0x78 | UTF-8 string (one-byte uint8_t for n, and then | 2971 | | n bytes follow) | 2972 +------------+------------------------------------------------+ 2973 | 0x79 | UTF-8 string (two-byte uint16_t for n, and | 2974 | | then n bytes follow) | 2975 +------------+------------------------------------------------+ 2976 | 0x7a | UTF-8 string (four-byte uint32_t for n, and | 2977 | | then n bytes follow) | 2978 +------------+------------------------------------------------+ 2979 | 0x7b | UTF-8 string (eight-byte uint64_t for n, and | 2980 | | then n bytes follow) | 2981 +------------+------------------------------------------------+ 2982 | 0x7f | UTF-8 string, UTF-8 strings follow, terminated | 2983 | | by "break" | 2984 +------------+------------------------------------------------+ 2985 | 0x80..0x97 | array (0x00..0x17 data items follow) | 2986 +------------+------------------------------------------------+ 2987 | 0x98 | array (one-byte uint8_t for n, and then n data | 2988 | | items follow) | 2989 +------------+------------------------------------------------+ 2990 | 0x99 | array (two-byte uint16_t for n, and then n | 2991 | | data items follow) | 2992 +------------+------------------------------------------------+ 2993 | 0x9a | array (four-byte uint32_t for n, and then n | 2994 | | data items follow) | 2995 +------------+------------------------------------------------+ 2996 | 0x9b | array (eight-byte uint64_t for n, and then n | 2997 | | data items follow) | 2998 +------------+------------------------------------------------+ 2999 | 0x9f | array, data items follow, terminated by | 3000 | | "break" | 3001 +------------+------------------------------------------------+ 3002 | 0xa0..0xb7 | map (0x00..0x17 pairs of data items follow) | 3003 +------------+------------------------------------------------+ 3004 | 0xb8 | map (one-byte uint8_t for n, and then n pairs | 3005 | | of data items follow) | 3006 +------------+------------------------------------------------+ 3007 | 0xb9 | map (two-byte uint16_t for n, and then n pairs | 3008 | | of data items follow) | 3009 +------------+------------------------------------------------+ 3010 | 0xba | map (four-byte uint32_t for n, and then n | 3011 | | pairs of data items follow) | 3012 +------------+------------------------------------------------+ 3013 | 0xbb | map (eight-byte uint64_t for n, and then n | 3014 | | pairs of data items follow) | 3015 +------------+------------------------------------------------+ 3016 | 0xbf | map, pairs of data items follow, terminated by | 3017 | | "break" | 3018 +------------+------------------------------------------------+ 3019 | 0xc0 | Text-based date/time (data item follows; see | 3020 | | Section 3.4.1) | 3021 +------------+------------------------------------------------+ 3022 | 0xc1 | Epoch-based date/time (data item follows; see | 3023 | | Section 3.4.2) | 3024 +------------+------------------------------------------------+ 3025 | 0xc2 | Positive bignum (data item "byte string" | 3026 | | follows) | 3027 +------------+------------------------------------------------+ 3028 | 0xc3 | Negative bignum (data item "byte string" | 3029 | | follows) | 3030 +------------+------------------------------------------------+ 3031 | 0xc4 | Decimal Fraction (data item "array" follows; | 3032 | | see Section 3.4.4) | 3033 +------------+------------------------------------------------+ 3034 | 0xc5 | Bigfloat (data item "array" follows; see | 3035 | | Section 3.4.4) | 3036 +------------+------------------------------------------------+ 3037 | 0xc6..0xd4 | (tag) | 3038 +------------+------------------------------------------------+ 3039 | 0xd5..0xd7 | Expected Conversion (data item follows; see | 3040 | | Section 3.4.5.2) | 3041 +------------+------------------------------------------------+ 3042 | 0xd8..0xdb | (more tags; 1/2/4/8 bytes of tag number and | 3043 | | then a data item follow) | 3044 +------------+------------------------------------------------+ 3045 | 0xe0..0xf3 | (simple value) | 3046 +------------+------------------------------------------------+ 3047 | 0xf4 | False | 3048 +------------+------------------------------------------------+ 3049 | 0xf5 | True | 3050 +------------+------------------------------------------------+ 3051 | 0xf6 | Null | 3052 +------------+------------------------------------------------+ 3053 | 0xf7 | Undefined | 3054 +------------+------------------------------------------------+ 3055 | 0xf8 | (simple value, one byte follows) | 3056 +------------+------------------------------------------------+ 3057 | 0xf9 | Half-Precision Float (two-byte IEEE 754) | 3058 +------------+------------------------------------------------+ 3059 | 0xfa | Single-Precision Float (four-byte IEEE 754) | 3060 +------------+------------------------------------------------+ 3061 | 0xfb | Double-Precision Float (eight-byte IEEE 754) | 3062 +------------+------------------------------------------------+ 3063 | 0xff | "break" stop code | 3064 +------------+------------------------------------------------+ 3066 Table 7: Jump Table for Initial Byte 3068 Appendix C. Pseudocode 3070 The well-formedness of a CBOR item can be checked by the pseudocode 3071 in Figure 1. The data is well-formed if and only if: 3073 * the pseudocode does not "fail"; 3075 * after execution of the pseudocode, no bytes are left in the input 3076 (except in streaming applications) 3078 The pseudocode has the following prerequisites: 3080 * take(n) reads n bytes from the input data and returns them as a 3081 byte string. If n bytes are no longer available, take(n) fails. 3083 * uint() converts a byte string into an unsigned integer by 3084 interpreting the byte string in network byte order. 3086 * Arithmetic works as in C. 3088 * All variables are unsigned integers of sufficient range. 3090 Note that "well_formed" returns the major type for well-formed 3091 definite length items, but 99 for an indefinite length item (or -1 3092 for a "break" stop code, only if "breakable" is set). This is used 3093 in "well_formed_indefinite" to ascertain that indefinite length 3094 strings only contain definite length strings as chunks. 3096 well_formed(breakable = false) { 3097 // process initial bytes 3098 ib = uint(take(1)); 3099 mt = ib >> 5; 3100 val = ai = ib & 0x1f; 3101 switch (ai) { 3102 case 24: val = uint(take(1)); break; 3103 case 25: val = uint(take(2)); break; 3104 case 26: val = uint(take(4)); break; 3105 case 27: val = uint(take(8)); break; 3106 case 28: case 29: case 30: fail(); 3107 case 31: 3108 return well_formed_indefinite(mt, breakable); 3109 } 3110 // process content 3111 switch (mt) { 3112 // case 0, 1, 7 do not have content; just use val 3113 case 2: case 3: take(val); break; // bytes/UTF-8 3114 case 4: for (i = 0; i < val; i++) well_formed(); break; 3115 case 5: for (i = 0; i < val*2; i++) well_formed(); break; 3116 case 6: well_formed(); break; // 1 embedded data item 3117 case 7: if (ai == 24 && val < 32) fail(); // bad simple 3118 } 3119 return mt; // definite-length data item 3120 } 3122 well_formed_indefinite(mt, breakable) { 3123 switch (mt) { 3124 case 2: case 3: 3125 while ((it = well_formed(true)) != -1) 3126 if (it != mt) // need definite-length chunk 3127 fail(); // of same type 3128 break; 3129 case 4: while (well_formed(true) != -1); break; 3130 case 5: while (well_formed(true) != -1) well_formed(); break; 3131 case 7: 3132 if (breakable) 3133 return -1; // signal break out 3134 else fail(); // no enclosing indefinite 3135 default: fail(); // wrong mt 3136 } 3137 return 99; // indefinite-length data item 3138 } 3140 Figure 1: Pseudocode for Well-Formedness Check 3142 Note that the remaining complexity of a complete CBOR decoder is 3143 about presenting data that has been decoded to the application in an 3144 appropriate form. 3146 Major types 0 and 1 are designed in such a way that they can be 3147 encoded in C from a signed integer without actually doing an if-then- 3148 else for positive/negative (Figure 2). This uses the fact that 3149 (-1-n), the transformation for major type 1, is the same as ~n 3150 (bitwise complement) in C unsigned arithmetic; ~n can then be 3151 expressed as (-1)^n for the negative case, while 0^n leaves n 3152 unchanged for non-negative. The sign of a number can be converted to 3153 -1 for negative and 0 for non-negative (0 or positive) by arithmetic- 3154 shifting the number by one bit less than the bit length of the number 3155 (for example, by 63 for 64-bit numbers). 3157 void encode_sint(int64_t n) { 3158 uint64t ui = n >> 63; // extend sign to whole length 3159 unsigned mt = ui & 0x20; // extract (shifted) major type 3160 ui ^= n; // complement negatives 3161 if (ui < 24) 3162 *p++ = mt + ui; 3163 else if (ui < 256) { 3164 *p++ = mt + 24; 3165 *p++ = ui; 3166 } else 3167 ... 3169 Figure 2: Pseudocode for Encoding a Signed Integer 3171 See Section 1.2 for some specific assumptions about the profile of 3172 the C language used in these pieces of code. 3174 Appendix D. Half-Precision 3176 As half-precision floating-point numbers were only added to IEEE 754 3177 in 2008 [IEEE754], today's programming platforms often still only 3178 have limited support for them. It is very easy to include at least 3179 decoding support for them even without such support. An example of a 3180 small decoder for half-precision floating-point numbers in the C 3181 language is shown in Figure 3. A similar program for Python is in 3182 Figure 4; this code assumes that the 2-byte value has already been 3183 decoded as an (unsigned short) integer in network byte order (as 3184 would be done by the pseudocode in Appendix C). 3186 #include 3188 double decode_half(unsigned char *halfp) { 3189 unsigned half = (halfp[0] << 8) + halfp[1]; 3190 unsigned exp = (half >> 10) & 0x1f; 3191 unsigned mant = half & 0x3ff; 3192 double val; 3193 if (exp == 0) val = ldexp(mant, -24); 3194 else if (exp != 31) val = ldexp(mant + 1024, exp - 25); 3195 else val = mant == 0 ? INFINITY : NAN; 3196 return half & 0x8000 ? -val : val; 3197 } 3199 Figure 3: C Code for a Half-Precision Decoder 3201 import struct 3202 from math import ldexp 3204 def decode_single(single): 3205 return struct.unpack("!f", struct.pack("!I", single))[0] 3207 def decode_half(half): 3208 valu = (half & 0x7fff) << 13 | (half & 0x8000) << 16 3209 if ((half & 0x7c00) != 0x7c00): 3210 return ldexp(decode_single(valu), 112) 3211 return decode_single(valu | 0x7f800000) 3213 Figure 4: Python Code for a Half-Precision Decoder 3215 Appendix E. Comparison of Other Binary Formats to CBOR's Design 3216 Objectives 3218 The proposal for CBOR follows a history of binary formats that is as 3219 long as the history of computers themselves. Different formats have 3220 had different objectives. In most cases, the objectives of the 3221 format were never stated, although they can sometimes be implied by 3222 the context where the format was first used. Some formats were meant 3223 to be universally usable, although history has proven that no binary 3224 format meets the needs of all protocols and applications. 3226 CBOR differs from many of these formats due to it starting with a set 3227 of objectives and attempting to meet just those. This section 3228 compares a few of the dozens of formats with CBOR's objectives in 3229 order to help the reader decide if they want to use CBOR or a 3230 different format for a particular protocol or application. 3232 Note that the discussion here is not meant to be a criticism of any 3233 format: to the best of our knowledge, no format before CBOR was meant 3234 to cover CBOR's objectives in the priority we have assigned them. A 3235 brief recap of the objectives from Section 1.1 is: 3237 1. unambiguous encoding of most common data formats from Internet 3238 standards 3240 2. code compactness for encoder or decoder 3242 3. no schema description needed 3244 4. reasonably compact serialization 3246 5. applicability to constrained and unconstrained applications 3248 6. good JSON conversion 3250 7. extensibility 3252 A discussion of CBOR and other formats with respect to a different 3253 set of design objectives is provided in Section 5 and Appendix C of 3254 [RFC8618]. 3256 E.1. ASN.1 DER, BER, and PER 3258 [ASN.1] has many serializations. In the IETF, DER and BER are the 3259 most common. The serialized output is not particularly compact for 3260 many items, and the code needed to decode numeric items can be 3261 complex on a constrained device. 3263 Few (if any) IETF protocols have adopted one of the several variants 3264 of Packed Encoding Rules (PER). There could be many reasons for 3265 this, but one that is commonly stated is that PER makes use of the 3266 schema even for parsing the surface structure of the data item, 3267 requiring significant tool support. There are different versions of 3268 the ASN.1 schema language in use, which has also hampered adoption. 3270 E.2. MessagePack 3272 [MessagePack] is a concise, widely implemented counted binary 3273 serialization format, similar in many properties to CBOR, although 3274 somewhat less regular. While the data model can be used to represent 3275 JSON data, MessagePack has also been used in many remote procedure 3276 call (RPC) applications and for long-term storage of data. 3278 MessagePack has been essentially stable since it was first published 3279 around 2011; it has not yet had a transition. The evolution of 3280 MessagePack is impeded by an imperative to maintain complete 3281 backwards compatibility with existing stored data, while only few 3282 bytecodes are still available for extension. Repeated requests over 3283 the years from the MessagePack user community to separate out binary 3284 and text strings in the encoding recently have led to an extension 3285 proposal that would leave MessagePack's "raw" data ambiguous between 3286 its usages for binary and text data. The extension mechanism for 3287 MessagePack remains unclear. 3289 E.3. BSON 3291 [BSON] is a data format that was developed for the storage of JSON- 3292 like maps (JSON objects) in the MongoDB database. Its major 3293 distinguishing feature is the capability for in-place update, which 3294 prevents a compact representation. BSON uses a counted 3295 representation except for map keys, which are null-byte terminated. 3296 While BSON can be used for the representation of JSON-like objects on 3297 the wire, its specification is dominated by the requirements of the 3298 database application and has become somewhat baroque. The status of 3299 how BSON extensions will be implemented remains unclear. 3301 E.4. MSDTP: RFC 713 3303 Message Services Data Transmission (MSDTP) is a very early example of 3304 a compact message format; it is described in [RFC0713], written in 3305 1976. It is included here for its historical value, not because it 3306 was ever widely used. 3308 E.5. Conciseness on the Wire 3310 While CBOR's design objective of code compactness for encoders and 3311 decoders is a higher priority than its objective of conciseness on 3312 the wire, many people focus on the wire size. Table 8 shows some 3313 encoding examples for the simple nested array [1, [2, 3]]; where some 3314 form of indefinite-length encoding is supported by the encoding, 3315 [_ 1, [2, 3]] (indefinite length on the outer array) is also shown. 3317 +=============+============================+================+ 3318 | Format | [1, [2, 3]] | [_ 1, [2, 3]] | 3319 +=============+============================+================+ 3320 | RFC 713 | c2 05 81 c2 02 82 83 | | 3321 +-------------+----------------------------+----------------+ 3322 | ASN.1 BER | 30 0b 02 01 01 30 06 02 01 | 30 80 02 01 01 | 3323 | | 02 02 01 03 | 30 06 02 01 02 | 3324 | | | 02 01 03 00 00 | 3325 +-------------+----------------------------+----------------+ 3326 | MessagePack | 92 01 92 02 03 | | 3327 +-------------+----------------------------+----------------+ 3328 | BSON | 22 00 00 00 10 30 00 01 00 | | 3329 | | 00 00 04 31 00 13 00 00 00 | | 3330 | | 10 30 00 02 00 00 00 10 31 | | 3331 | | 00 03 00 00 00 00 00 | | 3332 +-------------+----------------------------+----------------+ 3333 | CBOR | 82 01 82 02 03 | 9f 01 82 02 03 | 3334 | | | ff | 3335 +-------------+----------------------------+----------------+ 3337 Table 8: Examples for Different Levels of Conciseness 3339 Appendix F. Well-formedness errors and examples 3341 There are three basic kinds of well-formedness errors that can occur 3342 in decoding a CBOR data item: 3344 * Too much data: There are input bytes left that were not consumed. 3345 This is only an error if the application assumed that the input 3346 bytes would span exactly one data item. Where the application 3347 uses the self-delimiting nature of CBOR encoding to permit 3348 additional data after the data item, as is for example done in 3349 CBOR sequences [RFC8742], the CBOR decoder can simply indicate 3350 what part of the input has not been consumed. 3352 * Too little data: The input data available would need additional 3353 bytes added at their end for a complete CBOR data item. This may 3354 indicate the input is truncated; it is also a common error when 3355 trying to decode random data as CBOR. For some applications, 3356 however, this may not actually be an error, as the application may 3357 not be certain it has all the data yet and can obtain or wait for 3358 additional input bytes. Some of these applications may have an 3359 upper limit for how much additional data can show up; here the 3360 decoder may be able to indicate that the encoded CBOR data item 3361 cannot be completed within this limit. 3363 * Syntax error: The input data are not consistent with the 3364 requirements of the CBOR encoding, and this cannot be remedied by 3365 adding (or removing) data at the end. 3367 In Appendix C, errors of the first kind are addressed in the first 3368 paragraph/bullet list (requiring "no bytes are left"), and errors of 3369 the second kind are addressed in the second paragraph/bullet list 3370 (failing "if n bytes are no longer available"). Errors of the third 3371 kind are identified in the pseudocode by specific instances of 3372 calling fail(), in order: 3374 * a reserved value is used for additional information (28, 29, 30) 3376 * major type 7, additional information 24, value < 32 (incorrect) 3378 * incorrect substructure of indefinite length byte/text string (may 3379 only contain definite length strings of the same major type) 3381 * "break" stop code (mt=7, ai=31) occurs in a value position of a 3382 map or except at a position directly in an indefinite length item 3383 where also another enclosed data item could occur 3385 * additional information 31 used with major type 0, 1, or 6 3387 F.1. Examples for CBOR data items that are not well-formed 3389 This subsection shows a few examples for CBOR data items that are not 3390 well-formed. Each example is a sequence of bytes each shown in 3391 hexadecimal; multiple examples in a list are separated by commas. 3393 Examples for well-formedness error kind 1 (too much data) can easily 3394 be formed by adding data to a well-formed encoded CBOR data item. 3396 Similarly, examples for well-formedness error kind 2 (too little 3397 data) can be formed by truncating a well-formed encoded CBOR data 3398 item. In test suites, it may be beneficial to specifically test with 3399 incomplete data items that would require large amounts of addition to 3400 be completed (for instance by starting the encoding of a string of a 3401 very large size). 3403 A premature end of the input can occur in a head or within the 3404 enclosed data, which may be bare strings or enclosed data items that 3405 are either counted or should have been ended by a "break" stop code. 3407 * End of input in a head: 18, 19, 1a, 1b, 19 01, 1a 01 02, 1b 01 02 3408 03 04 05 06 07, 38, 58, 78, 98, 9a 01 ff 00, b8, d8, f8, f9 00, fa 3409 00 00, fb 00 00 00 3411 * Definite length strings with short data: 41, 61, 5a ff ff ff ff 3412 00, 5b ff ff ff ff ff ff ff ff 01 02 03, 7a ff ff ff ff 00, 7b 7f 3413 ff ff ff ff ff ff ff 01 02 03 3415 * Definite length maps and arrays not closed with enough items: 81, 3416 81 81 81 81 81 81 81 81 81, 82 00, a1, a2 01 02, a1 00, a2 00 00 3417 00 3419 * Tag number not followed by tag content: c0 3421 * Indefinite length strings not closed by a "break" stop code: 5f 41 3422 00, 7f 61 00 3424 * Indefinite length maps and arrays not closed by a "break" stop 3425 code: 9f, 9f 01 02, bf, bf 01 02 01 02, 81 9f, 9f 80 00, 9f 9f 9f 3426 9f 9f ff ff ff ff, 9f 81 9f 81 9f 9f ff ff ff 3428 A few examples for the five subkinds of well-formedness error kind 3 3429 (syntax error) are shown below. 3431 Subkind 1: 3433 * Reserved additional information values: 1c, 1d, 1e, 3c, 3d, 3e, 3434 5c, 5d, 5e, 7c, 7d, 7e, 9c, 9d, 9e, bc, bd, be, dc, dd, de, fc, 3435 fd, fe, 3437 Subkind 2: 3439 * Reserved two-byte encodings of simple values: f8 00, f8 01, f8 18, 3440 f8 1f 3442 Subkind 3: 3444 * Indefinite length string chunks not of the correct type: 5f 00 ff, 3445 5f 21 ff, 5f 61 00 ff, 5f 80 ff, 5f a0 ff, 5f c0 00 ff, 5f e0 ff, 3446 7f 41 00 ff 3448 * Indefinite length string chunks not definite length: 5f 5f 41 00 3449 ff ff, 7f 7f 61 00 ff ff 3451 Subkind 4: 3453 * Break occurring on its own outside of an indefinite length item: 3454 ff 3456 * Break occurring in a definite length array or map or a tag: 81 ff, 3457 82 00 ff, a1 ff, a1 ff 00, a1 00 ff, a2 00 00 ff, 9f 81 ff, 9f 82 3458 9f 81 9f 9f ff ff ff ff 3460 * Break in indefinite length map would lead to odd number of items 3461 (break in a value position): bf 00 ff, bf 00 00 00 ff 3463 Subkind 5: 3465 * Major type 0, 1, 6 with additional information 31: 1f, 3f, df 3467 Appendix G. Changes from RFC 7049 3469 As discussed in the introduction, this document is a revised edition 3470 of RFC 7049, with editorial improvements, added detail, and fixed 3471 errata. This document formally obsoletes RFC 7049, while keeping 3472 full compatibility of the interchange format from RFC 7049. This 3473 document does not create a new version of the format. 3475 G.1. Errata processing, clerical changes 3477 The two verified errata on RFC 7049, EID 3764 and EID 3770, concerned 3478 two encoding examples in the text that have been corrected 3479 (Section 3.4.3: "29" -> "49", Section 5.5: "0b000_11101" -> 3480 "0b000_11001"). Also, RFC 7049 contained an example using the 3481 numeric value 24 for a simple value (EID 5917), which is not well- 3482 formed; this example has been removed. Errata report 5763 pointed to 3483 an accident in the wording of the definition of tags; this was 3484 resolved during a re-write of Section 3.4. Errata report 5434 3485 pointed out that the UBJSON example in Appendix E no longer complied 3486 with the version of UBJSON current at the time of submitting the 3487 report. It turned out that the UBJSON specification had completely 3488 changed since 2013; this example therefore also was removed. Further 3489 errata reports (4409, 4963, 4964) complained that the map key sorting 3490 rules for canonical encoding were onerous; these led to a 3491 reconsideration of the canonical encoding suggestions and replacement 3492 by the deterministic encoding suggestions (described below). An 3493 editorial suggestion in errata report 4294 was also implemented 3494 (improved symmetry by adding "Second value" to a comment to the last 3495 example in Section 3.2.2). 3497 Other more clerical changes include: 3499 * use of new RFCXML functionality [RFC7991]; 3501 * explain some more of the notation used; 3503 * updated references, e.g. for RFC4627 to [RFC8259] in many places, 3504 for CNN-TERMS to [RFC7228]; added missing reference to [IEEE754] 3505 (importing required definitions) and updated to [ECMA262]; added a 3506 reference to [RFC8618] that further illustrates the discussion in 3507 Appendix E; 3509 * the discussion of diagnostic notation mentions the "Extended 3510 Diagnostic Notation" (EDN) defined in [RFC8610] as well as the gap 3511 diagnostic notation has in representing NaN payloads; an 3512 explanation was added on how to represent indefinite length 3513 strings with no chunks; 3515 * the addition of this appendix. 3517 G.2. Changes in IANA considerations 3519 The IANA considerations were generally updated (clerical changes, 3520 e.g., now pointing to the CBOR working group as the author of the 3521 specification). References to the respective IANA registries have 3522 been added to the informative references. 3524 Tags in the space from 256 to 32767 (lower half of "1+2") are no 3525 longer assigned by First Come First Served; this range is now 3526 Specification Required. 3528 G.3. Changes in suggestions and other informational components 3530 In revising the document, beyond processing errata reports, the WG 3531 could use nearly seven years of experience with the use of CBOR in a 3532 diverse set of applications. This led to a number of editorial 3533 changes, including adding tables for illustration, but also to 3534 emphasizing some aspects and de-emphasizing others. 3536 A significant addition in this revision is Section 2, which discusses 3537 the CBOR data model and its small variations involved in the 3538 processing of CBOR. Introducing terms for those (basic generic, 3539 extended generic, specific) enables more concise language in other 3540 places of the document, but also helps in clarifying expectations on 3541 implementations and on the extensibility features of the format. 3543 RFC 7049, as a format derived from the JSON ecosystem, was influenced 3544 by the JSON number system that was in turn inherited from JavaScript 3545 at the time. JSON does not provide distinct integers and floating- 3546 point values (and the latter are decimal in the format). CBOR 3547 provides binary representations of numbers, which do differ between 3548 integers and floating-point values. Experience from implementation 3549 and use now suggested that the separation between these two number 3550 domains should be more clearly drawn in the document; language that 3551 suggested an integer could seamlessly stand in for a floating-point 3552 value was removed. Also, a suggestion (based on I-JSON [RFC7493]) 3553 was added for handling these types when converting JSON to CBOR, and 3554 the use of a specific rounding mechanism has been recommended. 3556 For a single value in the data model, CBOR often provides multiple 3557 encoding options. The revision adds a new section Section 4, which 3558 first introduces the term "preferred serialization" (Section 4.1) and 3559 defines it for various kinds of data items. On the basis of this 3560 terminology, the section goes on to discuss how a CBOR-based protocol 3561 can define "deterministic encoding" (Section 4.2), which now avoids 3562 the RFC 7049 terms "canonical" and "canonicalization". The 3563 suggestion of "Core Deterministic Encoding Requirements" 3564 Section 4.2.1 enables generic support for such protocol-defined 3565 encoding requirements. The present revision further eases the 3566 implementation of deterministic encoding by simplifying the map 3567 ordering suggested in RFC 7049 to simple lexicographic ordering of 3568 encoded keys. A description of the older suggestion is kept as an 3569 alternative, now termed "length-first map key ordering" 3570 (Section 4.2.3). 3572 The terminology for well-formed and valid data was sharpened and more 3573 stringently used, avoiding less well-defined alternative terms such 3574 as "syntax error", "decoding error" and "strict mode" outside 3575 examples. Also, a third level of requirements beyond CBOR-level 3576 validity that an application has on its input data is now explicitly 3577 called out. Well-formed (processable at all), valid (checked by a 3578 validity-checking generic decoder), and expected input (as checked by 3579 the application) are treated as a hierarchy of layers of 3580 acceptability. 3582 The handling of non-well-formed simple values was clarified in text 3583 and pseudocode. Appendix F was added to discuss well-formedness 3584 errors and provide examples for them. The pseudocode was updated to 3585 be more portable and some portability considerations were added. 3587 The discussion of validity has been sharpened in two areas. Map 3588 validity (handling of duplicate keys) was clarified and the domain of 3589 applicability of certain implementation choices explained. Also, 3590 while streamlining the terminology for tags, tag numbers, and tag 3591 content, discussion was added on tag validity, and the restrictions 3592 were clarified on tag content, in general and specifically for tag 1. 3594 An implementation note (and note for future tag definitions) was 3595 added to Section 3.4 about defining tags with semantics that depend 3596 on serialization order. 3598 Tag 35 is no longer defined in this updated document; the 3599 registration based on the definition in RFC 7049 remains in place. 3601 Terminology was introduced in Section 3 for "argument" and "head", 3602 simplifying further discussion. 3604 The security considerations were mostly rewritten and significantly 3605 expanded; in multiple other places, the document is now more explicit 3606 that a decoder cannot simply condone well-formedness errors. 3608 Acknowledgements 3610 CBOR was inspired by MessagePack. MessagePack was developed and 3611 promoted by Sadayuki Furuhashi ("frsyuki"). This reference to 3612 MessagePack is solely for attribution; CBOR is not intended as a 3613 version of or replacement for MessagePack, as it has different design 3614 goals and requirements. 3616 The need for functionality beyond the original MessagePack 3617 Specification became obvious to many people at about the same time 3618 around the year 2012. BinaryPack is a minor derivation of 3619 MessagePack that was developed by Eric Zhang for the binaryjs 3620 project. A similar, but different, extension was made by Tim Caswell 3621 for his msgpack-js and msgpack-js-browser projects. Many people have 3622 contributed to the discussion about extending MessagePack to separate 3623 text string representation from byte string representation. 3625 The encoding of the additional information in CBOR was inspired by 3626 the encoding of length information designed by Klaus Hartke for CoAP. 3628 This document also incorporates suggestions made by many people, 3629 notably Dan Frost, James Manger, Jeffrey Yasskin, Joe Hildebrand, 3630 Keith Moore, Laurence Lundblade, Matthew Lepinski, Michael 3631 Richardson, Nico Williams, Peter Occil, Phillip Hallam-Baker, Ray 3632 Polk, Stuart Cheshire, Tim Bray, Tony Finch, Tony Hansen, and Yaron 3633 Sheffer. Benjamin Kaduk provided an extensive review during IESG 3634 processing. Éric Vyncke, Erik Kline, Robert Wilton, and Roman Danyliw 3635 provided further IESG comments, which included an IoT directorate 3636 review by Eve Schooler. 3638 Authors' Addresses 3640 Carsten Bormann 3641 Universitaet Bremen TZI 3642 Postfach 330440 3643 D-28359 Bremen 3644 Germany 3646 Phone: +49-421-218-63921 3647 Email: cabo@tzi.org 3648 Paul Hoffman 3649 ICANN 3651 Email: paul.hoffman@icann.org