idnits 2.17.00 (12 Aug 2021) /tmp/idnits23491/draft-ietf-cbor-7049bis-16.txt: -(2523): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(2584): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(3609): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 4 instances of lines with non-ascii characters in the document. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (30 September 2020) is 591 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '2' on line 2875 -- Looks like a reference, but probably isn't: '3' on line 2875 -- Looks like a reference, but probably isn't: '4' on line 2873 -- Looks like a reference, but probably isn't: '5' on line 2873 -- Looks like a reference, but probably isn't: '100' on line 1551 == Missing Reference: '-1' is mentioned on line 1547, but not defined -- Looks like a reference, but probably isn't: '1' on line 3164 == Missing Reference: 'RFCthis' is mentioned on line 2353, but not defined == Missing Reference: 'TM' is mentioned on line 2695, but not defined -- Looks like a reference, but probably isn't: '0' on line 3180 -- Possible downref: Non-RFC (?) normative reference: ref. 'C' -- Possible downref: Non-RFC (?) normative reference: ref. 'Cplusplus17' -- Possible downref: Non-RFC (?) normative reference: ref. 'IEEE754' == Outdated reference: A later version (-06) exists of draft-bormann-cbor-notable-tags-02 -- Obsolete informational reference (is this intentional?): RFC 7049 (Obsoleted by RFC 8949) Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 13 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Bormann 3 Internet-Draft Universitaet Bremen TZI 4 Obsoletes: 7049 (if approved) P. Hoffman 5 Intended status: Standards Track ICANN 6 Expires: 3 April 2021 30 September 2020 8 Concise Binary Object Representation (CBOR) 9 draft-ietf-cbor-7049bis-16 11 Abstract 13 The Concise Binary Object Representation (CBOR) is a data format 14 whose design goals include the possibility of extremely small code 15 size, fairly small message size, and extensibility without the need 16 for version negotiation. These design goals make it different from 17 earlier binary serializations such as ASN.1 and MessagePack. 19 This document is a revised edition of RFC 7049, with editorial 20 improvements, added detail, and fixed errata. This revision formally 21 obsoletes RFC 7049, while keeping full compatibility of the 22 interchange format from RFC 7049. It does not create a new version 23 of the format. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on 3 April 2021. 42 Copyright Notice 44 Copyright (c) 2020 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 49 license-info) in effect on the date of publication of this document. 50 Please review these documents carefully, as they describe your rights 51 and restrictions with respect to this document. Code Components 52 extracted from this document must include Simplified BSD License text 53 as described in Section 4.e of the Trust Legal Provisions and are 54 provided without warranty as described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 59 1.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . 4 60 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 61 2. CBOR Data Models . . . . . . . . . . . . . . . . . . . . . . 8 62 2.1. Extended Generic Data Models . . . . . . . . . . . . . . 9 63 2.2. Specific Data Models . . . . . . . . . . . . . . . . . . 9 64 3. Specification of the CBOR Encoding . . . . . . . . . . . . . 10 65 3.1. Major Types . . . . . . . . . . . . . . . . . . . . . . . 11 66 3.2. Indefinite Lengths for Some Major Types . . . . . . . . . 14 67 3.2.1. The "break" Stop Code . . . . . . . . . . . . . . . . 14 68 3.2.2. Indefinite-Length Arrays and Maps . . . . . . . . . . 14 69 3.2.3. Indefinite-Length Byte Strings and Text Strings . . . 16 70 3.2.4. Summary of indefinite-length use of major types . . . 17 71 3.3. Floating-Point Numbers and Values with No Content . . . . 18 72 3.4. Tagging of Items . . . . . . . . . . . . . . . . . . . . 20 73 3.4.1. Standard Date/Time String . . . . . . . . . . . . . . 23 74 3.4.2. Epoch-based Date/Time . . . . . . . . . . . . . . . . 23 75 3.4.3. Bignums . . . . . . . . . . . . . . . . . . . . . . . 24 76 3.4.4. Decimal Fractions and Bigfloats . . . . . . . . . . . 25 77 3.4.5. Content Hints . . . . . . . . . . . . . . . . . . . . 26 78 3.4.5.1. Encoded CBOR Data Item . . . . . . . . . . . . . 27 79 3.4.5.2. Expected Later Encoding for CBOR-to-JSON 80 Converters . . . . . . . . . . . . . . . . . . . . 27 81 3.4.5.3. Encoded Text . . . . . . . . . . . . . . . . . . 28 82 3.4.6. Self-Described CBOR . . . . . . . . . . . . . . . . . 29 83 4. Serialization Considerations . . . . . . . . . . . . . . . . 29 84 4.1. Preferred Serialization . . . . . . . . . . . . . . . . . 29 85 4.2. Deterministically Encoded CBOR . . . . . . . . . . . . . 31 86 4.2.1. Core Deterministic Encoding Requirements . . . . . . 31 87 4.2.2. Additional Deterministic Encoding Considerations . . 32 88 4.2.3. Length-first Map Key Ordering . . . . . . . . . . . . 34 89 5. Creating CBOR-Based Protocols . . . . . . . . . . . . . . . . 35 90 5.1. CBOR in Streaming Applications . . . . . . . . . . . . . 35 91 5.2. Generic Encoders and Decoders . . . . . . . . . . . . . . 36 92 5.3. Validity of Items . . . . . . . . . . . . . . . . . . . . 37 93 5.3.1. Basic validity . . . . . . . . . . . . . . . . . . . 37 94 5.3.2. Tag validity . . . . . . . . . . . . . . . . . . . . 37 96 5.4. Validity and Evolution . . . . . . . . . . . . . . . . . 38 97 5.5. Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 39 98 5.6. Specifying Keys for Maps . . . . . . . . . . . . . . . . 40 99 5.6.1. Equivalence of Keys . . . . . . . . . . . . . . . . . 42 100 5.7. Undefined Values . . . . . . . . . . . . . . . . . . . . 43 101 6. Converting Data between CBOR and JSON . . . . . . . . . . . . 43 102 6.1. Converting from CBOR to JSON . . . . . . . . . . . . . . 43 103 6.2. Converting from JSON to CBOR . . . . . . . . . . . . . . 44 104 7. Future Evolution of CBOR . . . . . . . . . . . . . . . . . . 46 105 7.1. Extension Points . . . . . . . . . . . . . . . . . . . . 46 106 7.2. Curating the Additional Information Space . . . . . . . . 47 107 8. Diagnostic Notation . . . . . . . . . . . . . . . . . . . . . 47 108 8.1. Encoding Indicators . . . . . . . . . . . . . . . . . . . 49 109 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 49 110 9.1. Simple Values Registry . . . . . . . . . . . . . . . . . 50 111 9.2. Tags Registry . . . . . . . . . . . . . . . . . . . . . . 50 112 9.3. Media Type ("MIME Type") . . . . . . . . . . . . . . . . 51 113 9.4. CoAP Content-Format . . . . . . . . . . . . . . . . . . . 51 114 9.5. The +cbor Structured Syntax Suffix Registration . . . . . 52 115 10. Security Considerations . . . . . . . . . . . . . . . . . . . 53 116 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 55 117 11.1. Normative References . . . . . . . . . . . . . . . . . . 55 118 11.2. Informative References . . . . . . . . . . . . . . . . . 57 119 Appendix A. Examples of Encoded CBOR Data Items . . . . . . . . 59 120 Appendix B. Jump Table for Initial Byte . . . . . . . . . . . . 63 121 Appendix C. Pseudocode . . . . . . . . . . . . . . . . . . . . . 66 122 Appendix D. Half-Precision . . . . . . . . . . . . . . . . . . . 69 123 Appendix E. Comparison of Other Binary Formats to CBOR's Design 124 Objectives . . . . . . . . . . . . . . . . . . . . . . . 70 125 E.1. ASN.1 DER, BER, and PER . . . . . . . . . . . . . . . . . 71 126 E.2. MessagePack . . . . . . . . . . . . . . . . . . . . . . . 71 127 E.3. BSON . . . . . . . . . . . . . . . . . . . . . . . . . . 72 128 E.4. MSDTP: RFC 713 . . . . . . . . . . . . . . . . . . . . . 72 129 E.5. Conciseness on the Wire . . . . . . . . . . . . . . . . . 72 130 Appendix F. Well-formedness errors and examples . . . . . . . . 73 131 F.1. Examples for CBOR data items that are not well-formed . . 74 132 Appendix G. Changes from RFC 7049 . . . . . . . . . . . . . . . 76 133 G.1. Errata processing, clerical changes . . . . . . . . . . . 76 134 G.2. Changes in IANA considerations . . . . . . . . . . . . . 77 135 G.3. Changes in suggestions and other informational 136 components . . . . . . . . . . . . . . . . . . . . . . . 77 137 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 79 138 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 79 140 1. Introduction 142 There are hundreds of standardized formats for binary representation 143 of structured data (also known as binary serialization formats). Of 144 those, some are for specific domains of information, while others are 145 generalized for arbitrary data. In the IETF, probably the best-known 146 formats in the latter category are ASN.1's BER and DER [ASN.1]. 148 The format defined here follows some specific design goals that are 149 not well met by current formats. The underlying data model is an 150 extended version of the JSON data model [RFC8259]. It is important 151 to note that this is not a proposal that the grammar in RFC 8259 be 152 extended in general, since doing so would cause a significant 153 backwards incompatibility with already deployed JSON documents. 154 Instead, this document simply defines its own data model that starts 155 from JSON. 157 Appendix E lists some existing binary formats and discusses how well 158 they do or do not fit the design objectives of the Concise Binary 159 Object Representation (CBOR). 161 This document is a revised edition of [RFC7049], with editorial 162 improvements, added detail, and fixed errata. This revision formally 163 obsoletes RFC 7049, while keeping full compatibility of the 164 interchange format from RFC 7049. It does not create a new version 165 of the format. 167 1.1. Objectives 169 The objectives of CBOR, roughly in decreasing order of importance, 170 are: 172 1. The representation must be able to unambiguously encode most 173 common data formats used in Internet standards. 175 * It must represent a reasonable set of basic data types and 176 structures using binary encoding. "Reasonable" here is 177 largely influenced by the capabilities of JSON, with the major 178 addition of binary byte strings. The structures supported are 179 limited to arrays and trees; loops and lattice-style graphs 180 are not supported. 182 * There is no requirement that all data formats be uniquely 183 encoded; that is, it is acceptable that the number "7" might 184 be encoded in multiple different ways. 186 2. The code for an encoder or decoder must be able to be compact in 187 order to support systems with very limited memory, processor 188 power, and instruction sets. 190 * An encoder and a decoder need to be implementable in a very 191 small amount of code (for example, in class 1 constrained 192 nodes as defined in [RFC7228]). 194 * The format should use contemporary machine representations of 195 data (for example, not requiring binary-to-decimal 196 conversion). 198 3. Data must be able to be decoded without a schema description. 200 * Similar to JSON, encoded data should be self-describing so 201 that a generic decoder can be written. 203 4. The serialization must be reasonably compact, but data 204 compactness is secondary to code compactness for the encoder and 205 decoder. 207 * "Reasonable" here is bounded by JSON as an upper bound in 208 size, and by the implementation complexity limiting how much 209 effort can go into achieving that compactness. Using either 210 general compression schemes or extensive bit-fiddling violates 211 the complexity goals. 213 5. The format must be applicable to both constrained nodes and high- 214 volume applications. 216 * This means it must be reasonably frugal in CPU usage for both 217 encoding and decoding. This is relevant both for constrained 218 nodes and for potential usage in applications with a very high 219 volume of data. 221 6. The format must support all JSON data types for conversion to and 222 from JSON. 224 * It must support a reasonable level of conversion as long as 225 the data represented is within the capabilities of JSON. It 226 must be possible to define a unidirectional mapping towards 227 JSON for all types of data. 229 7. The format must be extensible, and the extended data must be 230 decodable by earlier decoders. 232 * The format is designed for decades of use. 234 * The format must support a form of extensibility that allows 235 fallback so that a decoder that does not understand an 236 extension can still decode the message. 238 * The format must be able to be extended in the future by later 239 IETF standards. 241 1.2. Terminology 243 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 244 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 245 "OPTIONAL" in this document are to be interpreted as described in 246 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 247 capitals, as shown here. 249 The term "byte" is used in its now-customary sense as a synonym for 250 "octet". All multi-byte values are encoded in network byte order 251 (that is, most significant byte first, also known as "big-endian"). 253 This specification makes use of the following terminology: 255 Data item: A single piece of CBOR data. The structure of a data 256 item may contain zero, one, or more nested data items. The term 257 is used both for the data item in representation format and for 258 the abstract idea that can be derived from that by a decoder; the 259 former can be addressed specifically by using "encoded data item". 261 Decoder: A process that decodes a well-formed encoded CBOR data item 262 and makes it available to an application. Formally speaking, a 263 decoder contains a parser to break up the input using the syntax 264 rules of CBOR, as well as a semantic processor to prepare the data 265 in a form suitable to the application. 267 Encoder: A process that generates the (well-formed) representation 268 format of a CBOR data item from application information. 270 Data Stream: A sequence of zero or more data items, not further 271 assembled into a larger containing data item (see [RFC8742] for 272 one application). The independent data items that make up a data 273 stream are sometimes also referred to as "top-level data items". 275 Well-formed: A data item that follows the syntactic structure of 276 CBOR. A well-formed data item uses the initial bytes and the byte 277 strings and/or data items that are implied by their values as 278 defined in CBOR and does not include following extraneous data. 279 CBOR decoders by definition only return contents from well-formed 280 data items. 282 Valid: A data item that is well-formed and also follows the semantic 283 restrictions that apply to CBOR data items (Section 5.3). 285 Expected: Besides its normal English meaning, the term "expected" is 286 used to describe requirements beyond CBOR validity that an 287 application has on its input data. Well-formed (processable at 288 all), valid (checked by a validity-checking generic decoder), and 289 expected (checked by the application) form a hierarchy of layers 290 of acceptability. 292 Stream decoder: A process that decodes a data stream and makes each 293 of the data items in the sequence available to an application as 294 they are received. 296 Terms and concepts for floating-point values such as Infinity, NaN 297 (not a number), negative zero, and subnormal are defined in 298 [IEEE754]. 300 Where bit arithmetic or data types are explained, this document uses 301 the notation familiar from the programming language C [C], except 302 that "**" denotes exponentiation and ".." denotes a range that 303 includes both ends given. Examples and pseudocode assume that signed 304 integers use two's complement representation and that right shifts of 305 signed integers perform sign extension; these assumptions are also 306 specified in Sections 6.8.2 and 7.6.7 of the 2020 version of C++, 307 successor of [Cplusplus17]. 309 Similar to the "0x" notation for hexadecimal numbers, numbers in 310 binary notation are prefixed with "0b". Underscores can be added to 311 a number solely for readability, so 0b00100001 (0x21) might be 312 written 0b001_00001 to emphasize the desired interpretation of the 313 bits in the byte; in this case, it is split into three bits and five 314 bits. Encoded CBOR data items are sometimes given in the "0x" or 315 "0b" notation; these values are first interpreted as numbers as in C 316 and are then interpreted as byte strings in network byte order, 317 including any leading zero bytes expressed in the notation. 319 Words may be _italicized_ for emphasis; in the plain text form of 320 this specification this is indicated by surrounding words with 321 underscore characters. Verbatim text (e.g., names from a programming 322 language) may be set in "monospace" type; in plain text this is 323 approximated somewhat ambiguously by surrounding the text in double 324 quotes (which also retain their usual meaning). 326 2. CBOR Data Models 328 CBOR is explicit about its generic data model, which defines the set 329 of all data items that can be represented in CBOR. Its basic generic 330 data model is extensible by the registration of "simple values" and 331 tags. Applications can then subset the resulting extended generic 332 data model to build their specific data models. 334 Within environments that can represent the data items in the generic 335 data model, generic CBOR encoders and decoders can be implemented 336 (which usually involves defining additional implementation data types 337 for those data items that do not already have a natural 338 representation in the environment). The ability to provide generic 339 encoders and decoders is an explicit design goal of CBOR; however 340 many applications will provide their own application-specific 341 encoders and/or decoders. 343 In the basic (un-extended) generic data model defined in Section 3, a 344 data item is one of: 346 * an integer in the range -2**64..2**64-1 inclusive 348 * a simple value, identified by a number between 0 and 255, but 349 distinct from that number itself 351 * a floating-point value, distinct from an integer, out of the set 352 representable by IEEE 754 binary64 (including non-finites) 353 [IEEE754] 355 * a sequence of zero or more bytes ("byte string") 357 * a sequence of zero or more Unicode code points ("text string") 359 * a sequence of zero or more data items ("array") 361 * a mapping (mathematical function) from zero or more data items 362 ("keys") each to a data item ("values"), ("map") 364 * a tagged data item ("tag"), comprising a tag number (an integer in 365 the range 0..2**64-1) and the tag content (a data item) 367 Note that integer and floating-point values are distinct in this 368 model, even if they have the same numeric value. 370 Also note that serialization variants are not visible at the generic 371 data model level, including the number of bytes of the encoded 372 floating-point value or the choice of one of the ways in which an 373 integer, the length of a text or byte string, the number of elements 374 in an array or pairs in a map, or a tag number, (collectively "the 375 argument", see Section 3) can be encoded. 377 2.1. Extended Generic Data Models 379 This basic generic data model comes pre-extended by the registration 380 of a number of simple values and tag numbers right in this document, 381 such as: 383 * "false", "true", "null", and "undefined" (simple values identified 384 by 20..23) 386 * integer and floating-point values with a larger range and 387 precision than the above (tag numbers 2 to 5) 389 * application data types such as a point in time or an RFC 3339 390 date/time string (tag numbers 1, 0) 392 Further elements of the extended generic data model can be (and have 393 been) defined via the IANA registries created for CBOR. Even if such 394 an extension is unknown to a generic encoder or decoder, data items 395 using that extension can be passed to or from the application by 396 representing them at the interface to the application within the 397 basic generic data model, i.e., as generic simple values or generic 398 tags. 400 In other words, the basic generic data model is stable as defined in 401 this document, while the extended generic data model expands by the 402 registration of new simple values or tag numbers, but never shrinks. 404 While there is a strong expectation that generic encoders and 405 decoders can represent "false", "true", and "null" ("undefined" is 406 intentionally omitted) in the form appropriate for their programming 407 environment, implementation of the data model extensions created by 408 tags is truly optional and a matter of implementation quality. 410 2.2. Specific Data Models 412 The specific data model for a CBOR-based protocol usually subsets the 413 extended generic data model and assigns application semantics to the 414 data items within this subset and its components. When documenting 415 such specific data models, where it is desired to specify the types 416 of data items, it is preferred to identify the types by the names 417 they have in the generic data model ("negative integer", "array") 418 instead of by referring to aspects of their CBOR representation 419 ("major type 1", "major type 4"). 421 Specific data models can also specify what values (including values 422 of different types) are equivalent for the purposes of map keys and 423 encoder freedom. For example, in the generic data model, a valid map 424 MAY have both "0" and "0.0" as keys, and an encoder MUST NOT encode 425 "0.0" as an integer (major type 0, Section 3.1). However, if a 426 specific data model declares that floating-point and integer 427 representations of integral values are equivalent, using both map 428 keys "0" and "0.0" in a single map would be considered duplicates, 429 even while encoded as different major types, and so invalid; and an 430 encoder could encode integral-valued floats as integers or vice 431 versa, perhaps to save encoded bytes. 433 3. Specification of the CBOR Encoding 435 A CBOR data item (Section 2) is encoded to or decoded from a byte 436 string carrying a well-formed encoded data item as described in this 437 section. The encoding is summarized in Table 7 in Appendix B, 438 indexed by the initial byte. An encoder MUST produce only well- 439 formed encoded data items. A decoder MUST NOT return a decoded data 440 item when it encounters input that is not a well-formed encoded CBOR 441 data item (this does not detract from the usefulness of diagnostic 442 and recovery tools that might make available some information from a 443 damaged encoded CBOR data item). 445 The initial byte of each encoded data item contains both information 446 about the major type (the high-order 3 bits, described in 447 Section 3.1) and additional information (the low-order 5 bits). With 448 a few exceptions, the additional information's value describes how to 449 load an unsigned integer "argument": 451 Less than 24: The argument's value is the value of the additional 452 information. 454 24, 25, 26, or 27: The argument's value is held in the following 1, 455 2, 4, or 8 bytes, respectively, in network byte order. For major 456 type 7 and additional information value 25, 26, 27, these bytes 457 are not used as an integer argument, but as a floating-point value 458 (see Section 3.3). 460 28, 29, 30: These values are reserved for future additions to the 461 CBOR format. In the present version of CBOR, the encoded item is 462 not well-formed. 464 31: No argument value is derived. If the major type is 0, 1, or 6, 465 the encoded item is not well-formed. For major types 2 to 5, the 466 item's length is indefinite, and for major type 7, the byte does 467 not constitute a data item at all but terminates an indefinite 468 length item; all are described in Section 3.2. 470 The initial byte and any additional bytes consumed to construct the 471 argument are collectively referred to as the "head" of the data item. 473 The meaning of this argument depends on the major type. For example, 474 in major type 0, the argument is the value of the data item itself 475 (and in major type 1 the value of the data item is computed from the 476 argument); in major type 2 and 3 it gives the length of the string 477 data in bytes that follows; and in major types 4 and 5 it is used to 478 determine the number of data items enclosed. 480 If the encoded sequence of bytes ends before the end of a data item, 481 that item is not well-formed. If the encoded sequence of bytes still 482 has bytes remaining after the outermost encoded item is decoded, that 483 encoding is not a single well-formed CBOR item; depending on the 484 application, the decoder may either treat the encoding as not well- 485 formed or just identify the start of the remaining bytes to the 486 application. 488 A CBOR decoder implementation can be based on a jump table with all 489 256 defined values for the initial byte (Table 7). A decoder in a 490 constrained implementation can instead use the structure of the 491 initial byte and following bytes for more compact code (see 492 Appendix C for a rough impression of how this could look). 494 3.1. Major Types 496 The following lists the major types and the additional information 497 and other bytes associated with the type. 499 Major type 0: an unsigned integer in the range 0..2**64-1 inclusive. 500 The value of the encoded item is the argument itself. For 501 example, the integer 10 is denoted as the one byte 0b000_01010 502 (major type 0, additional information 10). The integer 500 would 503 be 0b000_11001 (major type 0, additional information 25) followed 504 by the two bytes 0x01f4, which is 500 in decimal. 506 Major type 1: a negative integer in the range -2**64..-1 inclusive. 507 The value of the item is -1 minus the argument. For example, the 508 integer -500 would be 0b001_11001 (major type 1, additional 509 information 25) followed by the two bytes 0x01f3, which is 499 in 510 decimal. 512 Major type 2: a byte string. The number of bytes in the string is 513 equal to the argument. For example, a byte string whose length is 514 5 would have an initial byte of 0b010_00101 (major type 2, 515 additional information 5 for the length), followed by 5 bytes of 516 binary content. A byte string whose length is 500 would have 3 517 initial bytes of 0b010_11001 (major type 2, additional information 518 25 to indicate a two-byte length) followed by the two bytes 0x01f4 519 for a length of 500, followed by 500 bytes of binary content. 521 Major type 3: a text string (Section 2), encoded as UTF-8 522 ([RFC3629]). The number of bytes in the string is equal to the 523 argument. A string containing an invalid UTF-8 sequence is well- 524 formed but invalid (Section 1.2). This type is provided for 525 systems that need to interpret or display human-readable text, and 526 allows the differentiation between unstructured bytes and text 527 that has a specified repertoire (that of Unicode) and encoding 528 (UTF-8). In contrast to formats such as JSON, the Unicode 529 characters in this type are never escaped. Thus, a newline 530 character (U+000A) is always represented in a string as the byte 531 0x0a, and never as the bytes 0x5c6e (the characters "\" and "n") 532 nor as 0x5c7530303061 (the characters "\", "u", "0", "0", "0", and 533 "a"). 535 Major type 4: an array of data items. In other formats, arrays are 536 also called lists, sequences, or tuples (a "CBOR sequence" is 537 something slightly different, though [RFC8742]). The argument is 538 the number of data items in the array. Items in an array do not 539 need to all be of the same type. For example, an array that 540 contains 10 items of any type would have an initial byte of 541 0b100_01010 (major type 4, additional information 10 for the 542 length) followed by the 10 remaining items. 544 Major type 5: a map of pairs of data items. Maps are also called 545 tables, dictionaries, hashes, or objects (in JSON). A map is 546 comprised of pairs of data items, each pair consisting of a key 547 that is immediately followed by a value. The argument is the 548 number of _pairs_ of data items in the map. For example, a map 549 that contains 9 pairs would have an initial byte of 0b101_01001 550 (major type 5, additional information 9 for the number of pairs) 551 followed by the 18 remaining items. The first item is the first 552 key, the second item is the first value, the third item is the 553 second key, and so on. Because items in a map come in pairs, 554 their total number is always even: A map that contains an odd 555 number of items (no value data present after the last key data 556 item) is not well-formed. A map that has duplicate keys may be 557 well-formed, but it is not valid, and thus it causes indeterminate 558 decoding; see also Section 5.6. 560 Major type 6: a tagged data item ("tag") whose tag number, an 561 integer in the range 0..2**64-1 inclusive, is the argument and 562 whose enclosed data item ("tag content") is the single encoded 563 data item that follows the head. See Section 3.4. 565 Major type 7: floating-point numbers and simple values, as well as 566 the "break" stop code. See Section 3.3. 568 These eight major types lead to a simple table showing which of the 569 256 possible values for the initial byte of a data item are used 570 (Table 7). 572 In major types 6 and 7, many of the possible values are reserved for 573 future specification. See Section 9 for more information on these 574 values. 576 Table 1 summarizes the major types defined by CBOR, ignoring the next 577 section for now. The number N in this table stands for the argument, 578 mt for the major type. 580 +====+=======================+=================================+ 581 | mt | Meaning | Content | 582 +====+=======================+=================================+ 583 | 0 | unsigned integer N | - | 584 +----+-----------------------+---------------------------------+ 585 | 1 | negative integer -1-N | - | 586 +----+-----------------------+---------------------------------+ 587 | 2 | byte string | N bytes | 588 +----+-----------------------+---------------------------------+ 589 | 3 | text string | N bytes (UTF-8 text) | 590 +----+-----------------------+---------------------------------+ 591 | 4 | array | N data items (elements) | 592 +----+-----------------------+---------------------------------+ 593 | 5 | map | 2N data items (key/value pairs) | 594 +----+-----------------------+---------------------------------+ 595 | 6 | tag of number N | 1 data item | 596 +----+-----------------------+---------------------------------+ 597 | 7 | simple/float | - | 598 +----+-----------------------+---------------------------------+ 600 Table 1: Overview over the definite-length use of CBOR major 601 types (mt = major type, N = argument) 603 3.2. Indefinite Lengths for Some Major Types 605 Four CBOR items (arrays, maps, byte strings, and text strings) can be 606 encoded with an indefinite length using additional information value 607 31. This is useful if the encoding of the item needs to begin before 608 the number of items inside the array or map, or the total length of 609 the string, is known. (The ability to start sending a data item 610 before all of it is known is often referred to as "streaming" within 611 that data item.) 613 Indefinite-length arrays and maps are dealt with differently than 614 indefinite-length strings (byte strings and text strings). 616 3.2.1. The "break" Stop Code 618 The "break" stop code is encoded with major type 7 and additional 619 information value 31 (0b111_11111). It is not itself a data item: it 620 is just a syntactic feature to close an indefinite-length item. 622 If the "break" stop code appears anywhere where a data item is 623 expected, other than directly inside an indefinite-length string, 624 array, or map -- for example directly inside a definite-length array 625 or map -- the enclosing item is not well-formed. 627 3.2.2. Indefinite-Length Arrays and Maps 629 Indefinite-length arrays and maps are represented using their major 630 type with the additional information value of 31, followed by an 631 arbitrary-length sequence of zero or more items for an array or key/ 632 value pairs for a map, followed by the "break" stop code 633 (Section 3.2.1). In other words, indefinite-length arrays and maps 634 look identical to other arrays and maps except for beginning with the 635 additional information value of 31 and ending with the "break" stop 636 code. 638 If the "break" stop code appears after a key in a map, in place of 639 that key's value, the map is not well-formed. 641 There is no restriction against nesting indefinite-length array or 642 map items. A "break" only terminates a single item, so nested 643 indefinite-length items need exactly as many "break" stop codes as 644 there are type bytes starting an indefinite-length item. 646 For example, assume an encoder wants to represent the abstract array 647 [1, [2, 3], [4, 5]]. The definite-length encoding would be 648 0x8301820203820405: 650 83 -- Array of length 3 651 01 -- 1 652 82 -- Array of length 2 653 02 -- 2 654 03 -- 3 655 82 -- Array of length 2 656 04 -- 4 657 05 -- 5 659 Indefinite-length encoding could be applied independently to each of 660 the three arrays encoded in this data item, as required, leading to 661 representations such as: 663 0x9f018202039f0405ffff 664 9F -- Start indefinite-length array 665 01 -- 1 666 82 -- Array of length 2 667 02 -- 2 668 03 -- 3 669 9F -- Start indefinite-length array 670 04 -- 4 671 05 -- 5 672 FF -- "break" (inner array) 673 FF -- "break" (outer array) 675 0x9f01820203820405ff 676 9F -- Start indefinite-length array 677 01 -- 1 678 82 -- Array of length 2 679 02 -- 2 680 03 -- 3 681 82 -- Array of length 2 682 04 -- 4 683 05 -- 5 684 FF -- "break" 686 0x83018202039f0405ff 687 83 -- Array of length 3 688 01 -- 1 689 82 -- Array of length 2 690 02 -- 2 691 03 -- 3 692 9F -- Start indefinite-length array 693 04 -- 4 694 05 -- 5 695 FF -- "break" 697 0x83019f0203ff820405 698 83 -- Array of length 3 699 01 -- 1 700 9F -- Start indefinite-length array 701 02 -- 2 702 03 -- 3 703 FF -- "break" 704 82 -- Array of length 2 705 04 -- 4 706 05 -- 5 708 An example of an indefinite-length map (that happens to have two key/ 709 value pairs) might be: 711 0xbf6346756ef563416d7421ff 712 BF -- Start indefinite-length map 713 63 -- First key, UTF-8 string length 3 714 46756e -- "Fun" 715 F5 -- First value, true 716 63 -- Second key, UTF-8 string length 3 717 416d74 -- "Amt" 718 21 -- Second value, -2 719 FF -- "break" 721 3.2.3. Indefinite-Length Byte Strings and Text Strings 723 Indefinite-length strings are represented by a byte containing the 724 major type for byte string or text string with an additional 725 information value of 31, followed by a series of zero or more strings 726 of the specified type ("chunks") that have definite lengths, and 727 finished by the "break" stop code (Section 3.2.1). The data item 728 represented by the indefinite-length string is the concatenation of 729 the chunks. If no chunks are present, the data item is an empty 730 string of the specified type. Zero-length chunks, while not 731 particularly useful, are permitted. 733 If any item between the indefinite-length string indicator 734 (0b010_11111 or 0b011_11111) and the "break" stop code is not a 735 definite-length string item of the same major type, the string is not 736 well-formed. 738 The design does not allow nesting indefinite-length strings as chunks 739 into indefinite-length strings. If it were allowed, it would require 740 decoder implementations to keep a stack, or at least a count, of 741 nesting levels. It is unnecessary on the encoder side because the 742 inner indefinite-length string would consist of chunks, and these 743 could instead be put directly into the outer indefinite-length 744 string. 746 If any definite-length text string inside an indefinite-length text 747 string is invalid, the indefinite-length text string is invalid. 748 Note that this implies that the UTF-8 bytes of a single Unicode code 749 point (scalar value) cannot be spread between chunks: a new chunk of 750 a text string can only be started at a code point boundary. 752 For example, assume an encoded data item consisting of the bytes: 754 0b010_11111 0b010_00100 0xaabbccdd 0b010_00011 0xeeff99 0b111_11111 756 5F -- Start indefinite-length byte string 757 44 -- Byte string of length 4 758 aabbccdd -- Bytes content 759 43 -- Byte string of length 3 760 eeff99 -- Bytes content 761 FF -- "break" 763 After decoding, this results in a single byte string with seven 764 bytes: 0xaabbccddeeff99. 766 3.2.4. Summary of indefinite-length use of major types 768 Table 2 summarizes the major types defined by CBOR as used for 769 indefinite length encoding (with additional information set to 31). 770 mt stands for the major type. 772 +====+===================+==================================+ 773 | mt | Meaning | enclosed up to "break" stop code | 774 +====+===================+==================================+ 775 | 0 | (not well-formed) | - | 776 +----+-------------------+----------------------------------+ 777 | 1 | (not well-formed) | - | 778 +----+-------------------+----------------------------------+ 779 | 2 | byte string | definite-length byte strings | 780 +----+-------------------+----------------------------------+ 781 | 3 | text string | definite-length text strings | 782 +----+-------------------+----------------------------------+ 783 | 4 | array | data items (elements) | 784 +----+-------------------+----------------------------------+ 785 | 5 | map | data items (key/value pairs) | 786 +----+-------------------+----------------------------------+ 787 | 6 | (not well-formed) | - | 788 +----+-------------------+----------------------------------+ 789 | 7 | "break" stop code | - | 790 +----+-------------------+----------------------------------+ 792 Table 2: Overview over the indefinite-length use of CBOR 793 major types (mt = major type, additional information = 794 31) 796 3.3. Floating-Point Numbers and Values with No Content 798 Major type 7 is for two types of data: floating-point numbers and 799 "simple values" that do not need any content. Each value of the 800 5-bit additional information in the initial byte has its own separate 801 meaning, as defined in Table 3. Like the major types for integers, 802 items of this major type do not carry content data; all the 803 information is in the initial bytes (the head). 805 +=============+===================================================+ 806 | 5-Bit Value | Semantics | 807 +=============+===================================================+ 808 | 0..23 | Simple value (value 0..23) | 809 +-------------+---------------------------------------------------+ 810 | 24 | Simple value (value 32..255 in following byte) | 811 +-------------+---------------------------------------------------+ 812 | 25 | IEEE 754 Half-Precision Float (16 bits follow) | 813 +-------------+---------------------------------------------------+ 814 | 26 | IEEE 754 Single-Precision Float (32 bits follow) | 815 +-------------+---------------------------------------------------+ 816 | 27 | IEEE 754 Double-Precision Float (64 bits follow) | 817 +-------------+---------------------------------------------------+ 818 | 28-30 | Reserved, not well-formed in the present document | 819 +-------------+---------------------------------------------------+ 820 | 31 | "break" stop code for indefinite-length items | 821 | | (Section 3.2.1) | 822 +-------------+---------------------------------------------------+ 824 Table 3: Values for Additional Information in Major Type 7 826 As with all other major types, the 5-bit value 24 signifies a single- 827 byte extension: it is followed by an additional byte to represent the 828 simple value. (To minimize confusion, only the values 32 to 255 are 829 used.) This maintains the structure of the initial bytes: as for the 830 other major types, the length of these always depends on the 831 additional information in the first byte. Table 4 lists the numeric 832 values assigned and available for simple values. 834 +=========+==============+ 835 | Value | Semantics | 836 +=========+==============+ 837 | 0..19 | (Unassigned) | 838 +---------+--------------+ 839 | 20 | False | 840 +---------+--------------+ 841 | 21 | True | 842 +---------+--------------+ 843 | 22 | Null | 844 +---------+--------------+ 845 | 23 | Undefined | 846 +---------+--------------+ 847 | 24..31 | (Reserved) | 848 +---------+--------------+ 849 | 32..255 | (Unassigned) | 850 +---------+--------------+ 852 Table 4: Simple Values 854 An encoder MUST NOT issue two-byte sequences that start with 0xf8 855 (major type 7, additional information 24) and continue with a byte 856 less than 0x20 (32 decimal). Such sequences are not well-formed. 857 (This implies that an encoder cannot encode false, true, null, or 858 undefined in two-byte sequences, and that only the one-byte variants 859 of these are well-formed; more generally speaking, each simple value 860 only has a single representation variant). 862 The 5-bit values of 25, 26, and 27 are for 16-bit, 32-bit, and 64-bit 863 IEEE 754 binary floating-point values [IEEE754]. These floating- 864 point values are encoded in the additional bytes of the appropriate 865 size. (See Appendix D for some information about 16-bit floating- 866 point numbers.) 868 3.4. Tagging of Items 870 In CBOR, a data item can be enclosed by a tag to give it some 871 additional semantics, as uniquely identified by a "tag number". The 872 tag is major type 6, its argument (Section 3) indicates the tag 873 number, and it contains a single enclosed data item, the "tag 874 content". (If a tag requires further structure to its content, this 875 structure is provided by the enclosed data item.) We use the term 876 "tag" for the entire data item consisting of both a tag number and 877 the tag content: the tag content is the data item that is being 878 tagged. 880 For example, assume that a byte string of length 12 is marked with a 881 tag of number 2 to indicate it is a positive "bignum" 882 (Section 3.4.3). The encoded data item would start with a byte 883 0b110_00010 (major type 6, additional information 2 for the tag 884 number) followed by the encoded tag content: 0b010_01100 (major type 885 2, additional information of 12 for the length) followed by the 12 886 bytes of the bignum. 888 The definition of a tag number describes the additional semantics 889 conveyed for tags with this tag number in the extended generic data 890 model. These semantics may include equivalence of some tagged data 891 items with other data items, including some that can already be 892 represented in the basic generic data model. For instance, 0xc24101, 893 a bignum the tag content of which is the byte string with the single 894 byte 0x01, is equivalent to an integer 1, which could also be encoded 895 for instance as 0x01, 0x1801, or 0x190001. The tag definition may 896 include the definition of a preferred serialization (Section 4.1) 897 that is recommended for generic encoders; this may prefer basic 898 generic data model representations over ones that employ a tag. 900 The tag definition usually restricts what kinds of nested data item 901 or items are valid for such tags. Tag definitions may restrict their 902 content to a very specific syntactic structure, as the tags defined 903 in this document do, or they may aim at a more semantically defined 904 definition of their content, as for instance tags 40 and 1040 do 905 [RFC8746]: These accept a number of different ways of representing 906 arrays. 908 As a matter of convention, many tags do not accept null or undefined 909 values as tag content; instead, the expectation is that a null or 910 undefined value can be used in place of the entire tag; Section 3.4.2 911 provides some further considerations for one specific tag about the 912 handling of this convention in application protocols and in mapping 913 to platform types. 915 Decoders do not need to understand tags of every tag number, and tags 916 may be of little value in applications where the implementation 917 creating a particular CBOR data item and the implementation decoding 918 that stream know the semantic meaning of each item in the data flow. 919 Their primary purpose in this specification is to define common data 920 types such as dates. A secondary purpose is to provide conversion 921 hints when it is foreseen that the CBOR data item needs to be 922 translated into a different format, requiring hints about the content 923 of items. Understanding the semantics of tags is optional for a 924 decoder; it can simply present both the tag number and the tag 925 content to the application, without interpreting the additional 926 semantics of the tag. 928 A tag applies semantics to the data item it encloses. Tags can nest: 929 If tag A encloses tag B, which encloses data item C, tag A applies to 930 the result of applying tag B on data item C. 932 IANA maintains a registry of tag numbers as described in Section 9.2. 933 Table 5 provides a list of tag numbers that were defined in 934 [RFC7049], with definitions in the rest of this section. (Tag number 935 35 was also defined in [RFC7049]; a discussion of this tag number 936 follows in Section 3.4.5.3.) Note that many other tag numbers have 937 been defined since the publication of [RFC7049]; see the registry 938 described at Section 9.2 for the complete list. 940 +============+=============+==================================+ 941 | Tag Number | Data Item | Tag Content Semantics | 942 +============+=============+==================================+ 943 | 0 | text string | Standard date/time string; see | 944 | | | Section 3.4.1 | 945 +------------+-------------+----------------------------------+ 946 | 1 | integer or | Epoch-based date/time; see | 947 | | float | Section 3.4.2 | 948 +------------+-------------+----------------------------------+ 949 | 2 | byte string | Positive bignum; see | 950 | | | Section 3.4.3 | 951 +------------+-------------+----------------------------------+ 952 | 3 | byte string | Negative bignum; see | 953 | | | Section 3.4.3 | 954 +------------+-------------+----------------------------------+ 955 | 4 | array | Decimal fraction; see | 956 | | | Section 3.4.4 | 957 +------------+-------------+----------------------------------+ 958 | 5 | array | Bigfloat; see Section 3.4.4 | 959 +------------+-------------+----------------------------------+ 960 | 21 | (any) | Expected conversion to base64url | 961 | | | encoding; see Section 3.4.5.2 | 962 +------------+-------------+----------------------------------+ 963 | 22 | (any) | Expected conversion to base64 | 964 | | | encoding; see Section 3.4.5.2 | 965 +------------+-------------+----------------------------------+ 966 | 23 | (any) | Expected conversion to base16 | 967 | | | encoding; see Section 3.4.5.2 | 968 +------------+-------------+----------------------------------+ 969 | 24 | byte string | Encoded CBOR data item; see | 970 | | | Section 3.4.5.1 | 971 +------------+-------------+----------------------------------+ 972 | 32 | text string | URI; see Section 3.4.5.3 | 973 +------------+-------------+----------------------------------+ 974 | 33 | text string | base64url; see Section 3.4.5.3 | 975 +------------+-------------+----------------------------------+ 976 | 34 | text string | base64; see Section 3.4.5.3 | 977 +------------+-------------+----------------------------------+ 978 | 36 | text string | MIME message; see | 979 | | | Section 3.4.5.3 | 980 +------------+-------------+----------------------------------+ 981 | 55799 | (any) | Self-described CBOR; see | 982 | | | Section 3.4.6 | 983 +------------+-------------+----------------------------------+ 985 Table 5: Tag numbers defined in RFC 7049 987 Conceptually, tags are interpreted in the generic data model, not at 988 (de-)serialization time. A small number of tags (at this time, tag 989 number 25 and tag number 29 [IANA.cbor-tags]) have been registered 990 with semantics that may require processing at (de-)serialization 991 time: The decoder needs to be aware and the encoder needs to be in 992 control of the exact sequence in which data items are encoded into 993 the CBOR data item. This means these tags cannot be implemented on 994 top of an arbitrary generic CBOR encoder/decoder (which might not 995 reflect the serialization order for entries in a map at the data 996 model level and vice versa); their implementation therefore typically 997 needs to be integrated into the generic encoder/decoder. The 998 definition of new tags with this property is NOT RECOMMENDED. 1000 IANA allocated tag numbers 65535, 4294967295, and 1001 18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit). 1002 These can be used as a convenience for implementers that want a 1003 single integer data structure to indicate either that a specific tag 1004 is present, or the absence of a tag. That allocation is described in 1005 Section 10 of [I-D.bormann-cbor-notable-tags]. These tags are not 1006 intended to occur in actual CBOR data items; implementations MAY flag 1007 such an occurrence as an error. 1009 Protocols using tag numbers 0 and 1 extend the generic data model 1010 (Section 2) with data items representing points in time; tag numbers 1011 2 and 3, with arbitrarily sized integers; and tag numbers 4 and 5, 1012 with floating-point values of arbitrary size and precision. 1014 3.4.1. Standard Date/Time String 1016 Tag number 0 contains a text string in the standard format described 1017 by the "date-time" production in [RFC3339], as refined by Section 3.3 1018 of [RFC4287], representing the point in time described there. A 1019 nested item of another type or a text string that doesn't match the 1020 [RFC4287] format is invalid. 1022 3.4.2. Epoch-based Date/Time 1024 Tag number 1 contains a numerical value counting the number of 1025 seconds from 1970-01-01T00:00Z in UTC time to the represented point 1026 in civil time. 1028 The tag content MUST be an unsigned or negative integer (major types 1029 0 and 1), or a floating-point number (major type 7 with additional 1030 information 25, 26, or 27). Other contained types are invalid. 1032 Non-negative values (major type 0 and non-negative floating-point 1033 numbers) stand for time values on or after 1970-01-01T00:00Z UTC and 1034 are interpreted according to POSIX [TIME_T]. (POSIX time is also 1035 known as "UNIX Epoch time".) Leap seconds are handled specially by 1036 POSIX time and this results in a 1 second discontinuity several times 1037 per decade. Note that applications that require the expression of 1038 times beyond early 2106 cannot leave out support of 64-bit integers 1039 for the tag content. 1041 Negative values (major type 1 and negative floating-point numbers) 1042 are interpreted as determined by the application requirements as 1043 there is no universal standard for UTC count-of-seconds time before 1044 1970-01-01T00:00Z (this is particularly true for points in time that 1045 precede discontinuities in national calendars). The same applies to 1046 non-finite values. 1048 To indicate fractional seconds, floating-point values can be used 1049 within tag number 1 instead of integer values. Note that this 1050 generally requires binary64 support, as binary16 and binary32 provide 1051 non-zero fractions of seconds only for a short period of time around 1052 early 1970. An application that requires tag number 1 support may 1053 restrict the tag content to be an integer (or a floating-point value) 1054 only. 1056 Note that platform types for date/time may include null or undefined 1057 values, which may also be desirable at an application protocol level. 1058 While emitting tag number 1 values with non-finite tag content values 1059 (e.g., with NaN for undefined date/time values or with Infinite for 1060 an expiry date that is not set) may seem an obvious way to handle 1061 this, using untagged null or undefined avoids the use of non-finites 1062 and results in a shorter encoding. Application protocol designers 1063 are encouraged to consider these cases and include clear guidelines 1064 for handling them. 1066 3.4.3. Bignums 1068 Protocols using tag numbers 2 and 3 extend the generic data model 1069 (Section 2) with "bignums" representing arbitrarily sized integers. 1070 In the basic generic data model, bignum values are not equal to 1071 integers from the same model, but the extended generic data model 1072 created by this tag definition defines equivalence based on numeric 1073 value, and preferred serialization (Section 4.1) never makes use of 1074 bignums that also can be expressed as basic integers (see below). 1076 Bignums are encoded as a byte string data item, which is interpreted 1077 as an unsigned integer n in network byte order. Contained items of 1078 other types are invalid. For tag number 2, the value of the bignum 1079 is n. For tag number 3, the value of the bignum is -1 - n. The 1080 preferred serialization of the byte string is to leave out any 1081 leading zeroes (note that this means the preferred serialization for 1082 n = 0 is the empty byte string, but see below). Decoders that 1083 understand these tags MUST be able to decode bignums that do have 1084 leading zeroes. The preferred serialization of an integer that can 1085 be represented using major type 0 or 1 is to encode it this way 1086 instead of as a bignum (which means that the empty string never 1087 occurs in a bignum when using preferred serialization). Note that 1088 this means the non-preferred choice of a bignum representation 1089 instead of a basic integer for encoding a number is not intended to 1090 have application semantics (just as the choice of a longer basic 1091 integer representation than needed, such as 0x1800 for 0x00 does 1092 not). 1094 For example, the number 18446744073709551616 (2**64) is represented 1095 as 0b110_00010 (major type 6, tag number 2), followed by 0b010_01001 1096 (major type 2, length 9), followed by 0x010000000000000000 (one byte 1097 0x01 and eight bytes 0x00). In hexadecimal: 1099 C2 -- Tag 2 1100 49 -- Byte string of length 9 1101 010000000000000000 -- Bytes content 1103 3.4.4. Decimal Fractions and Bigfloats 1105 Protocols using tag number 4 extend the generic data model with data 1106 items representing arbitrary-length decimal fractions of the form 1107 m*(10**e). Protocols using tag number 5 extend the generic data 1108 model with data items representing arbitrary-length binary fractions 1109 of the form m*(2**e). As with bignums, values of different types are 1110 not equal in the generic data model. 1112 Decimal fractions combine an integer mantissa with a base-10 scaling 1113 factor. They are most useful if an application needs the exact 1114 representation of a decimal fraction such as 1.1 because there is no 1115 exact representation for many decimal fractions in binary floating- 1116 point representations. 1118 "Bigfloats" combine an integer mantissa with a base-2 scaling factor. 1119 They are binary floating-point values that can exceed the range or 1120 the precision of the three IEEE 754 formats supported by CBOR 1121 (Section 3.3). Bigfloats may also be used by constrained 1122 applications that need some basic binary floating-point capability 1123 without the need for supporting IEEE 754. 1125 A decimal fraction or a bigfloat is represented as a tagged array 1126 that contains exactly two integer numbers: an exponent e and a 1127 mantissa m. Decimal fractions (tag number 4) use base-10 exponents; 1128 the value of a decimal fraction data item is m*(10**e). Bigfloats 1129 (tag number 5) use base-2 exponents; the value of a bigfloat data 1130 item is m*(2**e). The exponent e MUST be represented in an integer 1131 of major type 0 or 1, while the mantissa can also be a bignum 1132 (Section 3.4.3). Contained items with other structures are invalid. 1134 An example of a decimal fraction is that the number 273.15 could be 1135 represented as 0b110_00100 (major type 6 for tag, additional 1136 information 4 for the tag number), followed by 0b100_00010 (major 1137 type 4 for the array, additional information 2 for the length of the 1138 array), followed by 0b001_00001 (major type 1 for the first integer, 1139 additional information 1 for the value of -2), followed by 1140 0b000_11001 (major type 0 for the second integer, additional 1141 information 25 for a two-byte value), followed by 0b0110101010110011 1142 (27315 in two bytes). In hexadecimal: 1144 C4 -- Tag 4 1145 82 -- Array of length 2 1146 21 -- -2 1147 19 6ab3 -- 27315 1149 An example of a bigfloat is that the number 1.5 could be represented 1150 as 0b110_00101 (major type 6 for tag, additional information 5 for 1151 the tag number), followed by 0b100_00010 (major type 4 for the array, 1152 additional information 2 for the length of the array), followed by 1153 0b001_00000 (major type 1 for the first integer, additional 1154 information 0 for the value of -1), followed by 0b000_00011 (major 1155 type 0 for the second integer, additional information 3 for the value 1156 of 3). In hexadecimal: 1158 C5 -- Tag 5 1159 82 -- Array of length 2 1160 20 -- -1 1161 03 -- 3 1163 Decimal fractions and bigfloats provide no representation of 1164 Infinity, -Infinity, or NaN; if these are needed in place of a 1165 decimal fraction or bigfloat, the IEEE 754 half-precision 1166 representations from Section 3.3 can be used. 1168 3.4.5. Content Hints 1170 The tags in this section are for content hints that might be used by 1171 generic CBOR processors. These content hints do not extend the 1172 generic data model. 1174 3.4.5.1. Encoded CBOR Data Item 1176 Sometimes it is beneficial to carry an embedded CBOR data item that 1177 is not meant to be decoded immediately at the time the enclosing data 1178 item is being decoded. Tag number 24 (CBOR data item) can be used to 1179 tag the embedded byte string as a single data item encoded in CBOR 1180 format. Contained items that aren't byte strings are invalid. A 1181 contained byte string is valid if it encodes a well-formed CBOR data 1182 item; validity checking of the decoded CBOR item is not required for 1183 tag validity (but could be offered by a generic decoder as a special 1184 option). 1186 3.4.5.2. Expected Later Encoding for CBOR-to-JSON Converters 1188 Tag numbers 21 to 23 indicate that a byte string might require a 1189 specific encoding when interoperating with a text-based 1190 representation. These tags are useful when an encoder knows that the 1191 byte string data it is writing is likely to be later converted to a 1192 particular JSON-based usage. That usage specifies that some strings 1193 are encoded as base64, base64url, and so on. The encoder uses byte 1194 strings instead of doing the encoding itself to reduce the message 1195 size, to reduce the code size of the encoder, or both. The encoder 1196 does not know whether or not the converter will be generic, and 1197 therefore wants to say what it believes is the proper way to convert 1198 binary strings to JSON. 1200 The data item tagged can be a byte string or any other data item. In 1201 the latter case, the tag applies to all of the byte string data items 1202 contained in the data item, except for those contained in a nested 1203 data item tagged with an expected conversion. 1205 These three tag numbers suggest conversions to three of the base data 1206 encodings defined in [RFC4648]. Tag number 21 suggests conversion to 1207 base64url encoding (Section 5 of RFC 4648), where padding is not used 1208 (see Section 3.2 of RFC 4648); that is, all trailing equals signs 1209 ("=") are removed from the encoded string. Tag number 22 suggests 1210 conversion to classical base64 encoding (Section 4 of RFC 4648), with 1211 padding as defined in RFC 4648. For both base64url and base64, 1212 padding bits are set to zero (see Section 3.5 of RFC 4648), and the 1213 conversion to alternate encoding is performed on the contents of the 1214 byte string (that is, without adding any line breaks, whitespace, or 1215 other additional characters). Tag number 23 suggests conversion to 1216 base16 (hex) encoding, with uppercase alphabetics (see Section 8 of 1217 RFC 4648). Note that, for all three tag numbers, the encoding of the 1218 empty byte string is the empty text string. 1220 3.4.5.3. Encoded Text 1222 Some text strings hold data that have formats widely used on the 1223 Internet, and sometimes those formats can be validated and presented 1224 to the application in appropriate form by the decoder. There are 1225 tags for some of these formats. 1227 * Tag number 32 is for URIs, as defined in [RFC3986]. If the text 1228 string doesn't match the "URI-reference" production, the string is 1229 invalid. 1231 * Tag numbers 33 and 34 are for base64url- and base64-encoded text 1232 strings, respectively, as defined in [RFC4648]. If any of: 1234 - the encoded text string contains non-alphabet characters or 1235 only 1 alphabet character in the last block of 4 (where 1236 alphabet is defined by Section 5 of [RFC4648] for tag number 33 1237 and Section 4 of [RFC4648] for tag number 34), or 1239 - the padding bits in a 2- or 3-character block are not 0, or 1241 - the base64 encoding has the wrong number of padding characters, 1242 or 1244 - the base64url encoding has padding characters, 1246 the string is invalid. 1248 * Tag number 36 is for MIME messages (including all headers), as 1249 defined in [RFC2045]. A text string that isn't a valid MIME 1250 message is invalid. (For this tag, validity checking may be 1251 particularly onerous for a generic decoder and might therefore not 1252 be offered. Note that many MIME messages are general binary data 1253 and can therefore not be represented in a text string; 1254 [IANA.cbor-tags] lists a registration for tag number 257 that is 1255 similar to tag number 36 but uses a byte string as its tag 1256 content.) 1258 Note that tag numbers 33 and 34 differ from 21 and 22 in that the 1259 data is transported in base-encoded form for the former and in raw 1260 byte string form for the latter. 1262 [RFC7049] also defined a tag number 35, for regular expressions that 1263 are in Perl Compatible Regular Expressions (PCRE/PCRE2) form [PCRE] 1264 or in JavaScript regular expression syntax [ECMA262]. The state of 1265 the art in these regular expression specifications has since advanced 1266 and is continually advancing, so the present specification does not 1267 attempt to update the references to a snapshot that is current at the 1268 time of writing. Instead, this tag remains available (as registered 1269 in [RFC7049]) for applications that specify the particular regular 1270 expression variant they use out-of-band (possibly by limiting the 1271 usage to a defined common subset of both PCRE and ECMA262). As the 1272 present specification clarifies tag validity beyond [RFC7049], we 1273 note that due to the open way the tag was defined in [RFC7049], any 1274 contained string value needs to be valid at the CBOR tag level (but 1275 may then not be "expected" at the application level). 1277 3.4.6. Self-Described CBOR 1279 In many applications, it will be clear from the context that CBOR is 1280 being employed for encoding a data item. For instance, a specific 1281 protocol might specify the use of CBOR, or a media type is indicated 1282 that specifies its use. However, there may be applications where 1283 such context information is not available, such as when CBOR data is 1284 stored in a file that does not have disambiguating metadata. Here, 1285 it may help to have some distinguishing characteristics for the data 1286 itself. 1288 Tag number 55799 is defined for this purpose, specifically for use at 1289 the start of a stored encoded CBOR data item as specified by an 1290 application. It does not impart any special semantics on the data 1291 item that it encloses; that is, the semantics of the tag content 1292 enclosed in tag number 55799 is exactly identical to the semantics of 1293 the tag content itself. 1295 The serialization of this tag's head is 0xd9d9f7, which does not 1296 appear to be in use as a distinguishing mark for any frequently used 1297 file types. In particular, 0xd9d9f7 is not a valid start of a 1298 Unicode text in any Unicode encoding if it is followed by a valid 1299 CBOR data item. 1301 For instance, a decoder might be able to decode both CBOR and JSON. 1302 Such a decoder would need to mechanically distinguish the two 1303 formats. An easy way for an encoder to help the decoder would be to 1304 tag the entire CBOR item with tag number 55799, the serialization of 1305 which will never be found at the beginning of a JSON text. 1307 4. Serialization Considerations 1309 4.1. Preferred Serialization 1311 For some values at the data model level, CBOR provides multiple 1312 serializations. For many applications, it is desirable that an 1313 encoder always chooses a preferred serialization (preferred 1314 encoding); however, the present specification does not put the burden 1315 of enforcing this preference on either encoder or decoder. 1317 Some constrained decoders may be limited in their ability to decode 1318 non-preferred serializations: For example, if only integers below 1319 1_000_000_000 (one billion) are expected in an application, the 1320 decoder may leave out the code that would be needed to decode 64-bit 1321 arguments in integers. An encoder that always uses preferred 1322 serialization ("preferred encoder") interoperates with this decoder 1323 for the numbers that can occur in this application. More generally 1324 speaking, it therefore can be said that a preferred encoder is more 1325 universally interoperable (and also less wasteful) than one that, 1326 say, always uses 64-bit integers. 1328 Similarly, a constrained encoder may be limited in the variety of 1329 representation variants it supports in such a way that it does not 1330 emit preferred serializations ("variant encoder"): Say, it could be 1331 designed to always use the 32-bit variant for an integer that it 1332 encodes even if a short representation is available (again, assuming 1333 that there is no application need for integers that can only be 1334 represented with the 64-bit variant). A decoder that does not rely 1335 on only ever receiving preferred serializations ("variation-tolerant 1336 decoder") can therefore be said to be more universally interoperable 1337 (it might very well optimize for the case of receiving preferred 1338 serializations, though). Full implementations of CBOR decoders are 1339 by definition variation-tolerant; the distinction is only relevant if 1340 a constrained implementation of a CBOR decoder meets a variant 1341 encoder. 1343 The preferred serialization always uses the shortest form of 1344 representing the argument (Section 3); it also uses the shortest 1345 floating-point encoding that preserves the value being encoded. 1347 The preferred serialization for a floating-point value is the 1348 shortest floating-point encoding that preserves its value, e.g., 1349 0xf94580 for the number 5.5, and 0xfa45ad9c00 for the number 5555.5. 1350 For NaN values, a shorter encoding is preferred if zero-padding the 1351 shorter significand towards the right reconstitutes the original NaN 1352 value (for many applications, the single NaN encoding 0xf97e00 will 1353 suffice). 1355 Definite length encoding is preferred whenever the length is known at 1356 the time the serialization of the item starts. 1358 4.2. Deterministically Encoded CBOR 1360 Some protocols may want encoders to only emit CBOR in a particular 1361 deterministic format; those protocols might also have the decoders 1362 check that their input is in that deterministic format. Those 1363 protocols are free to define what they mean by a "deterministic 1364 format" and what encoders and decoders are expected to do. This 1365 section defines a set of restrictions that can serve as the base of 1366 such a deterministic format. 1368 4.2.1. Core Deterministic Encoding Requirements 1370 A CBOR encoding satisfies the "core deterministic encoding 1371 requirements" if it satisfies the following restrictions: 1373 * Preferred serialization MUST be used. In particular, this means 1374 that arguments (see Section 3) for integers, lengths in major 1375 types 2 through 5, and tags MUST be as short as possible, for 1376 instance: 1378 - 0 to 23 and -1 to -24 MUST be expressed in the same byte as the 1379 major type; 1381 - 24 to 255 and -25 to -256 MUST be expressed only with an 1382 additional uint8_t; 1384 - 256 to 65535 and -257 to -65536 MUST be expressed only with an 1385 additional uint16_t; 1387 - 65536 to 4294967295 and -65537 to -4294967296 MUST be expressed 1388 only with an additional uint32_t. 1390 Floating-point values also MUST use the shortest form that 1391 preserves the value, e.g. 1.5 is encoded as 0xf93e00 (binary16) 1392 and 1000000.5 as 0xfa49742408 (binary32). (One implementation of 1393 this is to have all floats start as a 64-bit float, then do a test 1394 conversion to a 32-bit float; if the result is the same numeric 1395 value, use the shorter form and repeat the process with a test 1396 conversion to a 16-bit float. This also works to select 16-bit 1397 float for positive and negative Infinity as well.) 1399 * Indefinite-length items MUST NOT appear. They can be encoded as 1400 definite-length items instead. 1402 * The keys in every map MUST be sorted in the bytewise lexicographic 1403 order of their deterministic encodings. For example, the 1404 following keys are sorted correctly: 1406 1. 10, encoded as 0x0a. 1408 2. 100, encoded as 0x1864. 1410 3. -1, encoded as 0x20. 1412 4. "z", encoded as 0x617a. 1414 5. "aa", encoded as 0x626161. 1416 6. [100], encoded as 0x811864. 1418 7. [-1], encoded as 0x8120. 1420 8. false, encoded as 0xf4. 1422 (Implementation note: the self-delimiting nature of the CBOR 1423 encoding means that there are no two well-formed CBOR encoded data 1424 items where one is a prefix of the other. The bytewise 1425 lexicographic comparison of deterministic encodings of different 1426 map keys therefore always ends in a position where the byte 1427 differs between the keys, before the end of a key is reached.) 1429 4.2.2. Additional Deterministic Encoding Considerations 1431 CBOR tags present additional considerations for deterministic 1432 encoding. If a CBOR-based protocol were to provide the same 1433 semantics for the presence and absence of a specific tag (e.g., by 1434 allowing both tag 1 data items and raw numbers in a date/time 1435 position, treating the latter as if they were tagged), the 1436 deterministic format would not allow the presence of the tag, based 1437 on the "shortest form" principle. For example, a protocol might give 1438 encoders the choice of representing a URL as either a text string or, 1439 using Section 3.4.5.3, tag number 32 containing a text string. This 1440 protocol's deterministic encoding needs to either require that the 1441 tag is present or require that it is absent, not allow either one. 1443 In a protocol that does require tags in certain places to obtain 1444 specific semantics, the tag needs to appear in the deterministic 1445 format as well. Deterministic encoding considerations also apply to 1446 the content of tags. 1448 If a protocol includes a field that can express integers with an 1449 absolute value of 2^64 or larger using tag numbers 2 or 3 1450 (Section 3.4.3), the protocol's deterministic encoding needs to 1451 specify whether smaller integers are also expressed using these tags 1452 or using major types 0 and 1. Preferred serialization uses the 1453 latter choice, which is therefore recommended. 1455 Protocols that include floating-point values, whether represented 1456 using basic floating-point values (Section 3.3) or using tags (or 1457 both), may need to define extra requirements on their deterministic 1458 encodings, such as: 1460 * Although IEEE floating-point values can represent both positive 1461 and negative zero as distinct values, the application might not 1462 distinguish these and might decide to represent all zero values 1463 with a positive sign, disallowing negative zero. (The application 1464 may also want to restrict the precision of floating-point values 1465 in such a way that there is never a need to represent 64-bit -- or 1466 even 32-bit -- floating-point values.) 1468 * If a protocol includes a field that can express floating-point 1469 values, with a specific data model that declares integer and 1470 floating-point values to be interchangeable, the protocol's 1471 deterministic encoding needs to specify whether (for example) the 1472 integer 1.0 is encoded as 0x01 (unsigned integer), 0xf93c00 1473 (binary16), 0xfa3f800000 (binary32), or 0xfb3ff0000000000000 1474 (binary64). Example rules for this are: 1476 1. Encode integral values that fit in 64 bits as values from 1477 major types 0 and 1, and other values as the preferred 1478 (smallest of 16-, 32-, or 64-bit) floating-point 1479 representation that accurately represents the value, 1481 2. Encode all values as the preferred floating-point 1482 representation that accurately represents the value, even for 1483 integral values, or 1485 3. Encode all values as 64-bit floating-point representations. 1487 Rule 1 straddles the boundaries between integers and floating- 1488 point values, and Rule 3 does not use preferred serialization, so 1489 Rule 2 may be a good choice in many cases. 1491 * If NaN is an allowed value and there is no intent to support NaN 1492 payloads or signaling NaNs, the protocol needs to pick a single 1493 representation, typically 0xf97e00. If that simple choice is not 1494 possible, specific attention will be needed for NaN handling. 1496 * Subnormal numbers (nonzero numbers with the lowest possible 1497 exponent of a given IEEE 754 number format) may be flushed to zero 1498 outputs or be treated as zero inputs in some floating-point 1499 implementations. A protocol's deterministic encoding may want to 1500 specifically accommodate such implementations while creating an 1501 onus on other implementations, by excluding subnormal numbers from 1502 interchange, interchanging zero instead. 1504 * The same number can be represented by different decimal fractions, 1505 by different bigfloats, and by different forms under other tags 1506 that may be defined to express numeric values. Depending on the 1507 implementation, it may not always be practical to determine 1508 whether any of these forms (or forms in the basic generic data 1509 model) are equivalent. An application protocol that presents 1510 choices of this kind for the representation format of numbers 1511 needs to be explicit in how the formats are to be chosen for 1512 deterministic encoding. 1514 4.2.3. Length-first Map Key Ordering 1516 The core deterministic encoding requirements (Section 4.2.1) sort map 1517 keys in a different order from the one suggested by Section 3.9 of 1518 [RFC7049] (called "Canonical CBOR" there). Protocols that need to be 1519 compatible with [RFC7049]'s order can instead be specified in terms 1520 of this specification's "length-first core deterministic encoding 1521 requirements": 1523 A CBOR encoding satisfies the "length-first core deterministic 1524 encoding requirements" if it satisfies the core deterministic 1525 encoding requirements except that the keys in every map MUST be 1526 sorted such that: 1528 1. If two keys have different lengths, the shorter one sorts 1529 earlier; 1531 2. If two keys have the same length, the one with the lower value in 1532 (byte-wise) lexical order sorts earlier. 1534 For example, under the length-first core deterministic encoding 1535 requirements, the following keys are sorted correctly: 1537 1. 10, encoded as 0x0a. 1539 2. -1, encoded as 0x20. 1541 3. false, encoded as 0xf4. 1543 4. 100, encoded as 0x1864. 1545 5. "z", encoded as 0x617a. 1547 6. [-1], encoded as 0x8120. 1549 7. "aa", encoded as 0x626161. 1551 8. [100], encoded as 0x811864. 1553 (Although [RFC7049] used the term "Canonical CBOR" for its form of 1554 requirements on deterministic encoding, this document avoids this 1555 term because "canonicalization" is often associated with specific 1556 uses of deterministic encoding only. The terms are essentially 1557 interchangeable, however, and the set of core requirements in this 1558 document could also be called "Canonical CBOR", while the length- 1559 first-ordered version of that could be called "Old Canonical CBOR".) 1561 5. Creating CBOR-Based Protocols 1563 Data formats such as CBOR are often used in environments where there 1564 is no format negotiation. A specific design goal of CBOR is to not 1565 need any included or assumed schema: a decoder can take a CBOR item 1566 and decode it with no other knowledge. 1568 Of course, in real-world implementations, the encoder and the decoder 1569 will have a shared view of what should be in a CBOR data item. For 1570 example, an agreed-to format might be "the item is an array whose 1571 first value is a UTF-8 string, second value is an integer, and 1572 subsequent values are zero or more floating-point numbers" or "the 1573 item is a map that has byte strings for keys and contains a pair 1574 whose key is 0xab01". 1576 CBOR-based protocols MUST specify how their decoders handle invalid 1577 and other unexpected data. CBOR-based protocols MAY specify that 1578 they treat arbitrary valid data as unexpected. Encoders for CBOR- 1579 based protocols MUST produce only valid items, that is, the protocol 1580 cannot be designed to make use of invalid items. An encoder can be 1581 capable of encoding as many or as few types of values as is required 1582 by the protocol in which it is used; a decoder can be capable of 1583 understanding as many or as few types of values as is required by the 1584 protocols in which it is used. This lack of restrictions allows CBOR 1585 to be used in extremely constrained environments. 1587 The rest of this section discusses some considerations in creating 1588 CBOR-based protocols. With few exceptions, it is advisory only and 1589 explicitly excludes any language from BCP 14 other than words that 1590 could be interpreted as "MAY" in the sense of BCP 14. The exceptions 1591 aim at facilitating interoperability of CBOR-based protocols while 1592 making use of a wide variety of both generic and application-specific 1593 encoders and decoders. 1595 5.1. CBOR in Streaming Applications 1597 In a streaming application, a data stream may be composed of a 1598 sequence of CBOR data items concatenated back-to-back. In such an 1599 environment, the decoder immediately begins decoding a new data item 1600 if data is found after the end of a previous data item. 1602 Not all of the bytes making up a data item may be immediately 1603 available to the decoder; some decoders will buffer additional data 1604 until a complete data item can be presented to the application. 1605 Other decoders can present partial information about a top-level data 1606 item to an application, such as the nested data items that could 1607 already be decoded, or even parts of a byte string that hasn't 1608 completely arrived yet. Such an application also MUST have a 1609 matching streaming security mechanism, where the desired protection 1610 is available for incremental data presented to the application. 1612 Note that some applications and protocols will not want to use 1613 indefinite-length encoding. Using indefinite-length encoding allows 1614 an encoder to not need to marshal all the data for counting, but it 1615 requires a decoder to allocate increasing amounts of memory while 1616 waiting for the end of the item. This might be fine for some 1617 applications but not others. 1619 5.2. Generic Encoders and Decoders 1621 A generic CBOR decoder can decode all well-formed encoded CBOR data 1622 items and present the data items to an application. See Appendix C. 1623 (The diagnostic notation, Section 8, may be used to present well- 1624 formed CBOR values to humans.) 1626 Generic CBOR encoders provide an application interface that allows 1627 the application to specify any well-formed value to be encoded as a 1628 CBOR data item, including simple values and tags unknown to the 1629 encoder. 1631 Even though CBOR attempts to minimize these cases, not all well- 1632 formed CBOR data is valid: for example, the encoded text string 1633 "0x62c0ae" does not contain valid UTF-8 (because [RFC3629] requires 1634 always using the shortest form) and so is not a valid CBOR item. 1635 Also, specific tags may make semantic constraints that may be 1636 violated, for instance by a bignum tag enclosing another tag, or by 1637 an instance of tag number 0 containing a byte string, or containing a 1638 text string with contents that do not match [RFC3339]'s "date-time" 1639 production. There is no requirement that generic encoders and 1640 decoders make unnatural choices for their application interface to 1641 enable the processing of invalid data. Generic encoders and decoders 1642 are expected to forward simple values and tags even if their specific 1643 codepoints are not registered at the time the encoder/decoder is 1644 written (Section 5.4). 1646 5.3. Validity of Items 1648 A well-formed but invalid CBOR data item (Section 1.2) presents a 1649 problem with interpreting the data encoded in it in the CBOR data 1650 model. A CBOR-based protocol could be specified in several layers, 1651 in which the lower layers don't process the semantics of some of the 1652 CBOR data they forward. These layers can't notice any validity 1653 errors in data they don't process and MUST forward that data as-is. 1654 The first layer that does process the semantics of an invalid CBOR 1655 item MUST take one of two choices: 1657 1. Replace the problematic item with an error marker and continue 1658 with the next item, or 1660 2. Issue an error and stop processing altogether. 1662 A CBOR-based protocol MUST specify which of these options its 1663 decoders take, for each kind of invalid item they might encounter. 1665 Such problems might occur at the basic validity level of CBOR or in 1666 the context of tags (tag validity). 1668 5.3.1. Basic validity 1670 Two kinds of validity errors can occur in the basic generic data 1671 model: 1673 Duplicate keys in a map: Generic decoders (Section 5.2) make data 1674 available to applications using the native CBOR data model. That 1675 data model includes maps (key-value mappings with unique keys), 1676 not multimaps (key-value mappings where multiple entries can have 1677 the same key). Thus, a generic decoder that gets a CBOR map item 1678 that has duplicate keys will decode to a map with only one 1679 instance of that key, or it might stop processing altogether. On 1680 the other hand, a "streaming decoder" may not even be able to 1681 notice. See Section 5.6 for more discussion of keys in maps. 1683 Invalid UTF-8 string: A decoder might or might not want to verify 1684 that the sequence of bytes in a UTF-8 string (major type 3) is 1685 actually valid UTF-8 and react appropriately. 1687 5.3.2. Tag validity 1689 Two additional kinds of validity errors are introduced by adding tags 1690 to the basic generic data model: 1692 Inadmissible type for tag content: Tag numbers (Section 3.4) specify 1693 what type of data item is supposed to be used as their tag 1694 content; for example, the tag numbers for positive or negative 1695 bignums are supposed to be put on byte strings. A decoder that 1696 decodes the tagged data item into a native representation (a 1697 native big integer in this example) is expected to check the type 1698 of the data item being tagged. Even decoders that don't have such 1699 native representations available in their environment may perform 1700 the check on those tags known to them and react appropriately. 1702 Inadmissible value for tag content: The type of data item may be 1703 admissible for a tag's content, but the specific value may not be; 1704 e.g., a value of "yesterday" is not acceptable for the content of 1705 tag 0, even though it properly is a text string. A decoder that 1706 normally ingests such tags into equivalent platform types might 1707 present this tag to the application in a similar way to how it 1708 would present a tag with an unknown tag number (Section 5.4). 1710 5.4. Validity and Evolution 1712 A decoder with validity checking will expend the effort to reliably 1713 detect data items with validity errors. For example, such a decoder 1714 needs to have an API that reports an error (and does not return data) 1715 for a CBOR data item that contains any of the validity errors listed 1716 in the previous subsection. 1718 The set of tags defined in the tag registry (Section 9.2), as well as 1719 the set of simple values defined in the simple values registry 1720 (Section 9.1), can grow at any time beyond the set understood by a 1721 generic decoder. A validity-checking decoder can do one of two 1722 things when it encounters such a case that it does not recognize: 1724 * It can report an error (and not return data). Note that treating 1725 this case as an error can cause ossification, and is thus not 1726 encouraged. This error is not a validity error per se. This kind 1727 of error is more likely to be raised by a decoder that would be 1728 performing validity checking if this were a known case. 1730 * It can emit the unknown item (type, value, and, for tags, the 1731 decoded tagged data item) to the application calling the decoder, 1732 with an indication that the decoder did not recognize that tag 1733 number or simple value. 1735 The latter approach, which is also appropriate for decoders that do 1736 not support validity checking, provides forward compatibility with 1737 newly registered tags and simple values without the requirement to 1738 update the encoder at the same time as the calling application. (For 1739 this, the API for the decoder needs to have a way to mark unknown 1740 items so that the calling application can handle them in a manner 1741 appropriate for the program.) 1743 Since some of the processing needed for validity checking may have an 1744 appreciable cost (in particular with duplicate detection for maps), 1745 support of validity checking is not a requirement placed on all CBOR 1746 decoders. 1748 Some encoders will rely on their applications to provide input data 1749 in such a way that valid CBOR results from the encoder. A generic 1750 encoder may also want to provide a validity-checking mode where it 1751 reliably limits its output to valid CBOR, independent of whether or 1752 not its application is indeed providing API-conformant data. 1754 5.5. Numbers 1756 CBOR-based protocols should take into account that different language 1757 environments pose different restrictions on the range and precision 1758 of numbers that are representable. For example, the basic JavaScript 1759 number system treats all numbers as floating-point values, which may 1760 result in silent loss of precision in decoding integers with more 1761 than 53 significant bits. Another example is that, since CBOR keeps 1762 the sign bit for its integer representation in the major type, it has 1763 one bit more for signed numbers of a certain length (e.g., 1764 -2**64..2**64-1 for 1+8-byte integers) than the typical platform 1765 signed integer representation of the same length (-2**63..2**63-1 for 1766 8-byte int64_t). A protocol that uses numbers should define its 1767 expectations on the handling of non-trivial numbers in decoders and 1768 receiving applications. 1770 A CBOR-based protocol that includes floating-point numbers can 1771 restrict which of the three formats (half-precision, single- 1772 precision, and double-precision) are to be supported. For an 1773 integer-only application, a protocol may want to completely exclude 1774 the use of floating-point values. 1776 A CBOR-based protocol designed for compactness may want to exclude 1777 specific integer encodings that are longer than necessary for the 1778 application, such as to save the need to implement 64-bit integers. 1779 There is an expectation that encoders will use the most compact 1780 integer representation that can represent a given value. However, a 1781 compact application that does not require deterministic encoding 1782 should accept values that use a longer-than-needed encoding (such as 1783 encoding "0" as 0b000_11001 followed by two bytes of 0x00) as long as 1784 the application can decode an integer of the given size. Similar 1785 considerations apply to floating-point values; decoding both 1786 preferred serializations and longer-than-needed ones is recommended. 1788 CBOR-based protocols for constrained applications that provide a 1789 choice between representing a specific number as an integer and as a 1790 decimal fraction or bigfloat (such as when the exponent is small and 1791 non-negative), might express a quality-of-implementation expectation 1792 that the integer representation is used directly. 1794 5.6. Specifying Keys for Maps 1796 The encoding and decoding applications need to agree on what types of 1797 keys are going to be used in maps. In applications that need to 1798 interwork with JSON-based applications, conversion is simplified by 1799 limiting keys to text strings only; otherwise, there has to be a 1800 specified mapping from the other CBOR types to text strings, and this 1801 often leads to implementation errors. In applications where keys are 1802 numeric in nature and numeric ordering of keys is important to the 1803 application, directly using the numbers for the keys is useful. 1805 If multiple types of keys are to be used, consideration should be 1806 given to how these types would be represented in the specific 1807 programming environments that are to be used. For example, in 1808 JavaScript Maps [ECMA262], a key of integer 1 cannot be distinguished 1809 from a key of floating-point 1.0. This means that, if integer keys 1810 are used, the protocol needs to avoid use of floating-point keys the 1811 values of which happen to be integer numbers in the same map. 1813 Decoders that deliver data items nested within a CBOR data item 1814 immediately on decoding them ("streaming decoders") often do not keep 1815 the state that is necessary to ascertain uniqueness of a key in a 1816 map. Similarly, an encoder that can start encoding data items before 1817 the enclosing data item is completely available ("streaming encoder") 1818 may want to reduce its overhead significantly by relying on its data 1819 source to maintain uniqueness. 1821 A CBOR-based protocol MUST define what to do when a receiving 1822 application does see multiple identical keys in a map. The resulting 1823 rule in the protocol MUST respect the CBOR data model: it cannot 1824 prescribe a specific handling of the entries with the identical keys, 1825 except that it might have a rule that having identical keys in a map 1826 indicates a malformed map and that the decoder has to stop with an 1827 error. When processing maps that exhibit entries with duplicate 1828 keys, a generic decoder might do one of the following: 1830 * Not accept maps with duplicate keys (that is, enforce validity for 1831 maps, see also Section 5.4). These generic decoders are 1832 universally useful. An application may still need to do perform 1833 its own duplicate checking based on application rules (for 1834 instance if the application equates integers and floating-point 1835 values in map key positions for specific maps). 1837 * Pass all map entries to the application, including ones with 1838 duplicate keys. This requires the application to handle (check 1839 against) duplicate keys, even if the application rules are 1840 identical to the generic data model rules. 1842 * Lose some entries with duplicate keys, e.g. by only delivering the 1843 final (or first) entry out of the entries with the same key. With 1844 such a generic decoder, applications may get different results for 1845 a specific key on different runs and with different generic 1846 decoders as which value is returned is based on generic decoder 1847 implementation and the actual order of keys in the map. In 1848 particular, applications cannot validate key uniqueness on their 1849 own as they do not necessarily see all entries; they may not be 1850 able to use such a generic decoder if they do need to validate key 1851 uniqueness. These generic decoders can only be used in situations 1852 where the data source and transfer can be relied upon to always 1853 provide valid maps; this is not possible if the data source and 1854 transfer can be attacked. 1856 Generic decoders need to document which of these three approaches 1857 they implement. 1859 The CBOR data model for maps does not allow ascribing semantics to 1860 the order of the key/value pairs in the map representation. Thus, a 1861 CBOR-based protocol MUST NOT specify that changing the key/value pair 1862 order in a map would change the semantics, except to specify that 1863 some orders are disallowed, for example where they would not meet the 1864 requirements of a deterministic encoding (Section 4.2). (Any 1865 secondary effects of map ordering such as on timing, cache usage, and 1866 other potential side channels are not considered part of the 1867 semantics but may be enough reason on their own for a protocol to 1868 require a deterministic encoding format.) 1870 Applications for constrained devices that have maps where a small 1871 number of frequently used keys can be identified should consider 1872 using small integers as keys; for instance, a set of 24 or fewer 1873 frequent keys can be encoded in a single byte as unsigned integers, 1874 up to 48 if negative integers are also used. Less frequently 1875 occurring keys can then use integers with longer encodings. 1877 5.6.1. Equivalence of Keys 1879 The specific data model applying to a CBOR data item is used to 1880 determine whether keys occurring in maps are duplicates or distinct. 1882 At the generic data model level, numerically equivalent integer and 1883 floating-point values are distinct from each other, as they are from 1884 the various big numbers (Tags 2 to 5). Similarly, text strings are 1885 distinct from byte strings, even if composed of the same bytes. A 1886 tagged value is distinct from an untagged value or from a value 1887 tagged with a different tag number. 1889 Within each of these groups, numeric values are distinct unless they 1890 are numerically equal (specifically, -0.0 is equal to 0.0); for the 1891 purpose of map key equivalence, NaN (not a number) values are 1892 equivalent if they have the same significand after zero-extending 1893 both significands at the right to 64 bits. 1895 (Byte and text) strings are compared byte by byte, arrays element by 1896 element, and are equal if they have the same number of bytes/elements 1897 and the same values at the same positions. Two maps are equal if 1898 they have the same set of pairs regardless of their order; pairs are 1899 equal if both the key and value are equal. 1901 Tagged values are equal if both the tag number and the tag content 1902 are equal. (Note that a generic decoder that provides processing for 1903 a specific tag may not be able to distinguish some semantically 1904 equivalent values, e.g. if leading zeroes occur in the content of tag 1905 2/3 (Section 3.4.3).) Simple values are equal if they simply have 1906 the same value. Nothing else is equal in the generic data model; a 1907 simple value 2 is not equivalent to an integer 2 and an array is 1908 never equivalent to a map. 1910 As discussed in Section 2.2, specific data models can make values 1911 equivalent for the purpose of comparing map keys that are distinct in 1912 the generic data model. Note that this implies that a generic 1913 decoder may deliver a decoded map to an application that needs to be 1914 checked for duplicate map keys by that application (alternatively, 1915 the decoder may provide a programming interface to perform this 1916 service for the application). Specific data models are not able to 1917 distinguish values for map keys that are equal for this purpose at 1918 the generic data model level. 1920 5.7. Undefined Values 1922 In some CBOR-based protocols, the simple value (Section 3.3) of 1923 Undefined might be used by an encoder as a substitute for a data item 1924 with an encoding problem, in order to allow the rest of the enclosing 1925 data items to be encoded without harm. 1927 6. Converting Data between CBOR and JSON 1929 This section gives non-normative advice about converting between CBOR 1930 and JSON. Implementations of converters MAY use whichever advice 1931 here they want. 1933 It is worth noting that a JSON text is a sequence of characters, not 1934 an encoded sequence of bytes, while a CBOR data item consists of 1935 bytes, not characters. 1937 6.1. Converting from CBOR to JSON 1939 Most of the types in CBOR have direct analogs in JSON. However, some 1940 do not, and someone implementing a CBOR-to-JSON converter has to 1941 consider what to do in those cases. The following non-normative 1942 advice deals with these by converting them to a single substitute 1943 value, such as a JSON null. 1945 * An integer (major type 0 or 1) becomes a JSON number. 1947 * A byte string (major type 2) that is not embedded in a tag that 1948 specifies a proposed encoding is encoded in base64url without 1949 padding and becomes a JSON string. 1951 * A UTF-8 string (major type 3) becomes a JSON string. Note that 1952 JSON requires escaping certain characters ([RFC8259], Section 7): 1953 quotation mark (U+0022), reverse solidus (U+005C), and the "C0 1954 control characters" (U+0000 through U+001F). All other characters 1955 are copied unchanged into the JSON UTF-8 string. 1957 * An array (major type 4) becomes a JSON array. 1959 * A map (major type 5) becomes a JSON object. This is possible 1960 directly only if all keys are UTF-8 strings. A converter might 1961 also convert other keys into UTF-8 strings (such as by converting 1962 integers into strings containing their decimal representation); 1963 however, doing so introduces a danger of key collision. Note also 1964 that, if tags on UTF-8 strings are ignored as proposed below, this 1965 will cause a key collision if the tags are different but the 1966 strings are the same. 1968 * False (major type 7, additional information 20) becomes a JSON 1969 false. 1971 * True (major type 7, additional information 21) becomes a JSON 1972 true. 1974 * Null (major type 7, additional information 22) becomes a JSON 1975 null. 1977 * A floating-point value (major type 7, additional information 25 1978 through 27) becomes a JSON number if it is finite (that is, it can 1979 be represented in a JSON number); if the value is non-finite (NaN, 1980 or positive or negative Infinity), it is represented by the 1981 substitute value. 1983 * Any other simple value (major type 7, any additional information 1984 value not yet discussed) is represented by the substitute value. 1986 * A bignum (major type 6, tag number 2 or 3) is represented by 1987 encoding its byte string in base64url without padding and becomes 1988 a JSON string. For tag number 3 (negative bignum), a "~" (ASCII 1989 tilde) is inserted before the base-encoded value. (The conversion 1990 to a binary blob instead of a number is to prevent a likely 1991 numeric overflow for the JSON decoder.) 1993 * A byte string with an encoding hint (major type 6, tag number 21 1994 through 23) is encoded as described by the hint and becomes a JSON 1995 string. 1997 * For all other tags (major type 6, any other tag number), the tag 1998 content is represented as a JSON value; the tag number is ignored. 2000 * Indefinite-length items are made definite before conversion. 2002 A CBOR-to-JSON converter may want to keep to the JSON profile I-JSON 2003 [RFC7493], to maximize interoperability and increase confidence that 2004 the JSON output can be processed with predictable results. For 2005 example, this has implications on the range of integers that can be 2006 represented reliably, as well as on the top-level items that may be 2007 supported by older JSON implementations. 2009 6.2. Converting from JSON to CBOR 2011 All JSON values, once decoded, directly map into one or more CBOR 2012 values. As with any kind of CBOR generation, decisions have to be 2013 made with respect to number representation. In a suggested 2014 conversion: 2016 * JSON numbers without fractional parts (integer numbers) are 2017 represented as integers (major types 0 and 1, possibly major type 2018 6 tag number 2 and 3), choosing the shortest form; integers longer 2019 than an implementation-defined threshold may instead be 2020 represented as floating-point values. The default range that is 2021 represented as integer is -2**53+1..2**53-1 (fully exploiting the 2022 range for exact integers in the binary64 representation often used 2023 for decoding JSON [RFC7493]). A CBOR-based protocol, or a generic 2024 converter implementation, may choose -2**32..2**32-1 or 2025 -2**64..2**64-1 (fully using the integer ranges available in CBOR 2026 with uint32_t or uint64_t, respectively) or even -2**31..2**31-1 2027 or -2**63..2**63-1 (using popular ranges for two's complement 2028 signed integers). (If the JSON was generated from a JavaScript 2029 implementation, its precision is already limited to 53 bits 2030 maximum.) 2032 * Numbers with fractional parts are represented as floating-point 2033 values, performing the decimal-to-binary conversion based on the 2034 precision provided by IEEE 754 binary64. The mathematical value 2035 of the JSON number is converted to binary64 using the 2036 roundTiesToEven procedure in Section 4.3.1 of [IEEE754]. Then, 2037 when encoding in CBOR, the preferred serialization uses the 2038 shortest floating-point representation exactly representing this 2039 conversion result; for instance, 1.5 is represented in a 16-bit 2040 floating-point value (not all implementations will be capable of 2041 efficiently finding the minimum form, though). Instead of using 2042 the default binary64 precision, there may be an implementation- 2043 defined limit to the precision of the conversion that will affect 2044 the precision of the represented values. Decimal representation 2045 should only be used on the CBOR side if that is specified in a 2046 protocol. 2048 CBOR has been designed to generally provide a more compact encoding 2049 than JSON. One implementation strategy that might come to mind is to 2050 perform a JSON-to-CBOR encoding in place in a single buffer. This 2051 strategy would need to carefully consider a number of pathological 2052 cases, such as that some strings represented with no or very few 2053 escapes and longer (or much longer) than 255 bytes may expand when 2054 encoded as UTF-8 strings in CBOR. Similarly, a few of the binary 2055 floating-point representations might cause expansion from some short 2056 decimal representations (1.1, 1e9) in JSON. This may be hard to get 2057 right, and any ensuing vulnerabilities may be exploited by an 2058 attacker. 2060 7. Future Evolution of CBOR 2062 Successful protocols evolve over time. New ideas appear, 2063 implementation platforms improve, related protocols are developed and 2064 evolve, and new requirements from applications and protocols are 2065 added. Facilitating protocol evolution is therefore an important 2066 design consideration for any protocol development. 2068 For protocols that will use CBOR, CBOR provides some useful 2069 mechanisms to facilitate their evolution. Best practices for this 2070 are well known, particularly from JSON format development of JSON- 2071 based protocols. Therefore, such best practices are outside the 2072 scope of this specification. 2074 However, facilitating the evolution of CBOR itself is very well 2075 within its scope. CBOR is designed to both provide a stable basis 2076 for development of CBOR-based protocols and to be able to evolve. 2077 Since a successful protocol may live for decades, CBOR needs to be 2078 designed for decades of use and evolution. This section provides 2079 some guidance for the evolution of CBOR. It is necessarily more 2080 subjective than other parts of this document. It is also necessarily 2081 incomplete, lest it turn into a textbook on protocol development. 2083 7.1. Extension Points 2085 In a protocol design, opportunities for evolution are often included 2086 in the form of extension points. For example, there may be a 2087 codepoint space that is not fully allocated from the outset, and the 2088 protocol is designed to tolerate and embrace implementations that 2089 start using more codepoints than initially allocated. 2091 Sizing the codepoint space may be difficult because the range 2092 required may be hard to predict. Protocol designs should attempt to 2093 make the codepoint space large enough so that it can slowly be filled 2094 over the intended lifetime of the protocol. 2096 CBOR has three major extension points: 2098 * the "simple" space (values in major type 7). Of the 24 efficient 2099 (and 224 slightly less efficient) values, only a small number have 2100 been allocated. Implementations receiving an unknown simple data 2101 item may easily be able to process it as such, given that the 2102 structure of the value is indeed simple. The IANA registry in 2103 Section 9.1 is the appropriate way to address the extensibility of 2104 this codepoint space. 2106 * the "tag" space (values in major type 6). The total codepoint 2107 space is abundant; only a tiny part of it has been allocated. 2108 However, not all of these codepoints are equally efficient: the 2109 first 24 only consume a single ("1+0") byte, and half of them have 2110 already been allocated. The next 232 values only consume two 2111 ("1+1") bytes, with nearly a quarter already allocated. These 2112 subspaces need some curation to last for a few more decades. 2113 Implementations receiving an unknown tag number can choose to 2114 process just the enclosed tag content or, preferably, to process 2115 the tag as an unknown tag number wrapping the tag content. The 2116 IANA registry in Section 9.2 is the appropriate way to address the 2117 extensibility of this codepoint space. 2119 * the "additional information" space. An implementation receiving 2120 an unknown additional information value has no way to continue 2121 decoding, so allocating codepoints in this space is a major step 2122 beyond just exercising an extension point. There are also very 2123 few codepoints left. See also Section 7.2. 2125 7.2. Curating the Additional Information Space 2127 The human mind is sometimes drawn to filling in little perceived gaps 2128 to make something neat. We expect the remaining gaps in the 2129 codepoint space for the additional information values to be an 2130 attractor for new ideas, just because they are there. 2132 The present specification does not manage the additional information 2133 codepoint space by an IANA registry. Instead, allocations out of 2134 this space can only be done by updating this specification. 2136 For an additional information value of n >= 24, the size of the 2137 additional data typically is 2**(n-24) bytes. Therefore, additional 2138 information values 28 and 29 should be viewed as candidates for 2139 128-bit and 256-bit quantities, in case a need arises to add them to 2140 the protocol. Additional information value 30 is then the only 2141 additional information value available for general allocation, and 2142 there should be a very good reason for allocating it before assigning 2143 it through an update of the present specification. 2145 8. Diagnostic Notation 2147 CBOR is a binary interchange format. To facilitate documentation and 2148 debugging, and in particular to facilitate communication between 2149 entities cooperating in debugging, this section defines a simple 2150 human-readable diagnostic notation. All actual interchange always 2151 happens in the binary format. 2153 Note that this truly is a diagnostic format; it is not meant to be 2154 parsed. Therefore, no formal definition (as in ABNF) is given in 2155 this document. (Implementers looking for a text-based format for 2156 representing CBOR data items in configuration files may also want to 2157 consider YAML [YAML].) 2159 The diagnostic notation is loosely based on JSON as it is defined in 2160 RFC 8259, extending it where needed. 2162 The notation borrows the JSON syntax for numbers (integer and 2163 floating-point), True (>true<), False (>false<), Null (>null<), UTF-8 2164 strings, arrays, and maps (maps are called objects in JSON; the 2165 diagnostic notation extends JSON here by allowing any data item in 2166 the key position). Undefined is written >undefined< as in 2167 JavaScript. The non-finite floating-point numbers Infinity, 2168 -Infinity, and NaN are written exactly as in this sentence (this is 2169 also a way they can be written in JavaScript, although JSON does not 2170 allow them). A tag is written as an integer number for the tag 2171 number, followed by the tag content in parentheses; for instance, an 2172 RFC 3339 (ISO 8601) date could be notated as: 2174 0("2013-03-21T20:04:00Z") 2176 or the equivalent relative time as 2178 1(1363896240) 2180 Byte strings are notated in one of the base encodings, without 2181 padding, enclosed in single quotes, prefixed by >h< for base16, >b32< 2182 for base32, >h32< for base32hex, >b64< for base64 or base64url (the 2183 actual encodings do not overlap, so the string remains unambiguous). 2184 For example, the byte string 0x12345678 could be written h'12345678', 2185 b32'CI2FM6A', or b64'EjRWeA'. 2187 Unassigned simple values are given as "simple()" with the appropriate 2188 integer in the parentheses. For example, "simple(42)" indicates 2189 major type 7, value 42. 2191 A number of useful extensions to the diagnostic notation defined here 2192 are provided in Appendix G of [RFC8610], "Extended Diagnostic 2193 Notation" (EDN). Similarly, an extension of this notation could be 2194 provided in a separate document to provide for the documentation of 2195 NaN payloads, which are not covered in the present document. 2197 8.1. Encoding Indicators 2199 Sometimes it is useful to indicate in the diagnostic notation which 2200 of several alternative representations were actually used; for 2201 example, a data item written >1.5< by a diagnostic decoder might have 2202 been encoded as a half-, single-, or double-precision float. 2204 The convention for encoding indicators is that anything starting with 2205 an underscore and all following characters that are alphanumeric or 2206 underscore, is an encoding indicator, and can be ignored by anyone 2207 not interested in this information. For example, "_" or "_3". 2208 Encoding indicators are always optional. 2210 A single underscore can be written after the opening brace of a map 2211 or the opening bracket of an array to indicate that the data item was 2212 represented in indefinite-length format. For example, [_ 1, 2] 2213 contains an indicator that an indefinite-length representation was 2214 used to represent the data item [1, 2]. 2216 An underscore followed by a decimal digit n indicates that the 2217 preceding item (or, for arrays and maps, the item starting with the 2218 preceding bracket or brace) was encoded with an additional 2219 information value of 24+n. For example, 1.5_1 is a half-precision 2220 floating-point number, while 1.5_3 is encoded as double precision. 2221 This encoding indicator is not shown in Appendix A. (Note that the 2222 encoding indicator "_" is thus an abbreviation of the full form "_7", 2223 which is not used.) 2225 The detailed chunk structure of byte and text strings of indefinite 2226 length can be notated in the form (_ h'0123', h'4567') and (_ "foo", 2227 "bar"). However, for an indefinite length string with no chunks 2228 inside, (_ ) would be ambiguous whether a byte string (0x5fff) or a 2229 text string (0x7fff) is meant and is therefore not used. The basic 2230 forms ''_ and ""_ can be used instead and are reserved for the case 2231 with no chunks only -- not as short forms for the (permitted, but not 2232 really useful) encodings with only empty chunks, which to preserve 2233 the chunk structure need to be notated as (_ ''), (_ ""), etc. 2235 9. IANA Considerations 2237 IANA has created two registries for new CBOR values. The registries 2238 are separate, that is, not under an umbrella registry, and follow the 2239 rules in [RFC8126]. IANA has also assigned a new MIME media type and 2240 an associated Constrained Application Protocol (CoAP) Content-Format 2241 entry. 2243 9.1. Simple Values Registry 2245 IANA has created the "Concise Binary Object Representation (CBOR) 2246 Simple Values" registry at [IANA.cbor-simple-values]. The initial 2247 values are shown in Table 4. 2249 New entries in the range 0 to 19 are assigned by Standards Action. 2250 It is suggested that these Standards Actions allocate values starting 2251 with the number 16 in order to reserve the lower numbers for 2252 contiguous blocks (if any). 2254 New entries in the range 32 to 255 are assigned by Specification 2255 Required. 2257 9.2. Tags Registry 2259 IANA has created the "Concise Binary Object Representation (CBOR) 2260 Tags" registry at [IANA.cbor-tags]. The tags that were defined in 2261 [RFC7049] are described in detail in Section 3.4, and other tags have 2262 already been defined since then. 2264 New entries in the range 0 to 23 ("1+0") are assigned by Standards 2265 Action. New entries in the ranges 24 to 255 ("1+1") and 256 to 32767 2266 (lower half of "1+2") are assigned by Specification Required. New 2267 entries in the range 32768 to 18446744073709551615 (upper half of 2268 "1+2", "1+4", and "1+8") are assigned by First Come First Served. 2269 The template for registration requests is: 2271 * Data item 2273 * Semantics (short form) 2275 In addition, First Come First Served requests should include: 2277 * Point of contact 2279 * Description of semantics (URL) -- This description is optional; 2280 the URL can point to something like an Internet-Draft or a web 2281 page. 2283 Applicants exercising the First Come First Served range and making a 2284 suggestion for a tag number that is not representable in 32 bits 2285 (i.e., larger than 4294967295) should be aware that this could reduce 2286 interoperability with implementations that do not support 64-bit 2287 numbers. 2289 9.3. Media Type ("MIME Type") 2291 The Internet media type [RFC6838] for a single encoded CBOR data item 2292 is application/cbor, as defined in [IANA.media-types]: 2294 Type name: application 2296 Subtype name: cbor 2298 Required parameters: n/a 2300 Optional parameters: n/a 2302 Encoding considerations: Binary 2304 Security considerations: See Section 10 of this document 2306 Interoperability considerations: n/a 2308 Published specification: This document 2310 Applications that use this media type: Many 2312 Additional information: 2313 * Magic number(s): n/a 2315 * File extension(s): .cbor 2317 * Macintosh file type code(s): n/a 2319 Person & email address to contact for further information: IETF CBOR 2320 Working Group cbor@ietf.org (mailto:cbor@ietf.org) or IETF 2321 Applications and Real-Time Area art@ietf.org (mailto:art@ietf.org) 2323 Intended usage: COMMON 2325 Restrictions on usage: none 2327 Author: IETF CBOR Working Group cbor@ietf.org (mailto:cbor@ietf.org) 2329 Change controller: The IESG iesg@ietf.org (mailto:iesg@ietf.org) 2331 9.4. CoAP Content-Format 2333 The CoAP Content-Format for CBOR is registered in 2334 [IANA.core-parameters]: 2336 Media Type: application/cbor 2337 Encoding: - 2339 Id: 60 2341 Reference: [RFCthis] 2343 9.5. The +cbor Structured Syntax Suffix Registration 2345 The Structured Syntax Suffix [RFC6838] for media types based on a 2346 single encoded CBOR data item is +cbor, as defined in 2347 [IANA.media-type-structured-suffix]: 2349 Name: Concise Binary Object Representation (CBOR) 2351 +suffix: +cbor 2353 References: [RFCthis] 2355 Encoding Considerations: CBOR is a binary format. 2357 Interoperability Considerations: n/a 2359 Fragment Identifier Considerations: The syntax and semantics of 2360 fragment identifiers specified for +cbor SHOULD be as specified 2361 for "application/cbor". (At publication of this document, there 2362 is no fragment identification syntax defined for "application/ 2363 cbor".) 2365 The syntax and semantics for fragment identifiers for a specific 2366 "xxx/yyy+cbor" SHOULD be processed as follows: 2368 * For cases defined in +cbor, where the fragment identifier 2369 resolves per the +cbor rules, then process as specified in 2370 +cbor. 2372 * For cases defined in +cbor, where the fragment identifier does 2373 not resolve per the +cbor rules, then process as specified in 2374 "xxx/yyy+cbor". 2376 * For cases not defined in +cbor, then process as specified in 2377 "xxx/yyy+cbor". 2379 Security Considerations: See Section 10 of this document 2381 Contact: IETF CBOR Working Group cbor@ietf.org 2382 (mailto:cbor@ietf.org) or IETF Applications and Real-Time Area 2383 art@ietf.org (mailto:art@ietf.org) 2385 Author/Change Controller: The IESG iesg@ietf.org 2386 (mailto:iesg@ietf.org) 2388 10. Security Considerations 2390 A network-facing application can exhibit vulnerabilities in its 2391 processing logic for incoming data. Complex parsers are well known 2392 as a likely source of such vulnerabilities, such as the ability to 2393 remotely crash a node, or even remotely execute arbitrary code on it. 2394 CBOR attempts to narrow the opportunities for introducing such 2395 vulnerabilities by reducing parser complexity, by giving the entire 2396 range of encodable values a meaning where possible. 2398 Because CBOR decoders are often used as a first step in processing 2399 unvalidated input, they need to be fully prepared for all types of 2400 hostile input that may be designed to corrupt, overrun, or achieve 2401 control of the system decoding the CBOR data item. A CBOR decoder 2402 needs to assume that all input may be hostile even if it has been 2403 checked by a firewall, has come over a secure channel such as TLS, is 2404 encrypted or signed, or has come from some other source that is 2405 presumed trusted. 2407 Section 4.1 gives examples of limitations in interoperability when 2408 using a constrained CBOR decoder with input from a CBOR encoder that 2409 uses a non-preferred serialization. When a single data item is 2410 consumed both by such a constrained decoder and a full decoder, it 2411 can lead to security issues that can be exploited by an attacker who 2412 can inject or manipulate content. 2414 As discussed throughout this document, there are many values that can 2415 be considered "equivalent" in some circumstances and "not equivalent" 2416 in others. As just one example, the numeric value for the number 2417 "one" might be expressed as an integer or a bignum. A system 2418 interpreting CBOR input might accept either form for the number 2419 "one", or might reject one (or both) forms. Such acceptance or 2420 rejection can have security implications in the program that is using 2421 the interpreted input. 2423 Hostile input may be constructed to overrun buffers, overflow or 2424 underflow integer arithmetic, or cause other decoding disruption. 2425 CBOR data items might have lengths or sizes that are intentionally 2426 extremely large or too short. Resource exhaustion attacks might 2427 attempt to lure a decoder into allocating very big data items 2428 (strings, arrays, maps, or even arbitrary precision numbers) or 2429 exhaust the stack depth by setting up deeply nested items. Decoders 2430 need to have appropriate resource management to mitigate these 2431 attacks. (Items for which very large sizes are given can also 2432 attempt to exploit integer overflow vulnerabilities.) 2433 A CBOR decoder, by definition, only accepts well-formed CBOR; this is 2434 the first step to its robustness. Input that is not well-formed CBOR 2435 causes no further processing from the point where the lack of well- 2436 formedness was detected. If possible, any data decoded up to this 2437 point should have no impact on the application using the CBOR 2438 decoder. 2440 In addition to ascertaining well-formedness, a CBOR decoder might 2441 also perform validity checks on the CBOR data. Alternatively, it can 2442 leave those checks to the application using the decoder. This choice 2443 needs to be clearly documented in the decoder. Beyond the validity 2444 at the CBOR level, an application also needs to ascertain that the 2445 input is in alignment with the application protocol that is 2446 serialized in CBOR. 2448 The input check itself may consume resources. This is usually linear 2449 in the size of the input, which means that an attacker has to spend 2450 resources that are commensurate to the resources spent by the 2451 defender on input validation. However, an attacker might be able to 2452 craft inputs that will take longer for a target decoder to process 2453 than for the attacker to produce. Processing for arbitrary-precision 2454 numbers may exceed linear effort. Also, some hash-table 2455 implementations that are used by decoders to build in-memory 2456 representations of maps can be attacked to spend quadratic effort, 2457 unless a secret key (see Section 7 of [SIPHASH_LNCS], also 2458 [SIPHASH_OPEN]) or some other mitigation is employed. Such 2459 superlinear efforts can be exploited by an attacker to exhaust 2460 resources at or before the input validator; they therefore need to be 2461 avoided in a CBOR decoder implementation. Note that tag number 2462 definitions and their implementations can add security considerations 2463 of this kind; this should then be discussed in the security 2464 considerations of the tag number definition. 2466 CBOR encoders do not receive input directly from the network and are 2467 thus not directly attackable in the same way as CBOR decoders. 2468 However, CBOR encoders often have an API that takes input from 2469 another level in the implementation and can be attacked through that 2470 API. The design and implementation of that API should assume the 2471 behavior of its caller may be based on hostile input or on coding 2472 mistakes. It should check inputs for buffer overruns, overflow and 2473 underflow of integer arithmetic, and other such errors that are aimed 2474 to disrupt the encoder. 2476 Protocols should be defined in such a way that potential multiple 2477 interpretations are reliably reduced to a single interpretation. For 2478 example, an attacker could make use of invalid input such as 2479 duplicate keys in maps, or exploit different precision in processing 2480 numbers to make one application base its decisions on a different 2481 interpretation than the one that will be used by a second 2482 application. To facilitate consistent interpretation, encoder and 2483 decoder implementations should provide a validity checking mode of 2484 operation (Section 5.4). Note, however, that a generic decoder 2485 cannot know about all requirements that an application poses on its 2486 input data; it is therefore not relieving the application from 2487 performing its own input checking. Also, since the set of defined 2488 tag numbers evolves, the application may employ a tag number that is 2489 not yet supported for validity checking by the generic decoder it 2490 uses. Generic decoders therefore need to provide documentation which 2491 tag numbers they support and what validity checking they can provide 2492 for each of them as well as for basic CBOR validity (UTF-8 checking, 2493 duplicate map key checking). 2495 Section 3.4.3 notes that using the non-preferred choice of a bignum 2496 representation instead of a basic integer for encoding a number is 2497 not intended to have application semantics, but it can have such 2498 semantics if an application receiving CBOR data is using a decoder in 2499 the basic generic data model. This disparity causes a security issue 2500 if the two sets of semantics differ. Thus, applications using CBOR 2501 need to specify the data model that they are using for each use of 2502 CBOR data. 2504 It is common to convert CBOR data to other formats. In many cases, 2505 CBOR has more expressive types than other formats; this is 2506 particularly true for the common conversion to JSON. The loss of 2507 type information can cause security issues for the systems that are 2508 processing the less-expressive data. 2510 Section 6.2 describes a possibly-common usage scenario of converting 2511 between CBOR and JSON that could allow an attack if the attcker knows 2512 that the application is performing the conversion. 2514 Security considerations for the use of base16 and base64 from 2515 [RFC4648], and the use of UTF-8 from [RFC3629], are relevant to CBOR 2516 as well. 2518 11. References 2520 11.1. Normative References 2522 [C] International Organization for Standardization, 2523 "Information technology — Programming languages — C", ISO/ 2524 IEC 9899:2018, Fourth Edition, June 2018. 2526 [Cplusplus17] 2527 International Organization for Standardization, 2528 "Programming languages — C++", ISO/IEC 14882:2017, Fifth 2529 Edition, December 2017. 2531 [IEEE754] IEEE, "IEEE Standard for Floating-Point Arithmetic", IEEE 2532 Std 754-2019, DOI 10.1109/IEEESTD.2019.8766229, 2533 . 2535 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 2536 Extensions (MIME) Part One: Format of Internet Message 2537 Bodies", RFC 2045, DOI 10.17487/RFC2045, November 1996, 2538 . 2540 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2541 Requirement Levels", BCP 14, RFC 2119, 2542 DOI 10.17487/RFC2119, March 1997, 2543 . 2545 [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: 2546 Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002, 2547 . 2549 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 2550 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2551 2003, . 2553 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2554 Resource Identifier (URI): Generic Syntax", STD 66, 2555 RFC 3986, DOI 10.17487/RFC3986, January 2005, 2556 . 2558 [RFC4287] Nottingham, M., Ed. and R. Sayre, Ed., "The Atom 2559 Syndication Format", RFC 4287, DOI 10.17487/RFC4287, 2560 December 2005, . 2562 [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data 2563 Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006, 2564 . 2566 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 2567 Writing an IANA Considerations Section in RFCs", BCP 26, 2568 RFC 8126, DOI 10.17487/RFC8126, June 2017, 2569 . 2571 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2572 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2573 May 2017, . 2575 [TIME_T] The Open Group Base Specifications, "Open Group Standard: 2576 Vol. 1: Base Definitions, Issue 7", Section 4.16 'Seconds 2577 Since the Epoch', IEEE Std 1003.1, 2018 Edition, 2018, 2578 . 2581 11.2. Informative References 2583 [ASN.1] International Telecommunication Union, "Information 2584 Technology — ASN.1 encoding rules: Specification of Basic 2585 Encoding Rules (BER), Canonical Encoding Rules (CER) and 2586 Distinguished Encoding Rules (DER)", ITU-T Recommendation 2587 X.690, 1994. 2589 [BSON] Various, "BSON - Binary JSON", 2013, 2590 . 2592 [ECMA262] Ecma International, "ECMAScript 2018 Language 2593 Specification", ECMA Standard ECMA-262, 9th Edition, June 2594 2018, . 2598 [I-D.bormann-cbor-notable-tags] 2599 Bormann, C., "Notable CBOR Tags", Work in Progress, 2600 Internet-Draft, draft-bormann-cbor-notable-tags-02, 25 2601 June 2020, . 2604 [IANA.cbor-simple-values] 2605 IANA, "Concise Binary Object Representation (CBOR) Simple 2606 Values", 2607 . 2609 [IANA.cbor-tags] 2610 IANA, "Concise Binary Object Representation (CBOR) Tags", 2611 . 2613 [IANA.core-parameters] 2614 IANA, "Constrained RESTful Environments (CoRE) 2615 Parameters", 2616 . 2618 [IANA.media-type-structured-suffix] 2619 IANA, "Structured Syntax Suffix Registry", 2620 . 2623 [IANA.media-types] 2624 IANA, "Media Types", 2625 . 2627 [MessagePack] 2628 Furuhashi, S., "MessagePack", 2013, . 2630 [PCRE] Ho, A., "PCRE - Perl Compatible Regular Expressions", 2631 2018, . 2633 [RFC0713] Haverty, J., "MSDTP-Message Services Data Transmission 2634 Protocol", RFC 713, DOI 10.17487/RFC0713, April 1976, 2635 . 2637 [RFC6838] Freed, N., Klensin, J., and T. Hansen, "Media Type 2638 Specifications and Registration Procedures", BCP 13, 2639 RFC 6838, DOI 10.17487/RFC6838, January 2013, 2640 . 2642 [RFC7049] Bormann, C. and P. Hoffman, "Concise Binary Object 2643 Representation (CBOR)", RFC 7049, DOI 10.17487/RFC7049, 2644 October 2013, . 2646 [RFC7228] Bormann, C., Ersue, M., and A. Keranen, "Terminology for 2647 Constrained-Node Networks", RFC 7228, 2648 DOI 10.17487/RFC7228, May 2014, 2649 . 2651 [RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493, 2652 DOI 10.17487/RFC7493, March 2015, 2653 . 2655 [RFC7991] Hoffman, P., "The "xml2rfc" Version 3 Vocabulary", 2656 RFC 7991, DOI 10.17487/RFC7991, December 2016, 2657 . 2659 [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data 2660 Interchange Format", STD 90, RFC 8259, 2661 DOI 10.17487/RFC8259, December 2017, 2662 . 2664 [RFC8610] Birkholz, H., Vigano, C., and C. Bormann, "Concise Data 2665 Definition Language (CDDL): A Notational Convention to 2666 Express Concise Binary Object Representation (CBOR) and 2667 JSON Data Structures", RFC 8610, DOI 10.17487/RFC8610, 2668 June 2019, . 2670 [RFC8618] Dickinson, J., Hague, J., Dickinson, S., Manderson, T., 2671 and J. Bond, "Compacted-DNS (C-DNS): A Format for DNS 2672 Packet Capture", RFC 8618, DOI 10.17487/RFC8618, September 2673 2019, . 2675 [RFC8742] Bormann, C., "Concise Binary Object Representation (CBOR) 2676 Sequences", RFC 8742, DOI 10.17487/RFC8742, February 2020, 2677 . 2679 [RFC8746] Bormann, C., Ed., "Concise Binary Object Representation 2680 (CBOR) Tags for Typed Arrays", RFC 8746, 2681 DOI 10.17487/RFC8746, February 2020, 2682 . 2684 [SIPHASH_LNCS] 2685 Aumasson, J. and D. Bernstein, "SipHash: A Fast Short- 2686 Input PRF", Lecture Notes in Computer Science pp. 489-508, 2687 DOI 10.1007/978-3-642-34931-7_28, 2012, 2688 . 2690 [SIPHASH_OPEN] 2691 Aumasson, J. and D.J. Bernstein, "SipHash: a fast short- 2692 input PRF", . 2694 [YAML] Ben-Kiki, O., Evans, C., and I.d. Net, "YAML Ain't Markup 2695 Language (YAML[TM]) Version 1.2", 3rd Edition, October 2696 2009, . 2698 Appendix A. Examples of Encoded CBOR Data Items 2700 The following table provides some CBOR-encoded values in hexadecimal 2701 (right column), together with diagnostic notation for these values 2702 (left column). Note that the string "\u00fc" is one form of 2703 diagnostic notation for a UTF-8 string containing the single Unicode 2704 character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (u umlaut). 2705 Similarly, "\u6c34" is a UTF-8 string in diagnostic notation with a 2706 single character U+6C34 (CJK UNIFIED IDEOGRAPH-6C34, often 2707 representing "water"), and "\ud800\udd51" is a UTF-8 string in 2708 diagnostic notation with a single character U+10151 (GREEK ACROPHONIC 2709 ATTIC FIFTY STATERS). (Note that all these single-character strings 2710 could also be represented in native UTF-8 in diagnostic notation, 2711 just not in an ASCII-only specification.) In the diagnostic notation 2712 provided for bignums, their intended numeric value is shown as a 2713 decimal number (such as 18446744073709551616) instead of showing a 2714 tagged byte string (such as 2(h'010000000000000000')). 2716 +==============================+====================================+ 2717 |Diagnostic | Encoded | 2718 +==============================+====================================+ 2719 |0 | 0x00 | 2720 +------------------------------+------------------------------------+ 2721 |1 | 0x01 | 2722 +------------------------------+------------------------------------+ 2723 |10 | 0x0a | 2724 +------------------------------+------------------------------------+ 2725 |23 | 0x17 | 2726 +------------------------------+------------------------------------+ 2727 |24 | 0x1818 | 2728 +------------------------------+------------------------------------+ 2729 |25 | 0x1819 | 2730 +------------------------------+------------------------------------+ 2731 |100 | 0x1864 | 2732 +------------------------------+------------------------------------+ 2733 |1000 | 0x1903e8 | 2734 +------------------------------+------------------------------------+ 2735 |1000000 | 0x1a000f4240 | 2736 +------------------------------+------------------------------------+ 2737 |1000000000000 | 0x1b000000e8d4a51000 | 2738 +------------------------------+------------------------------------+ 2739 |18446744073709551615 | 0x1bffffffffffffffff | 2740 +------------------------------+------------------------------------+ 2741 |18446744073709551616 | 0xc249010000000000000000 | 2742 +------------------------------+------------------------------------+ 2743 |-18446744073709551616 | 0x3bffffffffffffffff | 2744 +------------------------------+------------------------------------+ 2745 |-18446744073709551617 | 0xc349010000000000000000 | 2746 +------------------------------+------------------------------------+ 2747 |-1 | 0x20 | 2748 +------------------------------+------------------------------------+ 2749 |-10 | 0x29 | 2750 +------------------------------+------------------------------------+ 2751 |-100 | 0x3863 | 2752 +------------------------------+------------------------------------+ 2753 |-1000 | 0x3903e7 | 2754 +------------------------------+------------------------------------+ 2755 |0.0 | 0xf90000 | 2756 +------------------------------+------------------------------------+ 2757 |-0.0 | 0xf98000 | 2758 +------------------------------+------------------------------------+ 2759 |1.0 | 0xf93c00 | 2760 +------------------------------+------------------------------------+ 2761 |1.1 | 0xfb3ff199999999999a | 2762 +------------------------------+------------------------------------+ 2763 |1.5 | 0xf93e00 | 2764 +------------------------------+------------------------------------+ 2765 |65504.0 | 0xf97bff | 2766 +------------------------------+------------------------------------+ 2767 |100000.0 | 0xfa47c35000 | 2768 +------------------------------+------------------------------------+ 2769 |3.4028234663852886e+38 | 0xfa7f7fffff | 2770 +------------------------------+------------------------------------+ 2771 |1.0e+300 | 0xfb7e37e43c8800759c | 2772 +------------------------------+------------------------------------+ 2773 |5.960464477539063e-8 | 0xf90001 | 2774 +------------------------------+------------------------------------+ 2775 |0.00006103515625 | 0xf90400 | 2776 +------------------------------+------------------------------------+ 2777 |-4.0 | 0xf9c400 | 2778 +------------------------------+------------------------------------+ 2779 |-4.1 | 0xfbc010666666666666 | 2780 +------------------------------+------------------------------------+ 2781 |Infinity | 0xf97c00 | 2782 +------------------------------+------------------------------------+ 2783 |NaN | 0xf97e00 | 2784 +------------------------------+------------------------------------+ 2785 |-Infinity | 0xf9fc00 | 2786 +------------------------------+------------------------------------+ 2787 |Infinity | 0xfa7f800000 | 2788 +------------------------------+------------------------------------+ 2789 |NaN | 0xfa7fc00000 | 2790 +------------------------------+------------------------------------+ 2791 |-Infinity | 0xfaff800000 | 2792 +------------------------------+------------------------------------+ 2793 |Infinity | 0xfb7ff0000000000000 | 2794 +------------------------------+------------------------------------+ 2795 |NaN | 0xfb7ff8000000000000 | 2796 +------------------------------+------------------------------------+ 2797 |-Infinity | 0xfbfff0000000000000 | 2798 +------------------------------+------------------------------------+ 2799 |false | 0xf4 | 2800 +------------------------------+------------------------------------+ 2801 |true | 0xf5 | 2802 +------------------------------+------------------------------------+ 2803 |null | 0xf6 | 2804 +------------------------------+------------------------------------+ 2805 |undefined | 0xf7 | 2806 +------------------------------+------------------------------------+ 2807 |simple(16) | 0xf0 | 2808 +------------------------------+------------------------------------+ 2809 |simple(255) | 0xf8ff | 2810 +------------------------------+------------------------------------+ 2811 |0("2013-03-21T20:04:00Z") | 0xc074323031332d30332d32315432303a | 2812 | | 30343a30305a | 2813 +------------------------------+------------------------------------+ 2814 |1(1363896240) | 0xc11a514b67b0 | 2815 +------------------------------+------------------------------------+ 2816 |1(1363896240.5) | 0xc1fb41d452d9ec200000 | 2817 +------------------------------+------------------------------------+ 2818 |23(h'01020304') | 0xd74401020304 | 2819 +------------------------------+------------------------------------+ 2820 |24(h'6449455446') | 0xd818456449455446 | 2821 +------------------------------+------------------------------------+ 2822 |32("http://www.example.com") | 0xd82076687474703a2f2f7777772e6578 | 2823 | | 616d706c652e636f6d | 2824 +------------------------------+------------------------------------+ 2825 |h'' | 0x40 | 2826 +------------------------------+------------------------------------+ 2827 |h'01020304' | 0x4401020304 | 2828 +------------------------------+------------------------------------+ 2829 |"" | 0x60 | 2830 +------------------------------+------------------------------------+ 2831 |"a" | 0x6161 | 2832 +------------------------------+------------------------------------+ 2833 |"IETF" | 0x6449455446 | 2834 +------------------------------+------------------------------------+ 2835 |"\"\\" | 0x62225c | 2836 +------------------------------+------------------------------------+ 2837 |"\u00fc" | 0x62c3bc | 2838 +------------------------------+------------------------------------+ 2839 |"\u6c34" | 0x63e6b0b4 | 2840 +------------------------------+------------------------------------+ 2841 |"\ud800\udd51" | 0x64f0908591 | 2842 +------------------------------+------------------------------------+ 2843 |[] | 0x80 | 2844 +------------------------------+------------------------------------+ 2845 |[1, 2, 3] | 0x83010203 | 2846 +------------------------------+------------------------------------+ 2847 |[1, [2, 3], [4, 5]] | 0x8301820203820405 | 2848 +------------------------------+------------------------------------+ 2849 |[1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x98190102030405060708090a0b0c0d0e | 2850 |10, 11, 12, 13, 14, 15, 16, | 0f101112131415161718181819 | 2851 |17, 18, 19, 20, 21, 22, 23, | | 2852 |24, 25] | | 2853 +------------------------------+------------------------------------+ 2854 |{} | 0xa0 | 2855 +------------------------------+------------------------------------+ 2856 |{1: 2, 3: 4} | 0xa201020304 | 2857 +------------------------------+------------------------------------+ 2858 |{"a": 1, "b": [2, 3]} | 0xa26161016162820203 | 2859 +------------------------------+------------------------------------+ 2860 |["a", {"b": "c"}] | 0x826161a161626163 | 2861 +------------------------------+------------------------------------+ 2862 |{"a": "A", "b": "B", "c": "C",| 0xa5616161416162614261636143616461 | 2863 |"d": "D", "e": "E"} | 4461656145 | 2864 +------------------------------+------------------------------------+ 2865 |(_ h'0102', h'030405') | 0x5f42010243030405ff | 2866 +------------------------------+------------------------------------+ 2867 |(_ "strea", "ming") | 0x7f657374726561646d696e67ff | 2868 +------------------------------+------------------------------------+ 2869 |[_ ] | 0x9fff | 2870 +------------------------------+------------------------------------+ 2871 |[_ 1, [2, 3], [_ 4, 5]] | 0x9f018202039f0405ffff | 2872 +------------------------------+------------------------------------+ 2873 |[_ 1, [2, 3], [4, 5]] | 0x9f01820203820405ff | 2874 +------------------------------+------------------------------------+ 2875 |[1, [2, 3], [_ 4, 5]] | 0x83018202039f0405ff | 2876 +------------------------------+------------------------------------+ 2877 |[1, [_ 2, 3], [4, 5]] | 0x83019f0203ff820405 | 2878 +------------------------------+------------------------------------+ 2879 |[_ 1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x9f0102030405060708090a0b0c0d0e0f | 2880 |10, 11, 12, 13, 14, 15, 16, | 101112131415161718181819ff | 2881 |17, 18, 19, 20, 21, 22, 23, | | 2882 |24, 25] | | 2883 +------------------------------+------------------------------------+ 2884 |{_ "a": 1, "b": [_ 2, 3]} | 0xbf61610161629f0203ffff | 2885 +------------------------------+------------------------------------+ 2886 |["a", {_ "b": "c"}] | 0x826161bf61626163ff | 2887 +------------------------------+------------------------------------+ 2888 |{_ "Fun": true, "Amt": -2} | 0xbf6346756ef563416d7421ff | 2889 +------------------------------+------------------------------------+ 2891 Table 6: Examples of Encoded CBOR Data Items 2893 Appendix B. Jump Table for Initial Byte 2895 For brevity, this jump table does not show initial bytes that are 2896 reserved for future extension. It also only shows a selection of the 2897 initial bytes that can be used for optional features. (All unsigned 2898 integers are in network byte order.) 2900 +============+================================================+ 2901 | Byte | Structure/Semantics | 2902 +============+================================================+ 2903 | 0x00..0x17 | Unsigned integer 0x00..0x17 (0..23) | 2904 +------------+------------------------------------------------+ 2905 | 0x18 | Unsigned integer (one-byte uint8_t follows) | 2906 +------------+------------------------------------------------+ 2907 | 0x19 | Unsigned integer (two-byte uint16_t follows) | 2908 +------------+------------------------------------------------+ 2909 | 0x1a | Unsigned integer (four-byte uint32_t follows) | 2910 +------------+------------------------------------------------+ 2911 | 0x1b | Unsigned integer (eight-byte uint64_t follows) | 2912 +------------+------------------------------------------------+ 2913 | 0x20..0x37 | Negative integer -1-0x00..-1-0x17 (-1..-24) | 2914 +------------+------------------------------------------------+ 2915 | 0x38 | Negative integer -1-n (one-byte uint8_t for n | 2916 | | follows) | 2917 +------------+------------------------------------------------+ 2918 | 0x39 | Negative integer -1-n (two-byte uint16_t for n | 2919 | | follows) | 2920 +------------+------------------------------------------------+ 2921 | 0x3a | Negative integer -1-n (four-byte uint32_t for | 2922 | | n follows) | 2923 +------------+------------------------------------------------+ 2924 | 0x3b | Negative integer -1-n (eight-byte uint64_t for | 2925 | | n follows) | 2926 +------------+------------------------------------------------+ 2927 | 0x40..0x57 | byte string (0x00..0x17 bytes follow) | 2928 +------------+------------------------------------------------+ 2929 | 0x58 | byte string (one-byte uint8_t for n, and then | 2930 | | n bytes follow) | 2931 +------------+------------------------------------------------+ 2932 | 0x59 | byte string (two-byte uint16_t for n, and then | 2933 | | n bytes follow) | 2934 +------------+------------------------------------------------+ 2935 | 0x5a | byte string (four-byte uint32_t for n, and | 2936 | | then n bytes follow) | 2937 +------------+------------------------------------------------+ 2938 | 0x5b | byte string (eight-byte uint64_t for n, and | 2939 | | then n bytes follow) | 2940 +------------+------------------------------------------------+ 2941 | 0x5f | byte string, byte strings follow, terminated | 2942 | | by "break" | 2943 +------------+------------------------------------------------+ 2944 | 0x60..0x77 | UTF-8 string (0x00..0x17 bytes follow) | 2945 +------------+------------------------------------------------+ 2946 | 0x78 | UTF-8 string (one-byte uint8_t for n, and then | 2947 | | n bytes follow) | 2948 +------------+------------------------------------------------+ 2949 | 0x79 | UTF-8 string (two-byte uint16_t for n, and | 2950 | | then n bytes follow) | 2951 +------------+------------------------------------------------+ 2952 | 0x7a | UTF-8 string (four-byte uint32_t for n, and | 2953 | | then n bytes follow) | 2954 +------------+------------------------------------------------+ 2955 | 0x7b | UTF-8 string (eight-byte uint64_t for n, and | 2956 | | then n bytes follow) | 2957 +------------+------------------------------------------------+ 2958 | 0x7f | UTF-8 string, UTF-8 strings follow, terminated | 2959 | | by "break" | 2960 +------------+------------------------------------------------+ 2961 | 0x80..0x97 | array (0x00..0x17 data items follow) | 2962 +------------+------------------------------------------------+ 2963 | 0x98 | array (one-byte uint8_t for n, and then n data | 2964 | | items follow) | 2965 +------------+------------------------------------------------+ 2966 | 0x99 | array (two-byte uint16_t for n, and then n | 2967 | | data items follow) | 2968 +------------+------------------------------------------------+ 2969 | 0x9a | array (four-byte uint32_t for n, and then n | 2970 | | data items follow) | 2971 +------------+------------------------------------------------+ 2972 | 0x9b | array (eight-byte uint64_t for n, and then n | 2973 | | data items follow) | 2974 +------------+------------------------------------------------+ 2975 | 0x9f | array, data items follow, terminated by | 2976 | | "break" | 2977 +------------+------------------------------------------------+ 2978 | 0xa0..0xb7 | map (0x00..0x17 pairs of data items follow) | 2979 +------------+------------------------------------------------+ 2980 | 0xb8 | map (one-byte uint8_t for n, and then n pairs | 2981 | | of data items follow) | 2982 +------------+------------------------------------------------+ 2983 | 0xb9 | map (two-byte uint16_t for n, and then n pairs | 2984 | | of data items follow) | 2985 +------------+------------------------------------------------+ 2986 | 0xba | map (four-byte uint32_t for n, and then n | 2987 | | pairs of data items follow) | 2988 +------------+------------------------------------------------+ 2989 | 0xbb | map (eight-byte uint64_t for n, and then n | 2990 | | pairs of data items follow) | 2991 +------------+------------------------------------------------+ 2992 | 0xbf | map, pairs of data items follow, terminated by | 2993 | | "break" | 2994 +------------+------------------------------------------------+ 2995 | 0xc0 | Text-based date/time (data item follows; see | 2996 | | Section 3.4.1) | 2997 +------------+------------------------------------------------+ 2998 | 0xc1 | Epoch-based date/time (data item follows; see | 2999 | | Section 3.4.2) | 3000 +------------+------------------------------------------------+ 3001 | 0xc2 | Positive bignum (data item "byte string" | 3002 | | follows) | 3003 +------------+------------------------------------------------+ 3004 | 0xc3 | Negative bignum (data item "byte string" | 3005 | | follows) | 3006 +------------+------------------------------------------------+ 3007 | 0xc4 | Decimal Fraction (data item "array" follows; | 3008 | | see Section 3.4.4) | 3009 +------------+------------------------------------------------+ 3010 | 0xc5 | Bigfloat (data item "array" follows; see | 3011 | | Section 3.4.4) | 3012 +------------+------------------------------------------------+ 3013 | 0xc6..0xd4 | (tag) | 3014 +------------+------------------------------------------------+ 3015 | 0xd5..0xd7 | Expected Conversion (data item follows; see | 3016 | | Section 3.4.5.2) | 3017 +------------+------------------------------------------------+ 3018 | 0xd8..0xdb | (more tags; 1/2/4/8 bytes of tag number and | 3019 | | then a data item follow) | 3020 +------------+------------------------------------------------+ 3021 | 0xe0..0xf3 | (simple value) | 3022 +------------+------------------------------------------------+ 3023 | 0xf4 | False | 3024 +------------+------------------------------------------------+ 3025 | 0xf5 | True | 3026 +------------+------------------------------------------------+ 3027 | 0xf6 | Null | 3028 +------------+------------------------------------------------+ 3029 | 0xf7 | Undefined | 3030 +------------+------------------------------------------------+ 3031 | 0xf8 | (simple value, one byte follows) | 3032 +------------+------------------------------------------------+ 3033 | 0xf9 | Half-Precision Float (two-byte IEEE 754) | 3034 +------------+------------------------------------------------+ 3035 | 0xfa | Single-Precision Float (four-byte IEEE 754) | 3036 +------------+------------------------------------------------+ 3037 | 0xfb | Double-Precision Float (eight-byte IEEE 754) | 3038 +------------+------------------------------------------------+ 3039 | 0xff | "break" stop code | 3040 +------------+------------------------------------------------+ 3042 Table 7: Jump Table for Initial Byte 3044 Appendix C. Pseudocode 3046 The well-formedness of a CBOR item can be checked by the pseudocode 3047 in Figure 1. The data is well-formed if and only if: 3049 * the pseudocode does not "fail"; 3050 * after execution of the pseudocode, no bytes are left in the input 3051 (except in streaming applications) 3053 The pseudocode has the following prerequisites: 3055 * take(n) reads n bytes from the input data and returns them as a 3056 byte string. If n bytes are no longer available, take(n) fails. 3058 * uint() converts a byte string into an unsigned integer by 3059 interpreting the byte string in network byte order. 3061 * Arithmetic works as in C. 3063 * All variables are unsigned integers of sufficient range. 3065 Note that "well_formed" returns the major type for well-formed 3066 definite length items, but 99 for an indefinite length item (or -1 3067 for a "break" stop code, only if "breakable" is set). This is used 3068 in "well_formed_indefinite" to ascertain that indefinite length 3069 strings only contain definite length strings as chunks. 3071 well_formed(breakable = false) { 3072 // process initial bytes 3073 ib = uint(take(1)); 3074 mt = ib >> 5; 3075 val = ai = ib & 0x1f; 3076 switch (ai) { 3077 case 24: val = uint(take(1)); break; 3078 case 25: val = uint(take(2)); break; 3079 case 26: val = uint(take(4)); break; 3080 case 27: val = uint(take(8)); break; 3081 case 28: case 29: case 30: fail(); 3082 case 31: 3083 return well_formed_indefinite(mt, breakable); 3084 } 3085 // process content 3086 switch (mt) { 3087 // case 0, 1, 7 do not have content; just use val 3088 case 2: case 3: take(val); break; // bytes/UTF-8 3089 case 4: for (i = 0; i < val; i++) well_formed(); break; 3090 case 5: for (i = 0; i < val*2; i++) well_formed(); break; 3091 case 6: well_formed(); break; // 1 embedded data item 3092 case 7: if (ai == 24 && val < 32) fail(); // bad simple 3093 } 3094 return mt; // definite-length data item 3095 } 3097 well_formed_indefinite(mt, breakable) { 3098 switch (mt) { 3099 case 2: case 3: 3100 while ((it = well_formed(true)) != -1) 3101 if (it != mt) // need definite-length chunk 3102 fail(); // of same type 3103 break; 3104 case 4: while (well_formed(true) != -1); break; 3105 case 5: while (well_formed(true) != -1) well_formed(); break; 3106 case 7: 3107 if (breakable) 3108 return -1; // signal break out 3109 else fail(); // no enclosing indefinite 3110 default: fail(); // wrong mt 3111 } 3112 return 99; // indefinite-length data item 3113 } 3115 Figure 1: Pseudocode for Well-Formedness Check 3117 Note that the remaining complexity of a complete CBOR decoder is 3118 about presenting data that has been decoded to the application in an 3119 appropriate form. 3121 Major types 0 and 1 are designed in such a way that they can be 3122 encoded in C from a signed integer without actually doing an if-then- 3123 else for positive/negative (Figure 2). This uses the fact that 3124 (-1-n), the transformation for major type 1, is the same as ~n 3125 (bitwise complement) in C unsigned arithmetic; ~n can then be 3126 expressed as (-1)^n for the negative case, while 0^n leaves n 3127 unchanged for non-negative. The sign of a number can be converted to 3128 -1 for negative and 0 for non-negative (0 or positive) by arithmetic- 3129 shifting the number by one bit less than the bit length of the number 3130 (for example, by 63 for 64-bit numbers). 3132 void encode_sint(int64_t n) { 3133 uint64t ui = n >> 63; // extend sign to whole length 3134 unsigned mt = ui & 0x20; // extract (shifted) major type 3135 ui ^= n; // complement negatives 3136 if (ui < 24) 3137 *p++ = mt + ui; 3138 else if (ui < 256) { 3139 *p++ = mt + 24; 3140 *p++ = ui; 3141 } else 3142 ... 3144 Figure 2: Pseudocode for Encoding a Signed Integer 3146 See Section 1.2 for some specific assumptions about the profile of 3147 the C language used in these pieces of code. 3149 Appendix D. Half-Precision 3151 As half-precision floating-point numbers were only added to IEEE 754 3152 in 2008 [IEEE754], today's programming platforms often still only 3153 have limited support for them. It is very easy to include at least 3154 decoding support for them even without such support. An example of a 3155 small decoder for half-precision floating-point numbers in the C 3156 language is shown in Figure 3. A similar program for Python is in 3157 Figure 4; this code assumes that the 2-byte value has already been 3158 decoded as an (unsigned short) integer in network byte order (as 3159 would be done by the pseudocode in Appendix C). 3161 #include 3163 double decode_half(unsigned char *halfp) { 3164 unsigned half = (halfp[0] << 8) + halfp[1]; 3165 unsigned exp = (half >> 10) & 0x1f; 3166 unsigned mant = half & 0x3ff; 3167 double val; 3168 if (exp == 0) val = ldexp(mant, -24); 3169 else if (exp != 31) val = ldexp(mant + 1024, exp - 25); 3170 else val = mant == 0 ? INFINITY : NAN; 3171 return half & 0x8000 ? -val : val; 3172 } 3174 Figure 3: C Code for a Half-Precision Decoder 3176 import struct 3177 from math import ldexp 3179 def decode_single(single): 3180 return struct.unpack("!f", struct.pack("!I", single))[0] 3182 def decode_half(half): 3183 valu = (half & 0x7fff) << 13 | (half & 0x8000) << 16 3184 if ((half & 0x7c00) != 0x7c00): 3185 return ldexp(decode_single(valu), 112) 3186 return decode_single(valu | 0x7f800000) 3188 Figure 4: Python Code for a Half-Precision Decoder 3190 Appendix E. Comparison of Other Binary Formats to CBOR's Design 3191 Objectives 3193 The proposal for CBOR follows a history of binary formats that is as 3194 long as the history of computers themselves. Different formats have 3195 had different objectives. In most cases, the objectives of the 3196 format were never stated, although they can sometimes be implied by 3197 the context where the format was first used. Some formats were meant 3198 to be universally usable, although history has proven that no binary 3199 format meets the needs of all protocols and applications. 3201 CBOR differs from many of these formats due to it starting with a set 3202 of objectives and attempting to meet just those. This section 3203 compares a few of the dozens of formats with CBOR's objectives in 3204 order to help the reader decide if they want to use CBOR or a 3205 different format for a particular protocol or application. 3207 Note that the discussion here is not meant to be a criticism of any 3208 format: to the best of our knowledge, no format before CBOR was meant 3209 to cover CBOR's objectives in the priority we have assigned them. A 3210 brief recap of the objectives from Section 1.1 is: 3212 1. unambiguous encoding of most common data formats from Internet 3213 standards 3215 2. code compactness for encoder or decoder 3217 3. no schema description needed 3219 4. reasonably compact serialization 3221 5. applicability to constrained and unconstrained applications 3223 6. good JSON conversion 3225 7. extensibility 3227 A discussion of CBOR and other formats with respect to a different 3228 set of design objectives is provided in Section 5 and Appendix C of 3229 [RFC8618]. 3231 E.1. ASN.1 DER, BER, and PER 3233 [ASN.1] has many serializations. In the IETF, DER and BER are the 3234 most common. The serialized output is not particularly compact for 3235 many items, and the code needed to decode numeric items can be 3236 complex on a constrained device. 3238 Few (if any) IETF protocols have adopted one of the several variants 3239 of Packed Encoding Rules (PER). There could be many reasons for 3240 this, but one that is commonly stated is that PER makes use of the 3241 schema even for parsing the surface structure of the data item, 3242 requiring significant tool support. There are different versions of 3243 the ASN.1 schema language in use, which has also hampered adoption. 3245 E.2. MessagePack 3247 [MessagePack] is a concise, widely implemented counted binary 3248 serialization format, similar in many properties to CBOR, although 3249 somewhat less regular. While the data model can be used to represent 3250 JSON data, MessagePack has also been used in many remote procedure 3251 call (RPC) applications and for long-term storage of data. 3253 MessagePack has been essentially stable since it was first published 3254 around 2011; it has not yet had a transition. The evolution of 3255 MessagePack is impeded by an imperative to maintain complete 3256 backwards compatibility with existing stored data, while only few 3257 bytecodes are still available for extension. Repeated requests over 3258 the years from the MessagePack user community to separate out binary 3259 and text strings in the encoding recently have led to an extension 3260 proposal that would leave MessagePack's "raw" data ambiguous between 3261 its usages for binary and text data. The extension mechanism for 3262 MessagePack remains unclear. 3264 E.3. BSON 3266 [BSON] is a data format that was developed for the storage of JSON- 3267 like maps (JSON objects) in the MongoDB database. Its major 3268 distinguishing feature is the capability for in-place update, which 3269 prevents a compact representation. BSON uses a counted 3270 representation except for map keys, which are null-byte terminated. 3271 While BSON can be used for the representation of JSON-like objects on 3272 the wire, its specification is dominated by the requirements of the 3273 database application and has become somewhat baroque. The status of 3274 how BSON extensions will be implemented remains unclear. 3276 E.4. MSDTP: RFC 713 3278 Message Services Data Transmission (MSDTP) is a very early example of 3279 a compact message format; it is described in [RFC0713], written in 3280 1976. It is included here for its historical value, not because it 3281 was ever widely used. 3283 E.5. Conciseness on the Wire 3285 While CBOR's design objective of code compactness for encoders and 3286 decoders is a higher priority than its objective of conciseness on 3287 the wire, many people focus on the wire size. Table 8 shows some 3288 encoding examples for the simple nested array [1, [2, 3]]; where some 3289 form of indefinite-length encoding is supported by the encoding, 3290 [_ 1, [2, 3]] (indefinite length on the outer array) is also shown. 3292 +=============+============================+================+ 3293 | Format | [1, [2, 3]] | [_ 1, [2, 3]] | 3294 +=============+============================+================+ 3295 | RFC 713 | c2 05 81 c2 02 82 83 | | 3296 +-------------+----------------------------+----------------+ 3297 | ASN.1 BER | 30 0b 02 01 01 30 06 02 01 | 30 80 02 01 01 | 3298 | | 02 02 01 03 | 30 06 02 01 02 | 3299 | | | 02 01 03 00 00 | 3300 +-------------+----------------------------+----------------+ 3301 | MessagePack | 92 01 92 02 03 | | 3302 +-------------+----------------------------+----------------+ 3303 | BSON | 22 00 00 00 10 30 00 01 00 | | 3304 | | 00 00 04 31 00 13 00 00 00 | | 3305 | | 10 30 00 02 00 00 00 10 31 | | 3306 | | 00 03 00 00 00 00 00 | | 3307 +-------------+----------------------------+----------------+ 3308 | CBOR | 82 01 82 02 03 | 9f 01 82 02 03 | 3309 | | | ff | 3310 +-------------+----------------------------+----------------+ 3312 Table 8: Examples for Different Levels of Conciseness 3314 Appendix F. Well-formedness errors and examples 3316 There are three basic kinds of well-formedness errors that can occur 3317 in decoding a CBOR data item: 3319 * Too much data: There are input bytes left that were not consumed. 3320 This is only an error if the application assumed that the input 3321 bytes would span exactly one data item. Where the application 3322 uses the self-delimiting nature of CBOR encoding to permit 3323 additional data after the data item, as is for example done in 3324 CBOR sequences [RFC8742], the CBOR decoder can simply indicate 3325 what part of the input has not been consumed. 3327 * Too little data: The input data available would need additional 3328 bytes added at their end for a complete CBOR data item. This may 3329 indicate the input is truncated; it is also a common error when 3330 trying to decode random data as CBOR. For some applications, 3331 however, this may not actually be an error, as the application may 3332 not be certain it has all the data yet and can obtain or wait for 3333 additional input bytes. Some of these applications may have an 3334 upper limit for how much additional data can show up; here the 3335 decoder may be able to indicate that the encoded CBOR data item 3336 cannot be completed within this limit. 3338 * Syntax error: The input data are not consistent with the 3339 requirements of the CBOR encoding, and this cannot be remedied by 3340 adding (or removing) data at the end. 3342 In Appendix C, errors of the first kind are addressed in the first 3343 paragraph/bullet list (requiring "no bytes are left"), and errors of 3344 the second kind are addressed in the second paragraph/bullet list 3345 (failing "if n bytes are no longer available"). Errors of the third 3346 kind are identified in the pseudocode by specific instances of 3347 calling fail(), in order: 3349 * a reserved value is used for additional information (28, 29, 30) 3351 * major type 7, additional information 24, value < 32 (incorrect) 3353 * incorrect substructure of indefinite length byte/text string (may 3354 only contain definite length strings of the same major type) 3356 * "break" stop code (mt=7, ai=31) occurs in a value position of a 3357 map or except at a position directly in an indefinite length item 3358 where also another enclosed data item could occur 3360 * additional information 31 used with major type 0, 1, or 6 3362 F.1. Examples for CBOR data items that are not well-formed 3364 This subsection shows a few examples for CBOR data items that are not 3365 well-formed. Each example is a sequence of bytes each shown in 3366 hexadecimal; multiple examples in a list are separated by commas. 3368 Examples for well-formedness error kind 1 (too much data) can easily 3369 be formed by adding data to a well-formed encoded CBOR data item. 3371 Similarly, examples for well-formedness error kind 2 (too little 3372 data) can be formed by truncating a well-formed encoded CBOR data 3373 item. In test suites, it may be beneficial to specifically test with 3374 incomplete data items that would require large amounts of addition to 3375 be completed (for instance by starting the encoding of a string of a 3376 very large size). 3378 A premature end of the input can occur in a head or within the 3379 enclosed data, which may be bare strings or enclosed data items that 3380 are either counted or should have been ended by a "break" stop code. 3382 * End of input in a head: 18, 19, 1a, 1b, 19 01, 1a 01 02, 1b 01 02 3383 03 04 05 06 07, 38, 58, 78, 98, 9a 01 ff 00, b8, d8, f8, f9 00, fa 3384 00 00, fb 00 00 00 3386 * Definite length strings with short data: 41, 61, 5a ff ff ff ff 3387 00, 5b ff ff ff ff ff ff ff ff 01 02 03, 7a ff ff ff ff 00, 7b 7f 3388 ff ff ff ff ff ff ff 01 02 03 3390 * Definite length maps and arrays not closed with enough items: 81, 3391 81 81 81 81 81 81 81 81 81, 82 00, a1, a2 01 02, a1 00, a2 00 00 3392 00 3394 * Tag number not followed by tag content: c0 3396 * Indefinite length strings not closed by a "break" stop code: 5f 41 3397 00, 7f 61 00 3399 * Indefinite length maps and arrays not closed by a "break" stop 3400 code: 9f, 9f 01 02, bf, bf 01 02 01 02, 81 9f, 9f 80 00, 9f 9f 9f 3401 9f 9f ff ff ff ff, 9f 81 9f 81 9f 9f ff ff ff 3403 A few examples for the five subkinds of well-formedness error kind 3 3404 (syntax error) are shown below. 3406 Subkind 1: 3408 * Reserved additional information values: 1c, 1d, 1e, 3c, 3d, 3e, 3409 5c, 5d, 5e, 7c, 7d, 7e, 9c, 9d, 9e, bc, bd, be, dc, dd, de, fc, 3410 fd, fe, 3412 Subkind 2: 3414 * Reserved two-byte encodings of simple values: f8 00, f8 01, f8 18, 3415 f8 1f 3417 Subkind 3: 3419 * Indefinite length string chunks not of the correct type: 5f 00 ff, 3420 5f 21 ff, 5f 61 00 ff, 5f 80 ff, 5f a0 ff, 5f c0 00 ff, 5f e0 ff, 3421 7f 41 00 ff 3423 * Indefinite length string chunks not definite length: 5f 5f 41 00 3424 ff ff, 7f 7f 61 00 ff ff 3426 Subkind 4: 3428 * Break occurring on its own outside of an indefinite length item: 3429 ff 3431 * Break occurring in a definite length array or map or a tag: 81 ff, 3432 82 00 ff, a1 ff, a1 ff 00, a1 00 ff, a2 00 00 ff, 9f 81 ff, 9f 82 3433 9f 81 9f 9f ff ff ff ff 3435 * Break in indefinite length map would lead to odd number of items 3436 (break in a value position): bf 00 ff, bf 00 00 00 ff 3438 Subkind 5: 3440 * Major type 0, 1, 6 with additional information 31: 1f, 3f, df 3442 Appendix G. Changes from RFC 7049 3444 As discussed in the introduction, this document is a revised edition 3445 of RFC 7049, with editorial improvements, added detail, and fixed 3446 errata. This document formally obsoletes RFC 7049, while keeping 3447 full compatibility of the interchange format from RFC 7049. This 3448 document does not create a new version of the format. 3450 G.1. Errata processing, clerical changes 3452 The two verified errata on RFC 7049, EID 3764 and EID 3770, concerned 3453 two encoding examples in the text that have been corrected 3454 (Section 3.4.3: "29" -> "49", Section 5.5: "0b000_11101" -> 3455 "0b000_11001"). Also, RFC 7049 contained an example using the 3456 numeric value 24 for a simple value (EID 5917), which is not well- 3457 formed; this example has been removed. Errata report 5763 pointed to 3458 an accident in the wording of the definition of tags; this was 3459 resolved during a re-write of Section 3.4. Errata report 5434 3460 pointed out that the UBJSON example in Appendix E no longer complied 3461 with the version of UBJSON current at the time of submitting the 3462 report. It turned out that the UBJSON specification had completely 3463 changed since 2013; this example therefore also was removed. Further 3464 errata reports (4409, 4963, 4964) complained that the map key sorting 3465 rules for canonical encoding were onerous; these led to a 3466 reconsideration of the canonical encoding suggestions and replacement 3467 by the deterministic encoding suggestions (described below). An 3468 editorial suggestion in errata report 4294 was also implemented 3469 (improved symmetry by adding "Second value" to a comment to the last 3470 example in Section 3.2.2). 3472 Other more clerical changes include: 3474 * use of new RFCXML functionality [RFC7991]; 3476 * explain some more of the notation used; 3478 * updated references, e.g. for RFC4627 to [RFC8259] in many places, 3479 for CNN-TERMS to [RFC7228]; added missing reference to [IEEE754] 3480 (importing required definitions) and updated to [ECMA262]; added a 3481 reference to [RFC8618] that further illustrates the discussion in 3482 Appendix E; 3484 * the discussion of diagnostic notation mentions the "Extended 3485 Diagnostic Notation" (EDN) defined in [RFC8610] as well as the gap 3486 diagnostic notation has in representing NaN payloads; an 3487 explanation was added on how to represent indefinite length 3488 strings with no chunks; 3490 * the addition of this appendix. 3492 G.2. Changes in IANA considerations 3494 The IANA considerations were generally updated (clerical changes, 3495 e.g., now pointing to the CBOR working group as the author of the 3496 specification). References to the respective IANA registries have 3497 been added to the informative references. 3499 Tags in the space from 256 to 32767 (lower half of "1+2") are no 3500 longer assigned by First Come First Served; this range is now 3501 Specification Required. 3503 G.3. Changes in suggestions and other informational components 3505 In revising the document, beyond processing errata reports, the WG 3506 could use nearly seven years of experience with the use of CBOR in a 3507 diverse set of applications. This led to a number of editorial 3508 changes, including adding tables for illustration, but also to 3509 emphasizing some aspects and de-emphasizing others. 3511 A significant addition in this revision is Section 2, which discusses 3512 the CBOR data model and its small variations involved in the 3513 processing of CBOR. Introducing terms for those (basic generic, 3514 extended generic, specific) enables more concise language in other 3515 places of the document, but also helps in clarifying expectations on 3516 implementations and on the extensibility features of the format. 3518 RFC 7049, as a format derived from the JSON ecosystem, was influenced 3519 by the JSON number system that was in turn inherited from JavaScript 3520 at the time. JSON does not provide distinct integers and floating- 3521 point values (and the latter are decimal in the format). CBOR 3522 provides binary representations of numbers, which do differ between 3523 integers and floating-point values. Experience from implementation 3524 and use now suggested that the separation between these two number 3525 domains should be more clearly drawn in the document; language that 3526 suggested an integer could seamlessly stand in for a floating-point 3527 value was removed. Also, a suggestion (based on I-JSON [RFC7493]) 3528 was added for handling these types when converting JSON to CBOR, and 3529 the use of a specific rounding mechanism has been recommended. 3531 For a single value in the data model, CBOR often provides multiple 3532 encoding options. The revision adds a new section Section 4, which 3533 first introduces the term "preferred serialization" (Section 4.1) and 3534 defines it for various kinds of data items. On the basis of this 3535 terminology, the section goes on to discuss how a CBOR-based protocol 3536 can define "deterministic encoding" (Section 4.2), which now avoids 3537 the RFC 7049 terms "canonical" and "canonicalization". The 3538 suggestion of "Core Deterministic Encoding Requirements" 3539 Section 4.2.1 enables generic support for such protocol-defined 3540 encoding requirements. The present revision further eases the 3541 implementation of deterministic encoding by simplifying the map 3542 ordering suggested in RFC 7049 to simple lexicographic ordering of 3543 encoded keys. A description of the older suggestion is kept as an 3544 alternative, now termed "length-first map key ordering" 3545 (Section 4.2.3). 3547 The terminology for well-formed and valid data was sharpened and more 3548 stringently used, avoiding less well-defined alternative terms such 3549 as "syntax error", "decoding error" and "strict mode" outside 3550 examples. Also, a third level of requirements beyond CBOR-level 3551 validity that an application has on its input data is now explicitly 3552 called out. Well-formed (processable at all), valid (checked by a 3553 validity-checking generic decoder), and expected input (as checked by 3554 the application) are treated as a hierarchy of layers of 3555 acceptability. 3557 The handling of non-well-formed simple values was clarified in text 3558 and pseudocode. Appendix F was added to discuss well-formedness 3559 errors and provide examples for them. The pseudocode was updated to 3560 be more portable and some portability considerations were added. 3562 The discussion of validity has been sharpened in two areas. Map 3563 validity (handling of duplicate keys) was clarified and the domain of 3564 applicability of certain implementation choices explained. Also, 3565 while streamlining the terminology for tags, tag numbers, and tag 3566 content, discussion was added on tag validity, and the restrictions 3567 were clarified on tag content, in general and specifically for tag 1. 3569 An implementation note (and note for future tag definitions) was 3570 added to Section 3.4 about defining tags with semantics that depend 3571 on serialization order. 3573 Tag 35 is no longer defined in this updated document; the 3574 registration based on the definition in RFC 7049 remains in place. 3576 Terminology was introduced in Section 3 for "argument" and "head", 3577 simplifying further discussion. 3579 The security considerations were mostly rewritten and significantly 3580 expanded; in multiple other places, the document is now more explicit 3581 that a decoder cannot simply condone well-formedness errors. 3583 Acknowledgements 3585 CBOR was inspired by MessagePack. MessagePack was developed and 3586 promoted by Sadayuki Furuhashi ("frsyuki"). This reference to 3587 MessagePack is solely for attribution; CBOR is not intended as a 3588 version of or replacement for MessagePack, as it has different design 3589 goals and requirements. 3591 The need for functionality beyond the original MessagePack 3592 Specification became obvious to many people at about the same time 3593 around the year 2012. BinaryPack is a minor derivation of 3594 MessagePack that was developed by Eric Zhang for the binaryjs 3595 project. A similar, but different, extension was made by Tim Caswell 3596 for his msgpack-js and msgpack-js-browser projects. Many people have 3597 contributed to the discussion about extending MessagePack to separate 3598 text string representation from byte string representation. 3600 The encoding of the additional information in CBOR was inspired by 3601 the encoding of length information designed by Klaus Hartke for CoAP. 3603 This document also incorporates suggestions made by many people, 3604 notably Dan Frost, James Manger, Jeffrey Yasskin, Joe Hildebrand, 3605 Keith Moore, Laurence Lundblade, Matthew Lepinski, Michael 3606 Richardson, Nico Williams, Peter Occil, Phillip Hallam-Baker, Ray 3607 Polk, Stuart Cheshire, Tim Bray, Tony Finch, Tony Hansen, and Yaron 3608 Sheffer. Benjamin Kaduk provided an extensive review during IESG 3609 processing. Éric Vyncke, Erik Kline, Robert Wilton, and Roman Danyliw 3610 provided further IESG comments, which included an IoT directorate 3611 review by Eve Schooler. 3613 Authors' Addresses 3615 Carsten Bormann 3616 Universitaet Bremen TZI 3617 Postfach 330440 3618 D-28359 Bremen 3619 Germany 3621 Phone: +49-421-218-63921 3622 Email: cabo@tzi.org 3623 Paul Hoffman 3624 ICANN 3626 Email: paul.hoffman@icann.org