idnits 2.17.00 (12 Aug 2021) /tmp/idnits32276/draft-fielding-uri-rfc2396bis-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. -- The draft header indicates that this document obsoletes RFC2732, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC2396, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC1808, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC1738, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 618 has weird spacing: '... query frag...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 16, 2004) is 6668 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2277' is defined on line 1951, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) -- Obsolete informational reference (is this intentional?): RFC 1738 (Obsoleted by RFC 4248, RFC 4266) -- Obsolete informational reference (is this intentional?): RFC 1808 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2110 (Obsoleted by RFC 2557) -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2518 (Obsoleted by RFC 4918) -- Obsolete informational reference (is this intentional?): RFC 2717 (Obsoleted by RFC 4395) -- Obsolete informational reference (is this intentional?): RFC 2718 (Obsoleted by RFC 4395) -- Obsolete informational reference (is this intentional?): RFC 2732 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 3513 (Obsoleted by RFC 4291) Summary: 5 errors (**), 0 flaws (~~), 5 warnings (==), 18 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group T. Berners-Lee 2 Internet-Draft MIT/LCS 3 Updates: 1738 (if approved) R. Fielding 4 Obsoletes: 2732, 2396, 1808 (if approved) Day Software 5 Expires: August 16, 2004 L. Masinter 6 Adobe 7 February 16, 2004 9 Uniform Resource Identifier (URI): Generic Syntax 10 draft-fielding-uri-rfc2396bis-04 12 Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that other 19 groups may also distribute working documents as Internet-Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 . 29 The list of Internet-Draft Shadow Directories can be accessed at 30 . 32 This Internet-Draft will expire on August 16, 2004. 34 Copyright Notice 36 Copyright (C) The Internet Society (2004). All Rights Reserved. 38 Abstract 40 A Uniform Resource Identifier (URI) is a compact string of characters 41 for identifying an abstract or physical resource. This specification 42 defines the generic URI syntax and a process for resolving URI 43 references that might be in relative form, along with guidelines and 44 security considerations for the use of URIs on the Internet. 46 The URI syntax defines a grammar that is a superset of all valid 47 URIs, such that an implementation can parse the common components of 48 a URI reference without knowing the scheme-specific requirements of 49 every possible identifier. This specification does not define a 50 generative grammar for URIs; that task is performed by the individual 51 specifications of each URI scheme. 53 Editorial Note 55 Discussion of this draft and comments to the editors should be sent 56 to the uri@w3.org mailing list. An issues list and version history 57 is available at . 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . . 4 63 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . . . . 5 64 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6 65 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . . . . 6 66 1.2 Design Considerations . . . . . . . . . . . . . . . . . . . 6 67 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . 6 68 1.2.2 Separating Identification from Interaction . . . . . . . . . 7 69 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . . . . 9 70 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . . 10 71 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . 11 72 2.1 Percent Encoding . . . . . . . . . . . . . . . . . . . . . . 11 73 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . . 12 74 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . . 12 75 2.4 When to Encode or Decode . . . . . . . . . . . . . . . . . . 13 76 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . 15 77 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 78 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . . 16 79 3.2.1 User Information . . . . . . . . . . . . . . . . . . . . . . 16 80 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 81 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 82 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 83 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 84 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . 22 85 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 86 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . . 24 87 4.2 Relative URI . . . . . . . . . . . . . . . . . . . . . . . . 24 88 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . . 25 89 4.4 Same-document Reference . . . . . . . . . . . . . . . . . . 25 90 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . . 25 91 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . 27 92 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . . 27 93 5.1.1 Base URI within Document Content . . . . . . . . . . . . . . 27 94 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . . . . 28 95 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . . . . 28 96 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . . . . 28 97 5.2 Relative Resolution . . . . . . . . . . . . . . . . . . . . 28 98 5.2.1 Pre-parse the Base URI . . . . . . . . . . . . . . . . . . . 29 99 5.2.2 Transform References . . . . . . . . . . . . . . . . . . . . 29 100 5.2.3 Merge Paths . . . . . . . . . . . . . . . . . . . . . . . . 30 101 5.2.4 Remove Dot Segments . . . . . . . . . . . . . . . . . . . . 30 102 5.3 Component Recomposition . . . . . . . . . . . . . . . . . . 32 103 5.4 Reference Resolution Examples . . . . . . . . . . . . . . . 33 104 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . . . . 33 105 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . . . . 33 106 6. Normalization and Comparison . . . . . . . . . . . . . . . . 35 107 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . 35 108 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . . 36 109 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 36 110 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . . . . 37 111 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . . . . 38 112 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . . . . 39 113 6.3 Canonical Form . . . . . . . . . . . . . . . . . . . . . . . 39 114 7. Security Considerations . . . . . . . . . . . . . . . . . . 41 115 7.1 Reliability and Consistency . . . . . . . . . . . . . . . . 41 116 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . . 41 117 7.3 Back-end Transcoding . . . . . . . . . . . . . . . . . . . . 42 118 7.4 Rare IP Address Formats . . . . . . . . . . . . . . . . . . 42 119 7.5 Sensitive Information . . . . . . . . . . . . . . . . . . . 43 120 7.6 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . . 43 121 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 45 122 Normative References . . . . . . . . . . . . . . . . . . . . 46 123 Informative References . . . . . . . . . . . . . . . . . . . 47 124 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 48 125 A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . 50 126 B. Parsing a URI Reference with a Regular Expression . . . . . 52 127 C. Delimiting a URI in Context . . . . . . . . . . . . . . . . 53 128 D. Summary of Non-editorial Changes . . . . . . . . . . . . . . 55 129 D.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . . 55 130 D.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . . 55 131 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 132 Intellectual Property and Copyright Statements . . . . . . . 62 134 1. Introduction 136 A Uniform Resource Identifier (URI) provides a simple and extensible 137 means for identifying a resource. This specification of URI syntax 138 and semantics is derived from concepts introduced by the World Wide 139 Web global information initiative, whose use of such identifiers 140 dates from 1990 and is described in "Universal Resource Identifiers 141 in WWW" [RFC1630], and is designed to meet the recommendations laid 142 out in "Functional Recommendations for Internet Resource Locators" 143 [RFC1736] and "Functional Requirements for Uniform Resource Names" 144 [RFC1737]. 146 This document obsoletes [RFC2396], which merged "Uniform Resource 147 Locators" [RFC1738] and "Relative Uniform Resource Locators" 148 [RFC1808] in order to define a single, generic syntax for all URIs. 149 It excludes those portions of RFC 1738 that defined the specific 150 syntax of individual URI schemes; those portions will be updated as 151 separate documents. The process for registration of new URI schemes 152 is defined separately by [RFC2717]. Advice for designers of new URI 153 schemes can be found in [RFC2718]. 155 All significant changes from RFC 2396 are noted in Appendix D. 157 This specification uses the terms "character" and "character 158 encoding" in accordance with the definitions provided in [RFC2978]. 160 1.1 Overview of URIs 162 URIs are characterized as follows: 164 Uniform 166 Uniformity provides several benefits: it allows different types of 167 resource identifiers to be used in the same context, even when the 168 mechanisms used to access those resources may differ; it allows 169 uniform semantic interpretation of common syntactic conventions 170 across different types of resource identifiers; it allows 171 introduction of new types of resource identifiers without 172 interfering with the way that existing identifiers are used; and, 173 it allows the identifiers to be reused in many different contexts, 174 thus permitting new applications or protocols to leverage a 175 pre-existing, large, and widely-used set of resource identifiers. 177 Resource 179 Anything that can be named or described can be a resource. 180 Familiar examples include an electronic document, an image, a 181 service (e.g., "today's weather report for Los Angeles"), and a 182 collection of other resources. A resource is not necessarily 183 accessible via the Internet; e.g., human beings, corporations, and 184 bound books in a library can also be resources. Likewise, abstract 185 concepts can be resources, such as the operators and operands of a 186 mathematical equation or the types of a relationship (e.g., 187 "parent" or "employee"). 189 Identifier 191 An identifier embodies the information required to distinguish 192 what is being identified from all other things within its scope of 193 identification. 195 A URI is an identifier that consists of a sequence of characters 196 matching the syntax defined by the syntax rule named "URI" in Section 197 3. A URI can be used to refer to a resource. This specification does 198 not place any limits on the nature of a resource or the reasons why 199 an application might wish to refer to a resource. URIs have a global 200 scope and should be interpreted consistently regardless of context, 201 but that interpretation may be defined in relation to the user's 202 context (e.g., "http://localhost/" refers to a resource that is 203 relative to the user's network interface and yet not specific to any 204 one user). 206 1.1.1 Generic Syntax 208 Each URI begins with a scheme name, as defined in Section 3.1, that 209 refers to a specification for assigning identifiers within that 210 scheme. As such, the URI syntax is a federated and extensible naming 211 system wherein each scheme's specification may further restrict the 212 syntax and semantics of identifiers using that scheme. 214 This specification defines those elements of the URI syntax that are 215 required of all URI schemes or are common to many URI schemes. It 216 thus defines the syntax and semantics that are needed to implement a 217 scheme-independent parsing mechanism for URI references, such that 218 the scheme-dependent handling of a URI can be postponed until the 219 scheme-dependent semantics are needed. Likewise, protocols and data 220 formats that make use of URI references can refer to this 221 specification as defining the range of syntax allowed for all URIs, 222 including those schemes that have yet to be defined. 224 A parser of the generic URI syntax is capable of parsing any URI 225 reference into its major components; once the scheme is determined, 226 further scheme-specific parsing can be performed on the components. 227 In other words, the URI generic syntax is a superset of the syntax of 228 all URI schemes. 230 1.1.2 Examples 232 The following examples illustrate URIs that are in common use. 234 ftp://ftp.is.co.za/rfc/rfc1808.txt 236 http://www.ietf.org/rfc/rfc2396.txt 238 mailto:John.Doe@example.com 240 news:comp.infosystems.www.servers.unix 242 telnet://melvyl.ucop.edu/ 244 1.1.3 URI, URL, and URN 246 A URI can be further classified as a locator, a name, or both. The 247 term "Uniform Resource Locator" (URL) refers to the subset of URIs 248 that, in addition to identifying a resource, provide a means of 249 locating the resource by describing its primary access mechanism 250 (e.g., its network "location"). The term "Uniform Resource Name" 251 (URN) has been used historically to refer to both URIs under the 252 "urn" scheme [RFC2141], which are required to remain globally unique 253 and persistent even when the resource ceases to exist or becomes 254 unavailable, and to any other URI with the properties of a name. 256 An individual scheme does not need to be classified as being just one 257 of "name" or "locator". Instances of URIs from any given scheme may 258 have the characteristics of names or locators or both, often 259 depending on the persistence and care in the assignment of 260 identifiers by the naming authority, rather than any quality of the 261 scheme. Future specifications and related documentation should use 262 the general term "URI", rather than the more restrictive terms URL 263 and URN [RFC3305]. 265 1.2 Design Considerations 267 1.2.1 Transcription 269 The URI syntax has been designed with global transcription as one of 270 its main considerations. A URI is a sequence of characters from a 271 very limited set: the letters of the basic Latin alphabet, digits, 272 and a few special characters. A URI may be represented in a variety 273 of ways: e.g., ink on paper, pixels on a screen, or a sequence of 274 integers from a coded character set. The interpretation of a URI 275 depends only on the characters used and not how those characters are 276 represented in a network protocol. 278 The goal of transcription can be described by a simple scenario. 279 Imagine two colleagues, Sam and Kim, sitting in a pub at an 280 international conference and exchanging research ideas. Sam asks Kim 281 for a location to get more information, so Kim writes the URI for the 282 research site on a napkin. Upon returning home, Sam takes out the 283 napkin and types the URI into a computer, which then retrieves the 284 information to which Kim referred. 286 There are several design considerations revealed by the scenario: 288 o A URI is a sequence of characters that is not always represented 289 as a sequence of octets. 291 o A URI might be transcribed from a non-network source, and thus 292 should consist of characters that are most likely to be able to be 293 entered into a computer, within the constraints imposed by 294 keyboards (and related input devices) across languages and 295 locales. 297 o A URI often needs to be remembered by people, and it is easier for 298 people to remember a URI when it consists of meaningful or 299 familiar components. 301 These design considerations are not always in alignment. For 302 example, it is often the case that the most meaningful name for a URI 303 component would require characters that cannot be typed into some 304 systems. The ability to transcribe a resource identifier from one 305 medium to another has been considered more important than having a 306 URI consist of the most meaningful of components. 308 In local or regional contexts and with improving technology, users 309 might benefit from being able to use a wider range of characters; 310 such use is not defined in this specification. Percent-encoded 311 octets (Section 2.1) may be used within a URI to represent characters 312 outside the range of the US-ASCII coded character set if such 313 representation is defined by the scheme or by the protocol element in 314 which the URI is referenced; such a definition will specify the 315 character encoding scheme used to map those characters to octets 316 prior to being percent-encoded for the URI. 318 1.2.2 Separating Identification from Interaction 320 A common misunderstanding of URIs is that they are only used to refer 321 to accessible resources. In fact, the URI alone only provides 322 identification; access to the resource is neither guaranteed nor 323 implied by the presence of a URI. Instead, an operation (if any) 324 associated with a URI reference is defined by the protocol element, 325 data format attribute, or natural language text in which it appears. 327 Given a URI, a system may attempt to perform a variety of operations 328 on the resource, as might be characterized by such words as "access", 329 "update", "replace", or "find attributes". Such operations are 330 defined by the protocols that make use of URIs, not by this 331 specification. However, we do use a few general terms for describing 332 common operations on URIs. URI "resolution" is the process of 333 determining an access mechanism and the appropriate parameters 334 necessary to dereference a URI; such resolution may require several 335 iterations. To use that access mechanism to perform an action on the 336 URI's resource is to "dereference" the URI. 338 When URIs are used within information systems to identify sources of 339 information, the most common form of URI dereference is "retrieval": 340 making use of a URI in order to retrieve a representation of its 341 associated resource. A "representation" is a sequence of octets, 342 along with representation metadata describing those octets, that 343 constitutes a record of the state of the resource at the time that 344 the representation is generated. Retrieval is achieved by a process 345 that might include using the URI as a cache key to check for a 346 locally cached representation, resolution of the URI to determine an 347 appropriate access mechanism (if any), and dereference of the URI for 348 the sake of applying a retrieval operation. Depending on the 349 protocols used to perform the retrieval, additional information might 350 be supplied about the resource (resource metadata) and its relation 351 to other resources. 353 URI references in information systems are designed to be 354 late-binding: the result of an access is generally determined at the 355 time it is accessed and may vary over time or due to other aspects of 356 the interaction. When an author creates a reference to such a 357 resource, they do so with the intention that the reference be used in 358 the future; what is being identified is not some specific result that 359 was obtained in the past, but rather some characteristic that is 360 expected to be true for future results. In such cases, the resource 361 referred to by the URI is actually a sameness of characteristics as 362 observed over time, perhaps elucidated by additional comments or 363 assertions made by the resource provider. 365 Although many URI schemes are named after protocols, this does not 366 imply that use of such a URI will result in access to the resource 367 via the named protocol. URIs are often used simply for the sake of 368 identification. Even when a URI is used to retrieve a representation 369 of a resource, that access might be through gateways, proxies, 370 caches, and name resolution services that are independent of the 371 protocol associated with the scheme name, and the resolution of some 372 URIs may require the use of more than one protocol (e.g., both DNS 373 and HTTP are typically used to access an "http" URI's origin server 374 when a representation isn't found in a local cache). 376 1.2.3 Hierarchical Identifiers 378 The URI syntax is organized hierarchically, with components listed in 379 order of decreasing significance from left to right. For some URI 380 schemes, the visible hierarchy is limited to the scheme itself: 381 everything after the scheme component delimiter (":") is considered 382 opaque to URI processing. Other URI schemes make the hierarchy 383 explicit and visible to generic parsing algorithms. 385 The generic syntax uses the slash ("/"), question mark ("?"), and 386 number sign ("#") characters for the purpose of delimiting components 387 that are significant to the generic parser's hierarchical 388 interpretation of an identifier. In addition to aiding the 389 readability of such identifiers through the consistent use of 390 familiar syntax, this uniform representation of hierarchy across 391 naming schemes allows scheme-independent references to be made 392 relative to that hierarchy. 394 It is often the case that a group or "tree" of documents has been 395 constructed to serve a common purpose, wherein the vast majority of 396 URIs in these documents point to resources within the tree rather 397 than outside of it. Similarly, documents located at a particular 398 site are much more likely to refer to other resources at that site 399 than to resources at remote sites. Relative referencing of URIs 400 allows document trees to be partially independent of their location 401 and access scheme. For instance, it is possible for a single set of 402 hypertext documents to be simultaneously accessible and traversable 403 via each of the "file", "http", and "ftp" schemes if the documents 404 refer to each other using relative references. Furthermore, such 405 document trees can be moved, as a whole, without changing any of the 406 relative references. 408 A relative URI reference (Section 4.2) refers to a resource by 409 describing the difference within a hierarchical name space between 410 the reference context and the target URI. The reference resolution 411 algorithm, presented in Section 5, defines how such a reference is 412 transformed to the target URI. Since relative references can only be 413 used within the context of a hierarchical URI, designers of new URI 414 schemes should use a syntax consistent with the generic syntax's 415 hierarchical components unless there are compelling reasons to forbid 416 relative referencing within that scheme. 418 All URIs are parsed by generic syntax parsers when used. A URI scheme 419 that wishes to remain opaque to hierarchical processing must disallow 420 the use of slash and question mark characters. However, since a 421 non-relative URI reference is only modified by the generic parser if 422 it contains complete path segments of "." or ".." (see Section 3.3), 423 URIs may safely use "/" for other purposes if they do not allow 424 dot-segments. 426 1.3 Syntax Notation 428 This specification uses the Augmented Backus-Naur Form (ABNF) 429 notation of [RFC2234], including the following core ABNF syntax rules 430 defined by that specification: ALPHA (letters), CR (carriage return), 431 CTL (control characters), DIGIT (decimal digits), DQUOTE (double 432 quote), HEXDIG (hexadecimal digits), LF (line feed), and SP (space). 433 The complete URI syntax is collected in Appendix A. 435 2. Characters 437 Although ABNF notation defines its terminal values to be non-negative 438 integers (codepoints) based on the US-ASCII coded character set 439 [ASCII], we must invert that relation in order to understand the URI 440 syntax, since URIs are defined as strings of characters independent 441 of any particular encoding. Therefore, the integer values must be 442 mapped back to their corresponding characters via US-ASCII in order 443 to complete the syntax rules. 445 This specification does not mandate the use of any particular 446 character encoding scheme for mapping between URI characters and the 447 octets used to store or transmit those characters. When a URI appears 448 in a protocol element, the character encoding is defined by that 449 protocol; absent such a definition, a URI is assumed to use the same 450 character encoding as the surrounding text. 452 A URI is composed from a limited set of characters consisting of 453 digits, letters, and a few graphic symbols. A reserved (Section 2.2) 454 subset of those characters may be used to delimit syntax components 455 within a URI, while the remaining characters, including both the 456 unreserved (Section 2.3) set and those reserved characters not acting 457 as delimiters, define each component's data. 459 2.1 Percent Encoding 461 A percent-encoding mechanism is used to represent a data octet in a 462 component when that octet's corresponding character is outside the 463 allowed set or is being used as a delimiter of, or within, the 464 component. A percent-encoded octet is encoded as a character triplet, 465 consisting of the percent character "%" followed by the two 466 hexadecimal digits representing that octet's numeric value. For 467 example, "%20" is the percent-encoding for the binary octet 468 "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space 469 character (SP). 471 pct-encoded = "%" HEXDIG HEXDIG 473 The uppercase hexadecimal digits 'A' through 'F' are equivalent to 474 the lowercase digits 'a' through 'f', respectively. Two URIs that 475 differ only in the case of hexadecimal digits used in percent-encoded 476 octets are equivalent. For consistency, URI producers and 477 normalizers should use uppercase hexadecimal digits for all 478 percent-encodings. 480 2.2 Reserved Characters 482 URIs include components and sub-components that are delimited by 483 characters in the "reserved" set. These characters are called 484 "reserved" because they may (or may not) be defined as delimiters by 485 the generic syntax, by each scheme-specific syntax, or by the 486 implementation-specific syntax of a URI's dereferencing algorithm. 487 If data for a URI component would conflict with a reserved 488 character's purpose as a delimiter, then the conflicting data must be 489 percent-encoded before forming the URI. 491 reserved = gen-delims / sub-delims 493 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 495 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 496 / "*" / "+" / "," / ";" / "=" 498 A subset of the reserved characters (gen-delims) are used as 499 delimiters of the generic URI components described in Section 3. A 500 component's ABNF syntax rule will not use the reserved or gen-delims 501 rule names directly; instead, each syntax rule lists those reserved 502 characters that are allowed within that component (i.e., not 503 delimiting it). The allowed reserved characters, including those in 504 the sub-delims set and any of the gen-delims that are not a delimiter 505 of that component, are reserved for use as sub-component delimiters 506 within the component. Only the most common sub-components are 507 defined by this specification; other sub-components may be defined by 508 a URI scheme's specification, or by the implementation-specific 509 syntax of a URI's dereferencing algorithm, provided that such 510 sub-components are delimited by characters in that component's 511 reserved set. If no such delimiting role has been assigned, then a 512 reserved character appearing in a component represents the data octet 513 corresponding to its encoding in US-ASCII. 515 URIs that differ in the replacement of a reserved character with its 516 corresponding percent-encoded octet are not equivalent. 517 Percent-encoding a reserved character, or decoding a percent-encoded 518 octet that corresponds to a reserved character, will change how the 519 URI is interpreted by most applications. 521 2.3 Unreserved Characters 523 Characters that are allowed in a URI but do not have a reserved 524 purpose are called unreserved. These include uppercase and lowercase 525 letters, decimal digits, hyphen, period, underscore, and tilde. 527 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 529 URIs that differ in the replacement of an unreserved character with 530 its corresponding percent-encoded octet are equivalent: they identify 531 the same resource. However, percent-encoded unreserved characters 532 may change the result of some URI comparisons (Section 6), 533 potentially leading to incorrect or inefficient behavior. For 534 consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A 535 and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore 536 (%5F), or tilde (%7E) should not be created by URI producers and, 537 when found in a URI, should be decoded to their corresponding 538 unreserved character by URI normalizers. 540 2.4 When to Encode or Decode 542 Under normal circumstances, the only time that octets within a URI 543 are percent-encoded is during the process of producing the URI from 544 its component parts. It is during that process that an 545 implementation determines which of the reserved characters are to be 546 used as sub-component delimiters and which can be safely used as 547 data. Once produced, a URI is always in its percent-encoded form. 549 When a URI is dereferenced, the components and sub-components 550 significant to the scheme-specific dereferencing process (if any) 551 must be parsed and separated before the percent-encoded octets within 552 those components can be safely decoded, since otherwise the data may 553 be mistaken for component delimiters. The only exception is for 554 percent-encoded octets corresponding to characters in the unreserved 555 set, which can be decoded at any time. For example, the octet 556 corresponding to the tilde ("~") character is often encoded as "%7E" 557 by older URI processing software; the "%7E" can be replaced by "~" 558 without changing its interpretation. 560 Because the percent ("%") character serves as the indicator for 561 percent-encoded octets, it must be percent-encoded as "%25" in order 562 for that octet to be used as data within a URI. Implementations must 563 not percent-encode or decode the same string more than once, since 564 decoding an already decoded string might lead to misinterpreting a 565 percent data octet as the beginning of a percent-encoding, or vice 566 versa in the case of percent-encoding an already percent-encoded 567 string. 569 URI characters serve as an external interface for identification 570 between systems. A system that internally provides identifiers in 571 the form of a different character encoding, such as EBCDIC, will 572 generally perform character translation of textual identifiers to 573 UTF-8 [RFC3629] (or some other superset of the US-ASCII character 574 encoding) at an internal interface, since that results in more 575 meaningful identifiers than simply percent-encoding the original 576 octets. When interpreting an incoming URI on such an interface, 577 percent-encoded octets must be decoded before the reverse transcoding 578 can be applied. 580 In some cases, the interface between a URI component and the 581 identifying data it has been crafted to represent is much less direct 582 than a character encoding translation. For example, portions of a 583 URI might reflect a query on non-ASCII data, numeric coordinates on a 584 map, etc. Likewise, a URI scheme may define components with 585 additional encoding requirements, such as base64, that are applied 586 prior to forming the component and producing the URI. 588 When a URI scheme defines a component that represents textual data 589 consisting of characters from the Unicode (ISO/IEC 10646-1) character 590 set, the data should be encoded first as octets according to the 591 UTF-8 character encoding [RFC3629], and then only those octets that 592 do not correspond to characters in the unreserved set should be 593 percent-encoded. For example, the character A would be represented 594 as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be 595 represented as "%C3%80", and the character KATAKANA LETTER A would be 596 represented as "%E3%82%A2". 598 3. Syntax Components 600 The generic URI syntax consists of a hierarchical sequence of 601 components referred to as the scheme, authority, path, query, and 602 fragment. 604 URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment] 606 The scheme and path components are required, though path may be empty 607 (no characters). An ABNF-driven parser will find that the border 608 between authority and path is ambiguous; they are disambiguated by 609 the "first-match-wins" (a.k.a. "greedy") algorithm. In other words, 610 if authority is present then the first segment of the path must be 611 empty. 613 The following are two example URIs and their component parts: 615 foo://example.com:8042/over/there?name=ferret#nose 616 \_/ \______________/\_________/ \_________/ \__/ 617 | | | | | 618 scheme authority path query fragment 619 | _____________________|__ 620 / \ / \ 621 urn:example:animal:ferret:nose 623 3.1 Scheme 625 Each URI begins with a scheme name that refers to a specification for 626 assigning identifiers within that scheme. As such, the URI syntax is 627 a federated and extensible naming system wherein each scheme's 628 specification may further restrict the syntax and semantics of 629 identifiers using that scheme. 631 Scheme names consist of a sequence of characters beginning with a 632 letter and followed by any combination of letters, digits, plus 633 ("+"), period ("."), or hyphen ("-"). Although scheme is 634 case-insensitive, the canonical form is lowercase and documents that 635 specify schemes must do so using lowercase letters. An 636 implementation should accept uppercase letters as equivalent to 637 lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for 638 the sake of robustness, but should only produce lowercase scheme 639 names, for consistency. 641 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 643 Individual schemes are not specified by this document. The process 644 for registration of new URI schemes is defined separately by 646 [RFC2717]. The scheme registry maintains the mapping between scheme 647 names and their specifications. Advice for designers of new URI 648 schemes can be found in [RFC2718]. 650 When presented with a URI that violates one or more scheme-specific 651 restrictions, the scheme-specific resolution process should flag the 652 reference as an error rather than ignore the unused parts; doing so 653 reduces the number of equivalent URIs and helps detect abuses of the 654 generic syntax that might indicate the URI has been constructed to 655 mislead the user (Section 7.6). 657 3.2 Authority 659 Many URI schemes include a hierarchical element for a naming 660 authority, such that governance of the name space defined by the 661 remainder of the URI is delegated to that authority (which may, in 662 turn, delegate it further). The generic syntax provides a common 663 means for distinguishing an authority based on a registered name or 664 server address, along with optional port and user information. 666 The authority component is preceded by a double slash ("//") and is 667 terminated by the next slash ("/"), question mark ("?"), or number 668 sign ("#") character, or by the end of the URI. 670 authority = [ userinfo "@" ] host [ ":" port ] 672 URI producers and normalizers should omit the "@" delimiter that 673 separates userinfo from host if the userinfo component is empty (zero 674 length) and should omit the ":" delimiter that separates host from 675 port if the port component is empty. Some schemes do not allow the 676 userinfo and/or port sub-components. 678 3.2.1 User Information 680 The userinfo sub-component may consist of a user name and, 681 optionally, scheme-specific information about how to gain 682 authorization to access the resource. The user information, if 683 present, is followed by a commercial at-sign ("@") that delimits it 684 from the host. 686 userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) 688 Use of the format "user:password" in the userinfo field is 689 deprecated. Applications should not render as clear text any data 690 after the first colon (":") character found within a userinfo 691 sub-component unless such data is the empty string (indicating no 692 password) or "anonymous". Applications may choose to ignore or reject 693 such data when received as part of a reference, and should reject the 694 storage of such data in unencrypted form. The passing of 695 authentication information in clear text has proven to be a security 696 risk in almost every case where it has been used. 698 Applications that render a URI for the sake of user feedback, such as 699 in graphical hypertext browsing, should render userinfo in a way that 700 is distinguished from the rest of a URI, when feasible. Such 701 rendering will assist the user in cases where the userinfo has been 702 misleadingly crafted to look like a trusted domain name (Section 703 7.6). 705 3.2.2 Host 707 The host sub-component of authority is identified by an IP literal 708 encapsulated within square brackets, an IPv4 address in 709 dotted-decimal form, or a host name. 711 host = IP-literal / IPv4address / reg-name 713 The syntax rule for host is ambiguous because it does not completely 714 distinguish between an IPv4address and a reg-name. Again, the 715 "first-match-wins" algorithm applies: If host matches the rule for 716 IPv4address, then it should be considered an IPv4 address literal and 717 not a reg-name. Although host is case-insensitive, producers and 718 normalizers should use lowercase for host names and hexadecimal 719 addresses for the sake of uniformity, while only using uppercase 720 letters for percent-encodings. 722 A host identified by an Internet Protocol literal address, version 6 723 [RFC3513] or later, is distinguished by enclosing the IP literal 724 within square brackets ("[" and "]"). This is the only place where 725 square bracket characters are allowed in the URI syntax. In 726 anticipation of future, as-yet-undefined IP literal address formats, 727 an optional version flag may be used to indicate such a format 728 explicitly rather than relying on heuristic determination. 730 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 732 IPvFuture = "v" HEXDIG "." 1*( unreserved / sub-delims / ":" ) 734 The version flag does not indicate the IP version; rather, it 735 indicates future versions of the literal format. As such, 736 implementations must not provide the version flag for existing IPv4 737 and IPv6 literal addresses. If a URI containing an IP-literal that 738 starts with "v" (case-insensitive), indicating that the version flag 739 is present, is dereferenced by an application that does not know the 740 meaning of that version flag, then the application should return an 741 appropriate error for "address mechanism not supported". 743 A host identified by an IPv6 literal address is represented inside 744 the square brackets without a preceding version flag. The ABNF 745 provided here is a translation of the text definition of an IPv6 746 literal address provided in [RFC3513]. A 128-bit IPv6 address is 747 divided into eight 16-bit pieces. Each piece is represented 748 numerically in case-insensitive hexadecimal, using one to four 749 hexadecimal digits (leading zeroes are permitted). The eight encoded 750 pieces are given most-significant first, separated by colon 751 characters. Optionally, the least-significant two pieces may instead 752 be represented in IPv4 address textual format. A sequence of one or 753 more consecutive zero-valued 16-bit pieces within the address may be 754 elided, omitting all their digits and leaving exactly two consecutive 755 colons in their place to mark the elision. 757 IPv6address = 6( h16 ":" ) ls32 758 / "::" 5( h16 ":" ) ls32 759 / [ h16 ] "::" 4( h16 ":" ) ls32 760 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 761 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 762 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 763 / [ *4( h16 ":" ) h16 ] "::" ls32 764 / [ *5( h16 ":" ) h16 ] "::" h16 765 / [ *6( h16 ":" ) h16 ] "::" 767 ls32 = ( h16 ":" h16 ) / IPv4address 768 ; least-significant 32 bits of address 770 h16 = 1*4HEXDIG 771 ; 16 bits of address represented in hexadecimal 773 A host identified by an IPv4 literal address is represented in 774 dotted-decimal notation (a sequence of four decimal numbers in the 775 range 0 to 255, separated by "."), as described in [RFC1123] by 776 reference to [RFC0952]. Note that other forms of dotted notation may 777 be interpreted on some platforms, as described in Section 7.4, but 778 only the dotted-decimal form of four octets is allowed by this 779 grammar. 781 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 783 dec-octet = DIGIT ; 0-9 784 / %x31-39 DIGIT ; 10-99 785 / "1" 2DIGIT ; 100-199 786 / "2" %x30-34 DIGIT ; 200-249 787 / "25" %x30-35 ; 250-255 789 A host identified by a registered name is a string of characters that 790 is intended for lookup within a locally-defined host or service name 791 registry. The most common of such registry mechanisms is the Domain 792 Name System (DNS), as defined by Section 3 of [RFC1034] and Section 793 2.1 of [RFC1123]. A DNS name consists of a sequence of domain labels 794 separated by ".", each domain label starting and ending with an 795 alphanumeric character and possibly also containing "-" characters. 796 The rightmost domain label of a fully qualified domain name in DNS 797 may be followed by a single "." and should be followed by one if it 798 is necessary to distinguish between the complete domain name and some 799 local domain. 801 reg-name = 0*255( unreserved / pct-encoded / sub-delims ) 803 If the host component is defined and the registered name is empty 804 (zero length), then the name defaults to "localhost" (Section 6.2.3 805 discusses how this should be normalized). If "localhost" is not 806 determined by a host name lookup, then it should be interpreted to 807 mean the machine on which the URI is being resolved. 809 This specification does not mandate a particular registered name 810 lookup technology and therefore does not restrict the syntax of 811 reg-name beyond that necessary for interoperability. Instead, it 812 delegates the issue of host name syntax conformance to the operating 813 system of each application performing URI resolution, and that 814 operating system decides what it will allow for the purpose of host 815 identification. A URI resolution implementation might use DNS, host 816 tables, yellow pages, NetInfo, WINS, or any other system for lookup 817 of host and service names. However, a globally-scoped naming system, 818 such as DNS fully-qualified domain names, is necessary for URIs that 819 are intended to have global scope. URI producers should use host 820 names that conform to the DNS syntax, even when use of DNS is not 821 immediately apparent. 823 The reg-name syntax allows percent-encoded octets in order to 824 represent non-ASCII host or service names in a uniform way that is 825 independent of the underlying name resolution technology; such octets 826 must represent characters encoded in the UTF-8 character encoding 827 [RFC3629] prior to being percent-encoded. When a non-ASCII host name 828 represents an internationalized domain name intended for resolution 829 via DNS, the name must be transformed to the IDNA encoding [RFC3490] 830 prior to name lookup. URI producers should provide such host names in 831 the IDNA encoding, rather than a percent-encoding, if they wish to 832 maximize interoperability with legacy URI resolvers. 834 The presence of host within a URI does not imply that the scheme 835 requires access to the given host on the Internet. In many cases, 836 the host syntax is used only for the sake of reusing the existing 837 registration process created and deployed for DNS, thus obtaining a 838 globally unique name without the cost of deploying another registry. 839 However, such use comes with its own costs: domain name ownership may 840 change over time for reasons not anticipated by the URI producer. 842 3.2.3 Port 844 The port sub-component of authority is designated by an optional port 845 number in decimal following the host and delimited from it by a 846 single colon (":") character. 848 port = *DIGIT 850 A scheme may define a default port. For example, the "http" scheme 851 defines a default port of "80", corresponding to its reserved TCP 852 port number. The type of port designated by the port number (e.g., 853 TCP, UDP, SCTP, etc.) is defined by the URI scheme. URI producers 854 and normalizers should omit the port component and its ":" delimiter 855 if port is empty or its value would be the same as the scheme's 856 default. 858 3.3 Path 860 The path component contains data, usually organized in hierarchical 861 form, that, along with data in the non-hierarchical query component 862 (Section 3.4), serves to identify a resource within the scope of the 863 URI's scheme and naming authority (if any). If a URI contains an 864 authority component, then the initial path segment must be empty 865 (i.e., the path must begin with a slash ("/") character or be 866 entirely empty). The path is terminated by the first question mark 867 ("?") or number sign ("#") character, or by the end of the URI. 869 path = segment *( "/" segment ) 870 segment = *pchar 872 pchar = unreserved / pct-encoded / sub-delims / ":" / "@" 874 A path consists of a sequence of path segments separated by a slash 875 ("/") character. A path is always defined for a URI, though the 876 defined path may be empty (zero length). Use of the slash character 877 to indicate hierarchy is only required when a URI will be used as the 878 context for relative references. For example, the URI 879 has a path of "fred@example.com", whereas 880 the URI has an empty path. 882 The path segments "." and ".." are defined for relative reference 883 within the path name hierarchy. They are intended for use at the 884 beginning of a relative path reference (Section 4.2) for indicating 885 relative position within the hierarchical tree of names. This is 886 similar to their role within some operating systems' file directory 887 structure to indicate the current directory and parent directory, 888 respectively. However, unlike a file system, these dot-segments are 889 only interpreted within the URI path hierarchy and are removed as 890 part of the resolution process (Section 5.2). 892 Aside from dot-segments in hierarchical paths, a path segment is 893 considered opaque by the generic syntax. URI-producing applications 894 often use the reserved characters allowed in a segment for the 895 purpose of delimiting scheme-specific or dereference-handler-specific 896 sub-components. For example, the semicolon (";") and equals ("=") 897 reserved characters are often used for delimiting parameters and 898 parameter values applicable to that segment. The comma (",") 899 reserved character is often used for similar purposes. For example, 900 one URI producer might use a segment like "name;v=1.1" to indicate a 901 reference to version 1.1 of "name", whereas another might use a 902 segment like "name,1.1" to indicate the same. Parameter types may be 903 defined by scheme-specific semantics, but in most cases the syntax of 904 a parameter is specific to the implementation of the URI's 905 dereferencing algorithm. 907 3.4 Query 909 The query component contains non-hierarchical data that, along with 910 data in the path component (Section 3.3), serves to identify a 911 resource within the scope of the URI's scheme and naming authority 912 (if any). The query component is indicated by the first question mark 913 ("?") character and terminated by a number sign ("#") character or by 914 the end of the URI. 916 query = *( pchar / "/" / "?" ) 918 The characters slash ("/") and question mark ("?") may represent data 919 within the query component, but should not be used as such within a 920 URI that is expected to be the base for relative references (Section 921 5.1). Incorrect implementations of reference resolution often fail 922 to distinguish query data from path data when looking for 923 hierarchical separators, thus resulting in non-interoperable results. 924 However, since query components are often used to carry identifying 925 information in the form of "key=value" pairs, and one frequently used 926 value is a reference to another URI, it is sometimes better for 927 usability to avoid percent-encoding those characters. 929 3.5 Fragment 931 The fragment identifier component of a URI allows indirect 932 identification of a secondary resource by reference to a primary 933 resource and additional identifying information. The identified 934 secondary resource may be some portion or subset of the primary 935 resource, some view on representations of the primary resource, or 936 some other resource defined or described by those representations. A 937 fragment identifier component is indicated by the presence of a 938 number sign ("#") character and terminated by the end of the URI. 940 fragment = *( pchar / "/" / "?" ) 942 The semantics of a fragment identifier are defined by the set of 943 representations that might result from a retrieval action on the 944 primary resource. The fragment's format and resolution is therefore 945 dependent on the media type [RFC2046] of a potentially retrieved 946 representation, even though such a retrieval is only performed if the 947 URI is dereferenced. Individual media types may define their own 948 restrictions on, or structure within, the fragment identifier syntax 949 for specifying different types of subsets, views, or external 950 references that are identifiable as secondary resources by that media 951 type. If the primary resource has multiple representations, as is 952 often the case for resources whose representation is selected based 953 on attributes of the retrieval request (a.k.a., content negotiation), 954 then whatever is identified by the fragment should be consistent 955 across all of those representations: each representation should 956 either define the fragment such that it corresponds to the same 957 secondary resource, regardless of how it is represented, or the 958 fragment should be left undefined by the representation (i.e., not 959 found). 961 As with any URI, use of a fragment identifier component does not 962 imply that a retrieval action will take place. A URI with a fragment 963 identifier may be used to refer to the secondary resource without any 964 implication that the primary resource is accessible or will ever be 965 accessed. 967 Fragment identifiers have a special role in information systems as 968 the primary form of client-side indirect referencing, allowing an 969 author to specifically identify those aspects of an existing resource 970 that are only indirectly provided by the resource owner. As such, 971 interpretation of the fragment identifier during a retrieval action 972 is performed solely by the user agent; the fragment identifier is not 973 passed to other systems during the process of retrieval. Although 974 this is often perceived to be a loss of information, particularly in 975 regards to accurate redirection of references as content moves over 976 time, it also serves to prevent information providers from denying 977 reference authors the right to selectively refer to information 978 within a resource. 980 The characters slash ("/") and question mark ("?") are allowed to 981 represent data within the fragment identifier, but should not be used 982 as such within a URI that is expected to be the base for relative 983 references (Section 5.1) for the same reasons as described above for 984 query. 986 4. Usage 988 When applications make reference to a URI, they do not always use the 989 full form of reference defined by the "URI" syntax rule. In order to 990 save space and take advantage of hierarchical locality, many Internet 991 protocol elements and media type formats allow an abbreviation of a 992 URI, while others restrict the syntax to a particular form of URI. 993 We define the most common forms of reference syntax in this 994 specification because they impact and depend upon the design of the 995 generic syntax, requiring a uniform parsing algorithm in order to be 996 interpreted consistently. 998 4.1 URI Reference 1000 URI-reference is used to denote the most common usage of a resource 1001 identifier. 1003 URI-reference = URI / relative-URI 1005 A URI-reference may be relative: if the reference's prefix matches 1006 the syntax of a scheme followed by its colon separator, then the 1007 reference is a URI rather than a relative-URI. 1009 A URI-reference is typically parsed first into the five URI 1010 components, in order to determine what components are present and 1011 whether or not the reference is relative, and then each component is 1012 parsed for its subparts and their validation. The ABNF of 1013 URI-reference, along with the "first-match-wins" disambiguation rule, 1014 is sufficient to define a validating parser for the generic syntax. 1015 Readers familiar with regular expressions should see Appendix B for 1016 an example of a non-validating URI-reference parser that will take 1017 any given string and extract the URI components. 1019 4.2 Relative URI 1021 A relative URI reference takes advantage of the hierarchical syntax 1022 (Section 1.2.3) in order to express a reference that is relative to 1023 the name space of another hierarchical URI. 1025 relative-URI = ["//" authority] path ["?" query] ["#" fragment] 1027 The URI referred to by a relative reference, also known as the target 1028 URI, is obtained by applying the reference resolution algorithm of 1029 Section 5. 1031 A relative reference that begins with two slash characters is termed 1032 a network-path reference; such references are rarely used. A relative 1033 reference that begins with a single slash character is termed an 1034 absolute-path reference. A relative reference that does not begin 1035 with a slash character is termed a relative-path reference. 1037 A path segment that contains a colon character (e.g., "this:that") 1038 cannot be used as the first segment of a relative-path reference 1039 because it would be mistaken for a scheme name. Such a segment must 1040 be preceded by a dot-segment (e.g., "./this:that") to make a 1041 relative-path reference. 1043 4.3 Absolute URI 1045 Some protocol elements allow only the absolute form of a URI without 1046 a fragment identifier. For example, defining a base URI for later 1047 use by relative references calls for an absolute-URI syntax rule that 1048 does not allow a fragment. 1050 absolute-URI = scheme ":" ["//" authority] path ["?" query] 1052 4.4 Same-document Reference 1054 When a URI reference refers to a URI that is, aside from its fragment 1055 component (if any), identical to the base URI (Section 5.1), that 1056 reference is called a "same-document" reference. The most frequent 1057 examples of same-document references are relative references that are 1058 empty or include only the number sign ("#") separator followed by a 1059 fragment identifier. 1061 When a same-document reference is dereferenced for the purpose of a 1062 retrieval action, the target of that reference is defined to be 1063 within the same entity (representation, document, or message) as the 1064 reference; therefore, a dereference should not result in a new 1065 retrieval action. 1067 Normalization of the base and target URIs prior to their comparison, 1068 as described in Section 6.2.2 and Section 6.2.3, is allowed but 1069 rarely performed in practice. Normalization may increase the set of 1070 same-document references, which may be of benefit to some caching 1071 applications. As such, reference authors should not assume that a 1072 slightly different, though equivalent, reference URI will (or will 1073 not) be interpreted as a same-document reference by any given 1074 application. 1076 4.5 Suffix Reference 1078 The URI syntax is designed for unambiguous reference to resources and 1079 extensibility via the URI scheme. However, as URI identification and 1080 usage have become commonplace, traditional media (television, radio, 1081 newspapers, billboards, etc.) have increasingly used a suffix of the 1082 URI as a reference, consisting of only the authority and path 1083 portions of the URI, such as 1085 www.w3.org/Addressing/ 1087 or simply a DNS registered name on its own. Such references are 1088 primarily intended for human interpretation, rather than for 1089 machines, with the assumption that context-based heuristics are 1090 sufficient to complete the URI (e.g., most host names beginning with 1091 "www" are likely to have a URI prefix of "http://"). Although there 1092 is no standard set of heuristics for disambiguating a URI suffix, 1093 many client implementations allow them to be entered by the user and 1094 heuristically resolved. 1096 While this practice of using suffix references is common, it should 1097 be avoided whenever possible and never used in situations where 1098 long-term references are expected. The heuristics noted above will 1099 change over time, particularly when a new URI scheme becomes popular, 1100 and are often incorrect when used out of context. Furthermore, they 1101 can lead to security issues along the lines of those described in 1102 [RFC1535]. 1104 Since a URI suffix has the same syntax as a relative path reference, 1105 a suffix reference cannot be used in contexts where a relative 1106 reference is expected. As a result, suffix references are limited to 1107 those places where there is no defined base URI, such as dialog boxes 1108 and off-line advertisements. 1110 5. Reference Resolution 1112 This section defines the process of resolving a URI reference within 1113 a context that allows relative references, such that the result is a 1114 string matching the "URI" syntax rule of Section 3. 1116 5.1 Establishing a Base URI 1118 The term "relative" implies that there exists a "base URI" against 1119 which the relative reference is applied. Aside from fragment-only 1120 references (Section 4.4), relative references are only usable when a 1121 base URI is known. A base URI must be established by the parser 1122 prior to parsing URI references that might be relative. 1124 The base URI of a reference can be established in one of four ways, 1125 discussed below in order of precedence. The order of precedence can 1126 be thought of in terms of layers, where the innermost defined base 1127 URI has the highest precedence. This can be visualized graphically 1128 as: 1130 .----------------------------------------------------------. 1131 | .----------------------------------------------------. | 1132 | | .----------------------------------------------. | | 1133 | | | .----------------------------------------. | | | 1134 | | | | .----------------------------------. | | | | 1135 | | | | | | | | | | 1136 | | | | `----------------------------------' | | | | 1137 | | | | (5.1.1) Base URI embedded in content | | | | 1138 | | | `----------------------------------------' | | | 1139 | | | (5.1.2) Base URI of the encapsulating entity | | | 1140 | | | (message, representation, or none) | | | 1141 | | `----------------------------------------------' | | 1142 | | (5.1.3) URI used to retrieve the entity | | 1143 | `----------------------------------------------------' | 1144 | (5.1.4) Default Base URI (application-dependent) | 1145 `----------------------------------------------------------' 1147 5.1.1 Base URI within Document Content 1149 Within certain media types, a base URI for relative references can be 1150 embedded within the content itself such that it can be readily 1151 obtained by a parser. This can be useful for descriptive documents, 1152 such as tables of content, which may be transmitted to others through 1153 protocols other than their usual retrieval context (e.g., E-Mail or 1154 USENET news). 1156 It is beyond the scope of this specification to specify how, for each 1157 media type, a base URI can be embedded. The appropriate syntax, when 1158 available, is described by each media type's specification. 1160 5.1.2 Base URI from the Encapsulating Entity 1162 If no base URI is embedded, the base URI is defined by the 1163 representation's retrieval context. For a document that is enclosed 1164 within another entity, such as a message or archive, the retrieval 1165 context is that entity; thus, the default base URI of a 1166 representation is the base URI of the entity in which the 1167 representation is encapsulated. 1169 A mechanism for embedding a base URI within MIME container types 1170 (e.g., the message and multipart types) is defined by MHTML 1171 [RFC2110]. Protocols that do not use the MIME message header syntax, 1172 but do allow some form of tagged metadata to be included within 1173 messages, may define their own syntax for defining a base URI as part 1174 of a message. 1176 5.1.3 Base URI from the Retrieval URI 1178 If no base URI is embedded and the representation is not encapsulated 1179 within some other entity, then, if a URI was used to retrieve the 1180 representation, that URI shall be considered the base URI. Note that 1181 if the retrieval was the result of a redirected request, the last URI 1182 used (i.e., the URI that resulted in the actual retrieval of the 1183 representation) is the base URI. 1185 5.1.4 Default Base URI 1187 If none of the conditions described above apply, then the base URI is 1188 defined by the context of the application. Since this definition is 1189 necessarily application-dependent, failing to define a base URI using 1190 one of the other methods may result in the same content being 1191 interpreted differently by different types of application. 1193 A sender of a representation containing relative references is 1194 responsible for ensuring that a base URI for those references can be 1195 established. Aside from fragment-only references, relative references 1196 can only be used reliably in situations where the base URI is 1197 well-defined. 1199 5.2 Relative Resolution 1201 This section describes an algorithm for converting a URI reference 1202 that might be relative to a given base URI into the parsed componets 1203 of the reference's target. The components can then be recomposed, as 1204 described in Section 5.3, to form the target URI. This algorithm 1205 provides definitive results that can be used to test the output of 1206 other implementations. Applications may implement relative reference 1207 resolution using some other algorithm, provided that the results 1208 match what would be given by this algorithm. 1210 5.2.1 Pre-parse the Base URI 1212 The base URI (Base) is established according to the procedure of 1213 Section 5.1 and parsed into the five main components described in 1214 Section 3. Note that only the scheme component is required to be 1215 present in a base URI; the other components may be empty or 1216 undefined. A component is undefined if its associated delimiter does 1217 not appear in the URI reference; the path component is never 1218 undefined, though it may be empty. 1220 Normalization of the base URI, as described in Section 6.2.2 and 1221 Section 6.2.3, is optional. A URI reference must be transformed to 1222 its target URI before it can be normalized. 1224 5.2.2 Transform References 1226 For each URI reference (R), the following pseudocode describes an 1227 algorithm for transforming R into its target URI (T): 1229 -- The URI reference is parsed into the five URI components 1230 -- 1231 (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); 1233 -- A non-strict parser may ignore a scheme in the reference 1234 -- if it is identical to the base URI's scheme. 1235 -- 1236 if ((not strict) and (R.scheme == Base.scheme)) then 1237 undefine(R.scheme); 1238 endif; 1240 if defined(R.scheme) then 1241 T.scheme = R.scheme; 1242 T.authority = R.authority; 1243 T.path = remove_dot_segments(R.path); 1244 T.query = R.query; 1245 else 1246 if defined(R.authority) then 1247 T.authority = R.authority; 1248 T.path = remove_dot_segments(R.path); 1249 T.query = R.query; 1250 else 1251 if (R.path == "") then 1252 T.path = Base.path; 1253 if defined(R.query) then 1254 T.query = R.query; 1255 else 1256 T.query = Base.query; 1257 endif; 1258 else 1259 if (R.path starts-with "/") then 1260 T.path = remove_dot_segments(R.path); 1261 else 1262 T.path = merge(Base.path, R.path); 1263 T.path = remove_dot_segments(T.path); 1264 endif; 1265 T.query = R.query; 1266 endif; 1267 T.authority = Base.authority; 1268 endif; 1269 T.scheme = Base.scheme; 1270 endif; 1272 T.fragment = R.fragment; 1274 5.2.3 Merge Paths 1276 The pseudocode above refers to a "merge" routine for merging a 1277 relative-path reference with the path of the base URI. This is 1278 accomplished as follows: 1280 o If the base URI has a defined authority component and an empty 1281 path, then return a string consisting of "/" concatenated with the 1282 reference's path; otherwise, 1284 o Return a string consisting of the reference's path component 1285 appended to all but the last segment of the base URI's path (i.e., 1286 excluding any characters after the right-most "/" in the base URI 1287 path, or excluding the entire base URI path if it does not contain 1288 any "/" characters). 1290 5.2.4 Remove Dot Segments 1292 The pseudocode also refers to a "remove_dot_segments" routine for 1293 interpreting and removing the special "." and ".." complete path 1294 segments from a referenced path. This is done after the path is 1295 extracted from a reference, whether or not the path was relative, in 1296 order to remove any invalid or extraneous dot-segments prior to 1297 forming the target URI. Although there are many ways to accomplish 1298 this removal process, we describe a simple method using a two string 1299 buffers. 1301 1. The input buffer is initialized with the now-appended path 1302 components and the output buffer is initialized to the empty 1303 string. 1305 2. Replace any prefix of "./" or "../" at the beginning of the input 1306 buffer with "/". 1308 3. While the input buffer is not empty, loop: 1310 1. If the input buffer begins with a prefix of "/./" or "/.", 1311 where "." is a complete path segment, then replace that 1312 prefix with "/"; otherwise 1314 2. If the input buffer begins with a prefix of "/../" or "/..", 1315 where ".." is a complete path segment, then replace that 1316 prefix with "/" and remove the last segment and its preceding 1317 "/" (if any) from the output buffer; otherwise 1319 3. Remove the first segment and its preceding "/" (if any) from 1320 the input buffer and append them to the output buffer. 1322 4. Finally, the output buffer is returned as the result of 1323 remove_dot_segments. 1325 The following illustrates how the above steps are applied for two 1326 example merged paths, showing the state of the two buffers after each 1327 step. 1329 STEP OUTPUT BUFFER INPUT BUFFER 1331 1 : /a/b/c/./../../g 1332 3c: /a /b/c/./../../g 1333 3c: /a/b /c/./../../g 1334 3c: /a/b/c /./../../g 1335 3a: /a/b/c /../../g 1336 3b: /a/b /../g 1337 3b: /a /g 1338 3c: /a/g 1340 STEP OUTPUT BUFFER INPUT BUFFER 1342 1 : mid/content=5/../6 1343 3c: mid /content=5/../6 1344 3c: mid/content=5 /../6 1345 3b: mid /6 1346 3c: mid/6 1348 Some applications may find it more efficient to implement the 1349 remove_dot_segments algorithm using two segment stacks rather than 1350 strings. 1352 Note: Some client applications will fail to separate a reference's 1353 query component from its path component before merging the base 1354 and reference paths. This may result in loss of information if 1355 the query component contains the strings "/../" or "/./". 1357 5.3 Component Recomposition 1359 Parsed URI components can be recomposed to obtain the corresponding 1360 URI reference string. Using pseudocode, this would be: 1362 result = "" 1364 if defined(scheme) then 1365 append scheme to result; 1366 append ":" to result; 1367 endif; 1369 if defined(authority) then 1370 append "//" to result; 1371 append authority to result; 1372 endif; 1374 append path to result; 1376 if defined(query) then 1377 append "?" to result; 1378 append query to result; 1379 endif; 1381 if defined(fragment) then 1382 append "#" to result; 1383 append fragment to result; 1384 endif; 1386 return result; 1388 Note that we are careful to preserve the distinction between a 1389 component that is undefined, meaning that its separator was not 1390 present in the reference, and a component that is empty, meaning that 1391 the separator was present and was immediately followed by the next 1392 component separator or the end of the reference. 1394 5.4 Reference Resolution Examples 1396 Within a representation with a well-defined base URI of 1398 http://a/b/c/d;p?q 1400 a relative URI reference is transformed to its target URI as follows. 1402 5.4.1 Normal Examples 1404 "g:h" = "g:h" 1405 "g" = "http://a/b/c/g" 1406 "./g" = "http://a/b/c/g" 1407 "g/" = "http://a/b/c/g/" 1408 "/g" = "http://a/g" 1409 "//g" = "http://g" 1410 "?y" = "http://a/b/c/d;p?y" 1411 "g?y" = "http://a/b/c/g?y" 1412 "#s" = "http://a/b/c/d;p?q#s" 1413 "g#s" = "http://a/b/c/g#s" 1414 "g?y#s" = "http://a/b/c/g?y#s" 1415 ";x" = "http://a/b/c/;x" 1416 "g;x" = "http://a/b/c/g;x" 1417 "g;x?y#s" = "http://a/b/c/g;x?y#s" 1418 "" = "http://a/b/c/d;p?q" 1419 "." = "http://a/b/c/" 1420 "./" = "http://a/b/c/" 1421 ".." = "http://a/b/" 1422 "../" = "http://a/b/" 1423 "../g" = "http://a/b/g" 1424 "../.." = "http://a/" 1425 "../../" = "http://a/" 1426 "../../g" = "http://a/g" 1428 5.4.2 Abnormal Examples 1430 Although the following abnormal examples are unlikely to occur in 1431 normal practice, all URI parsers should be capable of resolving them 1432 consistently. Each example uses the same base as above. 1434 Parsers must be careful in handling cases where there are more 1435 relative path ".." segments than there are hierarchical levels in the 1436 base URI's path. Note that the ".." syntax cannot be used to change 1437 the authority component of a URI. 1439 "../../../g" = "http://a/g" 1440 "../../../../g" = "http://a/g" 1442 Similarly, parsers must remove the dot-segments "." and ".." when 1443 they are complete components of a path, but not when they are only 1444 part of a segment. 1446 "/./g" = "http://a/g" 1447 "/../g" = "http://a/g" 1448 "g." = "http://a/b/c/g." 1449 ".g" = "http://a/b/c/.g" 1450 "g.." = "http://a/b/c/g.." 1451 "..g" = "http://a/b/c/..g" 1453 Less likely are cases where the relative URI reference uses 1454 unnecessary or nonsensical forms of the "." and ".." complete path 1455 segments. 1457 "./../g" = "http://a/b/g" 1458 "./g/." = "http://a/b/c/g/" 1459 "g/./h" = "http://a/b/c/g/h" 1460 "g/../h" = "http://a/b/c/h" 1461 "g;x=1/./y" = "http://a/b/c/g;x=1/y" 1462 "g;x=1/../y" = "http://a/b/c/y" 1464 Some applications fail to separate the reference's query and/or 1465 fragment components from a relative path before merging it with the 1466 base path and removing dot-segments. This error is rarely noticed, 1467 since typical usage of a fragment never includes the hierarchy ("/") 1468 character, and the query component is not normally used within 1469 relative references. 1471 "g?y/./x" = "http://a/b/c/g?y/./x" 1472 "g?y/../x" = "http://a/b/c/g?y/../x" 1473 "g#s/./x" = "http://a/b/c/g#s/./x" 1474 "g#s/../x" = "http://a/b/c/g#s/../x" 1476 Some parsers allow the scheme name to be present in a relative URI 1477 reference if it is the same as the base URI scheme. This is 1478 considered to be a loophole in prior specifications of partial URI 1479 [RFC1630]. Its use should be avoided, but is allowed for backward 1480 compatibility. 1482 "http:g" = "http:g" ; for strict parsers 1483 / "http://a/b/c/g" ; for backward compatibility 1485 6. Normalization and Comparison 1487 One of the most common operations on URIs is simple comparison: 1488 determining if two URIs are equivalent without using the URIs to 1489 access their respective resource(s). A comparison is performed every 1490 time a response cache is accessed, a browser checks its history to 1491 color a link, or an XML parser processes tags within a namespace. 1492 Extensive normalization prior to comparison of URIs is often used by 1493 spiders and indexing engines to prune a search space or reduce 1494 duplication of request actions and response storage. 1496 URI comparison is performed in respect to some particular purpose, 1497 and software with differing purposes will often be subject to 1498 differing design trade-offs in regards to how much effort should be 1499 spent in reducing duplicate identifiers. This section describes a 1500 variety of methods that may be used to compare URIs, the trade-offs 1501 between them, and the types of applications that might use them. 1503 6.1 Equivalence 1505 Since URIs exist to identify resources, presumably they should be 1506 considered equivalent when they identify the same resource. However, 1507 such a definition of equivalence is not of much practical use, since 1508 there is no way for software to compare two resources without 1509 knowledge of the implementation-specific syntax of each URI's 1510 dereferencing algorithm. For this reason, determination of 1511 equivalence or difference of URIs is based on string comparison, 1512 perhaps augmented by reference to additional rules provided by URI 1513 scheme definitions. We use the terms "different" and "equivalent" to 1514 describe the possible outcomes of such comparisons, but there are 1515 many application-dependent versions of equivalence. 1517 Even though it is possible to determine that two URIs are equivalent, 1518 it is never possible to be sure that two URIs identify different 1519 resources. For example, an owner of two different domain names could 1520 decide to serve the same resource from both, resulting in two 1521 different URIs. Therefore, comparison methods are designed to 1522 minimize false negatives while strictly avoiding false positives. 1524 In testing for equivalence, applications should not directly compare 1525 relative URI references; the references should be converted to their 1526 target URI forms before comparison. When URIs are being compared for 1527 the purpose of selecting (or avoiding) a network action, such as 1528 retrieval of a representation, the fragment components (if any) 1529 should be excluded from the comparison. 1531 6.2 Comparison Ladder 1533 A variety of methods are used in practice to test URI equivalence. 1534 These methods fall into a range, distinguished by the amount of 1535 processing required and the degree to which the probability of false 1536 negatives is reduced. As noted above, false negatives cannot in 1537 principle be eliminated. In practice, their probability can be 1538 reduced, but this reduction requires more processing and is not 1539 cost-effective for all applications. 1541 If this range of comparison practices is considered as a ladder, the 1542 following discussion will climb the ladder, starting with those 1543 practices that are cheap but have a relatively higher chance of 1544 producing false negatives, and proceeding to those that have higher 1545 computational cost and lower risk of false negatives. 1547 6.2.1 Simple String Comparison 1549 If two URIs, considered as character strings, are identical, then it 1550 is safe to conclude that they are equivalent. This type of 1551 equivalence test has very low computational cost and is in wide use 1552 in a variety of applications, particularly in the domain of parsing. 1554 Testing strings for equivalence requires some basic precautions. This 1555 procedure is often referred to as "bit-for-bit" or "byte-for-byte" 1556 comparison, which is potentially misleading. Testing of strings for 1557 equality is normally based on pairwise comparison of the characters 1558 that make up the strings, starting from the first and proceeding 1559 until both strings are exhausted and all characters found to be 1560 equal, a pair of characters compares unequal, or one of the strings 1561 is exhausted before the other. 1563 Such character comparisons require that each pair of characters be 1564 put in comparable form. For example, should one URI be stored in a 1565 byte array in EBCDIC encoding, and the second be in a Java String 1566 object (UTF-16), bit-for-bit comparisons applied naively will produce 1567 both false-positive and false-negative errors. It is better to speak 1568 of equality on a character-for-character rather than byte-for-byte or 1569 bit-for-bit basis. In practical terms, character-by-character 1570 comparisons should be done codepoint-by-codepoint after conversion to 1571 a common character encoding. 1573 6.2.2 Syntax-based Normalization 1575 Software may use logic based on the definitions provided by this 1576 specification to reduce the probability of false negatives. Such 1577 processing is moderately higher in cost than character-for-character 1578 string comparison. For example, an application using this approach 1579 could reasonably consider the following two URIs equivalent: 1581 example://a/b/c/%7Bfoo%7D 1582 eXAMPLE://a/./b/../b/%63/%7bfoo%7d 1584 Web user agents, such as browsers, typically apply this type of URI 1585 normalization when determining whether a cached response is 1586 available. Syntax-based normalization includes such techniques as 1587 case normalization, encoding normalization, empty-component 1588 normalization, and removal of dot-segments. 1590 6.2.2.1 Case Normalization 1592 When a URI scheme uses components of the generic syntax, it will also 1593 use the common syntax equivalence rules, namely that the scheme and 1594 host are case-insensitive and therefore should be normalized to 1595 lowercase. For example, the URI is 1596 equivalent to . Applications should not 1597 assume anything about the case sensitivity of other URI components, 1598 since that is dependent on the implementation used to handle a 1599 dereference. 1601 The hexadecimal digits within a percent-encoding triplet (e.g., "%3a" 1602 versus "%3A") are case-insensitive and therefore should be normalized 1603 to use uppercase letters for the digits A-F. 1605 6.2.2.2 Encoding Normalization 1607 The percent-encoding mechanism (Section 2.1) is a frequent source of 1608 variance among otherwise identical URIs. In addition to the 1609 case-insensitivity issue noted above, some URI producers 1610 percent-encode octets that do not require percent-encoding, resulting 1611 in URIs that are equivalent to their non-encoded counterparts. Such 1612 URIs should be normalized by decoding any percent-encoded octet that 1613 corresponds to an unreserved character, as described in Section 2.3. 1615 6.2.2.3 Empty-component Normalization 1617 Components of the generic URI syntax are delimited from other 1618 components by optional separators. For example, a query component is 1619 separated from the path by a question mark ("?") and a port 1620 sub-component is separated from host by a colon (":"). A URI in 1621 which a delimiter is present and the (sub-)component it delimits is 1622 empty is equivalent to the same URI without that delimiter. For 1623 example, the following are all equivalent: 1625 ftp://example.com/ 1626 ftp://example.com:/ 1627 ftp://@example.com:/ 1628 ftp://@example.com:/? 1629 ftp://@example.com:/?# 1631 URI producers and normalizers should omit a delimiter if the 1632 component it delimits is empty, as exemplified by the first URI 1633 above, with one exception: a double-slash delimiter indicating an 1634 authority component should not be removed, even when the authority is 1635 empty, since doing so can lead to misinterpreting the path. 1637 6.2.2.4 Path Segment Normalization 1639 The complete path segments "." and ".." have a special meaning within 1640 hierarchical URI schemes. As such, they should not appear in 1641 absolute paths; if they are found, they can be removed by applying 1642 the remove_dot_segments algorithm to the path, as described in 1643 Section 5.2. 1645 6.2.3 Scheme-based Normalization 1647 The syntax and semantics of URIs vary from scheme to scheme, as 1648 described by the defining specification for each scheme. Software 1649 may use scheme-specific rules, at further processing cost, to reduce 1650 the probability of false negatives. For example, since the "http" 1651 scheme makes use of an authority component, has a default port of 1652 "80", and defines an empty path to be equivalent to "/", the 1653 following four URIs are equivalent: 1655 http://example.com 1656 http://example.com/ 1657 http://example.com:/ 1658 http://example.com:80/ 1660 In general, a URI that uses the generic syntax for authority with an 1661 empty path should be normalized to a path of "/"; likewise, an 1662 explicit ":port", where the port is empty or the default for the 1663 scheme, is equivalent to one where the port and its ":" delimiter are 1664 elided. In other words, the second of the above URI examples is the 1665 normal form for the "http" scheme. 1667 Another case where normalization varies by scheme is in the handling 1668 of an empty authority component. For many scheme specifications, an 1669 empty authority is considered an error; for others, it is considered 1670 equivalent to "localhost". For the sake of uniformity, future scheme 1671 specifications should define an empty authority as being equivalent 1672 to "localhost", and URI producers and normalizers should use 1673 "localhost" instead of an empty authority. 1675 6.2.4 Protocol-based Normalization 1677 Web spiders, for which substantial effort to reduce the incidence of 1678 false negatives is often cost-effective, are observed to implement 1679 even more aggressive techniques in URI comparison. For example, if 1680 they observe that a URI such as 1682 http://example.com/data 1684 redirects to a URI differing only in the trailing slash 1686 http://example.com/data/ 1688 they will likely regard the two as equivalent in the future. This 1689 kind of technique is only appropriate when equivalence is clearly 1690 indicated by both the result of accessing the resources and the 1691 common conventions of their scheme's dereference algorithm (in this 1692 case, use of redirection by HTTP origin servers to avoid problems 1693 with relative references). 1695 6.3 Canonical Form 1697 It is in the best interests of everyone to avoid false-negatives in 1698 comparing URIs and to minimize the amount of software processing for 1699 such comparisons. Those who produce and make reference to URIs can 1700 reduce the cost of processing and the risk of false negatives by 1701 consistently providing them in a form that is reasonably canonical 1702 with respect to their scheme. Specifically: 1704 o Always provide the URI scheme in lowercase characters. 1706 o Always provide the host, if any, in lowercase characters. 1708 o Only perform percent-encoding where it is essential. 1710 o Always use uppercase A-through-F characters when percent-encoding. 1712 o Prevent /./ and /../ from appearing in non-relative URI paths. 1714 o Omit delimiters when their associated (sub-)component is empty. 1716 o For schemes that define an empty authority to be equivalent to 1717 "localhost", use "localhost". 1719 o For schemes that define an empty path to be equivalent to a path 1720 of "/", use "/". 1722 7. Security Considerations 1724 A URI does not in itself pose a security threat. However, since URIs 1725 are often used to provide a compact set of instructions for access to 1726 network resources, care must be taken to properly interpret the data 1727 within a URI, to prevent that data from causing unintended access, 1728 and to avoid including data that should not be revealed in plain 1729 text. 1731 7.1 Reliability and Consistency 1733 There is no guarantee that, having once used a given URI to retrieve 1734 some information, the same information will be retrievable by that 1735 URI in the future. Nor is there any guarantee that the information 1736 retrievable via that URI in the future will be observably similar to 1737 that retrieved in the past. The URI syntax does not constrain how a 1738 given scheme or authority apportions its name space or maintains it 1739 over time. Such a guarantee can only be obtained from the person(s) 1740 controlling that name space and the resource in question. A specific 1741 URI scheme may define additional semantics, such as name persistence, 1742 if those semantics are required of all naming authorities for that 1743 scheme. 1745 7.2 Malicious Construction 1747 It is sometimes possible to construct a URI such that an attempt to 1748 perform a seemingly harmless, idempotent operation, such as the 1749 retrieval of a representation, will in fact cause a possibly damaging 1750 remote operation to occur. The unsafe URI is typically constructed 1751 by specifying a port number other than that reserved for the network 1752 protocol in question. The client unwittingly contacts a site that is 1753 running a different protocol service and data within the URI contains 1754 instructions that, when interpreted according to this other protocol, 1755 cause an unexpected operation. A frequent example of such abuse has 1756 been the use of a protocol-based scheme with a port component of 1757 "25", thereby fooling user agent software into sending an unintended 1758 or impersonating message via an SMTP server. 1760 Applications should prevent dereference of a URI that specifies a TCP 1761 port number within the "well-known port" range (0 - 1023) unless the 1762 protocol being used to dereference that URI is compatible with the 1763 protocol expected on that well-known port. Although IANA maintains a 1764 registry of well-known ports, applications should make such 1765 restrictions user-configurable to avoid preventing the deployment of 1766 new services. 1768 When a URI contains percent-encoded octets that match the delimiters 1769 for a given resolution or dereference protocol (for example, CR and 1770 LF characters for the TELNET protocol), such percent-encoded octets 1771 must not be decoded before transmission across that protocol. 1772 Transfer of the percent-encoding, which might violate the protocol, 1773 is less harmful than allowing decoded octets to be interpreted as 1774 additional operations or parameters, perhaps triggering an unexpected 1775 and possibly harmful remote operation. 1777 7.3 Back-end Transcoding 1779 When a URI is dereferenced, the data within it is often parsed by 1780 both the user agent and one or more servers. In HTTP, for example, a 1781 typical user agent will parse a URI into its five major components, 1782 access the authority's server, and send it the data within the 1783 authority, path, and query components. A typical server will take 1784 that information, parse the path into segments and the query into 1785 key/value pairs, and then invoke implementation-specific handlers to 1786 respond to the request. As a result, a common security concern for 1787 server implementations that handle a URI, either as a whole or split 1788 into separate components, is proper interpretation of the octet data 1789 represented by the characters and percent-encodings within that URI. 1791 Percent-encoded octets must be decoded at some point during the 1792 dereference process. Applications must split the URI into its 1793 components and sub-components prior to decoding the octets, since 1794 otherwise the decoded octets might be mistaken for delimiters. 1795 Security checks of the data within a URI should be applied after 1796 decoding the octets. Note, however, that the "%00" percent-encoding 1797 (NUL) may require special handling and should be rejected if the 1798 application is not expecting to receive raw data within a component. 1800 Special care should be taken when the URI path interpretation process 1801 involves the use of a back-end filesystem or related system 1802 functions. Filesystems typically assign an operational meaning to 1803 special characters, such as the "/", "\", ":", "[", and "]" 1804 characters, and special device names like ".", "..", "...", "aux", 1805 "lpt", etc. In some cases, merely testing for the existence of such a 1806 name will cause the operating system to pause or invoke unrelated 1807 system calls, leading to significant security concerns regarding 1808 denial of service and unintended data transfer. It would be 1809 impossible for this specification to list all such significant 1810 characters and device names; implementers should research the 1811 reserved names and characters for the types of storage device that 1812 may be attached to their application and restrict the use of data 1813 obtained from URI components accordingly. 1815 7.4 Rare IP Address Formats 1817 Although the URI syntax for IPv4address only allows the common, 1818 dotted-decimal form of IPv4 address literal, many implementations 1819 that process URIs make use of platform-dependent system routines, 1820 such as gethostbyname() and inet_aton(), to translate the string 1821 literal to an actual IP address. Unfortunately, such system routines 1822 often allow and process a much larger set of formats than those 1823 described in Section 3.2.2. 1825 For example, many implementations allow dotted forms of three 1826 numbers, wherein the last part is interpreted as a 16-bit quantity 1827 and placed in the right-most two bytes of the network address (e.g., 1828 a Class B network). Likewise, a dotted form of two numbers means the 1829 last part is interpreted as a 24-bit quantity and placed in the right 1830 most three bytes of the network address (Class A), and a single 1831 number (without dots) is interpreted as a 32-bit quantity and stored 1832 directly in the network address. Adding further to the confusion, 1833 some implementations allow each dotted part to be interpreted as 1834 decimal, octal, or hexadecimal, as specified in the C language (i.e., 1835 a leading 0x or 0X implies hexadecimal; otherwise, a leading 0 1836 implies octal; otherwise, the number is interpreted as decimal). 1838 These additional IP address formats are not allowed in the URI syntax 1839 due to differences between platform implementations. However, they 1840 can become a security concern if an application attempts to filter 1841 access to resources based on the IP address in string literal format. 1842 If such filtering is performed, literals should be converted to 1843 numeric form and filtered based on the numeric value, rather than a 1844 prefix or suffix of the string form. 1846 7.5 Sensitive Information 1848 URI producers should not provide a URI that contains a username or 1849 password which is intended to be secret: URIs are frequently 1850 displayed by browsers, stored in clear text bookmarks, and logged by 1851 user agent history and intermediary applications (proxies). A 1852 password appearing within the userinfo component is deprecated and 1853 should be considered an error (or simply ignored) except in those 1854 rare cases where the 'password' parameter is intended to be public. 1856 7.6 Semantic Attacks 1858 Because the userinfo sub-component is rarely used and appears before 1859 the host in the authority component, it can be used to construct a 1860 URI that is intended to mislead a human user by appearing to identify 1861 one (trusted) naming authority while actually identifying a different 1862 authority hidden behind the noise. For example 1864 ftp://ftp.example.com&story=breaking_news@10.0.0.1/top_story.htm 1866 might lead a human user to assume that the host is 1867 'trusted.example.com', whereas it is actually '10.0.0.1'. Note that 1868 a misleading userinfo sub-component could be much longer than the 1869 example above. 1871 A misleading URI, such as the one above, is an attack on the user's 1872 preconceived notions about the meaning of a URI, rather than an 1873 attack on the software itself. User agents may be able to reduce the 1874 impact of such attacks by distinguishing the various components of 1875 the URI when rendered, such as by using a different color or tone to 1876 render userinfo if any is present, though there is no general 1877 panacea. More information on URI-based semantic attacks can be found 1878 in [Siedzik]. 1880 8. Acknowledgments 1882 This specification is derived from RFC 2396 [RFC2396], RFC 1808 1883 [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those 1884 documents still apply. It also incorporates the update (with 1885 corrections) for IPv6 literals in the host syntax, as defined by 1886 Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in 1887 [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, 1888 Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, 1889 Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin 1890 Duerst, Stefan Eissing, Clive D.W. Feather, Tony Hammond, Pat Hayes, 1891 Henry Holtzman, Ian B. Jacobs, Michael Kay, John C. Klensin, Graham 1892 Klyne, Dan Kohn, Bruce Lilly, Andrew Main, Ira McDonald, Michael 1893 Mealling, Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, 1894 Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne, Stuart 1895 Williams, and Henry Zongaro are gratefully acknowledged. 1897 Normative References 1899 [ASCII] American National Standards Institute, "Coded Character 1900 Set -- 7-bit American Standard Code for Information 1901 Interchange", ANSI X3.4, 1986. 1903 [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1904 Specifications: ABNF", RFC 2234, November 1997. 1906 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1907 10646", STD 63, RFC 3629, November 2003. 1909 Informative References 1911 [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet 1912 host table specification", RFC 952, October 1985. 1914 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1915 STD 13, RFC 1034, November 1987. 1917 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 1918 and Support", STD 3, RFC 1123, October 1989. 1920 [RFC1535] Gavron, E., "A Security Problem and Proposed Correction 1921 With Widely Deployed DNS Software", RFC 1535, October 1922 1993. 1924 [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A 1925 Unifying Syntax for the Expression of Names and Addresses 1926 of Objects on the Network as used in the World-Wide Web", 1927 RFC 1630, June 1994. 1929 [RFC1736] Kunze, J., "Functional Recommendations for Internet 1930 Resource Locators", RFC 1736, February 1995. 1932 [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for 1933 Uniform Resource Names", RFC 1737, December 1994. 1935 [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform 1936 Resource Locators (URL)", RFC 1738, December 1994. 1938 [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC 1939 1808, June 1995. 1941 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1942 Extensions (MIME) Part Two: Media Types", RFC 2046, 1943 November 1996. 1945 [RFC2110] Palme, J. and A. Hopmann, "MIME E-mail Encapsulation of 1946 Aggregate Documents, such as HTML (MHTML)", RFC 2110, 1947 March 1997. 1949 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 1951 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1952 Languages", BCP 18, RFC 2277, January 1998. 1954 [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 1955 Resource Identifiers (URI): Generic Syntax", RFC 2396, 1956 August 1998. 1958 [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. 1959 Jensen, "HTTP Extensions for Distributed Authoring -- 1960 WEBDAV", RFC 2518, February 1999. 1962 [RFC2717] Petke, R. and I. King, "Registration Procedures for URL 1963 Scheme Names", BCP 35, RFC 2717, November 1999. 1965 [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, 1966 "Guidelines for new URL Schemes", RFC 2718, November 1999. 1968 [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for 1969 Literal IPv6 Addresses in URL's", RFC 2732, December 1999. 1971 [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration 1972 Procedures", BCP 19, RFC 2978, October 2000. 1974 [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint W3C/ 1975 IETF URI Planning Interest Group: Uniform Resource 1976 Identifiers (URIs), URLs, and Uniform Resource Names 1977 (URNs): Clarifications and Recommendations", RFC 3305, 1978 August 2002. 1980 [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, 1981 "Internationalizing Domain Names in Applications (IDNA)", 1982 RFC 3490, March 2003. 1984 [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 1985 (IPv6) Addressing Architecture", RFC 3513, April 2003. 1987 [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", April 1988 2001, . 1991 Authors' Addresses 1993 Tim Berners-Lee 1994 World Wide Web Consortium 1995 MIT/LCS, Room NE43-356 1996 200 Technology Square 1997 Cambridge, MA 02139 1998 USA 2000 Phone: +1-617-253-5702 2001 Fax: +1-617-258-5999 2002 EMail: timbl@w3.org 2003 URI: http://www.w3.org/People/Berners-Lee/ 2005 Roy T. Fielding 2006 Day Software 2007 5251 California Ave., Suite 110 2008 Irvine, CA 92612-3074 2009 USA 2011 Phone: +1-949-679-2960 2012 Fax: +1-949-679-2972 2013 EMail: fielding@gbiv.com 2014 URI: http://roy.gbiv.com/ 2016 Larry Masinter 2017 Adobe Systems Incorporated 2018 345 Park Ave 2019 San Jose, CA 95110 2020 USA 2022 Phone: +1-408-536-3024 2023 EMail: LMM@acm.org 2024 URI: http://larry.masinter.net/ 2026 Appendix A. Collected ABNF for URI 2028 URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment] 2030 URI-reference = URI / relative-URI 2032 relative-URI = ["//" authority] path ["?" query] ["#" fragment] 2034 absolute-URI = scheme ":" ["//" authority] path ["?" query] 2036 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 2038 authority = [ userinfo "@" ] host [ ":" port ] 2039 userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) 2040 host = IP-literal / IPv4address / reg-name 2041 port = *DIGIT 2043 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 2045 IPvFuture = "v" HEXDIG "." 1*( unreserved / sub-delims / ":" ) 2047 IPv6address = 6( h16 ":" ) ls32 2048 / "::" 5( h16 ":" ) ls32 2049 / [ h16 ] "::" 4( h16 ":" ) ls32 2050 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 2051 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 2052 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 2053 / [ *4( h16 ":" ) h16 ] "::" ls32 2054 / [ *5( h16 ":" ) h16 ] "::" h16 2055 / [ *6( h16 ":" ) h16 ] "::" 2057 h16 = 1*4HEXDIG 2058 ls32 = ( h16 ":" h16 ) / IPv4address 2060 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 2062 dec-octet = DIGIT ; 0-9 2063 / %x31-39 DIGIT ; 10-99 2064 / "1" 2DIGIT ; 100-199 2065 / "2" %x30-34 DIGIT ; 200-249 2066 / "25" %x30-35 ; 250-255 2068 reg-name = 0*255( unreserved / pct-encoded / sub-delims ) 2070 path = segment *( "/" segment ) 2071 segment = *pchar 2073 query = *( pchar / "/" / "?" ) 2074 fragment = *( pchar / "/" / "?" ) 2076 pct-encoded = "%" HEXDIG HEXDIG 2078 pchar = unreserved / pct-encoded / sub-delims / ":" / "@" 2080 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 2081 reserved = gen-delims / sub-delims 2082 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 2083 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 2084 / "*" / "+" / "," / ";" / "=" 2086 Appendix B. Parsing a URI Reference with a Regular Expression 2088 Since the "first-match-wins" algorithm is identical to the "greedy" 2089 disambiguation method used by POSIX regular expressions, it is 2090 natural and commonplace to use a regular expression for parsing the 2091 potential five components of a URI reference. 2093 The following line is the regular expression for breaking-down a 2094 well-formed URI reference into its components. 2096 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 2097 12 3 4 5 6 7 8 9 2099 The numbers in the second line above are only to assist readability; 2100 they indicate the reference points for each subexpression (i.e., each 2101 paired parenthesis). We refer to the value matched for subexpression 2102 as $. For example, matching the above expression to 2104 http://www.ics.uci.edu/pub/ietf/uri/#Related 2106 results in the following subexpression matches: 2108 $1 = http: 2109 $2 = http 2110 $3 = //www.ics.uci.edu 2111 $4 = www.ics.uci.edu 2112 $5 = /pub/ietf/uri/ 2113 $6 = 2114 $7 = 2115 $8 = #Related 2116 $9 = Related 2118 where indicates that the component is not present, as is 2119 the case for the query component in the above example. Therefore, we 2120 can determine the value of the four components and fragment as 2122 scheme = $2 2123 authority = $4 2124 path = $5 2125 query = $7 2126 fragment = $9 2128 and, going in the opposite direction, we can recreate a URI reference 2129 from its components using the algorithm of Section 5.3. 2131 Appendix C. Delimiting a URI in Context 2133 URIs are often transmitted through formats that do not provide a 2134 clear context for their interpretation. For example, there are many 2135 occasions when a URI is included in plain text; examples include text 2136 sent in electronic mail, USENET news messages, and, most importantly, 2137 printed on paper. In such cases, it is important to be able to 2138 delimit the URI from the rest of the text, and in particular from 2139 punctuation marks that might be mistaken for part of the URI. 2141 In practice, URIs are delimited in a variety of ways, but usually 2142 within double-quotes "http://example.com/", angle brackets , or just using whitespace 2145 http://example.com/ 2147 These wrappers do not form part of the URI. 2149 In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may 2150 need to be added to break a long URI across lines. The whitespace 2151 should be ignored when extracting the URI. 2153 No whitespace should be introduced after a hyphen ("-") character. 2154 Because some typesetters and printers may (erroneously) introduce a 2155 hyphen at the end of line when breaking a line, the interpreter of a 2156 URI containing a line break immediately after a hyphen should ignore 2157 all whitespace around the line break, and should be aware that the 2158 hyphen may or may not actually be part of the URI. 2160 Using <> angle brackets around each URI is especially recommended as 2161 a delimiting style for a reference that contains embedded whitespace. 2163 The prefix "URL:" (with or without a trailing space) was formerly 2164 recommended as a way to help distinguish a URI from other bracketed 2165 designators, though it is not commonly used in practice and is no 2166 longer recommended. 2168 For robustness, software that accepts user-typed URI should attempt 2169 to recognize and strip both delimiters and embedded whitespace. 2171 For example, the text: 2173 Yes, Jim, I found it under "http://www.w3.org/Addressing/", 2174 but you can probably pick it up from . Note the warning in . 2178 contains the URI references 2179 http://www.w3.org/Addressing/ 2180 ftp://foo.example.com/rfc/ 2181 http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING 2183 Appendix D. Summary of Non-editorial Changes 2185 D.1 Additions 2187 IPv6 (and later) literals have been added to the list of possible 2188 identifiers for the host portion of a authority component, as 2189 described by [RFC2732], with the addition of "[" and "]" to the 2190 reserved set and a version flag to anticipate future versions of IP 2191 literals. Square brackets are now specified as reserved within the 2192 authority component and not allowed outside their use as delimiters 2193 for an IP literal within host. In order to make this change without 2194 changing the technical definition of the path, query, and fragment 2195 components, those rules were redefined to directly specify the 2196 characters allowed rather than be defined in terms of uric. 2198 Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal 2199 address, which unfortunately lacks an ABNF description of 2200 IPv6address, we created a new ABNF rule for IPv6address that matches 2201 the text representations defined by Section 2.2 of [RFC3513]. 2202 Likewise, the definition of IPv4address has been improved in order to 2203 limit each decimal octet to the range 0-255. 2205 Section 6 (Section 6) on URI normalization and comparison has been 2206 completely rewritten and extended using input from Tim Bray and 2207 discussion within the W3C Technical Architecture Group. 2209 An ABNF rule for URI has been introduced to correspond to the common 2210 usage of the term: an absolute URI with optional fragment. 2212 D.2 Modifications from RFC 2396 2214 The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. 2215 This change required all rule names that formerly included underscore 2216 characters to be renamed with a dash instead. 2218 Section 2 on characters has been rewritten to explain what characters 2219 are reserved, when they are reserved, and why they are reserved even 2220 when not used as delimiters by the generic syntax. The mark 2221 characters that are typically unsafe to decode, including the 2222 exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open 2223 and close parentheses ("(" and ")"), have been moved to the reserved 2224 set in order to clarify the distinction between reserved and 2225 unreserved and hopefully answer the most common question of scheme 2226 designers. Likewise, the section on percent-encoded characters has 2227 been rewritten, and URI normalizers are now given license to decode 2228 any percent-encoded octets corresponding to unreserved characters. 2229 In general, the terms "escaped" and "unescaped" have been replaced 2230 with "percent-encoded" and "decoded", respectively, to reduce 2231 confusion with other forms of escape mechanisms. 2233 The ABNF for URI and URI-reference has been redesigned to make them 2234 more friendly to LALR parsers and significantly reduce complexity. As 2235 a result, the layout form of syntax description has been removed, 2236 along with the uric, uric_no_slash, hier_part, opaque_part, net_path, 2237 abs_path, rel_path, path_segments, rel_segment, and mark rules. All 2238 references to "opaque" URIs have been replaced with a better 2239 description of how the path component may be opaque to hierarchy. The 2240 ambiguity regarding the parsing of URI-reference as a URI or a 2241 relative-URI with a colon in the first segment is now explained and 2242 disambiguated in the section defining relative-URI. 2244 The fragment identifier has been moved back into the section on 2245 generic syntax components and within the URI and relative-URI rules, 2246 though it remains excluded from absolute-URI. The number sign ("#") 2247 character has been moved back to the reserved set as a result of 2248 reintegrating the fragment syntax. 2250 The ABNF has been corrected to allow a relative path to be empty. 2251 This also allows an absolute-URI to consist of nothing after the 2252 "scheme:", as is present in practice with the "dav:" namespace 2253 [RFC2518] and the "about:" scheme used internally by many WWW browser 2254 implementations. The ambiguity regarding the boundary between 2255 authority and path is now explained and disambiguated in the same 2256 section. 2258 Registry-based naming authorities that use the generic syntax are now 2259 defined within the host rule and limited to 255 path characters. This 2260 change allows current implementations, where whatever name provided 2261 is simply fed to the local name resolution mechanism, to be 2262 consistent with the specification and removes the need to re-specify 2263 DNS name formats here. It also allows the host component to contain 2264 percent-encoded octets, which is necessary to enable 2265 internationalized domain names to be provided in URIs, processed in 2266 their native character encodings at the application layers above URI 2267 processing, and passed to an IDNA library as a registered name in the 2268 UTF-8 character encoding. The server, hostport, hostname, 2269 domainlabel, toplabel, and alphanum rules have been removed. 2271 The resolving relative references algorithm of [RFC2396] has been 2272 rewritten using pseudocode for this revision to improve clarity and 2273 fix the following issues: 2275 o [RFC2396] section 5.2, step 6a, failed to account for a base URI 2276 with no path. 2278 o Restored the behavior of [RFC1808] where, if the reference 2279 contains an empty path and a defined query component, then the 2280 target URI inherits the base URI's path component. 2282 o Removed the special-case treatment of same-document references 2283 within the URI parser in favor of a section that explains when a 2284 reference should be interpreted by a dereferencing engine as a 2285 same-document reference: when the target URI and base URI, 2286 excluding fragments, match. This change does not modify the 2287 behavior of existing same-document references as defined by RFC 2288 2396 (fragment-only references); it merely adds the same-document 2289 distinction to other references that refer to the base URI and 2290 simplifies the interface between applications and their URI 2291 parsers, as is consistent with the internal architecture of 2292 deployed URI processing implementations. 2294 o Separated the path merge routine into two routines: merge, for 2295 describing combination of the base URI path with a relative-path 2296 reference, and remove_dot_segments, for describing how to remove 2297 the special "." and ".." segments from a composed path. The 2298 remove_dot_segments algorithm is now applied to all URI reference 2299 paths in order to match common implementations and improve the 2300 normalization of URIs in practice. This change only impacts the 2301 parsing of abnormal references and same-scheme references wherein 2302 the base URI has a non-hierarchical path. 2304 Index 2306 A 2307 ABNF 10 2308 absolute 25 2309 absolute-path 24 2310 absolute-URI 25 2311 access 7 2312 authority 15, 16 2314 B 2315 base URI 27 2317 C 2318 characters 11 2320 D 2321 dec-octet 18 2322 dereference 7 2323 dot-segments 20 2325 F 2326 fragment 22 2328 G 2329 gen-delims 12 2330 generic syntax 5 2332 H 2333 h16 17 2334 hierarchical 9 2335 host 17 2337 I 2338 identifier 5 2339 IP-literal 17 2340 IPv4 18 2341 IPv4address 18 2342 IPv6 17 2343 IPv6address 17 2344 IPvFuture 17 2346 L 2347 locator 6 2348 ls32 17 2350 M 2351 merge 30 2353 N 2354 name 6 2355 network-path 24 2357 P 2358 path 15, 20 2359 pchar 20 2360 pct-encoded 11 2361 percent-encoding 11 2362 port 20 2364 Q 2365 query 21 2367 R 2368 reg-name 19 2369 registered name 19 2370 relative 9, 27 2371 relative-path 24 2372 relative-URI 24 2373 remove_dot_segments 30 2374 representation 8 2375 reserved 12 2376 resolution 7, 27 2377 resource 4 2378 retrieval 8 2380 S 2381 same-document 25 2382 sameness 8 2383 scheme 15 2384 segment 20 2385 sub-delims 12 2386 suffix 25 2388 T 2389 transcription 6 2391 U 2392 uniform 4 2393 unreserved 12 2394 URI grammar 2395 absolute-URI 25 2396 ALPHA 10 2397 authority 15, 16 2398 CR 10 2399 CTL 10 2400 dec-octet 18 2401 DIGIT 10 2402 DQUOTE 10 2403 fragment 15, 22, 24 2404 gen-delims 12 2405 h16 18 2406 HEXDIG 10 2407 host 16, 17 2408 IP-literal 17 2409 IPv4address 18 2410 IPv6address 17, 18 2411 IPvFuture 17 2412 LF 10 2413 ls32 18 2414 mark 12 2415 OCTET 10 2416 path 15 2417 path-segments 20 2418 pchar 20, 21, 22 2419 pct-encoded 11 2420 port 16, 20 2421 query 15, 21, 24, 25 2422 reg-name 19 2423 relative-URI 24, 24 2424 reserved 12 2425 scheme 15, 15, 25 2426 segment 20 2427 SP 10 2428 sub-delims 12 2429 unreserved 12 2430 URI 15, 24 2431 URI-reference 24 2432 userinfo 16, 16 2433 URI 15 2434 URI-reference 24 2435 URL 6 2436 URN 6 2437 userinfo 16 2439 Intellectual Property Statement 2441 The IETF takes no position regarding the validity or scope of any 2442 intellectual property or other rights that might be claimed to 2443 pertain to the implementation or use of the technology described in 2444 this document or the extent to which any license under such rights 2445 might or might not be available; neither does it represent that it 2446 has made any effort to identify any such rights. Information on the 2447 IETF's procedures with respect to rights in standards-track and 2448 standards-related documentation can be found in BCP-11. Copies of 2449 claims of rights made available for publication and any assurances of 2450 licenses to be made available, or the result of an attempt made to 2451 obtain a general license or permission for the use of such 2452 proprietary rights by implementors or users of this specification can 2453 be obtained from the IETF Secretariat. 2455 The IETF invites any interested party to bring to its attention any 2456 copyrights, patents or patent applications, or other proprietary 2457 rights which may cover technology that may be required to practice 2458 this standard. Please address the information to the IETF Executive 2459 Director. 2461 Full Copyright Statement 2463 Copyright (C) The Internet Society (2004). All Rights Reserved. 2465 This document and translations of it may be copied and furnished to 2466 others, and derivative works that comment on or otherwise explain it 2467 or assist in its implementation may be prepared, copied, published 2468 and distributed, in whole or in part, without restriction of any 2469 kind, provided that the above copyright notice and this paragraph are 2470 included on all such copies and derivative works. However, this 2471 document itself may not be modified in any way, such as by removing 2472 the copyright notice or references to the Internet Society or other 2473 Internet organizations, except as needed for the purpose of 2474 developing Internet standards in which case the procedures for 2475 copyrights defined in the Internet Standards process must be 2476 followed, or as required to translate it into languages other than 2477 English. 2479 The limited permissions granted above are perpetual and will not be 2480 revoked by the Internet Society or its successors or assignees. 2482 This document and the information contained herein is provided on an 2483 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 2484 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 2485 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 2486 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 2487 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2489 Acknowledgment 2491 Funding for the RFC Editor function is currently provided by the 2492 Internet Society.