idnits 2.17.00 (12 Aug 2021) /tmp/idnits28113/draft-fielding-uri-rfc2396bis-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 740: '... practice is NOT RECOMMENDED, because ...' -- The draft header indicates that this document obsoletes RFC2732, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC2396, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC1808, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC1738, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 672 has weird spacing: '... query frag...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 6, 2003) is 6923 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2277' is defined on line 1756, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) -- Obsolete informational reference (is this intentional?): RFC 1738 (Obsoleted by RFC 4248, RFC 4266) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 1808 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2518 (Obsoleted by RFC 4918) -- Obsolete informational reference (is this intentional?): RFC 3513 (Obsoleted by RFC 4291) -- Obsolete informational reference (is this intentional?): RFC 2732 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) -- Obsolete informational reference (is this intentional?): RFC 2110 (Obsoleted by RFC 2557) -- Obsolete informational reference (is this intentional?): RFC 2717 (Obsoleted by RFC 4395) -- Obsolete informational reference (is this intentional?): RFC 2279 (ref. 'UTF-8') (Obsoleted by RFC 3629) Summary: 6 errors (**), 0 flaws (~~), 5 warnings (==), 17 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group T. Berners-Lee 3 Internet-Draft MIT/LCS 4 Updates: 1738 (if approved) R. Fielding 5 Obsoletes: 2732, 2396, 1808 (if approved) Day Software 6 L. Masinter 7 Expires: December 5, 2003 Adobe 8 June 6, 2003 10 Uniform Resource Identifier (URI): Generic Syntax 11 draft-fielding-uri-rfc2396bis-03 13 Status of this Memo 15 This document is an Internet-Draft and is in full conformance with 16 all provisions of Section 10 of RFC2026. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that other 20 groups may also distribute working documents as Internet-Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 . 30 The list of Internet-Draft Shadow Directories can be accessed at 31 . 33 Copyright Notice 35 Copyright (C) The Internet Society (2003). All Rights Reserved. 37 Abstract 39 A Uniform Resource Identifier (URI) is a compact string of characters 40 for identifying an abstract or physical resource. This specification 41 defines the generic URI syntax and a process for resolving URI 42 references that might be in relative form, along with guidelines and 43 security considerations for the use of URIs on the Internet. 45 The URI syntax defines a grammar that is a superset of all valid 46 URIs, such that an implementation can parse the common components of 47 a URI reference without knowing the scheme-specific requirements of 48 every possible identifier. This specification does not define a 49 generative grammar for URIs; that task is performed by the individual 50 specifications of each URI scheme. 52 Editorial Note 54 Discussion of this draft and comments to the editors should be sent 55 to the uri@w3.org mailing list. An issues list and version history 56 is available at . 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . . 4 63 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . . . . 5 64 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6 65 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . . . . 6 66 1.2 Design Considerations . . . . . . . . . . . . . . . . . . . 6 67 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . 6 68 1.2.2 Separating Identification from Interaction . . . . . . . . . 7 69 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . . . . 8 70 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . . 9 71 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . 11 72 2.1 Encoding of Characters . . . . . . . . . . . . . . . . . . . 11 73 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . . 11 74 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . . 12 75 2.4 Escaped Characters . . . . . . . . . . . . . . . . . . . . . 13 76 2.4.1 Escaped Encoding . . . . . . . . . . . . . . . . . . . . . . 13 77 2.4.2 When to Escape and Unescape . . . . . . . . . . . . . . . . 13 78 2.5 Excluded Characters . . . . . . . . . . . . . . . . . . . . 14 79 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . 16 80 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 81 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . . 17 82 3.2.1 User Information . . . . . . . . . . . . . . . . . . . . . . 18 83 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 84 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 85 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 86 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 87 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . 22 88 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 89 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . . 24 90 4.2 Relative URI . . . . . . . . . . . . . . . . . . . . . . . . 24 91 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . . 25 92 4.4 Same-document Reference . . . . . . . . . . . . . . . . . . 25 93 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . . 25 94 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . 27 95 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . . 27 96 5.1.1 Base URI within Document Content . . . . . . . . . . . . . . 27 97 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . . . . 28 98 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . . . . 28 99 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . . . . 28 100 5.2 Obtaining the Referenced URI . . . . . . . . . . . . . . . . 28 101 5.3 Recomposition of a Parsed URI . . . . . . . . . . . . . . . 31 102 5.4 Reference Resolution Examples . . . . . . . . . . . . . . . 32 103 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . . . . 32 104 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . . . . 32 105 6. Normalization and Comparison . . . . . . . . . . . . . . . . 35 106 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . 35 107 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . . 35 108 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 36 109 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . . . . 37 110 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . . . . 38 111 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . . . . 38 112 6.3 Canonical Form . . . . . . . . . . . . . . . . . . . . . . . 38 113 7. Security Considerations . . . . . . . . . . . . . . . . . . 40 114 7.1 Reliability and Consistency . . . . . . . . . . . . . . . . 40 115 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . . 40 116 7.3 Rare IP Address Formats . . . . . . . . . . . . . . . . . . 41 117 7.4 Sensitive Information . . . . . . . . . . . . . . . . . . . 41 118 7.5 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . . 41 119 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 43 120 Normative References . . . . . . . . . . . . . . . . . . . . 44 121 Informative References . . . . . . . . . . . . . . . . . . . 45 122 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 47 123 A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . 48 124 B. Parsing a URI Reference with a Regular Expression . . . . . 50 125 C. Delimiting a URI in Context . . . . . . . . . . . . . . . . 51 126 D. Summary of Non-editorial Changes . . . . . . . . . . . . . . 53 127 D.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . . 53 128 D.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . . 53 129 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 130 Intellectual Property and Copyright Statements . . . . . . . 60 132 1. Introduction 134 A Uniform Resource Identifier (URI) provides a simple and extensible 135 means for identifying a resource. This specification of URI syntax 136 and semantics is derived from concepts introduced by the World Wide 137 Web global information initiative, whose use of such identifiers 138 dates from 1990 and is described in "Universal Resource Identifiers 139 in WWW" [RFC1630], and is designed to meet the recommendations laid 140 out in "Functional Recommendations for Internet Resource Locators" 141 [RFC1736] and "Functional Requirements for Uniform Resource Names" 142 [RFC1737]. 144 This document obsoletes [RFC2396], which merged "Uniform Resource 145 Locators" [RFC1738] and "Relative Uniform Resource Locators" 146 [RFC1808] in order to define a single, generic syntax for all URIs. 147 It excludes those portions of RFC 1738 that defined the specific 148 syntax of individual URI schemes; those portions will be updated as 149 separate documents. The process for registration of new URI schemes 150 is defined separately by [RFC2717]. 152 All significant changes from RFC 2396 are noted in Appendix D. 154 1.1 Overview of URIs 156 URIs are characterized as follows: 158 Uniform 160 Uniformity provides several benefits: it allows different types of 161 resource identifiers to be used in the same context, even when the 162 mechanisms used to access those resources may differ; it allows 163 uniform semantic interpretation of common syntactic conventions 164 across different types of resource identifiers; it allows 165 introduction of new types of resource identifiers without 166 interfering with the way that existing identifiers are used; and, 167 it allows the identifiers to be reused in many different contexts, 168 thus permitting new applications or protocols to leverage a 169 pre-existing, large, and widely-used set of resource identifiers. 171 Resource 173 Anything that can be named or described can be a resource. 174 Familiar examples include an electronic document, an image, a 175 service (e.g., "today's weather report for Los Angeles"), and a 176 collection of other resources. A resource is not necessarily 177 accessible via the Internet; e.g., human beings, corporations, and 178 bound books in a library can also be resources. Likewise, abstract 179 concepts can be resources, such as the operators and operands of a 180 mathematical equation or the types of a relationship (e.g., 181 "parent" or "employee"). 183 Identifier 185 An identifier embodies the information required to distinguish 186 what is being identified from all other things within its scope of 187 identification. 189 A URI is an identifier that consists of a sequence of characters 190 matching the syntax defined by the grammar rule named "URI" in 191 Section 3. A URI can be used to refer to a resource. This 192 specification does not place any limits on the nature of a resource 193 or the reasons why an application might wish to refer to a resource. 194 URIs have a global scope and should be interpreted consistently 195 regardless of context, but that interpretation may be defined in 196 relation to the user's context (e.g., "http://localhost/" refers to a 197 resource that is relative to the user's network interface and yet not 198 specific to any one user). 200 1.1.1 Generic Syntax 202 Each URI begins with a scheme name, as defined in Section 3.1, that 203 refers to a specification for assigning identifiers within that 204 scheme. As such, the URI syntax is a federated and extensible naming 205 system wherein each scheme's specification may further restrict the 206 syntax and semantics of identifiers using that scheme. 208 This specification defines those elements of the URI syntax that are 209 required of all URI schemes or are common to many URI schemes. It 210 thus defines the syntax and semantics that are needed to implement a 211 scheme-independent parsing mechanism for URI references, such that 212 the scheme-dependent handling of a URI can be postponed until the 213 scheme-dependent semantics are needed. Likewise, protocols and data 214 formats that make use of URI references can refer to this 215 specification as defining the range of syntax allowed for all URIs, 216 including those schemes that have yet to be defined. 218 A parser of the generic URI syntax is capable of parsing any URI 219 reference into its major components; once the scheme is determined, 220 further scheme-specific parsing can be performed on the components. 221 In other words, the URI generic syntax is a superset of the syntax of 222 all URI schemes. 224 1.1.2 Examples 226 The following examples illustrate URIs that are in common use. 228 ftp://ftp.is.co.za/rfc/rfc1808.txt 229 -- ftp scheme for File Transfer Protocol services 231 gopher://gopher.tc.umn.edu:70/11/Mailing%20Lists/ 232 -- gopher scheme for Gopher and Gopher+ Protocol services 234 http://www.ietf.org/rfc/rfc2396.txt 235 -- http scheme for Hypertext Transfer Protocol services 237 mailto:John.Doe@example.com 238 -- mailto scheme for electronic mail addresses 240 news:comp.infosystems.www.servers.unix 241 -- news scheme for USENET news groups and articles 243 telnet://melvyl.ucop.edu/ 244 -- telnet scheme for interactive TELNET services 246 1.1.3 URI, URL, and URN 248 A URI can be further classified as a locator, a name, or both. The 249 term "Uniform Resource Locator" (URL) refers to the subset of URIs 250 that, in addition to identifying a resource, provide a means of 251 locating the resource by describing its primary access mechanism 252 (e.g., its network "location"). The term "Uniform Resource Name" 253 (URN) refers to URIs under the "urn" scheme [RFC2141], which are 254 required to remain globally unique and persistent even when the 255 resource ceases to exist or becomes unavailable. 257 An individual scheme does not need to be classified as being just one 258 of "name" or "locator". Instances of URIs from any given scheme may 259 have the characteristics of names or locators or both, often 260 depending on the persistence and care in the assignment of 261 identifiers by the naming authority, rather than any quality of the 262 scheme. 264 1.2 Design Considerations 266 1.2.1 Transcription 268 The URI syntax has been designed with global transcription as one of 269 its main considerations. A URI is a sequence of characters from a 270 very limited set: the letters of the basic Latin alphabet, digits, 271 and a few special characters. A URI may be represented in a variety 272 of ways: e.g., ink on paper, pixels on a screen, or a sequence of 273 octets in a coded character set. The interpretation of a URI depends 274 only on the characters used and not how those characters are 275 represented in a network protocol. 277 The goal of transcription can be described by a simple scenario. 278 Imagine two colleagues, Sam and Kim, sitting in a pub at an 279 international conference and exchanging research ideas. Sam asks Kim 280 for a location to get more information, so Kim writes the URI for the 281 research site on a napkin. Upon returning home, Sam takes out the 282 napkin and types the URI into a computer, which then retrieves the 283 information to which Kim referred. 285 There are several design considerations revealed by the scenario: 287 o A URI is a sequence of characters that is not always represented 288 as a sequence of octets. 290 o A URI might be transcribed from a non-network source, and thus 291 should consist of characters that are most likely to be able to be 292 entered into a computer, within the constraints imposed by 293 keyboards (and related input devices) across languages and 294 locales. 296 o A URI often needs to be remembered by people, and it is easier for 297 people to remember a URI when it consists of meaningful or 298 familiar components. 300 These design considerations are not always in alignment. For 301 example, it is often the case that the most meaningful name for a URI 302 component would require characters that cannot be typed into some 303 systems. The ability to transcribe a resource identifier from one 304 medium to another has been considered more important than having a 305 URI consist of the most meaningful of components. In local or 306 regional contexts and with improving technology, users might benefit 307 from being able to use a wider range of characters; such use is not 308 defined in this specification. 310 1.2.2 Separating Identification from Interaction 312 A common misunderstanding of URIs is that they are only used to refer 313 to accessible resources. In fact, the URI alone only provides 314 identification; access to the resource is neither guaranteed nor 315 implied by the presence of a URI. Instead, an operation (if any) 316 associated with a URI reference is defined by the protocol element, 317 data format attribute, or natural language text in which it appears. 319 Given a URI, a system may attempt to perform a variety of operations 320 on the resource, as might be characterized by such words as "denote", 321 "access", "update", "replace", or "find attributes". Such operations 322 are defined by the protocols that make use of URIs, not by this 323 specification. However, we do use a few general terms for describing 324 common operations on URIs. URI "resolution" is the process of 325 determining an access mechanism and the appropriate parameters 326 necessary to dereference a URI; such resolution may require several 327 iterations. Use of that access mechanism to perform an action on the 328 URI's resource is termed a "dereference" of the URI. 330 When URIs are used within information systems to identify sources of 331 information, the most common form of URI dereference is "retrieval": 332 making use of a URI in order to retrieve a representation of its 333 associated resource. A "representation" is a sequence of octets, 334 along with metadata describing those octets, that constitutes a 335 record of the state of the resource at the time that the 336 representation is generated. Retrieval is achieved by a process that 337 might include using the URI as a cache key to check for a locally 338 cached representation, resolution of the URI to determine an 339 appropriate access mechanism (if any), and dereference of the URI for 340 the sake of applying a retrieval operation. 342 URI references in information systems are designed to be 343 late-binding: the result of an access is generally determined at the 344 time it is accessed and may vary over time or due to other aspects of 345 the interaction. When an author creates a reference to such a 346 resource, they do so with the intention that the reference be used in 347 the future; what is being identified is not some specific result that 348 was obtained in the past, but rather some characteristic that is 349 expected to be true for future results. In such cases, the resource 350 referred to by the URI is actually a sameness of characteristics as 351 observed over time, perhaps elucidated by additional comments or 352 assertions made by the resource provider. 354 Although many URI schemes are named after protocols, this does not 355 imply that use of such a URI will result in access to the resource 356 via the named protocol. URIs are often used simply for the sake of 357 identification. Even when a URI is used to retrieve a representation 358 of a resource, that access might be through gateways, proxies, 359 caches, and name resolution services that are independent of the 360 protocol associated with the scheme name, and the resolution of some 361 URIs may require the use of more than one protocol (e.g., both DNS 362 and HTTP are typically used to access an "http" URI's origin server 363 when a representation isn't found in a local cache). 365 1.2.3 Hierarchical Identifiers 366 The URI syntax is organized hierarchically, with components listed in 367 decreasing order from left to right. For some URI schemes, the 368 visible hierarchy is limited to the scheme itself: everything after 369 the scheme component delimiter is considered opaque to URI 370 processing. Other URI schemes make the hierarchy explicit and visible 371 to generic parsing algorithms. 373 The URI syntax reserves the slash ("/"), question-mark ("?"), and 374 number-sign ("#") characters for the purpose of delimiting components 375 that are significant to the generic parser's hierarchical 376 interpretation of an identifier. In addition to aiding the 377 readability of such identifiers through the consistent use of 378 familiar syntax, this uniform representation of hierarchy across 379 naming schemes allows scheme-independent references to be made 380 relative to that hierarchy. 382 It is often the case that a group or "tree" of documents has been 383 constructed to serve a common purpose; the vast majority of URIs in 384 these documents point to resources within the tree rather than 385 outside of it. Similarly, documents located at a particular site are 386 much more likely to refer to other resources at that site than to 387 resources at remote sites. 389 Relative referencing of URIs allows document trees to be partially 390 independent of their location and access scheme. For instance, it is 391 possible for a single set of hypertext documents to be simultaneously 392 accessible and traversable via each of the "file", "http", and "ftp" 393 schemes if the documents refer to each other using relative 394 references. Furthermore, such document trees can be moved, as a 395 whole, without changing any of the relative references. 397 A relative URI reference (Section 4.2) refers to a resource by 398 describing the difference within a hierarchical name space between 399 the current context and the target URI. The reference resolution 400 algorithm, presented in Section 5, defines how such references are 401 resolved. 403 1.3 Syntax Notation 405 This specification uses the Augmented Backus-Naur Form (ABNF) 406 notation of [RFC2234] to define the URI syntax. Although the ABNF 407 defines syntax in terms of the US-ASCII character encoding [ASCII], 408 the URI syntax should be interpreted in terms of the character that 409 the ASCII-encoded octet represents, rather than the octet encoding 410 itself. How a URI is represented in terms of bits and bytes on the 411 wire is dependent upon the character encoding of the protocol used to 412 transport it, or the charset of the document that contains it. 414 The following core ABNF productions are used by this specification as 415 defined by Section 6.1 of [RFC2234]: ALPHA, CR, CTL, DIGIT, DQUOTE, 416 HEXDIG, LF, OCTET, and SP. The complete URI syntax is collected in 417 Appendix A. 419 2. Characters 421 A URI consists of a restricted set of characters, primarily chosen 422 to aid transcription and usability both in computer systems and in 423 non-computer communications. Characters used conventionally as 424 delimiters around a URI are excluded. The set of URI characters 425 consists of digits, letters, and a few graphic symbols chosen from 426 those common to most of the character encodings and input facilities 427 available to Internet users. 429 uric = reserved / unreserved / escaped 431 Within a URI, reserved characters are used to delimit syntax 432 components, unreserved characters are used to describe registered 433 names, and unreserved, non-delimiting reserved, and escaped 434 characters are used to represent strings of data (1*OCTET) within the 435 components. 437 2.1 Encoding of Characters 439 As described above (Section 1.3), the URI syntax is defined in terms 440 of characters by reference to the US-ASCII encoding of characters to 441 octets. This specification does not mandate the use of any 442 particular mapping between its character set and the octets used to 443 store or transmit those characters. 445 URI characters representing strings of data within a component may, 446 if allowed by the component production, represent an arbitrary 447 sequence of octets. For example, portions of a given URI might 448 correspond to a filename on a non-ASCII file system, a query on 449 non-ASCII data, numeric coordinates on a map, etc. Some URI schemes 450 define a specific encoding of raw data to US-ASCII characters as part 451 of their scheme-specific requirements. Most URI schemes represent 452 data octets by the US-ASCII character corresponding to that octet, 453 either directly in the form of the character's glyph or by use of an 454 escape triplet (Section 2.4). 456 When a URI scheme defines a component that represents textual data 457 consisting of characters from the Unicode (ISO 10646) character set, 458 we recommend that the data be encoded first as octets according to 459 the UTF-8 [UTF-8] character encoding, and then escaping only those 460 octets that are not in the unreserved character set. 462 2.2 Reserved Characters 464 URIs include components and sub-components that are delimited by 465 certain special characters. These characters are called "reserved", 466 since their usage within a URI component is limited to their reserved 467 purpose within that component. If data for a URI component would 468 conflict with the reserved purpose, then the conflicting data must be 469 escaped (Section 2.4) before forming the URI. 471 reserved = "/" / "?" / "#" / "[" / "]" / ";" / 472 ":" / "@" / "&" / "=" / "+" / "$" / "," 474 Reserved characters are used as delimiters of the generic URI 475 components described in Section 3, as well as within those components 476 for delimiting sub-components. A component's ABNF syntax rule will 477 not use the "reserved" production directly; instead, each rule lists 478 those reserved characters that are allowed within that component. 479 Allowed reserved characters that are not assigned a sub-component 480 delimiter role by this specification should be considered reserved 481 for special use by whatever software generates the URI (i.e., they 482 may be used to delimit or indicate information that is significant to 483 interpretation of the identifier, but that significance is outside 484 the scope of this specification). Outside of the URI's origin, a 485 reserved character cannot be escaped without fear of changing how it 486 will be interpreted; likewise, an escaped octet that corresponds to a 487 reserved character cannot be unescaped outside the software that is 488 responsible for interpreting it during URI resolution. 490 The slash ("/"), question-mark ("?"), and number-sign ("#") 491 characters are reserved in all URIs for the purpose of delimiting 492 components that are significant to the generic parser's hierarchical 493 interpretation of an identifier. The hierarchical prefix of a URI, 494 wherein the slash ("/") character signifies a hierarchy delimiter, 495 extends from the scheme (Section 3.1) through to the first 496 question-mark ("?"), number-sign ("#"), or the end of the URI string. 497 In other words, the slash ("/") character is not treated as a 498 hierarchical separator within the query (Section 3.4) and fragment 499 (Section 3.5) components of a URI, but is still considered reserved 500 within those components for purposes outside the scope of this 501 specification. 503 2.3 Unreserved Characters 505 Characters that are allowed in a URI but do not have a reserved 506 purpose are called unreserved. These include uppercase and lowercase 507 letters, decimal digits, and a limited set of punctuation marks and 508 symbols. 510 unreserved = ALPHA / DIGIT / mark 512 mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / "(" / ")" 514 Escaping unreserved characters in a URI does not change what resource 515 is identified by that URI. However, it may change the result of a 516 URI comparison (Section 6), potentially leading to less efficient 517 actions by an application. Therefore, unreserved characters should 518 not be escaped unless the URI is being used in a context that does 519 not allow the unescaped character to appear. URI normalization 520 processes may unescape sequences in the ranges of ALPHA (%41-%5A and 521 %61-%7A), DIGIT (%30-%39), hyphen (%2D), underscore (%5F), or tilde 522 (%7E) without fear of creating a conflict, but unescaping the other 523 mark characters is usually counterproductive. 525 2.4 Escaped Characters 527 Data must be escaped if it does not have a representation using an 528 unreserved character; this includes data that does not correspond to 529 a printable character of the US-ASCII coded character set or 530 corresponds to a US-ASCII character that delimits the component from 531 others, is reserved in that component for delimiting sub-components, 532 or is excluded from any use within a URI (Section 2.5). 534 2.4.1 Escaped Encoding 536 An escaped octet is encoded as a character triplet, consisting of 537 the percent character "%" followed by the two hexadecimal digits 538 representing that octet's numeric value. For example, "%20" is the 539 escaped encoding for the binary octet "00100000" (ABNF: %x20), which 540 corresponds to the US-ASCII space character (SP). This is sometimes 541 referred to as "percent-encoding" the octet. 543 escaped = "%" HEXDIG HEXDIG 545 The uppercase hexadecimal digits 'A' through 'F' are equivalent to 546 the lowercase digits 'a' through 'f', respectively. Two URIs that 547 differ only in the case of hexadecimal digits used in escaped octets 548 are equivalent. For consistency, we recommend that uppercase digits 549 be used by URI generators and normalizers. 551 2.4.2 When to Escape and Unescape 553 Under normal circumstances, the only time that characters within a 554 URI string are escaped is during the process of generating the URI 555 from its component parts. Each component may have its own set of 556 characters that are reserved, so only the mechanism responsible for 557 generating or interpreting that component can determine whether or 558 not escaping a character will change its semantics. The exception is 559 when a URI is being used within a context where the unreserved "mark" 560 characters might need to be escaped, such as when used for a 561 command-line argument or within a single-quoted attribute. 563 Once generated, a URI is always in an escaped form. When a URI is 564 resolved, the components significant to that scheme-specific 565 resolution process (if any) must be parsed and separated before the 566 escaped characters within those components can be safely unescaped. 568 In some cases, data that could be represented by an unreserved 569 character may appear escaped; for example, some of the unreserved 570 "mark" characters are automatically escaped by some systems. A URI 571 normalizer may unescape escaped octets that are represented by 572 characters in the unreserved set. For example, "%7E" is sometimes 573 used instead of tilde ("~") in an "http" URI path and can be 574 converted to "~" without changing the interpretation of the URI. 576 In all cases, a URI character is equivalent to its corresponding 577 ASCII-encoded octet, even when that octet is represented as a 578 percent-escape. URI characters are provided as an external ASCII 579 interface for identification between systems. A system that 580 internally provides identifiers in the form of a different character 581 encoding, such as EBCDIC, will generally perform character 582 translation of textual identifiers to UTF-8 at some internal 583 interface, thus providing meaningful identifiers in ASCII even though 584 the back-end identifiers are in a different encoding. Escaped octets 585 must be unescaped before such a transcoding is applied. Although 586 this specification does not define the character encoding of escaped 587 octets outside the ASCII range, the general principle of unescaping 588 before transcoding should be applied for all character encodings. 590 Because the percent ("%") character serves as the escape indicator, 591 it must be escaped as "%25" in order for that octet to be used as 592 data within a URI. Implementers should be careful not to escape or 593 unescape the same string more than once, since unescaping an already 594 unescaped string might lead to misinterpreting a percent data 595 character as another escaped character, or vice versa in the case of 596 escaping an already escaped string. 598 2.5 Excluded Characters 600 Although they are disallowed within the URI syntax, we include here 601 a description of those characters that have been excluded and the 602 reasons for their exclusion. 604 excluded = invisible / delims / unwise 606 The control characters (CTL) in the US-ASCII coded character set are 607 not used within a URI, both because they are non-printable and 608 because they are likely to be misinterpreted by some control 609 mechanisms. The space character (SP) is excluded because significant 610 spaces may disappear and insignificant spaces may be introduced when 611 a URI is transcribed, typeset, or subjected to the treatment of 612 word-processing programs. Whitespace is also used to delimit a URI 613 in many contexts. Characters outside the US-ASCII set are excluded as 614 well. 616 invisible = CTL / SP / %x80-FF 618 The angle-bracket ("<" and ">") and double-quote (") characters are 619 excluded because they are often used as the delimiters around a URI 620 in text documents and protocol fields. The percent character ("%") 621 is excluded because it is used for the encoding of escaped (Section 622 2.4) characters. 624 delims = "<" / ">" / "%" / DQUOTE 626 Other characters are excluded because gateways and other transport 627 agents are known to sometimes modify such characters. 629 unwise = "{" / "}" / "|" / "\" / "^" / "`" 631 Data octets corresponding to excluded characters must be escaped in 632 order to be represented within a URI. 634 3. Syntax Components 636 The generic URI syntax consists of a hierarchical sequence of 637 components referred to as the scheme, authority, path, query, and 638 fragment. 640 URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 642 hier-part = net-path / abs-path / rel-path 644 net-path = "//" authority [ abs-path ] 645 abs-path = "/" path-segments 646 rel-path = path-segments 648 The scheme and path components are required, though path may be empty 649 (no characters). An ABNF-driven parser of hier-part will find that 650 the three productions in the rule are ambiguous: they are 651 disambiguated by the "first-match-wins" (a.k.a. "greedy") algorithm. 652 In other words, if the string begins with two slash characters ("// 653 "), then it is a net-path; if it begins with only one slash 654 character, then it is an abs-path; otherwise, it is a rel-path. Note 655 that rel-path does not necessarily contain any slash ("/") 656 characters; a non-hierarchical path will be treated as opaque data by 657 a generic URI parser. 659 The authority component is only present when a string matches the 660 net-path production. Since the presence of an authority component 661 restricts the remaining syntax for path, we have not included a 662 specific "path" rule in the syntax. Instead, what we refer to as the 663 URI path is that part of the parsed URI string matching the abs-path 664 or rel-path production in the syntax above, since they are mutually 665 exclusive for any given URI and can be parsed as a single component. 667 The following are two example URIs and their component parts: 669 foo://example.com:8042/over/there?name=ferret#nose 670 \_/ \______________/\_________/ \_________/ \__/ 671 | | | | | 672 scheme authority path query fragment 673 | _____________________|__ 674 / \ / \ 675 urn:example:animal:ferret:nose 677 3.1 Scheme 679 Each URI begins with a scheme name that refers to a specification for 680 assigning identifiers within that scheme. As such, the URI syntax is 681 a federated and extensible naming system wherein each scheme's 682 specification may further restrict the syntax and semantics of 683 identifiers using that scheme. 685 Scheme names consist of a sequence of characters beginning with a 686 letter and followed by any combination of letters, digits, plus 687 ("+"), period ("."), or hyphen ("-"). Although scheme is 688 case-insensitive, the canonical form is lowercase and documents that 689 specify schemes must do so using lowercase letters. An 690 implementation should accept uppercase letters as equivalent to 691 lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for 692 the sake of robustness, but should only generate lowercase scheme 693 names, for consistency. 695 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 697 Individual schemes are not specified by this document. The process 698 for registration of new URI schemes is defined separately by 699 [RFC2717]. The scheme registry maintains the mapping between scheme 700 names and their specifications. 702 3.2 Authority 704 Many URI schemes include a hierarchical element for a naming 705 authority, such that governance of the name space defined by the 706 remainder of the URI is delegated to that authority (which may, in 707 turn, delegate it further). The generic syntax provides a common 708 means for distinguishing an authority based on a registered domain 709 name or server address, along with optional port and user 710 information. 712 The authority component is preceded by a double slash ("//") and is 713 terminated by the next slash ("/"), question-mark ("?"), or 714 number-sign ("#") character, or by the end of the URI. 716 authority = [ userinfo "@" ] host [ ":" port ] 718 The parts "@" and ":" may be omitted. 720 Some schemes do not allow the userinfo and/or port sub-components. 721 When presented with a URI that violates one or more scheme-specific 722 restrictions, the scheme-specific URI resolution process should flag 723 the reference as an error rather than ignore the unused parts; doing 724 so reduces the number of equivalent URIs and helps detect abuses of 725 the generic syntax that might indicate the URI has been constructed 726 to mislead the user (Section 7.5). 728 3.2.1 User Information 730 The userinfo sub-component may consist of a user name and, 731 optionally, scheme-specific information about how to gain 732 authorization to access the server. The user information, if 733 present, is followed by a commercial at-sign ("@") that delimits it 734 from the host. 736 userinfo = *( unreserved / escaped / ";" / 737 ":" / "&" / "=" / "+" / "$" / "," ) 739 Some URI schemes use the format "user:password" in the userinfo 740 field. This practice is NOT RECOMMENDED, because the passing of 741 authentication information in clear text has proven to be a security 742 risk in almost every case where it has been used. Note also that 743 userinfo might be crafted to look like a trusted domain name in order 744 to mislead users, as described in Section 7.5. 746 3.2.2 Host 748 The host sub-component of authority is identified by an IPv6 literal 749 encapsulated within square brackets, an IPv4 address in 750 dotted-decimal form, or a domain name. 752 host = [ IPv6reference / IPv4address / hostname ] 754 If host is omitted, a default may be defined by the scheme-specific 755 semantics of the URI. For example, the "file" URI scheme defaults to 756 "localhost", whereas the "http" URI scheme does not allow host to be 757 omitted. 759 The production for host is ambiguous because it does not completely 760 distinguish between an IPv4address and a hostname. Again, the 761 "first-match-wins" algorithm applies: If host matches the production 762 for IPv4address, then it should be considered an IPv4 address literal 763 and not a hostname. 765 A hostname takes the form described in Section 3 of [RFC1034] and 766 Section 2.1 of [RFC1123]: a sequence of domain labels separated by 767 ".", each domain label starting and ending with an alphanumeric 768 character and possibly also containing "-" characters. The rightmost 769 domain label of a fully qualified domain name may be followed by a 770 single "." if it is necessary to distinguish between the complete 771 domain name and some local domain. 773 hostname = domainlabel qualified 774 qualified = *( "." domainlabel ) [ "." ] 775 domainlabel = alphanum [ 0*61( alphanum / "-" ) alphanum ] 776 alphanum = ALPHA / DIGIT 778 A host identified by an IPv4 literal address is represented in 779 dotted-decimal notation (a sequence of four decimal numbers in the 780 range 0 to 255, separated by "."), as described in [RFC1123] by 781 reference to [RFC0952]. Note that other forms of dotted notation may 782 be interpreted on some platforms, as described in Section 7.3, but 783 only the dotted-decimal form of four octets is allowed by this 784 grammar. 786 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 788 dec-octet = DIGIT ; 0-9 789 / %x31-39 DIGIT ; 10-99 790 / "1" 2DIGIT ; 100-199 791 / "2" %x30-34 DIGIT ; 200-249 792 / "25" %x30-35 ; 250-255 794 A host identified by an IPv6 literal address [RFC3513] is 795 distinguished by enclosing the IPv6 literal within square-brackets 796 ("[" and "]"). This is the only place where square-bracket 797 characters are allowed in the URI syntax. 799 IPv6reference = "[" IPv6address "]" 801 IPv6address = 6( h4 ":" ) ls32 802 / "::" 5( h4 ":" ) ls32 803 / [ h4 ] "::" 4( h4 ":" ) ls32 804 / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 805 / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 806 / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 807 / [ *4( h4 ":" ) h4 ] "::" ls32 808 / [ *5( h4 ":" ) h4 ] "::" h4 809 / [ *6( h4 ":" ) h4 ] "::" 811 ls32 = ( h4 ":" h4 ) / IPv4address 812 ; least-significant 32 bits of address 814 h4 = 1*4HEXDIG 816 The presence of host within a URI does not imply that the scheme 817 requires access to the given host on the Internet. In many cases, 818 the host syntax is used only for the sake of reusing the existing 819 registration process created and deployed for DNS, thus obtaining a 820 globally unique name without the cost of deploying another registry. 821 However, such use comes with its own costs: domain name ownership may 822 change over time for reasons not anticipated by the URI creator. 824 3.2.3 Port 826 The port sub-component of authority is designated by an optional 827 port number in decimal following the host and delimited from it by a 828 single colon (":") character. 830 port = *DIGIT 832 If port is omitted, a default may be defined by the scheme-specific 833 semantics of the URI. Likewise, the type of network port designated 834 by the port number (e.g., TCP, UDP, SCTP, etc.) is defined by the URI 835 scheme. For example, the "http" URI scheme defines a default of TCP 836 port 80. 838 3.3 Path 840 The path component contains hierarchical data that, along with data 841 in the optional query (Section 3.4) component, serves to identify a 842 resource within the scope of that URI's scheme and naming authority 843 (if any). There is no specific "path" syntax production in the 844 generic URI syntax. Instead, what we refer to as the URI path is 845 that part of the parsed URI string matching either the abs-path or 846 the rel-path production, since they are mutually exclusive for any 847 given URI and can be parsed as a single component. The path is 848 terminated by the first question-mark ("?") or number-sign ("#") 849 character, or by the end of the URI. 851 path-segments = segment *( "/" segment ) 852 segment = *pchar 854 pchar = unreserved / escaped / ";" / 855 ":" / "@" / "&" / "=" / "+" / "$" / "," 857 The path consists of a sequence of path segments separated by a slash 858 ("/") character. A path is always defined for a URI, though the 859 defined path may be empty (zero length) or opaque (not containing any 860 "/" delimiters). For example, the URI has 861 a path of "fred@example.com". 863 The path segments "." and ".." are defined for relative reference 864 within the path name hierarchy. They are intended for use at the 865 beginning of a relative path reference (Section 4.2) for indicating 866 relative position within the hierarchical tree of names, with a 867 similar effect to how they are used within some operating systems' 868 file directory structure to indicate the current directory and parent 869 directory, respectively. Unlike a file system, however, these 870 dot-segments are only interpreted within the URI path hierarchy and 871 are removed as part of the URI normalization or resolution process, 872 as described in Section 5.2. 874 Aside from dot-segments in hierarchical paths, a path segment is 875 considered opaque by the generic syntax. URI generating applications 876 often use the reserved characters allowed in segment for the purpose 877 of delimiting scheme-specific or generator-specific sub-components. 878 For example, the semicolon (";") and equals ("=") reserved characters 879 are often used for delimiting parameters and parameter values 880 applicable to that segment. The comma (",") reserved character is 881 often used for similar purposes. For example, one URI generator 882 might use a segment like "name;v=1.1" to indicate a reference to 883 version 1.1 of "name", whereas another might use a segment like 884 "name,1.1" to indicate the same. Parameter types may be defined by 885 scheme-specific semantics, but in most cases the meaning of a 886 parameter is specific to the URI originator. 888 3.4 Query 890 The query component contains non-hierarchical data that, along with 891 data in the path (Section 3.3) component, serves to identify a 892 resource within the scope of that URI's scheme and naming authority 893 (if any). The query component is indicated by the first question-mark 894 ("?") character and terminated by a number-sign ("#") character or by 895 the end of the URI. 897 query = *( pchar / "/" / "?" ) 899 The characters slash ("/") and question-mark ("?") are allowed to 900 represent data within the query component, but such use is 901 discouraged; incorrect implementations of reference resolution often 902 fail to distinguish them from hierarchical separators, thus resulting 903 in non-interoperable results while parsing relative references. 904 However, since query components are often used to carry identifying 905 information in the form of "key=value" pairs, and one frequently used 906 value is a reference to another URI, it is sometimes better for 907 usability to include those characters unescaped. 909 Note: Some client applications will fail to separate a reference's 910 query component from its path component before merging the base 911 and reference paths (Section 5.2). This may result in loss of 912 information if the query component contains the strings "/../" or 913 "/./". 915 3.5 Fragment 917 The fragment identifier component allows indirect identification of a 918 secondary resource by reference to a primary resource and additional 919 identifying information that is selective within that resource. The 920 identified secondary resource may be some portion or subset of the 921 primary resource, some view on representations of the primary 922 resource, or some other resource that is merely named within the 923 primary resource. A fragment identifier component is indicated by 924 the presence of a number-sign ("#") character and terminated by the 925 end of the URI string. 927 fragment = *( pchar / "/" / "?" ) 929 The semantics of a fragment identifier are defined by the set of 930 representations that might result from a retrieval action on the 931 primary resource. The fragment's format and resolution is therefore 932 dependent on the media type [RFC2046] of the retrieved 933 representation, even though such a retrieval is only performed if the 934 URI is dereferenced. Individual media types may define their own 935 restrictions on, or structure within, the fragment identifier syntax 936 for specifying different types of subsets, views, or external 937 references that are identifiable as secondary resources by that media 938 type. If the primary resource is represented by multiple media 939 types, as is often the case for resources whose representation is 940 selected based on attributes of the retrieval request, then 941 interpretation of the fragment identifier must be consistent across 942 all of those media types in order for it to be viable as an 943 identifier. 945 As with any URI, use of a fragment identifier component does not 946 imply that a retrieval action will take place. A URI with a fragment 947 identifier may be used to refer to the secondary resource without any 948 implication that the primary resource is accessible. However, if 949 that URI is used in a context that does call for retrieval and is not 950 a same-document reference (Section 4.4), the fragment identifier is 951 only valid as a reference if a retrieval action on the primary 952 resource succeeds and results in a representation for which the 953 fragment identifier is meaningful. 955 Fragment identifiers have a special role in information systems as 956 the primary form of client-side indirect referencing, allowing an 957 author to specifically identify those aspects of an existing resource 958 that are only indirectly provided by the resource owner. As such, 959 interpretation of the fragment identifier during a retrieval action 960 is performed solely by the user agent; the fragment identifier is not 961 passed to other systems during the process of retrieval. Although 962 this is often perceived to be a loss of information, particularly in 963 regards to accurate redirection of references as content moves over 964 time, it also serves to prevent information providers from denying 965 reference authors the right to selectively refer to information 966 within a resource. 968 The characters slash ("/") and question-mark ("?") are allowed to 969 represent data within the fragment identifier, but such use is 970 discouraged for the same reasons as described above for query. 972 4. Usage 974 When applications make reference to a URI, they do not always use the 975 full form of reference defined by the "URI" syntax production. In 976 order to save space and take advantage of hierarchical locality, many 977 Internet protocol elements and media type formats allow an 978 abbreviation of a URI, while others restrict the syntax to a 979 particular form of URI. We define the most common forms of reference 980 syntax in this specification because they impact and depend upon the 981 design of the generic syntax, requiring a uniform parsing algorithm 982 in order to be interpreted consistently. 984 4.1 URI Reference 986 The ABNF rule URI-reference is used to denote the most common usage 987 of a resource identifier. 989 URI-reference = URI / relative-URI 991 A URI-reference may be relative: if the reference string's prefix 992 matches the syntax of a scheme followed by its colon separator, then 993 the reference is a URI rather than a relative-URI. 995 A URI-reference is typically parsed first into the five URI 996 components, in order to determine what components are present and 997 whether or not the reference is relative, and then each component is 998 parsed for its subparts and their validation. The ABNF of 999 URI-reference, along with the "first-match-wins" disambiguation rule, 1000 is sufficient to define a validating parser for the generic syntax. 1001 Readers familiar with regular expressions should see Appendix B for 1002 an example of a non-validating URI-reference parser that will take 1003 any given string and extract the URI components. 1005 4.2 Relative URI 1007 A relative URI reference takes advantage of the hier-part syntax 1008 (Section 3) in order to express a reference that is relative to the 1009 name space of another hierarchical URI. 1011 relative-URI = hier-part [ "?" query ] [ "#" fragment ] 1013 The URI referred to by a relative reference is obtained by applying 1014 the reference resolution algorithm of Section 5. 1016 A relative reference that begins with two slash characters is termed 1017 a network-path reference; such references are rarely used. A relative 1018 reference that begins with a single slash character is termed an 1019 absolute-path reference. A relative reference that does not begin 1020 with a slash character is termed a relative-path reference. 1022 A path segment that contains a colon character (e.g., "this:that") 1023 cannot be used as the first segment of a relative-path reference 1024 because it would be mistaken for a scheme name. Such a segment must 1025 be preceded by a dot-segment (e.g., "./this:that") to make a 1026 relative-path reference. 1028 4.3 Absolute URI 1030 Some protocol elements allow only the absolute form of a URI without 1031 a fragment identifier. For example, defining the base URI for later 1032 use by relative references calls for an absolute-URI production that 1033 does not allow a fragment. 1035 absolute-URI = scheme ":" hier-part [ "?" query ] 1037 4.4 Same-document Reference 1039 When a URI reference occurring within a document or message refers to 1040 a URI that is, aside from its fragment component (if any), identical 1041 to the base URI (Section 5.1), that reference is called a 1042 "same-document" reference. The most frequent examples of 1043 same-document references are relative references that are empty or 1044 include only the number-sign ("#") separator followed by a fragment 1045 identifier. 1047 When a same-document reference is dereferenced for the purpose of a 1048 retrieval action, the target of that reference is defined to be 1049 within that current document or message; the dereference should not 1050 result in a new retrieval. 1052 4.5 Suffix Reference 1054 The URI syntax is designed for unambiguous reference to resources and 1055 extensibility via the URI scheme. However, as URI identification and 1056 usage have become commonplace, traditional media (television, radio, 1057 newspapers, billboards, etc.) have increasingly used a suffix of the 1058 URI as a reference, consisting of only the authority and path 1059 portions of the URI, such as 1061 www.w3.org/Addressing/ 1063 or simply the DNS hostname on its own. Such references are primarily 1064 intended for human interpretation rather than machine, with the 1065 assumption that context-based heuristics are sufficient to complete 1066 the URI (e.g., most hostnames beginning with "www" are likely to have 1067 a URI prefix of "http://"). Although there is no standard set of 1068 heuristics for disambiguating a URI suffix, many client 1069 implementations allow them to be entered by the user and 1070 heuristically resolved. It should be noted that such heuristics may 1071 change over time, particularly when new URI schemes are introduced. 1073 Since a URI suffix has the same syntax as a relative path reference, 1074 a suffix reference cannot be used in contexts where a relative 1075 reference is expected. As a result, suffix references are limited to 1076 those places where there is no defined base URI, such as dialog boxes 1077 and off-line advertisements. 1079 5. Reference Resolution 1081 This section defines the process of resolving a URI reference within 1082 a context that allows relative references, such that the result is a 1083 string matching the "URI" syntax production of Section 3. 1085 5.1 Establishing a Base URI 1087 The term "relative" implies that there exists some "base URI" against 1088 which the relative reference is applied. Aside from same-document 1089 references (Section 4.4, relative references are only usable if the 1090 base URI is known. The base URI must be established by the parser 1091 prior to parsing URI references that might be relative. 1093 The base URI of a document can be established in one of four ways, 1094 listed below in order of precedence. The order of precedence can be 1095 thought of in terms of layers, where the innermost defined base URI 1096 has the highest precedence. This can be visualized graphically as: 1098 .----------------------------------------------------------. 1099 | .----------------------------------------------------. | 1100 | | .----------------------------------------------. | | 1101 | | | .----------------------------------------. | | | 1102 | | | | .----------------------------------. | | | | 1103 | | | | | | | | | | 1104 | | | | `----------------------------------' | | | | 1105 | | | | (5.1.1) Base URI embedded in the | | | | 1106 | | | | document's content | | | | 1107 | | | `----------------------------------------' | | | 1108 | | | (5.1.2) Base URI of the encapsulating entity | | | 1109 | | | (message, document, or none). | | | 1110 | | `----------------------------------------------' | | 1111 | | (5.1.3) URI used to retrieve the entity | | 1112 | `----------------------------------------------------' | 1113 | (5.1.4) Default Base URI is application-dependent | 1114 `----------------------------------------------------------' 1116 5.1.1 Base URI within Document Content 1118 Within certain document media types, the base URI of the document can 1119 be embedded within the content itself such that it can be readily 1120 obtained by a parser. This can be useful for descriptive documents, 1121 such as tables of content, which may be transmitted to others through 1122 protocols other than their usual retrieval context (e.g., E-Mail or 1123 USENET news). 1125 It is beyond the scope of this document to specify how, for each 1126 media type, the base URI can be embedded. It is assumed that user 1127 agents manipulating such media types will be able to obtain the 1128 appropriate syntax from that media type's specification. 1130 A mechanism for embedding the base URI within MIME container types 1131 (e.g., the message and multipart types) is defined by MHTML 1132 [RFC2110]. Protocols that do not use the MIME message header syntax, 1133 but do allow some form of tagged metadata to be included within 1134 messages, may define their own syntax for defining the base URI as 1135 part of a message. 1137 5.1.2 Base URI from the Encapsulating Entity 1139 If no base URI is embedded, the base URI of a document is defined by 1140 the document's retrieval context. For a document that is enclosed 1141 within another entity (such as a message or another document), the 1142 retrieval context is that entity; thus, the default base URI of the 1143 document is the base URI of the entity in which the document is 1144 encapsulated. 1146 5.1.3 Base URI from the Retrieval URI 1148 If no base URI is embedded and the document is not encapsulated 1149 within some other entity (e.g., the top level of a composite entity), 1150 then, if a URI was used to retrieve the base document, that URI shall 1151 be considered the base URI. Note that if the retrieval was the 1152 result of a redirected request, the last URI used (i.e., that which 1153 resulted in the actual retrieval of the document) is the base URI. 1155 5.1.4 Default Base URI 1157 If none of the conditions described in above apply, then the base URI 1158 is defined by the context of the application. Since this definition 1159 is necessarily application-dependent, failing to define the base URI 1160 using one of the other methods may result in the same content being 1161 interpreted differently by different types of application. 1163 It is the responsibility of the distributor(s) of a document 1164 containing a relative reference to ensure that the base URI for that 1165 document can be established. It must be emphasized that a relative 1166 reference, aside from a same-document reference, cannot be used 1167 reliably in situations where the document's base URI is not 1168 well-defined. 1170 5.2 Obtaining the Referenced URI 1172 This section describes an example algorithm for resolving URI 1173 references that might be relative to a given base URI. The algorithm 1174 is intended to provide a definitive result that can be used to test 1175 the output of other implementations. Implementation of the algorithm 1176 itself is not required, but the result given by an implementation 1177 must match the result that would be given by this algorithm. 1179 The base URI (Base) is established according to the rules of Section 1180 5.1 and parsed into the five main components described in Section 3. 1181 Note that only the scheme component is required to be present in the 1182 base URI; the other components may be empty or undefined. A 1183 component is undefined if its preceding separator does not appear in 1184 the URI reference; the path component is never undefined, though it 1185 may be empty. The algorithm assumes that the base URI is well-formed 1186 and does not contain dot-segments in its path. 1188 For each URI reference (R), the following pseudocode describes an 1189 algorithm for transforming R into its target URI (T): 1191 -- The URI reference is parsed into the five URI components 1192 -- 1193 (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); 1195 -- A non-strict parser may ignore a scheme in the reference 1196 -- if it is identical to the base URI's scheme. 1197 -- 1198 if ((not strict) and (R.scheme == Base.scheme)) then 1199 undefine(R.scheme); 1200 endif; 1202 if defined(R.scheme) then 1203 T.scheme = R.scheme; 1204 T.authority = R.authority; 1205 T.path = remove_dot_segments(R.path); 1206 T.query = R.query; 1207 else 1208 if defined(R.authority) then 1209 T.authority = R.authority; 1210 T.path = remove_dot_segments(R.path); 1211 T.query = R.query; 1212 else 1213 if (R.path == "") then 1214 T.path = Base.path; 1215 if defined(R.query) then 1216 T.query = R.query; 1217 else 1218 T.query = Base.query; 1219 endif; 1220 else 1221 if (R.path starts-with "/") then 1222 T.path = remove_dot_segments(R.path); 1223 else 1224 T.path = merge(Base.path, R.path); 1225 T.path = remove_dot_segments(T.path); 1226 endif; 1227 T.query = R.query; 1228 endif; 1229 T.authority = Base.authority; 1230 endif; 1231 T.scheme = Base.scheme; 1232 endif; 1234 T.fragment = R.fragment; 1236 The pseudocode above refers to a merge routine for merging a 1237 relative-path reference with the path of the base URI. This is 1238 accomplished as follows: 1240 o If the base URI's path is empty, then return a string consisting 1241 of "/" concatenated with the reference's path component; 1242 otherwise, 1244 o If the base URI's path is non-hierarchical, as indicated by not 1245 beginning with a slash, then return a string consisting of the 1246 reference's path component; otherwise, 1248 o Return a string consisting of the reference's path component 1249 appended to all but the last segment of the base URI's path (i.e., 1250 any characters after the right-most "/" in the base URI path are 1251 excluded). 1253 The pseudocode also refers to a remove_dot_segments routine for 1254 interpreting and removing the special "." and ".." complete path 1255 segments from a referenced path. This is done after the path is 1256 extracted from a reference, whether or not the path was relative, in 1257 order to remove any invalid or extraneous dot-segments prior to 1258 forming the target URI. Although there are many ways to accomplish 1259 this removal process, we describe a simple method using a separate 1260 string buffer: 1262 1. The buffer is initialized with the unprocessed path component. 1264 2. If the buffer begins with "./" or "../", the "." or ".." segment 1265 is removed. 1267 3. All occurrences of "/./" in the buffer are replaced with "/". 1269 4. If the buffer ends with "/.", the "." is removed. 1271 5. All occurrences of "//../" in the buffer, where ".." and 1272 are complete path segments, are iteratively replaced 1273 with "/" in order from left to right until no matching pattern 1274 remains. If the buffer ends with "//..", that is also 1275 replaced with "/". Note that may be empty. 1277 6. All prefixes of "/../" in the buffer, where ".." and 1278 are complete path segments, are iteratively replaced 1279 with "/" in order from left to right until no matching pattern 1280 remains. If the buffer ends with "/..", that is also 1281 replaced with "/". Note that may be empty. 1283 7. The remaining buffer is returned as the result of 1284 remove_dot_segments. 1286 Some systems may find it more efficient to implement the 1287 remove_dot_segments algorithm as a stack of path segments being 1288 compressed, rather than as a series of string pattern replacements. 1290 5.3 Recomposition of a Parsed URI 1292 Parsed URI components can be recomposed to obtain the corresponding 1293 URI reference string. Using pseudocode, this would be: 1295 result = "" 1297 if defined(scheme) then 1298 append scheme to result; 1299 append ":" to result; 1300 endif; 1302 if defined(authority) then 1303 append "//" to result; 1304 append authority to result; 1305 endif; 1307 append path to result; 1309 if defined(query) then 1310 append "?" to result; 1311 append query to result; 1312 endif; 1314 if defined(fragment) then 1315 append "#" to result; 1316 append fragment to result; 1317 endif; 1318 return result; 1320 Note that we are careful to preserve the distinction between a 1321 component that is undefined, meaning that its separator was not 1322 present in the reference, and a component that is empty, meaning that 1323 the separator was present and was immediately followed by the next 1324 component separator or the end of the reference. 1326 5.4 Reference Resolution Examples 1328 Within an object with a well-defined base URI of 1330 http://a/b/c/d;p?q 1332 a relative URI reference would be resolved as follows: 1334 5.4.1 Normal Examples 1336 "g:h" = "g:h" 1337 "g" = "http://a/b/c/g" 1338 "./g" = "http://a/b/c/g" 1339 "g/" = "http://a/b/c/g/" 1340 "/g" = "http://a/g" 1341 "//g" = "http://g" 1342 "?y" = "http://a/b/c/d;p?y" 1343 "g?y" = "http://a/b/c/g?y" 1344 "#s" = "http://a/b/c/d;p?q#s" 1345 "g#s" = "http://a/b/c/g#s" 1346 "g?y#s" = "http://a/b/c/g?y#s" 1347 ";x" = "http://a/b/c/;x" 1348 "g;x" = "http://a/b/c/g;x" 1349 "g;x?y#s" = "http://a/b/c/g;x?y#s" 1350 "." = "http://a/b/c/" 1351 "./" = "http://a/b/c/" 1352 ".." = "http://a/b/" 1353 "../" = "http://a/b/" 1354 "../g" = "http://a/b/g" 1355 "../.." = "http://a/" 1356 "../../" = "http://a/" 1357 "../../g" = "http://a/g" 1359 5.4.2 Abnormal Examples 1361 Although the following abnormal examples are unlikely to occur in 1362 normal practice, all URI parsers should be capable of resolving them 1363 consistently. Each example uses the same base as above. 1365 An empty reference refers to the current base URI. 1367 "" = "http://a/b/c/d;p?q" 1369 Parsers must be careful in handling the case where there are more 1370 relative path ".." segments than there are hierarchical levels in the 1371 base URI's path. Note that the ".." syntax cannot be used to change 1372 the authority component of a URI. 1374 "../../../g" = "http://a/g" 1375 "../../../../g" = "http://a/g" 1377 Similarly, parsers must remove the dot-segments "." and ".." when 1378 they are complete components of a path, but not when they are only 1379 part of a segment. 1381 "/./g" = "http://a/g" 1382 "/../g" = "http://a/g" 1383 "g." = "http://a/b/c/g." 1384 ".g" = "http://a/b/c/.g" 1385 "g.." = "http://a/b/c/g.." 1386 "..g" = "http://a/b/c/..g" 1388 Less likely are cases where the relative URI uses unnecessary or 1389 nonsensical forms of the "." and ".." complete path segments. 1391 "./../g" = "http://a/b/g" 1392 "./g/." = "http://a/b/c/g/" 1393 "g/./h" = "http://a/b/c/g/h" 1394 "g/../h" = "http://a/b/c/h" 1395 "g;x=1/./y" = "http://a/b/c/g;x=1/y" 1396 "g;x=1/../y" = "http://a/b/c/y" 1398 Some applications fail to separate the reference's query and/or 1399 fragment components from a relative path before merging it with the 1400 base path and removing dot-segments. This error is rarely noticed, 1401 since typical usage of a fragment never includes the hierarchy ("/") 1402 character, and the query component is not normally used within 1403 relative references. 1405 "g?y/./x" = "http://a/b/c/g?y/./x" 1406 "g?y/../x" = "http://a/b/c/g?y/../x" 1407 "g#s/./x" = "http://a/b/c/g#s/./x" 1408 "g#s/../x" = "http://a/b/c/g#s/../x" 1410 Some parsers allow the scheme name to be present in a relative URI if 1411 it is the same as the base URI scheme. This is considered to be a 1412 loophole in prior specifications of partial URI [RFC1630]. Its use 1413 should be avoided, but is allowed for backward compatibility. 1415 "http:g" = "http:g" ; for strict parsers 1416 / "http://a/b/c/g" ; for backward compatibility 1418 6. Normalization and Comparison 1420 One of the most common operations on URIs is simple comparison: 1421 determining if two URIs are equivalent without using the URIs to 1422 access their respective resource(s). A comparison is performed every 1423 time a response cache is accessed, a browser checks its history to 1424 color a link, or an XML parser processes tags within a namespace. 1425 Extensive normalization prior to comparison of URIs is often used by 1426 spiders and indexing engines to prune a search space or reduce 1427 duplication of request actions and response storage. 1429 URI comparison is performed in respect to some particular purpose, 1430 and software with differing purposes will often be subject to 1431 differing design trade-offs in regards to how much effort should be 1432 spent in reducing duplicate identifiers. This section describes a 1433 variety of methods that may be used to compare URIs, the trade-offs 1434 between them, and the types of applications that might use them. 1436 6.1 Equivalence 1438 Since URIs exist to identify resources, presumably they should be 1439 considered equivalent when they identify the same resource. However, 1440 such a definition of equivalence is not of much practical use, since 1441 there is no way for software to compare two resources without 1442 knowledge of their origin. For this reason, determination of 1443 equivalence or difference of URIs is based on string comparison, 1444 perhaps augmented by reference to additional rules provided by URI 1445 scheme definitions. We use the terms "different" and "equivalent" to 1446 describe the possible outcomes of such comparisons, but there are 1447 many application-dependent versions of equivalence. 1449 Even though it is possible to determine that two URIs are equivalent, 1450 it is never possible to be sure that two URIs identify different 1451 resources. Therefore, comparison methods are designed to minimize 1452 false negatives while strictly avoiding false positives. 1454 In testing for equivalence, it is generally unwise to directly 1455 compare relative URI references; they should be converted to their 1456 absolute forms before comparison. Furthermore, when URI references 1457 are being compared for the purpose of selecting (or avoiding) a 1458 network action, such as retrieval of a representation, it is often 1459 necessary to remove fragment identifiers from the URIs prior to 1460 comparison. 1462 6.2 Comparison Ladder 1464 A variety of methods are used in practice to test URI equivalence. 1465 These methods fall into a range, distinguished by the amount of 1466 processing required and the degree to which the probability of false 1467 negatives is reduced. As noted above, false negatives cannot in 1468 principle be eliminated. In practice, their probability can be 1469 reduced, but this reduction requires more processing and is not 1470 cost-effective for all applications. 1472 If this range of comparison practices is considered as a ladder, the 1473 following discussion will climb the ladder, starting with those that 1474 are cheap but have a relatively higher chance of producing false 1475 negatives, and proceeding to those that have higher computational 1476 cost and lower risk of false negatives. 1478 6.2.1 Simple String Comparison 1480 If two URIs, considered as character strings, are identical, then it 1481 is safe to conclude that they are equivalent. This type of 1482 equivalence test has very low computational cost and is in wide use 1483 in a variety of applications, particularly in the domain of parsing. 1485 Testing strings for equivalence requires some basic precautions. This 1486 procedure is often referred to as "bit-for-bit" or "byte-for-byte" 1487 comparison, which is potentially misleading. Testing of strings for 1488 equality is normally based on pairwise comparison of the characters 1489 that make up the strings, starting from the first and proceeding 1490 until both strings are exhausted and all characters found to be 1491 equal, a pair of characters compares unequal, or one of the strings 1492 is exhausted before the other. 1494 Such character comparisons require that each pair of characters be 1495 put in comparable form. For example, should one URI be stored in a 1496 byte array in EBCDIC encoding, and the second be in a Java String 1497 object, bit-for-bit comparisons applied naively will produce both 1498 false-positive and false-negative errors. Thus, in principle, it is 1499 better to speak of equality on a character-for-character rather than 1500 byte-for-byte or bit-for-bit basis. 1502 Unicode defines a character as being identified by number 1503 ("codepoint") with an associated bundle of visual and other 1504 semantics. At the software level, it is not practical to compare 1505 semantic bundles, so in practical terms, character-by-character 1506 comparisons are done codepoint-by-codepoint. 1508 6.2.2 Syntax-based Normalization 1510 Software may use logic based on the definitions provided by this 1511 specification to reduce the probability of false negatives. Such 1512 processing is moderately higher in cost than character-for-character 1513 string comparison. For example, an application using this approach 1514 could reasonably consider the following two URIs equivalent: 1516 example://a/b/c/%7A 1517 eXAMPLE://a/./b/../b/c/%7a 1519 Web user agents, such as browsers, typically apply this type of URI 1520 normalization when determining whether a cached response is 1521 available. Syntax-based normalization includes such techniques as 1522 case normalization, escape normalization, and removal of 1523 dot-segments. 1525 6.2.2.1 Case Normalization 1527 When a URI scheme uses components of the generic syntax, it will also 1528 use the common syntax equivalence rules, namely that the scheme and 1529 hostname are case insensitive and therefore can be normalized to 1530 lowercase. For example, the URI is 1531 equivalent to . 1533 6.2.2.2 Escape Normalization 1535 The percent-escape mechanism described in Section 2.4 is a frequent 1536 source of variance among otherwise identical URIs. One cause is the 1537 choice of uppercase or lowercase letters for the hexadecimal digits 1538 within the escape sequence (e.g., "%3a" versus "%3A"). Such sequences 1539 are always equivalent; for the sake of uniformity, URI generators and 1540 normalizers are strongly encouraged to use uppercase letters for the 1541 hex digits A-F. 1543 Only characters that are excluded from or reserved within the URI 1544 syntax must be escaped when used as data. However, some URI 1545 generators go beyond that and escape characters that do not require 1546 escaping, resulting in URIs that are equivalent to their unescaped 1547 counterparts. Such URIs can be normalized by unescaping sequences 1548 that represent the unreserved characters, as described in Section 1549 2.3. 1551 6.2.2.3 Path Segment Normalization 1553 The complete path segments "." and ".." have a special meaning within 1554 hierarchical URI schemes. As such, they should not appear in 1555 absolute paths; if they are found, they can be removed by applying 1556 the remove_dot_segments algorithm to the path, as described in 1557 Section 5.2. 1559 6.2.3 Scheme-based Normalization 1561 The syntax and semantics of URIs vary from scheme to scheme, as 1562 described by the defining specification for each scheme. Software 1563 may use scheme-specific rules, at further processing cost, to reduce 1564 the probability of false negatives. For example, Web spiders that 1565 populate most large search engines would consider the following two 1566 URIs to be equivalent: 1568 http://example.com/ 1569 http://example.com:80/ 1571 This behavior is based on the rules provided by the syntax and 1572 semantics of the "http" URI scheme, which defines an empty port 1573 component as being equivalent to the default TCP port for HTTP (port 1574 80). In general, a URI scheme that uses the generic syntax for 1575 authority is defined such that a URI with an explicit ":port", where 1576 the port is the default for the scheme, is equivalent to one where 1577 the port is elided. 1579 6.2.4 Protocol-based Normalization 1581 Web spiders, for which substantial effort to reduce the incidence of 1582 false negatives is often cost-effective, are observed to implement 1583 even more aggressive techniques in URI comparison. For example, if 1584 they observe that a URI such as 1586 http://example.com/data 1588 redirects to a URI differing only in the trailing slash 1590 http://example.com/data/ 1592 they will likely regard the two as equivalent in the future. 1593 Obviously, this kind of technique is only appropriate in special 1594 situations. 1596 6.3 Canonical Form 1598 It is in the best interests of everyone to avoid false-negatives in 1599 comparing URIs and to minimize the amount of software processing for 1600 such comparisons. Those who generate and make reference to URIs can 1601 reduce the cost of processing and the risk of false negatives by 1602 consistently providing them in a form that is reasonably canonical 1603 with respect to their scheme. Specifically: 1605 o Always provide the URI scheme in lowercase characters. 1607 o Always provide the hostname, if any, in lowercase characters. 1609 o Only perform percent-escaping where it is essential. 1611 o Always use uppercase A-through-F characters when percent-escaping. 1613 o Prevent /./ and /../ from appearing in non-relative URI paths. 1615 The good practices listed above are motivated by deployed software 1616 that frequently use these techniques for the purposes of 1617 normalization. 1619 7. Security Considerations 1621 A URI does not in itself pose a security threat. However, since URIs 1622 are often used to provide a compact set of instructions for access to 1623 network resources, care must be taken to properly interpret the data 1624 within a URI, to prevent that data from causing unintended access, 1625 and to avoid including data that should not be revealed in plain 1626 text. 1628 7.1 Reliability and Consistency 1630 There is no guarantee that, having once used a given URI to retrieve 1631 some information, the same information will be retrievable by that 1632 URI in the future. Nor is there any guarantee that the information 1633 retrievable via that URI in the future will be observably similar to 1634 that retrieved in the past. The URI syntax does not constrain how a 1635 given scheme or authority apportions its name space or maintains it 1636 over time. Such a guarantee can only be obtained from the person(s) 1637 controlling that name space and the resource in question. A specific 1638 URI scheme may define additional semantics, such as name persistence, 1639 if those semantics are required of all naming authorities for that 1640 scheme. 1642 7.2 Malicious Construction 1644 It is sometimes possible to construct a URI such that an attempt to 1645 perform a seemingly harmless, idempotent operation, such as the 1646 retrieval of a representation, will in fact cause a possibly damaging 1647 remote operation to occur. The unsafe URI is typically constructed 1648 by specifying a port number other than that reserved for the network 1649 protocol in question. The client unwittingly contacts a site that is 1650 running a different protocol service. The content of the URI 1651 contains instructions that, when interpreted according to this other 1652 protocol, cause an unexpected operation. An example has been the use 1653 of a gopher URI to cause an unintended or impersonating message to be 1654 sent via a SMTP server. 1656 Caution should be used when dereferencing a URI that specifies a TCP 1657 port number other than the default for the scheme, especially when it 1658 is a number within the reserved space. 1660 Care should be taken when a URI contains escaped delimiters for a 1661 given protocol (for example, CR and LF characters for telnet 1662 protocols) that these octets are not unescaped before transmission. 1663 This might violate the protocol, but avoids the potential for such 1664 characters to be used to simulate an extra operation or parameter in 1665 that protocol which might lead to an unexpected and possibly harmful 1666 remote operation being performed. 1668 7.3 Rare IP Address Formats 1670 Although the URI syntax for IPv4address only allows the common, 1671 dotted-decimal form of IPv4 address literal, many implementations 1672 that process URIs make use of platform-dependent system routines, 1673 such as gethostbyname() and inet_aton(), to translate the string 1674 literal to an actual IP address. Unfortunately, such system routines 1675 often allow and process a much larger set of formats than those 1676 described in Section 3.2.2. 1678 For example, many implementations allow dotted forms of three 1679 numbers, wherein the last part is interpreted as a 16-bit quantity 1680 and placed in the right-most two bytes of the network address (e.g., 1681 a Class B network). Likewise, a dotted form of two numbers means the 1682 last part is interpreted as a 24-bit quantity and placed in the right 1683 most three bytes of the network address (Class A), and a single 1684 number (without dots) is interpreted as a 32-bit quantity and stored 1685 directly in the network address. Adding further to the confusion, 1686 some implementations allow each dotted part to be interpreted as 1687 decimal, octal, or hexadecimal, as specified in the C language (i.e., 1688 a leading 0x or 0X implies hexadecimal; otherwise, a leading 0 1689 implies octal; otherwise, the number is interpreted as decimal). 1691 These additional IP address formats are not allowed in the URI syntax 1692 due to differences between platform implementations. However, they 1693 can become a security concern if an application attempts to filter 1694 access to resources based on the IP address in string literal format. 1695 If such filtering is performed, it is recommended that literals be 1696 converted to numeric form and filtered based on the numeric value, 1697 rather than a prefix or suffix of the string form. 1699 7.4 Sensitive Information 1701 It is clearly unwise to use a URI that contains a password which is 1702 intended to be secret. In particular, the use of a password within 1703 the userinfo component of a URI is strongly discouraged except in 1704 those rare cases where the 'password' parameter is intended to be 1705 public. 1707 7.5 Semantic Attacks 1709 Because the userinfo component is rarely used and appears before the 1710 hostname in the authority component, it can be used to construct a 1711 URI that is intended to mislead a human user by appearing to identify 1712 one (trusted) naming authority while actually identifying a different 1713 authority hidden behind the noise. For example 1715 http://www.example.com&story=breaking_news@10.0.0.1/top_story.htm 1717 might lead a human user to assume that the host is 'www.example.com', 1718 whereas it is actually '10.0.0.1'. Note that the misleading userinfo 1719 could be much longer than the example above. 1721 A misleading URI, such as the one above, is an attack on the user's 1722 preconceived notions about the meaning of a URI, rather than an 1723 attack on the software itself. User agents may be able to reduce the 1724 impact of such attacks by visually distinguishing the various 1725 components of the URI when rendered, such as by using a different 1726 color or tone to render userinfo if any is present, though there is 1727 no general panacea. More information on URI-based semantic attacks 1728 can be found in [Siedzik]. 1730 8. Acknowledgments 1732 This specification is derived from RFC 2396 [RFC2396], RFC 1808 1733 [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those 1734 documents still apply. It also incorporates the update (with 1735 corrections) for IPv6 literals in the host syntax, as defined by 1736 Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in 1737 [RFC2732]. In addition, contributions by Reese Anschultz, Tim Bray, 1738 Rob Cameron, Dan Connolly, Adam M. Costello, John Cowan, Jason 1739 Diamond, Martin Duerst, Stefan Eissing, Clive D.W. Feather, Pat 1740 Hayes, Henry Holtzman, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew 1741 Main, Michael Mealling, Julian Reschke, Tomas Rokicki, Miles Sabin, 1742 Ronald Tschalaer, Marc Warne, Stuart Williams, and Henry Zongaro are 1743 gratefully acknowledged. 1745 Normative References 1747 [ASCII] American National Standards Institute, "Coded Character 1748 Set -- 7-bit American Standard Code for Information 1749 Interchange", ANSI X3.4, 1986. 1751 [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1752 Specifications: ABNF", RFC 2234, November 1997. 1754 Informative References 1756 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1757 Languages", BCP 18, RFC 2277, January 1998. 1759 [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A 1760 Unifying Syntax for the Expression of Names and Addresses 1761 of Objects on the Network as used in the World-Wide Web", 1762 RFC 1630, June 1994. 1764 [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform 1765 Resource Locators (URL)", RFC 1738, December 1994. 1767 [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 1768 Resource Identifiers (URI): Generic Syntax", RFC 2396, 1769 August 1998. 1771 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 1772 and Support", STD 3, RFC 1123, October 1989. 1774 [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC 1775 1808, June 1995. 1777 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1778 Extensions (MIME) Part Two: Media Types", RFC 2046, 1779 November 1996. 1781 [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. 1782 Jensen, "HTTP Extensions for Distributed Authoring -- 1783 WEBDAV", RFC 2518, February 1999. 1785 [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet 1786 host table specification", RFC 952, October 1985. 1788 [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 1789 (IPv6) Addressing Architecture", RFC 3513, April 2003. 1791 [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for 1792 Literal IPv6 Addresses in URL's", RFC 2732, December 1999. 1794 [RFC1736] Kunze, J., "Functional Recommendations for Internet 1795 Resource Locators", RFC 1736, February 1995. 1797 [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for 1798 Uniform Resource Names", RFC 1737, December 1994. 1800 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 1802 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1803 STD 13, RFC 1034, November 1987. 1805 [RFC2110] Palme, J. and A. Hopmann, "MIME E-mail Encapsulation of 1806 Aggregate Documents, such as HTML (MHTML)", RFC 2110, 1807 March 1997. 1809 [RFC2717] Petke, R. and I. King, "Registration Procedures for URL 1810 Scheme Names", BCP 35, RFC 2717, November 1999. 1812 [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", April 1813 2001. 1815 [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 1816 10646", RFC 2279, January 1998. 1818 Authors' Addresses 1820 Tim Berners-Lee 1821 World Wide Web Consortium 1822 MIT/LCS, Room NE43-356 1823 200 Technology Square 1824 Cambridge, MA 02139 1825 USA 1827 Phone: +1-617-253-5702 1828 Fax: +1-617-258-5999 1829 EMail: timbl@w3.org 1830 URI: http://www.w3.org/People/Berners-Lee/ 1832 Roy T. Fielding 1833 Day Software 1834 2 Corporate Plaza, Suite 150 1835 Newport Beach, CA 92660 1836 USA 1838 Phone: +1-949-999-2523 1839 Fax: +1-949-644-5064 1840 EMail: roy.fielding@day.com 1841 URI: http://www.apache.org/~fielding/ 1843 Larry Masinter 1844 Adobe Systems Incorporated 1845 345 Park Ave 1846 San Jose, CA 95110 1847 USA 1849 Phone: +1-408-536-3024 1850 EMail: LMM@acm.org 1851 URI: http://larry.masinter.net/ 1853 Appendix A. Collected ABNF for URI 1855 abs-path = "/" path-segments 1857 absolute-URI = scheme ":" hier-part [ "?" query ] 1859 alphanum = ALPHA / DIGIT 1861 authority = [ userinfo "@" ] host [ ":" port ] 1863 dec-octet = DIGIT ; 0-9 1864 / %x31-39 DIGIT ; 10-99 1865 / "1" 2DIGIT ; 100-199 1866 / "2" %x30-34 DIGIT ; 200-249 1867 / "25" %x30-35 ; 250-255 1869 domainlabel = alphanum [ 0*61( alphanum / "-" ) alphanum ] 1871 escaped = "%" HEXDIG HEXDIG 1873 fragment = *( pchar / "/" / "?" ) 1875 h4 = 1*4HEXDIG 1877 hier-part = net-path / abs-path / rel-path 1879 host = [ IPv6reference / IPv4address / hostname ] 1881 hostname = domainlabel qualified 1883 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 1885 IPv6address = 6( h4 ":" ) ls32 1886 / "::" 5( h4 ":" ) ls32 1887 / [ h4 ] "::" 4( h4 ":" ) ls32 1888 / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 1889 / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 1890 / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 1891 / [ *4( h4 ":" ) h4 ] "::" ls32 1892 / [ *5( h4 ":" ) h4 ] "::" h4 1893 / [ *6( h4 ":" ) h4 ] "::" 1895 IPv6reference = "[" IPv6address "]" 1897 ls32 = ( h4 ":" h4 ) / IPv4address 1899 mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / "(" / ")" 1900 net-path = "//" authority [ abs-path ] 1902 path-segments = segment *( "/" segment ) 1904 pchar = unreserved / escaped / ";" / 1905 ":" / "@" / "&" / "=" / "+" / "$" / "," 1907 port = *DIGIT 1909 qualified = *( "." domainlabel ) [ "." ] 1911 query = *( pchar / "/" / "?" ) 1913 rel-path = path-segments 1915 relative-URI = hier-part [ "?" query ] [ "#" fragment ] 1917 reserved = "/" / "?" / "#" / "[" / "]" / ";" / 1918 ":" / "@" / "&" / "=" / "+" / "$" / "," 1920 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 1922 segment = *pchar 1924 unreserved = ALPHA / DIGIT / mark 1926 URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 1928 URI-reference = URI / relative-URI 1930 uric = reserved / unreserved / escaped 1932 userinfo = *( unreserved / escaped / ";" / 1933 ":" / "&" / "=" / "+" / "$" / "," ) 1935 Appendix B. Parsing a URI Reference with a Regular Expression 1937 Since the "first-match-wins" algorithm is identical to the "greedy" 1938 disambiguation method used by POSIX regular expressions, it is 1939 natural and commonplace to use a regular expression for parsing the 1940 potential five components of a URI reference. 1942 The following line is the regular expression for breaking-down a 1943 well-formed URI reference into its components. 1945 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 1946 12 3 4 5 6 7 8 9 1948 The numbers in the second line above are only to assist readability; 1949 they indicate the reference points for each subexpression (i.e., each 1950 paired parenthesis). We refer to the value matched for subexpression 1951 as $. For example, matching the above expression to 1953 http://www.ics.uci.edu/pub/ietf/uri/#Related 1955 results in the following subexpression matches: 1957 $1 = http: 1958 $2 = http 1959 $3 = //www.ics.uci.edu 1960 $4 = www.ics.uci.edu 1961 $5 = /pub/ietf/uri/ 1962 $6 = 1963 $7 = 1964 $8 = #Related 1965 $9 = Related 1967 where indicates that the component is not present, as is 1968 the case for the query component in the above example. Therefore, we 1969 can determine the value of the four components and fragment as 1971 scheme = $2 1972 authority = $4 1973 path = $5 1974 query = $7 1975 fragment = $9 1977 and, going in the opposite direction, we can recreate a URI reference 1978 from its components using the algorithm of Section 5.3. 1980 Appendix C. Delimiting a URI in Context 1982 URIs are often transmitted through formats that do not provide a 1983 clear context for their interpretation. For example, there are many 1984 occasions when a URI is included in plain text; examples include text 1985 sent in electronic mail, USENET news messages, and, most importantly, 1986 printed on paper. In such cases, it is important to be able to 1987 delimit the URI from the rest of the text, and in particular from 1988 punctuation marks that might be mistaken for part of the URI. 1990 In practice, URI are delimited in a variety of ways, but usually 1991 within double-quotes "http://example.com/", angle brackets , or just using whitespace 1994 http://example.com/ 1996 These wrappers do not form part of the URI. 1998 In the case where a fragment identifier is associated with a URI 1999 reference, the fragment would be placed within the brackets as well 2000 (separated from the URI with a "#" character). 2002 In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may 2003 need to be added to break a long URI across lines. The whitespace 2004 should be ignored when extracting the URI. 2006 No whitespace should be introduced after a hyphen ("-") character. 2007 Because some typesetters and printers may (erroneously) introduce a 2008 hyphen at the end of line when breaking a line, the interpreter of a 2009 URI containing a line break immediately after a hyphen should ignore 2010 all unescaped whitespace around the line break, and should be aware 2011 that the hyphen may or may not actually be part of the URI. 2013 Using <> angle brackets around each URI is especially recommended as 2014 a delimiting style for a URI that contains whitespace. 2016 The prefix "URL:" (with or without a trailing space) was formerly 2017 recommended as a way to help distinguish a URI from other bracketed 2018 designators, though it is not commonly used in practice and is no 2019 longer recommended. 2021 For robustness, software that accepts user-typed URI should attempt 2022 to recognize and strip both delimiters and embedded whitespace. 2024 For example, the text: 2026 Yes, Jim, I found it under "http://www.w3.org/Addressing/", 2027 but you can probably pick it up from . Note the warning in . 2031 contains the URI references 2033 http://www.w3.org/Addressing/ 2034 ftp://ds.internic.net/rfc/ 2035 http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING 2037 Appendix D. Summary of Non-editorial Changes 2039 D.1 Additions 2041 IPv6 literals have been added to the list of possible identifiers for 2042 the host portion of a authority component, as described by [RFC2732], 2043 with the addition of "[" and "]" to the reserved and uric sets. 2044 Square brackets are now specified as reserved within the authority 2045 component and not allowed outside their use as delimiters for an 2046 IPv6reference within host. In order to make this change without 2047 changing the technical definition of the path, query, and fragment 2048 components, those rules were redefined to directly specify the 2049 characters allowed rather than be defined in terms of uric. 2051 Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal 2052 address, which unfortunately lacks an ABNF description of 2053 IPv6address, we created a new ABNF rule for IPv6address that matches 2054 the text representations defined by Section 2.2 of [RFC3513]. 2055 Likewise, the definition of IPv4address has been improved in order to 2056 limit each decimal octet to the range 0-255, and the definition of 2057 hostname has been improved to better specify length limitations and 2058 partially-qualified domain names. 2060 Section 6 (Section 6) on URI normalization and comparison has been 2061 completely rewritten and extended using input from Tim Bray and 2062 discussion within the W3C Technical Architecture Group. Likewise, 2063 Section 2.1 on the encoding of characters has been replaced. 2065 An ABNF production for URI has been introduced to correspond to the 2066 common usage of the term: an absolute URI with optional fragment. 2068 D.2 Modifications from RFC 2396 2070 The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. 2071 This change required all rule names that formerly included underscore 2072 characters to be renamed with a dash instead. 2074 Section 2.2 on reserved characters has been rewritten to clearly 2075 explain what characters are reserved, when they are reserved, and why 2076 they are reserved even when not used as delimiters by the generic 2077 syntax. Likewise, the section on escaped characters has been 2078 rewritten, and URI normalizers are now given license to unescape any 2079 octets corresponding to unreserved characters. The number-sign ("#") 2080 character has been moved back from the excluded delims to the 2081 reserved set. 2083 The ABNF for URI and URI-reference has been redesigned to make them 2084 more friendly to LALR parsers and significantly reduce complexity. As 2085 a result, the layout form of syntax description has been removed, 2086 along with the uric-no-slash, opaque-part, and rel-segment 2087 productions. All references to "opaque" URIs have been replaced with 2088 a better description of how the path component may be opaque to 2089 hierarchy. The fragment identifier has been moved back into the 2090 section on generic syntax components and within the URI and 2091 relative-URI productions, though it remains excluded from 2092 absolute-URI. The ambiguity regarding the parsing of URI-reference as 2093 a URI or a relative-URI with a colon in the first segment is now 2094 explained and disambiguated in the section defining relative-URI. 2096 The ABNF of hier-part and relative-URI has been corrected to allow a 2097 relative URI path to be empty. This also allows an absolute-URI to 2098 consist of nothing after the "scheme:", as is present in practice 2099 with the "DAV:" namespace [RFC2518] and the "about:" URI used by many 2100 browser implementations. The ambiguity regarding the parsing of 2101 net-path, abs-path, and rel-path is now explained and disambiguated 2102 in the same section. 2104 Registry-based naming authorities that use the generic syntax 2105 authority component are now limited to DNS hostnames, since those 2106 have been the only such URIs in deployment. This change was 2107 necessary to enable internationalized domain names to be processed in 2108 their native character encodings at the application layers above URI 2109 processing. The reg_name, server, and hostport productions have been 2110 removed to simplify parsing of the URI syntax. 2112 The ABNF of qualified has been simplified to remove a parsing 2113 ambiguity without changing the allowed syntax. The toplabel 2114 production has been removed because it served no useful purpose. The 2115 ambiguity regarding the parsing of host as IPv4address or hostname is 2116 now explained and disambiguated in the same section. 2118 The resolving relative references algorithm of [RFC2396] has been 2119 rewritten using pseudocode for this revision to improve clarity and 2120 fix the following issues: 2122 o [RFC2396] section 5.2, step 6a, failed to account for a base URI 2123 with no path. 2125 o Restored the behavior of [RFC1808] where, if the reference 2126 contains an empty path and a defined query component, then the 2127 target URI inherits the base URI's path component. 2129 o Removed the special-case treatment of same-document references in 2130 favor of a section that explains that a new retrieval action 2131 should not be made if the target URI and base URI, excluding 2132 fragments, match. This change has no impact on user agent 2133 behavior aside from how the resolved reference might be described 2134 to the user. 2136 o Separated the path merge routine into two routines: merge, for 2137 describing combination of the base URI path with a relative-path 2138 reference, and remove_dot_segments, for describing how to remove 2139 the special "." and ".." segments from a composed path. The 2140 remove_dot_segments algorithm is now applied to all URI reference 2141 paths in order to match common implementations and improve the 2142 normalization of URIs in practice. This change only impacts the 2143 parsing of abnormal references and same-scheme references wherein 2144 the base URI has a non-hierarchical path. 2146 Index 2148 A 2149 ABNF 9 2150 abs-path 16 2151 absolute 25 2152 absolute-path 24 2153 absolute-URI 25 2154 access 7 2155 alphanum 18 2156 authority 16, 17 2158 B 2159 base URI 27 2161 D 2162 dec-octet 19 2163 delims 15 2164 dereference 7 2165 domainlabel 18 2166 dot-segments 20 2168 E 2169 escaped 13 2170 excluded 14 2172 F 2173 fragment 22 2175 G 2176 generic syntax 5 2178 H 2179 h4 19 2180 hier-part 16 2181 hierarchical 8 2182 host 18 2183 hostname 18 2185 I 2186 identifier 5 2187 invisible 14 2188 IPv4 19 2189 IPv4address 19 2190 IPv6 19 2191 IPv6address 19 2192 IPv6reference 19 2194 L 2195 locator 6 2196 ls32 19 2198 M 2199 mark 12 2200 merge 30 2202 N 2203 name 6 2204 net-path 16 2205 network-path 24 2207 P 2208 path 16, 20 2209 path-segments 20 2210 pchar 20 2211 port 20 2213 Q 2214 qualified 18 2215 query 21 2217 R 2218 rel-path 16 2219 relative 9, 27 2220 relative-path 24 2221 relative-URI 24 2222 remove_dot_segments 30 2223 representation 8 2224 reserved 11 2225 resolution 7, 27 2226 resource 4 2227 retrieval 8 2229 S 2230 same-document 25 2231 sameness 8 2232 scheme 16 2233 segment 20 2234 suffix 25 2236 T 2237 transcription 6 2239 U 2240 uniform 4 2241 unreserved 12 2242 unwise 15 2243 URI grammar 2244 abs-path 16 2245 absolute-URI 25 2246 ALPHA 9 2247 alphanum 18 2248 authority 16, 17 2249 CR 9 2250 CTL 9 2251 dec-octet 19 2252 DIGIT 9 2253 domainlabel 18 2254 DQUOTE 9 2255 escaped 13 2256 fragment 16, 22, 24 2257 h4 19 2258 HEXDIG 9 2259 hier-part 16, 24, 25 2260 host 17, 18 2261 hostname 18 2262 IPv4address 19 2263 IPv6address 19 2264 IPv6reference 19 2265 LF 9 2266 ls32 19 2267 mark 12 2268 net-path 16 2269 OCTET 9 2270 path-segments 16, 20 2271 pchar 20, 21, 22 2272 port 17, 20 2273 qualified 18 2274 query 16, 21, 24, 25 2275 rel-path 16 2276 relative-URI 24, 24 2277 reserved 12 2278 scheme 16, 17, 25 2279 segment 20 2280 SP 9 2281 unreserved 12 2282 URI 16, 24 2283 URI-reference 24 2284 uric 11 2285 userinfo 17, 18 2286 URI 16 2287 URI-reference 24 2288 uric 11 2289 URL 6 2290 URN 6 2291 userinfo 18 2293 Intellectual Property Statement 2295 The IETF takes no position regarding the validity or scope of any 2296 intellectual property or other rights that might be claimed to 2297 pertain to the implementation or use of the technology described in 2298 this document or the extent to which any license under such rights 2299 might or might not be available; neither does it represent that it 2300 has made any effort to identify any such rights. Information on the 2301 IETF's procedures with respect to rights in standards-track and 2302 standards-related documentation can be found in BCP-11. Copies of 2303 claims of rights made available for publication and any assurances of 2304 licenses to be made available, or the result of an attempt made to 2305 obtain a general license or permission for the use of such 2306 proprietary rights by implementors or users of this specification can 2307 be obtained from the IETF Secretariat. 2309 The IETF invites any interested party to bring to its attention any 2310 copyrights, patents or patent applications, or other proprietary 2311 rights which may cover technology that may be required to practice 2312 this standard. Please address the information to the IETF Executive 2313 Director. 2315 Full Copyright Statement 2317 Copyright (C) The Internet Society (2003). All Rights Reserved. 2319 This document and translations of it may be copied and furnished to 2320 others, and derivative works that comment on or otherwise explain it 2321 or assist in its implementation may be prepared, copied, published 2322 and distributed, in whole or in part, without restriction of any 2323 kind, provided that the above copyright notice and this paragraph are 2324 included on all such copies and derivative works. However, this 2325 document itself may not be modified in any way, such as by removing 2326 the copyright notice or references to the Internet Society or other 2327 Internet organizations, except as needed for the purpose of 2328 developing Internet standards in which case the procedures for 2329 copyrights defined in the Internet Standards process must be 2330 followed, or as required to translate it into languages other than 2331 English. 2333 The limited permissions granted above are perpetual and will not be 2334 revoked by the Internet Society or its successors or assignees. 2336 This document and the information contained herein is provided on an 2337 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 2338 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 2339 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 2340 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 2341 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2343 Acknowledgement 2345 Funding for the RFC Editor function is currently provided by the 2346 Internet Society.