idnits 2.17.00 (12 Aug 2021) /tmp/idnits28740/draft-fielding-uri-rfc2396bis-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1.a on line 19. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2701. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2678. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2685. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2691. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 2707), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 38. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: This document is an Internet-Draft and is subject to all provisions of Section 3 of RFC 3667. By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. -- The draft header indicates that this document obsoletes RFC2732, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC2396, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC1808, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC1738, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 729 has weird spacing: '... query frag...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 17, 2004) is 6516 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 59 -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) -- Possible downref: Non-RFC (?) normative reference: ref. 'UCS' -- Obsolete informational reference (is this intentional?): RFC 2717 (ref. 'BCP35') (Obsoleted by RFC 4395) -- Obsolete informational reference (is this intentional?): RFC 1738 (Obsoleted by RFC 4248, RFC 4266) -- Obsolete informational reference (is this intentional?): RFC 1808 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2518 (Obsoleted by RFC 4918) -- Obsolete informational reference (is this intentional?): RFC 2718 (Obsoleted by RFC 4395) -- Obsolete informational reference (is this intentional?): RFC 2732 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 3513 (Obsoleted by RFC 4291) Summary: 10 errors (**), 0 flaws (~~), 4 warnings (==), 24 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group T. Berners-Lee 2 Internet-Draft W3C/MIT 3 Updates: 1738 (if approved) R. Fielding 4 Obsoletes: 2732, 2396, 1808 (if approved) Day Software 5 L. Masinter 6 Expires: January 15, 2005 Adobe 7 July 17, 2004 9 Uniform Resource Identifier (URI): Generic Syntax 10 draft-fielding-uri-rfc2396bis-06 12 Status of this Memo 14 This document is an Internet-Draft and is subject to all provisions 15 of section 3 of RFC 3667. By submitting this Internet-Draft, each 16 author represents that any applicable patent or other IPR claims of 17 which he or she is aware have been or will be disclosed, and any of 18 which he or she become aware will be disclosed, in accordance with 19 RFC 3668. 21 Internet-Drafts are working documents of the Internet Engineering 22 Task Force (IETF), its areas, and its working groups. Note that 23 other groups may also distribute working documents as 24 Internet-Drafts. 26 Internet-Drafts are draft documents valid for a maximum of six months 27 and may be updated, replaced, or obsoleted by other documents at any 28 time. It is inappropriate to use Internet-Drafts as reference 29 material or to cite them other than as "work in progress." 31 The list of current Internet-Drafts can be accessed at 32 . 33 The list of Internet-Draft Shadow Directories can be accessed at 34 . 36 Copyright Notice 38 Copyright (C) The Internet Society (2004). All Rights Reserved. 40 Abstract 42 A Uniform Resource Identifier (URI) is a compact sequence of 43 characters for identifying an abstract or physical resource. This 44 specification defines the generic URI syntax and a process for 45 resolving URI references that might be in relative form, along with 46 guidelines and security considerations for the use of URIs on the 47 Internet. The URI syntax defines a grammar that is a superset of all 48 valid URIs, such that an implementation can parse the common 49 components of a URI reference without knowing the scheme-specific 50 requirements of every possible identifier. This specification does 51 not define a generative grammar for URIs; that task is performed by 52 the individual specifications of each URI scheme. 54 Editorial Note 56 Discussion of this draft and comments to the editors should be sent 57 to the uri@w3.org mailing list. An issues list and version history 58 is available at <http://gbiv.com/protocols/uri/rev-2002/ 59 issues.html> [1]. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 64 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . 4 65 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . 6 66 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 7 67 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . 7 68 1.2 Design Considerations . . . . . . . . . . . . . . . . . . 7 69 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . 7 70 1.2.2 Separating Identification from Interaction . . . . . . 9 71 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . 10 72 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . 11 73 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11 74 2.1 Percent-Encoding . . . . . . . . . . . . . . . . . . . . . 12 75 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . 12 76 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . 13 77 2.4 When to Encode or Decode . . . . . . . . . . . . . . . . . 13 78 2.5 Identifying Data . . . . . . . . . . . . . . . . . . . . . 14 79 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 16 80 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 16 81 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . 17 82 3.2.1 User Information . . . . . . . . . . . . . . . . . . . 17 83 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . 18 84 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . 21 85 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 86 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . 23 87 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . 23 88 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 89 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . 25 90 4.2 Relative URI . . . . . . . . . . . . . . . . . . . . . . . 26 91 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . 26 92 4.4 Same-document Reference . . . . . . . . . . . . . . . . . 26 93 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . 27 95 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28 96 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . 28 97 5.1.1 Base URI Embedded in Content . . . . . . . . . . . . . 28 98 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . 29 99 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . 29 100 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . 29 101 5.2 Relative Resolution . . . . . . . . . . . . . . . . . . . 30 102 5.2.1 Pre-parse the Base URI . . . . . . . . . . . . . . . . 30 103 5.2.2 Transform References . . . . . . . . . . . . . . . . . 30 104 5.2.3 Merge Paths . . . . . . . . . . . . . . . . . . . . . 31 105 5.2.4 Remove Dot Segments . . . . . . . . . . . . . . . . . 32 106 5.3 Component Recomposition . . . . . . . . . . . . . . . . . 34 107 5.4 Reference Resolution Examples . . . . . . . . . . . . . . 34 108 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . 35 109 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . 35 110 6. Normalization and Comparison . . . . . . . . . . . . . . . . . 36 111 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . 37 112 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . 37 113 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . 38 114 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . 38 115 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . 39 116 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . 40 117 6.3 Canonical Form . . . . . . . . . . . . . . . . . . . . . . 40 118 7. Security Considerations . . . . . . . . . . . . . . . . . . . 41 119 7.1 Reliability and Consistency . . . . . . . . . . . . . . . 41 120 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . 41 121 7.3 Back-end Transcoding . . . . . . . . . . . . . . . . . . . 42 122 7.4 Rare IP Address Formats . . . . . . . . . . . . . . . . . 43 123 7.5 Sensitive Information . . . . . . . . . . . . . . . . . . 44 124 7.6 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . 44 125 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 44 126 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 44 127 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 45 128 10.1 Normative References . . . . . . . . . . . . . . . . . . . . 45 129 10.2 Informative References . . . . . . . . . . . . . . . . . . . 45 130 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 47 131 A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 48 132 B. Parsing a URI Reference with a Regular Expression . . . . . . 50 133 C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 50 134 D. Summary of Non-editorial Changes . . . . . . . . . . . . . . . 52 135 D.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . 52 136 D.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . 52 137 E. Instructions to RFC Editor . . . . . . . . . . . . . . . . . . 54 138 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 139 Intellectual Property and Copyright Statements . . . . . . . . 59 141 1. Introduction 143 A Uniform Resource Identifier (URI) provides a simple and extensible 144 means for identifying a resource. This specification of URI syntax 145 and semantics is derived from concepts introduced by the World Wide 146 Web global information initiative, whose use of such identifiers 147 dates from 1990 and is described in "Universal Resource Identifiers 148 in WWW" [RFC1630], and is designed to meet the recommendations laid 149 out in "Functional Recommendations for Internet Resource Locators" 150 [RFC1736] and "Functional Requirements for Uniform Resource Names" 151 [RFC1737]. 153 This document obsoletes [RFC2396], which merged "Uniform Resource 154 Locators" [RFC1738] and "Relative Uniform Resource Locators" 155 [RFC1808] in order to define a single, generic syntax for all URIs. 156 It contains the updates from, and obsoletes, [RFC2732], which 157 introduced syntax for IPv6 addresses. It excludes those portions of 158 RFC 1738 that defined the specific syntax of individual URI schemes; 159 those portions will be updated as separate documents. The process 160 for registration of new URI schemes is defined separately by [BCP35]. 161 Advice for designers of new URI schemes can be found in [RFC2718]. 163 All significant changes from RFC 2396 are noted in Appendix D. 165 This specification uses the terms "character" and "coded character 166 set" in accordance with the definitions provided in [BCP19], and 167 "character encoding" in place of what [BCP19] refers to as a 168 "charset". 170 1.1 Overview of URIs 172 URIs are characterized as follows: 174 Uniform 176 Uniformity provides several benefits: it allows different types of 177 resource identifiers to be used in the same context, even when the 178 mechanisms used to access those resources may differ; it allows 179 uniform semantic interpretation of common syntactic conventions 180 across different types of resource identifiers; it allows 181 introduction of new types of resource identifiers without 182 interfering with the way that existing identifiers are used; and, 183 it allows the identifiers to be reused in many different contexts, 184 thus permitting new applications or protocols to leverage a 185 pre-existing, large, and widely-used set of resource identifiers. 187 Resource 189 This specification does not limit the scope of what might be a 190 resource; rather, the term "resource" is used in a general sense 191 for whatever might be identified by a URI. Familiar examples 192 include an electronic document, an image, a source of information 193 with consistent purpose (e.g., "today's weather report for Los 194 Angeles"), a service (e.g., an HTTP to SMS gateway), a collection 195 of other resources, and so on. A resource is not necessarily 196 accessible via the Internet; e.g., human beings, corporations, and 197 bound books in a library can also be resources. Likewise, 198 abstract concepts can be resources, such as the operators and 199 operands of a mathematical equation, the types of a relationship 200 (e.g., "parent" or "employee"), or numeric values (e.g., zero, 201 one, and infinity). 203 Identifier 205 An identifier embodies the information required to distinguish 206 what is being identified from all other things within its scope of 207 identification. Our use of the terms "identify" and "identifying" 208 refer to this purpose of distinguishing one resource from all 209 other resources, regardless of how that purpose is accomplished 210 (e.g., by name, address, context, etc.). These terms should not 211 be mistaken as an assumption that an identifier defines or 212 embodies the identity of what is referenced, though that may be 213 the case for some identifiers. Nor should it be assumed that a 214 system using URIs will access the resource identified: in many 215 cases, URIs are used to denote resources without any intention 216 that they be accessed. Likewise, the "one" resource identified 217 might not be singular in nature (e.g., a resource might be a named 218 set or a mapping that varies over time). 220 A URI is an identifier, consisting of a sequence of characters 221 matching the syntax rule named in Section 3, that enables 222 uniform identification of resources via a separately defined, 223 extensible set of naming schemes (Section 3.1). How that 224 identification is accomplished, assigned, or enabled is delegated to 225 each scheme specification. 227 This specification does not place any limits on the nature of a 228 resource, the reasons why an application might wish to refer to a 229 resource, or the kinds of system that might use URIs for the sake of 230 identifying resources. This specification does not require that a 231 URI persists in identifying the same resource over all time, though 232 that is a common goal of all URI schemes. Nevertheless, nothing in 233 this specification prevents an application from limiting itself to 234 particular types of resources, or to a subset of URIs that maintains 235 characteristics desired by that application. 237 URIs have a global scope and are interpreted consistently regardless 238 of context, though the result of that interpretation may be in 239 relation to the end-user's context. For example, "http://localhost/" 240 has the same interpretation for every user of that reference, even 241 though the network interface corresponding to "localhost" may be 242 different for each end-user: interpretation is independent of access. 243 However, an action made on the basis of that reference will take 244 place in relation to the end-user's context, which implies that an 245 action intended to refer to a single, globally unique thing must use 246 a URI that distinguishes that resource from all other things. URIs 247 that identify in relation to the end-user's local context should only 248 be used when the context itself is a defining aspect of the resource, 249 such as when an on-line help manual refers to a file on the 250 end-user's filesystem (e.g., "file:///etc/hosts"). 252 1.1.1 Generic Syntax 254 Each URI begins with a scheme name, as defined in Section 3.1, that 255 refers to a specification for assigning identifiers within that 256 scheme. As such, the URI syntax is a federated and extensible naming 257 system wherein each scheme's specification may further restrict the 258 syntax and semantics of identifiers using that scheme. 260 This specification defines those elements of the URI syntax that are 261 required of all URI schemes or are common to many URI schemes. It 262 thus defines the syntax and semantics that are needed to implement a 263 scheme-independent parsing mechanism for URI references, such that 264 the scheme-dependent handling of a URI can be postponed until the 265 scheme-dependent semantics are needed. Likewise, protocols and data 266 formats that make use of URI references can refer to this 267 specification as defining the range of syntax allowed for all URIs, 268 including those schemes that have yet to be defined, thus decoupling 269 the evolution of identification schemes from the evolution of 270 protocols, data formats, and implementations that make use of URIs. 272 A parser of the generic URI syntax is capable of parsing any URI 273 reference into its major components; once the scheme is determined, 274 further scheme-specific parsing can be performed on the components. 275 In other words, the URI generic syntax is a superset of the syntax of 276 all URI schemes. 278 1.1.2 Examples 280 The following example URIs illustrate several URI schemes and 281 variations in their common syntax components: 283 ftp://ftp.is.co.za/rfc/rfc1808.txt 285 http://www.ietf.org/rfc/rfc2396.txt 287 ldap://[2001:db8::7]/c=GB?objectClass?one 289 mailto:John.Doe@example.com 291 news:comp.infosystems.www.servers.unix 293 tel:+1-816-555-1212 295 telnet://192.0.2.16:80/ 297 urn:oasis:names:specification:docbook:dtd:xml:4.1.2 299 1.1.3 URI, URL, and URN 301 A URI can be further classified as a locator, a name, or both. The 302 term "Uniform Resource Locator" (URL) refers to the subset of URIs 303 that, in addition to identifying a resource, provide a means of 304 locating the resource by describing its primary access mechanism 305 (e.g., its network "location"). The term "Uniform Resource Name" 306 (URN) has been used historically to refer to both URIs under the 307 "urn" scheme [RFC2141], which are required to remain globally unique 308 and persistent even when the resource ceases to exist or becomes 309 unavailable, and to any other URI with the properties of a name. 311 An individual scheme does not need to be classified as being just one 312 of "name" or "locator". Instances of URIs from any given scheme may 313 have the characteristics of names or locators or both, often 314 depending on the persistence and care in the assignment of 315 identifiers by the naming authority, rather than any quality of the 316 scheme. Future specifications and related documentation should use 317 the general term "URI", rather than the more restrictive terms URL 318 and URN [RFC3305]. 320 1.2 Design Considerations 322 1.2.1 Transcription 324 The URI syntax has been designed with global transcription as one of 325 its main considerations. A URI is a sequence of characters from a 326 very limited set: the letters of the basic Latin alphabet, digits, 327 and a few special characters. A URI may be represented in a variety 328 of ways: e.g., ink on paper, pixels on a screen, or a sequence of 329 character encoding octets. The interpretation of a URI depends only 330 on the characters used and not how those characters are represented 331 in a network protocol. 333 The goal of transcription can be described by a simple scenario. 334 Imagine two colleagues, Sam and Kim, sitting in a pub at an 335 international conference and exchanging research ideas. Sam asks Kim 336 for a location to get more information, so Kim writes the URI for the 337 research site on a napkin. Upon returning home, Sam takes out the 338 napkin and types the URI into a computer, which then retrieves the 339 information to which Kim referred. 341 There are several design considerations revealed by the scenario: 343 o A URI is a sequence of characters that is not always represented 344 as a sequence of octets. 346 o A URI might be transcribed from a non-network source, and thus 347 should consist of characters that are most likely to be able to be 348 entered into a computer, within the constraints imposed by 349 keyboards (and related input devices) across languages and 350 locales. 352 o A URI often needs to be remembered by people, and it is easier for 353 people to remember a URI when it consists of meaningful or 354 familiar components. 356 These design considerations are not always in alignment. For 357 example, it is often the case that the most meaningful name for a URI 358 component would require characters that cannot be typed into some 359 systems. The ability to transcribe a resource identifier from one 360 medium to another has been considered more important than having a 361 URI consist of the most meaningful of components. 363 In local or regional contexts and with improving technology, users 364 might benefit from being able to use a wider range of characters; 365 such use is not defined by this specification. Percent-encoded 366 octets (Section 2.1) may be used within a URI to represent characters 367 outside the range of the US-ASCII coded character set if such 368 representation is allowed by the scheme or by the protocol element in 369 which the URI is referenced; such a definition should specify the 370 character encoding used to map those characters to octets prior to 371 being percent-encoded for the URI. 373 1.2.2 Separating Identification from Interaction 375 A common misunderstanding of URIs is that they are only used to refer 376 to accessible resources. In fact, the URI alone only provides 377 identification; access to the resource is neither guaranteed nor 378 implied by the presence of a URI. Instead, an operation (if any) 379 associated with a URI reference is defined by the protocol element, 380 data format attribute, or natural language text in which it appears. 382 Given a URI, a system may attempt to perform a variety of operations 383 on the resource, as might be characterized by such words as "access", 384 "update", "replace", or "find attributes". Such operations are 385 defined by the protocols that make use of URIs, not by this 386 specification. However, we do use a few general terms for describing 387 common operations on URIs. URI "resolution" is the process of 388 determining an access mechanism and the appropriate parameters 389 necessary to dereference a URI; such resolution may require several 390 iterations. To use that access mechanism to perform an action on the 391 URI's resource is to "dereference" the URI. 393 When URIs are used within information retrieval systems to identify 394 sources of information, the most common form of URI dereference is 395 "retrieval": making use of a URI in order to retrieve a 396 representation of its associated resource. A "representation" is a 397 sequence of octets, along with representation metadata describing 398 those octets, that constitutes a record of the state of the resource 399 at the time that the representation is generated. Retrieval is 400 achieved by a process that might include using the URI as a cache key 401 to check for a locally cached representation, resolution of the URI 402 to determine an appropriate access mechanism (if any), and 403 dereference of the URI for the sake of applying a retrieval 404 operation. Depending on the protocols used to perform the retrieval, 405 additional information might be supplied about the resource (resource 406 metadata) and its relation to other resources. 408 URI references in information retrieval systems are designed to be 409 late-binding: the result of an access is generally determined at the 410 time it is accessed and may vary over time or due to other aspects of 411 the interaction. Such references are created in order to be used in 412 the future: what is being identified is not some specific result that 413 was obtained in the past, but rather some characteristic that is 414 expected to be true for future results. In such cases, the resource 415 referred to by the URI is actually a sameness of characteristics as 416 observed over time, perhaps elucidated by additional comments or 417 assertions made by the resource provider. 419 Although many URI schemes are named after protocols, this does not 420 imply that use of such a URI will result in access to the resource 421 via the named protocol. URIs are often used simply for the sake of 422 identification. Even when a URI is used to retrieve a representation 423 of a resource, that access might be through gateways, proxies, 424 caches, and name resolution services that are independent of the 425 protocol associated with the scheme name, and the resolution of some 426 URIs may require the use of more than one protocol (e.g., both DNS 427 and HTTP are typically used to access an "http" URI's origin server 428 when a representation isn't found in a local cache). 430 1.2.3 Hierarchical Identifiers 432 The URI syntax is organized hierarchically, with components listed in 433 order of decreasing significance from left to right. For some URI 434 schemes, the visible hierarchy is limited to the scheme itself: 435 everything after the scheme component delimiter (":") is considered 436 opaque to URI processing. Other URI schemes make the hierarchy 437 explicit and visible to generic parsing algorithms. 439 The generic syntax uses the slash ("/"), question mark ("?"), and 440 number sign ("#") characters for the purpose of delimiting components 441 that are significant to the generic parser's hierarchical 442 interpretation of an identifier. In addition to aiding the 443 readability of such identifiers through the consistent use of 444 familiar syntax, this uniform representation of hierarchy across 445 naming schemes allows scheme-independent references to be made 446 relative to that hierarchy. 448 It is often the case that a group or "tree" of documents has been 449 constructed to serve a common purpose, wherein the vast majority of 450 URIs in these documents point to resources within the tree rather 451 than outside of it. Similarly, documents located at a particular 452 site are much more likely to refer to other resources at that site 453 than to resources at remote sites. Relative referencing of URIs 454 allows document trees to be partially independent of their location 455 and access scheme. For instance, it is possible for a single set of 456 hypertext documents to be simultaneously accessible and traversable 457 via each of the "file", "http", and "ftp" schemes if the documents 458 refer to each other using relative references. Furthermore, such 459 document trees can be moved, as a whole, without changing any of the 460 relative references. 462 A relative URI reference (Section 4.2) refers to a resource by 463 describing the difference within a hierarchical name space between 464 the reference context and the target URI. The reference resolution 465 algorithm, presented in Section 5, defines how such a reference is 466 transformed to the target URI. Since relative references can only be 467 used within the context of a hierarchical URI, designers of new URI 468 schemes should use a syntax consistent with the generic syntax's 469 hierarchical components unless there are compelling reasons to forbid 470 relative referencing within that scheme. 472 All URIs are parsed by generic syntax parsers when used. A URI 473 scheme that wishes to remain opaque to hierarchical processing must 474 disallow the use of slash and question mark characters. However, 475 since a URI reference is only modified by the generic parser if it 476 contains a dot-segment (a complete path segment of "." or "..", as 477 described in Section 3.3), URI schemes may safely use "/" for other 478 purposes if they do not allow dot-segments. 480 1.3 Syntax Notation 482 This specification uses the Augmented Backus-Naur Form (ABNF) 483 notation of [RFC2234], including the following core ABNF syntax rules 484 defined by that specification: ALPHA (letters), CR (carriage return), 485 DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal 486 digits), LF (line feed), and SP (space). The complete URI syntax is 487 collected in Appendix A. 489 2. Characters 491 The URI syntax provides a method of encoding data, presumably for the 492 sake of identifying a resource, as a sequence of characters. The URI 493 characters are, in turn, frequently encoded as octets for transport 494 or presentation. This specification does not mandate any particular 495 character encoding for mapping between URI characters and the octets 496 used to store or transmit those characters. When a URI appears in a 497 protocol element, the character encoding is defined by that protocol; 498 absent such a definition, a URI is assumed to be in the same 499 character encoding as the surrounding text. 501 The ABNF notation defines its terminal values to be non-negative 502 integers (codepoints) based on the US-ASCII coded character set 503 [ASCII]. Since a URI is a sequence of characters, we must invert 504 that relation in order to understand the URI syntax. Therefore, the 505 integer values used by the ABNF must be mapped back to their 506 corresponding characters via US-ASCII in order to complete the syntax 507 rules. 509 A URI is composed from a limited set of characters consisting of 510 digits, letters, and a few graphic symbols. A reserved subset of 511 those characters may be used to delimit syntax components within a 512 URI, while the remaining characters, including both the unreserved 513 set and those reserved characters not acting as delimiters, define 514 each component's identifying data. 516 2.1 Percent-Encoding 518 A percent-encoding mechanism is used to represent a data octet in a 519 component when that octet's corresponding character is outside the 520 allowed set or is being used as a delimiter of, or within, the 521 component. A percent-encoded octet is encoded as a character 522 triplet, consisting of the percent character "%" followed by the two 523 hexadecimal digits representing that octet's numeric value. For 524 example, "%20" is the percent-encoding for the binary octet 525 "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space 526 character (SP). Section 2.4 describes when percent-encoding and 527 decoding is applied. 529 pct-encoded = "%" HEXDIG HEXDIG 531 The uppercase hexadecimal digits 'A' through 'F' are equivalent to 532 the lowercase digits 'a' through 'f', respectively. Two URIs that 533 differ only in the case of hexadecimal digits used in percent-encoded 534 octets are equivalent. For consistency, URI producers and 535 normalizers should use uppercase hexadecimal digits for all 536 percent-encodings. 538 2.2 Reserved Characters 540 URIs include components and subcomponents that are delimited by 541 characters in the "reserved" set. These characters are called 542 "reserved" because they may (or may not) be defined as delimiters by 543 the generic syntax, by each scheme-specific syntax, or by the 544 implementation-specific syntax of a URI's dereferencing algorithm. 545 If data for a URI component would conflict with a reserved 546 character's purpose as a delimiter, then the conflicting data must be 547 percent-encoded before forming the URI. 549 reserved = gen-delims / sub-delims 551 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 553 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 554 / "*" / "+" / "," / ";" / "=" 556 The purpose of reserved characters is to provide a set of delimiting 557 characters that are distinguishable from other data within a URI. 558 URIs that differ in the replacement of a reserved character with its 559 corresponding percent-encoded octet are not equivalent. 560 Percent-encoding a reserved character, or decoding a percent-encoded 561 octet that corresponds to a reserved character, will change how the 562 URI is interpreted by most applications. Thus, characters in the 563 reserved set are protected from normalization and are therefore safe 564 to be used by scheme-specific and producer-specific algorithms for 565 delimiting data subcomponents within a URI. 567 A subset of the reserved characters (gen-delims) are used as 568 delimiters of the generic URI components described in Section 3. A 569 component's ABNF syntax rule will not use the reserved or gen-delims 570 rule names directly; instead, each syntax rule lists the characters 571 allowed within that component (i.e., not delimiting it) and any of 572 those characters that are also in the reserved set are "reserved" for 573 use as subcomponent delimiters within the component. Only the most 574 common subcomponents are defined by this specification; other 575 subcomponents may be defined by a URI scheme's specification, or by 576 the implementation-specific syntax of a URI's dereferencing 577 algorithm, provided that such subcomponents are delimited by 578 characters in the reserved set allowed within that component. 580 URI producing applications should percent-encode data octets that 581 correspond to characters in the reserved set. However, if a reserved 582 character is found in a URI component and no delimiting role is known 583 for that character, then it should be interpreted as representing the 584 data octet corresponding to that character's encoding in US-ASCII. 586 2.3 Unreserved Characters 588 Characters that are allowed in a URI but do not have a reserved 589 purpose are called unreserved. These include uppercase and lowercase 590 letters, decimal digits, hyphen, period, underscore, and tilde. 592 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 594 URIs that differ in the replacement of an unreserved character with 595 its corresponding percent-encoded US-ASCII octet are equivalent: they 596 identify the same resource. However, URI comparison implementations 597 do not always perform normalization prior to comparison Section 6. 598 For consistency, percent-encoded octets in the ranges of ALPHA 599 (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), 600 underscore (%5F), or tilde (%7E) should not be created by URI 601 producers and, when found in a URI, should be decoded to their 602 corresponding unreserved character by URI normalizers. 604 2.4 When to Encode or Decode 606 Under normal circumstances, the only time that octets within a URI 607 are percent-encoded is during the process of producing the URI from 608 its component parts. It is during that process that an 609 implementation determines which of the reserved characters are to be 610 used as subcomponent delimiters and which can be safely used as data. 611 Once produced, a URI is always in its percent-encoded form. 613 When a URI is dereferenced, the components and subcomponents 614 significant to the scheme-specific dereferencing process (if any) 615 must be parsed and separated before the percent-encoded octets within 616 those components can be safely decoded, since otherwise the data may 617 be mistaken for component delimiters. The only exception is for 618 percent-encoded octets corresponding to characters in the unreserved 619 set, which can be decoded at any time. For example, the octet 620 corresponding to the tilde ("~") character is often encoded as "%7E" 621 by older URI processing software; the "%7E" can be replaced by "~" 622 without changing its interpretation. 624 Because the percent ("%") character serves as the indicator for 625 percent-encoded octets, it must be percent-encoded as "%25" in order 626 for that octet to be used as data within a URI. Implementations must 627 not percent-encode or decode the same string more than once, since 628 decoding an already decoded string might lead to misinterpreting a 629 percent data octet as the beginning of a percent-encoding, or vice 630 versa in the case of percent-encoding an already percent-encoded 631 string. 633 2.5 Identifying Data 635 URI characters provide identifying data for each of the URI 636 components, serving as an external interface for identification 637 between systems. Although the presence and nature of the URI 638 production interface is hidden from clients that use its URIs, and 639 thus beyond the scope of the interoperability requirements defined by 640 this specification, it is a frequent source of confusion and errors 641 in the interpretation of URI character issues. Implementers need to 642 be aware that there are multiple character encodings involved in the 643 production and transmission of URIs: local name and data encoding, 644 public interface encoding, URI character encoding, data format 645 encoding, and protocol encoding. 647 The first encoding of identifying data is the one in which the local 648 names or data are stored. URI producing applications (a.k.a., origin 649 servers) will typically use the local encoding as the basis for 650 producing meaningful names. The URI producer will transform the 651 local encoding to one that is suitable for a public interface, and 652 then transform the public interface encoding into the restricted set 653 of URI characters (reserved, unreserved, and percent-encodings). 654 Those characters are, in turn, encoded as octets to be used as a 655 reference within a data format (e.g., a document charset), and such 656 data formats are often subsequently encoded for transmission over 657 Internet protocols. 659 For most systems, an unreserved character appearing within a URI 660 component is interpreted as representing the data octet corresponding 661 to that character's encoding in US-ASCII. Consumers of URIs assume 662 that the letter "X" corresponds to the octet "01011000", and there is 663 no harm in making that assumption even when it is incorrect. A 664 system that internally provides identifiers in the form of a 665 different character encoding, such as EBCDIC, will generally perform 666 character translation of textual identifiers to UTF-8 [STD63] (or 667 some other superset of the US-ASCII character encoding) at an 668 internal interface, thereby providing more meaningful identifiers 669 than simply percent-encoding the original octets. 671 For example, consider an information service that provides data, 672 stored locally using an EBCDIC-based filesystem, to clients on the 673 Internet through an HTTP server. When an author creates a file on 674 that filesystem with the name "Laguna Beach", their expectation is 675 that the "http" URI corresponding to that resource would also contain 676 the meaningful string "Laguna%20Beach". If, however, that server 677 produces URIs using an overly-simplistic raw octet mapping, then the 678 result would be a URI containing 679 "%D3%81%87%A4%95%81@%C2%85%81%83%88". An internal transcoding 680 interface fixes that problem by transcoding the local name to a 681 superset of US-ASCII prior to producing the URI. Naturally, proper 682 interpretation of an incoming URI on such an interface requires that 683 percent-encoded octets be decoded (e.g., "%20" to SP) before the 684 reverse transcoding is applied to obtain the local name. 686 In some cases, the internal interface between a URI component and the 687 identifying data that it has been crafted to represent is much less 688 direct than a character encoding translation. For example, portions 689 of a URI might reflect a query on non-ASCII data, numeric coordinates 690 on a map, etc. Likewise, a URI scheme may define components with 691 additional encoding requirements that are applied prior to forming 692 the component and producing the URI. 694 When a new URI scheme defines a component that represents textual 695 data consisting of characters from the Unicode character set [UCS], 696 the data should be encoded first as octets according to the UTF-8 697 character encoding [STD63], and then only those octets that do not 698 correspond to characters in the unreserved set should be 699 percent-encoded. For example, the character A would be represented 700 as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be 701 represented as "%C3%80", and the character KATAKANA LETTER A would be 702 represented as "%E3%82%A2". 704 3. Syntax Components 706 The generic URI syntax consists of a hierarchical sequence of 707 components referred to as the scheme, authority, path, query, and 708 fragment. 710 URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 712 hier-part = "//" authority path-abempty 713 / path-absolute 714 / path-rootless 715 / path-empty 717 The scheme and path components are required, though path may be empty 718 (no characters). When authority is present, the path must either be 719 empty or begin with a slash ("/") character. When authority is not 720 present, the path cannot begin with two slash characters ("//"). 721 These restrictions result in five different ABNF rules for a path 722 (Section 3.3), only one of which will match any given URI reference. 724 The following are two example URIs and their component parts: 726 foo://example.com:8042/over/there?name=ferret#nose 727 \_/ \______________/\_________/ \_________/ \__/ 728 | | | | | 729 scheme authority path query fragment 730 | _____________________|__ 731 / \ / \ 732 urn:example:animal:ferret:nose 734 3.1 Scheme 736 Each URI begins with a scheme name that refers to a specification for 737 assigning identifiers within that scheme. As such, the URI syntax is 738 a federated and extensible naming system wherein each scheme's 739 specification may further restrict the syntax and semantics of 740 identifiers using that scheme. 742 Scheme names consist of a sequence of characters beginning with a 743 letter and followed by any combination of letters, digits, plus 744 ("+"), period ("."), or hyphen ("-"). Although scheme is 745 case-insensitive, the canonical form is lowercase and documents that 746 specify schemes must do so using lowercase letters. An 747 implementation should accept uppercase letters as equivalent to 748 lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for 749 the sake of robustness, but should only produce lowercase scheme 750 names, for consistency. 752 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 754 Individual schemes are not specified by this document. The process 755 for registration of new URI schemes is defined separately by [BCP35]. 756 The scheme registry maintains the mapping between scheme names and 757 their specifications. Advice for designers of new URI schemes can be 758 found in [RFC2718]. 760 When presented with a URI that violates one or more scheme-specific 761 restrictions, the scheme-specific resolution process should flag the 762 reference as an error rather than ignore the unused parts; doing so 763 reduces the number of equivalent URIs and helps detect abuses of the 764 generic syntax that might indicate the URI has been constructed to 765 mislead the user (Section 7.6). 767 3.2 Authority 769 Many URI schemes include a hierarchical element for a naming 770 authority, such that governance of the name space defined by the 771 remainder of the URI is delegated to that authority (which may, in 772 turn, delegate it further). The generic syntax provides a common 773 means for distinguishing an authority based on a registered name or 774 server address, along with optional port and user information. 776 The authority component is preceded by a double slash ("//") and is 777 terminated by the next slash ("/"), question mark ("?"), or number 778 sign ("#") character, or by the end of the URI. 780 authority = [ userinfo "@" ] host [ ":" port ] 782 URI producers and normalizers should omit the ":" delimiter that 783 separates host from port if the port component is empty. Some 784 schemes do not allow the userinfo and/or port subcomponents. 786 If a URI contains an authority component, then the path component 787 must either be empty or begin with a slash ("/") character. 788 Non-validating parsers (those that merely separate a URI reference 789 into its major components) will often ignore the subcomponent 790 structure of authority, treating it as an opaque string from the 791 double-slash to the first terminating delimiter, until such time as 792 the URI is dereferenced. 794 3.2.1 User Information 796 The userinfo subcomponent may consist of a user name and, optionally, 797 scheme-specific information about how to gain authorization to access 798 the resource. The user information, if present, is followed by a 799 commercial at-sign ("@") that delimits it from the host. 801 userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) 803 Use of the format "user:password" in the userinfo field is 804 deprecated. Applications should not render as clear text any data 805 after the first colon (":") character found within a userinfo 806 subcomponent unless the data after the colon is the empty string 807 (indicating no password). Applications may choose to ignore or 808 reject such data when received as part of a reference, and should 809 reject the storage of such data in unencrypted form. The passing of 810 authentication information in clear text has proven to be a security 811 risk in almost every case where it has been used. 813 Applications that render a URI for the sake of user feedback, such as 814 in graphical hypertext browsing, should render userinfo in a way that 815 is distinguished from the rest of a URI, when feasible. Such 816 rendering will assist the user in cases where the userinfo has been 817 misleadingly crafted to look like a trusted domain name 818 (Section 7.6). 820 3.2.2 Host 822 The host subcomponent of authority is identified by an IP literal 823 encapsulated within square brackets, an IPv4 address in 824 dotted-decimal form, or a registered name. The host subcomponent is 825 case-insensitive. The presence of a host subcomponent within a URI 826 does not imply that the scheme requires access to the given host on 827 the Internet. In many cases, the host syntax is used only for the 828 sake of reusing the existing registration process created and 829 deployed for DNS, thus obtaining a globally unique name without the 830 cost of deploying another registry. However, such use comes with its 831 own costs: domain name ownership may change over time for reasons not 832 anticipated by the URI producer. In other cases, the data within the 833 host component identifies a registered name that has nothing to do 834 with an Internet host. We use the name "host" for the ABNF rule 835 because that is its most common purpose, not its only purpose, and 836 thus should not be considered as semantically limiting the data 837 within it. 839 host = IP-literal / IPv4address / reg-name 841 The syntax rule for host is ambiguous because it does not completely 842 distinguish between an IPv4address and a reg-name. In order to 843 disambiguate the syntax, we apply the "first-match-wins" algorithm: 844 If host matches the rule for IPv4address, then it should be 845 considered an IPv4 address literal and not a reg-name. Although host 846 is case-insensitive, producers and normalizers should use lowercase 847 for registered names and hexadecimal addresses for the sake of 848 uniformity, while only using uppercase letters for percent-encodings. 850 A host identified by an Internet Protocol literal address, version 6 851 [RFC3513] or later, is distinguished by enclosing the IP literal 852 within square brackets ("[" and "]"). This is the only place where 853 square bracket characters are allowed in the URI syntax. In 854 anticipation of future, as-yet-undefined IP literal address formats, 855 an optional version flag may be used to indicate such a format 856 explicitly rather than relying on heuristic determination. 858 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 860 IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) 862 The version flag does not indicate the IP version; rather, it 863 indicates future versions of the literal format. As such, 864 implementations must not provide the version flag for existing IPv4 865 and IPv6 literal addresses. If a URI containing an IP-literal that 866 starts with "v" (case-insensitive), indicating that the version flag 867 is present, is dereferenced by an application that does not know the 868 meaning of that version flag, then the application should return an 869 appropriate error for "address mechanism not supported". 871 A host identified by an IPv6 literal address is represented inside 872 the square brackets without a preceding version flag. The ABNF 873 provided here is a translation of the text definition of an IPv6 874 literal address provided in [RFC3513]. A 128-bit IPv6 address is 875 divided into eight 16-bit pieces. Each piece is represented 876 numerically in case-insensitive hexadecimal, using one to four 877 hexadecimal digits (leading zeroes are permitted). The eight encoded 878 pieces are given most-significant first, separated by colon 879 characters. Optionally, the least-significant two pieces may instead 880 be represented in IPv4 address textual format. A sequence of one or 881 more consecutive zero-valued 16-bit pieces within the address may be 882 elided, omitting all their digits and leaving exactly two consecutive 883 colons in their place to mark the elision. 885 IPv6address = 6( h16 ":" ) ls32 886 / "::" 5( h16 ":" ) ls32 887 / [ h16 ] "::" 4( h16 ":" ) ls32 888 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 889 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 890 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 891 / [ *4( h16 ":" ) h16 ] "::" ls32 892 / [ *5( h16 ":" ) h16 ] "::" h16 893 / [ *6( h16 ":" ) h16 ] "::" 895 ls32 = ( h16 ":" h16 ) / IPv4address 896 ; least-significant 32 bits of address 898 h16 = 1*4HEXDIG 899 ; 16 bits of address represented in hexadecimal 901 A host identified by an IPv4 literal address is represented in 902 dotted-decimal notation (a sequence of four decimal numbers in the 903 range 0 to 255, separated by "."), as described in [RFC1123] by 904 reference to [RFC0952]. Note that other forms of dotted notation may 905 be interpreted on some platforms, as described in Section 7.4, but 906 only the dotted-decimal form of four octets is allowed by this 907 grammar. 909 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 911 dec-octet = DIGIT ; 0-9 912 / %x31-39 DIGIT ; 10-99 913 / "1" 2DIGIT ; 100-199 914 / "2" %x30-34 DIGIT ; 200-249 915 / "25" %x30-35 ; 250-255 917 A host identified by a registered name is a sequence of characters 918 that is usually intended for lookup within a locally-defined host or 919 service name registry, though the URI's scheme-specific semantics may 920 require that a specific registry (or fixed name table) be used 921 instead. The most common name registry mechanism is the Domain Name 922 System (DNS). A registered name intended for lookup in the DNS uses 923 the syntax defined in Section 3.5 of [RFC1034] and Section 2.1 of 924 [RFC1123]. Such a name consists of a sequence of domain labels 925 separated by ".", each domain label starting and ending with an 926 alphanumeric character and possibly also containing "-" characters. 927 The rightmost domain label of a fully qualified domain name in DNS 928 may be followed by a single "." and should be followed by one if it 929 is necessary to distinguish between the complete domain name and some 930 local domain. 932 reg-name = *( unreserved / pct-encoded / sub-delims ) 934 If the URI scheme defines a default for host, then that default 935 applies when the host subcomponent is undefined or when the 936 registered name is empty (zero length). For example, the "file" URI 937 scheme is defined such that no authority, an empty host, and 938 "localhost" all mean the end-user's machine, whereas the "http" 939 scheme considers a missing authority or empty host to be invalid. 941 This specification does not mandate a particular registered name 942 lookup technology and therefore does not restrict the syntax of 943 reg-name beyond that necessary for interoperability. Instead, it 944 delegates the issue of registered name syntax conformance to the 945 operating system of each application performing URI resolution, and 946 that operating system decides what it will allow for the purpose of 947 host identification. A URI resolution implementation might use DNS, 948 host tables, yellow pages, NetInfo, WINS, or any other system for 949 lookup of registered names. However, a globally-scoped naming 950 system, such as DNS fully-qualified domain names, is necessary for 951 URIs that are intended to have global scope. URI producers should 952 use names that conform to the DNS syntax, even when use of DNS is not 953 immediately apparent, and should limit such names to no more than 255 954 characters in length. 956 The reg-name syntax allows percent-encoded octets in order to 957 represent non-ASCII registered names in a uniform way that is 958 independent of the underlying name resolution technology; such 959 non-ASCII characters must first be encoded according to UTF-8 [STD63] 960 and then each octet of the corresponding UTF-8 sequence must be 961 percent-encoded to be represented as URI characters. URI producing 962 applications must not use percent-encoding in host unless it is used 963 to represent a UTF-8 character sequence. When a non-ASCII registered 964 name represents an internationalized domain name intended for 965 resolution via the DNS, the name must be transformed to the IDNA 966 encoding [RFC3490] prior to name lookup. URI producers should 967 provide such registered names in the IDNA encoding, rather than a 968 percent-encoding, if they wish to maximize interoperability with 969 legacy URI resolvers. 971 3.2.3 Port 973 The port subcomponent of authority is designated by an optional port 974 number in decimal following the host and delimited from it by a 975 single colon (":") character. 977 port = *DIGIT 979 A scheme may define a default port. For example, the "http" scheme 980 defines a default port of "80", corresponding to its reserved TCP 981 port number. The type of port designated by the port number (e.g., 982 TCP, UDP, SCTP, etc.) is defined by the URI scheme. URI producers 983 and normalizers should omit the port component and its ":" delimiter 984 if port is empty or its value would be the same as the scheme's 985 default. 987 3.3 Path 989 The path component contains data, usually organized in hierarchical 990 form, that, along with data in the non-hierarchical query component 991 (Section 3.4), serves to identify a resource within the scope of the 992 URI's scheme and naming authority (if any). The path is terminated 993 by the first question mark ("?") or number sign ("#") character, or 994 by the end of the URI. 996 If a URI contains an authority component, then the path component 997 must either be empty or begin with a slash ("/") character. If a URI 998 does not contain an authority component, then the path cannot begin 999 with two slash characters ("//"). In addition, a URI reference 1000 (Section 4.1) may begin with a relative path, in which case the first 1001 path segment cannot contain a colon (":") character. The ABNF 1002 requires five separate rules to disambiguate these cases, only one of 1003 which will match the path substring within a given URI reference. We 1004 use the generic term "path component" to describe the URI substring 1005 matched by the parser to one of these rules. 1007 path = path-abempty ; begins with "/" or is empty 1008 / path-absolute ; begins with "/" but not "//" 1009 / path-noscheme ; begins with a non-colon segment 1010 / path-rootless ; begins with a segment 1011 / path-empty ; zero characters 1013 path-abempty = *( "/" segment ) 1014 path-absolute = "/" [ segment-nz *( "/" segment ) ] 1015 path-noscheme = segment-nz-nc *( "/" segment ) 1016 path-rootless = segment-nz *( "/" segment ) 1017 path-empty = 0 1019 segment = *pchar 1020 segment-nz = 1*pchar 1021 segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) 1022 ; non-zero-length segment without any colon ":" 1024 pchar = unreserved / pct-encoded / sub-delims / ":" / "@" 1026 A path consists of a sequence of path segments separated by a slash 1027 ("/") character. A path is always defined for a URI, though the 1028 defined path may be empty (zero length). Use of the slash character 1029 to indicate hierarchy is only required when a URI will be used as the 1030 context for relative references. For example, the URI 1031 has a path of "fred@example.com", whereas 1032 the URI has an empty path. 1034 The path segments "." and "..", also known as dot-segments, are 1035 defined for relative reference within the path name hierarchy. They 1036 are intended for use at the beginning of a relative path reference 1037 (Section 4.2) for indicating relative position within the 1038 hierarchical tree of names. This is similar to their role within 1039 some operating systems' file directory structure to indicate the 1040 current directory and parent directory, respectively. However, 1041 unlike a file system, these dot-segments are only interpreted within 1042 the URI path hierarchy and are removed as part of the resolution 1043 process (Section 5.2). 1045 Aside from dot-segments in hierarchical paths, a path segment is 1046 considered opaque by the generic syntax. URI-producing applications 1047 often use the reserved characters allowed in a segment for the 1048 purpose of delimiting scheme-specific or dereference-handler-specific 1049 subcomponents. For example, the semicolon (";") and equals ("=") 1050 reserved characters are often used for delimiting parameters and 1051 parameter values applicable to that segment. The comma (",") 1052 reserved character is often used for similar purposes. For example, 1053 one URI producer might use a segment like "name;v=1.1" to indicate a 1054 reference to version 1.1 of "name", whereas another might use a 1055 segment like "name,1.1" to indicate the same. Parameter types may be 1056 defined by scheme-specific semantics, but in most cases the syntax of 1057 a parameter is specific to the implementation of the URI's 1058 dereferencing algorithm. 1060 3.4 Query 1062 The query component contains non-hierarchical data that, along with 1063 data in the path component (Section 3.3), serves to identify a 1064 resource within the scope of the URI's scheme and naming authority 1065 (if any). The query component is indicated by the first question 1066 mark ("?") character and terminated by a number sign ("#") character 1067 or by the end of the URI. 1069 query = *( pchar / "/" / "?" ) 1071 The characters slash ("/") and question mark ("?") may represent data 1072 within the query component. Beware that some older, erroneous 1073 implementations do not handle such URIs correctly when they are used 1074 as the base for relative references (Section 5.1), apparently because 1075 they fail to to distinguish query data from path data when looking 1076 for hierarchical separators. However, since query components are 1077 often used to carry identifying information in the form of 1078 "key=value" pairs, and one frequently used value is a reference to 1079 another URI, it is sometimes better for usability to avoid 1080 percent-encoding those characters. 1082 3.5 Fragment 1084 The fragment identifier component of a URI allows indirect 1085 identification of a secondary resource by reference to a primary 1086 resource and additional identifying information. The identified 1087 secondary resource may be some portion or subset of the primary 1088 resource, some view on representations of the primary resource, or 1089 some other resource defined or described by those representations. A 1090 fragment identifier component is indicated by the presence of a 1091 number sign ("#") character and terminated by the end of the URI. 1093 fragment = *( pchar / "/" / "?" ) 1095 The semantics of a fragment identifier are defined by the set of 1096 representations that might result from a retrieval action on the 1097 primary resource. The fragment's format and resolution is therefore 1098 dependent on the media type [RFC2046] of a potentially retrieved 1099 representation, even though such a retrieval is only performed if the 1100 URI is dereferenced. If no such representation exists, then the 1101 semantics of the fragment are considered unknown and, effectively, 1102 unconstrained. Fragment identifier semantics are independent of the 1103 URI scheme and thus cannot be redefined by scheme specifications. 1105 Individual media types may define their own restrictions on, or 1106 structure within, the fragment identifier syntax for specifying 1107 different types of subsets, views, or external references that are 1108 identifiable as secondary resources by that media type. If the 1109 primary resource has multiple representations, as is often the case 1110 for resources whose representation is selected based on attributes of 1111 the retrieval request (a.k.a., content negotiation), then whatever is 1112 identified by the fragment should be consistent across all of those 1113 representations: each representation should either define the 1114 fragment such that it corresponds to the same secondary resource, 1115 regardless of how it is represented, or the fragment should be left 1116 undefined by the representation (i.e., not found). 1118 As with any URI, use of a fragment identifier component does not 1119 imply that a retrieval action will take place. A URI with a fragment 1120 identifier may be used to refer to the secondary resource without any 1121 implication that the primary resource is accessible or will ever be 1122 accessed. 1124 Fragment identifiers have a special role in information retrieval 1125 systems as the primary form of client-side indirect referencing, 1126 allowing an author to specifically identify those aspects of an 1127 existing resource that are only indirectly provided by the resource 1128 owner. As such, the fragment identifier is not used in the 1129 scheme-specific processing of a URI; instead, the fragment identifier 1130 is separated from the rest of the URI prior to a dereference, and 1131 thus the identifying information within the fragment itself is 1132 dereferenced solely by the user agent and regardless of the URI 1133 scheme. Although this separate handling is often perceived to be a 1134 loss of information, particularly in regards to accurate redirection 1135 of references as resources move over time, it also serves to prevent 1136 information providers from denying reference authors the right to 1137 selectively refer to information within a resource. Indirect 1138 referencing also provides additional flexibility and extensibility to 1139 systems that use URIs, since new media types are easier to define and 1140 deploy than new schemes of identification. 1142 The characters slash ("/") and question mark ("?") are allowed to 1143 represent data within the fragment identifier. Beware that some 1144 older, erroneous implementations do not handle such URIs correctly 1145 when they are used as the base for relative references (Section 5.1). 1147 4. Usage 1149 When applications make reference to a URI, they do not always use the 1150 full form of reference defined by the "URI" syntax rule. In order to 1151 save space and take advantage of hierarchical locality, many Internet 1152 protocol elements and media type formats allow an abbreviation of a 1153 URI, while others restrict the syntax to a particular form of URI. 1154 We define the most common forms of reference syntax in this 1155 specification because they impact and depend upon the design of the 1156 generic syntax, requiring a uniform parsing algorithm in order to be 1157 interpreted consistently. 1159 4.1 URI Reference 1161 URI-reference is used to denote the most common usage of a resource 1162 identifier. 1164 URI-reference = URI / relative-URI 1166 A URI-reference may be relative: if the reference's prefix matches 1167 the syntax of a scheme followed by its colon separator, then the 1168 reference is a URI rather than a relative-URI. 1170 A URI-reference is typically parsed first into the five URI 1171 components, in order to determine what components are present and 1172 whether or not the reference is relative, after which each component 1173 is parsed for its subparts and their validation. The ABNF of 1174 URI-reference, along with the "first-match-wins" disambiguation rule, 1175 is sufficient to define a validating parser for the generic syntax. 1176 Readers familiar with regular expressions should see Appendix B for 1177 an example of a non-validating URI-reference parser that will take 1178 any given string and extract the URI components. 1180 4.2 Relative URI 1182 A relative URI reference takes advantage of the hierarchical syntax 1183 (Section 1.2.3) in order to express a reference that is relative to 1184 the name space of another hierarchical URI. 1186 relative-URI = relative-part [ "?" query ] [ "#" fragment ] 1188 relative-part = "//" authority path-abempty 1189 / path-absolute 1190 / path-noscheme 1191 / path-empty 1193 The URI referred to by a relative reference, also known as the target 1194 URI, is obtained by applying the reference resolution algorithm of 1195 Section 5. 1197 A relative reference that begins with two slash characters is termed 1198 a network-path reference; such references are rarely used. A 1199 relative reference that begins with a single slash character is 1200 termed an absolute-path reference. A relative reference that does 1201 not begin with a slash character is termed a relative-path reference. 1203 A path segment that contains a colon character (e.g., "this:that") 1204 cannot be used as the first segment of a relative-path reference 1205 because it would be mistaken for a scheme name. Such a segment must 1206 be preceded by a dot-segment (e.g., "./this:that") to make a 1207 relative-path reference. 1209 4.3 Absolute URI 1211 Some protocol elements allow only the absolute form of a URI without 1212 a fragment identifier. For example, defining a base URI for later 1213 use by relative references calls for an absolute-URI syntax rule that 1214 does not allow a fragment. 1216 absolute-URI = scheme ":" hier-part [ "?" query ] 1218 4.4 Same-document Reference 1220 When a URI reference refers to a URI that is, aside from its fragment 1221 component (if any), identical to the base URI (Section 5.1), that 1222 reference is called a "same-document" reference. The most frequent 1223 examples of same-document references are relative references that are 1224 empty or include only the number sign ("#") separator followed by a 1225 fragment identifier. 1227 When a same-document reference is dereferenced for the purpose of a 1228 retrieval action, the target of that reference is defined to be 1229 within the same entity (representation, document, or message) as the 1230 reference; therefore, a dereference should not result in a new 1231 retrieval action. 1233 Normalization of the base and target URIs prior to their comparison, 1234 as described in Section 6.2.2 and Section 6.2.3, is allowed but 1235 rarely performed in practice. Normalization may increase the set of 1236 same-document references, which may be of benefit to some caching 1237 applications. As such, reference authors should not assume that a 1238 slightly different, though equivalent, reference URI will (or will 1239 not) be interpreted as a same-document reference by any given 1240 application. 1242 4.5 Suffix Reference 1244 The URI syntax is designed for unambiguous reference to resources and 1245 extensibility via the URI scheme. However, as URI identification and 1246 usage have become commonplace, traditional media (television, radio, 1247 newspapers, billboards, etc.) have increasingly used a suffix of the 1248 URI as a reference, consisting of only the authority and path 1249 portions of the URI, such as 1251 www.w3.org/Addressing/ 1253 or simply a DNS registered name on its own. Such references are 1254 primarily intended for human interpretation, rather than for 1255 machines, with the assumption that context-based heuristics are 1256 sufficient to complete the URI (e.g., most registered names beginning 1257 with "www" are likely to have a URI prefix of "http://"). Although 1258 there is no standard set of heuristics for disambiguating a URI 1259 suffix, many client implementations allow them to be entered by the 1260 user and heuristically resolved. 1262 While this practice of using suffix references is common, it should 1263 be avoided whenever possible and never used in situations where 1264 long-term references are expected. The heuristics noted above will 1265 change over time, particularly when a new URI scheme becomes popular, 1266 and are often incorrect when used out of context. Furthermore, they 1267 can lead to security issues along the lines of those described in 1268 [RFC1535]. 1270 Since a URI suffix has the same syntax as a relative path reference, 1271 a suffix reference cannot be used in contexts where a relative 1272 reference is expected. As a result, suffix references are limited to 1273 those places where there is no defined base URI, such as dialog boxes 1274 and off-line advertisements. 1276 5. Reference Resolution 1278 This section defines the process of resolving a URI reference within 1279 a context that allows relative references, such that the result is a 1280 string matching the "URI" syntax rule of Section 3. 1282 5.1 Establishing a Base URI 1284 The term "relative" implies that there exists a "base URI" against 1285 which the relative reference is applied. Aside from fragment-only 1286 references (Section 4.4), relative references are only usable when a 1287 base URI is known. A base URI must be established by the parser 1288 prior to parsing URI references that might be relative. A base URI 1289 must conform to the syntax rule (Section 4.3): if the 1290 base URI is obtained from a URI reference, then that reference must 1291 be converted to absolute form and stripped of any fragment component 1292 prior to use as a base URI. 1294 The base URI of a reference can be established in one of four ways, 1295 discussed below in order of precedence. The order of precedence can 1296 be thought of in terms of layers, where the innermost defined base 1297 URI has the highest precedence. This can be visualized graphically 1298 as: 1300 .----------------------------------------------------------. 1301 | .----------------------------------------------------. | 1302 | | .----------------------------------------------. | | 1303 | | | .----------------------------------------. | | | 1304 | | | | .----------------------------------. | | | | 1305 | | | | | | | | | | 1306 | | | | `----------------------------------' | | | | 1307 | | | | (5.1.1) Base URI embedded in content | | | | 1308 | | | `----------------------------------------' | | | 1309 | | | (5.1.2) Base URI of the encapsulating entity | | | 1310 | | | (message, representation, or none) | | | 1311 | | `----------------------------------------------' | | 1312 | | (5.1.3) URI used to retrieve the entity | | 1313 | `----------------------------------------------------' | 1314 | (5.1.4) Default Base URI (application-dependent) | 1315 `----------------------------------------------------------' 1317 5.1.1 Base URI Embedded in Content 1319 Within certain media types, a base URI for relative references can be 1320 embedded within the content itself such that it can be readily 1321 obtained by a parser. This can be useful for descriptive documents, 1322 such as tables of content, which may be transmitted to others through 1323 protocols other than their usual retrieval context (e.g., E-Mail or 1324 USENET news). 1326 It is beyond the scope of this specification to specify how, for each 1327 media type, a base URI can be embedded. The appropriate syntax, when 1328 available, is described by the data format specification associated 1329 with each media type. 1331 5.1.2 Base URI from the Encapsulating Entity 1333 If no base URI is embedded, the base URI is defined by the 1334 representation's retrieval context. For a document that is enclosed 1335 within another entity, such as a message or archive, the retrieval 1336 context is that entity; thus, the default base URI of a 1337 representation is the base URI of the entity in which the 1338 representation is encapsulated. 1340 A mechanism for embedding a base URI within MIME container types 1341 (e.g., the message and multipart types) is defined by MHTML 1342 [RFC2557]. Protocols that do not use the MIME message header syntax, 1343 but do allow some form of tagged metadata to be included within 1344 messages, may define their own syntax for defining a base URI as part 1345 of a message. 1347 5.1.3 Base URI from the Retrieval URI 1349 If no base URI is embedded and the representation is not encapsulated 1350 within some other entity, then, if a URI was used to retrieve the 1351 representation, that URI shall be considered the base URI. Note that 1352 if the retrieval was the result of a redirected request, the last URI 1353 used (i.e., the URI that resulted in the actual retrieval of the 1354 representation) is the base URI. 1356 5.1.4 Default Base URI 1358 If none of the conditions described above apply, then the base URI is 1359 defined by the context of the application. Since this definition is 1360 necessarily application-dependent, failing to define a base URI using 1361 one of the other methods may result in the same content being 1362 interpreted differently by different types of application. 1364 A sender of a representation containing relative references is 1365 responsible for ensuring that a base URI for those references can be 1366 established. Aside from fragment-only references, relative 1367 references can only be used reliably in situations where the base URI 1368 is well-defined. 1370 5.2 Relative Resolution 1372 This section describes an algorithm for converting a URI reference 1373 that might be relative to a given base URI into the parsed components 1374 of the reference's target. The components can then be recomposed, as 1375 described in Section 5.3, to form the target URI. This algorithm 1376 provides definitive results that can be used to test the output of 1377 other implementations. Applications may implement relative reference 1378 resolution using some other algorithm, provided that the results 1379 match what would be given by this algorithm. 1381 5.2.1 Pre-parse the Base URI 1383 The base URI (Base) is established according to the procedure of 1384 Section 5.1 and parsed into the five main components described in 1385 Section 3. Note that only the scheme component is required to be 1386 present in a base URI; the other components may be empty or 1387 undefined. A component is undefined if its associated delimiter does 1388 not appear in the URI reference; the path component is never 1389 undefined, though it may be empty. 1391 Normalization of the base URI, as described in Section 6.2.2 and 1392 Section 6.2.3, is optional. A URI reference must be transformed to 1393 its target URI before it can be normalized. 1395 5.2.2 Transform References 1397 For each URI reference (R), the following pseudocode describes an 1398 algorithm for transforming R into its target URI (T): 1400 -- The URI reference is parsed into the five URI components 1401 -- 1402 (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); 1404 -- A non-strict parser may ignore a scheme in the reference 1405 -- if it is identical to the base URI's scheme. 1406 -- 1407 if ((not strict) and (R.scheme == Base.scheme)) then 1408 undefine(R.scheme); 1409 endif; 1411 if defined(R.scheme) then 1412 T.scheme = R.scheme; 1413 T.authority = R.authority; 1414 T.path = remove_dot_segments(R.path); 1415 T.query = R.query; 1416 else 1417 if defined(R.authority) then 1418 T.authority = R.authority; 1419 T.path = remove_dot_segments(R.path); 1420 T.query = R.query; 1421 else 1422 if (R.path == "") then 1423 T.path = Base.path; 1424 if defined(R.query) then 1425 T.query = R.query; 1426 else 1427 T.query = Base.query; 1428 endif; 1429 else 1430 if (R.path starts-with "/") then 1431 T.path = remove_dot_segments(R.path); 1432 else 1433 T.path = merge(Base.path, R.path); 1434 T.path = remove_dot_segments(T.path); 1435 endif; 1436 T.query = R.query; 1437 endif; 1438 T.authority = Base.authority; 1439 endif; 1440 T.scheme = Base.scheme; 1441 endif; 1443 T.fragment = R.fragment; 1445 5.2.3 Merge Paths 1447 The pseudocode above refers to a "merge" routine for merging a 1448 relative-path reference with the path of the base URI. This is 1449 accomplished as follows: 1451 o If the base URI has a defined authority component and an empty 1452 path, then return a string consisting of "/" concatenated with the 1453 reference's path; otherwise, 1455 o Return a string consisting of the reference's path component 1456 appended to all but the last segment of the base URI's path (i.e., 1457 excluding any characters after the right-most "/" in the base URI 1458 path, or excluding the entire base URI path if it does not contain 1459 any "/" characters). 1461 5.2.4 Remove Dot Segments 1463 The pseudocode also refers to a "remove_dot_segments" routine for 1464 interpreting and removing the special "." and ".." complete path 1465 segments from a referenced path. This is done after the path is 1466 extracted from a reference, whether or not the path was relative, in 1467 order to remove any invalid or extraneous dot-segments prior to 1468 forming the target URI. Although there are many ways to accomplish 1469 this removal process, we describe a simple method using two string 1470 buffers. 1472 1. The input buffer is initialized with the now-appended path 1473 components and the output buffer is initialized to the empty 1474 string. 1476 2. While the input buffer is not empty, loop: 1478 A. If the input buffer begins with a prefix of "../" or "./", 1479 then remove that prefix from the input buffer; otherwise, 1481 B. If the input buffer begins with a prefix of "/./" or "/.", 1482 where "." is a complete path segment, then replace that 1483 prefix with "/" in the input buffer; otherwise, 1485 C. If the input buffer begins with a prefix of "/../" or "/..", 1486 where ".." is a complete path segment, then replace that 1487 prefix with "/" in the input buffer and remove the last 1488 segment and its preceding "/" (if any) from the output 1489 buffer; otherwise, 1491 D. If the input buffer consists only of "." or "..", then remove 1492 that from the input buffer; otherwise, 1494 E. Move the first path segment in the input buffer to the end of 1495 the output buffer, including the initial "/" character (if 1496 any) and any subsequent characters up to, but not including, 1497 the next "/" character or the end of the input buffer. 1499 3. Finally, the output buffer is returned as the result of 1500 remove_dot_segments. 1502 Note that dot-segments are intended for use in URI references to 1503 express an identifier relative to the hierarchy of names in the base 1504 URI. The remove_dot_segments algorithm respects that hierarchy by 1505 removing extra dot-segments rather than treating them as an error or 1506 leaving them to be misinterpreted by dereference implementations. 1508 The following illustrates how the above steps are applied for two 1509 example merged paths, showing the state of the two buffers after each 1510 step. 1512 STEP OUTPUT BUFFER INPUT BUFFER 1514 1 : /a/b/c/./../../g 1515 2E: /a /b/c/./../../g 1516 2E: /a/b /c/./../../g 1517 2E: /a/b/c /./../../g 1518 2B: /a/b/c /../../g 1519 2C: /a/b /../g 1520 2C: /a /g 1521 2E: /a/g 1523 STEP OUTPUT BUFFER INPUT BUFFER 1525 1 : mid/content=5/../6 1526 2E: mid /content=5/../6 1527 2E: mid/content=5 /../6 1528 2C: mid /6 1529 2E: mid/6 1531 Some applications may find it more efficient to implement the 1532 remove_dot_segments algorithm using two segment stacks rather than 1533 strings. 1535 Note: Beware that some older, erroneous implementations will fail 1536 to separate a reference's query component from its path component 1537 prior to merging the base and reference paths, resulting in an 1538 interoperability failure if the query component contains the 1539 strings "/../" or "/./". 1541 5.3 Component Recomposition 1543 Parsed URI components can be recomposed to obtain the corresponding 1544 URI reference string. Using pseudocode, this would be: 1546 result = "" 1548 if defined(scheme) then 1549 append scheme to result; 1550 append ":" to result; 1551 endif; 1553 if defined(authority) then 1554 append "//" to result; 1555 append authority to result; 1556 endif; 1558 append path to result; 1560 if defined(query) then 1561 append "?" to result; 1562 append query to result; 1563 endif; 1565 if defined(fragment) then 1566 append "#" to result; 1567 append fragment to result; 1568 endif; 1570 return result; 1572 Note that we are careful to preserve the distinction between a 1573 component that is undefined, meaning that its separator was not 1574 present in the reference, and a component that is empty, meaning that 1575 the separator was present and was immediately followed by the next 1576 component separator or the end of the reference. 1578 5.4 Reference Resolution Examples 1580 Within a representation with a well-defined base URI of 1582 http://a/b/c/d;p?q 1584 a relative URI reference is transformed to its target URI as follows. 1586 5.4.1 Normal Examples 1588 "g:h" = "g:h" 1589 "g" = "http://a/b/c/g" 1590 "./g" = "http://a/b/c/g" 1591 "g/" = "http://a/b/c/g/" 1592 "/g" = "http://a/g" 1593 "//g" = "http://g" 1594 "?y" = "http://a/b/c/d;p?y" 1595 "g?y" = "http://a/b/c/g?y" 1596 "#s" = "http://a/b/c/d;p?q#s" 1597 "g#s" = "http://a/b/c/g#s" 1598 "g?y#s" = "http://a/b/c/g?y#s" 1599 ";x" = "http://a/b/c/;x" 1600 "g;x" = "http://a/b/c/g;x" 1601 "g;x?y#s" = "http://a/b/c/g;x?y#s" 1602 "" = "http://a/b/c/d;p?q" 1603 "." = "http://a/b/c/" 1604 "./" = "http://a/b/c/" 1605 ".." = "http://a/b/" 1606 "../" = "http://a/b/" 1607 "../g" = "http://a/b/g" 1608 "../.." = "http://a/" 1609 "../../" = "http://a/" 1610 "../../g" = "http://a/g" 1612 5.4.2 Abnormal Examples 1614 Although the following abnormal examples are unlikely to occur in 1615 normal practice, all URI parsers should be capable of resolving them 1616 consistently. Each example uses the same base as above. 1618 Parsers must be careful in handling cases where there are more 1619 relative path ".." segments than there are hierarchical levels in the 1620 base URI's path. Note that the ".." syntax cannot be used to change 1621 the authority component of a URI. 1623 "../../../g" = "http://a/g" 1624 "../../../../g" = "http://a/g" 1626 Similarly, parsers must remove the dot-segments "." and ".." when 1627 they are complete components of a path, but not when they are only 1628 part of a segment. 1630 "/./g" = "http://a/g" 1631 "/../g" = "http://a/g" 1632 "g." = "http://a/b/c/g." 1633 ".g" = "http://a/b/c/.g" 1634 "g.." = "http://a/b/c/g.." 1635 "..g" = "http://a/b/c/..g" 1637 Less likely are cases where the relative URI reference uses 1638 unnecessary or nonsensical forms of the "." and ".." complete path 1639 segments. 1641 "./../g" = "http://a/b/g" 1642 "./g/." = "http://a/b/c/g/" 1643 "g/./h" = "http://a/b/c/g/h" 1644 "g/../h" = "http://a/b/c/h" 1645 "g;x=1/./y" = "http://a/b/c/g;x=1/y" 1646 "g;x=1/../y" = "http://a/b/c/y" 1648 Some applications fail to separate the reference's query and/or 1649 fragment components from a relative path before merging it with the 1650 base path and removing dot-segments. This error is rarely noticed, 1651 since typical usage of a fragment never includes the hierarchy ("/") 1652 character, and the query component is not normally used within 1653 relative references. 1655 "g?y/./x" = "http://a/b/c/g?y/./x" 1656 "g?y/../x" = "http://a/b/c/g?y/../x" 1657 "g#s/./x" = "http://a/b/c/g#s/./x" 1658 "g#s/../x" = "http://a/b/c/g#s/../x" 1660 Some parsers allow the scheme name to be present in a relative URI 1661 reference if it is the same as the base URI scheme. This is 1662 considered to be a loophole in prior specifications of partial URI 1663 [RFC1630]. Its use should be avoided, but is allowed for backward 1664 compatibility. 1666 "http:g" = "http:g" ; for strict parsers 1667 / "http://a/b/c/g" ; for backward compatibility 1669 6. Normalization and Comparison 1671 One of the most common operations on URIs is simple comparison: 1672 determining if two URIs are equivalent without using the URIs to 1673 access their respective resource(s). A comparison is performed every 1674 time a response cache is accessed, a browser checks its history to 1675 color a link, or an XML parser processes tags within a namespace. 1676 Extensive normalization prior to comparison of URIs is often used by 1677 spiders and indexing engines to prune a search space or reduce 1678 duplication of request actions and response storage. 1680 URI comparison is performed in respect to some particular purpose, 1681 and software with differing purposes will often be subject to 1682 differing design trade-offs in regards to how much effort should be 1683 spent in reducing duplicate identifiers. This section describes a 1684 variety of methods that may be used to compare URIs, the trade-offs 1685 between them, and the types of applications that might use them. A 1686 canonical form for URI references is defined to reduce the occurrence 1687 of false negative comparisons. 1689 6.1 Equivalence 1691 Since URIs exist to identify resources, presumably they should be 1692 considered equivalent when they identify the same resource. However, 1693 such a definition of equivalence is not of much practical use, since 1694 there is no way for software to compare two resources without 1695 knowledge of the implementation-specific syntax of each URI's 1696 dereferencing algorithm. For this reason, determination of 1697 equivalence or difference of URIs is based on string comparison, 1698 perhaps augmented by reference to additional rules provided by URI 1699 scheme definitions. We use the terms "different" and "equivalent" to 1700 describe the possible outcomes of such comparisons, but there are 1701 many application-dependent versions of equivalence. 1703 Even though it is possible to determine that two URIs are equivalent, 1704 it is never possible to be sure that two URIs identify different 1705 resources. For example, an owner of two different domain names could 1706 decide to serve the same resource from both, resulting in two 1707 different URIs. Therefore, comparison methods are designed to 1708 minimize false negatives while strictly avoiding false positives. 1710 In testing for equivalence, applications should not directly compare 1711 relative URI references; the references should be converted to their 1712 target URI forms before comparison. When URIs are being compared for 1713 the purpose of selecting (or avoiding) a network action, such as 1714 retrieval of a representation, the fragment components (if any) 1715 should be excluded from the comparison. 1717 6.2 Comparison Ladder 1719 A variety of methods are used in practice to test URI equivalence. 1720 These methods fall into a range, distinguished by the amount of 1721 processing required and the degree to which the probability of false 1722 negatives is reduced. As noted above, false negatives cannot in 1723 principle be eliminated. In practice, their probability can be 1724 reduced, but this reduction requires more processing and is not 1725 cost-effective for all applications. 1727 If this range of comparison practices is considered as a ladder, the 1728 following discussion will climb the ladder, starting with those 1729 practices that are cheap but have a relatively higher chance of 1730 producing false negatives, and proceeding to those that have higher 1731 computational cost and lower risk of false negatives. 1733 6.2.1 Simple String Comparison 1735 If two URIs, considered as character strings, are identical, then it 1736 is safe to conclude that they are equivalent. This type of 1737 equivalence test has very low computational cost and is in wide use 1738 in a variety of applications, particularly in the domain of parsing. 1740 Testing strings for equivalence requires some basic precautions. 1741 This procedure is often referred to as "bit-for-bit" or 1742 "byte-for-byte" comparison, which is potentially misleading. Testing 1743 of strings for equality is normally based on pairwise comparison of 1744 the characters that make up the strings, starting from the first and 1745 proceeding until both strings are exhausted and all characters found 1746 to be equal, a pair of characters compares unequal, or one of the 1747 strings is exhausted before the other. 1749 Such character comparisons require that each pair of characters be 1750 put in comparable form. For example, should one URI be stored in a 1751 byte array in EBCDIC encoding, and the second be in a Java String 1752 object (UTF-16), bit-for-bit comparisons applied naively will produce 1753 errors. It is better to speak of equality on a 1754 character-for-character rather than byte-for-byte or bit-for-bit 1755 basis. In practical terms, character-by-character comparisons should 1756 be done codepoint-by-codepoint after conversion to a common character 1757 encoding. 1759 6.2.2 Syntax-based Normalization 1761 Software may use logic based on the definitions provided by this 1762 specification to reduce the probability of false negatives. Such 1763 processing is moderately higher in cost than character-for-character 1764 string comparison. For example, an application using this approach 1765 could reasonably consider the following two URIs equivalent: 1767 example://a/b/c/%7Bfoo%7D 1768 eXAMPLE://a/./b/../b/%63/%7bfoo%7d 1770 Web user agents, such as browsers, typically apply this type of URI 1771 normalization when determining whether a cached response is 1772 available. Syntax-based normalization includes such techniques as 1773 case normalization, percent-encoding normalization, and removal of 1774 dot-segments. 1776 6.2.2.1 Case Normalization 1778 When a URI scheme uses components of the generic syntax, it will also 1779 use the common syntax equivalence rules, namely that the scheme and 1780 host are case-insensitive and therefore should be normalized to 1781 lowercase. For example, the URI is 1782 equivalent to . Applications should not 1783 assume anything about the case sensitivity of other URI components, 1784 since that is dependent on the implementation used to handle a 1785 dereference. 1787 The hexadecimal digits within a percent-encoding triplet (e.g., "%3a" 1788 versus "%3A") are case-insensitive and therefore should be normalized 1789 to use uppercase letters for the digits A-F. 1791 6.2.2.2 Percent-Encoding Normalization 1793 The percent-encoding mechanism (Section 2.1) is a frequent source of 1794 variance among otherwise identical URIs. In addition to the 1795 case-insensitivity issue noted above, some URI producers 1796 percent-encode octets that do not require percent-encoding, resulting 1797 in URIs that are equivalent to their non-encoded counterparts. Such 1798 URIs should be normalized by decoding any percent-encoded octet that 1799 corresponds to an unreserved character, as described in Section 2.3. 1801 6.2.2.3 Path Segment Normalization 1803 The complete path segments "." and ".." have a special meaning within 1804 hierarchical URI schemes. As such, they should not appear in 1805 absolute paths; if they are found, they can be removed by applying 1806 the remove_dot_segments algorithm to the path, as described in 1807 Section 5.2. 1809 6.2.3 Scheme-based Normalization 1811 The syntax and semantics of URIs vary from scheme to scheme, as 1812 described by the defining specification for each scheme. Software 1813 may use scheme-specific rules, at further processing cost, to reduce 1814 the probability of false negatives. For example, since the "http" 1815 scheme makes use of an authority component, has a default port of 1816 "80", and defines an empty path to be equivalent to "/", the 1817 following four URIs are equivalent: 1819 http://example.com 1820 http://example.com/ 1821 http://example.com:/ 1822 http://example.com:80/ 1824 In general, a URI that uses the generic syntax for authority with an 1825 empty path should be normalized to a path of "/"; likewise, an 1826 explicit ":port", where the port is empty or the default for the 1827 scheme, is equivalent to one where the port and its ":" delimiter are 1828 elided. In other words, the second of the above URI examples is the 1829 normal form for the "http" scheme. 1831 Another case where normalization varies by scheme is in the handling 1832 of an empty authority component or empty host subcomponent. For many 1833 scheme specifications, an empty authority or host is considered an 1834 error; for others, it is considered equivalent to "localhost" or the 1835 end-user's host. When a scheme defines a default for authority and a 1836 URI reference to that default is desired, the reference should have 1837 an empty authority for the sake of uniformity, brevity, and 1838 internationalization. If, however, either the userinfo or port 1839 subcomponent is non-empty, then the host should be given explicitly 1840 even if it matches the default. 1842 6.2.4 Protocol-based Normalization 1844 Web spiders, for which substantial effort to reduce the incidence of 1845 false negatives is often cost-effective, are observed to implement 1846 even more aggressive techniques in URI comparison. For example, if 1847 they observe that a URI such as 1849 http://example.com/data 1851 redirects to a URI differing only in the trailing slash 1853 http://example.com/data/ 1855 they will likely regard the two as equivalent in the future. This 1856 kind of technique is only appropriate when equivalence is clearly 1857 indicated by both the result of accessing the resources and the 1858 common conventions of their scheme's dereference algorithm (in this 1859 case, use of redirection by HTTP origin servers to avoid problems 1860 with relative references). 1862 6.3 Canonical Form 1864 It is in the best interests of everyone concerned to avoid 1865 false-negatives in comparing URIs and to minimize the amount of 1866 software processing for such comparisons. Those who produce and make 1867 reference to URIs can reduce the cost of processing and the risk of 1868 false negatives by consistently providing them in a form that is 1869 reasonably canonical with respect to their scheme. Specifically: 1871 o Always provide the URI scheme in lowercase characters. 1873 o Always provide the host, if any, in lowercase characters. 1875 o Only perform percent-encoding where it is essential. 1877 o Always use uppercase A-through-F characters when percent-encoding. 1879 o Prevent dot-segments appearing in non-relative URI paths. 1881 o For schemes that define a default authority, use an empty 1882 authority if the default is desired. 1884 o For schemes that define an empty path to be equivalent to a path 1885 of "/", use "/". 1887 7. Security Considerations 1889 A URI does not in itself pose a security threat. However, since URIs 1890 are often used to provide a compact set of instructions for access to 1891 network resources, care must be taken to properly interpret the data 1892 within a URI, to prevent that data from causing unintended access, 1893 and to avoid including data that should not be revealed in plain 1894 text. 1896 7.1 Reliability and Consistency 1898 There is no guarantee that, having once used a given URI to retrieve 1899 some information, the same information will be retrievable by that 1900 URI in the future. Nor is there any guarantee that the information 1901 retrievable via that URI in the future will be observably similar to 1902 that retrieved in the past. The URI syntax does not constrain how a 1903 given scheme or authority apportions its name space or maintains it 1904 over time. Such a guarantee can only be obtained from the person(s) 1905 controlling that name space and the resource in question. A specific 1906 URI scheme may define additional semantics, such as name persistence, 1907 if those semantics are required of all naming authorities for that 1908 scheme. 1910 7.2 Malicious Construction 1912 It is sometimes possible to construct a URI such that an attempt to 1913 perform a seemingly harmless, idempotent operation, such as the 1914 retrieval of a representation, will in fact cause a possibly damaging 1915 remote operation to occur. The unsafe URI is typically constructed 1916 by specifying a port number other than that reserved for the network 1917 protocol in question. The client unwittingly contacts a site that is 1918 running a different protocol service and data within the URI contains 1919 instructions that, when interpreted according to this other protocol, 1920 cause an unexpected operation. A frequent example of such abuse has 1921 been the use of a protocol-based scheme with a port component of 1922 "25", thereby fooling user agent software into sending an unintended 1923 or impersonating message via an SMTP server. 1925 Applications should prevent dereference of a URI that specifies a TCP 1926 port number within the "well-known port" range (0 - 1023) unless the 1927 protocol being used to dereference that URI is compatible with the 1928 protocol expected on that well-known port. Although IANA maintains a 1929 registry of well-known ports, applications should make such 1930 restrictions user-configurable to avoid preventing the deployment of 1931 new services. 1933 When a URI contains percent-encoded octets that match the delimiters 1934 for a given resolution or dereference protocol (for example, CR and 1935 LF characters for the TELNET protocol), such percent-encoded octets 1936 must not be decoded before transmission across that protocol. 1937 Transfer of the percent-encoding, which might violate the protocol, 1938 is less harmful than allowing decoded octets to be interpreted as 1939 additional operations or parameters, perhaps triggering an unexpected 1940 and possibly harmful remote operation. 1942 7.3 Back-end Transcoding 1944 When a URI is dereferenced, the data within it is often parsed by 1945 both the user agent and one or more servers. In HTTP, for example, a 1946 typical user agent will parse a URI into its five major components, 1947 access the authority's server, and send it the data within the 1948 authority, path, and query components. A typical server will take 1949 that information, parse the path into segments and the query into 1950 key/value pairs, and then invoke implementation-specific handlers to 1951 respond to the request. As a result, a common security concern for 1952 server implementations that handle a URI, either as a whole or split 1953 into separate components, is proper interpretation of the octet data 1954 represented by the characters and percent-encodings within that URI. 1956 Percent-encoded octets must be decoded at some point during the 1957 dereference process. Applications must split the URI into its 1958 components and subcomponents prior to decoding the octets, since 1959 otherwise the decoded octets might be mistaken for delimiters. 1960 Security checks of the data within a URI should be applied after 1961 decoding the octets. Note, however, that the "%00" percent-encoding 1962 (NUL) may require special handling and should be rejected if the 1963 application is not expecting to receive raw data within a component. 1965 Special care should be taken when the URI path interpretation process 1966 involves the use of a back-end filesystem or related system 1967 functions. Filesystems typically assign an operational meaning to 1968 special characters, such as the "/", "\", ":", "[", and "]" 1969 characters, and special device names like ".", "..", "...", "aux", 1970 "lpt", etc. In some cases, merely testing for the existence of such 1971 a name will cause the operating system to pause or invoke unrelated 1972 system calls, leading to significant security concerns regarding 1973 denial of service and unintended data transfer. It would be 1974 impossible for this specification to list all such significant 1975 characters and device names; implementers should research the 1976 reserved names and characters for the types of storage device that 1977 may be attached to their application and restrict the use of data 1978 obtained from URI components accordingly. 1980 7.4 Rare IP Address Formats 1982 Although the URI syntax for IPv4address only allows the common, 1983 dotted-decimal form of IPv4 address literal, many implementations 1984 that process URIs make use of platform-dependent system routines, 1985 such as gethostbyname() and inet_aton(), to translate the string 1986 literal to an actual IP address. Unfortunately, such system routines 1987 often allow and process a much larger set of formats than those 1988 described in Section 3.2.2. 1990 For example, many implementations allow dotted forms of three 1991 numbers, wherein the last part is interpreted as a 16-bit quantity 1992 and placed in the right-most two bytes of the network address (e.g., 1993 a Class B network). Likewise, a dotted form of two numbers means the 1994 last part is interpreted as a 24-bit quantity and placed in the right 1995 most three bytes of the network address (Class A), and a single 1996 number (without dots) is interpreted as a 32-bit quantity and stored 1997 directly in the network address. Adding further to the confusion, 1998 some implementations allow each dotted part to be interpreted as 1999 decimal, octal, or hexadecimal, as specified in the C language (i.e., 2000 a leading 0x or 0X implies hexadecimal; otherwise, a leading 0 2001 implies octal; otherwise, the number is interpreted as decimal). 2003 These additional IP address formats are not allowed in the URI syntax 2004 due to differences between platform implementations. However, they 2005 can become a security concern if an application attempts to filter 2006 access to resources based on the IP address in string literal format. 2007 If such filtering is performed, literals should be converted to 2008 numeric form and filtered based on the numeric value, rather than a 2009 prefix or suffix of the string form. 2011 7.5 Sensitive Information 2013 URI producers should not provide a URI that contains a username or 2014 password which is intended to be secret: URIs are frequently 2015 displayed by browsers, stored in clear text bookmarks, and logged by 2016 user agent history and intermediary applications (proxies). A 2017 password appearing within the userinfo component is deprecated and 2018 should be considered an error (or simply ignored) except in those 2019 rare cases where the 'password' parameter is intended to be public. 2021 7.6 Semantic Attacks 2023 Because the userinfo subcomponent is rarely used and appears before 2024 the host in the authority component, it can be used to construct a 2025 URI that is intended to mislead a human user by appearing to identify 2026 one (trusted) naming authority while actually identifying a different 2027 authority hidden behind the noise. For example 2029 ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm 2031 might lead a human user to assume that the host is 'cnn.example.com', 2032 whereas it is actually '10.0.0.1'. Note that a misleading userinfo 2033 subcomponent could be much longer than the example above. 2035 A misleading URI, such as the one above, is an attack on the user's 2036 preconceived notions about the meaning of a URI, rather than an 2037 attack on the software itself. User agents may be able to reduce the 2038 impact of such attacks by distinguishing the various components of 2039 the URI when rendered, such as by using a different color or tone to 2040 render userinfo if any is present, though there is no general 2041 panacea. More information on URI-based semantic attacks can be found 2042 in [Siedzik]. 2044 8. IANA Considerations 2046 URI scheme names, as defined by in Section 3.1, form a 2047 registered name space that is managed by IANA according to the 2048 procedures defined in [BCP35]. 2050 9. Acknowledgments 2052 This specification is derived from RFC 2396 [RFC2396], RFC 1808 2053 [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those 2054 documents still apply. It also incorporates the update (with 2055 corrections) for IPv6 literals in the host syntax, as defined by 2056 Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in 2057 [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, 2058 Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, 2059 Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin 2060 Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond, 2061 Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael 2062 Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew 2063 Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert, 2064 Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai 2065 Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne, 2066 Stuart Williams, and Henry Zongaro are gratefully acknowledged. 2068 10. References 2070 10.1 Normative References 2072 [ASCII] American National Standards Institute, "Coded Character 2073 Set -- 7-bit American Standard Code for Information 2074 Interchange", ANSI X3.4, 1986. 2076 [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 2077 Specifications: ABNF", RFC 2234, November 1997. 2079 [STD63] Yergeau, F., "UTF-8, a transformation format of ISO 2080 10646", STD 63, RFC 3629, November 2003. 2082 [UCS] International Organization for Standardization, 2083 "Information Technology - Universal Multiple-Octet Coded 2084 Character Set (UCS)", ISO/IEC 10646:2003, December 2003. 2086 10.2 Informative References 2088 [BCP19] Freed, N. and J. Postel, "IANA Charset Registration 2089 Procedures", BCP 19, RFC 2978, October 2000. 2091 [BCP35] Petke, R. and I. King, "Registration Procedures for URL 2092 Scheme Names", BCP 35, RFC 2717, November 1999. 2094 [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet 2095 host table specification", RFC 952, October 1985. 2097 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 2098 STD 13, RFC 1034, November 1987. 2100 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 2101 and Support", STD 3, RFC 1123, October 1989. 2103 [RFC1535] Gavron, E., "A Security Problem and Proposed Correction 2104 With Widely Deployed DNS Software", RFC 1535, October 2105 1993. 2107 [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A 2108 Unifying Syntax for the Expression of Names and Addresses 2109 of Objects on the Network as used in the World-Wide Web", 2110 RFC 1630, June 1994. 2112 [RFC1736] Kunze, J., "Functional Recommendations for Internet 2113 Resource Locators", RFC 1736, February 1995. 2115 [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for 2116 Uniform Resource Names", RFC 1737, December 1994. 2118 [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform 2119 Resource Locators (URL)", RFC 1738, December 1994. 2121 [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC 2122 1808, June 1995. 2124 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 2125 Extensions (MIME) Part Two: Media Types", RFC 2046, 2126 November 1996. 2128 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 2130 [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 2131 Resource Identifiers (URI): Generic Syntax", RFC 2396, 2132 August 1998. 2134 [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. 2135 Jensen, "HTTP Extensions for Distributed Authoring -- 2136 WEBDAV", RFC 2518, February 1999. 2138 [RFC2557] Palme, F., Hopmann, A., Shelness, N. and E. Stefferud, 2139 "MIME Encapsulation of Aggregate Documents, such as HTML 2140 (MHTML)", RFC 2557, March 1999. 2142 [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, 2143 "Guidelines for new URL Schemes", RFC 2718, November 1999. 2145 [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for 2146 Literal IPv6 Addresses in URL's", RFC 2732, December 1999. 2148 [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint W3C/ 2149 IETF URI Planning Interest Group: Uniform Resource 2150 Identifiers (URIs), URLs, and Uniform Resource Names 2151 (URNs): Clarifications and Recommendations", RFC 3305, 2152 August 2002. 2154 [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, 2155 "Internationalizing Domain Names in Applications (IDNA)", 2156 RFC 3490, March 2003. 2158 [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 2159 (IPv6) Addressing Architecture", RFC 3513, April 2003. 2161 [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", April 2162 2001, . 2165 Authors' Addresses 2167 Tim Berners-Lee 2168 World Wide Web Consortium 2169 Massachusetts Institute of Technology 2170 77 Massachusetts Avenue 2171 Cambridge, MA 02139 2172 USA 2174 Phone: +1-617-253-5702 2175 Fax: +1-617-258-5999 2176 EMail: timbl@w3.org 2177 URI: http://www.w3.org/People/Berners-Lee/ 2179 Roy T. Fielding 2180 Day Software 2181 5251 California Ave., Suite 110 2182 Irvine, CA 92617 2183 USA 2185 Phone: +1-949-679-2960 2186 Fax: +1-949-679-2972 2187 EMail: fielding@gbiv.com 2188 URI: http://roy.gbiv.com/ 2190 Larry Masinter 2191 Adobe Systems Incorporated 2192 345 Park Ave 2193 San Jose, CA 95110 2194 USA 2196 Phone: +1-408-536-3024 2197 EMail: LMM@acm.org 2198 URI: http://larry.masinter.net/ 2200 Appendix A. Collected ABNF for URI 2202 URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 2204 hier-part = "//" authority path-abempty 2205 / path-absolute 2206 / path-rootless 2207 / path-empty 2209 URI-reference = URI / relative-URI 2211 absolute-URI = scheme ":" hier-part [ "?" query ] 2213 relative-URI = relative-part [ "?" query ] [ "#" fragment ] 2215 relative-part = "//" authority path-abempty 2216 / path-absolute 2217 / path-noscheme 2218 / path-empty 2220 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 2222 authority = [ userinfo "@" ] host [ ":" port ] 2223 userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) 2224 host = IP-literal / IPv4address / reg-name 2225 port = *DIGIT 2227 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 2229 IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) 2231 IPv6address = 6( h16 ":" ) ls32 2232 / "::" 5( h16 ":" ) ls32 2233 / [ h16 ] "::" 4( h16 ":" ) ls32 2234 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 2235 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 2236 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 2237 / [ *4( h16 ":" ) h16 ] "::" ls32 2238 / [ *5( h16 ":" ) h16 ] "::" h16 2239 / [ *6( h16 ":" ) h16 ] "::" 2241 h16 = 1*4HEXDIG 2242 ls32 = ( h16 ":" h16 ) / IPv4address 2243 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 2245 dec-octet = DIGIT ; 0-9 2246 / %x31-39 DIGIT ; 10-99 2247 / "1" 2DIGIT ; 100-199 2248 / "2" %x30-34 DIGIT ; 200-249 2249 / "25" %x30-35 ; 250-255 2251 reg-name = *( unreserved / pct-encoded / sub-delims ) 2253 path = path-abempty ; begins with "/" or is empty 2254 / path-absolute ; begins with "/" but not "//" 2255 / path-noscheme ; begins with a non-colon segment 2256 / path-rootless ; begins with a segment 2257 / path-empty ; zero characters 2259 path-abempty = *( "/" segment ) 2260 path-absolute = "/" [ segment-nz *( "/" segment ) ] 2261 path-noscheme = segment-nz-nc *( "/" segment ) 2262 path-rootless = segment-nz *( "/" segment ) 2263 path-empty = 0 2265 segment = *pchar 2266 segment-nz = 1*pchar 2267 segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) 2268 ; non-zero-length segment without any colon ":" 2270 pchar = unreserved / pct-encoded / sub-delims / ":" / "@" 2272 query = *( pchar / "/" / "?" ) 2274 fragment = *( pchar / "/" / "?" ) 2276 pct-encoded = "%" HEXDIG HEXDIG 2278 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 2279 reserved = gen-delims / sub-delims 2280 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 2281 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 2282 / "*" / "+" / "," / ";" / "=" 2284 Appendix B. Parsing a URI Reference with a Regular Expression 2286 Since the "first-match-wins" algorithm is identical to the "greedy" 2287 disambiguation method used by POSIX regular expressions, it is 2288 natural and commonplace to use a regular expression for parsing the 2289 potential five components of a URI reference. 2291 The following line is the regular expression for breaking-down a 2292 well-formed URI reference into its components. 2294 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 2295 12 3 4 5 6 7 8 9 2297 The numbers in the second line above are only to assist readability; 2298 they indicate the reference points for each subexpression (i.e., each 2299 paired parenthesis). We refer to the value matched for subexpression 2300 as $. For example, matching the above expression to 2302 http://www.ics.uci.edu/pub/ietf/uri/#Related 2304 results in the following subexpression matches: 2306 $1 = http: 2307 $2 = http 2308 $3 = //www.ics.uci.edu 2309 $4 = www.ics.uci.edu 2310 $5 = /pub/ietf/uri/ 2311 $6 = 2312 $7 = 2313 $8 = #Related 2314 $9 = Related 2316 where indicates that the component is not present, as is 2317 the case for the query component in the above example. Therefore, we 2318 can determine the value of the four components and fragment as 2320 scheme = $2 2321 authority = $4 2322 path = $5 2323 query = $7 2324 fragment = $9 2326 and, going in the opposite direction, we can recreate a URI reference 2327 from its components using the algorithm of Section 5.3. 2329 Appendix C. Delimiting a URI in Context 2331 URIs are often transmitted through formats that do not provide a 2332 clear context for their interpretation. For example, there are many 2333 occasions when a URI is included in plain text; examples include text 2334 sent in electronic mail, USENET news messages, and, most importantly, 2335 printed on paper. In such cases, it is important to be able to 2336 delimit the URI from the rest of the text, and in particular from 2337 punctuation marks that might be mistaken for part of the URI. 2339 In practice, URIs are delimited in a variety of ways, but usually 2340 within double-quotes "http://example.com/", angle brackets , or just using whitespace 2343 http://example.com/ 2345 These wrappers do not form part of the URI. 2347 In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may 2348 need to be added to break a long URI across lines. The whitespace 2349 should be ignored when extracting the URI. 2351 No whitespace should be introduced after a hyphen ("-") character. 2352 Because some typesetters and printers may (erroneously) introduce a 2353 hyphen at the end of line when breaking a line, the interpreter of a 2354 URI containing a line break immediately after a hyphen should ignore 2355 all whitespace around the line break, and should be aware that the 2356 hyphen may or may not actually be part of the URI. 2358 Using <> angle brackets around each URI is especially recommended as 2359 a delimiting style for a reference that contains embedded whitespace. 2361 The prefix "URL:" (with or without a trailing space) was formerly 2362 recommended as a way to help distinguish a URI from other bracketed 2363 designators, though it is not commonly used in practice and is no 2364 longer recommended. 2366 For robustness, software that accepts user-typed URI should attempt 2367 to recognize and strip both delimiters and embedded whitespace. 2369 For example, the text: 2371 Yes, Jim, I found it under "http://www.w3.org/Addressing/", 2372 but you can probably pick it up from . Note the warning in . 2376 contains the URI references 2378 http://www.w3.org/Addressing/ 2379 ftp://foo.example.com/rfc/ 2380 http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING 2382 Appendix D. Summary of Non-editorial Changes 2384 D.1 Additions 2386 IPv6 (and later) literals have been added to the list of possible 2387 identifiers for the host portion of a authority component, as 2388 described by [RFC2732], with the addition of "[" and "]" to the 2389 reserved set and a version flag to anticipate future versions of IP 2390 literals. Square brackets are now specified as reserved within the 2391 authority component and not allowed outside their use as delimiters 2392 for an IP literal within host. In order to make this change without 2393 changing the technical definition of the path, query, and fragment 2394 components, those rules were redefined to directly specify the 2395 characters allowed rather than be defined in terms of uric. 2397 Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal 2398 address, which unfortunately lacks an ABNF description of 2399 IPv6address, we created a new ABNF rule for IPv6address that matches 2400 the text representations defined by Section 2.2 of [RFC3513]. 2401 Likewise, the definition of IPv4address has been improved in order to 2402 limit each decimal octet to the range 0-255. 2404 Section 6 (Section 6) on URI normalization and comparison has been 2405 completely rewritten and extended using input from Tim Bray and 2406 discussion within the W3C Technical Architecture Group. 2408 An ABNF rule for URI has been introduced to correspond to the common 2409 usage of the term: an absolute URI with optional fragment. 2411 D.2 Modifications from RFC 2396 2413 The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. 2414 This change required all rule names that formerly included underscore 2415 characters to be renamed with a dash instead. 2417 Section 2 on characters has been rewritten to explain what characters 2418 are reserved, when they are reserved, and why they are reserved even 2419 when not used as delimiters by the generic syntax. The mark 2420 characters that are typically unsafe to decode, including the 2421 exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open 2422 and close parentheses ("(" and ")"), have been moved to the reserved 2423 set in order to clarify the distinction between reserved and 2424 unreserved and hopefully answer the most common question of scheme 2425 designers. Likewise, the section on percent-encoded characters has 2426 been rewritten, and URI normalizers are now given license to decode 2427 any percent-encoded octets corresponding to unreserved characters. 2428 In general, the terms "escaped" and "unescaped" have been replaced 2429 with "percent-encoded" and "decoded", respectively, to reduce 2430 confusion with other forms of escape mechanisms. 2432 The ABNF for URI and URI-reference has been redesigned to make them 2433 more friendly to LALR parsers and reduce complexity. As a result, 2434 the layout form of syntax description has been removed, along with 2435 the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path, 2436 path_segments, rel_segment, and mark rules. All references to 2437 "opaque" URIs have been replaced with a better description of how the 2438 path component may be opaque to hierarchy. The ambiguity regarding 2439 the parsing of URI-reference as a URI or a relative-URI with a colon 2440 in the first segment has been eliminated through the use of five 2441 separate path matching rules. 2443 The fragment identifier has been moved back into the section on 2444 generic syntax components and within the URI and relative-URI rules, 2445 though it remains excluded from absolute-URI. The number sign ("#") 2446 character has been moved back to the reserved set as a result of 2447 reintegrating the fragment syntax. 2449 The ABNF has been corrected to allow a relative path to be empty. 2450 This also allows an absolute-URI to consist of nothing after the 2451 "scheme:", as is present in practice with the "dav:" namespace 2452 [RFC2518] and the "about:" scheme used internally by many WWW browser 2453 implementations. The ambiguity regarding the boundary between 2454 authority and path has been eliminated through the use of five 2455 separate path matching rules. 2457 Registry-based naming authorities that use the generic syntax are now 2458 defined within the host rule. This change allows current 2459 implementations, where whatever name provided is simply fed to the 2460 local name resolution mechanism, to be consistent with the 2461 specification and removes the need to re-specify DNS name formats 2462 here. It also allows the host component to contain percent-encoded 2463 octets, which is necessary to enable internationalized domain names 2464 to be provided in URIs, processed in their native character encodings 2465 at the application layers above URI processing, and passed to an IDNA 2466 library as a registered name in the UTF-8 character encoding. The 2467 server, hostport, hostname, domainlabel, toplabel, and alphanum rules 2468 have been removed. 2470 The resolving relative references algorithm of [RFC2396] has been 2471 rewritten using pseudocode for this revision to improve clarity and 2472 fix the following issues: 2474 o [RFC2396] section 5.2, step 6a, failed to account for a base URI 2475 with no path. 2477 o Restored the behavior of [RFC1808] where, if the reference 2478 contains an empty path and a defined query component, then the 2479 target URI inherits the base URI's path component. 2481 o The determination of whether a URI reference is a same-document 2482 reference has been decoupled from the URI parser, simplifying the 2483 URI processing interface within applications in a way consistent 2484 with the internal architecture of deployed URI processing 2485 implementations. The determination is now based on comparison to 2486 the base URI after transforming a reference to absolute form, 2487 rather than on the format of the reference itself. This change 2488 may result in more references being considered "same-document" 2489 under this specification than would be under the rules given in 2490 RFC 2396, especially when normalization is used to reduce aliases. 2491 However, it does not change the status of existing same-document 2492 references. 2494 o Separated the path merge routine into two routines: merge, for 2495 describing combination of the base URI path with a relative-path 2496 reference, and remove_dot_segments, for describing how to remove 2497 the special "." and ".." segments from a composed path. The 2498 remove_dot_segments algorithm is now applied to all URI reference 2499 paths in order to match common implementations and improve the 2500 normalization of URIs in practice. This change only impacts the 2501 parsing of abnormal references and same-scheme references wherein 2502 the base URI has a non-hierarchical path. 2504 Appendix E. Instructions to RFC Editor 2506 Prior to publication as an RFC, please remove this section and the 2507 "Editorial Note" that appears after the Abstract. If [BCP35] or any 2508 of the normative references are updated prior to publication, the 2509 associated reference in this document can be safely updated as well. 2510 This document has been produced using the xml2rfc tool set; the XML 2511 version can be obtained via the URI listed in the editorial note. 2513 Index 2515 A 2516 ABNF 11 2517 absolute 26 2518 absolute-path 25 2519 absolute-URI 26 2520 access 9 2521 authority 15, 17 2523 B 2524 base URI 28 2526 C 2527 character encoding 4 2528 character 4 2529 characters 11 2530 coded character set 4 2532 D 2533 dec-octet 20 2534 dereference 9 2535 dot-segments 22 2537 F 2538 fragment 15, 23 2540 G 2541 gen-delims 12 2542 generic syntax 6 2544 H 2545 h16 19 2546 hier-part 15 2547 hierarchical 10 2548 host 18 2550 I 2551 identifier 5 2552 IP-literal 19 2553 IPv4 20 2554 IPv4address 20 2555 IPv6 19 2556 IPv6address 19 2557 IPvFuture 19 2559 L 2560 locator 7 2561 ls32 19 2563 M 2564 merge 31 2566 N 2567 name 7 2568 network-path 25 2570 P 2571 path 15, 21 2572 path-abempty 21 2573 path-absolute 21 2574 path-empty 21 2575 path-noscheme 21 2576 path-rootless 21 2577 path-abempty 15 2578 path-absolute 15 2579 path-empty 15 2580 path-rootless 15 2581 pchar 21 2582 pct-encoded 12 2583 percent-encoding 12 2584 port 21 2586 Q 2587 query 15, 23 2589 R 2590 reg-name 20 2591 registered name 20 2592 relative 10, 28 2593 relative-path 25 2594 relative-URI 25 2595 remove_dot_segments 31 2596 representation 9 2597 reserved 12 2598 resolution 9, 28 2599 resource 5 2600 retrieval 9 2602 S 2603 same-document 26 2604 sameness 9 2605 scheme 15, 16 2606 segment 21 2607 segment-nz 21 2608 segment-nz-nc 21 2609 sub-delims 12 2610 suffix 27 2612 T 2613 transcription 7 2615 U 2616 uniform 4 2617 unreserved 13 2618 URI grammar 2619 absolute-URI 26 2620 ALPHA 11 2621 authority 16, 17 2622 CR 11 2623 dec-octet 20 2624 DIGIT 11 2625 DQUOTE 11 2626 fragment 16, 24, 26 2627 gen-delims 12 2628 h16 19 2629 HEXDIG 11 2630 hier-part 16 2631 host 17, 18 2632 IP-literal 19 2633 IPv4address 20 2634 IPv6address 19, 19 2635 IPvFuture 19 2636 LF 11 2637 ls32 19 2638 mark 13 2639 OCTET 11 2640 path 22 2641 path-abempty 16, 22 2642 path-absolute 16, 22 2643 path-empty 16, 22 2644 path-noscheme 22 2645 path-rootless 16, 22 2646 pchar 22, 23, 24 2647 pct-encoded 12 2648 port 17, 21 2649 query 16, 23, 26, 26 2650 reg-name 20 2651 relative-URI 25, 26 2652 reserved 12 2653 scheme 16, 16, 26 2654 segment 22 2655 segment-nz 22 2656 segment-nz-nc 22 2657 SP 11 2658 sub-delims 12 2659 unreserved 13 2660 URI 16, 25 2661 URI-reference 25 2662 userinfo 17, 17 2663 URI 15 2664 URI-reference 25 2665 URL 7 2666 URN 7 2667 userinfo 17 2669 Intellectual Property Statement 2671 The IETF takes no position regarding the validity or scope of any 2672 Intellectual Property Rights or other rights that might be claimed to 2673 pertain to the implementation or use of the technology described in 2674 this document or the extent to which any license under such rights 2675 might or might not be available; nor does it represent that it has 2676 made any independent effort to identify any such rights. Information 2677 on the procedures with respect to rights in RFC documents can be 2678 found in BCP 78 and BCP 79. 2680 Copies of IPR disclosures made to the IETF Secretariat and any 2681 assurances of licenses to be made available, or the result of an 2682 attempt made to obtain a general license or permission for the use of 2683 such proprietary rights by implementers or users of this 2684 specification can be obtained from the IETF on-line IPR repository at 2685 http://www.ietf.org/ipr. 2687 The IETF invites any interested party to bring to its attention any 2688 copyrights, patents or patent applications, or other proprietary 2689 rights that may cover technology that may be required to implement 2690 this standard. Please address the information to the IETF at 2691 ietf-ipr@ietf.org. 2693 Disclaimer of Validity 2695 This document and the information contained herein are provided on an 2696 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2697 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2698 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 2699 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2700 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2701 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2703 Copyright Statement 2705 Copyright (C) The Internet Society (2004). This document is subject 2706 to the rights, licenses and restrictions contained in BCP 78, and 2707 except as set forth therein, the authors retain all their rights. 2709 Acknowledgment 2711 Funding for the RFC Editor function is currently provided by the 2712 Internet Society.