idnits 2.17.00 (12 Aug 2021) /tmp/idnits35178/draft-fielding-uri-rfc2396bis-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 659: '... practice is NOT RECOMMENDED, because ...' -- The draft header indicates that this document obsoletes RFC2732, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC2396, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC1808, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC1738, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 3, 2003) is 7018 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) -- Obsolete informational reference (is this intentional?): RFC 1738 (Obsoleted by RFC 4248, RFC 4266) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 1808 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2518 (Obsoleted by RFC 4918) -- Obsolete informational reference (is this intentional?): RFC 2373 (Obsoleted by RFC 3513) -- Obsolete informational reference (is this intentional?): RFC 2732 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2110 (Obsoleted by RFC 2557) -- Obsolete informational reference (is this intentional?): RFC 2717 (Obsoleted by RFC 4395) -- Obsolete informational reference (is this intentional?): RFC 2279 (ref. 'UTF-8') (Obsoleted by RFC 3629) Summary: 4 errors (**), 0 flaws (~~), 2 warnings (==), 17 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group T. Berners-Lee 3 Internet-Draft MIT/LCS 4 Updates: 1738 (if approved) R. Fielding 5 Obsoletes: 2732, 2396, 1808 (if approved) Day Software 6 Expires: September 1, 2003 L. Masinter 7 Adobe 8 March 3, 2003 10 Uniform Resource Identifier (URI): Generic Syntax 11 draft-fielding-uri-rfc2396bis-01 13 Status of this Memo 15 This document is an Internet-Draft and is in full conformance with 16 all provisions of Section 10 of RFC2026. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that other 20 groups may also distribute working documents as Internet-Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on September 1, 2003. 35 Copyright Notice 37 Copyright (C) The Internet Society (2003). All Rights Reserved. 39 Abstract 41 A Uniform Resource Identifier (URI) is a compact string of characters 42 for identifying an abstract or physical resource. This document 43 defines the generic syntax of a URI, including both absolute and 44 relative forms, and guidelines for their use. 46 This document defines a grammar that is a superset of all valid URIs, 47 such that an implementation can parse the common components of a URI 48 reference without knowing the scheme-specific requirements of every 49 possible identifier type. This document does not define a generative 50 grammar for all URIs; that task will be performed by the individual 51 specifications of each URI scheme. 53 Editorial Note 55 Discussion of this draft and comments to the editors should be sent 56 to the uri@w3.org mailing list. An issues list and version history 57 is available at . 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . . 4 63 1.2 URI, URL, and URN . . . . . . . . . . . . . . . . . . . . . 5 64 1.3 Example URIs . . . . . . . . . . . . . . . . . . . . . . . . 6 65 1.4 Hierarchical URIs and Relative Forms . . . . . . . . . . . . 6 66 1.5 URI Transcribability . . . . . . . . . . . . . . . . . . . . 7 67 1.6 Syntax Notation and Common Elements . . . . . . . . . . . . 8 68 2. URI Characters and Escape Sequences . . . . . . . . . . . . 9 69 2.1 URIs and non-ASCII characters . . . . . . . . . . . . . . . 9 70 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . . 10 71 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . . 11 72 2.4 Escape Sequences . . . . . . . . . . . . . . . . . . . . . . 11 73 2.4.1 Escaped Encoding . . . . . . . . . . . . . . . . . . . . . . 11 74 2.4.2 When to Escape and Unescape . . . . . . . . . . . . . . . . 11 75 2.4.3 Excluded US-ASCII Characters . . . . . . . . . . . . . . . . 12 76 3. URI Syntactic Components . . . . . . . . . . . . . . . . . . 14 77 3.1 Scheme Component . . . . . . . . . . . . . . . . . . . . . . 15 78 3.2 Authority Component . . . . . . . . . . . . . . . . . . . . 15 79 3.2.1 Registry-based Naming Authority . . . . . . . . . . . . . . 16 80 3.2.2 Server-based Naming Authority . . . . . . . . . . . . . . . 16 81 3.3 Path Component . . . . . . . . . . . . . . . . . . . . . . . 18 82 3.4 Query Component . . . . . . . . . . . . . . . . . . . . . . 19 83 4. URI References . . . . . . . . . . . . . . . . . . . . . . . 20 84 4.1 Fragment Identifier . . . . . . . . . . . . . . . . . . . . 20 85 4.2 Same-document References . . . . . . . . . . . . . . . . . . 21 86 4.3 Parsing a URI Reference . . . . . . . . . . . . . . . . . . 21 87 5. Relative URI References . . . . . . . . . . . . . . . . . . 22 88 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . . 23 89 5.1.1 Base URI within Document Content . . . . . . . . . . . . . . 24 90 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . . . . 24 91 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . . . . 25 92 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . . . . 25 93 5.2 Resolving Relative References to Absolute Form . . . . . . . 25 94 6. URI Normalization and Comparison . . . . . . . . . . . . . . 29 95 6.1 URI Equivalence . . . . . . . . . . . . . . . . . . . . . . 29 96 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . . 29 97 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 30 98 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . . . . 31 99 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . . . . 32 100 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . . . . 32 101 6.3 Good Practice When Using URIs . . . . . . . . . . . . . . . 32 102 7. Security Considerations . . . . . . . . . . . . . . . . . . 34 103 7.1 Reliability and Consistency . . . . . . . . . . . . . . . . 34 104 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . . 34 105 7.3 Rare IP Address Formats . . . . . . . . . . . . . . . . . . 35 106 7.4 Sensitive Information . . . . . . . . . . . . . . . . . . . 35 107 7.5 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . . 36 108 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 37 109 Normative References . . . . . . . . . . . . . . . . . . . . 38 110 Non-normative References . . . . . . . . . . . . . . . . . . 39 111 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 40 112 A. Collected BNF for URI . . . . . . . . . . . . . . . . . . . 42 113 B. Parsing a URI Reference with a Regular Expression . . . . . 43 114 C. Examples of Resolving Relative URI References . . . . . . . 44 115 C.1 Normal Examples . . . . . . . . . . . . . . . . . . . . . . 44 116 C.2 Abnormal Examples . . . . . . . . . . . . . . . . . . . . . 44 117 D. Embedding the Base URI in HTML documents . . . . . . . . . . 46 118 E. Recommendations for Delimiting URI in Context . . . . . . . 47 119 F. Abbreviated URIs . . . . . . . . . . . . . . . . . . . . . . 49 120 G. Summary of Non-editorial Changes . . . . . . . . . . . . . . 50 121 G.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . . 50 122 G.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . . 50 123 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 124 Intellectual Property and Copyright Statements . . . . . . . 55 126 1. Introduction 128 A Uniform Resource Identifier (URI) provides a simple and extensible 129 means for identifying a resource. This specification of URI syntax 130 and semantics is derived from concepts introduced by the World Wide 131 Web global information initiative, whose use of such objects dates 132 from 1990 and is described in "Universal Resource Identifiers in WWW" 133 [RFC1630], and is designed to meet the recommendations laid out in 134 "Functional Recommendations for Internet Resource Locators" [RFC1736] 135 and "Functional Requirements for Uniform Resource Names" [RFC1737]. 137 This document obsoletes [RFC2396], which merged "Uniform Resource 138 Locators" [RFC1738] and "Relative Uniform Resource Locators" 139 [RFC1808] in order to define a single, generic syntax for all URIs. 140 It excludes those portions of RFC 1738 that defined the specific 141 syntax of individual URI schemes; those portions will be updated as 142 separate documents. The process for registration of new URI schemes 143 is defined separately by [RFC2717]. 145 All significant changes from RFC 2396 are noted in Appendix G. 147 1.1 Overview of URIs 149 URIs are characterized by the following definitions: 151 Uniform 153 Uniformity provides several benefits: it allows different types of 154 resource identifiers to be used in the same context, even when the 155 mechanisms used to access those resources may differ; it allows 156 uniform semantic interpretation of common syntactic conventions 157 across different types of resource identifiers; it allows 158 introduction of new types of resource identifiers without 159 interfering with the way that existing identifiers are used; and, 160 it allows the identifiers to be reused in many different contexts, 161 thus permitting new applications or protocols to leverage a 162 pre-existing, large, and widely-used set of resource identifiers. 164 Resource 166 A resource can be anything that has identity. Familiar examples 167 include an electronic document, an image, a service (e.g., 168 "today's weather report for Los Angeles"), and a collection of 169 other resources. Not all resources are network "retrievable"; 170 e.g., human beings, corporations, and bound books in a library can 171 also be considered resources. 173 The resource is the conceptual mapping to an entity or set of 174 entities, not necessarily the entity which corresponds to that 175 mapping at any particular instance in time. Thus, a resource can 176 remain constant even when its content---the entities to which it 177 currently corresponds---changes over time, provided that the 178 conceptual mapping is not changed in the process. 180 Identifier 182 An identifier is an object that can act as a reference to 183 something that has identity. In the case of a URI, the object is 184 a sequence of characters with a restricted syntax. 186 Having identified a resource, a system may perform a variety of 187 operations on the resource, as might be characterized by such words 188 as `access', `update', `replace', or `find attributes'. 190 1.2 URI, URL, and URN 192 A URI can be further classified as a locator, a name, or both. The 193 term "Uniform Resource Locator" (URL) refers to the subset of URIs 194 that, in addition to identifying the resource, provide a means of 195 locating the resource by describing its primary access mechanism 196 (e.g., its network "location"). The term "Uniform Resource Name" 197 (URN) refers to the subset of URIs that are required to remain 198 globally unique and persistent even when the resource ceases to exist 199 or becomes unavailable. 201 An individual scheme does not need to be cast into one of a discrete 202 set of URI types such as "URL", "URN", "URC", etc. Any given URI 203 scheme may define subspaces that have the characteristics of a name, 204 a locator, or both, often depending on the persistence and care in 205 the assignment of identifiers by the naming authority, rather than on 206 any quality of the URI scheme. For that reason, this specification 207 deprecates use of the terms URL or URN to distinguish between 208 schemes, instead using the term URI throughout. 210 Each URI scheme (Section 3.1) defines the namespace of the URI, and 211 thus may further restrict the syntax and semantics of identifiers 212 using that scheme. This specification defines those elements of the 213 URI syntax that are either required of all URI schemes or are common 214 to many URI schemes. It thus defines the syntax and semantics that 215 are needed to implement a scheme-independent parsing mechanism for 216 URI references, such that the scheme-dependent handling of a URI can 217 be postponed until the scheme-dependent semantics are needed. 219 Although many URI schemes are named after protocols, this does not 220 imply that use of such a URI will result in access to the resource 221 via the named protocol. URIs are often used in contexts that are 222 purely for identification, just like any other identifier. Even when 223 a URI is used to obtain a representation of a resource, that access 224 might be through gateways, proxies, caches, and name resolution 225 services that are independent of the protocol of the resource origin, 226 and the resolution of some URIs may require the use of more than one 227 protocol (e.g., both DNS and HTTP are typically used to access an 228 "http" URI's resource when it can't be found in a local cache). 230 A parser of the generic URI syntax is capable of parsing any URI 231 reference into its major components; once the scheme is determined, 232 further scheme-specific parsing can be performed on the components. 233 In other words, the URI generic syntax is a superset of the syntax of 234 all URI schemes. 236 1.3 Example URIs 238 The following examples illustrate URIs that are in common use. 240 ftp://ftp.is.co.za/rfc/rfc1808.txt 241 -- ftp scheme for File Transfer Protocol services 243 gopher://gopher.tc.umn.edu:70/11/Mailing%20Lists/ 244 -- gopher scheme for Gopher and Gopher+ Protocol services 246 http://www.ietf.org/rfc/rfc2396.txt 247 -- http scheme for Hypertext Transfer Protocol services 249 mailto:John.Doe@example.com 250 -- mailto scheme for electronic mail addresses 252 news:comp.infosystems.www.servers.unix 253 -- news scheme for USENET news groups and articles 255 telnet://melvyl.ucop.edu/ 256 -- telnet scheme for interactive TELNET services 258 1.4 Hierarchical URIs and Relative Forms 260 An absolute identifier refers to a resource independent of the 261 context in which the identifier is used. In contrast, a relative 262 identifier refers to a resource by describing the difference within a 263 hierarchical namespace between the current context and an absolute 264 identifier of the resource. 266 Some URI schemes support a hierarchical naming system, where the 267 hierarchy of the name is denoted by a "/" delimiter separating the 268 components in the scheme. This document defines a scheme-independent 269 `relative' form of URI reference that can be used in conjunction with 270 a `base' URI of a hierarchical scheme to produce the `absolute' URI 271 form of the reference. The syntax of a hierarchical URI is described 272 in Section 3; the relative URI calculation is described in Section 5. 274 1.5 URI Transcribability 276 The URI syntax was designed with global transcribability as one of 277 its main concerns. A URI is a sequence of characters from a very 278 limited set, i.e. the letters of the basic Latin alphabet, digits, 279 and a few special characters. A URI may be represented in a variety 280 of ways: e.g., ink on paper, pixels on a screen, or a sequence of 281 octets in a coded character set. The interpretation of a URI depends 282 only on the characters used and not how those characters are 283 represented in a network protocol. 285 The goal of transcribability can be described by a simple scenario. 286 Imagine two colleagues, Sam and Kim, sitting in a pub at an 287 international conference and exchanging research ideas. Sam asks Kim 288 for a location to get more information, so Kim writes the URI for the 289 research site on a napkin. Upon returning home, Sam takes out the 290 napkin and types the URI into a computer, which then retrieves the 291 information to which Kim referred. 293 There are several design concerns revealed by the scenario: 295 o A URI is a sequence of characters, which is not always represented 296 as a sequence of octets. 298 o A URI may be transcribed from a non-network source, and thus 299 should consist of characters that are most likely to be able to be 300 typed into a computer, within the constraints imposed by keyboards 301 (and related input devices) across languages and locales. 303 o A URI often needs to be remembered by people, and it is easier for 304 people to remember a URI when it consists of meaningful 305 components. 307 These design concerns are not always in alignment. For example, it 308 is often the case that the most meaningful name for a URI component 309 would require characters that cannot be typed into some systems. The 310 ability to transcribe the resource identifier from one medium to 311 another was considered more important than having its URI consist of 312 the most meaningful of components. In local and regional contexts 313 and with improving technology, users might benefit from being able to 314 use a wider range of characters; such use is not defined in this 315 document. 317 1.6 Syntax Notation and Common Elements 319 This document uses two conventions to describe and define the syntax 320 for URI. The first, called the layout form, is a general description 321 of the order of components and component separators, as in 323 /;? 325 The component names are enclosed in angle-brackets and any characters 326 outside angle-brackets are literal separators. Whitespace should be 327 ignored. These descriptions are used informally and do not define 328 the syntax requirements. 330 The second convention is a formal grammar defined using the Augmented 331 Backus-Naur Form (ABNF) notation of [RFC2234]. Although the ABNF 332 defines syntax in terms of the ASCII character encoding [ASCII], the 333 URI syntax should be interpreted in terms of the character that the 334 ASCII-encoded octet represents, rather than the octet encoding 335 itself. How a URI is represented in terms of bits and bytes on the 336 wire is dependent upon the character encoding of the protocol used to 337 transport it, or the charset of the document that contains it. 339 The complete URI syntax is collected in Appendix A. 341 2. URI Characters and Escape Sequences 343 A URI consists of a restricted set of characters, primarily chosen 344 to aid transcribability and usability both in computer systems and in 345 non-computer communications. Characters used conventionally as 346 delimiters around a URI are excluded. The restricted set of 347 characters consists of digits, letters, and a few graphic symbols 348 chosen from those common to most of the character encodings and input 349 facilities available to Internet users. 351 uric = reserved / unreserved / escaped 353 Within a URI, characters are either used as delimiters or to 354 represent strings of data (octets) within the delimited portions. 355 Octets are either represented directly by a character (using the 356 US-ASCII character for that octet [ASCII]) or by an escape encoding. 357 This representation is elaborated below. 359 2.1 URIs and non-ASCII characters 361 The relationship between URIs and characters has been a source of 362 confusion for characters that are not part of US-ASCII. To describe 363 the relationship, it is useful to distinguish between a "character" 364 (as a distinguishable semantic entity) and an "octet" (an 8-bit 365 byte). There are two mappings, one from URI characters to octets, and 366 a second from octets to original characters: 368 URI character sequence->octet sequence->original character sequence 370 A URI is represented as a sequence of characters, not as a sequence 371 of octets. That is because a URI might be "transported" by means that 372 are not through a computer network, e.g., printed on paper, read over 373 the radio, etc. 375 Within a delimited component of a URI, a sequence of characters is 376 used to represent a sequence of octets. For example, the character 377 "a" represents the octet 97 (decimal), while the character sequence 378 "%", "0", "a" represents the octet 10 (decimal). 380 There is a second translation for some resources: the sequence of 381 octets defined by a component of the URI is subsequently used to 382 represent a sequence of characters. A 'charset' defines this mapping. 383 There are many charsets in use in Internet protocols. For example, 384 UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences 385 of characters in the repertoire of ISO 10646. 387 In the simplest case, the original character sequence contains only 388 characters that are defined in US-ASCII, and the two levels of 389 mapping are simple and easily invertible: each 'original character' 390 is represented as the octet for the US-ASCII code for it, which is, 391 in turn, represented as either the US-ASCII character, or else the 392 "%" escape sequence for that octet. 394 For original character sequences that contain non-ASCII characters, 395 however, the situation is more difficult. Internet protocols that 396 transmit octet sequences intended to represent character sequences 397 are expected to provide some way of identifying the charset used, if 398 there might be more than one [RFC2277]. However, there is currently 399 no provision within the generic URI syntax to accomplish this 400 identification. An individual URI scheme may require a single 401 charset, define a default charset, or provide a way to indicate the 402 charset used. For example, a new scheme "foo" might be defined such 403 that any escaped octet is keyed to the UTF-8 encoding in order to 404 determine the corresponding Unicode character. 406 It is expected that a systematic treatment of character encoding 407 within URIs will be developed as a future modification of this 408 specification. 410 2.2 Reserved Characters 412 Many URI include components consisting of or delimited by, certain 413 special characters. These characters are called "reserved", since 414 their usage within the URI component is limited to their reserved 415 purpose. If the data for a URI component would conflict with the 416 reserved purpose, then the conflicting data must be escaped before 417 forming the URI. 419 reserved = "[" / "]" / ";" / "/" / "?" / 420 ":" / "@" / "&" / "=" / "+" / "$" / "," 422 The "reserved" syntax class above refers to those characters that are 423 allowed within a URI, but which may not be allowed within a 424 particular component of the generic URI syntax; they are used as 425 delimiters of the components described in Section 3. 427 Characters in the "reserved" set are not reserved in all contexts. 428 The set of characters actually reserved within any given URI 429 component is defined by that component. In general, a character is 430 reserved if the semantics of the URI changes if the character is 431 replaced with its escaped US-ASCII encoding. 433 2.3 Unreserved Characters 435 Data characters that are allowed in a URI but do not have a reserved 436 purpose are called unreserved. These include upper and lower case 437 letters, decimal digits, and a limited set of punctuation marks and 438 symbols. 440 unreserved = ALPHA / DIGIT / mark 442 mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / "(" / ")" 444 Unreserved characters can be escaped without changing the semantics 445 of the URI, but this should not be done unless the URI is being used 446 in a context that does not allow the unescaped character to appear. 447 URI normalization processes may unescape sequences in the ranges of 448 ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), underscore (%5F), or 449 tilde (%7E) without fear of creating a conflict, but unescaping the 450 other mark characters is usually counterproductive. 452 2.4 Escape Sequences 454 Data must be escaped if it does not have a representation using an 455 unreserved character; this includes data that does not correspond to 456 a printable character of the US-ASCII coded character set, or that 457 corresponds to any US-ASCII character that is disallowed, as 458 explained below. 460 2.4.1 Escaped Encoding 462 An escaped octet is encoded as a character triplet, consisting of 463 the percent character "%" followed by the two hexadecimal digits 464 representing the octet code in . For example, "%20" is the escaped 465 encoding for the US-ASCII space character. 467 escaped = "%" HEXDIG HEXDIG 469 2.4.2 When to Escape and Unescape 471 A URI is always in an "escaped" form, since escaping or unescaping a 472 completed URI might change its semantics. Normally, the only time 473 escape encodings can safely be made is when the URI is being created 474 from its component parts; each component may have its own set of 475 characters that are reserved, so only the mechanism responsible for 476 generating or interpreting that component can determine whether or 477 not escaping a character will change its semantics. Likewise, a URI 478 must be separated into its components before the escaped characters 479 within those components can be safely decoded. 481 In some cases, data that could be represented by an unreserved 482 character may appear escaped; for example, some of the unreserved 483 "mark" characters are automatically escaped by some systems. If the 484 given URI scheme defines a canonicalization algorithm, then 485 unreserved characters may be unescaped according to that algorithm. 486 For example, "%7e" is sometimes used instead of "~" in an http URI 487 path, but the two are equivalent for an http URI. 489 Because the percent "%" character always has the reserved purpose of 490 being the escape indicator, it must be escaped as "%25" in order to 491 be used as data within a URI. Implementers should be careful not to 492 escape or unescape the same string more than once, since unescaping 493 an already unescaped string might lead to misinterpreting a percent 494 data character as another escaped character, or vice versa in the 495 case of escaping an already escaped string. 497 2.4.3 Excluded US-ASCII Characters 499 Although they are disallowed within the URI syntax, we include here a 500 description of those US-ASCII characters that have been excluded and 501 the reasons for their exclusion. 503 The control characters (CTL) in the US-ASCII coded character set are 504 not used within a URI, both because they are non-printable and 505 because they are likely to be misinterpreted by some control 506 mechanisms. 508 The space character (SP) is excluded because significant spaces may 509 disappear and insignificant spaces may be introduced when a URI is 510 transcribed or typeset or subjected to the treatment of 511 word-processing programs. Whitespace is also used to delimit a URI 512 in many contexts. 514 The angle-bracket "<" and ">" and double-quote (") characters are 515 excluded because they are often used as the delimiters around a URI 516 in text documents and protocol fields. The character "#" is excluded 517 because it is used to delimit a URI from a fragment identifier in a 518 URI reference (Section 4). The percent character "%" is excluded 519 because it is used for the encoding of escaped characters. 521 delims = "<" / ">" / "#" / "%" / DQUOTE 523 Other characters are excluded because gateways and other transport 524 agents are known to sometimes modify such characters, or they are 525 used as delimiters. 527 unwise = "{" / "}" / "|" / "\" / "^" / "`" 529 Data corresponding to excluded characters must be escaped in order to 530 be properly represented within a URI. 532 3. URI Syntactic Components 534 The URI syntax is dependent upon the scheme. In general, absolute 535 URIs are written as follows: 537 : 539 An absolute URI contains the name of the scheme being used () 540 followed by a colon (":") and then a string (the 541 ) whose interpretation depends on the scheme. 543 The URI syntax does not require that the scheme-specific-part have 544 any general structure or set of semantics which is common among all 545 URIs. However, a subset of URI do share a common syntax for 546 representing hierarchical relationships within the namespace. This 547 "generic URI" syntax consists of a sequence of four main components: 549 ://? 551 each of which, except , may be absent from a particular URI. 552 For example, some URI schemes do not allow an component, 553 and others do not use a component. 555 absolute-URI = scheme ":" ( hier-part / opaque-part ) 557 URIs that are hierarchical in nature use the slash "/" character for 558 separating hierarchical components. For some file systems, a "/" 559 character (used to denote the hierarchical structure of a URI) is the 560 delimiter used to construct a file name hierarchy, and thus the URI 561 path will look similar to a file pathname. This does NOT imply that 562 the resource is a file or that the URI maps to an actual filesystem 563 pathname. 565 hier-part = [ net-path / abs-path ] [ "?" query ] 567 net-path = "//" authority [ abs-path ] 569 abs-path = "/" path-segments 571 URIs that do not make use of the slash "/" character for separating 572 hierarchical components are considered opaque by the generic URI 573 parser. 575 opaque-part = uric-no-slash *uric 577 uric-no-slash = unreserved / escaped / "[" / "]" / ";" / "?" / 578 ":" / "@" / "&" / "=" / "+" / "$" / "," 580 We use the term to refer to both the and 581 constructs, since they are mutually exclusive for any 582 given URI and can be parsed as a single component. 584 3.1 Scheme Component 586 Just as there are many different methods of access to resources, 587 there are a variety of schemes for identifying such resources. The 588 URI syntax consists of a sequence of components separated by reserved 589 characters, with the first component defining the semantics for the 590 remainder of the URI string. 592 Scheme names consist of a sequence of characters beginning with a 593 lower case letter and followed by any combination of lower case 594 letters, digits, plus ("+"), period ("."), or hyphen ("-"). For 595 resiliency, programs interpreting a URI should treat upper case 596 letters as equivalent to lower case in scheme names (e.g., allow 597 "HTTP" as well as "http"). 599 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 601 Relative URI references are distinguished from absolute URI in that 602 they do not begin with a scheme name. Instead, the scheme is 603 inherited from the base URI, as described in Section 5.2. 605 3.2 Authority Component 607 Many URI schemes include a top hierarchical element for a naming 608 authority, such that the namespace defined by the remainder of the 609 URI is governed by that authority. This authority component is 610 typically defined by an Internet-based server or a scheme-specific 611 registry of naming authorities. 613 authority = server / reg-name 615 The authority component is preceded by a double slash "//" and is 616 terminated by the next slash "/", question-mark "?", or by the end of 617 the URI. Within the authority component, the characters ";", ":", 618 "@", "?", "/", "[", and "]" are reserved. 620 An authority component is not required for a URI scheme to make use 621 of relative references. A base URI without an authority component 622 implies that any relative reference will also be without an authority 623 component. 625 3.2.1 Registry-based Naming Authority 627 The structure of a registry-based naming authority is specific to 628 the URI scheme, but constrained to the allowed characters for an 629 authority component. 631 reg-name = 1*( unreserved / escaped / ";" / 632 ":" / "@" / "&" / "=" / "+" / "$" / "," ) 634 3.2.2 Server-based Naming Authority 636 URI schemes that involve the direct use of an IP-based protocol to a 637 specified server on the Internet use a common syntax for the server 638 component of the URI's scheme-specific data: 640 @: 642 where may consist of a user name and, optionally, 643 scheme-specific information about how to gain authorization to access 644 the server. The parts "@" and ":" may be omitted. If 645 is omitted, the default host is defined by the scheme-specific 646 semantics of the URI (e.g., the "file" URI scheme defaults to 647 "localhost", whereas the "http" URI scheme does not allow host to be 648 omitted). 650 server = [ [ userinfo "@" ] hostport ] 652 The user information, if present, is followed by a commercial 653 at-sign "@". 655 userinfo = *( unreserved / escaped / ";" / 656 ":" / "&" / "=" / "+" / "$" / "," ) 658 Some URI schemes use the format "user:password" in the userinfo 659 field. This practice is NOT RECOMMENDED, because the passing of 660 authentication information in clear text has proven to be a security 661 risk in almost every case where it has been used. Note also that 662 userinfo which is crafted to look like a trusted domain name might be 663 used to mislead users, as described in Section 7.5. 665 The server is identified by a network host --- as described by an 666 IPv6 literal encapsulated within square brackets, an IPv4 address in 667 dotted-decimal form, or a domain name --- and an optional port 668 number. The server's port, if any is required by the URI scheme, can 669 be specified by a port number in decimal following the host and 670 delimited from it by a colon (":") character. If no explicit port 671 number is given, the default port number, as defined by the URI 672 scheme, is assumed. The type of network port identified by the URI 673 (e.g., TCP, UDP, SCTP, etc.) is defined by the scheme-specific 674 semantics of the URI scheme. 676 hostport = host [ ":" port ] 677 host = IPv6reference / IPv4address / hostname 678 port = *DIGIT 680 A hostname takes the form described in Section 3 of [RFC1034] and 681 Section 2.1 of [RFC1123]: a sequence of domain labels separated by 682 ".", each domain label starting and ending with an alphanumeric 683 character and possibly also containing "-" characters. The rightmost 684 domain label of a fully qualified domain name will never start with a 685 digit, thus syntactically distinguishing domain names from IPv4 686 addresses, and may be followed by a single "." if it is necessary to 687 distinguish between the complete domain name and any local domain. 689 hostname = domainlabel qualified 690 qualified = *( "." domainlabel ) [ "." toplabel "." ] 691 domainlabel = alphanum [ 0*61( alphanum | "-" ) alphanum ] 692 toplabel = alpha [ 0*61( alphanum | "-" ) alphanum ] 693 alphanum = ALPHA / DIGIT 695 A host identified by an IPv4 literal address is represented in 696 dotted-decimal notation (a sequence of four decimal numbers in the 697 range 0 to 255, separated by "."), as described in [RFC1123] by 698 reference to [RFC0952]. Note that other forms of dotted notation may 699 be interpreted on some platforms, as described in Section 7.3, but 700 only the dotted-decimal form of four octets is allowed by this 701 grammar. 703 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 704 dec-octet = DIGIT / ; 0-9 705 ( %x31-39 DIGIT ) / ; 10-99 706 ( "1" 2DIGIT ) / ; 100-199 707 ( "2" %x30-34 DIGIT ) / ; 200-249 708 ( "25" %x30-35 ) ; 250-255 710 A host identified by an IPv6 literal address [RFC2373] is 711 distinguished by enclosing the IPv6 literal within square-brakets 712 ("[" and "]"). This is the only place where square-bracket 713 characters are allowed in the hierarchical URI syntax. 715 IPv6reference = "[" IPv6address "]" 717 IPv6address = ( 6( h4 ":" ) ls32 ) 718 / ( "::" 5( h4 ":" ) ls32 ) 719 / ( [ h4 ] "::" 4( h4 ":" ) ls32 ) 720 / ( [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 ) 721 / ( [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 ) 722 / ( [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 ) 723 / ( [ *4( h4 ":" ) h4 ] "::" ls32 ) 724 / ( [ *5( h4 ":" ) h4 ] "::" h4 ) 725 / ( [ *6( h4 ":" ) h4 ] "::" ) 727 ls32 = ( h4 ":" h4 ) / IPv4address 728 ; least-significant 32 bits of address 730 h4 = 1*4HEXDIG 732 3.3 Path Component 734 The path component contains data, specific to the authority (or the 735 scheme if there is no authority component), identifying the resource 736 within the scope of that scheme and authority. 738 path = [ abs-path / opaque-part ] 740 path-segments = segment *( "/" segment ) 741 segment = *pchar 743 pchar = unreserved / escaped / ";" / 744 ":" / "@" / "&" / "=" / "+" / "$" / "," 746 The path may consist of a sequence of path segments separated by a 747 single slash "/" character. Within a path segment, the characters "/ 748 ", ";", "=", and "?" are reserved. The semicolon (";") and equals 749 ("=") characters have the reserved purpose of delimiting parameters 750 and parameter values within a path segment. However, parameters are 751 not significant to the parsing of relative references. 753 3.4 Query Component 755 The query component is a string of information to be interpreted by 756 the resource. 758 query = *( pchar / "/" / "?" ) 760 Within a query component, the characters ";", "/", "?", ":", "@", 761 "&", "=", "+", ",", and "$" are reserved. 763 4. URI References 765 The term "URI-reference" is used here to denote the common usage of 766 a resource identifier. A URI reference may be absolute or relative, 767 and may have additional information attached in the form of a 768 fragment identifier. However, "the URI" that results from such a 769 reference includes only the absolute URI after the fragment 770 identifier (if any) is removed and after any relative URI is resolved 771 to its absolute form. Although it is possible to limit the 772 discussion of URI syntax and semantics to that of the absolute 773 result, most usage of URI is within general URI references, and it is 774 impossible to obtain the URI from such a reference without also 775 parsing the fragment and resolving the relative form. 777 URI-reference = [ absolute-URI / relative-URI ] [ "#" fragment ] 779 Many protocol elements allow only the absolute form of a URI with an 780 optional fragment identifier. 782 absolute-URI-reference = absolute-URI [ "#" fragment ] 784 The syntax for a relative URI is a shortened form of that for an 785 absolute URI, where some prefix of the URI is missing and certain 786 path components ("." and "..") have a special meaning when, and only 787 when, interpreting a relative path. The relative URI syntax is 788 defined in Section 5. 790 4.1 Fragment Identifier 792 When a URI reference is used to perform a retrieval action on the 793 identified resource, the optional fragment identifier, separated from 794 the URI by a crosshatch ("#") character, consists of additional 795 reference information to be interpreted by the user agent after the 796 retrieval action has been successfully completed. As such, it is not 797 part of a URI, but is often used in conjunction with a URI. 799 fragment = *( pchar / "/" / "?" ) 801 The semantics of a fragment identifier is a property of the data 802 resulting from a retrieval action, regardless of the type of URI used 803 in the reference. Therefore, the format and interpretation of 804 fragment identifiers is dependent on the media type [RFC2046] of the 805 retrieval result. The character restrictions described in Section 2 806 for a URI also apply to the fragment in a URI-reference. Individual 807 media types may define additional restrictions or structure within 808 the fragment for specifying different types of "partial views" that 809 can be identified within that media type. 811 A fragment identifier is only meaningful when a URI reference is 812 intended for retrieval and the result of that retrieval is a document 813 for which the identified fragment is consistently defined. 815 4.2 Same-document References 817 A URI reference that does not contain a URI is a reference to the 818 current document. In other words, an empty URI reference within a 819 document is interpreted as a reference to the start of that document, 820 and a reference containing only a fragment identifier is a reference 821 to the identified fragment of that document. Traversal of such a 822 reference should not result in an additional retrieval action. 823 However, if the URI reference occurs in a context that is always 824 intended to result in a new request, as in the case of HTML's FORM 825 element [HTML], then an empty URI reference represents the base URI 826 of the current document and should be replaced by that URI when 827 transformed into a request. 829 4.3 Parsing a URI Reference 831 A URI reference is typically parsed according to the four main 832 components and fragment identifier in order to determine what 833 components are present and whether the reference is relative or 834 absolute. The individual components are then parsed for their 835 subparts and, if not opaque, to verify their validity. 837 Although the BNF defines what is allowed in each component, it is 838 ambiguous in terms of differentiating between an authority component 839 and a path component that begins with two slash characters. The 840 greedy algorithm is used for disambiguation: the left-most matching 841 rule soaks up as much of the URI reference string as it is capable of 842 matching. In other words, the authority component wins. 844 Readers familiar with regular expressions should see Appendix B for a 845 concrete parsing example and test oracle. 847 5. Relative URI References 849 It is often the case that a group or "tree" of documents has been 850 constructed to serve a common purpose; the vast majority of URIs in 851 these documents point to resources within the tree rather than 852 outside of it. Similarly, documents located at a particular site are 853 much more likely to refer to other resources at that site than to 854 resources at remote sites. 856 Relative addressing of URIs allows document trees to be partially 857 independent of their location and access scheme. For instance, it is 858 possible for a single set of hypertext documents to be simultaneously 859 accessible and traversable via each of the "file", "http", and "ftp" 860 schemes if the documents refer to each other using relative URIs. 861 Furthermore, such document trees can be moved, as a whole, without 862 changing any of the relative references. Experience within the WWW 863 has demonstrated that the ability to perform relative referencing is 864 necessary for the long-term usability of embedded URIs. 866 The relative URI syntax takes advantage of the syntax of 867 (Section 3) in order to express a reference that is 868 relative to the namespace of another hierarchical URI. 870 relative-URI = [ net-path / abs-path / rel-path ] [ "?" query ] 872 A relative reference beginning with two slash characters is termed a 873 network-path reference, as defined by in Section 3. Such 874 references are rarely used. 876 A relative reference beginning with a single slash character is 877 termed an absolute-path reference, as defined by in 878 Section 3. 880 A relative reference that does not begin with a scheme name or a 881 slash character is termed a relative-path reference. 883 rel-path = rel-segment [ abs-path ] 885 rel-segment = 1*( unreserved / escaped / ";" / 886 "@" / "&" / "=" / "+" / "$" / "," ) 888 Within a relative-path reference, the complete path segments "." and 889 ".." have special meanings: "the current hierarchy level" and "the 890 level above this hierarchy level", respectively. Although this is 891 very similar to their use within Unix-based filesystems to indicate 892 directory levels, these path components are only considered special 893 when resolving a relative-path reference to its absolute form 894 (Section 5.2). 896 Authors should be aware that a path segment which contains a colon 897 character cannot be used as the first segment of a relative URI path 898 (e.g., "this:that"), because it would be mistaken for a scheme name. 899 It is therefore necessary to precede such segments with other 900 segments (e.g., "./this:that") in order for them to be referenced as 901 a relative path. 903 It is not necessary for all URI within a given scheme to be 904 restricted to the syntax, since the hierarchical 905 properties of that syntax are only necessary when a relative URI is 906 used within a particular document. Documents can only make use of a 907 relative URI when their base URI fits within the syntax. 908 It is assumed that any document which contains a relative reference 909 will also have a base URI that obeys the syntax. In other words, a 910 relative URI cannot be used within a document that has an unsuitable 911 base URI. 913 Some URI schemes do not allow a hierarchical syntax matching the 914 syntax, and thus cannot use relative references. 916 5.1 Establishing a Base URI 918 The term "relative URI" implies that there exists some absolute "base 919 URI" against which the relative reference is applied. Indeed, the 920 base URI is necessary to define the semantics of any relative URI 921 reference; without it, a relative reference is meaningless. In order 922 for relative URI to be usable within a document, the base URI of that 923 document must be known to the parser. 925 The base URI of a document can be established in one of four ways, 926 listed below in order of precedence. The order of precedence can be 927 thought of in terms of layers, where the innermost defined base URI 928 has the highest precedence. This can be visualized graphically as: 930 .----------------------------------------------------------. 931 | .----------------------------------------------------. | 932 | | .----------------------------------------------. | | 933 | | | .----------------------------------------. | | | 934 | | | | .----------------------------------. | | | | 935 | | | | | | | | | | 936 | | | | `----------------------------------' | | | | 937 | | | | (5.1.1) Base URI embedded in the | | | | 938 | | | | document's content | | | | 939 | | | `----------------------------------------' | | | 940 | | | (5.1.2) Base URI of the encapsulating entity | | | 941 | | | (message, document, or none). | | | 942 | | `----------------------------------------------' | | 943 | | (5.1.3) URI used to retrieve the entity | | 944 | `----------------------------------------------------' | 945 | (5.1.4) Default Base URI is application-dependent | 946 `----------------------------------------------------------' 948 5.1.1 Base URI within Document Content 950 Within certain document media types, the base URI of the document can 951 be embedded within the content itself such that it can be readily 952 obtained by a parser. This can be useful for descriptive documents, 953 such as tables of content, which may be transmitted to others through 954 protocols other than their usual retrieval context (e.g., E-Mail or 955 USENET news). 957 It is beyond the scope of this document to specify how, for each 958 media type, the base URI can be embedded. It is assumed that user 959 agents manipulating such media types will be able to obtain the 960 appropriate syntax from that media type's specification. An example 961 of how the base URI can be embedded in the Hypertext Markup Language 962 (HTML) [HTML] is provided in Appendix D. 964 A mechanism for embedding the base URI within MIME container types 965 (e.g., the message and multipart types) is defined by MHTML 966 [RFC2110]. Protocols that do not use the MIME message header syntax, 967 but which do allow some form of tagged metainformation to be included 968 within messages, may define their own syntax for defining the base 969 URI as part of a message. 971 5.1.2 Base URI from the Encapsulating Entity 973 If no base URI is embedded, the base URI of a document is defined by 974 the document's retrieval context. For a document that is enclosed 975 within another entity (such as a message or another document), the 976 retrieval context is that entity; thus, the default base URI of the 977 document is the base URI of the entity in which the document is 978 encapsulated. 980 5.1.3 Base URI from the Retrieval URI 982 If no base URI is embedded and the document is not encapsulated 983 within some other entity (e.g., the top level of a composite entity), 984 then, if a URI was used to retrieve the base document, that URI shall 985 be considered the base URI. Note that if the retrieval was the 986 result of a redirected request, the last URI used (i.e., that which 987 resulted in the actual retrieval of the document) is the base URI. 989 5.1.4 Default Base URI 991 If none of the conditions described in Sections 5.1.1--5.1.3 apply, 992 then the base URI is defined by the context of the application. Since 993 this definition is necessarily application-dependent, failing to 994 define the base URI using one of the other methods may result in the 995 same content being interpreted differently by different types of 996 application. 998 It is the responsibility of the distributor(s) of a document 999 containing a relative URI to ensure that the base URI for that 1000 document can be established. It must be emphasized that a relative 1001 URI cannot be used reliably in situations where the document's base 1002 URI is not well-defined. 1004 5.2 Resolving Relative References to Absolute Form 1006 This section describes an example algorithm for resolving URI 1007 references that might be relative to a given base URI. The algorithm 1008 is intended to provide a definitive result that can be used to test 1009 the output of other implementations. Implementation of the algorithm 1010 itself is not required, but the result given by an implementation 1011 must match the result that would be given by this algorithm. 1013 The base URI is established according to the rules of Section 5.1 and 1014 parsed into the four main components as described in Section 3. Note 1015 that only the scheme component is required to be present in the base 1016 URI; the other components may be empty or undefined. A component is 1017 undefined if its preceding separator does not appear in the URI 1018 reference; the path component is never undefined, though it may be 1019 empty. The base URI's query component is not used by the resolution 1020 algorithm and may be discarded. 1022 For each URI reference (R), the following pseudocode describes an 1023 algorithm for transforming R into its target (T), which is either an 1024 absolute URI or the current document, and R's optional fragment: 1026 (R.scheme, R.authority, R.path, R.query, fragment) = parse(R); 1027 -- The URI reference is parsed into the four components and 1028 -- fragment identifier, as described in Section 4.3. 1030 if ((not validating) and (R.scheme == Base.scheme)) then 1031 -- A non-validating parser may ignore a scheme in the 1032 -- reference if it is identical to the base URI's scheme. 1033 undefine(R.scheme); 1034 endif; 1036 if defined(R.scheme) then 1037 T.scheme = R.scheme; 1038 T.authority = R.authority; 1039 T.path = R.path; 1040 T.query = R.query; 1041 else 1042 if defined(R.authority) then 1043 T.authority = R.authority; 1044 T.path = R.path; 1045 T.query = R.query; 1046 else 1047 if (R.path == "") then 1048 if defined(R.query) then 1049 T.path = Base.path; 1050 T.query = R.query; 1051 else 1052 -- An empty reference refers to the current document 1053 return (current-document, fragment); 1054 endif; 1055 else 1056 if (R.path starts-with "/") then 1057 T.path = R.path; 1058 else 1059 T.path = merge(Base.path, R.path); 1060 endif; 1061 T.query = R.query; 1062 endif; 1063 T.authority = Base.authority; 1064 endif; 1065 T.scheme = Base.scheme; 1066 endif; 1068 return (T, fragment); 1070 The pseudocode above refers to a merge routine for merging a 1071 relative-path reference with the path of the base URI to obtain the 1072 target path. Although there are many ways to do this, we will 1073 describe a simple method using a separate string buffer: 1075 1. All but the last segment of the base URI's path component is 1076 copied to the buffer. In other words, any characters after the 1077 last (right-most) slash character, if any, are excluded. If the 1078 base URI's path component is the empty string, then a single 1079 slash character ("/") is copied to the buffer. 1081 2. The reference's path component is appended to the buffer string. 1083 3. All occurrences of "./", where "." is a complete path segment, 1084 are removed from the buffer string. 1086 4. If the buffer string ends with "." as a complete path segment, 1087 that "." is removed. 1089 5. All occurrences of "/../", where is a complete 1090 path segment not equal to "..", are removed from the buffer 1091 string. Removal of these path segments is performed iteratively, 1092 removing the leftmost matching pattern on each iteration, until 1093 no matching pattern remains. 1095 6. If the buffer string ends with "/..", where is 1096 a complete path segment not equal to "..", that "/.." is 1097 removed. 1099 7. If the resulting buffer string still begins with one or more 1100 complete path segments of "..", then the reference is considered 1101 to be in error. Implementations may handle this error by 1102 retaining these components in the resolved path (i.e., treating 1103 them as part of the final URI), by removing them from the 1104 resolved path (i.e., discarding relative levels above the root), 1105 or by avoiding traversal of the reference. 1107 8. The remaining buffer string is the target URI's path component. 1109 Some systems may find it more efficient to implement the merge 1110 algorithm as a pair of path segment stacks being merged, rather than 1111 as a series of string pattern replacements. 1113 Note: Some WWW client applications will fail to separate the 1114 reference's query component from its path component before merging 1115 the base and reference paths. This may result in a loss of 1116 information if the query component contains the strings "/../" or 1117 "/./". 1119 The resulting target URI components and fragment can be recombined to 1120 provide the absolute form of the URI reference. Using pseudocode, 1121 this would be: 1123 result = "" 1125 if defined(T.scheme) then 1126 append T.scheme to result; 1127 append ":" to result; 1128 endif; 1130 if defined(T.authority) then 1131 append "//" to result; 1132 append T.authority to result; 1133 endif; 1135 append T.path to result; 1137 if defined(T.query) then 1138 append "?" to result; 1139 append T.query to result; 1140 endif; 1142 if defined(fragment) then 1143 append "#" to result; 1144 append fragment to result; 1145 endif; 1147 return result; 1149 Note that we must be careful to preserve the distinction between a 1150 component that is undefined, meaning that its separator was not 1151 present in the reference, and a component that is empty, meaning that 1152 the separator was present and was immediately followed by the next 1153 component separator or the end of the reference. 1155 Resolution examples are provided in Appendix C. 1157 6. URI Normalization and Comparison 1159 One of the most common operations on URIs is simple comparison: 1160 determining if two URIs are equivalent without using the URIs to 1161 access their respective resource(s). A comparison is performed every 1162 time a response cache is accessed, a browser checks its history to 1163 color a link, or an XML parser processes tags within a namespace. 1164 Extensive normalization prior to comparison of URIs is often used by 1165 spiders and indexing engines to prune a search space or reduce 1166 duplication of request actions and response storage. 1168 URI comparison is performed in respect to some particular purpose, 1169 and software with differing purposes will often be subject to 1170 differing design trade-offs in regards to how much effort should be 1171 spent in reducing duplicate identifiers. This section describes a 1172 variety of methods that may be used to compare URIs, the trade-offs 1173 between them, and the types of applications that might use them. 1175 6.1 URI Equivalence 1177 Since URIs exist to identify resources, presumably they should be 1178 considered equivalent when they identify the same resource. However, 1179 such a definition of equivalence is not of much practical use, since 1180 there is no way for software to compare two resources without 1181 knowledge of their origin. For this reason, determination of 1182 equivalence or difference of URIs is based on string comparison, 1183 perhaps augmented by reference to additional rules provided by URI 1184 scheme definitions. We use the terms "different" and "equivalent" to 1185 describe the possible outcomes of such comparisons, but there are 1186 many application-dependent versions of equivalence. 1188 Even though it is possible to determine that two URIs are equivalent, 1189 it is never possible to be sure that two URIs identify different 1190 resources. Therefore, comparison methods are designed to minimize 1191 false negatives while strictly avoiding false positives. 1193 In testing for equivalence, it is generally unwise to directly 1194 compare relative URI references; they should be converted to their 1195 absolute forms before comparison. Furthermore, when URI references 1196 are being compared for the purpose of selecting (or avoiding) a 1197 network action, such as retrieval of a representation, it is often 1198 necessary to separate fragment identifiers from the URIs prior to 1199 comparison. 1201 6.2 Comparison Ladder 1203 A variety of methods are used in practice to test URI equivalence. 1204 These methods fall into a range, distinguished by the amount of 1205 processing required and the degree to which the probability of false 1206 negatives is reduced. As noted above, false negatives cannot in 1207 principle be eliminated. In practice, their probability can be 1208 reduced, but this reduction requires more processing and is not 1209 cost-effective for all applications. 1211 If this range of comparison practices is considered as a ladder, the 1212 following discussion will climb the ladder, starting with those that 1213 are cheap but have a relatively higher chance of producing false 1214 negatives, and proceeding to those that have higher computational 1215 cost and lower risk of false negatives. 1217 6.2.1 Simple String Comparison 1219 If two URIs, considered as character strings, are identical, then it 1220 is safe to conclude that they are equivalent. This type of 1221 equivalence test has very low computational cost and is in wide use 1222 in a variety of applications, particularly in the domain of parsing. 1224 Testing strings for equivalence requires some basic precautions. This 1225 procedure is often referred to as "bit-for-bit" or "byte-for-byte" 1226 comparison, which is potentially misleading. Testing of strings for 1227 equality is normally based on pairwise comparison of the characters 1228 that make up the strings, starting from the first and proceeding 1229 until both strings are exhausted and all characters found to be 1230 equal, or a pair of characters compares unequal or one of the strings 1231 is exhausted before the other. 1233 Such character comparisons require that each pair of characters be 1234 put in comparable form. For example, should one URI be stored in a 1235 byte array in EBCDIC encoding, and the second be in a Java String 1236 object, bit-for-bit comparisons applied naively will produce both 1237 false-positive and false-negative errors. Thus, in principle, it is 1238 better to speak of equality on a character-for-character rather than 1239 byte-for-byte or bit-for-bit basis. 1241 Unicode defines a character as being identified by number 1242 ("codepoint") with an associated bundle of visual and other 1243 semantics. At the software level, it is not practical to compare 1244 semantic bundles, so in practical terms, character-by-character 1245 comparisons are done codepoint-by-codepoint. 1247 6.2.2 Syntax-based Normalization 1249 Software may use logic based on the definitions provided by this 1250 specification to reduce the probability of false negatives. Such 1251 processing is (moderately) higher in cost than 1252 character-for-character string comparison. For example, an 1253 application using this approach could reasonably consider the 1254 following two URIs equivalent: 1256 example://a/b/c/%7A 1257 eXAMPLE://a/./b/../b/c/%7a 1259 Web user agents, such as browsers, typically apply this type of URI 1260 normalization when determining whether a cached response is 1261 available. Syntax-based normalization includes such techniques as 1262 case normalization, escape normalization, and removal of leftover 1263 relative path segments. 1265 6.2.2.1 Case Normalization 1267 When a URI scheme uses elements of the common syntax, it will also 1268 use the common syntax equivalence rules, namely that the scheme and 1269 hostname are case insensitive and therefore can be normailized to 1270 lowercase. For example, the URI is 1271 equivalent to . 1273 6.2.2.2 Escape Normalization 1275 The %-escape mechanism described in Section 2.4 is a frequent source 1276 of variance among otherwise identical URIs. One cause is the choice 1277 of upper-case or lower-case letters for the hexadecimal digits within 1278 the escape sequence (e.g., "%3a" versus "%3A"). Such sequences are 1279 always equivalent; for the sake of uniformity, URI generators and 1280 normalizers are strongly encouraged to use upper-case letters for the 1281 hex digits A-F. 1283 Only characters that are excluded from or reserved within the URI 1284 syntax must be escaped when used as data. However, some URI 1285 generators go beyond that and escape characters that do not require 1286 escaping, resulting in URIs that are equivalent to their unescaped 1287 counterparts. Such URIs can be normalized by unescaping sequences 1288 that represent the unreserved characters, as described in Section 1289 2.3. 1291 6.2.2.3 Path Segment Normalization 1293 The complete path segments "." and ".." have a special meaning within 1294 hierarchical URI schemes. As such, they should not appear in 1295 absolute URI paths; if they are found, they can be removed by 1296 splitting the URI just after the "/" that starts the path, using the 1297 left half as the base URI and the right as a relative reference, and 1298 normalizing the URI by merging the two in in accordance with the 1299 relative URI processing algorithm (Section 5). 1301 6.2.3 Scheme-based Normalization 1303 The syntax and semantics of URIs vary from scheme to scheme, as 1304 described by the defining specification for each scheme. Software 1305 may use scheme-specific rules, at further processing cost, to reduce 1306 the probability of false negatives. For example, Web spiders that 1307 populate most large search engines would consider the following two 1308 URIs to be equivalent: 1310 http://example.com/ 1311 http://example.com:80/ 1313 This behavior is based on the rules provided by the syntax and 1314 semantics of the "http" URI scheme, which defines an empty port 1315 component as being equivalent to the default TCP port for HTTP (port 1316 80). In general, a URI scheme that uses the generic syntax of 1317 hostport is defined such that a URI with an explicit ":port", where 1318 the port is the default for the scheme, is equivalent to one where 1319 the port is elided. 1321 6.2.4 Protocol-based Normalization 1323 Web spiders, for which substantial effort to reduce the incidence of 1324 false negatives is often cost-effective, are observed to implement 1325 even more aggressive techniques in URI comparison. For example, if 1326 they observe that a URI such as 1328 http://example.com/data 1330 redirects to 1332 http://example.com/data/ 1334 they will likely regard the two as equivalent in the future. 1335 Obviously, this kind of technique is only appropriate in special 1336 situations. 1338 6.3 Good Practice When Using URIs 1340 It is in the best interests of everyone to avoid false-negatives in 1341 comparing URIs, and to only require the minimum amount of software 1342 processing for such comparisons. Those who generate and make 1343 reference to URIs can reduce the cost of processing and the risk of 1344 false negatives by consistently providing them in a form that is 1345 reasonably canonical with respect to their scheme. Specifically: 1347 Always provide the URI scheme in lower-case characters. 1349 Always provide the hostname, if any, in lower-case characters. 1351 Only perform %-escaping where it is essential. 1353 Always use upper-case A-through-F characters when %-escaping. 1355 Use the UTF-8 character-to-octet mapping, whenever possible. 1357 Prevent /./ and /../ from appearing in absolute URI paths. 1359 The choices listed above are motivated by observations that a high 1360 proportion of deployed software already use these techniques in 1361 practice for the purposes of normalization. 1363 7. Security Considerations 1365 A URI does not in itself pose a security threat. However, since URIs 1366 are often used to provide a compact set of instructions for access to 1367 network resources, care must be taken to properly interpret the data 1368 within a URI, to prevent that data from causing unintended access, 1369 and to avoid including data that should not be revealed in plain 1370 text. 1372 7.1 Reliability and Consistency 1374 There is no guarantee that, having once used a given URI to retrieve 1375 some information, that the same information will be retievable by 1376 that URI in the future. Nor is there any guarantee that the 1377 information retrievable via that URI in the future will be observably 1378 similar to that retrieved in the past. The URI syntax does not 1379 constrain how a given scheme or authority apportions its namespace or 1380 maintains it over time. Such a guarantee can only be obtained from 1381 the person(s) controlling that namespace and the resource in 1382 question. A specific URI scheme may define additional semantics, 1383 such as name persistence, if those semantics are required of all 1384 naming authorities for that scheme. 1386 7.2 Malicious Construction 1388 It is sometimes possible to construct a URI such that an attempt to 1389 perform a seemingly harmless, idempotent operation, such as the 1390 retrieval of a representation associated with a resource, will in 1391 fact cause a possibly damaging remote operation to occur. The unsafe 1392 URI is typically constructed by specifying a port number other than 1393 that reserved for the network protocol in question. The client 1394 unwittingly contacts a site that is in fact running a different 1395 protocol. The content of the URI contains instructions that, when 1396 interpreted according to this other protocol, cause an unexpected 1397 operation. An example has been the use of a gopher URI to cause an 1398 unintended or impersonating message to be sent via a SMTP server. 1400 Caution should be used when using any URI that specifies a TCP port 1401 number other than the default for the protocol, especially when it is 1402 a number within the reserved space. 1404 Care should be taken when a URI contains escaped delimiters for a 1405 given protocol (for example, CR and LF characters for telnet 1406 protocols) that these are not unescaped before transmission. This 1407 might violate the protocol, but avoids the potential for such 1408 characters to be used to simulate an extra operation or parameter in 1409 that protocol, which might lead to an unexpected and possibly harmful 1410 remote operation being performed. 1412 7.3 Rare IP Address Formats 1414 Although the URI syntax for IPv4address only allows the common, 1415 dotted-decimal form of IPv4 address literal, many implementations 1416 that process URIs make use of platform-dependent system routines, 1417 such as gethostbyname() and inet_aton(), to translate the string 1418 literal to an actual IP address. Unfortunately, such system routines 1419 often allow and process a much larger set of formats than those 1420 described in Section 3.2.2. 1422 For example, many implementations allow dotted forms of three 1423 numbers, wherein the last part is interpreted as a 16-bit quantity 1424 and placed in the right-most two bytes of the network address (e.g., 1425 a Class B network). Likewise, a dotted form of two numbers means the 1426 last part is interpreted as a 24-bit quantity and placed in the right 1427 most three bytes of the network address (Class A), and a single 1428 number (without dots) is interpreted as a 32-bit quantity and stored 1429 directly in the network address. Adding further to the confusion, 1430 some implementations allow each dotted part to be interpreted as 1431 decimal, octal, or hexadecimal, as specified in the C language (i.e., 1432 a leading 0x or 0X implies hexadecimal; otherwise, a leading 0 1433 implies octal; otherwise, the number is interpreted as decimal). 1435 These additional IP address formats are not allowed in the URI syntax 1436 due to differences between platform implementations. However, they 1437 can become a security concern if an application attempts to filter 1438 access to resources based on the IP address in string literal format. 1439 If such filtering is performed, it is recommended that literals be 1440 converted to numeric form and filtered based on the numeric value, 1441 rather than a prefix or suffix of the string form. 1443 7.4 Sensitive Information 1445 It is clearly unwise to use a URI that contains a password which is 1446 intended to be secret. In particular, the use of a password within 1447 the userinfo component of a URI is strongly discouraged except in 1448 those rare cases where the 'password' parameter is intended to be 1449 public. 1451 7.5 Semantic Attacks 1453 Because the userinfo component is rarely used and appears before the 1454 hostname in the authority component, it can be used to construct a 1455 URI that is intended to mislead a human user by appearing to identify 1456 one (trusted) naming authority while actually identifying a different 1457 authority hidden behind the noise. For example 1459 http://www.example.com&story=breaking_news@10.0.0.1/top_story.htm 1461 might lead a human user to assume that the authority is 1462 'www.example.com', whereas it is actually '10.0.0.1'. Note that the 1463 misleading userinfo could be much longer than the example above. 1465 A misleading URI, such as the one above, is an attack on the user's 1466 preconceived notions about the meaning of a URI, rather than an 1467 attack on the software itself. User agents may be able to reduce the 1468 impact of such attacks by visually distinguishing the various 1469 components of the URI when rendered, such as by using a different 1470 color or tone to render userinfo if any is present, though there is 1471 no general panacea. More information on URI-based semantic attacks 1472 can be found in [Siedzik]. 1474 8. Acknowledgements 1476 This document is derived from RFC 2396 [RFC2396], RFC 1808 [RFC1808], 1477 and RFC 1738 [RFC1738]; the acknowledgements in those specifications 1478 still apply. It also incorporates the update (with corrections) for 1479 IPv6 literals in the host syntax, as defined by Robert M. Hinden, 1480 Brian E. Carpenter, and Larry Masinter in [RFC2732]. In addition, 1481 contributions by Reese Anschultz, Tim Bray, Dan Connolly, Adam M. 1482 Costello, Jason Diamond, Martin Duerst, Henry Holtzman, Graham Klyne, 1483 Dan Kohn, Bruce Lilly, Michael Mealling, Julian Reschke, Tomas 1484 Rokicki, Miles Sabin, Ronald Tschalaer, Marc Warne, Henry Zongaro, 1485 and Zefram are gratefully acknowledged. 1487 Normative References 1489 [ASCII] American National Standards Institute, "Coded Character 1490 Set -- 7-bit American Standard Code for Information 1491 Interchange", ANSI X3.4, 1986. 1493 [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1494 Specifications: ABNF", RFC 2234, November 1997. 1496 Non-normative References 1498 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1499 Languages", BCP 18, RFC 2277, January 1998. 1501 [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A 1502 Unifying Syntax for the Expression of Names and Addresses 1503 of Objects on the Network as used in the World-Wide Web", 1504 RFC 1630, June 1994. 1506 [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform 1507 Resource Locators (URL)", RFC 1738, December 1994. 1509 [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 1510 Resource Identifiers (URI): Generic Syntax", RFC 2396, 1511 August 1998. 1513 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 1514 and Support", STD 3, RFC 1123, October 1989. 1516 [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC 1517 1808, June 1995. 1519 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1520 Extensions (MIME) Part Two: Media Types", RFC 2046, 1521 November 1996. 1523 [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. 1524 Jensen, "HTTP Extensions for Distributed Authoring -- 1525 WEBDAV", RFC 2518, February 1999. 1527 [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet 1528 host table specification", RFC 952, October 1985. 1530 [RFC2373] Hinden, R. and S. Deering, "IP Version 6 Addressing 1531 Architecture", RFC 2373, July 1998. 1533 [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for 1534 Literal IPv6 Addresses in URL's", RFC 2732, December 1999. 1536 [RFC1736] Kunze, J., "Functional Recommendations for Internet 1537 Resource Locators", RFC 1736, February 1995. 1539 [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for 1540 Uniform Resource Names", RFC 1737, December 1994. 1542 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1543 STD 13, RFC 1034, November 1987. 1545 [RFC2110] Palme, J. and A. Hopmann, "MIME E-mail Encapsulation of 1546 Aggregate Documents, such as HTML (MHTML)", RFC 2110, 1547 March 1997. 1549 [RFC2717] Petke, R. and I. King, "Registration Procedures for URL 1550 Scheme Names", BCP 35, RFC 2717, November 1999. 1552 [HTML] Raggett, D., Le Hors, A. and I. Jacobs, "Hypertext Markup 1553 Language (HTML 4.01) Specification", December 1999. 1555 [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", April 1556 2001. 1558 [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 1559 10646", RFC 2279, January 1998. 1561 Authors' Addresses 1563 Tim Berners-Lee 1564 World Wide Web Consortium 1565 MIT/LCS, Room NE43-356 1566 200 Technology Square 1567 Cambridge, MA 02139 1568 USA 1570 Phone: +1-617-253-5702 1571 Fax: +1-617-258-5999 1572 EMail: timbl@w3.org 1573 URI: http://www.w3.org/People/Berners-Lee/ 1575 Roy T. Fielding 1576 Day Software 1577 2 Corporate Plaza, Suite 150 1578 Newport Beach, CA 92660 1579 USA 1581 Phone: +1-949-999-2523 1582 Fax: +1-949-644-5064 1583 EMail: roy.fielding@day.com 1584 URI: http://www.apache.org/~fielding/ 1585 Larry Masinter 1586 Adobe Systems Incorporated 1587 345 Park Ave 1588 San Jose, CA 95110 1589 USA 1591 Phone: +1-408-536-3024 1592 EMail: LMM@acm.org 1593 URI: http://larry.masinter.net/ 1595 Appendix A. Collected BNF for URI 1597 To be filled-in later. 1599 Appendix B. Parsing a URI Reference with a Regular Expression 1601 As described in Section 4.3, the generic URI syntax is not sufficient 1602 to disambiguate the components of some forms of URI. Since the 1603 "greedy algorithm" described in that section is identical to the 1604 disambiguation method used by POSIX regular expressions, it is 1605 natural and commonplace to use a regular expression for parsing the 1606 potential four components and fragment identifier of a URI reference. 1608 The following line is the regular expression for breaking-down a URI 1609 reference into its components. 1611 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 1612 12 3 4 5 6 7 8 9 1614 The numbers in the second line above are only to assist readability; 1615 they indicate the reference points for each subexpression (i.e., each 1616 paired parenthesis). We refer to the value matched for subexpression 1617 as $. For example, matching the above expression to 1619 http://www.ics.uci.edu/pub/ietf/uri/#Related 1621 results in the following subexpression matches: 1623 $1 = http: 1624 $2 = http 1625 $3 = //www.ics.uci.edu 1626 $4 = www.ics.uci.edu 1627 $5 = /pub/ietf/uri/ 1628 $6 = 1629 $7 = 1630 $8 = #Related 1631 $9 = Related 1633 where indicates that the component is not present, as is 1634 the case for the query component in the above example. Therefore, we 1635 can determine the value of the four components and fragment as 1637 scheme = $2 1638 authority = $4 1639 path = $5 1640 query = $7 1641 fragment = $9 1643 and, going in the opposite direction, we can recreate a URI reference 1644 from its components using the algorithm of Section 5.2. 1646 Appendix C. Examples of Resolving Relative URI References 1648 Within an object with a well-defined base URI of 1650 http://a/b/c/d;p?q 1652 the relative URI would be resolved as follows: 1654 C.1 Normal Examples 1656 g:h = g:h 1657 g = http://a/b/c/g 1658 ./g = http://a/b/c/g 1659 g/ = http://a/b/c/g/ 1660 /g = http://a/g 1661 //g = http://g 1662 ?y = http://a/b/c/d;p?y 1663 g?y = http://a/b/c/g?y 1664 #s = (current document)#s 1665 g#s = http://a/b/c/g#s 1666 g?y#s = http://a/b/c/g?y#s 1667 ;x = http://a/b/c/;x 1668 g;x = http://a/b/c/g;x 1669 g;x?y#s = http://a/b/c/g;x?y#s 1670 . = http://a/b/c/ 1671 ./ = http://a/b/c/ 1672 .. = http://a/b/ 1673 ../ = http://a/b/ 1674 ../g = http://a/b/g 1675 ../.. = http://a/ 1676 ../../ = http://a/ 1677 ../../g = http://a/g 1679 C.2 Abnormal Examples 1681 Although the following abnormal examples are unlikely to occur in 1682 normal practice, all URI parsers should be capable of resolving them 1683 consistently. Each example uses the same base as above. 1685 An empty reference refers to the start of the current document. 1687 <> = (current document) 1689 Parsers must be careful in handling the case where there are more 1690 relative path ".." segments than there are hierarchical levels in the 1691 base URI's path. Note that the ".." syntax cannot be used to change 1692 the authority component of a URI. 1694 ../../../g = http://a/../g 1695 ../../../../g = http://a/../../g 1697 In practice, some implementations strip leading relative symbolic 1698 elements (".", "..") after applying a relative URI calculation, based 1699 on the theory that compensating for obvious author errors is better 1700 than allowing the request to fail. Thus, the above two references 1701 will be interpreted as "http://a/g" by some implementations. 1703 Similarly, parsers must avoid treating "." and ".." as special when 1704 they are not complete components of a relative path. 1706 /./g = http://a/./g 1707 /../g = http://a/../g 1708 g. = http://a/b/c/g. 1709 .g = http://a/b/c/.g 1710 g.. = http://a/b/c/g.. 1711 ..g = http://a/b/c/..g 1713 Less likely are cases where the relative URI uses unnecessary or 1714 nonsensical forms of the "." and ".." complete path segments. 1716 ./../g = http://a/b/g 1717 ./g/. = http://a/b/c/g/ 1718 g/./h = http://a/b/c/g/h 1719 g/../h = http://a/b/c/h 1720 g;x=1/./y = http://a/b/c/g;x=1/y 1721 g;x=1/../y = http://a/b/c/y 1723 Some applications fail to separate the reference's query and/or 1724 fragment components from a relative path before merging it with the 1725 base path. This error is rarely noticed, since typical usage of a 1726 fragment never includes the hierarchy ("/") character, and the query 1727 component is not normally used within relative references. 1729 g?y/./x = http://a/b/c/g?y/./x 1730 g?y/../x = http://a/b/c/g?y/../x 1731 g#s/./x = http://a/b/c/g#s/./x 1732 g#s/../x = http://a/b/c/g#s/../x 1734 Some parsers allow the scheme name to be present in a relative URI if 1735 it is the same as the base URI scheme. This is considered to be a 1736 loophole in prior specifications of partial URI [RFC1630]. Its use 1737 should be avoided, but is allowed for backwards compatibility. 1739 http:g = http:g ; for validating parsers 1740 / http://a/b/c/g ; for backwards compatibility 1742 Appendix D. Embedding the Base URI in HTML documents 1744 It is useful to consider an example of how the base URI of a document 1745 can be embedded within the document's content. In this appendix, we 1746 describe how documents written in the Hypertext Markup Language 1747 (HTML) [HTML] can include an embedded base URI. This appendix does 1748 not form a part of the URI specification and should not be considered 1749 as anything more than a descriptive example. 1751 HTML defines a special element "BASE" which, when present in the 1752 "HEAD" portion of a document, signals that the parser should use the 1753 BASE element's "HREF" attribute as the base URI for resolving any 1754 relative URI. The "HREF" attribute must be an absolute URI. Note 1755 that, in HTML, element and attribute names are case-insensitive. For 1756 example: 1758 1759 1760 An example HTML document 1761 1762 1763 ... a hypertext anchor ... 1764 1766 A parser reading the example document should interpret the given 1767 relative URI "../x" as representing the absolute URI 1769 1771 regardless of the context in which the example document was obtained. 1773 Appendix E. Recommendations for Delimiting URI in Context 1775 URIs are often transmitted through formats that do not provide a 1776 clear context for their interpretation. For example, there are many 1777 occasions when a URI is included in plain text; examples include text 1778 sent in electronic mail, USENET news messages, and, most importantly, 1779 printed on paper. In such cases, it is important to be able to 1780 delimit the URI from the rest of the text, and in particular from 1781 punctuation marks that might be mistaken for part of the URI. 1783 In practice, URI are delimited in a variety of ways, but usually 1784 within double-quotes "http://example.com/", angle brackets , or just using whitespace 1787 http://example.com/ 1789 These wrappers do not form part of the URI. 1791 In the case where a fragment identifier is associated with a URI 1792 reference, the fragment would be placed within the brackets as well 1793 (separated from the URI with a "#" character). 1795 In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may 1796 need to be added to break a long URI across lines. The whitespace 1797 should be ignored when extracting the URI. 1799 No whitespace should be introduced after a hyphen ("-") character. 1800 Because some typesetters and printers may (erroneously) introduce a 1801 hyphen at the end of line when breaking a line, the interpreter of a 1802 URI containing a line break immediately after a hyphen should ignore 1803 all unescaped whitespace around the line break, and should be aware 1804 that the hyphen may or may not actually be part of the URI. 1806 Using <> angle brackets around each URI is especially recommended as 1807 a delimiting style for a URI that contains whitespace. 1809 The prefix "URL:" (with or without a trailing space) was formerly 1810 recommended as a way to help distinguish a URI from other bracketed 1811 designators, though it is not commonly used in practice and is no 1812 longer recommended. 1814 For robustness, software that accepts user-typed URI should attempt 1815 to recognize and strip both delimiters and embedded whitespace. 1817 For example, the text: 1819 Yes, Jim, I found it under "http://www.w3.org/Addressing/", 1820 but you can probably pick it up from . Note the warning in . 1824 contains the URI references 1826 http://www.w3.org/Addressing/ 1827 ftp://ds.internic.net/rfc/ 1828 http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING 1830 Appendix F. Abbreviated URIs 1832 The URI syntax was designed for unambiguous reference to network 1833 resources and extensibility via the URI scheme. However, as URI 1834 identification and usage have become commonplace, traditional media 1835 (television, radio, newspapers, billboards, etc.) have increasingly 1836 used abbreviated URI references. That is, a reference consisting of 1837 only the authority and path portions of the identified resource, such 1838 as 1840 www.w3.org/Addressing/ 1842 or simply the DNS hostname on its own. Such references are primarily 1843 intended for human interpretation rather than machine, with the 1844 assumption that context-based heuristics are sufficient to complete 1845 the URI (e.g., most hostnames beginning with "www" are likely to have 1846 a URI prefix of "http://"). Although there is no standard set of 1847 heuristics for disambiguating abbreviated URI references, many client 1848 implementations allow them to be entered by the user and 1849 heuristically resolved. It should be noted that such heuristics may 1850 change over time, particularly when new URI schemes are introduced. 1852 Since an abbreviated URI has the same syntax as a relative URI path, 1853 abbreviated URI references cannot be used in contexts where relative 1854 URIs are expected. This limits the use of abbreviated URIs to places 1855 where there is no defined base URI, such as dialog boxes and off-line 1856 advertisements. 1858 Appendix G. Summary of Non-editorial Changes 1860 G.1 Additions 1862 IPv6 literals have been added to the list of possible identifiers for 1863 the host portion of a server component, as described by [RFC2732], 1864 with the addition of "[" and "]" to the reserved, uric, and 1865 uric-no-slash sets. Square brackets are now specified as reserved 1866 for the authority component, allowed within the opaque part of an 1867 opaque URI, and not allowed in the hierarchical syntax except for 1868 their use as delimiters for an IPv6reference within host. In order 1869 to make this change without changing the technical definition of the 1870 path, query, and fragment components, those rules were redefined to 1871 directly specify the characters allowed rather than continuing to be 1872 defined in terms of uric. 1874 Since [RFC2732] defers to [RFC2373] for definition of an IPv6 literal 1875 address, which unfortunately has an incorrect ABNF description of 1876 IPv6address, we created a new ABNF rule for IPv6address that matches 1877 the text representations defined by Section 2.2 of [RFC2373]. 1878 Likewise, the definition of IPv4address has been improved in order to 1879 limit each decimal octet to the range 0-255, and the definition of 1880 hostname has been improved to better specify length limitations and 1881 partially-qualified domain names. 1883 Section 6 on URI normalization and comparison has been completely 1884 rewritten and extended using input from Tim Bray and discussion 1885 within the W3C Technical Architecture Group. 1887 G.2 Modifications from RFC 2396 1889 The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. 1890 This change required all rule names that formerly included underscore 1891 characters to be renamed with a dash instead. Likewise, absoluteURI 1892 and relativeURI have been changed to absolute-URI and relative-URI, 1893 respectively, for consistency. 1895 The ABNF of hier-part and relative-URI (Section 3) has been corrected 1896 to allow a relative URI path to be empty. This also allows an 1897 absolute-URI to consist of nothing after the "scheme:", as is present 1898 in practice with the "DAV:" namespace [RFC2518] and the "about:" URI 1899 used by many browser implementations. 1901 The ABNF of qualified has been simplified to remove a parsing 1902 ambiguity without changing the allowed syntax. 1904 The resolving relative references algorithm of [RFC2396] has been 1905 rewritten using pseudocode for this revision to improve clarity and 1906 fix the following issues: 1908 o [RFC2396] section 5.2, step 6a, failed to account for a base URI 1909 with no path. 1911 o Restored the behavior of [RFC1808] where, if the the reference 1912 contains an empty path and a defined query component, then the 1913 target URI inherits the base URI's path component. 1915 Index 1917 A 1918 abs-path 14 1919 absolute-URI 14 1920 absolute-URI-reference 20 1921 alphanum 17 1922 authority 15 1924 D 1925 dec-octet 17 1926 delims 12 1927 domainlabel 17 1929 E 1930 escaped 11 1932 F 1933 fragment 20 1935 H 1936 h4 18 1937 hier-part 14 1938 host 16 1939 hostname 17 1940 hostport 16 1942 I 1943 IPv4 17 1944 IPv4address 17 1945 IPv6 18 1946 IPv6address 18 1947 IPv6reference 18 1949 L 1950 ls32 18 1952 M 1953 mark 11 1955 N 1956 net-path 14 1958 O 1959 opaque-part 14 1961 P 1962 path 18 1963 path-segments 18 1964 pchar 18 1965 port 16 1967 Q 1968 qualified 17 1969 query 19 1971 R 1972 reg-name 16 1973 rel-path 22 1974 rel-segment 22 1975 relative-URI 22 1976 reserved 10 1978 S 1979 scheme 15 1980 segment 18 1981 server 16 1983 T 1984 toplabel 17 1986 U 1987 unreserved 11 1988 unwise 12 1989 URI grammar 1990 abs-path 14 1991 absolute-URI 14 1992 absolute-URI-reference 20 1993 alphanum 17 1994 authority 15 1995 dec-octet 17 1996 delims 12 1997 domainlabel 17 1998 escaped 11 1999 fragment 20 2000 h4 18 2001 hier-part 14 2002 host 17 2003 hostname 17 2004 hostport 17 2005 IPv4address 17 2006 IPv6address 18 2007 IPv6reference 18 2008 ls32 18 2009 mark 11 2010 net-path 14 2011 opaque-part 14 2012 path 18 2013 path-segments 18 2014 pchar 18 2015 port 17 2016 qualified 17 2017 query 19 2018 reg-name 16 2019 rel-path 22 2020 rel-segment 22 2021 relative-URI 22 2022 reserved 10 2023 scheme 15 2024 segment 18 2025 server 16 2026 toplabel 17 2027 unreserved 11 2028 unwise 12 2029 URI-reference 20 2030 uric 9 2031 uric-no-slash 14 2032 userinfo 16 2033 URI-reference 20 2034 uric 9 2035 uric-no-slash 14 2036 userinfo 16 2038 Intellectual Property Statement 2040 The IETF takes no position regarding the validity or scope of any 2041 intellectual property or other rights that might be claimed to 2042 pertain to the implementation or use of the technology described in 2043 this document or the extent to which any license under such rights 2044 might or might not be available; neither does it represent that it 2045 has made any effort to identify any such rights. Information on the 2046 IETF's procedures with respect to rights in standards-track and 2047 standards-related documentation can be found in BCP-11. Copies of 2048 claims of rights made available for publication and any assurances of 2049 licenses to be made available, or the result of an attempt made to 2050 obtain a general license or permission for the use of such 2051 proprietary rights by implementors or users of this specification can 2052 be obtained from the IETF Secretariat. 2054 The IETF invites any interested party to bring to its attention any 2055 copyrights, patents or patent applications, or other proprietary 2056 rights which may cover technology that may be required to practice 2057 this standard. Please address the information to the IETF Executive 2058 Director. 2060 Full Copyright Statement 2062 Copyright (C) The Internet Society (2003). All Rights Reserved. 2064 This document and translations of it may be copied and furnished to 2065 others, and derivative works that comment on or otherwise explain it 2066 or assist in its implementation may be prepared, copied, published 2067 and distributed, in whole or in part, without restriction of any 2068 kind, provided that the above copyright notice and this paragraph are 2069 included on all such copies and derivative works. However, this 2070 document itself may not be modified in any way, such as by removing 2071 the copyright notice or references to the Internet Society or other 2072 Internet organizations, except as needed for the purpose of 2073 developing Internet standards in which case the procedures for 2074 copyrights defined in the Internet Standards process must be 2075 followed, or as required to translate it into languages other than 2076 English. 2078 The limited permissions granted above are perpetual and will not be 2079 revoked by the Internet Society or its successors or assignees. 2081 This document and the information contained herein is provided on an 2082 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 2083 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 2084 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 2085 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 2086 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2088 Acknowledgement 2090 Funding for the RFC Editor function is currently provided by the 2091 Internet Society.