URIs and URLs

You are currently viewing a snapshot of www.mozilla.org taken on April 21, 2008. Most of this content is highly out of date (some pages haven't been updated since the project began in 1998) and exists for historical purposes only. If there are any pages on this archive site that you think should be added back to www.mozilla.org, please file a bug.

Roadmap
Projects
Coding
Testing
Tools
- Bugzilla
- Tinderbox
- Bonsai
- LXR
FAQs

URIs and URLs
Andreas Otte <andreas.otte@debitel.net>
January 1, 2002

Overview

Handling network and locally retrievable resources is a central part of Necko. Resources are identified by URI "Uniform Resource Identifier" [Taken from RFC 2396]:

Uniform

Uniformity provides several benefits: it allows different types of resource identifiers to be used in the same context, even when the mechanisms used to access those resources may differ; it allows uniform semantic interpretation of common syntactic conventions across different types of resource identifiers; it allows introduction of new types of resource identifiers without interfering with the way that existing identifiers are used; and, it allows the identifiers to be reused in many different contexts, thus permitting new applications or protocols to leverage a pre-existing, large, and widely-used set of resource identifiers.

Resource

A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources. The resource is the conceptual mapping to an entity or set of entities, not necessarily the entity which corresponds to that mapping at any particular instance in time. Thus, a resource can remain constant even when its content---the entities to which it currently corresponds---changes over time, provided that the conceptual mapping is not changed in the process.

Identifier

An identifier is an object that can act as a reference to something that has identity. In the case of URI, the object is a sequence of characters with a restricted syntax.

A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network "location"), rather than identifying the resource by name or by some other attribute(s) of that resource.

...

The URI scheme defines the namespace of the URI, and thus may further restrict the syntax and semantics of identifiers using that scheme. Although many URL schemes are named after protocols, this does not imply that the only way to access the URL's resource is via the named protocol. Gateways, proxies, caches, and name resolution services might be used to access some resources, independent of the protocol of their origin, and the resolution of some URL may require the use of more than one protocol (e.g., both DNS and HTTP are typically used to access an "http" URL's resource when it can't be found in a local cache).

In Necko every URI scheme is represented by a protocol handler. Sometimes a protocol handler represents more than one scheme. The protocol handler provides scheme specific information and methods to create new URIs of the schemes it supports. One of the main Necko goals is to provide a "plug able" protocol support. This means that it should be possible to add new protocols to Necko just by implementing nsIProtocolHandler and nsIChannel. It also might be necessary to implement a new urlparser for a new protocol but that might not be necessary because Necko already provides URI implementations that can deal with a number of schemes, by implementing the generic urlparser defined in RFC 2396.

nsIURI and nsIURL

In a strict sense Necko does only know URLs, URIs by the above definition are much to generic to be properly represented inside a library.

There are however two interfaces which loosely relate to the distinction between URI and URL as per the above definition: nsIURI and nsIURL.

nsIURI represents access to a very simple, very generic form of an URL. Simply speaking it's scheme and non-scheme, separated by a colon, like "about". nsIURL inherits from nsIURI and represents access to typical URLs with schemes like "http", "ftp", ...

nsSimpleURI

One implementation of nsIURI is nsSimpleURI which is the basis for protocols like "about". nsSimpleURI contains setters and getters for the URI and the components of an URI: scheme and path (non-scheme). There are no pre written urlparsers for simple URIs, because of it's simple structure.

nsStandardURL

The most important implementation of nsIURL is nsStandardURL which is the basis for protocols like http, ftp, ...

These schemes support a hierarchical naming system, where the hierarchy of the name is denoted by a "/" delimiter separating the components in the path. nsStandardURL also contains the facilities to parse these typ of urls, to break the specification of the URL "spec" down into the most basic segments.

The spec consists of prepath and path. The prepath consists of scheme and authority. The authority consists of prehost, host and port. The prehost consists of username and password. The path consists of directory, filename, param, query and ref. The filename consists of filebasename and fileextension.

If the spec is completly broken down, it consists of: scheme, username, password, host, port, directory, filebasename, fileextension, param, query and ref. Together these segments form the URL spec with the following syntax:

scheme://username:password@host:port/directory/filebasename.fileextension;param?query#ref

For performance reasons the complete spec is stored in escaped form in the nsStandardURL object with pointers (position and length) to each basic segment and for the more global segments like path and prehost for example.

Necko provides pre written urlparsers for schemes based on hierachical naming systems.

Escaping

To be able to parse an URL safely it is sometimes necessary to "escape" certain characters, to hide them from the parser. An escaped character is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing the octet code. For example, "%20" is the escaped encoding for the US-ASCII space character.

Another quote from RFC 2396:

A URI is always in an "escaped" form, since escaping or unescaping a completed URI might change its semantics. Normally, the only time escape encodings can safely be made is when the URI is being created from its component parts; each component may have its own set of characters that are reserved, so only the mechanism responsible for generating or interpreting that component can determine whether or not escaping a character will change its semantics. Likewise, a URI must be separated into its components before the escaped characters within those components can be safely decoded.

This implies that the segments of urls are escaped differently. This is done by NS_EscapeURL which is now part of xpcom, but started as part of Necko. The information how to escape each segment is stored in a matrix.

Also a string should not be escaped more than once. Necko will not escape an already escaped character unless forced by a special mask that can be used if it is known that a string is not escaped.

Parsing URLs

RFC 2396 defines an URL parser that can deal with the syntax that is common to most URL schemes that are currently in existence.

Sometimes scheme specific parsing is required. Also to be somewhat tolerant to syntax errors the parser has to know more about the specific syntax of the URLs for that scheme. To stay almost generic Necko contains three parsers for the main classes of standard URLs. Which one has to be used is defined by the implementation of nsIProtocolhandler for the scheme in question.

The three main classes are:

Authority
The URLs have an authority segment, like "http".

NoAuthority
These URLs have no or a degenerated authority segment, like the "file" scheme. Also this parser can identify drives if possible depending on the platform.

Standard
It is not known if an authority segment exists or not, less syntax correction can be applied in this case.

Noteable Differences

Necko does not support certain deprecated forms of relative URLs, based on the following part of RFC 2396:

If the scheme component is defined, indicating that the reference starts with a scheme name, then the reference is interpreted as an absolute URI and we are done. Otherwise, the reference URI's scheme is inherited from the base URI's scheme component.

Due to a loophole in prior specifications [RFC1630], some parsers allow the scheme name to be present in a relative URI if it is the same as the base URI scheme. Unfortunately, this can conflict with the correct parsing of non-hierarchical URI. For backwards compatibility, an implementation may work around such references by removing the scheme if it matches that of the base URI and the scheme is known to always use the "hier_part" syntax. The parser can then continue with the steps below for the remainder of the reference components. Validating parsers should mark such a misformed relative reference as an error.

The decision was made against backwards compatibility. This means that URLs like "http:page.html" or "http:/directory/page.html" are interpreted as absolute urls and "corrected" by the parser.

Also the handling of query segments is different from the examples given in RFC 2396:

Within an object with a well-defined base URI of

http://a/b/c/d;p?q

the relative URI would be resolved as follows:

...

?y = http://a/b/c/?y

...

Instead

?y = http://a/b/c/d;p?y

was implemented as suggested by the older RFC 1808. This decision is based on an email by Roy T. Fielding, one of the authors of RFC 2396, stating that the given example is wrong. Details can be found at bug 90439.
Registry-based authoritys
Currently Necko's url-objects only support host based authoritys or urls with no authoritys. Registry-based authoritys as defined in RFC 2396

Many URI schemes include a top hierarchical element for a naming authority, such that the namespace defined by the remainder of the URI is governed by that authority. This authority component is typically defined by an Internet-based server or a scheme-specific registry of naming authorities.

...

The structure of a registry-based naming authority is specific to the URI scheme, but constrained to the allowed characters for an authority component.

are not supported.

References

Main reference for URIs, URLs and URL-parsing is RFC 2396.

Mozilla