RFC1808 and RFC1738 Notes

urlparse has following unittests:

run_unittest(urlParseTestCase)
- checkRoundtrips
- test_roundtrips (8)
- test_http_roundtrips (6)
- checkJoin
- test_unparse_parse(9)
- test_RFC1808 (1)
- test_RFC2396 (2)
- test_urldefrag (10)
- test_urlsplit_attributes (11)
- test_attributes_bad_port (3)
- test_attributes_without_netloc (4)
- test_caching (5)
- test_noslash (7)

<pre>
>>>RFC1808_BASE ="http://a/b/c/d;p?q#f"
>>>urlparse.urlsplit(RFC1808_BASE)
</pre>
SplitResult(scheme='http',netloc='a',path='/b/c/d;p',query='q',fragment='f')

In the checkJoin tests it takes the parameters (base, relurl, expected).

The relative url specification always takes the BASE URL and Relative URL and
follows the algorithm described in RFC1808 and acts accordingly to give the
expected URL.

The syntax for the relative URLS is a shortened form of that for a absolute
URLS, where the prefix of the URL is missing and certain path components ('.'
and '..') have a special meaning when interpreting the relative path.

- If the params and the query is present, the query must occur after the
paramters.

- question mark character (?) is allowed in the ftp and the file path segment.

- Parsing a scheme: the scheme can contain alphanumeric, "+", ".", "-" and must
end with ':'

I was confused as how the scheme can contain the characters like "+",".","-"
characters and was looking out for examples.
Well, svn+shh://svn.python.org/trunk/ is an example where the scheme contains
the "+" character.


The url is denoted by:

<pre>
<scheme>://<scheme_specific_part>
</pre>

Scheme name consists of a sequence of characters. The lower case letters
"a"-"z",digit and the character plus "+", period "." and hypen "-" are allowed.

- URL is basically a sequence of octets in the coded character set. (Did not
quite understand.)

- Hierarchial name schemes, the components of the hierarchy are separated by
"/".


>From RFC1738, URL schemes that involve the direct use of an IP-based protocol
to a specified host on the Internet use a common syntax for the scheme specific
data.

<pre>
//<user>:<password>@<host>:<port>/<url-path>
</pre>

- The scheme specific data starts with a double slash '//' to indicate that it
compiles with the common internet scheme syntax.


In the current state, the urlparse module complies with the RFC1808 and
basically assumes the netloc specification starts with //.
That could be with respect to relative syntax, but in practical purposes we do
specify urls as www.python.org, i.e without the scheme and without the // of
the netloc.

The port handling part of the urlparse is seriously missing.

No comments: