Simple is better than complex.: June 2008

URLError giving NameError

* scriptor Georgij Kondratjev, explico

Not creating new bug entry because everybody can quickly fix it.

In urllib/request.py some instances of URLError are raised with "raise
urllib.error.URLError" and this works, buth there are lines with "raise
URLError" which produces "NameError: global name 'URLError' is not defined"

http://bugs.python.org/issue2775

Update: Georg fixed it in revision 64624.
It was quick and I turned up bit late.

urllib issues Roundup

I had a spreadsheet listing down the bugs in the urllib. I could not use it effectively as much as I wished to. Decided to list down the bugs in the blog itself so that I stay on top of things TODO.

Feature Requests:

http://bugs.python.org/issue1591035
update urlparse to RFC 3986
http://bugs.python.org/issue1462525
URI parsing library
http://bugs.python.org/issue1448934
urllib2+https+proxy not working

Bugs:

Following are quick fixes as per my analysis.

http://bugs.python.org/issue1432
Strange behavior of urlparse.urljoin
I have the patch attached for this.
Review is required before checkin.
http://bugs.python.org/issue2275
urllib2 header capitalization. Patch attached.
http://bugs.python.org/issue2464
urllib2 can't handle http://www.wikispaces.com
http://bugs.python.org/issue2776
urllib2.urlopen() gets confused with path with // in it
http://bugs.python.org/issue2756
urllib2 add_header fails with existing unredirected_header
Patch attached.
http://bugs.python.org/issue2916
urlgrabber.grabber calls setdefaulttimeout

Following will take days time.

http://bugs.python.org/issue2885
Create the urllib package
http://bugs.python.org/issue1424152
urllib/urllib2: HTTPS over (Squid) Proxy fails
Patch recently attached.
http://bugs.python.org/issue1675455
Use getaddrinfo() in urllib2.py for IPv6 support. Patch provided.
http://bugs.python.org/issue2987
RFC2732 support for urlparse (e.g. http://[::1]:80/)

Low priority.

http://bugs.python.org/issue1285086
urllib.quote is too slow

Fixed Issues: Yet to be closed

http://bugs.python.org/issue600362
relocate cgi.parse_qs() into urlparse
http://bugs.python.org/issue2829
Copy cgi.parse_qs() to urllib.parse
http://bugs.python.org/issue2195
urlparse() does not handle URLs with port numbers properly
Duplicate issue. Needs to be closed.

Issue600362 :relocate cgi.parse_qs() into urlparse

Patch for Python 2.6 and Python 3.0: http://bugs.python.org/issue600362

I followed a long route for this patch.

- Did svn -R revert .
- Then modified the files.
- Created patch.
- Installed and tested.

The previous patch modifies the trunk code at places which are +/-2 lines away from this patch.
Now, this is real difficult. Only one of them will apply cleanly. I dont know a way wherein the related patches, addressing different issues, can be supplied continuously and all apply cleanly.

Either have to figure out the way or ask around.

urlparse method - Issue754016

A patch: http://bugs.python.org/issue754016
And a Discussion:
http://mail.python.org/pipermail/web-sig/2008-June/003454.html

Working on issue 754016

Working on issue 754016, for the past two days.

- Facundo's suggestion is when netloc does not start with '//', urlparse should
raise an value error. I am kind of analyzing how feasible it will be,
because urlparse is not only for url but for other schemes also, where path
name directly follows scheme name,
>>>urlparse.urlparse('mailto:orsenthil@example.com')
ParseResult(scheme='mailto', netloc='', path='o', params='', query='',
fragment='')

To give an idea of the various url forms which urlparse can take, I wrote this
test program


import urlparse

list_of_valid_urls=['g:h','http:g','http:','g','./g','g/',
'/g','//g','?y','g?y','g?y/./x','.','./','..',
'../','../g','../..','../../g','../../../g','./../g',
'./g/.','/./g','g/./h','g/../h','http:g','http:','http:?y',
'http:g?y','http:g?y/./x']

for url in list_of_valid_urls:
    print 'url-> %s -> %s' % (url,urlparse.urlsplit(url))

- The ValueError suggestion needs some mroe thought and discussion I believe.

_ My current solution stands at Documentation fix highlighting the need for
compliance with RFC when mentioning netloc value.
netloc should be //netloc.

_ In the same bug, there was an error pointed out by Antony that patch fails
for url 'http:' with an index error.
Yes, thats the mistake I made in the patch, wherein I referenced the character beyound ':' to check if that is a digit. When the character itself is not present., then index error results. The tests did not catch it as well.

I have corrected it now, added tests to it.

But just I added test to 'http:', I added test for 'http:','ftp:' as well.
And recognized that the current urlparse fails for the same reason as old patch
failed.

>>> urlparse.urlparse('https:')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/urlparse.py", line 107, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "/usr/local/lib/python2.6/urlparse.py", line 164, in urlsplit
scheme, url = url[:i].lower(),url[i+1]
IndexError: string index out of range
>>> urlparse.urlparse('ftp:')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/urlparse.py", line 107, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "/usr/local/lib/python2.6/urlparse.py", line 164, in urlsplit
scheme, url = url[:i].lower(),url[i+1]
IndexError: string index out of range
>>>

http: does not fail because urlparse handles it as a special case.

Got to give some more thought on this today evening. It could lead to more some
changes to urlparse to handle 'scheme:' as a valid input for scheme, instead of
it failing as an IndexError.

--
O.R.Senthil Kumaran
http://uthcode.sarovar.org

urllib and NTLM Authentication?

I dont think it is my list of bug fixes. But got to look into this topic as it
was a required thing when developing certain apps at Office. Yesterday, one of
my friend recollected about it also.

Code Swarm - Python

urllib package

The First betas of Python 3.0 and Python 2.6 were scheduled for release on Jun
11, but now it is postponed to June 18th.

There is a TODO Task of packaging urllib and it comes under my GSOC task as
well. The Bug report had another developer assigned to it and I have informed
that I would give it a try.

The Standard Library Reorganization follows the PEP3108, most of the other
things are done. So, things are set as such.

If I follow the example of httplib Reorganization, the following has already
taken effect.

Python 2.5 || Python 3.0/Python 2.6

http
httplib ------- http.client ( client.py)
BaseHTTPServer ------- http.server ( server.py)
CGIHTTPServer ------- http.server ( server.py)
SimpleHTTPServer ------ http.server ( server.py)
(No Naming conflicts should occur)
Cookies ------- http.cookies( cookies.py)
cookielib ------- http.cookiejar

The similar reorganization is designed for urllib and this will be my TODO
task.
>From PEP 3108.

urllib2 -------- urllib.request ( request.py)
urlparse -------- urllib.parse ( parse.py)
urllib -------- urllib.parse, urllib.request

The current urllib module will be split into parse.py and request.py
- quoting related functionalies will be added to parse.py
- URLOpener and FancyUrlOpener will be added to request.py

Other activities should include:

- Docs need to be updated.
- Tests needs to be ensured to run properly.
- No conflicts should occur.
- Python 3.0 - Testing needs to be done.
- Changes to other modules.

I shall set internal Target of, June 16 with 4 hours per day for this task
exclusively.

For Bugs #2195 and #754016

A patch to fix this issue. I deliberated upon
this for a while and came up with the approach to:

1) fix the port issue, wherein urlparse should technically
recognize the ':' separator for port from ':' after scheme.

2) And Doc fix wherein, it is advised that in the absence of
a scheme, use the net_loc as //net_loc (following RCF 1808).

If we go for any other fix, like internally pre-pending //
when user has not specified the scheme (like in many
pratical purpose), then we stand at chance of breaking a
number of tests ( cases where url is 'g'(path only),';x'
(path with params) and cases where relative url is g:h)

Let me know your thoughts on this.

>>> urlparse('1.2.3.4:80')
ParseResult(scheme='', netloc='', path='1.2.3.4:80',
params='', query='', fragment='')
>>> urlparse('http://www.python.org:80/~guido/foo?query#fun')
ParseResult(scheme='http', netloc='www.python.org:80',
path='/~guido/foo', params='', query='query',
fragment='fun')
>>>

Index: Doc/library/urlparse.rst
===================================================================
--- Doc/library/urlparse.rst (revision 64056)
+++ Doc/library/urlparse.rst (working copy)
@@ -52,6 +52,23 @@
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

+ If the scheme value is not specified, urlparse following the syntax
+ specifications from RFC 1808, expects the netloc value to start with '//',
+ Otherwise, it is not possible to distinguish between net_loc and path
+ component and would classify the indistinguishable component as path as in
+ a relative url.
+
+ >>> from urlparse import urlparse
+ >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
+ ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
+ params='', query='', fragment='')
+ >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
+ ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
+ params='', query='', fragment='')
+ >>> urlparse('help/Python.html')
+ ParseResult(scheme='', netloc='', path='help/Python.html', params='',
+ query='', fragment='')
+
If the *default_scheme* argument is specified, it gives the default addressing
scheme, to be used only if the URL does not specify one. The default value for
this argument is the empty string.

Index: Lib/urlparse.py
===================================================================
--- Lib/urlparse.py (revision 64056)
+++ Lib/urlparse.py (working copy)
@@ -145,7 +144,7 @@
clear_cache()
netloc = query = fragment = ''
i = url.find(':')
- if i > 0:
+ if i > 0 and not url[i+1].isdigit():
if url[:i] == 'http': # optimize the common case
scheme = url[:i].lower()
url = url[i+1:]

RFC1808 and RFC1738 Notes

urlparse has following unittests:

run_unittest(urlParseTestCase)
- checkRoundtrips
- test_roundtrips (8)
- test_http_roundtrips (6)
- checkJoin
- test_unparse_parse(9)
- test_RFC1808 (1)
- test_RFC2396 (2)
- test_urldefrag (10)
- test_urlsplit_attributes (11)
- test_attributes_bad_port (3)
- test_attributes_without_netloc (4)
- test_caching (5)
- test_noslash (7)

<pre>
>>>RFC1808_BASE ="http://a/b/c/d;p?q#f"
>>>urlparse.urlsplit(RFC1808_BASE)
</pre>
SplitResult(scheme='http',netloc='a',path='/b/c/d;p',query='q',fragment='f')

In the checkJoin tests it takes the parameters (base, relurl, expected).

The relative url specification always takes the BASE URL and Relative URL and
follows the algorithm described in RFC1808 and acts accordingly to give the
expected URL.

The syntax for the relative URLS is a shortened form of that for a absolute
URLS, where the prefix of the URL is missing and certain path components ('.'
and '..') have a special meaning when interpreting the relative path.

- If the params and the query is present, the query must occur after the
paramters.

- question mark character (?) is allowed in the ftp and the file path segment.

- Parsing a scheme: the scheme can contain alphanumeric, "+", ".", "-" and must
end with ':'

I was confused as how the scheme can contain the characters like "+",".","-"
characters and was looking out for examples.
Well, svn+shh://svn.python.org/trunk/ is an example where the scheme contains
the "+" character.

The url is denoted by:

Scheme name consists of a sequence of characters. The lower case letters
"a"-"z",digit and the character plus "+", period "." and hypen "-" are allowed.

- URL is basically a sequence of octets in the coded character set. (Did not
quite understand.)

- Hierarchial name schemes, the components of the hierarchy are separated by
"/".

>From RFC1738, URL schemes that involve the direct use of an IP-based protocol
to a specified host on the Internet use a common syntax for the scheme specific
data.

- The scheme specific data starts with a double slash '//' to indicate that it
compiles with the common internet scheme syntax.

In the current state, the urlparse module complies with the RFC1808 and
basically assumes the netloc specification starts with //.
That could be with respect to relative syntax, but in practical purposes we do
specify urls as www.python.org, i.e without the scheme and without the // of
the netloc.

The port handling part of the urlparse is seriously missing.

urlparse and port number

Bugs #2195 and #754016 both complain about urlparse not handling port number properly and often giving error nous results with respect to scheme, netloc and path.

Yes, it misbehaves under circumstances when you do not start the netloc with //. But in all practical purposes when we use url without scheme, we do plainly say the netloc part, like www.python.org.

Requires fix and the following patch will do that.


@@ -143,7 +143,7 @@ def urlsplit(url, scheme='', allow_fragm
     if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
         clear_cache()
     netloc = query = fragment = ''
-    i = url.find(':')
+    i = url.find('://')
     if i > 0:
         if url[:i] == 'http': # optimize the common case
             scheme = url[:i].lower()
@@ -164,6 +164,9 @@ def urlsplit(url, scheme='', allow_fragm
             scheme, url = url[:i].lower(), url[i+1:]
     if scheme in uses_netloc and url[:2] == '//':
         netloc, url = _splitnetloc(url, 2)
+    else:
+        netloc, url = _splitnetloc(url)
+
     if allow_fragments and scheme in uses_fragment and '#' in url:
         url, fragment = url.split('#', 1)
     if scheme in uses_query and '?' in url:

1) First change for differentiating between the port's(:) and scheme's (:)//.
2) Second change when the scheme is not given, just split into netloc and rest of url.

Got to write the tests for it and submit it.

One general review comment is urlparse.urlsplit is written in a not very composed/collected way. There have been lot of realizations (just like the one above), then then patches/additions to fix it.
So we see a special condition for http being handled in a block of code.
Those can be cleaned up.

Timelines for next beta releases

Timelines for next releases:

Jun 11 2008: Python 2.6b1 and 3.0b1 planned
Jul 02 2008: Python 2.6b2 and 3.0b2 planned
Aug 06 2008: Python 2.6rc1 and 3.0rc1 planned
Aug 20 2008: Python 2.6rc2 and 3.0rc2 planned

Summer of Code 2008

I am working in enhancing the Python urllib module as part of the Google Summer of Code project. I will be using this blog to post updates on the project.