Simple is better than complex.: Task

Showing posts with label Task. Show all posts

urllib issues Roundup

I had a spreadsheet listing down the bugs in the urllib. I could not use it effectively as much as I wished to. Decided to list down the bugs in the blog itself so that I stay on top of things TODO.

Feature Requests:

http://bugs.python.org/issue1591035
update urlparse to RFC 3986
http://bugs.python.org/issue1462525
URI parsing library
http://bugs.python.org/issue1448934
urllib2+https+proxy not working

Bugs:

Following are quick fixes as per my analysis.

http://bugs.python.org/issue1432
Strange behavior of urlparse.urljoin
I have the patch attached for this.
Review is required before checkin.
http://bugs.python.org/issue2275
urllib2 header capitalization. Patch attached.
http://bugs.python.org/issue2464
urllib2 can't handle http://www.wikispaces.com
http://bugs.python.org/issue2776
urllib2.urlopen() gets confused with path with // in it
http://bugs.python.org/issue2756
urllib2 add_header fails with existing unredirected_header
Patch attached.
http://bugs.python.org/issue2916
urlgrabber.grabber calls setdefaulttimeout

Following will take days time.

http://bugs.python.org/issue2885
Create the urllib package
http://bugs.python.org/issue1424152
urllib/urllib2: HTTPS over (Squid) Proxy fails
Patch recently attached.
http://bugs.python.org/issue1675455
Use getaddrinfo() in urllib2.py for IPv6 support. Patch provided.
http://bugs.python.org/issue2987
RFC2732 support for urlparse (e.g. http://[::1]:80/)

Low priority.

http://bugs.python.org/issue1285086
urllib.quote is too slow

Fixed Issues: Yet to be closed

http://bugs.python.org/issue600362
relocate cgi.parse_qs() into urlparse
http://bugs.python.org/issue2829
Copy cgi.parse_qs() to urllib.parse
http://bugs.python.org/issue2195
urlparse() does not handle URLs with port numbers properly
Duplicate issue. Needs to be closed.

urllib and NTLM Authentication?

I dont think it is my list of bug fixes. But got to look into this topic as it
was a required thing when developing certain apps at Office. Yesterday, one of
my friend recollected about it also.

urllib package

The First betas of Python 3.0 and Python 2.6 were scheduled for release on Jun
11, but now it is postponed to June 18th.

There is a TODO Task of packaging urllib and it comes under my GSOC task as
well. The Bug report had another developer assigned to it and I have informed
that I would give it a try.

The Standard Library Reorganization follows the PEP3108, most of the other
things are done. So, things are set as such.

If I follow the example of httplib Reorganization, the following has already
taken effect.

Python 2.5 || Python 3.0/Python 2.6

http
httplib ------- http.client ( client.py)
BaseHTTPServer ------- http.server ( server.py)
CGIHTTPServer ------- http.server ( server.py)
SimpleHTTPServer ------ http.server ( server.py)
(No Naming conflicts should occur)
Cookies ------- http.cookies( cookies.py)
cookielib ------- http.cookiejar

The similar reorganization is designed for urllib and this will be my TODO
task.
>From PEP 3108.

urllib2 -------- urllib.request ( request.py)
urlparse -------- urllib.parse ( parse.py)
urllib -------- urllib.parse, urllib.request

The current urllib module will be split into parse.py and request.py
- quoting related functionalies will be added to parse.py
- URLOpener and FancyUrlOpener will be added to request.py

Other activities should include:

- Docs need to be updated.
- Tests needs to be ensured to run properly.
- No conflicts should occur.
- Python 3.0 - Testing needs to be done.
- Changes to other modules.

I shall set internal Target of, June 16 with 4 hours per day for this task
exclusively.

urlparse and port number

Bugs #2195 and #754016 both complain about urlparse not handling port number properly and often giving error nous results with respect to scheme, netloc and path.

Yes, it misbehaves under circumstances when you do not start the netloc with //. But in all practical purposes when we use url without scheme, we do plainly say the netloc part, like www.python.org.

Requires fix and the following patch will do that.


@@ -143,7 +143,7 @@ def urlsplit(url, scheme='', allow_fragm
     if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
         clear_cache()
     netloc = query = fragment = ''
-    i = url.find(':')
+    i = url.find('://')
     if i > 0:
         if url[:i] == 'http': # optimize the common case
             scheme = url[:i].lower()
@@ -164,6 +164,9 @@ def urlsplit(url, scheme='', allow_fragm
             scheme, url = url[:i].lower(), url[i+1:]
     if scheme in uses_netloc and url[:2] == '//':
         netloc, url = _splitnetloc(url, 2)
+    else:
+        netloc, url = _splitnetloc(url)
+
     if allow_fragments and scheme in uses_fragment and '#' in url:
         url, fragment = url.split('#', 1)
     if scheme in uses_query and '?' in url:

1) First change for differentiating between the port's(:) and scheme's (:)//.
2) Second change when the scheme is not given, just split into netloc and rest of url.

Got to write the tests for it and submit it.

One general review comment is urlparse.urlsplit is written in a not very composed/collected way. There have been lot of realizations (just like the one above), then then patches/additions to fix it.
So we see a special condition for http being handled in a block of code.
Those can be cleaned up.