urlparse and port number

Bugs #2195 and #754016 both complain about urlparse not handling port number properly and often giving error nous results with respect to scheme, netloc and path.

Yes, it misbehaves under circumstances when you do not start the netloc with //. But in all practical purposes when we use url without scheme, we do plainly say the netloc part, like www.python.org.

Requires fix and the following patch will do that.


@@ -143,7 +143,7 @@ def urlsplit(url, scheme='', allow_fragm
if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
clear_cache()
netloc = query = fragment = ''
- i = url.find(':')
+ i = url.find('://')
if i > 0:
if url[:i] == 'http': # optimize the common case
scheme = url[:i].lower()
@@ -164,6 +164,9 @@ def urlsplit(url, scheme='', allow_fragm
scheme, url = url[:i].lower(), url[i+1:]
if scheme in uses_netloc and url[:2] == '//':
netloc, url = _splitnetloc(url, 2)
+ else:
+ netloc, url = _splitnetloc(url)
+
if allow_fragments and scheme in uses_fragment and '#' in url:
url, fragment = url.split('#', 1)
if scheme in uses_query and '?' in url:

1) First change for differentiating between the port's(:) and scheme's (:)//.
2) Second change when the scheme is not given, just split into netloc and rest of url.

Got to write the tests for it and submit it.

One general review comment is urlparse.urlsplit is written in a not very composed/collected way. There have been lot of realizations (just like the one above), then then patches/additions to fix it.
So we see a special condition for http being handled in a block of code.
Those can be cleaned up.

No comments: