For Bugs #2195 and #754016

A patch to fix this issue. I deliberated upon
this for a while and came up with the approach to:

1) fix the port issue, wherein urlparse should technically
recognize the ':' separator for port from ':' after scheme.

2) And Doc fix wherein, it is advised that in the absence of
a scheme, use the net_loc as //net_loc (following RCF 1808).

If we go for any other fix, like internally pre-pending //
when user has not specified the scheme (like in many
pratical purpose), then we stand at chance of breaking a
number of tests ( cases where url is 'g'(path only),';x'
(path with params) and cases where relative url is g:h)

Let me know your thoughts on this.

>>> urlparse('1.2.3.4:80')
ParseResult(scheme='', netloc='', path='1.2.3.4:80',
params='', query='', fragment='')
>>> urlparse('http://www.python.org:80/~guido/foo?query#fun')
ParseResult(scheme='http', netloc='www.python.org:80',
path='/~guido/foo', params='', query='query',
fragment='fun')
>>>

Index: Doc/library/urlparse.rst
===================================================================
--- Doc/library/urlparse.rst (revision 64056)
+++ Doc/library/urlparse.rst (working copy)
@@ -52,6 +52,23 @@
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

+ If the scheme value is not specified, urlparse following the syntax
+ specifications from RFC 1808, expects the netloc value to start with '//',
+ Otherwise, it is not possible to distinguish between net_loc and path
+ component and would classify the indistinguishable component as path as in
+ a relative url.
+
+ >>> from urlparse import urlparse
+ >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
+ ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
+ params='', query='', fragment='')
+ >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
+ ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
+ params='', query='', fragment='')
+ >>> urlparse('help/Python.html')
+ ParseResult(scheme='', netloc='', path='help/Python.html', params='',
+ query='', fragment='')
+
If the *default_scheme* argument is specified, it gives the default addressing
scheme, to be used only if the URL does not specify one. The default value for
this argument is the empty string.

Index: Lib/urlparse.py
===================================================================
--- Lib/urlparse.py (revision 64056)
+++ Lib/urlparse.py (working copy)
@@ -145,7 +144,7 @@
clear_cache()
netloc = query = fragment = ''
i = url.find(':')
- if i > 0:
+ if i > 0 and not url[i+1].isdigit():
if url[:i] == 'http': # optimize the common case
scheme = url[:i].lower()
url = url[i+1:]

No comments: