Working on issue 754016

Working on issue 754016, for the past two days.

- Facundo's suggestion is when netloc does not start with '//', urlparse should
raise an value error. I am kind of analyzing how feasible it will be,
because urlparse is not only for url but for other schemes also, where path
name directly follows scheme name,
>>>urlparse.urlparse('mailto:orsenthil@example.com')
ParseResult(scheme='mailto', netloc='', path='o', params='', query='',
fragment='')

To give an idea of the various url forms which urlparse can take, I wrote this
test program



import urlparse

list_of_valid_urls=['g:h','http:g','http:','g','./g','g/',
'/g','//g','?y','g?y','g?y/./x','.','./','..',
'../','../g','../..','../../g','../../../g','./../g',
'./g/.','/./g','g/./h','g/../h','http:g','http:','http:?y',
'http:g?y','http:g?y/./x']

for url in list_of_valid_urls:
print 'url-> %s -> %s' % (url,urlparse.urlsplit(url))




- The ValueError suggestion needs some mroe thought and discussion I believe.

_ My current solution stands at Documentation fix highlighting the need for
compliance with RFC when mentioning netloc value.
netloc should be //netloc.


_ In the same bug, there was an error pointed out by Antony that patch fails
for url 'http:' with an index error.
Yes, thats the mistake I made in the patch, wherein I referenced the character beyound ':' to check if that is a digit. When the character itself is not present., then index error results. The tests did not catch it as well.

I have corrected it now, added tests to it.

But just I added test to 'http:', I added test for 'http:','ftp:' as well.
And recognized that the current urlparse fails for the same reason as old patch
failed.

>>> urlparse.urlparse('https:')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/urlparse.py", line 107, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "/usr/local/lib/python2.6/urlparse.py", line 164, in urlsplit
scheme, url = url[:i].lower(),url[i+1]
IndexError: string index out of range
>>> urlparse.urlparse('ftp:')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/urlparse.py", line 107, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "/usr/local/lib/python2.6/urlparse.py", line 164, in urlsplit
scheme, url = url[:i].lower(),url[i+1]
IndexError: string index out of range
>>>

http: does not fail because urlparse handles it as a special case.

Got to give some more thought on this today evening. It could lead to more some
changes to urlparse to handle 'scheme:' as a valid input for scheme, instead of
it failing as an IndexError.

--
O.R.Senthil Kumaran
http://uthcode.sarovar.org

1 comment:

Facundo Batista said...

Note that for HTTP, the scheme also is followed by the netloc (with a ":" in the middle).

The detail is that the netloc should always start with a "//".