[issue2776] urllib2.urlopen() gets confused with path with // in it

I played around with pdb module today to debug this issue. pdb is really
helpful.
Here's how the control goes.
1) There is an url with two '//'s in the path.
2) The call is data = urllib2.urlopen(url).read()
3) urlopen calls the build_opener. build_opener builds the opener using (tuple)
of handlers.
4) opener is an instance of OpenerDirector() and has default HTTPHandler and
HTTPSHandler.
5) When the Request call is made and the request has 'http' protocol, then
http_request method is called.
6) HTTPHandler has http_request method which is
AbstractHTTPHandler.do_request_

Now, for this issue we get to the do_request_ method and see that

7) host is set in the do_request_ method in the get_host() call.
8) request.get_selector() is the call which is causing this particular issue
of "urllib2 getting confused with path containing //".
.get_selector() method returns self.__r_host.
Now, when proxy is set using set_proxy(), self.__r_host is self.__original (
The original complete url itself), so the get_selector() call is returns the
sel_url properly and we can get teh host from the splithost() call on teh
sel_url.

When proxy is not set, and the url contains '//' in the path segment, then
.get_host() (step 7) call would have seperated the self.host and self.__r_host
(it pointing to the rest of the url) and .get_selector() simply returns this
(self.__r_host, rest of the url expect host. Thus causing call to fail.

9) Before the fix, request.add_unredirected_header('Host', sel_host or host)
had the escape mechanism set for proper urls wherein with sel_host is not set
and the host is used. Unfortunately, that failed when this bug caused sel_host
to be set to self.__r_host and Host in the headers was being setup wrongly (
rest of the url).

The patch which was attached appropriately fixed the issue. I modified and
included for py3k.


>
> I could reproduce this issue on trunk and p3k branch. The patch attached
> by Adrianna Pinska "appropriately" fixes this issue. I agree with the
> logic. Attaching the patch for py3k with the same fix.
>
> Thanks,
> Senthil
>
> Added file: http://bugs.python.org/file11103/issue2776-py3k.diff
>

No comments: