Simple is better than complex.: August 2008

[issue600362] relocate cgi.parse_qs() into urlparse

This issue/comments somehow escaped from my noticed, initially. I have
addressed your comments in the new set of patches.

1) Previous patch Docs had issues. Updated the Docs patch.
2) Included message in cgi.py about parse_qs, parse_qsl being present
for backward compatiblity.
3) The reason, py26 version of patch has quote function from urllib is
to avoid circular reference. urllib import urlparse for urljoin method.
So only way for us use quote is to have that portion of code in the
patch as well.

Please have a look the patches.
As this request has been present from a long time ( 2002-08-26 !), is it
possible to include this change in b3?

Thanks,
Senthil

Added file: http://bugs.python.org/file11116/issue600362-py26-v2.diff

issue2756: urllib2 add_header fails with existing unredirected_header


>>> import urllib2
>>> url = 'http://www.whompbox.com/headertest.php'
>>> request = urllib2.Request(url)
>>> request.add_data("Spam")
>>> f = urllib2.urlopen(url)
>>> request.header_items()
[]
>>> request.unredirected_hdrs.items()
[]
>>> f = urllib2.urlopen(request)
>>> request.header_items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]


>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]

Comment: This is fine. What is actually happening is do_request_ method in the http_open() is setting the unredirected_hdrs to above items.


>>> request.add_header('Content-type','application/xml')
>>> f = urllib2.urlopen(request)
>>> request.header_items()

[('Content-length', '4'), ('Content-type', 'application/xml'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]



Comment: When we add_header() the headers are indeed changed. Correct behavior. 
>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]

Comment: add_header() has not modified the unredirected_hdr.
Is this the whole purpose of issue2756? If yes, then better understanding of unredirected_hdr is needed and in the do_request_ method of AbstractHTTPHandler, where it changes unredirected_hdrs based on the logic of "not request.has_header(...)" what is actually aimed for checking that.

If add_header() is not supposed to change unredirected_hdrs but, add_unredirected_header() is the call to change unredirected_hdrs then, it is working fine and as expected.

(This is an undocumented interface, items() call was used for viewing the headers, tough actual code might not be using it.


>>>request.add_unredirected_header('Content-type','application/xml')
>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Content-type', 'application/xml'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]
>>>
Comment: add_unredirected_header() has correctly affected.

After application of the attached patch in issue report which modifies the add_header() and add_unredirected_header() method to remove the existing header of the same name. We will observe that the unredirected_hdr itself is removed and it is never added back.

After application of attached patch:


>>> url = 'http://www.whompbox.com/headertest.php'
>>> request = urllib2.Request(url)
>>> request.add_data("Spam")
>>> f = urllib2.urlopen(request)
>>> request.header_items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]


>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]


>>> request.add_header('Content-type','application/xml')
>>> f = urllib2.urlopen(request)
>>> request.header_items()

[('Content-length', '4'), ('Content-type', 'application/xml'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]


>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]
>>>
Comment: Notice the absense of Content-type header.

Issue 2756 - urllib2 add_header fails with existing unredirected header

The issue noticable here:

[ors@goofy ~]$ python
Python 2.6b2+ (trunk:65482M, Aug 4 2008, 14:26:01)
[GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> url = 'http://hroch486.icpf.cas.cz/formpost.html'
>>> import urllib2
>>> req_obj = urllib2.Request(url)
>>> req_obj.unredirected_hdrs
{}
>>> req_obj.add_data("Spam")
>>> req_obj.unredirected_hdrs
{}
>>> response = urllib2.urlopen(req_obj)
>>> req_obj.unredirected_hdrs
{'Content-length': '4', 'Content-type': 'application/x-www-form-urlencoded',
'Host': 'hroch486.icpf.cas.cz', 'User-agent': 'Python-urllib/2.6'}
>>> req_obj.add_data("SpamBar")
>>> req_obj.add_header("Content-type","application/html")
>>> response = urllib2.urlopen(req_obj)
>>> req_obj.unredirected_hdrs
{'Content-length': '4', 'Content-type': 'application/x-www-form-urlencoded',
'Host': 'hroch486.icpf.cas.cz', 'User-agent': 'Python-urllib/2.6'}
>>> req_obj.get_data()
'SpamBar'
>>>

In the final req_obj call, the unredirected_hdrs had not changed in neither
Content-length nor in Content-type.

[issue2776] urllib2.urlopen() gets confused with path with // in it

I played around with pdb module today to debug this issue. pdb is really
helpful.
Here's how the control goes.
1) There is an url with two '//'s in the path.
2) The call is data = urllib2.urlopen(url).read()
3) urlopen calls the build_opener. build_opener builds the opener using (tuple)
of handlers.
4) opener is an instance of OpenerDirector() and has default HTTPHandler and
HTTPSHandler.
5) When the Request call is made and the request has 'http' protocol, then
http_request method is called.
6) HTTPHandler has http_request method which is
AbstractHTTPHandler.do_request_

Now, for this issue we get to the do_request_ method and see that

7) host is set in the do_request_ method in the get_host() call.
8) request.get_selector() is the call which is causing this particular issue
of "urllib2 getting confused with path containing //".
.get_selector() method returns self.__r_host.
Now, when proxy is set using set_proxy(), self.__r_host is self.__original (
The original complete url itself), so the get_selector() call is returns the
sel_url properly and we can get teh host from the splithost() call on teh
sel_url.

When proxy is not set, and the url contains '//' in the path segment, then
.get_host() (step 7) call would have seperated the self.host and self.__r_host
(it pointing to the rest of the url) and .get_selector() simply returns this
(self.__r_host, rest of the url expect host. Thus causing call to fail.

9) Before the fix, request.add_unredirected_header('Host', sel_host or host)
had the escape mechanism set for proper urls wherein with sel_host is not set
and the host is used. Unfortunately, that failed when this bug caused sel_host
to be set to self.__r_host and Host in the headers was being setup wrongly (
rest of the url).

The patch which was attached appropriately fixed the issue. I modified and
included for py3k.

>
> I could reproduce this issue on trunk and p3k branch. The patch attached
> by Adrianna Pinska "appropriately" fixes this issue. I agree with the
> logic. Attaching the patch for py3k with the same fix.
>
> Thanks,
> Senthil
>
> Added file: http://bugs.python.org/file11103/issue2776-py3k.diff
>

Python urllib bugs roundup

From the previous post of bugs round up, finished activities (bugs which are closed now) include:

* http://bugs.python.org/issue1432 - Strange Behaviour of urlparse.urljoin
* http://bugs.python.org/issue2275 - urllib2 header capitalization.
* http://bugs.python.org/issue2916 - urlgrabber.grabber calls setdefaulttimeout
* http://bugs.python.org/issue2195 - urlparse() does not handle URLs with port numbers properly. - Duplicate issue.
* http://bugs.python.org/issue2829 - Copy cgi.parse_qs() to urllib.parse - Duplicate of issue600362.
* http://bugs.python.org/issue2885 - Create the urllib package. (but the tests are still named test_urllib2, test_urlparse etc. I shall discuss with py-dev if names changes in tests is okay before beta3 and give a attempt).

Activities TODO:

High- Priority:

Now, the list of bugs which are partially completed and requires addressing of some issues mentioned in the patches.These would take up some higher priority, as addressing would result in closure sooner.

* http://bugs.python.org/issue600362 - relocate cgi.parse_qs() into urlparse
* http://bugs.python.org/issue2776 - urllib2.urlopen() gets confused with path with // in it
* http://bugs.python.org/issue2756 - urllib2 add_header fails with existing unredirected_header Patch attached.
* http://bugs.python.org/issue2464 - urllib2 can't handle http://www.wikispaces.com

Plan:
I shall attempt to address all these issues before the release of Beta3 (That should be either on August 15 / August 23)

The following were some of the main issues to be taken up during G-SOC.
I see that as I have understood RFC 3986 better, I can work on issue1591035. I shall work on it on the branch and then discuss for inclusion in the trunk.

Feature Requests:

* http://bugs.python.org/issue1591035 - update urlparse to RFC 3986.Plan: By August 23.
* http://bugs.python.org/issue1462525 - URI parsing library - This will depend upon the previous issue, so we can assume Aug 23 for closure.
* http://bugs.python.org/issue2987 - RFC2732 support for urlparse (e.g. http://[::1]:80/) This is related bug again and will conclude by the same time-line.

I shall take up the following listed bugs after completion of the above.

* http://bugs.python.org/issue1448934 - urllib2+https+proxy not working.
* http://bugs.python.org/issue1424152 - urllib/urllib2: HTTPS over (Squid) Proxy fails
* http://bugs.python.org/issue1675455 - Use getaddrinfo() in urllib2.py for IPv6 support. Patch provided.

Low priority.
* http://bugs.python.org/issue1285086 - urllib.quote is too slow

Issue 3300

There is a good amount of discussion going around with
http://bugs.python.org/issue3300, I had been following from the start and had
an inclination towards quote and quote_plus to support UTF-8. But as the
discussion went further, without strong point on which stance to take, I had to
refresh and improve my knowledge of unicode support in Python and espcially
Unicode Strings in Python 3.0. Hopefully this will come handy in other issues.

Here are some notes on Unicode and Python.

What is Unicode?
In Computing, Unicode is an Industry Standard allowing Computers to
consistently display and manipulate text expressed in most of the world's
writing systems.

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

What is Unicode Character Set?

What is character encoding?

What is Encoding?
Converting a Character (or Something) to Number, because Computer internally
store numbers only.

Unicode Strings are a set of Code Points represented from 0x000000 to 0x10FFFF.
This sequence needs to be represented as a set of bytes ( meaning, values from
0-255) in memory. The rules for translating the Unicode String into sequence of
bytes is called encoding.

The representation in the number format is required for homogenuity, otherwise
it will be difficult to convert to and from.

What is Unicode Transformation Format?

What is UTF-8?

Unicode can be implented using a many character encodings. The most commonly
used one is utf-8, which uses 1 byte for all ASCII characters, which have the
same code values as in the standard ASCII encoding, and up to 4 bytes for other
characters.

When it \u the remaining the Unicode Code points which you will find defined
internationally from unicode.org

Now, how to represent them in BINARY (Coz: Computer!), is the trick and you
will have different encodings to do so.
So UTF-8 is one encoding and UTF-16,ASCII are all different encodings.

So you construct a unicode string
mystr = u'\u0065\u0066\u0067\u0068'

mystr is a unicode object. It does not make sense to print it.
But if you wish to see the object, use repr
print repr(mystr)

Now, the unicode object can be coverted to Binary using encoding, and let us
use 'ascii' and 'utf-8'
so you would do
asciistr = mystr.encode('ascii')
utf8str = mystr.encode('utf-8')

Now, it is string object in BINARY

let us print asciistr, and utf8str

STILL NEED MORE UNDERSTANDING.

http://boodebr.org/main/python/all-about-python-and-unicode

A Unicode string holds characters from the Unicode character set.

[issue1432] Strange behavior of urlparse.urljoin

I have made changes to urlparse.urljoin which would behave confirming to
RFC3986. The join of BASE ("http://a/b/c/d;p?q") with REL("?y") would result in
"http://a/b/c/d;p?y" as expected.

I have added a set of testcases for conformance with RFC3986 as well.

Added file: http://bugs.python.org/file11053/issue1432-py26.diff