Simple is better than complex.: 2008

Greedy vs Non-Greedy in Re - Good Example

Here is a good example to explain greedy vs, non-greedy search using module re in Python.

*?, +?, ??

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.

Once in a blue moon

A Blue moon is a name given to an irregularly timed full moon. Most years have twelve full moons which occur approximately monthly, but each calendar year contains those twelve full lunar cycles plus about eleven days to spare. The extra days accumulate, so that every two or three years there is an extra full moon (this happens every 2.72 years). (Source: Wikipedia)

So, its frequency of occurrence would be:


1/ 2.72 * 365 * 24 * 60 * 60) = 1.16580118 × 10 ^ -8

Now, this one is good. Excellent humor!

But don't get how they arrived at that number. Average considering the leap year does not result in that either.

Python Types and Obects

http://www.cafepy.com/article/python_types_and_objects/python_types_and_objects.html

Good article to understand new style classes.

Discussing English Grammar in the bug report.

Follow the discussion in this Python Documentation bug report.
Its for correct usage of English and grammar.

Terry J. Reedy added the comment:

Benjamin: I thank you too for verifying that I was not crazy.

Martin: I noticed native/non-native split too, and chalked it up to a subtle difference between German and English.

For future reference, the problem with the original, as I see it now, is a subtle interaction between syntax and semantics. The original sentence combined two thoughts about has_key. The two should be either
coordinate (parallel in form) or one should be clearly subordinate. A subordinate modifier should generally be closer to the subject, especially if it is much shorter. Making that so was one of my suggestions. The coordinate form would be 'but it is deprecated'. But this does not work because 'it' would be somewhat ambiguous because of the particular first modifier.

The following pair of sentences illustrate what I am trying to say. Guido was once a Nederlander, but he moved to America. Guido was once a student of Professor X, but he moved to America.
In English, the second 'he' is ambiguous because of the particular first modifier.

So, to me, 'but deprecated' at the end of the sentence reads as either a misplaced subordinate form or as an abbreviated coordinate form that is at least somewhat ambiguous because of the meaning of the other modifier.

git

Linus Torvalds has quipped about the name "git", which is British English slang for a stupid or unpleasant person:
“ I'm an egotistical bastard, and I name all my projects after myself. First Linux, now git.”

This self-deprecation is certainly tongue-in-cheek, insofar as Torvalds did not, in fact, name Linux after himself (see History of Linux).

http://en.wikipedia.org/wiki/Git_(software)

RFC Hierarchy for Related URL parsing

RFC Hierarchy for Relative URL formats
RFC3986(STD066) - This is the current and is the standard.
|
RFC2396 - This was previous one.
|
RFC2368
|
RFC1808 - urlparse header says, it follows this. But this has been upgraded a lot times.
|
RFC1738 - It started with this.

console font and resolution in ubuntu (Hardy Heron)

When ubuntu (Hardy Heron) booted by default, the console resolution and font ( ALT-F[1-6]) were really awful. Searched a lot to arrive at this solution.

To change your Console Resolution:

sudo vim /boot/grub/menu.lst

Go to your default kernel line and remove splash and add vga=791
The vga argument can take values from 791 to 795, google for the resolutions.
You have to remove splash term from that line, otherwise it wont get set.

To change your console font.

sudo dpkg-reconfigure console-setup

Follow the interview screens and choose the font to your preference.
For me, VGA with Size 16 worked best.

Proposed Python 3.0 schedule

> 19-Nov-2008 3.0 rc 3
> 03-Dec-2008 3.0 final

TODO [issue4191] urlparse normalize URL path

> <http://bugs.python.org/issue4191>

By Nov 12.

TODO: [issue3991] urllib.request.urlopen does not handle non-ASCII characters

<http://bugs.python.org/issue3991>

TODO: [issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

http://bugs.python.org/issue3932

Test Patch - [issue3924] cookielib chokes on non-integer cookie version, should ignore it instead

Test Patch - [issue3924] cookielib chokes on non-integer cookie
version, http://bugs.python.org/issue3924

Bogosort

"In computer science, bogosort (also random sort, shotgun sort or monkey sort) is a particularly ineffective sorting algorithm. Its only use is for educational purposes, to contrast it with other more realistic algorithms. If bogosort were used to sort a deck of cards, it would consist of checking if the deck were in order, and if it were not, one would throw the deck into the air, pick up the cards up at random, and repeat the process until the deck is sorted."

http://en.wikipedia.org/wiki/Bogosort

Conversations with Jeremy while working on urllib for py3k

(01:08:39) Senthil: Hi Jeremy!
(01:14:33) Jeremy: hi
(01:16:00) Senthil: The patch applied properly.. while executing the tests,
it was not able to view the parse.py
(01:16:10) Senthil: File "/usr/local/lib/python3.0/email/utils.py", line 28,
in <module>
import urllib.parse
ImportError: No module named parse
(01:16:40) Jeremy: Be sure you delete urllib.py and urllib.pyc
(01:16:45) Senthil: python3.0 regrtest.py and python3.0 test_urllib.py would
give the same.
(01:16:47) Senthil: I did that.
(01:17:12) Jeremy: Can you import them interactively?
(01:17:18) Senthil: deleted the older urllib.py ( oops. let me check with
.pyc :) )
(01:17:42) Jeremy: >>> import urllib.parse >>> urllib.parse.__file__
Ê¼/usr/local/google/home/jhylton/py3k/Lib/urllib/parse.pyÊ¼
(01:19:21) Senthil: [root@goofy python3.0]# python3.0
Python 3.0a5+ (py3k:64342M, Jun 18 2008, 00:51:20)
[GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named urllib.parse
>>>
(01:20:43) Senthil: Okay, just a min, Jeremy. I shall get back.. I think, I
know the prob, it with patch application...
(01:22:57) Jeremy: ok
(01:57:57) Senthil: Hi Jeremy, two (minor) changes in the patch.
(01:58:42) Senthil: 1) Add in the Makefile LIBSUBDIRS = urllib\ ( so that
make install creates that package dir).
(01:59:14) Senthil: 2) __init__.py in the urllib ( that was reason parse.py
was not getting imported. :) )
(02:02:25) Jeremy: oops
(02:03:17) Jeremy: Is the __init__.py not in the patch?
(02:03:22) Senthil: nope,
(02:03:40) Senthil: It is not. Atleast in one you sent me.
(02:04:13) Senthil: test_urllib.py has 1 failure.
(02:04:19) Senthil: FAIL: test_geturl (__main__.urlopen_FileTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_urllib.py", line 103, in test_geturl
self.assertEqual(self.returned_obj.geturl(), self.pathname)
AssertionError: 'file:///tmp/@test' != '/tmp/@test'
(02:04:36) Jeremy: ItÊ¼s in my local client, so IÊ¼m not sure why some of the
new files get added and some donÊ¼t.
(02:04:51) Senthil: ohh ok.
(02:05:00) Jeremy: That really baffling error I mentioned is in
test_urllibnet.py
(02:07:44) Senthil: yup. I see that in my box too.
(02:08:13) Jeremy: It would really help to get a second set of eyes on that
one. I think many of the other ones are shallow.
(02:09:05) Senthil: Jeremy, codereview.appspot.com seems to be giving 500
Server Error.
(02:09:18) Jeremy: IÊ¼ve been getting a lot of errors, too.
(02:09:19) Senthil: Have you checkin your code in the sandbox or any other
location?
(02:09:29) Jeremy: I guess I can check it in on a branch.
(02:10:22) Senthil: Would be a very good idea. Coz, to submit patches,
verify the changes etc etc.
(02:16:47) Senthil: Jeremy, both the test_urllibnet.py are due to urlopen
method not returning the file-object ( which previous urllib.py's urlopen ()
) returned.
(02:17:49) Jeremy: The HTTPResponse ought to behave like a file-object, I
thought.
(02:18:12) Senthil: the fix IMO, may not just be in the tests, but in what
to be decided upon for 2 urlopen from urllib.py and urllib2.py
(02:19:07) Senthil: In fact, both errors. One is test_fileno ( testing the
fileno attribute of file object and other test_readlines testing the
readlines method of the file object).
(02:20:49) Jeremy: Looking again, the both return an addinfourl() instance,
but...
(02:21:32) Jeremy: old urllib wraps a http client response and urllib2 wraps
an io.BufferedReader
(02:23:22) Jeremy: Doh! So a very small change seems to fix this particular
behavior.
(02:23:39) Jeremy: resp = urllib.response.addinfourl(r.fp, r.msg,
req.get_full_url()) resp.code = r.status resp.msg = r.reason return resp
(02:23:53) Jeremy: in urllib2 instead of wrapping it in a BufferedReader
(02:31:45) Senthil: am looking into it, digging through, trying to
understand/check it..
(02:32:10) Jeremy: I checked in my working copy in py3k-urllib branch!
(02:40:11) Jeremy: IÊ¼m heading home now. IÊ¼ll check in with you later. Feel
free to fix stuff on the branch and weÊ¼ll see if we can get it checked in
first thing in the morning (US time :-).
(02:40:14) Senthil: Jeremy, where do you plan to add this code:
(02:40:16) Senthil: resp = urllib.response.addinfourl
(02:40:29) Senthil: :)
(02:40:30) Senthil: sure.
(02:40:37) Senthil: I am at India time :D
(02:47:08) Jeremy: At the end of AbstractHTTPHandler.do_open
(02:58:52) Jeremy logged out.
(02:58:56) Senthil: Okay, I get it. :) such a minor change.
(02:59:20) Senthil: btw, it solved the fileno test, but readlines still
giving issues and new issues cropping up.
(02:59:38) Senthil: I shall look into it, Jeremy, Bye.
--
O.R.Senthil Kumaran
http://uthcode.sarovar.org

Notes while working on urllib patches

Following are some of the notes I took, while working on urllib patches.
It should be a handy reference when working on bugs again.

RFC 3986 Notes:
Mon Aug 25 10:17:01 IST 2008

- A URI is a sequence of characters that is not always represented as a
sequence of octets. ( What are octets? OCTETS means 8 Bits. Nothing else!)

- Percent-encoded octets may be used within a URI to represent characters
outside the range of the US-ASCII coded character set.

- Specification uses Augmented Backus-Naur Form (ABNF) notation of [RFC2234],
including the followig core ABNF syntax rules defined by that specification:
ALPHA (letters), CR ( carriage return), DIGIT (decimal digits), DQUOTE
(double quote), HEXDIG (hexadecimal digits), LF (line feed) and SP (space).

Section 1 of RFC3986 is very generic. Undestand that URI should be transferable
and single generic syntax should denote the whole range of URI schemes.

- URI Characters are, in turn, frequently encoded as octets for transport or
presentation.
- This specification does not mandate any character encoding for mapping
between URI characters and the octets used to store or transmit those
characters.
pct-encoded = "%" HEXDIG HEXDIG

- For consistency, uri producers and normalizers should use uppercase
hexadecimal digits, for all percent - encodings.
reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".

Section 2, was on encoding and decoding the characters in the url scheme. How
that is being used encoding reservered characters within data. Transmission of
url from local to public when using a different encoding - translate at the
interface level.

URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty

Many URI schemes include a hierarchical element for a naming
authority so that governance of the name space defined by the
remainder of the URI is delegated to that authority (which may, in
turn, delegate it further).

userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
host = IP-literal / IPv4address / reg-name

In order to disambiguate the syntax host between IPv4address and reg-name, we
apply the "first-match-wins" algorithm:

A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax.

IP-literal = "[" ( IPv6address / IPvFuture ) "]"

IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"

ls32 = ( h16 ":" h16 ) / IPv4address
; least-significant 32 bits of address

h16 = 1*4HEXDIG
; 16 bits of address represented in hexadecimal

IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet

dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255

reg-name = *( unreserved / pct-encoded / sub-delims )

Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then
each octet of the corresponding UTF-8 sequence must be percent- encoded to be
represented as URI characters.

When a non-ASCII registered name represents an internationalized domain name
intended for resolution via the DNS, the name must be transformed to the IDNA
encoding [RFC3490] prior to name lookup.

Section 3 was about sub-components and their structure and if they are
represented in NON ASCII how to go about with encoding/decoding that.

path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters

path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

relative-ref = relative-part [ "?" query ] [ "#" fragment ]

relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty

Section 4 was on the usage aspects and heuristics used in determining in the
scheme in the normal usages where scheme is not given.

- Base uri must be stripped of any fragment components prior to it being used
as a Base URI.

Section 5 was on relative reference implementation algorithm. I had covered
them practically in the Python urlparse module.

Section 6 was on Normalization of URIs for comparision and various
normalization practices that are used.

========================================================================
Python playground:

>>> if -1:
... print True
...
True
>>> if 0:
... print True
...
>>>

Use of namedtuple in py3k branch for urlparse.

========================================================================

Dissecting urlparse:

1) __all__ methods provides the public interfaces to all the methods like
urlparse, urlunparse, urljoin, urldefrag, urlsplit and urlunsplit.

2) then there is classification of schemes like uses_relative, uses_netloc,
non_hierarchical, uses_params, uses_query, uses_fragment
- there should be defined in an rfc most probably 1808.
- there is a special '' blank string, in certain classifications, which
means that apply by default.

3)valid characters in scheme name should be defined in 1808.

4) class ResultMixin is defined to provide username, password, hostname and
port.

5) from collections import namedtuple. This should be from python2.6.
namedtuple is pretty interesting feature.

6) SplitResult and ParseResult. Very good use of namedtuple and ResultMixin

7) The behaviour of the public methods urlparse, urlunparse, urlsplit and
urlunsplit and urldefrag matter most.

urlparse - scheme, netloc, path, params, query and fragment.
urlunparse will take those parameters and construct the url back.

urlsplit - scheme, netloc, path, query and fragment.
urlunsplit - takes these parameters (scheme, netloc, path, query and fragment)
and returns a url.

urlparse x urlunparse
urlsplit x urlunsplit
urldefrag
urljoin

Date: Tue Aug 19 20:40:46 IST 2008

Changes to urlsplit functionality in urllib.
As per the RFC3986, the url is split into:
scheme, authority, path, query, frag = url

The authority part in turn can be split into the sections:
user, passwd, host, port = authority

The following line is the regular expression for breaking-down a
well-formed URI reference into its components.

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9

scheme = $2
authority = $4
path = $5
query = $7
fragment = $9

The urlsplit functionality in the urllib can be moved to new regular
expression based parsing mechanism.

>From man uri, which confirms to rfc2396 and HTML 4.0 specs.

- An absolute identifier refers to a resource independent of context, while a
relative identifier refers to a resource by describing the difference from
the current context.

- A path segment while contains a colon character ':' can't be used as the
first segment of a relative URI path. Use it like this './file:path'

- A query can be given in the archaic "isindex" format, consisting of a word or
a phrase and not including an equal sign (=). If = is there, then it must be
after & like &key=value format.

Character Encodings:

- Reserved characters: ;/?:@&=+$,
- Unreserved characters: ALPHA, DIGITS, -_.!~*'()

An escaped octet is encoded as a character triplet consisting of the percent
character '%' followed by the two hexadecimal digits representing the octet
code.

HTML 4.0 specification section B.2 recommends the following, which should be
considered best available current guidance:

1) Represent each non-ASCII character as UTF-8
2) Escape those bytes with the URI escaping mechanism, converting each byte to
%HH where HH is the hexadecimal notation of the byte value.

One of the important changes when adhering to RFC3986 is parsing of IPv6
addresses.

===============================================================================
1) Bug issue1285086: urllib2.quote is too slow

For the short-circuit path, the regexp can make quote 10x as fast in
exceptional cases, even comparing to the faster version in trunk. The
average win for short-circuit seems to be twice as fast as the map in my
timings. This sounds good.

For the normal path, the overhead can make quote 50% slower. This IMHO
makes it unfit for quote replacement. Perhaps good for a cookbook recipe?

Regarding the OP's use case, I believe either adding a string cache to
quote or flagging stored strings as "safe" or "must quote" would result
in a much greater impact on performance.

Attaching patch against trunk. Web framework developers should be
interested in testing this and could provide the use cases/data needed
for settling this issue.

Index: urllib.py
===================================================================
--- urllib.py (revision 62222)
+++ urllib.py (working copy)
@@ -27,6 +27,7 @@
import os
import time
import sys
+import re
from urlparse import urljoin as basejoin

__all__ = ["urlopen", "URLopener", "FancyURLopener", "urlretrieve",
@@ -1175,6 +1176,7 @@
'abcdefghijklmnopqrstuvwxyz'
'0123456789' '_.-')
_safemaps = {}
+_must_quote = {}

def quote(s, safe = '/'):
"""quote('abc def') -> 'abc%20def'
@@ -1200,8 +1202,11 @@
cachekey = (safe, always_safe)
try:
safe_map = _safemaps[cachekey]
+ if not _must_quote[cachekey].search(s):
+ return s
except KeyError:
safe += always_safe
+ _must_quote[cachekey] = re.compile(r'[^%s]' % safe)
safe_map = {}
for i in range(256):
c = chr(i)

----------------------------------------------------------------------
What does this construct imply?

x = lambda: None

----------------------------------------------------------------------
How can we differentiate if an expression used is a general expression or a boolean expression.

----------------------------------------------------------------------
Having a construct like:
def __init__(self, *args, **kwargs):
BaseClass.__init__(self, *args, **kwargs)

But in the base class, I find that it is not taking the tuple and dict as
arguments.

I dont understand the
assert(proxies, 'has_key'), "proxies must be mapping"

----------------------------------------------------------------------

* What is an addrinfo struct.

The getaddrinfo() function returns a list of 5-tuples with the following structure:
(family, socktype, proto, canonname, sockaddr)

family, socktype, proto are all integer and are meant to be passed to the socket() function. canonname is a string representing the canonical name of the host. It can be a numeric IPv4/v6 address when AI_CANONNAME is specified for a numeric host.

socket.gethostbyname(hostname)
Translate a host name to IPv4 address format. The IPv4 address is returned as a string, such as '100.50.200.5'. If the host name is an IPv4 address itself it is returned unchanged. See gethostbyname_ex() for a more complete interface. gethostbyname() does not support IPv6 name resolution, and getaddrinfo() should be used instead for IPv4/v6 dual stack support.

We need to replace the gethostbyname socket call. Because it is only IPv4 specific. using the getaddrinfo() function can include the IPv4/v6 dual stack support.

import socket
print socket.gethostbyname(hostname)

def gethostbyname(hostname)
family, socktype, proto, canonname, sockaddr = socket.getaddrinfo(hostname)
return canonname

----------------------------------------------------------------------

Compare dates for cache control

RFC 1123 date format:
Thu, 01 Dec 1994 16:00:00 GMT

<pre>
>>> datereturned = "Thu, 01 Dec 1994 16:00:00 GMT"
>>> dateexpired = "Sun, 05 Aug 2007 03:25:42 GMT"
>>> obj1 = datetime.datetime(*time.strptime(datereturned, "%a, %d %b %Y %H:%M:%S %Z")[0:6])
>>> obj2 = datetime.datetime(*time.strptime(dateexpired, "%a, %d %b %Y %H:%M:%S %Z")[0:6])
>>> if obj1 == obj2:
print "Equal"
elif obj1 > obj2:
print datereturned
elif obj1 < obj2:
print dateexpired
</pre>

* Now you can compare the headers for expiry in cache control.

Header field definition:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

Adding Header information to Apache:
http://webmaster.info.aol.com/apache.html

To add header:
Go to the /etc/httpd/conf/httpd.conf
For e.g:
Add the information on headers
Header set Author "Senthil"

Q) Question is can I add a test for cache.
A) Its not a functionality, but its merely an internal optimization.

Q) If test for cache needs to be written, how will you write it?
A) request an url and redirect and request it again and verify that it is
coming frm a dictionary or the dictionary value is stored.

from_url = "http://example.com/a.html"
to_url = "http://example.com/b.html"

h = urllib2.HTTPRedirectHandler()
o = h.parent = MockOpener()

req = Request(from_url)
def cached301redirect(h, req, url=to_url):
h.http_error_301(req, MockFile(), 301, "Blah", MockHeaders({"location":url}))

# Why is Request object taking two parameters?

req = Request(from_url, origin_req_host="example.com")
count = 0
try:
while 1:
redirect(h, req, "http://example.com")
count = count + 1
if count > 2:
self.assertEqual("http://example.com",
urllib2.HTTPRedirectHandler().cache[req].geturl())
except urllib2.HTTPError:
self.assertEqual(count, urllib2.HTTPRedirectHandler.max_repeats)

CacheFTPHandler testcasesare hard to write.

--
O.R.Senthil Kumaran

Ariane 5 Flight 501

While listening to an introductory programming class, came across the mention of this this "Ariane 5 Flight 501" failure incident, which was caused by Arithmetic Exception and Integer overflow resulted from automatic type casting float to integer in the ADA program. Very costly software bug.

This German article discusses the issue. Below is the English translation of the same with the help of translate.google.com with some comments.

Ariane 5 - 501 (1-3) Ariane 5 - 501 (1-3)

4th June 1996, Kourou / FRZ. Guyana, ESA Guyana, ESA

Maiden flight of the new European launcher (weight: 740 tons, payload 7-18 t) with 4 Cluster satellites
Development costs in 10 years: DM 11 800 million I am not sure about 11 space 800 million. If it is, then its 4 trillion Rupees roughly.

Ada program of the inertial navigation system (excerpt):


 ... 
 declare 
   vertical_veloc_sensor: float; 
   horizontal_veloc_sensor: float; 
   vertical_veloc_bias: integer; 
   horizontal_veloc_bias: integer; 
   ... 
 begin 
   declare 
     pragma suppress(numeric_error, horizontal_veloc_bias);
   begin 
     sensor_get (vertical_veloc_sensor); 
     sensor_get (horizontal_veloc_sensor); 
     vertical_veloc_bias := integer(vertical_veloc_sensor); 
     horizontal_veloc_bias := integer(horizontal_veloc_sensor); 
     ... 
   exception exceptionnelle 
     When numeric_error => calculate_vertical_veloc (); 
     when others => use_irs1 (); 
   end; 
  irs2 end;

Effect:
37 seconds after ignition of the rocket (30 seconds after Liftoff) Ariane 5 reached 3700 m in altitude with a horizontal velocity of 32768.0 (internal units).This value was about five times higher than that of Ariane 4th

The transformation into a whole number led to an overflow, but was not caught.

The replacement computer (redundancy!) Had the same problem 72 msec before and immediately switched from that.

This resulted in that diagnostic data to the main computer were sent to this interpreted as trajectory data.Consequently, nonsensical control commands to the side, pivoting solid engines, and later to the main engine, to the deviate Flight no large (over 20 degrees) to correct them.

The rocket, however, threatened damage control and tested all himself (39 sec). I guess auto-destruct.

An intensive test of the navigation and main computer had not been undertaken since the software was in tested Ariane 4.

Damage:
DM 250 million start-up costs (~ 8.5 billion INR)
DM 850 million Cluster satellites (~ 29 billion INR)
DM 600 million for future improvements (~ 20 billion INR)
Loss of earnings for 2 to 3 years

The next test flight was only 17 months later carried out - 1 Stage ended prematurely firing.

The first commercial flight took place in December 1999.

Tragedy:

The problematic part of the program was only in preparation for the launch and the launch itself needed.
It should only be a transitional period to be active, for security reasons: 50 sec, to the ground station at a launch control over the interruption would have.

Despite the very different behavior of the Ariane 5 was nothing new about value.

Optimization:
Only 3 of 7 for a variables overflow examined - for the other 4 variables evidence existed that the values would remain small enough (Ariane 4).

This evidence was not for the Ariane 5 and this was not even understood.
Problem was with the of reuse of software!

Incredible - after 40 years of software-defect findings:

It was during program design assumes that only hardware failure may occur!
Therefore, the replacement computers also identical software. The system specification established that in case of failure of the computer is off and the replacement computer einspringt. A restart of a computer was not useful, since the redefinition of the flying height is too expensive.

PS: The attempt to establish new 4 Cluster satellites to launch, succeeded in July and August 2000 with two Russian launchers.

Master Vim Help system

This article
details steps on how to create your own help file in the vim help system.
Should be very helpful.

fortune from pyljvim

Education is an admirable thing, but it is well to remember from time to
time that nothing that is worth knowing can be taught.
-- Oscar Wilde, "The Critic as Artist"

[issue600362] relocate cgi.parse_qs() into urlparse

This issue/comments somehow escaped from my noticed, initially. I have
addressed your comments in the new set of patches.

1) Previous patch Docs had issues. Updated the Docs patch.
2) Included message in cgi.py about parse_qs, parse_qsl being present
for backward compatiblity.
3) The reason, py26 version of patch has quote function from urllib is
to avoid circular reference. urllib import urlparse for urljoin method.
So only way for us use quote is to have that portion of code in the
patch as well.

Please have a look the patches.
As this request has been present from a long time ( 2002-08-26 !), is it
possible to include this change in b3?

Thanks,
Senthil

Added file: http://bugs.python.org/file11116/issue600362-py26-v2.diff

issue2756: urllib2 add_header fails with existing unredirected_header


>>> import urllib2
>>> url = 'http://www.whompbox.com/headertest.php'
>>> request = urllib2.Request(url)
>>> request.add_data("Spam")
>>> f = urllib2.urlopen(url)
>>> request.header_items()
[]
>>> request.unredirected_hdrs.items()
[]
>>> f = urllib2.urlopen(request)
>>> request.header_items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]


>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]

Comment: This is fine. What is actually happening is do_request_ method in the http_open() is setting the unredirected_hdrs to above items.


>>> request.add_header('Content-type','application/xml')
>>> f = urllib2.urlopen(request)
>>> request.header_items()

[('Content-length', '4'), ('Content-type', 'application/xml'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]



Comment: When we add_header() the headers are indeed changed. Correct behavior. 
>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]

Comment: add_header() has not modified the unredirected_hdr.
Is this the whole purpose of issue2756? If yes, then better understanding of unredirected_hdr is needed and in the do_request_ method of AbstractHTTPHandler, where it changes unredirected_hdrs based on the logic of "not request.has_header(...)" what is actually aimed for checking that.

If add_header() is not supposed to change unredirected_hdrs but, add_unredirected_header() is the call to change unredirected_hdrs then, it is working fine and as expected.

(This is an undocumented interface, items() call was used for viewing the headers, tough actual code might not be using it.


>>>request.add_unredirected_header('Content-type','application/xml')
>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Content-type', 'application/xml'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]
>>>
Comment: add_unredirected_header() has correctly affected.

After application of the attached patch in issue report which modifies the add_header() and add_unredirected_header() method to remove the existing header of the same name. We will observe that the unredirected_hdr itself is removed and it is never added back.

After application of attached patch:


>>> url = 'http://www.whompbox.com/headertest.php'
>>> request = urllib2.Request(url)
>>> request.add_data("Spam")
>>> f = urllib2.urlopen(request)
>>> request.header_items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]


>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Content-type', 'application/x-www-form-urlencoded'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]


>>> request.add_header('Content-type','application/xml')
>>> f = urllib2.urlopen(request)
>>> request.header_items()

[('Content-length', '4'), ('Content-type', 'application/xml'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]


>>> request.unredirected_hdrs.items()

[('Content-length', '4'), ('Host', 'www.whompbox.com'), ('User-agent', 'Python-urllib/2.6')]
>>>
Comment: Notice the absense of Content-type header.

Issue 2756 - urllib2 add_header fails with existing unredirected header

The issue noticable here:

[ors@goofy ~]$ python
Python 2.6b2+ (trunk:65482M, Aug 4 2008, 14:26:01)
[GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> url = 'http://hroch486.icpf.cas.cz/formpost.html'
>>> import urllib2
>>> req_obj = urllib2.Request(url)
>>> req_obj.unredirected_hdrs
{}
>>> req_obj.add_data("Spam")
>>> req_obj.unredirected_hdrs
{}
>>> response = urllib2.urlopen(req_obj)
>>> req_obj.unredirected_hdrs
{'Content-length': '4', 'Content-type': 'application/x-www-form-urlencoded',
'Host': 'hroch486.icpf.cas.cz', 'User-agent': 'Python-urllib/2.6'}
>>> req_obj.add_data("SpamBar")
>>> req_obj.add_header("Content-type","application/html")
>>> response = urllib2.urlopen(req_obj)
>>> req_obj.unredirected_hdrs
{'Content-length': '4', 'Content-type': 'application/x-www-form-urlencoded',
'Host': 'hroch486.icpf.cas.cz', 'User-agent': 'Python-urllib/2.6'}
>>> req_obj.get_data()
'SpamBar'
>>>

In the final req_obj call, the unredirected_hdrs had not changed in neither
Content-length nor in Content-type.

[issue2776] urllib2.urlopen() gets confused with path with // in it

I played around with pdb module today to debug this issue. pdb is really
helpful.
Here's how the control goes.
1) There is an url with two '//'s in the path.
2) The call is data = urllib2.urlopen(url).read()
3) urlopen calls the build_opener. build_opener builds the opener using (tuple)
of handlers.
4) opener is an instance of OpenerDirector() and has default HTTPHandler and
HTTPSHandler.
5) When the Request call is made and the request has 'http' protocol, then
http_request method is called.
6) HTTPHandler has http_request method which is
AbstractHTTPHandler.do_request_

Now, for this issue we get to the do_request_ method and see that

7) host is set in the do_request_ method in the get_host() call.
8) request.get_selector() is the call which is causing this particular issue
of "urllib2 getting confused with path containing //".
.get_selector() method returns self.__r_host.
Now, when proxy is set using set_proxy(), self.__r_host is self.__original (
The original complete url itself), so the get_selector() call is returns the
sel_url properly and we can get teh host from the splithost() call on teh
sel_url.

When proxy is not set, and the url contains '//' in the path segment, then
.get_host() (step 7) call would have seperated the self.host and self.__r_host
(it pointing to the rest of the url) and .get_selector() simply returns this
(self.__r_host, rest of the url expect host. Thus causing call to fail.

9) Before the fix, request.add_unredirected_header('Host', sel_host or host)
had the escape mechanism set for proper urls wherein with sel_host is not set
and the host is used. Unfortunately, that failed when this bug caused sel_host
to be set to self.__r_host and Host in the headers was being setup wrongly (
rest of the url).

The patch which was attached appropriately fixed the issue. I modified and
included for py3k.

>
> I could reproduce this issue on trunk and p3k branch. The patch attached
> by Adrianna Pinska "appropriately" fixes this issue. I agree with the
> logic. Attaching the patch for py3k with the same fix.
>
> Thanks,
> Senthil
>
> Added file: http://bugs.python.org/file11103/issue2776-py3k.diff
>

Python urllib bugs roundup

From the previous post of bugs round up, finished activities (bugs which are closed now) include:

* http://bugs.python.org/issue1432 - Strange Behaviour of urlparse.urljoin
* http://bugs.python.org/issue2275 - urllib2 header capitalization.
* http://bugs.python.org/issue2916 - urlgrabber.grabber calls setdefaulttimeout
* http://bugs.python.org/issue2195 - urlparse() does not handle URLs with port numbers properly. - Duplicate issue.
* http://bugs.python.org/issue2829 - Copy cgi.parse_qs() to urllib.parse - Duplicate of issue600362.
* http://bugs.python.org/issue2885 - Create the urllib package. (but the tests are still named test_urllib2, test_urlparse etc. I shall discuss with py-dev if names changes in tests is okay before beta3 and give a attempt).

Activities TODO:

High- Priority:

Now, the list of bugs which are partially completed and requires addressing of some issues mentioned in the patches.These would take up some higher priority, as addressing would result in closure sooner.

* http://bugs.python.org/issue600362 - relocate cgi.parse_qs() into urlparse
* http://bugs.python.org/issue2776 - urllib2.urlopen() gets confused with path with // in it
* http://bugs.python.org/issue2756 - urllib2 add_header fails with existing unredirected_header Patch attached.
* http://bugs.python.org/issue2464 - urllib2 can't handle http://www.wikispaces.com

Plan:
I shall attempt to address all these issues before the release of Beta3 (That should be either on August 15 / August 23)

The following were some of the main issues to be taken up during G-SOC.
I see that as I have understood RFC 3986 better, I can work on issue1591035. I shall work on it on the branch and then discuss for inclusion in the trunk.

Feature Requests:

* http://bugs.python.org/issue1591035 - update urlparse to RFC 3986.Plan: By August 23.
* http://bugs.python.org/issue1462525 - URI parsing library - This will depend upon the previous issue, so we can assume Aug 23 for closure.
* http://bugs.python.org/issue2987 - RFC2732 support for urlparse (e.g. http://[::1]:80/) This is related bug again and will conclude by the same time-line.

I shall take up the following listed bugs after completion of the above.

* http://bugs.python.org/issue1448934 - urllib2+https+proxy not working.
* http://bugs.python.org/issue1424152 - urllib/urllib2: HTTPS over (Squid) Proxy fails
* http://bugs.python.org/issue1675455 - Use getaddrinfo() in urllib2.py for IPv6 support. Patch provided.

Low priority.
* http://bugs.python.org/issue1285086 - urllib.quote is too slow

Issue 3300

There is a good amount of discussion going around with
http://bugs.python.org/issue3300, I had been following from the start and had
an inclination towards quote and quote_plus to support UTF-8. But as the
discussion went further, without strong point on which stance to take, I had to
refresh and improve my knowledge of unicode support in Python and espcially
Unicode Strings in Python 3.0. Hopefully this will come handy in other issues.

Here are some notes on Unicode and Python.

What is Unicode?
In Computing, Unicode is an Industry Standard allowing Computers to
consistently display and manipulate text expressed in most of the world's
writing systems.

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

What is Unicode Character Set?

What is character encoding?

What is Encoding?
Converting a Character (or Something) to Number, because Computer internally
store numbers only.

Unicode Strings are a set of Code Points represented from 0x000000 to 0x10FFFF.
This sequence needs to be represented as a set of bytes ( meaning, values from
0-255) in memory. The rules for translating the Unicode String into sequence of
bytes is called encoding.

The representation in the number format is required for homogenuity, otherwise
it will be difficult to convert to and from.

What is Unicode Transformation Format?

What is UTF-8?

Unicode can be implented using a many character encodings. The most commonly
used one is utf-8, which uses 1 byte for all ASCII characters, which have the
same code values as in the standard ASCII encoding, and up to 4 bytes for other
characters.

When it \u the remaining the Unicode Code points which you will find defined
internationally from unicode.org

Now, how to represent them in BINARY (Coz: Computer!), is the trick and you
will have different encodings to do so.
So UTF-8 is one encoding and UTF-16,ASCII are all different encodings.

So you construct a unicode string
mystr = u'\u0065\u0066\u0067\u0068'

mystr is a unicode object. It does not make sense to print it.
But if you wish to see the object, use repr
print repr(mystr)

Now, the unicode object can be coverted to Binary using encoding, and let us
use 'ascii' and 'utf-8'
so you would do
asciistr = mystr.encode('ascii')
utf8str = mystr.encode('utf-8')

Now, it is string object in BINARY

let us print asciistr, and utf8str

STILL NEED MORE UNDERSTANDING.

http://boodebr.org/main/python/all-about-python-and-unicode

A Unicode string holds characters from the Unicode character set.

[issue1432] Strange behavior of urlparse.urljoin

I have made changes to urlparse.urljoin which would behave confirming to
RFC3986. The join of BASE ("http://a/b/c/d;p?q") with REL("?y") would result in
"http://a/b/c/d;p?y" as expected.

I have added a set of testcases for conformance with RFC3986 as well.

Added file: http://bugs.python.org/file11053/issue1432-py26.diff

Re: [issue2275] urllib2 header capitalization

I am submitting a revised patch for this issue.
I did some analysis on the history of this issue and found that this
.capitalize() vs .title() changes had come up earlier too (
issue1542948)and decision was taken to:
- To retain the Header format in .capitalize() to maintain backward
compatibility.
- However, when the headers are passed to httplib, they are converted to
.title() format ( see AbstractHTTPHandler method )
- It is encouraged that users uses .add_header(), .has_header(),
.get_header() methods to check for headers instead of using the .headers
dict directly (which will still remain undocumented interface).

Note to Hans-Peter would be: Changing the headers to .title() tends to
make the .header_items() retrieval backward incompatible, so the headers
will still be stored in .capitalize() format.

And I have made the following changes to the patch:
1) Support for case insensitive dict look up which will work with for
.has_header, .get_header(). So when .has_header("User-Agent") will
return True even when .headers give {"User-agent":"blah"}
2) Added tests to tests the behavior.
3) Changes to doc to reflect upon this issue.

Btw, the undocumented .headers interface will also support
case-insensitive lookup, so I have added tests for that too.

Let me know if you have any comments. Lets plan to close this issue.

Thanks,

_______________________________________
<http://bugs.python.org/issue2275>
_______________________________________

Update and [issue2275]

Its been sometime since I posted my progress. Well, I traveled out of town for
a weekend, and then I could get back into groove immediately. Just realized
that its been more than a week.
Things will be much faster now and I hope not to get into unplanned travel
schedules.

Okay, coming back. I started working on issue2275, which is causing much
debate.

With the discussion, I realized that there is an "difference in opinion" in
fixing the bug. I had assumed that Headers dictionary should be
"User-Agent"="Mozilla Form", and the currently it is in "User-agent" ="Mozilla
form".

For Backward compatiblity purposes, looks like we will have to maintain
capitalize() form and then provide the title() cases to other methods and also
implement the .headers methods.

After much thought into this discussion and reading some of the articles, I
come to think that.

Apart from the current functionality of the headers.

1) .headers public interface.
2) get_header method returning titled()
3) get_header items method returning titled()

Would be desirable.

I referenced Python 2.3.8 Library Docs and found that those methods were not
there and has been implemented from Python 2.4 only.

So, I looking into those two older releases to see where this is change surface
in and fix things specific to that change, so that older code does not break
in.

> John J Lee <jjlee@users.sourceforge.net> added the comment:
>
> > With respect to point 1), I assume that we all agree upon that headers
> > should stored in Titled-Format instead of Capitalized-format.
>
> I would probably choose to store the headers in Capitalized-form,
> because that makes implementing .headers trivial.
>
> [...]
> > Now, if we go for a Case Normalization at the much later stage, will the
> > headers be stored still in capitalize() format? ( In that case, this bug
> > requests it be stored in .titled() format confirming to many practices)
> > Would you like to explain a bit more on that?
>
> Implement .get_header() and friends using .headers, along the lines of:
>
> def get_header(self, header_name, default=None):
> return self.headers.get(
> header_name,
> self.unredirected_hdrs.get(header_name, default)).title()
>
> And then ensure that the headers actually passed to httplib also get
> .title()-cased. This also has the benefit, compared with your patch, of
> leaving the behaviour of non-HTTP URL schemes unchanged.
>

Training Today Gadget

[issue2275] urllib2 header capitalization

Added file: http://bugs.python.org/file10849/issue2275-py26.diff

- Included a CaseInsensitiveDict Lookup for Headers interface.
- Headers will now be .title()-ed instead of .capitalized() ed.
- Included Tests for the changes made.

http://bugs.python.org/issue2275

To Study:

- Difference between in directed header and unredirected header in the HTTP
implementation.
- Difference between gethostname and gethostbyname and if gethostname in
(Unix/Windows/Mac) support IPv6 addresses. This is used directly by _socket.c
in Python.
- RFC 2396! - Atleast 3 bugs reference this and url parsing libraries needs to
be upgraded and be conformant with RFC 2396.

--
O.R.Senthil Kumaran
http://uthcode.sarovar.org

[issue3094] By default, HTTPSConnection should send header "Host: somehost" instead of "Host: somehost:443"

http://bugs.python.org/issue3094

I had commented on the bug and patch. Good that it is fixed now.
_______________________________________

Gregory P. Smith added the comment:

fixed in trunk r64771.

(and indeed the previous behavior was buggy in the extreemly rare event that
someone ran a https server on port 80 the :80 should have been supplied).

Caseinsensitive Dict lookup


class CaseInsensitiveDict(dict):
    def __init__(self, *args, **kwargs):
        self.keystore = {}
        d = dict(*args, **kwargs)
        for k in d.keys():
            self.keystore[self._get_lower(k)] = k
        return super(CaseInsensitiveDict, self).__init__(*args, **kwargs)
    def __setitem__(self, k, v):
        if hasattr(self,'keystore'):
            self.keystore[self._get_lower(k)] = k
        return super(CaseInsensitiveDict, self).__setitem__(k, v)
    def __getitem__(self, k):
        if hasattr(self,'keystore') and self._get_lower(k) in self.keystore:
            k = self.keystore[self._get_lower(k)]
        return super(CaseInsensitiveDict, self).__getitem__(k)

    @staticmethod
    def _get_lower(k):
        if isinstance(k, str):
            return k.lower()
        else:
            return k

This one seems to do the trick at last.

This is based on the logic that for every key in the dictionary (you create for
you want a Caseinsensitive Dict lookup), store the key in the lower value in
the keystore.

When you retreive a item from the Dict, let your request be in any case, but
lower it and then lookup the actual key as it was stored in the keystore and
then retrieve the value using that key.

As the __init__, __setitem__, __getitem__ methods use the super() call to dict,
and using the *same* key and the value passed to the normal dictionary, this
class with an internal keystore would behave as unsuspecting as possible.

Devised this way after a good number of trials.
Lesson to myself:

Study the concepts, try the code before you look for examples in the web,
otherwise you tend to get influenced by examples. Sometimes it might be
helpful, that would only serve as learning. You might want to get back to
implement in the wayyou best understand.

During this patch workout, got to know mixins, decorators, staticmethod,
classmethod, dict more.

There are still couple of tests failing with
Issue2275(http://bugs.python.org/issue2275) , hopefully I would have ironed
them out by today.

sudo write in vim

After editing a file, you discover that its mode won't allow you to save.
Then you do:


:w !sudo tee % > /dev/null

release schedules from pep-0361

Jul 15 2008: Python 2.6b2 and 3.0b2 planned
Aug 23 2008: Python 2.6b3 and 3.0b3 planned
Sep 03 2008: Python 2.6rc1 and 3.0rc1 planned
Sep 17 2008: Python 2.6rc2 and 3.0rc2 planned
Oct 01 2008: Python 2.6 and 3.0 final planned
>

svn merge

All the changes with respect to urllib for py3k were made in py3k-urllib branch
before merging/checking in.
I wanted to fix the py3k related bugs in the same branch so that merge will be
easier later. In order to bring the branch up-to-date with the trunk code, I
had to merge the changes from the trunk into the py3k-urllib branch.

Looked into the svn merge and found out the way to do it.

- svn merge command compares two trees and applies the difference to <b>a working
copy.</b>
- Syntax_to_remember:
<pre class="prettyprint">
svn merge <destination_url_to_merge_to> <source_url_to_merge_from>
<working_copy>
</pre>
In my case it was:
<pre class="prettyprint">
svn merge svn+ssh://pythondev@svn.python.org/python/branches/py3k-urllib
svn+ssh://pythondev@svn.python.org/python/branches/py3k .
</pre>

- Before this, I had to svn update my working copy also.
- svn merge, changes your working copy, so to update your branch you have do
svn ci.

So in effect,
1) Go to your working copy.
2) svn update.
3) svn merge.
4) svn commit.

--
O.R.Senthil Kumaran
http://uthcode.sarovar.org

working on Issue2275

Currently working on issue2275

 
import UserDict
class CaseInsensitiveDict(dict, UserDict.DictMixin):
    def __init__(self, *args, **kw):
        self.orig = {}
        super(CaseInsensitiveDict, self).__init__(*args, **kw)
    def items(self):
        keys = dict.keys(self)
        values = dict.values(self)
        return dict((self.orig[k],v) for k in keys for v in values)
    def __setitem__(self, k, v):
        hash_val = hash(k.lower())
        self.orig[hash_val] = k
        dict.__setitem__(self, hash_val, v)
    def __getitem__(self, k):
        return dict.__getitem__(self, hash(k.lower()))


somedict = CaseInsensitiveDict()
print somedict
somedict['Blah'] = "Boo"
somedict['blah'] = "Boo1"
print somedict['BLAH']
print somedict
print somedict.items()

This can be used for creating a case insensitive dictionary.
But there are tests failing in urllib2 if I use it directly. I think more methods than just items() need to be overridden for usage.

URLError giving NameError

* scriptor Georgij Kondratjev, explico

Not creating new bug entry because everybody can quickly fix it.

In urllib/request.py some instances of URLError are raised with "raise
urllib.error.URLError" and this works, buth there are lines with "raise
URLError" which produces "NameError: global name 'URLError' is not defined"

http://bugs.python.org/issue2775

Update: Georg fixed it in revision 64624.
It was quick and I turned up bit late.

urllib issues Roundup

I had a spreadsheet listing down the bugs in the urllib. I could not use it effectively as much as I wished to. Decided to list down the bugs in the blog itself so that I stay on top of things TODO.

Feature Requests:

http://bugs.python.org/issue1591035
update urlparse to RFC 3986
http://bugs.python.org/issue1462525
URI parsing library
http://bugs.python.org/issue1448934
urllib2+https+proxy not working

Bugs:

Following are quick fixes as per my analysis.

http://bugs.python.org/issue1432
Strange behavior of urlparse.urljoin
I have the patch attached for this.
Review is required before checkin.
http://bugs.python.org/issue2275
urllib2 header capitalization. Patch attached.
http://bugs.python.org/issue2464
urllib2 can't handle http://www.wikispaces.com
http://bugs.python.org/issue2776
urllib2.urlopen() gets confused with path with // in it
http://bugs.python.org/issue2756
urllib2 add_header fails with existing unredirected_header
Patch attached.
http://bugs.python.org/issue2916
urlgrabber.grabber calls setdefaulttimeout

Following will take days time.

http://bugs.python.org/issue2885
Create the urllib package
http://bugs.python.org/issue1424152
urllib/urllib2: HTTPS over (Squid) Proxy fails
Patch recently attached.
http://bugs.python.org/issue1675455
Use getaddrinfo() in urllib2.py for IPv6 support. Patch provided.
http://bugs.python.org/issue2987
RFC2732 support for urlparse (e.g. http://[::1]:80/)

Low priority.

http://bugs.python.org/issue1285086
urllib.quote is too slow

Fixed Issues: Yet to be closed

http://bugs.python.org/issue600362
relocate cgi.parse_qs() into urlparse
http://bugs.python.org/issue2829
Copy cgi.parse_qs() to urllib.parse
http://bugs.python.org/issue2195
urlparse() does not handle URLs with port numbers properly
Duplicate issue. Needs to be closed.

Issue600362 :relocate cgi.parse_qs() into urlparse

Patch for Python 2.6 and Python 3.0: http://bugs.python.org/issue600362

I followed a long route for this patch.

- Did svn -R revert .
- Then modified the files.
- Created patch.
- Installed and tested.

The previous patch modifies the trunk code at places which are +/-2 lines away from this patch.
Now, this is real difficult. Only one of them will apply cleanly. I dont know a way wherein the related patches, addressing different issues, can be supplied continuously and all apply cleanly.

Either have to figure out the way or ask around.

urlparse method - Issue754016

A patch: http://bugs.python.org/issue754016
And a Discussion:
http://mail.python.org/pipermail/web-sig/2008-June/003454.html

Working on issue 754016

Working on issue 754016, for the past two days.

- Facundo's suggestion is when netloc does not start with '//', urlparse should
raise an value error. I am kind of analyzing how feasible it will be,
because urlparse is not only for url but for other schemes also, where path
name directly follows scheme name,
>>>urlparse.urlparse('mailto:orsenthil@example.com')
ParseResult(scheme='mailto', netloc='', path='o', params='', query='',
fragment='')

To give an idea of the various url forms which urlparse can take, I wrote this
test program


import urlparse

list_of_valid_urls=['g:h','http:g','http:','g','./g','g/',
'/g','//g','?y','g?y','g?y/./x','.','./','..',
'../','../g','../..','../../g','../../../g','./../g',
'./g/.','/./g','g/./h','g/../h','http:g','http:','http:?y',
'http:g?y','http:g?y/./x']

for url in list_of_valid_urls:
    print 'url-> %s -> %s' % (url,urlparse.urlsplit(url))

- The ValueError suggestion needs some mroe thought and discussion I believe.

_ My current solution stands at Documentation fix highlighting the need for
compliance with RFC when mentioning netloc value.
netloc should be //netloc.

_ In the same bug, there was an error pointed out by Antony that patch fails
for url 'http:' with an index error.
Yes, thats the mistake I made in the patch, wherein I referenced the character beyound ':' to check if that is a digit. When the character itself is not present., then index error results. The tests did not catch it as well.

I have corrected it now, added tests to it.

But just I added test to 'http:', I added test for 'http:','ftp:' as well.
And recognized that the current urlparse fails for the same reason as old patch
failed.

>>> urlparse.urlparse('https:')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/urlparse.py", line 107, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "/usr/local/lib/python2.6/urlparse.py", line 164, in urlsplit
scheme, url = url[:i].lower(),url[i+1]
IndexError: string index out of range
>>> urlparse.urlparse('ftp:')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/urlparse.py", line 107, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "/usr/local/lib/python2.6/urlparse.py", line 164, in urlsplit
scheme, url = url[:i].lower(),url[i+1]
IndexError: string index out of range
>>>

http: does not fail because urlparse handles it as a special case.

Got to give some more thought on this today evening. It could lead to more some
changes to urlparse to handle 'scheme:' as a valid input for scheme, instead of
it failing as an IndexError.

--
O.R.Senthil Kumaran
http://uthcode.sarovar.org

urllib and NTLM Authentication?

I dont think it is my list of bug fixes. But got to look into this topic as it
was a required thing when developing certain apps at Office. Yesterday, one of
my friend recollected about it also.

Code Swarm - Python

urllib package

The First betas of Python 3.0 and Python 2.6 were scheduled for release on Jun
11, but now it is postponed to June 18th.

There is a TODO Task of packaging urllib and it comes under my GSOC task as
well. The Bug report had another developer assigned to it and I have informed
that I would give it a try.

The Standard Library Reorganization follows the PEP3108, most of the other
things are done. So, things are set as such.

If I follow the example of httplib Reorganization, the following has already
taken effect.

Python 2.5 || Python 3.0/Python 2.6

http
httplib ------- http.client ( client.py)
BaseHTTPServer ------- http.server ( server.py)
CGIHTTPServer ------- http.server ( server.py)
SimpleHTTPServer ------ http.server ( server.py)
(No Naming conflicts should occur)
Cookies ------- http.cookies( cookies.py)
cookielib ------- http.cookiejar

The similar reorganization is designed for urllib and this will be my TODO
task.
>From PEP 3108.

urllib2 -------- urllib.request ( request.py)
urlparse -------- urllib.parse ( parse.py)
urllib -------- urllib.parse, urllib.request

The current urllib module will be split into parse.py and request.py
- quoting related functionalies will be added to parse.py
- URLOpener and FancyUrlOpener will be added to request.py

Other activities should include:

- Docs need to be updated.
- Tests needs to be ensured to run properly.
- No conflicts should occur.
- Python 3.0 - Testing needs to be done.
- Changes to other modules.

I shall set internal Target of, June 16 with 4 hours per day for this task
exclusively.

For Bugs #2195 and #754016

A patch to fix this issue. I deliberated upon
this for a while and came up with the approach to:

1) fix the port issue, wherein urlparse should technically
recognize the ':' separator for port from ':' after scheme.

2) And Doc fix wherein, it is advised that in the absence of
a scheme, use the net_loc as //net_loc (following RCF 1808).

If we go for any other fix, like internally pre-pending //
when user has not specified the scheme (like in many
pratical purpose), then we stand at chance of breaking a
number of tests ( cases where url is 'g'(path only),';x'
(path with params) and cases where relative url is g:h)

Let me know your thoughts on this.

>>> urlparse('1.2.3.4:80')
ParseResult(scheme='', netloc='', path='1.2.3.4:80',
params='', query='', fragment='')
>>> urlparse('http://www.python.org:80/~guido/foo?query#fun')
ParseResult(scheme='http', netloc='www.python.org:80',
path='/~guido/foo', params='', query='query',
fragment='fun')
>>>

Index: Doc/library/urlparse.rst
===================================================================
--- Doc/library/urlparse.rst (revision 64056)
+++ Doc/library/urlparse.rst (working copy)
@@ -52,6 +52,23 @@
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

+ If the scheme value is not specified, urlparse following the syntax
+ specifications from RFC 1808, expects the netloc value to start with '//',
+ Otherwise, it is not possible to distinguish between net_loc and path
+ component and would classify the indistinguishable component as path as in
+ a relative url.
+
+ >>> from urlparse import urlparse
+ >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
+ ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
+ params='', query='', fragment='')
+ >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
+ ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
+ params='', query='', fragment='')
+ >>> urlparse('help/Python.html')
+ ParseResult(scheme='', netloc='', path='help/Python.html', params='',
+ query='', fragment='')
+
If the *default_scheme* argument is specified, it gives the default addressing
scheme, to be used only if the URL does not specify one. The default value for
this argument is the empty string.

Index: Lib/urlparse.py
===================================================================
--- Lib/urlparse.py (revision 64056)
+++ Lib/urlparse.py (working copy)
@@ -145,7 +144,7 @@
clear_cache()
netloc = query = fragment = ''
i = url.find(':')
- if i > 0:
+ if i > 0 and not url[i+1].isdigit():
if url[:i] == 'http': # optimize the common case
scheme = url[:i].lower()
url = url[i+1:]

RFC1808 and RFC1738 Notes

urlparse has following unittests:

run_unittest(urlParseTestCase)
- checkRoundtrips
- test_roundtrips (8)
- test_http_roundtrips (6)
- checkJoin
- test_unparse_parse(9)
- test_RFC1808 (1)
- test_RFC2396 (2)
- test_urldefrag (10)
- test_urlsplit_attributes (11)
- test_attributes_bad_port (3)
- test_attributes_without_netloc (4)
- test_caching (5)
- test_noslash (7)

<pre>
>>>RFC1808_BASE ="http://a/b/c/d;p?q#f"
>>>urlparse.urlsplit(RFC1808_BASE)
</pre>
SplitResult(scheme='http',netloc='a',path='/b/c/d;p',query='q',fragment='f')

In the checkJoin tests it takes the parameters (base, relurl, expected).

The relative url specification always takes the BASE URL and Relative URL and
follows the algorithm described in RFC1808 and acts accordingly to give the
expected URL.

The syntax for the relative URLS is a shortened form of that for a absolute
URLS, where the prefix of the URL is missing and certain path components ('.'
and '..') have a special meaning when interpreting the relative path.

- If the params and the query is present, the query must occur after the
paramters.

- question mark character (?) is allowed in the ftp and the file path segment.

- Parsing a scheme: the scheme can contain alphanumeric, "+", ".", "-" and must
end with ':'

I was confused as how the scheme can contain the characters like "+",".","-"
characters and was looking out for examples.
Well, svn+shh://svn.python.org/trunk/ is an example where the scheme contains
the "+" character.

The url is denoted by:

Scheme name consists of a sequence of characters. The lower case letters
"a"-"z",digit and the character plus "+", period "." and hypen "-" are allowed.

- URL is basically a sequence of octets in the coded character set. (Did not
quite understand.)

- Hierarchial name schemes, the components of the hierarchy are separated by
"/".

>From RFC1738, URL schemes that involve the direct use of an IP-based protocol
to a specified host on the Internet use a common syntax for the scheme specific
data.

- The scheme specific data starts with a double slash '//' to indicate that it
compiles with the common internet scheme syntax.

In the current state, the urlparse module complies with the RFC1808 and
basically assumes the netloc specification starts with //.
That could be with respect to relative syntax, but in practical purposes we do
specify urls as www.python.org, i.e without the scheme and without the // of
the netloc.

The port handling part of the urlparse is seriously missing.

urlparse and port number

Bugs #2195 and #754016 both complain about urlparse not handling port number properly and often giving error nous results with respect to scheme, netloc and path.

Yes, it misbehaves under circumstances when you do not start the netloc with //. But in all practical purposes when we use url without scheme, we do plainly say the netloc part, like www.python.org.

Requires fix and the following patch will do that.


@@ -143,7 +143,7 @@ def urlsplit(url, scheme='', allow_fragm
     if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
         clear_cache()
     netloc = query = fragment = ''
-    i = url.find(':')
+    i = url.find('://')
     if i > 0:
         if url[:i] == 'http': # optimize the common case
             scheme = url[:i].lower()
@@ -164,6 +164,9 @@ def urlsplit(url, scheme='', allow_fragm
             scheme, url = url[:i].lower(), url[i+1:]
     if scheme in uses_netloc and url[:2] == '//':
         netloc, url = _splitnetloc(url, 2)
+    else:
+        netloc, url = _splitnetloc(url)
+
     if allow_fragments and scheme in uses_fragment and '#' in url:
         url, fragment = url.split('#', 1)
     if scheme in uses_query and '?' in url:

1) First change for differentiating between the port's(:) and scheme's (:)//.
2) Second change when the scheme is not given, just split into netloc and rest of url.

Got to write the tests for it and submit it.

One general review comment is urlparse.urlsplit is written in a not very composed/collected way. There have been lot of realizations (just like the one above), then then patches/additions to fix it.
So we see a special condition for http being handled in a block of code.
Those can be cleaned up.

Timelines for next beta releases

Timelines for next releases:

Jun 11 2008: Python 2.6b1 and 3.0b1 planned
Jul 02 2008: Python 2.6b2 and 3.0b2 planned
Aug 06 2008: Python 2.6rc1 and 3.0rc1 planned
Aug 20 2008: Python 2.6rc2 and 3.0rc2 planned

Summer of Code 2008

I am working in enhancing the Python urllib module as part of the Google Summer of Code project. I will be using this blog to post updates on the project.

How identation works for Python programs?

It is well explained in this article.

It is the lexical analyzer that takes care of the indentation and not the python parser. Lexical analyzer maintains a stack for the indentation.
1) First for no indentation, it would stored 0 in the stack [0]
2) Next when any Indentation occurs, it denotes it by token INDENT and pushes the indent value to the stack[0]. Think of it as a beinging { brace in the C program. And if we visualized, the can be only one INDENT statement per line.
4) When de-indent occurs in a line, as many values are popped out of the stack as the new reduced indentation till the value on the top of the stack is equal to new indentation (if not equal, error) and for each value popped out a DEDENT token in written. (Like multiple end }} in C)

A simple code like this



if x:
   if true:
      print 'yes'
print 'end'

Would be written as:

<if><x><:> # Stack[0]
<INDENT><if><true><:> # Stack [0,4]
<INDENT><print><'><yes><'> # Stack [0,4,8]
<DEDENT><DEDENT><print><'><end><'> #Stack[0]

The parser would just consider the as <INDENT> as { of the block and <DEDENT> as } of the block would be able to parse it as logical blocks.

That was a well written article again.

sf.net: 2008 community choice awards

I have voted for Python.

fortune cookie with MS Outlook

Folks who love Unix have a habit of attaching a fortune cookie along with their signatures. I was missing it when using MS Outlook. So, I figured out a way to get the fortune cookie attached to MS Outlook signature.

1) Install QLiner Quotes Software.
2) Download QLiner_fortune.

This is a modified version of fortune database for use along with QLiner software.
* Removed all Offensive Cookies. Most often we use MS-Outlook at Office. I don't wish anyone to get fired, but again, if you get fired due to some fortune cookie, I am not responsible.
* Only quotes less than 100 characters are included. Long quotes are bad.
*There is lame program oneliners.py which I wrote to convert to fortune database files to suit the QLiner. Modify it and use it, if you wish.

3) Go to C:\Program File\QLiner\Quotes\files
4) Remove the existing files (or keep them if you would wish to) and copy the files from the QLiner_fortune to this directory.
5) Use the Configuration Wizard to choose your interest.

There you go!

"I don't know, " said the voice on the PA, "apathetic bloody planet, I've no sympathy at all. "

UT300R2U USB Drivers Working on Windows XP

Got my BSNL UT300R2U USB Drivers working on Windows XP. My previous attempts were futile. This time, one of these files did it right.
usbdriver when installed was asking me to plug in the USB device. I plugged out/in a couple of times, did not see any recognition of USB device.
Then installed model dlls, using the Windows Add/Remove wizard that turned up.
In a couple of minutes, I found USB Modem connection working and later the installshield of the usbdriver also finished the installation.

Now, I have two connections from the single modem. Ethernet going to Linux box (Goofy) and USB connection to Laptop (Dogbert).

Thank you retupmoc, for the USB cable.

cdrecord in FC2

Some notes we take save us a lot of time at an later date. This post at my older blog,
cdrecord in Fedora Core 2 saved me a couple of hours of research today.

Robbie, the eBot

This video shows "Robbie, the eBot" doing the grab and the drop action.