Issue 3300

There is a good amount of discussion going around with
http://bugs.python.org/issue3300, I had been following from the start and had
an inclination towards quote and quote_plus to support UTF-8. But as the
discussion went further, without strong point on which stance to take, I had to
refresh and improve my knowledge of unicode support in Python and espcially
Unicode Strings in Python 3.0. Hopefully this will come handy in other issues.

Here are some notes on Unicode and Python.


What is Unicode?
In Computing, Unicode is an Industry Standard allowing Computers to
consistently display and manipulate text expressed in most of the world's
writing systems.

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

What is Unicode Character Set?

What is character encoding?

What is Encoding?
Converting a Character (or Something) to Number, because Computer internally
store numbers only.

Unicode Strings are a set of Code Points represented from 0x000000 to 0x10FFFF.
This sequence needs to be represented as a set of bytes ( meaning, values from
0-255) in memory. The rules for translating the Unicode String into sequence of
bytes is called encoding.

The representation in the number format is required for homogenuity, otherwise
it will be difficult to convert to and from.


What is Unicode Transformation Format?

What is UTF-8?

Unicode can be implented using a many character encodings. The most commonly
used one is utf-8, which uses 1 byte for all ASCII characters, which have the
same code values as in the standard ASCII encoding, and up to 4 bytes for other
characters.

When it \u the remaining the Unicode Code points which you will find defined
internationally from unicode.org

Now, how to represent them in BINARY (Coz: Computer!), is the trick and you
will have different encodings to do so.
So UTF-8 is one encoding and UTF-16,ASCII are all different encodings.

So you construct a unicode string
mystr = u'\u0065\u0066\u0067\u0068'

mystr is a unicode object. It does not make sense to print it.
But if you wish to see the object, use repr
print repr(mystr)

Now, the unicode object can be coverted to Binary using encoding, and let us
use 'ascii' and 'utf-8'
so you would do
asciistr = mystr.encode('ascii')
utf8str = mystr.encode('utf-8')

Now, it is string object in BINARY

let us print asciistr, and utf8str

STILL NEED MORE UNDERSTANDING.

http://boodebr.org/main/python/all-about-python-and-unicode

A Unicode string holds characters from the Unicode character set.

No comments: