Unicode

Python 2.0 provides support for Unicode strings

  • Needed for internationalization, XML, etc.

Unicode: In a nutshell

  • Internally, all character values are extended to 16 bit integers (a C short or wchar_t).
  • Character values 0-127 represent the same characters as 8-bit ASCII.
  • Otherwise, everything is about the same (well, mostly).

Issues

  • How do you specify Unicode strings in a program? (You can't type most of the characters)
  • External representation and I/O.
  • Compatibility with 8-bit strings (comparison, coercion, regular expressions, etc.)

Note:

  • When discussing Unicode, U+xxxx used to indicate a Unicode character value.
  • Ex: U+006A
  • This is not python syntax
<<< O'Reilly OSCON 2001, New Features in Python 2, Slide 49
July 26, 2001, beazley@cs.uchicago.edu
>>>