Unicode - I/O

codecs example: Autodetection of XML encoding

  • XML document starts with
  • Encoding can be determined by looking at first few bytes of input
     3C 3F 78 6D             # UTF-8, ASCII, Latin-1
     3C 00 3F 00             # UTF-16-LE
     00 3C 00 3F             # UTF-16-BE
     ...
  • Use of codec
     encodings = {
       '\x3c\x3f\x78\x6d' : 'utf-8',
       '\x3c\x00\x3f\x00' : 'utf-16-le',
       '\x00\x3c\x00\x3f' : 'utf-16-be' }
     
     f = open("foo.xml")
     reader = codecs.lookup(encodings[f.read(4)])[2]
     fr = reader(f)
     ... 
<<< O'Reilly OSCON 2001, New Features in Python 2, Slide 57
July 26, 2001, beazley@cs.uchicago.edu
>>>