git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How to decode UTF strings?


On 2019-10-26 03:10, Arne Vajh?j wrote:
> On 10/25/2019 4:52 PM, DFS wrote:
>> =?iso-8859-9?b?T/B1eg==?= <oguz.ismail.uysal at gmail.com>
>> =?utf-8?Q?=EB=AF=B8?= <taeyeon10006 at gmail.com>
>> =?GBK?B?0Pu66A==?= <xuan.alan at 163.com>
>> =?UTF-8?B?zp3Or866zr/PgiDOks6tz4HOs86/z4I=?= <vergos.nikolas at gmail.com>
> 
> How does something like:
> 
> from email.header import decode_header
> 
> def test(s):
>       print(s)
>       s2 = decode_header(s)
>       print(s2[0][0])
>       print(s2[1][0].strip())
> 
> test('=?iso-8859-9?b?T/B1eg==?= <oguz.ismail.uysal at gmail.com>')
> test('=?utf-8?Q?=EB=AF=B8?= <taeyeon10006 at gmail.com>')
> test('=?GBK?B?0Pu66A==?= <xuan.alan at 163.com>')
> test('=?UTF-8?B?zp3Or866zr/PgiDOks6tz4HOs86/z4I=?=
> <vergos.nikolas at gmail.com>')
> 
> work?
> 
When you decode the header you get a number of parts, each with its own 
encoding.

Here's a simple example, based in your code:

from email.header import decode_header

def test(header, default_encoding='utf-8'):
      parts = []

      for data, encoding in decode_header(header):
          if isinstance(data, str):
             parts.append(data)
          else:
             parts.append(data.decode(encoding or default_encoding))

      print(''.join(parts))

test('=?iso-8859-9?b?T/B1eg==?= <oguz.ismail.uysal at gmail.com>')
test('=?utf-8?Q?=EB=AF=B8?= <taeyeon10006 at gmail.com>')
test('=?GBK?B?0Pu66A==?= <xuan.alan at 163.com>')
test('=?UTF-8?B?zp3Or866zr/PgiDOks6tz4HOs86/z4I=?= 
<vergos.nikolas at gmail.com>')