git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

python2 vs python3


On 10/21/19, Albert-Jan Roskam <sjeik_appie at hotmail.com> wrote:
> On 18 Oct 2019 20:36, Chris Angelico <rosuav at gmail.com> wrote:
>
>> That's correct. The output of the command is, by default, given to you
>> in bytes.
>
> Do you happen to know why this is the default?  And is there a reliable way
> to figure out the encoding? On posix, it's probably utf8, but on windows I
> usually use cp437, but knowing windows, it could be any codepage

In Python 3.6+ on Windows, use "oem" instead of assuming OEM is
codepage 437. In Western Europe, OEM is 850, and in Windows 10 it can
even be set to 65001 (i.e. UTF-8). Python also supports "ansi"
("mbcs"). These two are implemented via codecs.code_page_encode and
codecs.code_page_decode, so, for better or worse, they use the Windows
best-fit 'replace' error handling instead of just "?". For example:

    >>> c = '\N{GREEK SMALL LETTER BETA}'
    >>> c_oem = c.encode('oem', 'replace').decode('oem')
    >>> c_oem
    '?'
    >>> unicodedata.name(c_oem)
    'LATIN SMALL LETTER SHARP S'

I'd like to also have something like "conin" and "conout" encodings
that use the attached console's current input and output codepages.
But at least it's simple to a write a little ctypes-based function
that implements this.

When writing to a pipe, almost all Windows command-line programs
default to one of OEM, ANSI, the current console input or output
codepage, UTF-8, or UTF-16. The latter two may also write a UTF byte
order mark (BOM). Sometimes the output encoding can be configured via
command-line options or environment variables (e.g. ipconfig.exe
supports an "OutputEncoding" environment variable).

>  (you can even change it with chcp.exe)

It's actually "chcp.com". Thus subprocess.Popen('chcp') fails because
CreateProcessW only adds ".EXE" when looking for the executable. This
binary uses the ".com" extension for compatibility with legacy batch
scripts. But don't let the extension fool you. It's just a regular
Windows PE binary, not a 16-bit MS-DOS binary.

As mentioned above, some programs use either the console input or
output codepage when writing to a pipe. This does not include Windows
Python, however, which instead defaults to ANSI. This can be
overridden via environment variables and command-line options that set
the standard I/O encoding or force UTF-8 mode.