Skip to content

Inconsisten handling of non-ASCII characters in encodings.normalize_encoding() #136736

@serhiy-storchaka

Description

@serhiy-storchaka

Bug report

#83518 changed handling of non-ASCII characters in encodings.normalize_encoding(), but it is still inconsistent with codecs.lookup(), and not even self-consistent. For example:

>>> import encodings
>>> encodings.normalize_encoding('a¤b')
'a_b'
>>> encodings.normalize_encoding('aæb')
'ab'
>>> encodings.normalize_encoding('a-¤')
'a'
>>> encodings.normalize_encoding('a-æ')
'a_'
>>> encodings.normalize_encoding('a-¤-b')
'a_b'
>>> encodings.normalize_encoding('a-æ-b')
'a__b'

You can even get an underscore at the end or repeated underscores in the middle.

cc @malemburg, @vstinner, @shihai1991

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.13bugs and security fixes3.14bugs and security fixes3.15new features, bugs and security fixesstdlibPython modules in the Lib dirtopic-unicodetype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      pFad - Phonifier reborn

      Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

      Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy