gh-55531: Implement `normalize_encoding` in C #136643

StanFromIreland · 2025-07-14T08:35:34Z

Issue: encoding package's normalize_encoding() function is too slow #55531

Lib/encodings/__init__.py

Lib/test/test_codecs.py

Objects/unicodeobject.c

picnixz

I know that it's a draft but here are already some comments that you can dismiss if you're working on them.

Lib/encodings/__init__.py

Lib/test/test_codecs.py

Modules/_codecsmodule.c

Objects/unicodeobject.c

Modules/_codecsmodule.c

Lib/encodings/__init__.py

Modules/_codecsmodule.c

StanFromIreland · 2025-07-14T09:26:08Z

I have cleaned up the changes and ensure the behavior remains the same, however there are still a few points I need input from @malemburg
(And as Benedikt said, should be their own issue)

This function is documented as taking strings, but during the 2->3 conversion and undocumented, and untested change was made which allowed it to accept bytes. I have kept it this way (in Python, to make removal simpler), though I think this should either be documented and tested, or removed.
The function has been documented as ascii only, and for bytes, it is. However, for strings, it has not been enforced with an error. What should we do?

ZeroIntensity

Would you mind running some microbenchmarks?

ZeroIntensity · 2025-07-14T10:24:18Z

Modules/_codecsmodule.c

+    const char *cstr = PyUnicode_AsUTF8(encoding);
+    if (cstr == NULL) {
+        return NULL;
+    }
+
+    size_t len = strlen(cstr);


Use PyUnicode_AsUTF8AndSize.

ZeroIntensity · 2025-07-15T10:04:27Z

Modules/_codecsmodule.c

+        return NULL;
+    }
+
+    char *normalized = PyMem_Malloc(len + 1);


We can avoid copies at the end by using PyUnicodeWriter here. It'll look something like this:

PyUnicodeWriter *writer = PyUnicodeWriter_Create(len + 1); // instead of PyMem_Malloc if (writer == NULL) { /* ... */ } /* ... */ if (PyUnicodeWriter_WriteUTF8(writer, normalized, len + 1) < 0) { /* ... */ } return PyUnicodeWriter_Finish(writer);

ZeroIntensity · 2025-07-15T10:08:06Z

Lib/encodings/__init__.py

@@ -26,9 +26,10 @@

 (c) Copyright CNRI, All Rights Reserved. NO WARRANTY.

-"""#"
+"""


Hate to be that guy, but these fly-by formatting changes just make it harder to review.

picnixz · 2025-07-15T10:20:21Z

Lib/test/test_codecs.py

@@ -3900,6 +3900,7 @@ def test_encodings_normalize_encoding(self):
        self.assertEqual(normalize('utf_8'), 'utf_8')
        self.assertEqual(normalize('utf\xE9\u20AC\U0010ffff-8'), 'utf_8')
        self.assertEqual(normalize('utf   8'), 'utf_8')
+


Suggested change

picnixz · 2025-07-15T10:22:00Z

Lib/encodings/__init__.py


 import codecs
+from _codecs import _normalize_encoding


put the from import after the sys import

picnixz · 2025-07-15T10:23:42Z

Lib/encodings/__init__.py

@@ -37,6 +38,7 @@
 _import_tail = ['*']
 _aliases = aliases.aliases

+


If you want to slip PEP-8 stuff, do it everywhere. In this case, I don't think we need to do it.

(Don't do it everywhere, that creates conflicts. If we're going to reformat, do it in a seperate PR and backport it to 3.13 and 3.14.)

(Yes, also; so in this case, just remove this change)

C

92873d6

StanFromIreland requested review from vstinner and malemburg July 14, 2025 08:35

bedevere-app bot mentioned this pull request Jul 14, 2025

encoding package's normalize_encoding() function is too slow #55531

Open

StanFromIreland commented Jul 14, 2025

View reviewed changes

Lib/encodings/__init__.py Show resolved Hide resolved

Lib/test/test_codecs.py Outdated Show resolved Hide resolved

Objects/unicodeobject.c Show resolved Hide resolved

Correct clinic note

4bae23a

picnixz reviewed Jul 14, 2025

View reviewed changes

StanFromIreland added 2 commits July 14, 2025 09:54

Little fixes

b5f3df3

Keep the messiness

2ad72b2

picnixz reviewed Jul 14, 2025

View reviewed changes

Modules/_codecsmodule.c Outdated Show resolved Hide resolved

StanFromIreland added 2 commits July 14, 2025 10:29

Clean up tests

3660160

Remove unnecessary message

4e12b9e

StanFromIreland marked this pull request as ready for review July 14, 2025 12:54

bedevere-app bot added the awaiting review label Jul 14, 2025

StanFromIreland requested a review from picnixz July 14, 2025 12:54

ZeroIntensity reviewed Jul 15, 2025

View reviewed changes

picnixz reviewed Jul 15, 2025

View reviewed changes

		@@ -37,6 +38,7 @@
		_import_tail = ['*']
		_aliases = aliases.aliases

Uh oh!

gh-55531: Implement normalize_encoding in C #136643

Are you sure you want to change the base?

gh-55531: Implement normalize_encoding in C #136643

Conversation

StanFromIreland commented Jul 14, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZeroIntensity left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

gh-55531: Implement `normalize_encoding` in C #136643

gh-55531: Implement `normalize_encoding` in C #136643

StanFromIreland commented Jul 14, 2025 •

edited by bedevere-app bot

Loading

StanFromIreland commented Jul 14, 2025 •

edited

Loading