-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
gh-55531: Implement normalize_encoding
in C
#136643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
gh-55531: Implement normalize_encoding
in C
#136643
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that it's a draft but here are already some comments that you can dismiss if you're working on them.
I have cleaned up the changes and ensure the behavior remains the same, however there are still a few points I need input from @malemburg
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind running some microbenchmarks?
const char *cstr = PyUnicode_AsUTF8(encoding); | ||
if (cstr == NULL) { | ||
return NULL; | ||
} | ||
|
||
size_t len = strlen(cstr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use PyUnicode_AsUTF8AndSize
.
return NULL; | ||
} | ||
|
||
char *normalized = PyMem_Malloc(len + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid copies at the end by using PyUnicodeWriter
here. It'll look something like this:
PyUnicodeWriter *writer = PyUnicodeWriter_Create(len + 1); // instead of PyMem_Malloc
if (writer == NULL) {
/* ... */
}
/* ... */
if (PyUnicodeWriter_WriteUTF8(writer, normalized, len + 1) < 0) {
/* ... */
}
return PyUnicodeWriter_Finish(writer);
@@ -26,9 +26,10 @@ | |||
|
|||
(c) Copyright CNRI, All Rights Reserved. NO WARRANTY. | |||
|
|||
"""#" | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hate to be that guy, but these fly-by formatting changes just make it harder to review.
@@ -3900,6 +3900,7 @@ def test_encodings_normalize_encoding(self): | |||
self.assertEqual(normalize('utf_8'), 'utf_8') | |||
self.assertEqual(normalize('utf\xE9\u20AC\U0010ffff-8'), 'utf_8') | |||
self.assertEqual(normalize('utf 8'), 'utf_8') | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
import codecs | ||
from _codecs import _normalize_encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the from
import after the sys
import
@@ -37,6 +38,7 @@ | |||
_import_tail = ['*'] | |||
_aliases = aliases.aliases | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to slip PEP-8 stuff, do it everywhere. In this case, I don't think we need to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Don't do it everywhere, that creates conflicts. If we're going to reformat, do it in a seperate PR and backport it to 3.13 and 3.14.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Yes, also; so in this case, just remove this change)
Uh oh!
There was an error while loading. Please reload this page.