-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time.strftime() and Unicode characters on Windows #52551
Comments
There is inconsistent behavior in time.strftime, comparing Python 2.6 and 3.1. In 3.1, non-ASCII Unicode characters seem to get dropped whereas in 2.6 you can keep them using the necessary Unicode-to-UTF8 workaround. This should be fixed if it isn't intended behavior. Python 2.6 >>> time.strftime(u"%d\u200F%A".encode("utf-8"), time.gmtime()).decode("utf-8")
u'03\u200fSaturday'
>>> time.strftime(u"%d\u0041%A".encode("utf-8"), time.gmtime()).decode("utf-8")
u'03ASaturday' Python 3.1 >>> time.strftime("%d\u200F%A", time.gmtime())
''
>>> len(time.strftime("%d\u200F%A", time.gmtime()))
0
>>> time.strftime("%d\u0041%A", time.gmtime())
'03ASaturday' |
This seems to be fixed now, on both 3.1 and 3.2. |
Actually the bug seems related to Windows. |
Just installed Python 3.1.2, same problem. I'm using Windows XP SP2 with two Python installations (2.6.4 and now 3.1.2). |
Definitely a Windows problem. I did this on Visual Studio 2008:
size_t ret = wcsftime(out, 1000, L"%d%A", timeStruct);
wprintf(L"ret = %d, out = (%s)\n", ret, out); ret = wcsftime(out, 1000, L"%d\u200f%A", timeStruct);
wprintf(L"ret = %d, out = (%s)\n", ret, out); and the output was ret = 8, out = (04Sunday)
ret = 0, out = () Python really shouldn't use any so-called standard functions on Windows. They never work as expected ^^... |
See also the issue bpo-10653: wcsftime() doesn't format correctly time zones, so Python 3 uses strftime() instead. |
Using 3.4.1 and 3.5.0 I get:- time.strftime("%d\u200F%A", time.gmtime())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'locale' codec can't encode character '\u200f' in position 2: Illegal byte sequence |
I verified Marks 3.4.1 result with Idle. It strikes me as a bug that a function that maps a unicode format string to a unicode string with interpolations added should ever encode the format to bytes, lets alone using using an encoding that fails or loses information. It is especially weird given that % formatting does not even work (at present) for bytes. It seems to me that strftime should never encode the non-special parts of the format text. Instead, it could split the format (re.split) into a list of alternatine '%x' pairs and running text segments, replace the '%x' entries with the proper entries, and return the list joined back into a string. Some replacements would be locale dependent, other not. (Just wondering, are the locate names of days and months bytes restricted to ascii or unrestricted unicode using native characters?) |
@alexander what is you take on this please? I can confirm that it is still a problem on Windows in 3.5.0. |
Mark, I am no expert on Windows. I believe Victor is most knowledgable in this area. |
The problem is definitely that: Windows is using strftime, not wcsftime. It's not using wcsftime because of bpo-10653. If I force Windows to use wcsftime, this particular example works:
>>> time.strftime("%d\u200F%A", time.gmtime())
'25\u200fFriday' I haven't looked at bpo-10653 enough to understand if it's still a problem with the new Visual C++. Maybe it is: I only tested with my default US locale. |
I've implemented a workaround for Sphinx: >>> time.strftime(u'%Y 年'.encode('unicode-escape').decode(), *args).encode().decode('unicode-escape')
2015 年 https://github.com/sphinx-doc/sphinx/blob/8ae43b9fd/sphinx/util/osutil.py#L175 |
The problem from bpo-10653 is that internally the CRT encodes the time zone name using the ANSI codepage (i.e. the default system codepage). wcsftime decodes this string using mbstowcs (i.e. multibyte string to wide-character string), which uses Latin-1 in the C locale. In other words, in the C locale on Windows, mbstowcs just casts the byte values to wchar_t. With the new Universal CRT, strftime is implemented by calling wcsftime, so the accepted solution for bpo-10653 is broken in 3.5+. A simple way around the problem is to switch back to using wcsftime and temporarily (or permanently) set the thread's LC_CTYPE locale to the system default. This makes the internal mbstowcs call use the ANSI codepage. Note that on POSIX platforms 3.x already sets the default via setlocale(LC_CTYPE, "") in Python/pylifecycle.c. Why not set this for all platforms that have setlocale?
If your system locale uses codepage 1252 (a superset of Latin-1), then you can still test this on a per thread basis if your system has additional language packs. For example: import ctypes
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
if kernel32.GetModuleHandleW('ucrtbased'): # debug build
crt = ctypes.CDLL('ucrtbased', use_errno=True)
else:
crt = ctypes.CDLL('ucrtbase', use_errno=True)
MUI_LANGUAGE_NAME = 8
LC_CTYPE = 2
class tm(ctypes.Structure):
pass
crt._gmtime64.restype = ctypes.POINTER(tm)
# set a Russian locale for the current thread
kernel32.SetThreadPreferredUILanguages(MUI_LANGUAGE_NAME,
'ru-RU\0', None)
crt._wsetlocale(LC_CTYPE, 'ru-RU')
# update the time zone name based on the thread locale
crt._tzset()
# get a struct tm *
ltime = ctypes.c_int64()
crt._time64(ctypes.byref(ltime))
tmptr = crt._gmtime64(ctypes.byref(ltime))
# call wcsftime using C and Russian locales
buf = (ctypes.c_wchar * 100)()
crt._wsetlocale(LC_CTYPE, 'C')
size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
tz1 = buf[:size]
crt._wsetlocale(LC_CTYPE, 'ru-RU')
size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
tz2 = buf[:size]
hcon = kernel32.GetStdHandle(-11)
pn = ctypes.pointer(ctypes.c_uint()) >>> _ = kernel32.WriteConsoleW(hcon, tz1, len(tz1), pn, None)
Âðåìÿ â ôîðìàòå UTC
>>> _ = kernel32.WriteConsoleW(hcon, tz2, len(tz2), pn, None)
Время в формате UTC The first result demonstrates the ANSI => Latin-1 mojibake problem in the C locale. You can encode this result as Latin-1 and then decode it back as codepage 1251: >>> tz1.encode('latin-1').decode('1251') == tz2
True But transcoding isn't a general workaround since the format string shouldn't be restricted to ANSI, unless you can smuggle the Unicode through like Takayuki showed. |
Update since msg255133: Python 3.8+ now calls setlocale(LC_CTYPE, "") at startup in Windows, as 3.x has always done in POSIX. So decoding the output of C strftime("%Z") with PyUnicode_DecodeLocaleAndSize() 'works' again, since both default to the process code page. The latter is usually the system code page, unless overridden to UTF-8 in the application manifest. But calling C strftime() as a workaround is still a fragile solution, since it requires that the process code page is able to encode the process or thread UI language. In general, the system code page, the current user locale, and current user preferred language are independent settings in Windows. Calling C strftime() also unnecessarily limits the format string to characters in the current LC_CTYPE locale encoding, which requires hacky workarounds. Starting with Windows 10 v2004 (build 19041), ucrt uses an internal wide-character version of the time-zone name that gets returned by an internal __wide_tzname() call and used for "%Z" in wcsftime(). The wide-character value gets updated by _tzset() and kept in sync with _tzname. If Python switched to using wcsftime() in Windows 10 2004+, then the current locale encoding would no longer be a problem for any UI language. Also, bpo-36779 switched to setting time.tzname by directly calling WinAPI GetTimeZineInformation(). time.tzset() should be implemented in order to keep the value of time.tzname in sync with time.strftime("%Z"). |
I'm not sure of what you mean. The function is implemented: static PyObject *
time_tzset(PyObject *self, PyObject *unused)
{
PyObject* m;
m = PyImport_ImportModuleNoBlock("time");
if (m == NULL) {
return NULL;
}
tzset();
/* Reset timezone, altzone, daylight and tzname */
if (init_timezone(m) < 0) {
return NULL;
}
Py_DECREF(m);
if (PyErr_Occurred())
return NULL;
Py_RETURN_NONE;
} |
My comment was limited to Windows, for which time.tzset() has never been implemented. Since Python has its own implementation of time.tzname in Windows, it should also implement time.tzset() to allow refreshing the value. Actually, ucrt implements C _tzset(), so the implementation of time.tzset() in Windows also has to call C _tzset() to update _tzname (and also ucrt's new private __wide_tzname), in addition to calling GetTimeZoneInformation() to update its own time.tzname value. Another difference with Python's time.tzname and C strftime("%Z") is that ucrt will use the TZ environment variable, but Python's implementation of time.tzname in Windows does not. |
Yet a couple bugs. On platforms without >>> print(ascii(time.strftime('\udcf0\udc9f\udc90\udc8d')))
'\U0001f40d' The result depends on the locale encoding. The above was for UTF-8. I expect the similar result for |
Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also pythongh-78662 and pythongh-124531.
Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also gh-78662 and gh-124531.
…5193) Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also pythongh-78662 and pythongh-124531. (cherry picked from commit ad3eac1) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also pythongh-78662 and pythongh-124531. (cherry picked from commit ad3eac1)
Python's |
…5657) Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also gh-78662 and gh-124531. (cherry picked from commit ad3eac1)
…5193) (pythonGH-125657) Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also pythongh-78662 and pythongh-124531. (cherry picked from commit 08ccbb9) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> (cherry picked from commit ad3eac1)
…5657) (GH-125661) Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also gh-78662 and gh-124531. (cherry picked from commit 08ccbb9) (cherry picked from commit ad3eac1) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also pythongh-78662 and pythongh-124531.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: