Content-Length: 272815 | pFad | https://github.com/numpy/numpy/issues/23442

50 BUG: `vectorize` truncates string outputs to 1 character, even with explicitly-specified `otypes` · Issue #23442 · numpy/numpy · GitHub
Skip to content

BUG: vectorize truncates string outputs to 1 character, even with explicitly-specified otypes #23442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cgobat opened this issue Mar 23, 2023 · 4 comments

Comments

@cgobat
Copy link

cgobat commented Mar 23, 2023

Describe the issue:

When a user creates a vectorize object from a function that returns a string as its output, and specifies the function's output type(s) using the otypes argument, string typecode specifiers (e.g. "U10" for a 10-character string) of any length cause the returned strings to be truncated to 1 character (i.e. np.dtype("<U1"). The same things happens with bytes (typecode "S", with any length specified). In order to make it work, one must either not specify any otypes (omit the argument), or use "O" to get a generic object dtype.

It seems this issue is possibly related to #2485 and/or StackOverflow: How to explicitly specify the output's string length in numpy.vectorize, but I can't say for sure. It seems odd that otypes ignores explicit length declarations.

Reproduce the code example:

import numpy as np

def make_10char_str(n: int) -> str:
    """Returns a string version of the input integer, with spaces to the
       right to left-justify it and pad the string out to 10 characters"""
    return f"{n:<10d}"

vector_str_func = np.vectorize(make_str_from_number, signature="()->()", otypes=["<U10"]) # "<U10" should correspond to a 10-character unicode str

print(vector_str_func([1, 24, 365, 4096])) # expected output is array([['1         ', '24        ', '365       ', '4096      '], dtype='<U10')

Output:

array(['1', '2', '3', '4'], dtype='<U1')

Runtime information:

np.__version__ is 1.24.2. sys.version is

3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 16:01:55) 
[GCC 11.3.0]

Output of np.show_runtime() is:

[{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'AVX2'],
                      'not_found': ['F16C',
                                    'FMA3',
                                    'AVX512F',
                                    'AVX512CD',
                                    'AVX512_KNL',
                                    'AVX512_KNM',
                                    'AVX512_SKX',
                                    'AVX512_CLX',
                                    'AVX512_CNL',
                                    'AVX512_ICL']}},
 {'architecture': 'Haswell',
  'filepath': '/home/cgobat/miniconda3/envs/.../lib/libopenblasp-r0.3.21.so',
  'internal_api': 'openblas',
  'num_threads': 4,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.21'}]

Context for the issue:

This issue can cause problems because if users want to specify their function's otypes explicitly, they are forced to use "O", which other operations that expect to see string dtype outputs (rather than np.object) may not be able to handle without additional processing.

@cgobat cgobat changed the title BUG: vectorize truncates string outputs to 1 character, even with explicitly-specified otypes BUG: vectorize truncates string outputs to 1 character, even with explicitly-specified otypes Mar 23, 2023
@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Mar 24, 2023

Thanks for reporting the issue, @cgobat.

Edit: I removed my previous misguided comment.

The size of any string types specified in otypes are always ignored by vectorize. The otypes argument is converted to a sequence of single character type codes. For example,

In [15]: def foo(a, b):
    ...:     pass
    ...: 

In [16]: vfoo = np.vectorize(foo, otypes=['<U16', '<U32'])

In [17]: vfoo.otypes
Out[17]: 'UU'

The lengths 16 and 32 have been discarded, and only the type code U is saved.

The actual output size of the strings will depend on the code path taken internally. If signature is not specified, the length of the output string type will be the maximum of the lengths of the computed values, e.g.

In [35]: vstr = np.vectorize(str, otypes=['U32'])  # The '32' is ignored.

In [36]: vstr.otypes
Out[36]: 'U'

In [37]: vstr(123)
Out[37]: array('123', dtype='<U3')

In [38]: vstr([[123, 99999],[-1, 0]])
Out[38]: 
array([['123', '99999'],
       ['-1', '0']], dtype='<U5')

If signature is given, a different code path is followed internally, and the output length is always 1:

In [40]: vstr = np.vectorize(str, signature='()->()', otypes=['U32'])

In [41]: vstr.otypes
Out[41]: 'U'

In [42]: vstr(123)
Out[42]: array('1', dtype='<U1')

In [43]: vstr([[123, 99999],[-1, 0]])
Out[43]: 
array([['1', '9'],
       ['-', '0']], dtype='<U1')

@cgobat
Copy link
Author

cgobat commented Mar 24, 2023

Thanks for looking into this, @WarrenWeckesser. Any ideas on how to proceed towards a resolution? Is a documentation update called for in the meantime?

@lsaavedr
Copy link

is there some news about that?

@ganesh-k13
Copy link
Member

ganesh-k13 commented May 13, 2025

The issue arises when signature is given as this particular path is chosen:

https://github.com/numpy/numpy/blob/c458e69d8794d4d25549761e12b40bfeafa1a4e3/numpy/lib/_function_base_impl.py#L2257C24-L2259

As part of this, np.empty_like is creating the result array that is <U1:

> /Users/gakathir/Documents/os/numpy/build-install/usr/lib/python3.13/site-packages/numpy/lib/_function_base_impl.py(2261)_create_arrays()
-> return arrays
(Pdb) p arrays
(array('', dtype='<U1'),)
(Pdb) for a, b, c in zip(results, shapes, dtypes): print(a,b,c)
123 () U
(Pdb) p np.empty_like(a, shape=b, dtype=c)
array('', dtype='<U1')
(Pdb)

[EDIT]

Coupled with this change #26270, the size of U is dropped which seems correct. Hence, we need to fix _create_arrays function for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/numpy/numpy/issues/23442

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy