Skip to content

Unexpected UnicodeError instead of UnicodeDecodeError within codec.readline() only for 'utf-16' encoding #112812

Closed as duplicate of#85287
@agowa

Description

@agowa

Bug report

Bug description:

import chardet
import codecs

def detect_encoding(file_path):
  with open(file_path, 'rb') as f:
    result = chardet.detect(f.read())
  return result['encoding']

def check_for_invalid_characters(file_path, encoding):
  try:
    with codecs.open(file_path, 'r', encoding=encoding, errors='strict') as f:
      f.readline()
      f.readline()
      third_line = f.readline()
      print(f"The 3rd line of the file: {third_line.strip()}")
      f.seek(0)  # Reset the file pointer to the beginning
      f.read()   # Check for invalid characters
    print(f"The file {file_path} is encoded with {encoding} and does not contain invalid characters.")
  except UnicodeDecodeError as e:
    print(f"The file {file_path} has invalid characters when decoded with {encoding}.")


def main(file_path):
  detected_encoding = detect_encoding(file_path)
  print(f"Detected encoding: {detected_encoding}")
  check_for_invalid_characters(file_path, detected_encoding)
  encodings_to_try = ['utf-8', 'latin-1', 'utf-16', 'iso-8859-1', 'iso-8859-15', 'iso-8859-7']
  for encoding in encodings_to_try:
    check_for_invalid_characters(file_path, encoding)

main(file_path)

Within the above code, when the file cannot be read using utf-16 and contains an invalid character it generates different exceptions compared to the other encodings.

For some reason when 'utf-16' is specified for the encoding and it fails codecs.py raises a UnicodeError and not just a UnicodeDecodeError as expected. This 2nd UnicodeError only appears for 'utf-16', but neither for 'utf-16-le' nor 'utf-16-be'.

The console output for the above code is:

Detected encoding: ISO-8859-7
The 3rd line of the file: DESCRIPTION;LANGUAGE=de-DE:φίδ
The file /tmp/booking.ics is encoded with ISO-8859-7 and does not contain invalid characters.
The file /tmp/booking.ics has invalid characters when decoded with utf-8.
The 3rd line of the file: DESCRIPTION;LANGUAGE=de-DE:ößä
The file /tmp/booking.ics is encoded with latin-1 and does not contain invalid characters.
Traceback (most recent call last):
  File "<frozen codecs>", line 507, in read
  File "/usr/lib/python3.11/encodings/utf_16.py", line 135, in decode
    codecs.utf_16_ex_decode(input, errors, 0, False)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 58-59: illegal encoding

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in main
  File "<stdin>", line 4, in check_for_invalid_characters
  File "<frozen codecs>", line 711, in readline
  File "<frozen codecs>", line 561, in readline
  File "<frozen codecs>", line 511, in read
  File "/usr/lib/python3.11/encodings/utf_16.py", line 141, in decode
    raise UnicodeError("UTF-16 stream does not start with BOM")
UnicodeError: UTF-16 stream does not start with BOM

My test file:
booking.zip

CPython versions tested on:

3.11

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtopic-unicodetype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      pFad - Phonifier reborn

      Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

      Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy