FilesPipeline: Optionally guess media extension from response headers, if provided. #1199

ldgarcia · 2015-04-30T23:51:39Z

I needed to download some autogenerated datasheets (in PDF format), and I decided to go with the FilesPipeline provided within Scrapy.

However, the URL structure didn't follow the assumption of the file_path method in the FilesPipeline class (a well structured filename, i.e. example.pdf):

Here is a snippet of my spider:

item['file_urls'] = []
item['file_urls'].append("https://www.domain.name/path/path/DatasheetService?control=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF-8%22%3F%3E%3Cpdf_generator_control%3E%3Cmode%3EPDF%3C%2Fmode%3E%3Cpdmsystem%3EPMD%3C%2Fpdmsystem%3E%3Ctemplate_selection+mlfb%3D%22{}%22+system%3D%22PRODIS%22%2F%3E%3Clanguage%3Ees%3C%2Flanguage%3E%3Ccaller%3EMall%3C%2Fcaller%3E%3C%2Fpdf_generator_control%3E".format(item['code']))

This is the code that extracts the media extension from the URL ( in scrapy.contrib.pipeline.files.FilesPipeline):

media_ext = os.path.splitext(url)[1]  # change to request.url after deprecation

Now, the value of media_ext for my URL is

.0%22+encoding%3D%22UTF-8%22%3F%3E%3Cpdf_generator_control%3E%3Cmode%3EPDF%3C%2Fmode%3E%3Cpdmsystem%3EPMD%3C%2Fpdmsystem%3E%3Ctemplate_selection+mlfb%3D%22CODE%22+system%3D%22PRODIS%22%2F%3E%3Clanguage%3Ees%3C%2Flanguage%3E%3Ccaller%3EMall%3C%2Fcaller%3E%3C%2Fpdf_generator_control%3E

Which is not only an incorrect extension type, but also results (in this particular case) in an IOException (filename too long).

One quick way to fix this is appending something like &type=.pdf to the URL while constructing the file_urls list:

item['file_urls'] = []
item['file_urls'].append("https://www.domain.name/path/path/DatasheetService?control=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF-8%22%3F%3E%3Cpdf_generator_control%3E%3Cmode%3EPDF%3C%2Fmode%3E%3Cpdmsystem%3EPMD%3C%2Fpdmsystem%3E%3Ctemplate_selection+mlfb%3D%22{}%22+system%3D%22PRODIS%22%2F%3E%3Clanguage%3Ees%3C%2Flanguage%3E%3Ccaller%3EMall%3C%2Fcaller%3E%3C%2Fpdf_generator_control%3E&type=.pdf".format(item['code']))

However, I think it would be helpful to introduce some code that (optionally) guesses the extension out of the response headers:

First, it checks the Content-Type header and uses the mimetypes module to guess the extension.
Second, if the previous step fails it checks the Content-Disposition header.
Third, if the previous step fails it uses the standard implementation.

This would be useful in the case that the crawler gets a list of URLs by some other means (i.e. using an XPath query).

Here is my approach, and I'd be happy to open a PR if you like the idea and the implementation.

The text was updated successfully, but these errors were encountered:

mesmere · 2024-03-09T16:15:27Z

Honestly if the Content-Disposition header is set at all it should override whatever default behavior FilesPipeline has going on. Or maybe it could be a configurable option.

My approach just uses the Content-Disposition header if it's available. It also uses the, like, officially endorsed way to parse this header instead of @ldgarcia's regex approach. It's some weird format specified in an ancient RFC, and from what I can tell it used to be parsed by the cgi module but that's being deprecated and the logic has been moved to the email module for whatever reason.

ldgarcia changed the title ~~FilesPipeline: Optionally parse media extension from response headers if provided.~~ FilesPipeline: Optionally parse media extension from response headers, if provided. Apr 30, 2015

ldgarcia changed the title ~~FilesPipeline: Optionally parse media extension from response headers, if provided.~~ FilesPipeline: Optionally guess media extension from response headers, if provided. Apr 30, 2015

kmike mentioned this issue Dec 12, 2017

filepipeline handle urls with "?" wrong #2997

Closed

Gallaecio added the enhancement label Apr 26, 2019

phrfpeixoto mentioned this issue Jun 7, 2019

Fixing FilePipeline file extensions #3817

Closed

wRAR added the media pipelines label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesPipeline: Optionally guess media extension from response headers, if provided. #1199

FilesPipeline: Optionally guess media extension from response headers, if provided. #1199

ldgarcia commented Apr 30, 2015

mesmere commented Mar 9, 2024

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

FilesPipeline: Optionally guess media extension from response headers, if provided. #1199

FilesPipeline: Optionally guess media extension from response headers, if provided. #1199

Comments

ldgarcia commented Apr 30, 2015

mesmere commented Mar 9, 2024

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!