Content-Length: 230523 | pFad | https://github.com/scrapy/scrapy/issues/1199

3F FilesPipeline: Optionally guess media extension from response headers, if provided. · Issue #1199 · scrapy/scrapy · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilesPipeline: Optionally guess media extension from response headers, if provided. #1199

Open
ldgarcia opened this issue Apr 30, 2015 · 1 comment

Comments

@ldgarcia
Copy link

I needed to download some autogenerated datasheets (in PDF format), and I decided to go with the FilesPipeline provided within Scrapy.

However, the URL structure didn't follow the assumption of the file_path method in the FilesPipeline class (a well structured filename, i.e. example.pdf):

Here is a snippet of my spider:

item['file_urls'] = []
item['file_urls'].append("https://www.domain.name/path/path/DatasheetService?control=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF-8%22%3F%3E%3Cpdf_generator_control%3E%3Cmode%3EPDF%3C%2Fmode%3E%3Cpdmsystem%3EPMD%3C%2Fpdmsystem%3E%3Ctemplate_selection+mlfb%3D%22{}%22+system%3D%22PRODIS%22%2F%3E%3Clanguage%3Ees%3C%2Flanguage%3E%3Ccaller%3EMall%3C%2Fcaller%3E%3C%2Fpdf_generator_control%3E".format(item['code']))

This is the code that extracts the media extension from the URL ( in scrapy.contrib.pipeline.files.FilesPipeline):

media_ext = os.path.splitext(url)[1]  # change to request.url after deprecation

Now, the value of media_ext for my URL is

.0%22+encoding%3D%22UTF-8%22%3F%3E%3Cpdf_generator_control%3E%3Cmode%3EPDF%3C%2Fmode%3E%3Cpdmsystem%3EPMD%3C%2Fpdmsystem%3E%3Ctemplate_selection+mlfb%3D%22CODE%22+system%3D%22PRODIS%22%2F%3E%3Clanguage%3Ees%3C%2Flanguage%3E%3Ccaller%3EMall%3C%2Fcaller%3E%3C%2Fpdf_generator_control%3E

Which is not only an incorrect extension type, but also results (in this particular case) in an IOException (filename too long).

One quick way to fix this is appending something like &type=.pdf to the URL while constructing the file_urls list:

item['file_urls'] = []
item['file_urls'].append("https://www.domain.name/path/path/DatasheetService?control=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF-8%22%3F%3E%3Cpdf_generator_control%3E%3Cmode%3EPDF%3C%2Fmode%3E%3Cpdmsystem%3EPMD%3C%2Fpdmsystem%3E%3Ctemplate_selection+mlfb%3D%22{}%22+system%3D%22PRODIS%22%2F%3E%3Clanguage%3Ees%3C%2Flanguage%3E%3Ccaller%3EMall%3C%2Fcaller%3E%3C%2Fpdf_generator_control%3E&type=.pdf".format(item['code']))

However, I think it would be helpful to introduce some code that (optionally) guesses the extension out of the response headers:

  • First, it checks the Content-Type header and uses the mimetypes module to guess the extension.
  • Second, if the previous step fails it checks the Content-Disposition header.
  • Third, if the previous step fails it uses the standard implementation.

This would be useful in the case that the crawler gets a list of URLs by some other means (i.e. using an XPath query).

Here is my approach, and I'd be happy to open a PR if you like the idea and the implementation.

@ldgarcia ldgarcia changed the title FilesPipeline: Optionally parse media extension from response headers if provided. FilesPipeline: Optionally parse media extension from response headers, if provided. Apr 30, 2015
@ldgarcia ldgarcia changed the title FilesPipeline: Optionally parse media extension from response headers, if provided. FilesPipeline: Optionally guess media extension from response headers, if provided. Apr 30, 2015
@mesmere
Copy link

mesmere commented Mar 9, 2024

Honestly if the Content-Disposition header is set at all it should override whatever default behavior FilesPipeline has going on. Or maybe it could be a configurable option.

My approach just uses the Content-Disposition header if it's available. It also uses the, like, officially endorsed way to parse this header instead of @ldgarcia's regex approach. It's some weird format specified in an ancient RFC, and from what I can tell it used to be parsed by the cgi module but that's being deprecated and the logic has been moved to the email module for whatever reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/scrapy/scrapy/issues/1199

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy