FilesPipeline: Optionally guess media extension from response headers, if provided. #1199
Labels
Content-Length: 230523 | pFad | https://github.com/scrapy/scrapy/issues/1199
3FFetched URL: https://github.com/scrapy/scrapy/issues/1199
Alternative Proxies:
I needed to download some autogenerated datasheets (in PDF format), and I decided to go with the FilesPipeline provided within Scrapy.
However, the URL structure didn't follow the assumption of the
file_path
method in theFilesPipeline
class (a well structured filename, i.e.example.pdf
):Here is a snippet of my spider:
This is the code that extracts the media extension from the URL ( in scrapy.contrib.pipeline.files.FilesPipeline):
Now, the value of
media_ext
for my URL isWhich is not only an incorrect extension type, but also results (in this particular case) in an IOException (filename too long).
One quick way to fix this is appending something like
&type=.pdf
to the URL while constructing thefile_urls
list:However, I think it would be helpful to introduce some code that (optionally) guesses the extension out of the response headers:
Content-Type
header and uses themimetypes
module to guess the extension.Content-Disposition
header.This would be useful in the case that the crawler gets a list of URLs by some other means (i.e. using an XPath query).
Here is my approach, and I'd be happy to open a PR if you like the idea and the implementation.
The text was updated successfully, but these errors were encountered: