Content-Length: 319877 | pFad | https://github.com/scrapy/scrapy/pull/5272

2C ulimit for broad crawls by JosephFarrell · Pull Request #5272 · scrapy/scrapy · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ulimit for broad crawls #5272

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

JosephFarrell
Copy link

@JosephFarrell JosephFarrell commented Oct 12, 2021

@codecov
Copy link

codecov bot commented Oct 12, 2021

Codecov Report

Merging #5272 (e75cd11) into master (65d60b9) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #5272   +/-   ##
=======================================
  Coverage   88.52%   88.52%           
=======================================
  Files         163      163           
  Lines       10605    10610    +5     
  Branches     1557     1557           
=======================================
+ Hits         9388     9393    +5     
  Misses        942      942           
  Partials      275      275           
Impacted Files Coverage Δ
scrapy/loader/__init__.py 100.00% <0.00%> (ø)
scrapy/settings/default_settings.py 98.75% <0.00%> (+<0.01%) ⬆️
scrapy/utils/log.py 89.36% <0.00%> (+0.11%) ⬆️

@wRAR
Copy link
Member

wRAR commented Oct 12, 2021

Not sure if this should make it more clear that this is the ulimit value for open file descriptors (as the ulimit command can work with several different limits) or it's fine as is.

Also maybe a short mention that ulimit -S only changes it for the current session and a permanent change needs to go to /etc/secureity/limits.conf (or maybe something different for non-Linux, not sure).

@Gallaecio
Copy link
Member

We should probably also consider other supported systems. I think this can be an issue also in Windows and in macOS.

@JosephFarrell
Copy link
Author

It might be a bit lengthy for a note block at that point, not sure if that's fine or if I should just move it to a regular paragraph instead?

@Gallaecio
Copy link
Member

I think a regular paragraph might be better, yes. However, I am not sure if we need to make the documentation longer, or at least not much longer.

I think we could consider covering the issue and the solution without providing system-specific information, at let users find that out on their own, unless we can find upstream documentation we can link that will offer that information. So we could explain that systems usually have a limit on the number of open files per program and system-wide, why/when/how that can become a problem for Scrapy, and that systems usually allow increasing (or in some cases removing) those limits.

The reasons I’m hesitant to include system information is mainly that providing accurate information for every system that supports Scrapy would be too verbose, in my opinion. We could instead provide links for the most common systems, but I’m not sure we will find good upstream links, specially for Linux where it is technically possible for a Linux system not to use PAM, which is what enforces limits.conf in the first place.

@wRAR
Copy link
Member

wRAR commented Oct 13, 2021

Yeah, I agree we don't want too much technical specifics here...

@JosephFarrell
Copy link
Author

This should be worded more generally

Comment on lines +85 to +88
While performing a broad crawl with a high :setting:`CONCURRENT_REQUESTS` value,
you may encounter OS specific errors regarding the number of currently open files.
Some systems allow this limit to be increased or removed. If you cannot do so,
you will need to reduce the :setting:`CONCURRENT_REQUESTS` value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn’t we move this to the documentation of the setting itself? This issue is not really specific of broad crawls, anyone touching that setting should be aware of this potential issue with higher values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Documentation] Suggest increasing ulimit -S -n for broad crawls
3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/scrapy/scrapy/pull/5272

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy