Content-Length: 327616 | pFad | http://github.com/apify/crawlee/pull/2792/files/e7388de23e6c6af9e4c7a9249190302e918aee8f

6F feat: stopping the crawlers gracefully with `BasicCrawler.stop()` by barjin · Pull Request #2792 · apify/crawlee · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: stopping the crawlers gracefully with BasicCrawler.stop() #2792

Merged
merged 12 commits into from
Jan 20, 2025
19 changes: 18 additions & 1 deletion packages/basic-crawler/src/internals/basic-crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,7 @@ export interface BasicCrawlerOptions<Context extends CrawlingContext = BasicCraw
/**
* Allows to keep the crawler alive even if the {@apilink RequestQueue} gets empty.
* By default, the `crawler.run()` will resolve once the queue is empty. With `keepAlive: true` it will keep running,
* waiting for more requests to come. Use `crawler.teardown()` to exit the crawler.
* waiting for more requests to come. Use `crawler.stop()` to exit the crawler gracefully, or `crawler.teardown()` to stop it immediately.
*/
keepAlive?: boolean;

Expand Down Expand Up @@ -974,6 +974,23 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
return stats;
}

/**
* Gracefully stops the current run of the crawler.
*
* All the tasks active at the time of calling this method will be allowed to finish.
*/
stop(message = 'The crawler has been gracefully stopped.'): void {
// Gracefully starve the this.autoscaledPool, so it doesn't start new tasks. Resolves once the pool is cleared.
this.autoscaledPool
?.pause()
// Resolves the `autoscaledPool.run()` promise in the `BasicCrawler.run()` method. Since the pool is already paused, it resolves immediately and doesn't kill any tasks.
.then(async () => this.autoscaledPool?.abort())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the motivation for not making the method async and ignoring the promise instead? feels handy to be able to await it (and users can always decide themselves to ignore the promise, unlike with the current solution)

also I don't think you need the async in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I planned it like this in the beginning, see 284274b

The reasons for moving to this was:
a. parity w/ Python version
b.

async requestHandler({ crawler }) {
    if (iFoundWhatISearchedFor) {
       await requestHandler.stop();
    }
}

^ this looks like a completely harmless piece of code, but will cause deadlock. requestHandler is waiting for autoscaledPool.pause, which is waiting for all tasks to finish (including the aforementioned requestHandler).

also I don't think you need the async in here

You and me both, but @typescript-eslint/promise-function-async thinks otherwise.

Copy link
Contributor Author

@barjin barjin Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I'm definitely for making this async, but we need to address these somehow. Also, note that even with the implementation from 284274b , crawler.stop resolves once autoscaledPool.abort() has resolved, but that (afaik) doesn't mean the crawler.run() has resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure that there is a good reason for an async stop() instead of non-async one. If you are expected to call it from a request handler, waiting for the crawler to stop there leads to a deadlock and the reasons are quite intuitive.

Should someone need to wait for the crawler to finish outside of the request handler, they can always wait for the promise from crawler.run() to resolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a bit wary about floating promises, but as long as you guys are fine with it (and there is the catch clause to stop it from killing the whole process), let's see how it goes.

I'll still add the promised tests from #2803 , just to stay on the safer side.

.then(() => this.log.info(message))
.catch((err) => {
this.log.error('An error occurred when stopping the crawler:', err);
});
}

async getRequestQueue() {
if (!this.requestQueue && this.requestList) {
this.log.warningOnce(
Expand Down
Loading








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/apify/crawlee/pull/2792/files/e7388de23e6c6af9e4c7a9249190302e918aee8f

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy