Content-Length: 231616 | pFad | http://github.com/apify/crawlee/issues/2756

1A Rust based HTTP client as an alternative for got-scraping · Issue #2756 · apify/crawlee · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust based HTTP client as an alternative for got-scraping #2756

Open
B4nan opened this issue Nov 29, 2024 · 1 comment
Open

Rust based HTTP client as an alternative for got-scraping #2756

B4nan opened this issue Nov 29, 2024 · 1 comment
Assignees
Labels
Epic An epic is a large body of work that can be broken down into a number of smaller issues. feature Issues that represent new features or improvements to existing features. product roadmap Issues synchronized to product roadmap. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@B4nan
Copy link
Member

B4nan commented Nov 29, 2024

Feature

The got-scraping client is getting obsolete, and because of it being written in Node, supporting HTTP3 is not likely a possibility in the near future. We want to build a new client in rust to allow better low level control over the request as its fingerprint.

Once implemented for node, we also want to reuse the same client for the python version.

@B4nan B4nan added feature Issues that represent new features or improvements to existing features. Epic An epic is a large body of work that can be broken down into a number of smaller issues. product roadmap Issues synchronized to product roadmap. labels Nov 29, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 29, 2024
@barjin
Copy link
Contributor

barjin commented Dec 2, 2024

Not sure where to put the issues now (likely to https://github.com/retch-http/retch later), but I wanted to have it in one place first. Here's a list of the biggest points I can think of (Important are important, Tip are IMO optional).

Stealth

Important

  • Patch h2 (and likely reqwest) to modify the h2 pseudoheaders order
  • Impersonate the HTTP2 markers in h2 (see e.g. the Akamai paper).

Tip

  • Automate fingerprint collecting
  • Automatic CI tests, comparing our fingerprints to e.g. https://ja4db.com/

Technical

Important

  • Add testing
  • Investigate HTTP3 capabilities
  • Remove bogus cipher implementations, use the actual implementations (e.g. Kyber from here or the deprecated ciphers we now fake here.

Distribution

Important

  • Design Node.JS interface
  • Setup Node.JS bindings, release npm package with prebuilt binaries
  • Design Python interface
  • Setup Python bindings, release npm package with prebuilt binaries

Tip

  • Figure out MacOS code signing for the CLI tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic An epic is a large body of work that can be broken down into a number of smaller issues. feature Issues that represent new features or improvements to existing features. product roadmap Issues synchronized to product roadmap. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/apify/crawlee/issues/2756

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy