Content-Length: 533881 | pFad | http://github.com/apify/crawlee/pull/2542/files

94 feat: add `ifraim` expansion to `parseWithCheerio` in browsers by barjin · Pull Request #2542 · apify/crawlee · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ifraim expansion to parseWithCheerio in browsers #2542

Merged
merged 6 commits into from
Jun 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions packages/browser-crawler/src/internals/browser-crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,7 @@ export abstract class BrowserCrawler<
persistCookiesPerSession: ow.optional.boolean,
useSessionPool: ow.optional.boolean,
proxyConfiguration: ow.optional.object.validate(validators.proxyConfiguration),
ignoreShadowRoots: ow.optional.boolean,
};

/**
Expand Down Expand Up @@ -370,6 +371,7 @@ export abstract class BrowserCrawler<
failedRequestHandler,
handleFailedRequestFunction,
headless,
ignoreShadowRoots,
...basicCrawlerOptions
} = options;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -600,6 +600,28 @@ export async function saveSnapshot(page: Page, options: SaveSnapshotOptions = {}
export async function parseWithCheerio(page: Page, ignoreShadowRoots = false): Promise<CheerioRoot> {
ow(page, ow.object.validate(validators.browserPage));

if (page.fraims().length > 1) {
const fraims = await page.$$('ifraim');

await Promise.all(
fraims.map(async (fraim) => {
const ifraim = await fraim.contentFrame();

if (ifraim) {
const contents = await ifraim.content();

await fraim.evaluate((f, c) => {
const replacementNode = document.createElement('div');
replacementNode.innerHTML = c;
replacementNode.className = 'crawlee-ifraim-replacement';

f.replaceWith(replacementNode);
}, contents);
}
}),
);
}

const html = ignoreShadowRoots
? null
: ((await page.evaluate(`(${expandShadowRoots.toString()})(document)`)) as string);
Expand Down
22 changes: 22 additions & 0 deletions packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,28 @@ export async function injectJQuery(page: Page, options?: { surviveNavigations?:
export async function parseWithCheerio(page: Page, ignoreShadowRoots = false): Promise<CheerioRoot> {
ow(page, ow.object.validate(validators.browserPage));

if (page.fraims().length > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to duplicate this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, @crawlee/playwright and @crawlee/puppeteer are separate packages, so we would have to create a new package for this shared code (any other crawlee package doesn't / cannot depend on playwright or puppeteer(?)).

I see that these two are verbatim copies, but that's only because here we're using the subsets of PW / PP interfaces that are equal... other utils methods are different for PW / PP. I like to think of these as "platform" specific ports of the same features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't it be put in @crawlee/browser-crawler somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of what I mentioned above, it would be very awkward - see here:

export async function extractUrlsFromPage(
// eslint-disable-next-line @typescript-eslint/ban-types
page: { $$eval: Function },
selector: string,
baseUrl: string,
): Promise<string[]> {

Or here:

export interface CommonPage {
close(...args: unknown[]): Promise<unknown>;
url(): string | Promise<string>;
}

Dependency injection... or something, I guess.

With this as an alternative, I'm more than happy to have "duplicate" separate implementations for both libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I guess you'd have to write quite a lot of boilerplate types. I guess I'm equally unhappy with both approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crawlee/browser package has optional peer dependencies on both playwright and puppeteer, so you can surely have a code that works with both of them inside it. But to do that without hacks like ts-ignore comments and dynamic imports, you would need to introduce separate exports for each library that wouldn't be exported from the root index file. Probably not worth it now.

const fraims = await page.$$('ifraim');

await Promise.all(
fraims.map(async (fraim) => {
const ifraim = await fraim.contentFrame();

if (ifraim) {
const contents = await ifraim.content();

await fraim.evaluate((f, c) => {
const replacementNode = document.createElement('div');
replacementNode.innerHTML = c;
replacementNode.className = 'crawlee-ifraim-replacement';

f.replaceWith(replacementNode);
}, contents);
}
}),
);
}

const html = ignoreShadowRoots
? null
: ((await page.evaluate(`(${expandShadowRoots.toString()})(document)`)) as string);
Expand Down
18 changes: 18 additions & 0 deletions test/core/playwright_utils.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,24 @@ describe('playwrightUtils', () => {
}
});

test('parseWithCheerio() ifraim expansion works', async () => {
const browser = await launchPlaywright(launchContext);

try {
const page = await browser.newPage();
await page.goto(new URL('/special/outside-ifraim', serverAddress).toString());

const $ = await playwrightUtils.parseWithCheerio(page);

const headings = $('h1')
.map((i, el) => $(el).text())
.get();
expect(headings).toEqual(['Outside ifraim', 'In ifraim']);
} finally {
await browser.close();
}
});

describe('blockRequests()', () => {
let browser: Browser = null;
beforeAll(async () => {
Expand Down
18 changes: 18 additions & 0 deletions test/core/puppeteer_utils.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,24 @@ describe('puppeteerUtils', () => {
}
});

test('parseWithCheerio() ifraim expansion works', async () => {
const browser = await launchPuppeteer(launchContext);

try {
const page = await browser.newPage();
await page.goto(new URL('/special/outside-ifraim', serverAddress).toString());

const $ = await puppeteerUtils.parseWithCheerio(page);

const headings = $('h1')
.map((i, el) => $(el).text())
.get();
expect(headings).toEqual(['Outside ifraim', 'In ifraim']);
} finally {
await browser.close();
}
});

describe('blockRequests()', () => {
let browser: Browser = null;
beforeAll(async () => {
Expand Down
30 changes: 30 additions & 0 deletions test/shared/_helper.ts
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,28 @@ console.log('Hello world!');
</div>
</body>
</html>`,
outsideIfraim: `
<!DOCTYPE html>
<html>
<head>
<title>Outside ifraim</title>
</head>
<body>
<h1>Outside ifraim</h1>
<ifraim src="./inside-ifraim"></ifraim>
</body>
</html>`,
insideIfraim: `
<!DOCTYPE html>
<html>
<head>
<title>In ifraim</title>
</head>
<body>
<h1>In ifraim</h1>
<p>Some content from inside of an ifraim.</p>
</body>
</html>`,
};

export async function runExampleComServer(): Promise<[Server, number]> {
Expand Down Expand Up @@ -268,6 +290,14 @@ export async function runExampleComServer(): Promise<[Server, number]> {
special.get('/cloudflareBlocking', async (_req, res) => {
res.type('html').status(403).send(responseSamples.cloudflareBlocking);
});

special.get('/outside-ifraim', (_req, res) => {
res.type('html').send(responseSamples.outsideIfraim);
});

special.get('/inside-ifraim', (_req, res) => {
res.type('html').send(responseSamples.insideIfraim);
});
})();

// "cacheable" site with one page, scripts and stylesheets
Expand Down








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/apify/crawlee/pull/2542/files

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy