Introduce support for PDFs

### Feature request

The idea (discussed in the Discord server with @lhoestq ) is to have a Pdf type like Image/Audio/Video. For example [Video](https://github.com/huggingface/datasets/blob/main/src/datasets/features/video.py) was recently added and contains how to decode a video file encoded in a dictionary like {"path": ..., "bytes": ...} as a VideoReader using decord. We want to do the same with pdf and get a [pypdfium2.PdfDocument](https://pypdfium2.readthedocs.io/en/stable/_modules/pypdfium2/_helpers/document.html#PdfDocument).

### Motivation

In many cases PDFs contain very valuable information beyond text (e.g. images, figures). Support for PDFs would help create datasets where all the information is preserved.

### Your contribution

I can start the implementation of the Pdf type :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce support for PDFs #7318

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Introduce support for PDFs #7318

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!