Content-Length: 288036 | pFad | https://github.com/huggingface/datasets/issues/7318

27 Introduce support for PDFs · Issue #7318 · huggingface/datasets · GitHub
Skip to content

Introduce support for PDFs #7318

Open
Open
@yabramuvdi

Description

@yabramuvdi

Feature request

The idea (discussed in the Discord server with @lhoestq ) is to have a Pdf type like Image/Audio/Video. For example Video was recently added and contains how to decode a video file encoded in a dictionary like {"path": ..., "bytes": ...} as a VideoReader using decord. We want to do the same with pdf and get a pypdfium2.PdfDocument.

Motivation

In many cases PDFs contain very valuable information beyond text (e.g. images, figures). Support for PDFs would help create datasets where all the information is preserved.

Your contribution

I can start the implementation of the Pdf type :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions









      ApplySandwichStrip

      pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


      --- a PPN by Garber Painting Akron. With Image Size Reduction included!

      Fetched URL: https://github.com/huggingface/datasets/issues/7318

      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy