Content-Length: 687918 | pFad | http://github.com/huggingface/datasets/commit/f8e332104dd4827da559a39d06d7849b9ed3f77d

75 Disallow video push_to_hub (#7265) · huggingface/datasets@f8e3321 · GitHub
Skip to content

Commit f8e3321

Browse files
authored
Disallow video push_to_hub (#7265)
* disallow video push_to_hub * docs * minor
1 parent 46e4616 commit f8e3321

File tree

5 files changed

+188
-1
lines changed

5 files changed

+188
-1
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@
7474
title: Object detection
7575
- local: video_load
7676
title: Load video data
77+
- local: video_dataset
78+
title: Create a video dataset
7779
title: "Vision"
7880
- sections:
7981
- local: nlp_load

docs/source/how_to.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The guides are organized into six sections:
1414

1515
- <span class="underline decoration-sky-400 decoration-2 font-semibold">General usage</span>: Functions for general dataset loading and processing. The functions shown in this section are applicable across all dataset modalities.
1616
- <span class="underline decoration-pink-400 decoration-2 font-semibold">Audio</span>: How to load, process, and share audio datasets.
17-
- <span class="underline decoration-yellow-400 decoration-2 font-semibold">Vision</span>: How to load, process, and share image datasets.
17+
- <span class="underline decoration-yellow-400 decoration-2 font-semibold">Vision</span>: How to load, process, and share image and video datasets.
1818
- <span class="underline decoration-green-400 decoration-2 font-semibold">Text</span>: How to load, process, and share text datasets.
1919
- <span class="underline decoration-orange-400 decoration-2 font-semibold">Tabular</span>: How to load, process, and share tabular datasets.
2020
- <span class="underline decoration-indigo-400 decoration-2 font-semibold">Dataset repository</span>: How to share and upload a dataset to the <a href="https://huggingface.co/datasets">Hub</a>.

docs/source/video_dataset.mdx

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Create a video dataset
2+
3+
This guide will show you how to create a video dataset with `VideoFolder` and some metadata. This is a no-code solution for quickly creating a video dataset with several thousand videos.
4+
5+
<Tip>
6+
7+
You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
8+
9+
</Tip>
10+
11+
## VideoFolder
12+
13+
The `VideoFolder` is a dataset builder designed to quickly load a video dataset with several thousand videos without requiring you to write any code.
14+
15+
<Tip>
16+
17+
💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `VideoFolder` creates dataset splits based on your dataset repository structure.
18+
19+
</Tip>
20+
21+
`VideoFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:
22+
23+
```
24+
folder/train/dog/golden_retriever.mp4
25+
folder/train/dog/german_shepherd.mp4
26+
folder/train/dog/chihuahua.mp4
27+
28+
folder/train/cat/maine_coon.mp4
29+
folder/train/cat/bengal.mp4
30+
folder/train/cat/birman.mp4
31+
```
32+
33+
Then users can load your dataset by specifying `videofolder` in [`load_dataset`] and the directory in `data_dir`:
34+
35+
```py
36+
>>> from datasets import load_dataset
37+
38+
>>> dataset = load_dataset("videofolder", data_dir="/path/to/folder")
39+
```
40+
41+
You can also use `videofolder` to load datasets involving multiple splits. To do so, your dataset directory should have the following structure:
42+
43+
```
44+
folder/train/dog/golden_retriever.mp4
45+
folder/train/cat/maine_coon.mp4
46+
folder/test/dog/german_shepherd.mp4
47+
folder/test/cat/bengal.mp4
48+
```
49+
50+
<Tip warning={true}>
51+
52+
If all video files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.
53+
54+
</Tip>
55+
56+
57+
If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl`.
58+
59+
```
60+
folder/train/metadata.csv
61+
folder/train/0001.mp4
62+
folder/train/0002.mp4
63+
folder/train/0003.mp4
64+
```
65+
66+
You can also zip your videos:
67+
68+
```
69+
folder/metadata.csv
70+
folder/train.zip
71+
folder/test.zip
72+
folder/valid.zip
73+
```
74+
75+
Your `metadata.csv` file must have a `file_name` column which links video files with their metadata:
76+
77+
```csv
78+
file_name,additional_feature
79+
0001.mp4,This is a first value of a text feature you added to your videos
80+
0002.mp4,This is a second value of a text feature you added to your videos
81+
0003.mp4,This is a third value of a text feature you added to your videos
82+
```
83+
84+
or using `metadata.jsonl`:
85+
86+
```jsonl
87+
{"file_name": "0001.mp4", "additional_feature": "This is a first value of a text feature you added to your videos"}
88+
{"file_name": "0002.mp4", "additional_feature": "This is a second value of a text feature you added to your videos"}
89+
{"file_name": "0003.mp4", "additional_feature": "This is a third value of a text feature you added to your videos"}
90+
```
91+
92+
<Tip>
93+
94+
If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set `drop_labels=False` in `load_dataset`.
95+
96+
</Tip>
97+
98+
### Video captioning
99+
100+
Video captioning datasets have text describing a video. An example `metadata.csv` may look like:
101+
102+
```csv
103+
file_name,text
104+
0001.mp4,This is a golden retriever playing with a ball
105+
0002.mp4,A german shepherd
106+
0003.mp4,One chihuahua
107+
```
108+
109+
Load the dataset with `VideoFolder`, and it will create a `text` column for the video captions:
110+
111+
```py
112+
>>> dataset = load_dataset("videofolder", data_dir="/path/to/folder", split="train")
113+
>>> dataset[0]["text"]
114+
"This is a golden retriever playing with a ball"
115+
```
116+
117+
### Upload dataset to the Hub
118+
119+
Once you've created a dataset, you can share it to the using `huggingface_hub` for example. Make sure you have the [huggingface_hub](https://huggingface.co/docs/huggingface_hub/index) library installed and you're logged in to your Hugging Face account (see the [Upload with Python tutorial](upload_dataset#upload-with-python) for more details).
120+
121+
Upload your dataset with `huggingface_hub.HfApi.upload_folder`:
122+
123+
```py
124+
from huggingface_hub import HfApi
125+
api = HfApi()
126+
127+
api.upload_folder(
128+
folder_path="/path/to/local/dataset",
129+
repo_id="username/my-cool-dataset",
130+
repo_type="dataset",
131+
)
132+
```
133+
134+
## WebDataset
135+
136+
The [WebDataset](https://github.com/webdataset/webdataset) format is based on TAR archives and is suitable for big video datasets.
137+
Indeed you can group your videos in TAR archives (e.g. 1GB of videos per TAR archive) and have thousands of TAR archives:
138+
139+
```
140+
folder/train/00000.tar
141+
folder/train/00001.tar
142+
folder/train/00002.tar
143+
...
144+
```
145+
146+
In the archives, each example is made of files sharing the same prefix:
147+
148+
```
149+
e39871fd9fd74f55.mp4
150+
e39871fd9fd74f55.json
151+
f18b91585c4d3f3e.mp4
152+
f18b91585c4d3f3e.json
153+
ede6e66b2fb59aab.mp4
154+
ede6e66b2fb59aab.json
155+
ed600d57fcee4f94.mp4
156+
ed600d57fcee4f94.json
157+
...
158+
```
159+
160+
You can put your videos labels/captions/features using JSON or text files for example.
161+
162+
For more details on the WebDataset format and the python library, please check the [WebDataset documentation](https://webdataset.github.io/webdataset).
163+
164+
Load your WebDataset and it will create on column per file suffix (here "mp4" and "json"):
165+
166+
```python
167+
>>> from datasets import load_dataset
168+
169+
>>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", split="train")
170+
>>> dataset[0]["json"]
171+
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
172+
```

docs/source/video_load.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,12 @@ To ignore the information in the metadata file, set `drop_labels=False` in [`loa
112112
>>> dataset = load_dataset("videofolder", data_dir="/path/to/folder", drop_labels=False)
113113
```
114114

115+
<Tip>
116+
117+
For more information about creating your own `VideoFolder` dataset, take a look at the [Create a video dataset](./video_dataset) guide.
118+
119+
</Tip>
120+
115121
## WebDataset
116122

117123
The [WebDataset](https://github.com/webdataset/webdataset) format is based on a folder of TAR archives and is suitable for big video datasets.

src/datasets/arrow_dataset.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5400,6 +5400,13 @@ def push_to_hub(
54005400
>>> french_dataset = load_dataset("<organization>/<dataset_id>", "fr")
54015401
```
54025402
"""
5403+
if "Video(" in str(self.features):
5404+
raise NotImplementedError(
5405+
"push_to_hub is not implemented for video datasets, instead you should upload the video files "
5406+
"using e.g. the huggingface_hub library and optionally upload a metadata.csv or metadata.jsonl "
5407+
"file containing other information like video captions, features or labels. More information "
5408+
"at https://huggingface.co/docs/datasets/main/en/video_load#videofolder"
5409+
)
54035410
if config_name == "data":
54045411
raise ValueError("`config_name` cannot be 'data'. Please, choose another name for configuration.")
54055412

0 commit comments

Comments
 (0)








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/huggingface/datasets/commit/f8e332104dd4827da559a39d06d7849b9ed3f77d

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy