|
| 1 | +# Create a video dataset |
| 2 | + |
| 3 | +This guide will show you how to create a video dataset with `VideoFolder` and some metadata. This is a no-code solution for quickly creating a video dataset with several thousand videos. |
| 4 | + |
| 5 | +<Tip> |
| 6 | + |
| 7 | +You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub. |
| 8 | + |
| 9 | +</Tip> |
| 10 | + |
| 11 | +## VideoFolder |
| 12 | + |
| 13 | +The `VideoFolder` is a dataset builder designed to quickly load a video dataset with several thousand videos without requiring you to write any code. |
| 14 | + |
| 15 | +<Tip> |
| 16 | + |
| 17 | +💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `VideoFolder` creates dataset splits based on your dataset repository structure. |
| 18 | + |
| 19 | +</Tip> |
| 20 | + |
| 21 | +`VideoFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like: |
| 22 | + |
| 23 | +``` |
| 24 | +folder/train/dog/golden_retriever.mp4 |
| 25 | +folder/train/dog/german_shepherd.mp4 |
| 26 | +folder/train/dog/chihuahua.mp4 |
| 27 | +
|
| 28 | +folder/train/cat/maine_coon.mp4 |
| 29 | +folder/train/cat/bengal.mp4 |
| 30 | +folder/train/cat/birman.mp4 |
| 31 | +``` |
| 32 | + |
| 33 | +Then users can load your dataset by specifying `videofolder` in [`load_dataset`] and the directory in `data_dir`: |
| 34 | + |
| 35 | +```py |
| 36 | +>>> from datasets import load_dataset |
| 37 | + |
| 38 | +>>> dataset = load_dataset("videofolder", data_dir="/path/to/folder") |
| 39 | +``` |
| 40 | + |
| 41 | +You can also use `videofolder` to load datasets involving multiple splits. To do so, your dataset directory should have the following structure: |
| 42 | + |
| 43 | +``` |
| 44 | +folder/train/dog/golden_retriever.mp4 |
| 45 | +folder/train/cat/maine_coon.mp4 |
| 46 | +folder/test/dog/german_shepherd.mp4 |
| 47 | +folder/test/cat/bengal.mp4 |
| 48 | +``` |
| 49 | + |
| 50 | +<Tip warning={true}> |
| 51 | + |
| 52 | +If all video files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly. |
| 53 | + |
| 54 | +</Tip> |
| 55 | + |
| 56 | + |
| 57 | +If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl`. |
| 58 | + |
| 59 | +``` |
| 60 | +folder/train/metadata.csv |
| 61 | +folder/train/0001.mp4 |
| 62 | +folder/train/0002.mp4 |
| 63 | +folder/train/0003.mp4 |
| 64 | +``` |
| 65 | + |
| 66 | +You can also zip your videos: |
| 67 | + |
| 68 | +``` |
| 69 | +folder/metadata.csv |
| 70 | +folder/train.zip |
| 71 | +folder/test.zip |
| 72 | +folder/valid.zip |
| 73 | +``` |
| 74 | + |
| 75 | +Your `metadata.csv` file must have a `file_name` column which links video files with their metadata: |
| 76 | + |
| 77 | +```csv |
| 78 | +file_name,additional_feature |
| 79 | +0001.mp4,This is a first value of a text feature you added to your videos |
| 80 | +0002.mp4,This is a second value of a text feature you added to your videos |
| 81 | +0003.mp4,This is a third value of a text feature you added to your videos |
| 82 | +``` |
| 83 | + |
| 84 | +or using `metadata.jsonl`: |
| 85 | + |
| 86 | +```jsonl |
| 87 | +{"file_name": "0001.mp4", "additional_feature": "This is a first value of a text feature you added to your videos"} |
| 88 | +{"file_name": "0002.mp4", "additional_feature": "This is a second value of a text feature you added to your videos"} |
| 89 | +{"file_name": "0003.mp4", "additional_feature": "This is a third value of a text feature you added to your videos"} |
| 90 | +``` |
| 91 | + |
| 92 | +<Tip> |
| 93 | + |
| 94 | +If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set `drop_labels=False` in `load_dataset`. |
| 95 | + |
| 96 | +</Tip> |
| 97 | + |
| 98 | +### Video captioning |
| 99 | + |
| 100 | +Video captioning datasets have text describing a video. An example `metadata.csv` may look like: |
| 101 | + |
| 102 | +```csv |
| 103 | +file_name,text |
| 104 | +0001.mp4,This is a golden retriever playing with a ball |
| 105 | +0002.mp4,A german shepherd |
| 106 | +0003.mp4,One chihuahua |
| 107 | +``` |
| 108 | + |
| 109 | +Load the dataset with `VideoFolder`, and it will create a `text` column for the video captions: |
| 110 | + |
| 111 | +```py |
| 112 | +>>> dataset = load_dataset("videofolder", data_dir="/path/to/folder", split="train") |
| 113 | +>>> dataset[0]["text"] |
| 114 | +"This is a golden retriever playing with a ball" |
| 115 | +``` |
| 116 | + |
| 117 | +### Upload dataset to the Hub |
| 118 | + |
| 119 | +Once you've created a dataset, you can share it to the using `huggingface_hub` for example. Make sure you have the [huggingface_hub](https://huggingface.co/docs/huggingface_hub/index) library installed and you're logged in to your Hugging Face account (see the [Upload with Python tutorial](upload_dataset#upload-with-python) for more details). |
| 120 | + |
| 121 | +Upload your dataset with `huggingface_hub.HfApi.upload_folder`: |
| 122 | + |
| 123 | +```py |
| 124 | +from huggingface_hub import HfApi |
| 125 | +api = HfApi() |
| 126 | + |
| 127 | +api.upload_folder( |
| 128 | + folder_path="/path/to/local/dataset", |
| 129 | + repo_id="username/my-cool-dataset", |
| 130 | + repo_type="dataset", |
| 131 | +) |
| 132 | +``` |
| 133 | + |
| 134 | +## WebDataset |
| 135 | + |
| 136 | +The [WebDataset](https://github.com/webdataset/webdataset) format is based on TAR archives and is suitable for big video datasets. |
| 137 | +Indeed you can group your videos in TAR archives (e.g. 1GB of videos per TAR archive) and have thousands of TAR archives: |
| 138 | + |
| 139 | +``` |
| 140 | +folder/train/00000.tar |
| 141 | +folder/train/00001.tar |
| 142 | +folder/train/00002.tar |
| 143 | +... |
| 144 | +``` |
| 145 | + |
| 146 | +In the archives, each example is made of files sharing the same prefix: |
| 147 | + |
| 148 | +``` |
| 149 | +e39871fd9fd74f55.mp4 |
| 150 | +e39871fd9fd74f55.json |
| 151 | +f18b91585c4d3f3e.mp4 |
| 152 | +f18b91585c4d3f3e.json |
| 153 | +ede6e66b2fb59aab.mp4 |
| 154 | +ede6e66b2fb59aab.json |
| 155 | +ed600d57fcee4f94.mp4 |
| 156 | +ed600d57fcee4f94.json |
| 157 | +... |
| 158 | +``` |
| 159 | + |
| 160 | +You can put your videos labels/captions/features using JSON or text files for example. |
| 161 | + |
| 162 | +For more details on the WebDataset format and the python library, please check the [WebDataset documentation](https://webdataset.github.io/webdataset). |
| 163 | + |
| 164 | +Load your WebDataset and it will create on column per file suffix (here "mp4" and "json"): |
| 165 | + |
| 166 | +```python |
| 167 | +>>> from datasets import load_dataset |
| 168 | + |
| 169 | +>>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", split="train") |
| 170 | +>>> dataset[0]["json"] |
| 171 | +{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]} |
| 172 | +``` |
0 commit comments