From 6ac561235c2a75afd612c0ab17402d0c921de22f Mon Sep 17 00:00:00 2001
From: Maxim Zhiltsov <zhiltsov.max35@gmail.com>
Date: Sat, 10 Dec 2022 10:06:47 +0300
Subject: [PATCH] Improve dataset manifest installation (#5447)

Extracted from #5083
Related #5096

- Improved dataset manifest docs
- Dataset manifest requirements are now installed in the server image
- Package dependencies are aligned with the server
---
 Dockerfile                                    |  10 +-
 .../contributing/development-environment.md   |   2 +-
 .../docs/manual/advanced/dataset_manifest.md  | 170 +++++++++++++-----
 utils/dataset_manifest/create.py              |   5 +
 utils/dataset_manifest/requirements.txt       |   8 +-
 5 files changed, 148 insertions(+), 47 deletions(-)
 mode change 100644 => 100755 utils/dataset_manifest/create.py

diff --git a/Dockerfile b/Dockerfile
index ea9034f7..36a931c4 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -47,8 +47,14 @@ RUN python3 -m venv /opt/venv
 ENV PATH="/opt/venv/bin:${PATH}"
 RUN python3 -m pip install --no-cache-dir -U pip==22.0.2 setuptools==60.6.0 wheel==0.37.1
 COPY cvat/requirements/ /tmp/requirements/
-RUN DATUMARO_HEADLESS=1 python3 -m pip install --no-cache-dir -r /tmp/requirements/${DJANGO_CONFIGURATION}.txt
-
+COPY utils/dataset_manifest/ /tmp/dataset_manifest/
+
+# The server implementation depends on the dataset_manifest utility
+# so we need to install its dependencies too
+# https://github.com/opencv/cvat/issues/5096
+RUN DATUMARO_HEADLESS=1 python3 -m pip install --no-cache-dir \
+    -r /tmp/requirements/${DJANGO_CONFIGURATION}.txt \
+    -r /tmp/dataset_manifest/requirements.txt
 
 FROM ubuntu:20.04
 
diff --git a/site/content/en/docs/contributing/development-environment.md b/site/content/en/docs/contributing/development-environment.md
index b8e938cf..c20c5dda 100644
--- a/site/content/en/docs/contributing/development-environment.md
+++ b/site/content/en/docs/contributing/development-environment.md
@@ -78,7 +78,7 @@ description: 'Installing a development environment for different operating syste
   python3 -m venv .env
   . .env/bin/activate
   pip install -U pip wheel setuptools
-  pip install -r cvat/requirements/development.txt
+  pip install -r cvat/requirements/development.txt -r utils/dataset_manifest/requirements.txt
   python manage.py migrate
   python manage.py collectstatic
   ```
diff --git a/site/content/en/docs/manual/advanced/dataset_manifest.md b/site/content/en/docs/manual/advanced/dataset_manifest.md
index 60d8b5d1..0d91775a 100644
--- a/site/content/en/docs/manual/advanced/dataset_manifest.md
+++ b/site/content/en/docs/manual/advanced/dataset_manifest.md
@@ -2,20 +2,80 @@
 
 ---
 
-title: 'Simple command line to prepare dataset manifest file'
+title: 'Dataset Manifest'
 linkTitle: 'Dataset manifest'
 weight: 30
-description: This section on [GitHub](https://github.com/cvat-ai/cvat/tree/develop/utils/dataset_manifest)
+description:
 
 ---
 
 <!--lint disable heading-style-->
 
-### Steps before use
+## Overview
 
-When used separately from Computer Vision Annotation Tool(CVAT), the required dependencies must be installed
+When we create a new task in CVAT, we need to specify where to get the input data from.
+CVAT allows to use different data sources, including local file uploads, a mounted
+file share on the server, cloud storages and remote URLs. In some cases CVAT
+needs to have extra information about the input data. This information can be provided
+in Dataset manifest files. They are mainly used when working with cloud storages to
+reduce the amount of network traffic used and speed up the task creation process.
+However, they can also be used in other cases, which will be explained below.
 
-#### Ubuntu:20.04
+A dataset manifest file is a text file in the JSONL format. These files can be created
+automatically with [the special command-line tool](https://github.com/opencv/cvat/tree/develop/utils/dataset_manifest),
+or manually, following [the manifest file format specification](#file-format).
+
+## How and when to use manifest files
+
+Manifest files can be used in the following cases:
+- A video file or a set of images is used as the data source and
+  the caching mode is enabled. [Read more](/docs/manual/advanced/data_on_fly/)
+- The data is located in a cloud storage. [Read more](/docs/manual/basics/cloud-storages/)
+
+## How to generate manifest files
+
+CVAT provides a dedicated Python tool to generate manifest files.
+The source code can be found [here](https://github.com/opencv/cvat/tree/develop/utils/dataset_manifest).
+
+Using the tool is the recommended way to create manifest files for you data. The data must be
+available locally to the tool to generate manifest.
+
+### Usage
+
+```bash
+usage: create.py [-h] [--force] [--output-dir .] source
+
+positional arguments:
+  source                Source paths
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --force               Use this flag to prepare the manifest file for video data
+                        if by default the video does not meet the requirements
+                        and a manifest file is not prepared
+  --output-dir OUTPUT_DIR
+                        Directory where the manifest file will be saved
+```
+
+### Use the script from a Docker image
+
+This is the recommended way to use the tool.
+
+The script can be used from the `cvat/server` image:
+
+```bash
+docker run -it --rm -u "$(id -u)":"$(id -g)" \
+  -v "${PWD}":"/local" \
+  --entrypoint python3 \
+  cvat/server \
+  utils/dataset_manifest/create.py --output-dir /local /local/<path/to/sources>
+```
+
+Make sure to adapt the command to your file locations.
+
+### Use the script directly
+
+#### Ubuntu 20.04
 
 Install dependencies:
 
@@ -38,72 +98,102 @@ Create an environment and install the necessary python modules:
 python3 -m venv .env
 . .env/bin/activate
 pip install -U pip
-pip install -r requirements.txt
-```
-
-### Using
-
-```bash
-usage: python create.py [-h] [--force] [--output-dir .] source
-
-positional arguments:
-  source                Source paths
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --force               Use this flag to prepare the manifest file for video data if by default the video does not meet the requirements
-                        and a manifest file is not prepared
-  --output-dir OUTPUT_DIR
-                        Directory where the manifest file will be saved
+pip install -r utils/dataset_manifest/requirements.txt
 ```
 
-### Alternative way to use with cvat/server
-
-```bash
-docker run -it -u root --entrypoint bash -v /path/to/host/data/:/path/inside/container/:rw cvat/server -c "pip3 install -r utils/dataset_manifest/requirements.txt && python3 utils/dataset_manifest/create.py --output-dir /path/to/manifest/directory/ /path/to/data/"
-```
+> Please note that if used with video this way, the results may be different from what
+would the server decode. It is related to the ffmpeg library version. For this reason,
+using the Docker-based version of the tool is recommended.
 
-### Examples of using
+### Examples
 
 Create a dataset manifest in the current directory with video which contains enough keyframes:
 
 ```bash
-python create.py ~/Documents/video.mp4
+python utils/dataset_manifest/create.py ~/Documents/video.mp4
 ```
 
 Create a dataset manifest with video which does not contain enough keyframes:
 
 ```bash
-python create.py --force --output-dir ~/Documents ~/Documents/video.mp4
+python utils/dataset_manifest/create.py --force --output-dir ~/Documents ~/Documents/video.mp4
 ```
 
 Create a dataset manifest with images:
 
 ```bash
-python create.py --output-dir ~/Documents ~/Documents/images/
+python utils/dataset_manifest/create.py --output-dir ~/Documents ~/Documents/images/
 ```
 
 Create a dataset manifest with pattern (may be used `*`, `?`, `[]`):
 
 ```bash
-python create.py --output-dir ~/Documents "/home/${USER}/Documents/**/image*.jpeg"
+python utils/dataset_manifest/create.py --output-dir ~/Documents "/home/${USER}/Documents/**/image*.jpeg"
 ```
 
-Create a dataset manifest with `cvat/server`:
+Create a dataset manifest using Docker image:
 
 ```bash
-docker run -it --entrypoint python3 -v ~/Documents/data/:${HOME}/manifest/:rw cvat/server
-utils/dataset_manifest/create.py --output-dir ~/manifest/ ~/manifest/images/
+docker run -it --rm -u "$(id -u)":"$(id -g)" \
+  -v ~/Documents/data/:${HOME}/manifest/:rw \
+  --entrypoint '/usr/bin/bash' \
+  cvat/server \
+  utils/dataset_manifest/create.py --output-dir ~/manifest/ ~/manifest/images/
 ```
 
-### Examples of generated `manifest.jsonl` files
+### File format
+
+The dataset manifest files are text files in JSONL format. These files have 2 sub-formats:
+_for video_ and _for images and 3d data_.
+
+> Each top-level entry enclosed in curly braces must use 1 string, no empty strings is allowed.
+> The formatting in the descriptions below is only for demonstration.
+
+#### Dataset manifest for video
 
-A manifest file contains some intuitive information and some specific like:
+The file describes a single video.
 
 `pts` - time at which the frame should be shown to the user
-`checksum` - `md5` hash sum for the specific image/frame
+`checksum` - `md5` hash sum for the specific image/frame decoded
+
+```json
+{ "version": <string, version id> }
+{ "type": "video" }
+{ "properties": {
+  "name": <string, filename>,
+  "resolution": [<int, width>, <int, height>],
+  "length": <int, frame count>
+}}
+{
+  "number": <int, frame number>,
+  "pts": <int, frame pts>,
+  "checksum": <string, md5 frame hash>
+} (repeatable)
+```
+
+#### Dataset manifest for images and other data types
+
+The file describes an ordered set of images and 3d point clouds.
+
+`name` - file basename and leading directories from the dataset root
+`checksum` - `md5` hash sum for the specific image/frame decoded
+
+```json
+{ "version": <string, version id> }
+{ "type": "images" }
+{
+  "name": <string, image filename>,
+  "extension": <string, . + file extension>,
+  "width": <int, width>,
+  "height": <int, height>,
+  "meta": <dict, optional>,
+  "checksum": <string, md5 hash, optional>
+} (repeatable)
+```
+
+### Example files
 
-#### For a video
+#### Manifest for a video
 
 ```json
 {"version":"1.0"}
@@ -117,7 +207,7 @@ A manifest file contains some intuitive information and some specific like:
 {"number":675,"pts":2430000,"checksum":"0e72faf67e5218c70b506445ac91cdd7"}
 ```
 
-#### For a dataset with images
+#### Manifest for a dataset with images
 
 ```json
 {"version":"1.0"}
diff --git a/utils/dataset_manifest/create.py b/utils/dataset_manifest/create.py
old mode 100644
new mode 100755
index 36b9f044..e52c4e6a
--- a/utils/dataset_manifest/create.py
+++ b/utils/dataset_manifest/create.py
@@ -1,6 +1,10 @@
+#!/usr/bin/env python3
+
 # Copyright (C) 2021-2022 Intel Corporation
+# Copyright (C) 2022 CVAT.ai Corporation
 #
 # SPDX-License-Identifier: MIT
+
 import argparse
 import os
 import sys
@@ -89,6 +93,7 @@ def main():
             sys.exit(str(ex))
 
     print('The manifest file has been prepared')
+
 if __name__ == "__main__":
     base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
     sys.path.append(base_dir)
diff --git a/utils/dataset_manifest/requirements.txt b/utils/dataset_manifest/requirements.txt
index 060bf8a6..ddfad5e7 100644
--- a/utils/dataset_manifest/requirements.txt
+++ b/utils/dataset_manifest/requirements.txt
@@ -1,5 +1,5 @@
-av==9.2.0 --no-binary=av
-opencv-python-headless==4.4.0.42
+av==9.2.0 --no-binary=av  # Pinned for the whole CVAT
+opencv-python-headless>=4.4.0.42
 Pillow==9.3.0
-tqdm==4.58.0
-natsort==8.0.0
+tqdm>=4.58.0
+natsort>=8.0.0