[Datumaro] Update docs (#2125)

* Update docs, add type hints, rename extract

* Add developer guide

* Update license headers, add license text

* Update developer_guide.md

Co-authored-by: Nikita Manovich <nikita.manovich@intel.com>
main
Maxim Zhiltsov 6 years ago committed by GitHub
parent 87d76c9a1a
commit 4dbfa3bfdf
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -72,124 +72,4 @@ python manage.py test datumaro/
## Design and code structure
- [Design document](docs/design.md)
### Command-line
Use [Docker](https://www.docker.com/) as an example. Basically,
the interface is divided on contexts and single commands.
Contexts are semantically grouped commands,
related to a single topic or target. Single commands are handy shorter
alternatives for the most used commands and also special commands,
which are hard to be put into any specific context.
![cli-design-image](docs/images/cli_design.png)
- The diagram above was created with [FreeMind](http://freemind.sourceforge.net/wiki/index.php/Main_Page)
Model-View-ViewModel (MVVM) UI pattern is used.
![mvvm-image](docs/images/mvvm.png)
### Datumaro project and environment structure
<!--lint disable fenced-code-flag-->
```
├── [datumaro module]
└── [project folder]
├── .datumaro/
| ├── config.yml
│   ├── .git/
│   ├── models/
│   └── plugins/
│   ├── plugin1/
│   | ├── file1.py
│   | └── file2.py
│   ├── plugin2.py
│   ├── custom_extractor1.py
│   └── ...
├── dataset/
└── sources/
├── source1
└── ...
```
<!--lint enable fenced-code-flag-->
### Plugins
Plugins are optional components, which extend the project. In Datumaro there are
several types of plugins, which include:
- `extractor` - produces dataset items from data source
- `importer` - recognizes dataset type and creates project
- `converter` - exports dataset to a specific format
- `transformation` - modifies dataset items or other properties
- `launcher` - executes models
Plugins reside in plugin directories:
- `datumaro/plugins` for builtin components
- `<project_dir>/.datumaro/plugins` for project-specific components
A plugin is a python file or package with any name, which exports some symbols.
To export a symbol, put it to `exports` list of the module like this:
``` python
class MyComponent1: ...
class MyComponent2: ...
exports = [MyComponent1, MyComponent2]
```
or inherit it from one of special classes:
``` python
from datumaro.components.extractor import Importer, SourceExtractor, Transform
from datumaro.components.launcher import Launcher
from datumaro.components.converter import Converter
```
There is an additional class to modify plugin appearance at command line:
``` python
from datumaro.components.cli_plugin import CliPlugin
```
Plugin example:
<!--lint disable fenced-code-flag-->
```
datumaro/plugins/
- my_plugin1/file1.py
- my_plugin1/file2.py
- my_plugin2.py
```
<!--lint enable fenced-code-flag-->
`my_plugin1/file2.py` contents:
``` python
from datumaro.components.extractor import Transform, CliPlugin
from .file1 import something, useful
class MyTransform(Transform, CliPlugin):
NAME = "custom_name"
"""
Some description.
"""
@classmethod
def build_cmdline_parser(cls, **kwargs):
parser = super().build_cmdline_parser(**kwargs)
parser.add_argument('-q', help="Some help")
return parser
...
```
`my_plugin2.py` contents:
``` python
from datumaro.components.extractor import SourceExtractor
class MyFormat: ...
class MyFormatExtractor(SourceExtractor): ...
exports = [MyFormat] # explicit exports declaration
# MyFormatExtractor won't be exported
```
- [Developer guide](docs/developer_guide.md)

@ -0,0 +1,22 @@
MIT License
Copyright (C) 2019-2020 Intel Corporation
 
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom
the Software is furnished to do so, subject to the following conditions:
 
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
 
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES
OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
OR OTHER DEALINGS IN THE SOFTWARE.
 

@ -1,4 +1,4 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,4 +1,4 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT
@ -278,7 +278,7 @@ def build_export_parser(parser_ctor=argparse.ArgumentParser):
parser = parser_ctor(help="Export project",
description="""
Exports the project dataset in some format. Optionally, a filter
can be passed, check 'extract' command description for more info.
can be passed, check 'filter' command description for more info.
Each dataset format has its own options, which
are passed after '--' separator (see examples), pass '-- -h'
for more info. If not stated otherwise, by default
@ -362,7 +362,7 @@ def export_command(args):
return 0
def build_extract_parser(parser_ctor=argparse.ArgumentParser):
def build_filter_parser(parser_ctor=argparse.ArgumentParser):
parser = parser_ctor(help="Extract subproject",
description="""
Extracts a subproject that contains only items matching filter.
@ -414,11 +414,11 @@ def build_extract_parser(parser_ctor=argparse.ArgumentParser):
help="Overwrite existing files in the save directory")
parser.add_argument('-p', '--project', dest='project_dir', default='.',
help="Directory of the project to operate on (default: current dir)")
parser.set_defaults(command=extract_command)
parser.set_defaults(command=filter_command)
return parser
def extract_command(args):
def filter_command(args):
project = load_project(args.project_dir)
if not args.dry_run:
@ -437,7 +437,7 @@ def extract_command(args):
filter_args = FilterModes.make_filter_args(args.mode)
if args.dry_run:
dataset = dataset.extract(filter_expr=args.filter, **filter_args)
dataset = dataset.filter(expr=args.filter, **filter_args)
for item in dataset:
encoded_item = DatasetItemEncoder.encode(item, dataset.categories())
xml_item = DatasetItemEncoder.to_string(encoded_item)
@ -447,8 +447,7 @@ def extract_command(args):
if not args.filter:
raise CliException("Expected a filter expression ('-e' argument)")
dataset.extract_project(save_dir=dst_dir, filter_expr=args.filter,
**filter_args)
dataset.filter_project(save_dir=dst_dir, expr=args.filter, **filter_args)
log.info("Subproject has been extracted to '%s'" % dst_dir)
@ -816,7 +815,7 @@ def build_parser(parser_ctor=argparse.ArgumentParser):
add_subparser(subparsers, 'create', build_create_parser)
add_subparser(subparsers, 'import', build_import_parser)
add_subparser(subparsers, 'export', build_export_parser)
add_subparser(subparsers, 'extract', build_extract_parser)
add_subparser(subparsers, 'filter', build_filter_parser)
add_subparser(subparsers, 'merge', build_merge_parser)
add_subparser(subparsers, 'diff', build_diff_parser)
add_subparser(subparsers, 'ediff', build_ediff_parser)

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT
@ -27,26 +27,25 @@ AnnotationType = Enum('AnnotationType',
_COORDINATE_ROUNDING_DIGITS = 2
@attrs
@attrs(kw_only=True)
class Annotation:
id = attrib(default=0, validator=default_if_none(int), kw_only=True)
attributes = attrib(factory=dict, validator=default_if_none(dict), kw_only=True)
group = attrib(default=0, validator=default_if_none(int), kw_only=True)
id = attrib(default=0, validator=default_if_none(int))
attributes = attrib(factory=dict, validator=default_if_none(dict))
group = attrib(default=0, validator=default_if_none(int))
def __attrs_post_init__(self):
assert isinstance(self.type, AnnotationType)
@property
def type(self):
def type(self) -> AnnotationType:
return self._type # must be set in subclasses
def wrap(item, **kwargs):
return attr.evolve(item, **kwargs)
def wrap(self, **kwargs):
return attr.evolve(self, **kwargs)
@attrs
@attrs(kw_only=True)
class Categories:
attributes = attrib(factory=set, validator=default_if_none(set),
kw_only=True, eq=False)
attributes = attrib(factory=set, validator=default_if_none(set), eq=False)
@attrs
class LabelCategories(Categories):
@ -92,7 +91,7 @@ class LabelCategories(Categories):
indices[item.name] = index
self._indices = indices
def add(self, name, parent=None, attributes=None):
def add(self, name: str, parent: str = None, attributes: dict = None):
assert name not in self._indices, name
if attributes is None:
attributes = set()
@ -109,7 +108,7 @@ class LabelCategories(Categories):
self._indices[name] = index
return index
def find(self, name):
def find(self, name: str):
index = self._indices.get(name)
if index is not None:
return index, self.items[index]
@ -601,7 +600,7 @@ class SourceExtractor(Extractor):
def get_subset(self, name):
if name != self._subset:
return None
raise Exception("Unknown subset '%s' requested" % name)
return self
class Importer:
@ -629,5 +628,5 @@ class Transform(Extractor):
def categories(self):
return self._extractor.categories()
def transform_item(self, item):
def transform_item(self, item: DatasetItem) -> DatasetItem:
raise NotImplementedError()

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,12 +1,12 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT
from collections import OrderedDict, defaultdict
from functools import reduce
import git
from glob import glob
from typing import Iterable, Union, Dict, List
import git
import importlib
import inspect
import logging as log
@ -19,7 +19,7 @@ from datumaro.components.config import Config, DEFAULT_FORMAT
from datumaro.components.config_model import (Model, Source,
PROJECT_DEFAULT_CONFIG, PROJECT_SCHEMA)
from datumaro.components.extractor import Extractor, LabelCategories,\
AnnotationType
AnnotationType, DatasetItem
from datumaro.components.launcher import ModelTransform
from datumaro.components.dataset_filter import \
XPathDatasetFilter, XPathAnnotationsFilter
@ -304,50 +304,40 @@ class Environment:
self.models.unregister(name)
class Subset(Extractor):
def __init__(self, parent):
self._parent = parent
self.items = OrderedDict()
class Dataset(Extractor):
class Subset(Extractor):
def __init__(self, parent):
self.parent = parent
self.items = OrderedDict()
def __iter__(self):
for item in self.items.values():
yield item
def __iter__(self):
yield from self.items.values()
def __len__(self):
return len(self.items)
def __len__(self):
return len(self.items)
def categories(self):
return self._parent.categories()
def categories(self):
return self.parent.categories()
class Dataset(Extractor):
@classmethod
def from_iterable(cls, iterable, categories=None):
"""Generation of Dataset from iterable object
Args:
iterable: Iterable object contains DatasetItems
categories (dict, optional): You can pass dict of categories or
you can pass list of names. It'll interpreted as list of names of
LabelCategories. Defaults to {}.
Returns:
Dataset: Dataset object
"""
def from_iterable(cls, iterable: Iterable[DatasetItem],
categories: Union[Dict, List[str]] = None):
if isinstance(categories, list):
categories = {AnnotationType.label : LabelCategories.from_iterable(categories)}
categories = { AnnotationType.label:
LabelCategories.from_iterable(categories)
}
if not categories:
categories = {}
class tmpExtractor(Extractor):
class _extractor(Extractor):
def __iter__(self):
return iter(iterable)
def categories(self):
return categories
return cls.from_extractors(tmpExtractor())
return cls.from_extractors(_extractor())
@classmethod
def from_extractors(cls, *sources):
@ -355,7 +345,7 @@ class Dataset(Extractor):
dataset = Dataset(categories=categories)
# merge items
subsets = defaultdict(lambda: Subset(dataset))
subsets = defaultdict(lambda: cls.Subset(dataset))
for source in sources:
for item in source:
existing_item = subsets[item.subset].items.get(item.id)
@ -416,20 +406,19 @@ class Dataset(Extractor):
if subset is None:
subset = item.subset
item = item.wrap(path=None, annotations=item.annotations)
if item.subset not in self._subsets:
self._subsets[item.subset] = Subset(self)
item = item.wrap(id=item_id, subset=subset, path=None)
if subset not in self._subsets:
self._subsets[subset] = self.Subset(self)
self._subsets[subset].items[item_id] = item
self._length = None
return item
def extract(self, filter_expr, filter_annotations=False, remove_empty=False):
def filter(self, expr, filter_annotations=False, remove_empty=False):
if filter_annotations:
return self.transform(XPathAnnotationsFilter, filter_expr,
remove_empty)
return self.transform(XPathAnnotationsFilter, expr, remove_empty)
else:
return self.transform(XPathDatasetFilter, filter_expr)
return self.transform(XPathDatasetFilter, expr)
def update(self, items):
for item in items:
@ -500,17 +489,14 @@ class ProjectDataset(Dataset):
sources = {}
for s_name, source in config.sources.items():
s_format = source.format
if not s_format:
s_format = env.PROJECT_EXTRACTOR_NAME
s_format = source.format or env.PROJECT_EXTRACTOR_NAME
options = {}
options.update(source.options)
url = source.url
if not source.url:
url = osp.join(config.project_dir, config.sources_dir, s_name)
sources[s_name] = env.make_extractor(s_format,
url, **options)
sources[s_name] = env.make_extractor(s_format, url, **options)
self._sources = sources
own_source = None
@ -531,7 +517,7 @@ class ProjectDataset(Dataset):
self._categories = categories
# merge items
subsets = defaultdict(lambda: Subset(self))
subsets = defaultdict(lambda: self.Subset(self))
for source_name, source in self._sources.items():
log.debug("Loading '%s' source contents..." % source_name)
for item in source:
@ -548,11 +534,8 @@ class ProjectDataset(Dataset):
# NOTE: consider imported sources as our own dataset
path = None
else:
path = item.path
if path is None:
path = []
path = [source_name] + path
item = item.wrap(path=path, annotations=item.annotations)
path = [source_name] + (item.path or [])
item = item.wrap(path=path)
subsets[item.subset].items[item.id] = item
@ -563,8 +546,7 @@ class ProjectDataset(Dataset):
existing_item = subsets[item.subset].items.get(item.id)
if existing_item is not None:
item = item.wrap(path=None,
image=self._merge_images(existing_item, item),
annotations=item.annotations)
image=self._merge_images(existing_item, item))
subsets[item.subset].items[item.id] = item
@ -590,6 +572,7 @@ class ProjectDataset(Dataset):
def put(self, item, item_id=None, subset=None, path=None):
if path is None:
path = item.path
if path:
source = path[0]
rest_path = path[1:]
@ -602,9 +585,9 @@ class ProjectDataset(Dataset):
if subset is None:
subset = item.subset
item = item.wrap(path=path, annotations=item.annotations)
if item.subset not in self._subsets:
self._subsets[item.subset] = Subset(self)
item = item.wrap(path=path)
if subset not in self._subsets:
self._subsets[subset] = self.Subset(self)
self._subsets[subset].items[item_id] = item
self._length = None
@ -713,7 +696,7 @@ class ProjectDataset(Dataset):
# NOTE: probably this function should be in the ViewModel layer
dataset = self
if filter_expr:
dataset = dataset.extract(filter_expr,
dataset = dataset.filter(filter_expr,
filter_annotations=filter_annotations,
remove_empty=remove_empty)
@ -727,15 +710,15 @@ class ProjectDataset(Dataset):
shutil.rmtree(save_dir)
raise
def extract_project(self, filter_expr, filter_annotations=False,
def filter_project(self, filter_expr, filter_annotations=False,
save_dir=None, remove_empty=False):
# NOTE: probably this function should be in the ViewModel layer
filtered = self
dataset = self
if filter_expr:
filtered = self.extract(filter_expr,
dataset = dataset.filter(filter_expr,
filter_annotations=filter_annotations,
remove_empty=remove_empty)
self._save_branch_project(filtered, save_dir=save_dir)
self._save_branch_project(dataset, save_dir=save_dir)
class Project:
@classmethod

@ -0,0 +1,4 @@
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,4 +1,3 @@
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,4 +1,3 @@
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT
@ -20,7 +19,7 @@ from datumaro.components.extractor import (SourceExtractor,
from datumaro.components.extractor import Importer
from datumaro.components.converter import Converter
from datumaro.util import cast
from datumaro.util.image import Image, save_image
from datumaro.util.image import Image
MotLabel = Enum('MotLabel', [

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,4 +1,3 @@
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,4 +1,3 @@
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,4 +1,3 @@
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,3 +1,7 @@
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT
from collections import OrderedDict

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@ -0,0 +1,200 @@
## Basics
The center part of the library is the `Dataset` class, which allows to iterate
over its elements. `DatasetItem`, an element of a dataset, represents a single
dataset entry with annotations - an image, video sequence, audio track etc.
It can contain only annotated data or meta information, only annotations, or
all of this.
Basic library usage and data flow:
```lang-none
Extractors -> Dataset -> Converter
|
Filtration
Transformations
Statistics
Merging
Inference
Quality Checking
Comparison
...
```
1. Data is read (or produced) by one or many `Extractor`s and merged
into a `Dataset`
1. A dataset is processed in some way
1. A dataset is saved with a `Converter`
Datumaro has a number of dataset and annotation features:
- iteration over dataset elements
- filtering of datasets and annotations by a custom criteria
- working with subsets (e.g. `train`, `val`, `test`)
- computing of dataset statistics
- comparison and merging of datasets
- various annotation operations
```python
from datumaro.components.project import Environment
# Import and save a dataset
env = Environment()
dataset = env.make_importer('voc')('src/dir').make_dataset()
env.converters.get('coco').convert(dataset, save_dir='dst/dir')
```
## Library contents
### Dataset Formats
Dataset reading is supported by `Extractor`s and `Importer`s:
- An `Extractor` produces a list of `DatasetItem`s corresponding
to the dataset.
- An `Importer` creates a project from the data source location.
It is possible to add custom Extractors and Importers. To do this, you need
to put an `Extractor` and `Importer` implementations to a plugin directory.
Dataset writing is supported by `Converter`s.
A Converter produces a dataset of a specific format from dataset items.
It is possible to add custom `Converter`s. To do this, you need to put a
Converter implementation script to a plugin directory.
### Dataset Conversions ("Transforms")
A `Transform` is a function for altering a dataset and producing a new one.
It can update dataset items, annotations, classes, and other properties.
A list of available transforms for dataset conversions can be extended by
adding a `Transform` implementation script into a plugin directory.
### Model launchers
A list of available launchers for model execution can be extended by
adding a `Launcher` implementation script into a plugin directory.
## Plugins
Datumaro comes with a number of built-in formats and other tools,
but it also can be extended by plugins. Plugins are optional components,
which dependencies are not installed by default.
In Datumaro there are several types of plugins, which include:
- `extractor` - produces dataset items from data source
- `importer` - recognizes dataset type and creates project
- `converter` - exports dataset to a specific format
- `transformation` - modifies dataset items or other properties
- `launcher` - executes models
A plugin is a regular Python module. It must be present in a plugin directory:
- `<project_dir>/.datumaro/plugins` for project-specific plugins
- `<datumaro_dir>/plugins` for global plugins
A plugin can be used either via the `Environment` class instance,
or by regular module importing:
```python
from datumaro.components.project import Environment, Project
from datumaro.plugins.yolo_format.converter import YoloConverter
# Import a dataset
dataset = Environment().make_importer('voc')(src_dir).make_dataset()
# Load an existing project, save the dataset in some project-specific format
project = Project.load('project/dir')
project.env.converters.get('custom_format').convert(dataset, save_dir=dst_dir)
# Save the dataset in some built-in format
Environment().converters.get('yolo').convert(dataset, save_dir=dst_dir)
YoloConverter.convert(dataset, save_dir=dst_dir)
```
### Writing a plugin
A plugin is a Python module with any name, which exports some symbols.
To export a symbol, inherit it from one of special classes:
```python
from datumaro.components.extractor import Importer, SourceExtractor, Transform
from datumaro.components.launcher import Launcher
from datumaro.components.converter import Converter
```
The `exports` list of the module can be used to override default behaviour:
```python
class MyComponent1: ...
class MyComponent2: ...
exports = [MyComponent2] # exports only MyComponent2
```
There is also an additional class to modify plugin appearance in command line:
```python
from datumaro.components.cli_plugin import CliPlugin
```
#### Plugin example
<!--lint disable fenced-code-flag-->
```
datumaro/plugins/
- my_plugin1/file1.py
- my_plugin1/file2.py
- my_plugin2.py
```
<!--lint enable fenced-code-flag-->
`my_plugin1/file2.py` contents:
```python
from datumaro.components.extractor import Transform, CliPlugin
from .file1 import something, useful
class MyTransform(Transform, CliPlugin):
NAME = "custom_name" # could be generated automatically
"""
Some description. The text will be displayed in the command line output.
"""
@classmethod
def build_cmdline_parser(cls, **kwargs):
parser = super().build_cmdline_parser(**kwargs)
parser.add_argument('-q', help="Very useful parameter")
return parser
def __init__(self, extractor, q):
super().__init__(extractor)
self.q = q
def transform_item(self, item):
return item
```
`my_plugin2.py` contents:
```python
from datumaro.components.extractor import SourceExtractor
class MyFormat: ...
class MyFormatExtractor(SourceExtractor): ...
exports = [MyFormat] # explicit exports declaration
# MyFormatExtractor won't be exported
```
## Command-line
Basically, the interface is divided on contexts and single commands.
Contexts are semantically grouped commands, related to a single topic or target.
Single commands are handy shorter alternatives for the most used commands
and also special commands, which are hard to be put into any specific context.
[Docker](https://www.docker.com/) is an example of similar approach.
![cli-design-image](images/cli_design.png)
- The diagram above was created with [FreeMind](http://freemind.sourceforge.net/wiki/index.php/Main_Page)
Model-View-ViewModel (MVVM) UI pattern is used.
![mvvm-image](images/mvvm.png)

@ -6,22 +6,23 @@
- [Interfaces](#interfaces)
- [Supported dataset formats and annotations](#supported-formats)
- [Command line workflow](#command-line-workflow)
- [Project structure](#project-structure)
- [Command reference](#command-reference)
- [Convert datasets](#convert-datasets)
- [Create a project](#create-project)
- [Create project](#create-project)
- [Add and remove data](#add-and-remove-data)
- [Import a project](#import-project)
- [Extract a subproject](#extract-subproject)
- [Import project](#import-project)
- [Filter project](#filter-project)
- [Update project (merge)](#update-project)
- [Merge projects](#merge-projects)
- [Export a project](#export-project)
- [Export project](#export-project)
- [Compare projects](#compare-projects)
- [Obtaining project info](#get-project-info)
- [Obtaining project statistics](#get-project-statistics)
- [Register a model](#register-model)
- [Register model](#register-model)
- [Run inference](#run-inference)
- [Run inference explanation](#explain-inference)
- [Transform a project](#transform-project)
- [Transform project](#transform-project)
- [Extending](#extending)
- [Links](#links)
@ -111,15 +112,39 @@ List of supported annotation types:
## Command line workflow
The key object is a project, so most CLI commands operate on projects. However, there
are few commands operating on datasets directly. A project is a combination of
a project's own dataset, a number of external data sources and an environment.
The key object is a project, so most CLI commands operate on projects.
However, there are few commands operating on datasets directly.
A project is a combination of a project's own dataset, a number of
external data sources and an environment.
An empty Project can be created by `project create` command,
an existing dataset can be imported with `project import` command.
A typical way to obtain projects is to export tasks in CVAT UI.
If you want to interact with models, you need to add them to project first.
### Project structure
<!--lint disable fenced-code-flag-->
```
└── project/
├── .datumaro/
| ├── config.yml
│   ├── .git/
│   ├── models/
│   └── plugins/
│   ├── plugin1/
│   | ├── file1.py
│   | └── file2.py
│   ├── plugin2.py
│   ├── custom_extractor1.py
│   └── ...
├── dataset/
└── sources/
├── source1
└── ...
```
<!--lint enable fenced-code-flag-->
## Command reference
> **Note**: command invocation syntax is subject to change,
@ -270,11 +295,11 @@ datum source add path <path/to/images/dir> -f image_dir
datum project export -f tf_detection_api
```
### Extract subproject
### Filter project
This command allows to create a sub-Project from a Project. The new project
includes only items satisfying some condition. [XPath](https://devhints.io/xpath)
is used as query format.
is used as a query format.
There are several filtering modes available (`-m/--mode` parameter).
Supported modes:
@ -290,38 +315,34 @@ returns `annotation` elements (see examples).
Usage:
``` bash
datum project extract --help
datum project filter --help
datum project extract \
datum project filter \
-p <project dir> \
-o <output dir> \
-e '<xpath filter expression>'
```
Example: extract a dataset with only images which `width` < `height`
``` bash
datum project extract \
datum project filter \
-p test_project \
-o test_project-extract \
-e '/item[image/width < image/height]'
```
Example: extract a dataset with only large annotations of class `cat` and any non-`persons`
``` bash
datum project extract \
datum project filter \
-p test_project \
-o test_project-extract \
--mode annotations -e '/item/annotation[(label="cat" and area > 99.5) or label!="person"]'
```
Example: extract a dataset with only occluded annotations, remove empty images
``` bash
datum project extract \
datum project filter \
-p test_project \
-o test_project-extract \
-m i+a -e '/item/annotation[occluded="True"]'
```
@ -362,7 +383,8 @@ Item representations are available with `--dry-run` parameter:
### Update project
This command updates items in a project from another one (check [Merge Projects](#merge-projects) for complex merging).
This command updates items in a project from another one
(check [Merge Projects](#merge-projects) for complex merging).
Usage:

@ -1,5 +1,5 @@
# Copyright (C) 2019 Intel Corporation
# Copyright (C) 2019-2020 Intel Corporation
#
# SPDX-License-Identifier: MIT
@ -36,7 +36,7 @@ setuptools.setup(
version=find_version(),
author="Intel",
author_email="maxim.zhiltsov@intel.com",
description="Dataset Framework",
description="Dataset Management Framework (Datumaro)",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/opencv/cvat/datumaro",

@ -1,5 +1,5 @@
"""
Copyright (c) 2019 Intel Corporation
Copyright (C) 2019-2020 Intel Corporation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

@ -250,7 +250,7 @@ class ProjectTest(TestCase):
project.env.extractors.register(e_type, TestExtractor)
project.add_source('source', { 'format': e_type })
dataset = project.make_dataset().extract('/item[id < 5]')
dataset = project.make_dataset().filter('/item[id < 5]')
self.assertEqual(5, len(dataset))

Loading…
Cancel
Save