In order for a neural network to be able to recognize objects and images in an image, it must first “show” what is there. Marking is intended for this purpose.

Image Annotation is an integral part of the development of artificial intelligence, and it is one of the main tasks in computer vision technology. Annotated images are needed as initial data for learning neural networks: object recognition in images allows computers to “see” the world around them like humans.


Annotation tools allow you to add semantic markup to documents or, more generally, to resources. Common annotation tools typically provide domain-independent annotation support. They are designed to meet general requirements such as ease of use, efficiency, etc. [1] Such tools should provide support for information retrieval, ontology and knowledge management, software interfaces, repository and user interface for ontology and knowledge base editors.


The most basic annotation tools allow users to manually create annotations. They are simple text annotation tools, while providing some support for ontologies.

What’s a data annotation tool?

A data annotation tool is a cloud-based, on-premise, or containerized software solution that can be used to annotate production-grade training data for machine learning. While some organizations take a do-it-yourself approach and build their own tools, there are many data annotation tools available via open source or freeware.

They are also offered commercially, for lease and purchase. Data annotation tools are generally designed to be used with specific types of data, such as image, video, text, audio, spreadsheet, or sensor data. They also offer different deployment models, including on-premise, container, SaaS (cloud), and Kubernetes.

The audio annotation is done for all types of speech, sound that can be hearable and to utilize for the natural language processing. Cogito provides, high-quality audio annotation services with best level of accuracy for each audio file. Main tasks:

  • Audio Transcription
  • Topic Classification
  • Speaker Identification
  • Emotion Analysis (sentiment)
  • Voice Activity Detection (VAD).

Text annotation one of the most common.



We researched some other annotation tools and outlined the pros and cons of each one. This will hopefully shine some light on your decision-making process.


A free open source graphical image markup tool written in Python that is used to select objects in an image. Annotations can be saved as XML files in PASCAL VOC / YOLO format. LabelImg allows you to create bounding boxes for marking objects in the Qt GUI.


CVAT (Computer Vision Annotation Tool)

Developed by researchers at Intel, CVAT is an open-source annotation tool that works both for images and videos alike. It’s a browser-based application and it works only with Google’s Chrome browser. It’s relatively easy to deploy in the local network using Docker.[2]



Labelbox has a slick user interface and a ton of functionalities. [3] In their own words, they are a “data-labeling and training-data management platform”. On top of their computer vision functionalities, they also offer text classification functionality. Their software is offered in three ways:

  1. SaaS web-based
  2. Hybrid on-premise: Allows you to host data on your own servers, but generated assets will be stored on Labelbox servers, unaccessible to them.
  3. Full on-premise: requires coordinating with their engineering team to fully deploy the software locally.

VoTT (Visual Object Tagging Tool)

Open source annotation and labeling tool for videos and images, with a no-brainer UI which feels a little outdated and getting the hang of it takes a little longer than it should. However, once you decipher how to work with it the rest goes rather smoothly. It can be run locally, offers MacOS, Windows and Linux support or it can be accessed as a web app and is compatible with most modern web browsers. I tested the MacOS version, the installation was quick and simple. The way you work with VoTT is through what they call “projects”.[4]


Marking data in images or videos is necessary to “feed” models of in-depth learning information about what is depicted in the picture. So computers and machines learn to recognize and recognize objects better.

Data Science Consultant

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store