Text Annotation

In order to train a machine learning model to accurately recognize and classify text data, it is necessary to provide the model with a labelled dataset. Text annotation involves manually labelling the data with tags or categories that indicate the meaning or context of the text.

The outcome of manual text annotation is a labelled dataset that can be used to train machine learning models for natural language processing tasks. The labelled dataset consists of text data that has been manually reviewed and annotated with tags or labels that indicate the meaning or context of the text.

Removing all identifying information from court documents by hand is vital, yet very demanding. We created an automatic system using statistical and neural net models to detect data like names, educational facilities, birth dates and addresses in these documents. Non-blacklisted found information is automatically replaced by randomised initials or placeholder strings, and the result is an anonymized document.


AI model development and deployment

With previously annotated data, the process of AI model development involves preparing the data, identifying relevant features, selecting an appropriate technical approach, training the model and evaluating its performance.

The main benefit of our AI models is their ability to handle complex patterns and adapt to variations in data. This unprecedented understanding of context makes it possible to solve problems in various domains of text. Once the models are trained they can be used to make accurate predictions or decisions on new, previously unseen data. Depending on the requirements, models can be hosted by TEXTA or deployed on the client's own infrastructure.

The National Library of Estonia was seeking a solution to automatically tag articles with suitable set of keywords and through that make the navigation in large volumes of data easier for their users. For accomplishing this goal, we experimented with various different methods: simple rule-based ones, more complicated machine learning algorithms and in some cases, the combination of both. The final result was a prototype showcasing the outcome of various keyword assignment methods.


Text Forensics

Text forensics is a way of combining various techniques and tools to explore and analyse textual datasets and identify clues or patterns. The outcome provides insights and evidence that can be used for various purposes such as fraud detection, legal investigations, reputation management, and market research.

Our tools provide the full text investigation experience:

  • extracting textual content from a multitude of document and image formats,
  • automatically identifying entities such as persons, organisations, and locations,
  • searching for hidden clues from data that is usually locked away as "unstructured"

Journalists discovered that personal and extremely delicate data about students was somehow made public by accident. The problem was that the client had no idea how many of such documents were leaked and how to find them from the millions of existing documents in the system. With the help of TEXTA Toolkit we could conduct the audit in approximately two weeks and we handed over a list of over 1500 documents that should have never been in the public view.


Data Science as a Service

For companies seeking expert data analysis and insights without the need for hiring full-time data scientists or with the need of extra hands for their ML sprint, TEXTA offers convenient and cost-effective solutions for Data Science as a Service. You can access our team of skilled data scientists on-demand, paying only for the hours worked.


We build complex systems using LLM

Data scientists in TEXTA are fluent in recent technologies such as large language models (LLMs) that have revolutionised the field of natural language understanding. We design and build complex systems using LLMs to solve tasks like information retrieval and question answering in industrial settings.