Natural Language Processing *

Computer analysis and synthesis of natural languages

Multilingual Text-to-Speech Models for Indic Languages

Machine learning *Natural Language Processing *Voice user interfaces *

In this article, we shall provide some background on how multilingual multi-speaker models work and test an Indic TTS model that supports 9 languages and 17 speakers (Hindi, Malayalam, Manipuri, Bengali, Rajasthani, Tamil, Telugu, Gujarati, Kannada).

It seems a bit counter-intuitive at first that one model can support so many languages and speakers provided that each Indic language has its own alphabet, but we shall see how it was implemented.

Also, we shall list the specs of these models like supported sampling rates and try something cool – making speakers of different Indic languages speak Hindi. Please, if you are a native speaker of any of these languages, share your opinion on how these voices sound, both in their respective language and in Hindi.

136

vldmrvslv 29 June at 17:24

Detecting attempts of mass influencing via social networks using NLP. Part 2

Python *Data Mining *Twitter API *Big Data *Natural Language Processing *

Tutorial

In Part 1 of this article, I built and compared two classifiers to detect trolls on Twitter. You can check it out here.

Now, time has come to look more deeply into the datasets to find some patterns using exploratory data analysis and topic modelling.

EDA

To do just that, I first created a word cloud of the most common words, which you can see below.

141

vldmrvslv 29 June at 17:20

Detecting attempts of mass influencing via social networks using NLP. Part 1

Python *Data Mining *Twitter API *Big Data *Natural Language Processing *

Tutorial

During the last decades, the world’s population has been developing as an information society, which means that information started to play a substantial end-to-end role in all life aspects and processes. In view of the growing demand for a free flow of information, social networks have become a force to be reckoned with. The ways of war-waging have also changed: instead of conventional weapons, governments now use political warfare, including fake news, a type of propaganda aimed at deliberate disinformation or hoaxes. And the lack of content control mechanisms makes it easy to spread any information as long as people believe in it.

Based on this premise, I’ve decided to experiment with different NLP approaches and build a classifier that could be used to detect either bots or fake content generated by trolls on Twitter in order to influence people.

In this first part of the article, I will cover the data collection process, preprocessing, feature extraction, classification itself and the evaluation of the models’ performance. In Part 2, I will dive deeper into the troll problem, conduct exploratory analysis to find patterns in the trolls’ behaviour and define the topics that seemed of great interest to them back in 2016.

Features for analysis

From all possible data to use (like hashtags, account language, tweet text, URLs, external links or references, tweet date and time), I settled upon English tweet text, Russian tweet text and hashtags. Tweet text is the main feature for analysis because it contains almost all essential characteristics that are typical for trolling activities in general, such as abuse, rudeness, external resources references, provocations and bullying. Hashtags were chosen as another source of textual information as they represent the central message of a tweet in one or two words.

145

vldmrvslv 23 June at 18:04

How we tackled document recognition issues for autonomus and automatic payments using OCR and NER

Python *Natural Language Processing *

Sandbox

In this article, I would like to describe how we’ve tackled the named entity recognition (aka NER) issue at Sber with the help of advanced AI techniques. It is one of many natural language processing (NLP) tasks that allows you to automatically extract data from unstructured text. This includes monetary values, dates, or names, surnames and positions.

Just imagine countless textual documents even a medium-sized organisation deals with on a daily basis, let alone huge corporations. Take Sber, for example: it is the largest financial institution in Russia, Central and Eastern Europe that has about 16,500 offices with over 250,000 employees, 137 million retail and 1.1 million corporate clients in 22 countries. As you can imagine, with such an enormous scale, the company collaborates with hundreds of suppliers, contractors and other counterparties, which implies thousands of contracts. For instance, the estimated number of legal documents to be processed in 2022 has been over 65,000, each of them consisting of 30 pages on average. During the lifecycle of a contract, a contract usually updated with 3 to 5 additional agreements. On top of this, a contract is accompanied by various source documents describing transactions. And in the PDF format, too.

Previously, the processing duty befell our service centre’s employees who checked whether payment details in a bill match those in the contract and then sent the document to the Accounting Department where an accountant double-checked everything. This is quite a long journey to a payment, right?

158

SergeyBPshenichnikov 8 June at 18:38

Algebra of text without formulas

Search engines *Semantics *Algorithms *Natural Language Processing *

Translation

The article is an abstract of my book [1] based on previously presented publications [2], [3], [4], [5]

834

SergeyBPshenichnikov 7 June at 22:41

Collective meaning recognition

Search engines *Semantics *Algorithms *Natural Language Processing *

Translation

The published material is in the Appendix of my book [1]

Modern civilization finds itself at a crossroads in which to choose the meaning of life. Because of the development of technology, the majority of the world's population may be "superfluous" - not in demand in the production of values. There is another option, where each person is a supreme value, an absolute individual and can be indispensably useful in the technology of the collective mind.

In the eighties of the last century, the task of creating a scientific field of "collective intelligence" was set. Collective intelligence is defined as the ability of the collective to find solutions to problems more effectively than each participant individually. The right collective mind must be...

664

snakers4 12 April at 21:08

Our new public speech synthesis in super-high quality, 10x faster and more stable

Machine learning *Natural Language Processing *Voice user interfaces *

hero_image

In our last article we made a bunch of promises about our speech synthesis.

After a lot of hard work we finally have delivered upon these promises:

Model size reduced 2x;
New models are 10x faster;
We added flags to control stress;
Now the models can make proper pauses;
High quality voice added (and unlimited "random" voices);
All speakers squeezed into the same model;
Input length limitations lifted, now models can work with paragraphs of text;
Pauses, speed and pitch can be controlled via SSML;
Sampling rates of 8, 24 or 48 kHz are supported;
Models are much more stable — they do not omit words anymore;

This is a truly break-through achievement for us and we are not planning to stop anytime soon. We will be adding as many languages as possible shortly (the CIS languages, English, European languages, Hindic languages). Also we are still planning to make our models additional 2-5x faster.

We are also planning to add phonemes and a new model for stress, as well as to reduce the minimum amount of audio required to train a high-quality voice to 5 — 15 minutes.

As usual you can try our model in our repo or in colab.

+13

1.6K

SergeyBPshenichnikov 1 December 2021 at 21:06

Concordance of sense

Search engines *Semantics *Algorithms *Natural Language Processing *

Translation

In [1,2,3] texts (sign sequences with repetitions) were transformed (coordinated) into algebraic systems using matrix units as word images. Coordinatization is a necessary condition of algebraization of any subject area. Function (arrow) (7) in [1]) is a matrix coordinatization of text. One can perform algebraic operations with words and fragments of matrix texts as with integers, but taking into account the noncommutativity of multiplication of words as matrices. Structurization of texts is reduced to the calculation of ideals and categories of texts in matrix form.

438

averkij 21 November 2021 at 16:35

How to create bilingual books. Part 2. Lingtrain Alignment Studio

Open source *Programming *Learning languages Natural Language Processing *

Tutorial

title

How to make a parallel book for language learning. Part 1. Python and Colab version

This is a second article on making parallel books. Today we will use the more advanced tool which will bring rich UI functionality. Lingtrain Alignment Studio is a web application written on Vue and Python. The main purpose of it is to extract the parallel corpora from two raw texts and make a bilingual (or even multilingual) parallel book. This is an open-source project and I will be glad to hear all of your bright ideas. Links to the sources and our community contacts can be found below. Los geht's!

Setup

The app is packed into the docker container. It's a simple technology to deploy your stuff anywhere from the server to your local machine. It's available across all the operating systems. So at first, you need a docker installed locally. Then you need to run two simple commands. The first will download the container:

docker pull lingtrain/aligner:v4

And the second one will run the application:

docker run -v C:\app\data:/app/data -v C:\app\img:/app/static/img -p 80:80 lingtrain/aligner:v4

C:\app\data and C:\app\img — your local folders.

The app will be available on the 80th port. Let's open the localhost page in your favorite browser.

Lingtrain app 1

We will make three simple steps: Load, Align, Create

977

averkij 31 October 2021 at 18:44

Lingtrain Aligner. How to make parallel books for language learning. Part 1. Python and Colab version

Open source *Programming *Machine learning *Learning languages Natural Language Processing *

Tutorial

title

If you're interested in learning new languages or teaching them, then you probably know such a way as parallel reading. It helps to immerse yourself in the context, increases the vocabulary, and allows you to enjoy the learning process. When it comes to reading, you most likely want to choose your favorite author, theme, or something familiar and this is often impossible if no one has published such a variant of a parallel book. It's becoming even worse when you're learning some cool language like Hungarian or Japanese.

Today we are taking a big step forward toward breaking this situation.

We will use the lingtrain_aligner tool. It's an open-source project on Python which aims to help all the people eager to learn foreign languages. It's a part of the Lingtrain project, you can follow us on Telegram, Facebook and Instagram. Let's start!

Find the texts

At first, we should find two texts we want to align. Let's take two editions of "To Kill a Mockingbird" by Harper Lee, in Russian and the original one.

1.2K

snakers4 6 October 2021 at 17:20

We have published a model for text repunctuation and recapitalization for four languages

Python *Big Data *Machine learning *Natural Language Processing *

Working with speech recognition models we often encounter misconceptions among potential customers and users (mostly related to the fact that people have a hard time distinguishing substance over form). People also tend to believe that punctuation marks and spaces are somehow obviously present in spoken speech, when in fact real spoken speech and written speech are entirely different beasts.

Of course you can just start each sentence with a capital letter and put a full stop at the end. But it is preferable to have some relatively simple and universal solution for "restoring" punctuation marks and capital letters in sentences that our speech recognition system generates. And it would be really nice if such a system worked with any texts in general.

For this reason, we would like to share a system that:

Inserts capital letters and basic punctuation marks (dot, comma, hyphen, question mark, exclamation mark, dash for Russian);
Works for 4 languages (Russian, English, German, Spanish) and can be extended;
By design is domain agnostic and is not based on any hard-coded rules;
Has non-trivial metrics and succeeds in the task of improving text readability;

To reiterate — the purpose of such a system is only to improve the readability of the text. It does not add information to the text that did not originally exist.

2.3K

SergeyBPshenichnikov 23 April 2021 at 13:01

Context category

Search engines *Semantics *Algorithms *Natural Language Processing *

Translation

The mathematical model of signed sequences with repetitions (texts) is a multiset. The multiset was defined by D. Knuth in 1969 and later studied in detail by A. B. Petrovsky [1]. The universal property of a multiset is the existence of identical elements. The limiting case of a multiset with unit multiplicities of elements is a set. A set with unit multiplicities corresponding to a multiset is called its generating set or domain. A set with zero multiplicity is an empty set.

862

SergeyBPshenichnikov 14 April 2021 at 18:13

Algebra of text. Examples

Search engines *Semantics *Algorithms *Natural Language Processing *

Translation

The previous work from ref [1] describes the method of transforming a sign sequence into algebra through an example of a linguistic text. Two other examples of algebraic structuring of texts of a different nature are given to illustrate the method.

snakers4 30 March 2021 at 06:33

High-Quality Text-to-Speech Made Accessible, Simple and Fast

Machine learning *Sound Natural Language Processing *

There is a lot of commotion in text-to-speech now. There is a great variety of toolkits, a plethora of commercial APIs from GAFA companies (based both on new and older technologies). There are also a lot of Silicon Valley startups trying to ship products akin to "deep fakes" in speech.

But despite all this ruckus we have not yet seen open solutions that would fulfill all of these criteria:

Naturally sounding speech;
A large library of voices in many languages;
Support for 16kHz and 8kHz out of the box;
No GPUs / ML engineering team / training required;
Unique voices not infringing upon third-party licenses;
High throughput on slow hardware. Decent performance on one CPU thread;
Minimalism and lack of dependencies. One-line usage, no builds or coding in C++ required;
Positioned as a solution, not yet another toolkit / compilation of models developed by other people;
Not affiliated by any means with ecosystems of Google / Yandex / Sberbank;

We decided to share our open non-commercial solution that fits all of these criteria with the community. Since we have published the whole pipeline we do not focus much on cherry picked examples and we encourage you to visit our project GitHub repo to test our TTS for yourself.

SergeyBPshenichnikov 28 March 2021 at 19:09

Converting text into algebra

Search engines *Semantics *Algorithms *Natural Language Processing *

Translation

Algebra and language (writing) are two different learning tools. When they are combined, we can expect new methods of machine understanding to emerge. To determine the meaning (to understand) is to calculate how the part relates to the whole. Modern search algorithms already perform the task of meaning recognition, and Google’s tensor processors perform matrix multiplications (convolutions) necessary in an algebraic approach. At the same time, semantic analysis mainly uses statistical methods. Using statistics in algebra, for instance, when looking for signs of numbers divisibility, would simply be strange. Algebraic apparatus is also useful for interpreting the calculations results when recognizing the meaning of a text.

944

snakers4 5 December 2020 at 12:55

Playing with Nvidia's New Ampere GPUs and Trying MIG

Image processing *Big Data *Machine learning *Computer hardware Natural Language Processing *

Every time when the essential question arises, whether to upgrade the cards in the server room or not, I look through similar articles and watch such videos.

Channel with the aforementioned video is very underestimated, but the author does not deal with ML. In general, when analyzing comparisons of accelerators for ML, several things usually catch your eye:

The authors usually take into account only the "adequacy" for the market of new cards in the United States;
The ratings are far from the people and are made on very standard networks (which is probably good overall) without details;
The popular mantra to train more and more gigantic models makes its own adjustments to the comparison;

The answer to the question "which card is better?" is not rocket science: Cards of the 20* series didn't get much popularity, while the 1080 Ti from Avito (Russian craigslist) still are very attractive (and, oddly enough, don't get cheaper, probably for this reason).

All this is fine and dandy and the standard benchmarks are unlikely to lie too much, but recently I learned about the existence of Multi-Instance-GPU technology for A100 video cards and native support for TF32 for Ampere devices and I got the idea to share my experience of the real testing cards on the Ampere architecture (3090 and A100). In this short note, I will try to answer the questions:

Is the upgrade to Ampere worth it? (spoiler for the impatient — yes);
Are the A100 worth the money (spoiler — in general — no);
Are there any cases when the A100 is still interesting (spoiler — yes);
Is MIG technology useful (spoiler — yes, but for inference and for very specific cases for training);

2.4K

veesot 7 July 2020 at 10:06

How to find an English teacher. Part 2

Python *Programming *Data visualization *Machine learning *Natural Language Processing *

This is a continuation of story about using Data Science for finding an English teacher. If you have not read it yet - there is an opportunity to become familiar with it

Briefly - we had information about language teachers and tried to apply some basic ideas using pandas and our expectations. Unfortunately we got stuck on the third step, because there is not enough information for resolving our the last requirements - we need not more 3 candidates at the end.

Disclaimer

It is an approach based on my own experience and can be unsuitable to your point of view, ideas, or principles.

641

veesot 3 July 2020 at 15:33

How to find an English teacher. Part 1

Python *Programming *Data Mining *Data visualization *Natural Language Processing *

In the modern world, here and there ideas are arising about using data science for an extra benefit. For instance, Google can use a history of watched videos for providing recommendations about new ones. Online shops are using a recommendation system for increasing your receipt. However… if companies use the data for their benefit, could we do the same for own needs such as looking an online English teacher?

Disclaimer

It is an approach based on my own experience and can be unsuitable to your point of view, ideas, or principles.

1.2K

veesot 9 November 2019 at 13:16

Machine Learning for your flat hunt. Part 3: The final push

Python *Programming *Data Mining *Machine learning *Natural Language Processing *

Photo by Dugan Arnett on Boston Globe

Are you still looking for a new flat? Ready to make the last attempt? If so - follow me and I show you how to reach the finish line.

1.8K

SergKremen1984 6 October 2019 at 14:52

Keyword Tree: graph analysis for semantic extraction

Data visualization *Machine learning *Natural Language Processing *

This post is a small abstract of full-scaled research focused on keyword recognition. Technique of semantics extraction was initially applied in field of social media research of depressive patterns. Here I focus on NLP and math aspects without psychological interpretation. It is clear that analysis of single word frequencies is not enough. Multiple random mixing of collection does not affect the relative frequency but destroys information totally — bag of words effect. We need more accurate approach for the mining of semantics attractors.

1.1K

Natural Language Processing *

Multilingual Text-to-Speech Models for Indic Languages

Detecting attempts of mass influencing via social networks using NLP. Part 2

Detecting attempts of mass influencing via social networks using NLP. Part 1

How we tackled document recognition issues for autonomus and automatic payments using OCR and NER

Algebra of text without formulas

Collective meaning recognition

Our new public speech synthesis in super-high quality, 10x faster and more stable

Concordance of sense

How to create bilingual books. Part 2. Lingtrain Alignment Studio

Setup

Lingtrain Aligner. How to make parallel books for language learning. Part 1. Python and Colab version

Find the texts

We have published a model for text repunctuation and recapitalization for four languages

Context category

Algebra of text. Examples

High-Quality Text-to-Speech Made Accessible, Simple and Fast

Converting text into algebra

Playing with Nvidia's New Ampere GPUs and Trying MIG

How to find an English teacher. Part 2

How to find an English teacher. Part 1

Machine Learning for your flat hunt. Part 3: The final push

Keyword Tree: graph analysis for semantic extraction

Authors' contribution

Your account

Sections

Information

Services