• High-Quality Text-to-Speech Made Accessible, Simple and Fast

      image


      There is a lot of commotion in text-to-speech now. There is a great variety of toolkits, a plethora of commercial APIs from GAFA companies (based both on new and older technologies). There are also a lot of Silicon Valley startups trying to ship products akin to "deep fakes" in speech.


      But despite all this ruckus we have not yet seen open solutions that would fulfill all of these criteria:


      • Naturally sounding speech;
      • A large library of voices in many languages;
      • Support for 16kHz and 8kHz out of the box;
      • No GPUs / ML engineering team / training required;
      • Unique voices not infringing upon third-party licenses;
      • High throughput on slow hardware. Decent performance on one CPU thread;
      • Minimalism and lack of dependencies. One-line usage, no builds or coding in C++ required;
      • Positioned as a solution, not yet another toolkit / compilation of models developed by other people;
      • Not affiliated by any means with ecosystems of Google / Yandex / Sberbank;

      We decided to share our open non-commercial solution that fits all of these criteria with the community. Since we have published the whole pipeline we do not focus much on cherry picked examples and we encourage you to visit our project GitHub repo to test our TTS for yourself.

    • Short-lived Music or MuseScore Code Analysis

        Having only programming background, it is impossible to develop software in some areas. Take the difficulties of medical software development as an example. The same is with music software, which will be discussed in this article. Here you need an advice of subject matter experts. However, it's more expensive for software development. That is why developers sometimes save on code quality. The example of the MuseScore project check, described in the article, will show the importance of code quality expertise. Hopefully, programming and musical humor will brighten up the technical text.

        Читать далее
      • Modern Portable Voice Activity Detector Released

          image


          Currently, there are hardly any high quality / modern / free / public voice activity detectors except for WebRTC Voice Activity Detector (link). WebRTC though starts to show its age and it suffers from many false positives.


          Also in some cases it is crucial to be able to anonymize large-scale spoken corpora (i.e. remove personal data). Typically personal data is considered to be private / sensitive if it contains (i) a name (ii) some private ID. Name recognition is a highly subjective matter and it depends on locale and business case, but Voice Activity and Number Detection are quite general tasks.


          Key features:


          • Modern, portable;
          • Low memory footprint;
          • Superior metrics to WebRTC;
          • Trained on huge spoken corpora and noise / sound libraries;
          • Slower than WebRTC, but fast enough for IOT / edge / mobile applications;
          • Unlike WebRTC (which mostly tells silence from voice), our VAD can tell voice from noise / music / silence;
          • PyTorch (JIT) and ONNX checkpoints;

          Typical use cases:


          • Spoken corpora anonymization;
          • Can be used together with WebRTC;
          • Voice activity detection for IOT / edge / mobile use cases;
          • Data cleaning and preparation, number and voice detection in general;
          • PyTorch and ONNX can be used with a wide variety of deployment options and backends in mind;
          Read more →
        • Modern Google-level STT Models Released


            We are proud to announce that we have built from ground up and released our high-quality (i.e. on par with premium Google models) speech-to-text Models for the following languages:


            • English;
            • German;
            • Spanish;

            You can find all of our models in our repository together with examples, quality and performance benchmarks. Also we invested some time into making our models as accessible as possible — you can try our examples as well as PyTorch, ONNX, TensorFlow checkpoints. You can also load our model via TorchHub.


            PyTorch ONNX TensorFlow Quality Colab
            English (en_v1) link Open In Colab
            German (de_v1) link Open In Colab
            Spanish (es_v1) link Open In Colab
            Read more →
          • Juggling work and study at ITMO University: CS edition

              We talked to the graduates of the Speech Information Systems MA program at ITMO about the ways our university helped jumpstart their careers. [More stories from our startups]:


              Read more →
            • Bluetooth stack modifications to improve audio quality on headphones without AAC, aptX, or LDAC codecs

                Before reading this article, it is recommended to read the previous one: Audio over Bluetooth: most detailed information about profiles, codecs, and devices / по-русски

                Some wireless headphone users note low sound quality and lack of high frequencies when using the standard Bluetooth SBC codec, which is supported by all headphones and other Bluetooth audio devices. A common recommendation to get better sound quality is to buy devices and headphones with aptX or LDAC codecs support. These codecs require licensing fees, that's why devices with them are more expensive.

                It turns out that the low quality of SBC is caused by artificial limitations of all current Bluetooth stacks and headphones' configuration, and this limitation can be circumvented on any existing device with software modification only.
                Read more →
                • +20
                • 29.4k
                • 3
              • Ads
                AdBlock has stolen the banner, but banners are not teeth — they will be back

                More
              • Audio over Bluetooth: most detailed information about profiles, codecs, and devices

                  XKCD comic. How standards proliferate. SITUATION: there are 14 competing standards. Geek: 14?! Ridiculous! We need to develop one universal standard that covery everyone's use cases. Geek's girlfriend: yeah! SOON: Situation: there are 15 competing standards.

                  This article is also available in Russian / Эта статья также доступна на русском языке

                  The mass market of smartphones without the 3.5 mm audio jack changed headphones industry, wireless Bluetooth headphones have become the main way to listen to music and communicate in headset mode for many users.
                  Bluetooth device manufacturers rarely disclose detailed product specifications, and Bluetooth audio articles on the Internet are contradictory and sometimes incorrect. They do not tell about all the features, and often publish the same false information.
                  Let's try to understand the protocol, the capabilities of Bluetooth stacks, headphones and speakers, Bluetooth codecs for music and speech, find out what affects the quality of the transmitted audio and the delay, learn how to capture and decode information about supported codecs and other device features.

                  TL;DR:

                  • SBC codec is OK
                  • Headphones have their own per-codec equalizer and post processing configuration
                  • aptX is not as good as the advertisements say
                  • LDAC is a marketing fluff
                  • Voice audio quality is still low
                  • Browsers are able to execute audio encoders compiled to WebAssembly from C using emscripten, and they won't even lag.

                  • +22
                  • 210k
                  • 9
                • Designing Sound for Pathfinder: Kingmaker


                    Pathfinder: Kingmaker (PF:K for short) is a role-playing video game created by Owlcat Games, released in Fall 2018 on Steam and GoG. Inspired by classic Bioware games, this project uses a popular board game system ruleset, combat takes place in Real-Time with Pause, follows an isometric camera, and has a non-linear story with multiple unique endings.


                    In this article, I will share a little about how we worked on designing the audio throughout the game’s development including task management, the search for inspiration, and troubleshooting. An experienced specialist may not find anything particularly groundbreaking in this recap, but beginners and enthusiasts will definitely discover some points of interest.

                  • PC Speaker To Eleven

                      Known now as a «motherboard speaker», or just «beeper», PC Speaker has been introduced in 1981 along with the first personal IBM computer. Being a successor of the big serious computers for serious business, it has been designed to produce very basic system beeps, so it never really had a chance to shine bright as a music device in numerous entertainment programs of the emerging home market. Overshadowed by much more advanced sound chips of popular home game systems, quickly replaced with powerful sound cards, it mostly served as a fallback option, playing severely downgraded content of better sound hardware.

                      «System Beeps» is a music album in shape of an MS-DOS program that features original music composed for PC Speaker using the same basic old techniques like ones found in classic PC games. It follows the usual retro computing demoscene formula — take something rusty and obsolete, and push it to eleven — and attempts to reveal the long hidden potential of this humble little sound device. You can hear it in action and form an opinion on how successful this attempt was at Bandcamp, or in the video below. The following article is an in-depth overview of the original PC Speaker capabilities and making of the project, for those who would like to know more.

                      • +30
                      • 31.7k
                      • 3