Big Data *

Everything about big data

Evrone 16 November 2022 at 14:13

How we designed the user interface for an enterprise analytical system

Singula Team corporate blog Big Data *CGI *Data Engineering *

In 2021, we were contacted by an industrial plant that was faced with the need to create a system for analyzing processes in its production. The enterprise team studied ready-made solutions, but none of the analytics system designs fully covered the required functionality. So they turned to us with a request to develop their own analytical system that would collect data from all machines and allow it to be analyzed to see bottlenecks in production. For this project, we created a data-driven UI/UX design and also developed a web-based interface for the equipment monitoring system.

490

vldmrvslv 29 June 2022 at 17:24

Detecting attempts of mass influencing via social networks using NLP. Part 2

Python *Data Mining *Twitter API *Big Data *Natural Language Processing *

Tutorial

In Part 1 of this article, I built and compared two classifiers to detect trolls on Twitter. You can check it out here.

Now, time has come to look more deeply into the datasets to find some patterns using exploratory data analysis and topic modelling.

EDA

To do just that, I first created a word cloud of the most common words, which you can see below.

727

vldmrvslv 29 June 2022 at 17:20

Detecting attempts of mass influencing via social networks using NLP. Part 1

Python *Data Mining *Twitter API *Big Data *Natural Language Processing *

Tutorial

During the last decades, the world’s population has been developing as an information society, which means that information started to play a substantial end-to-end role in all life aspects and processes. In view of the growing demand for a free flow of information, social networks have become a force to be reckoned with. The ways of war-waging have also changed: instead of conventional weapons, governments now use political warfare, including fake news, a type of propaganda aimed at deliberate disinformation or hoaxes. And the lack of content control mechanisms makes it easy to spread any information as long as people believe in it.

Based on this premise, I’ve decided to experiment with different NLP approaches and build a classifier that could be used to detect either bots or fake content generated by trolls on Twitter in order to influence people.

In this first part of the article, I will cover the data collection process, preprocessing, feature extraction, classification itself and the evaluation of the models’ performance. In Part 2, I will dive deeper into the troll problem, conduct exploratory analysis to find patterns in the trolls’ behaviour and define the topics that seemed of great interest to them back in 2016.

Features for analysis

From all possible data to use (like hashtags, account language, tweet text, URLs, external links or references, tweet date and time), I settled upon English tweet text, Russian tweet text and hashtags. Tweet text is the main feature for analysis because it contains almost all essential characteristics that are typical for trolling activities in general, such as abuse, rudeness, external resources references, provocations and bullying. Hashtags were chosen as another source of textual information as they represent the central message of a tweet in one or two words.

1.1K

Yersin_DBA 30 October 2021 at 20:04

Extending and moving a ZooKeeper ensemble

Database Administration *Big Data *

Tutorial

Translation

Once upon a time our DBA team had a task. We had to move a ZooKeeper ensemble which we had been using for Clickhouse cluster. Everyone is used to moving an ensemble by moving its data files. It seems easy and obvious but our Clickhouse cluster had more than 400 TB replicated data. All replication information had been collected in ZooKeeper cluster from the very beginning. At the end of the day we couldn’t miss even a row of data. Then we looked for information on the internet. Unfortunately there was a good tutorial about 3.4.5 and didn’t fit our version 3.6.2. So we decided to use “the extending” for moving our ensemble.

910

snakers4 6 October 2021 at 17:20

We have published a model for text repunctuation and recapitalization for four languages

Python *Big Data *Machine learning *Natural Language Processing *

Working with speech recognition models we often encounter misconceptions among potential customers and users (mostly related to the fact that people have a hard time distinguishing substance over form). People also tend to believe that punctuation marks and spaces are somehow obviously present in spoken speech, when in fact real spoken speech and written speech are entirely different beasts.

Of course you can just start each sentence with a capital letter and put a full stop at the end. But it is preferable to have some relatively simple and universal solution for "restoring" punctuation marks and capital letters in sentences that our speech recognition system generates. And it would be really nice if such a system worked with any texts in general.

For this reason, we would like to share a system that:

Inserts capital letters and basic punctuation marks (dot, comma, hyphen, question mark, exclamation mark, dash for Russian);
Works for 4 languages (Russian, English, German, Spanish) and can be extended;
By design is domain agnostic and is not based on any hard-coded rules;
Has non-trivial metrics and succeeds in the task of improving text readability;

To reiterate — the purpose of such a system is only to improve the readability of the text. It does not add information to the text that did not originally exist.

3.7K

AlexZus 1 October 2021 at 19:27

Millions of orders per second matching engine testing

C++ *Data Mining *Big Data *Data Engineering *

Sandbox

I had some experience in the matching engine development for cryptocurrency exchange some time ago. That was an interesting and challenging experience. I developed it in clear C++ from scratch. The testing of it is also quite a challenging task. You need to get data for testing, perform testing, collect some statistics, and at last, analyze collected data to find weak points and bottlenecks. I want to focus on testing the C++ matching engine and show how testing can give insights for optimizations even without the need to change the code. The matching engine I developed can do more than 1’000’000 TPS (transactions per second) and is 10x times faster than the matching engine of the Binance cryptocurrency exchange (see one post on Binance Blog).

5.6K

olegchir 28 September 2021 at 08:46

Big Data Tools with IntelliJ IDEA Ultimate, PyCharm Professional, DataGrip 2021.3 EAP, and DataSpell Support

JetBrains corporate blog Programming *Big Data *Data Engineering *

Recently we released a new build of the Big Data Tools plugin that is compatible with the 2021.3 versions of IntelliJ IDEA and PyCharm. DataGrip 2021.3 support will be available immediately after the release in October. The plugin also supports our new data science IDE – JetBrains DataSpell. If you still use previous versions, now is the perfect time to upgrade both your IDE and the plugin.

This year, we introduced a number of new features as well as some features that have been there for a while, for example, running Spark Submit with a run configuration.

Here’s a list of the key improvements:

1.3K

m31 1 July 2021 at 16:40

Data Phoenix Digest — 01.07.2021

Python *Algorithms *Big Data *Machine learning *Artificial Intelligence

We at Data Science Digest have always strived to ignite the fire of knowledge in the AI community. We’re proud to have helped thousands of people to learn something new and give you the tools to push ahead. And we’ve not been standing still, either.

Please meet Data Phoenix, a Data Science Digest rebranded and risen anew from our own flame. Our mission is to help everyone interested in Data Science and AI/ML to expand the frontiers of knowledge. More news, more updates, and webinars(!) are coming. Stay tuned!

The new issue of the new Data Phoenix Digest is here! AI that helps write code, EU’s ban on biometric surveillance, genetic algorithms for NLP, multivariate probabilistic regression with NGBoosting, alias-free GAN, MLOps toys, and more…

If you’re more used to getting updates every day, subscribe to our Telegram channel or follow us on social media: Twitter, Facebook.

-1

1.5K

m31 24 June 2021 at 13:09

DataScience Digest — 24.06.21

Python *Algorithms *Big Data *Machine learning *Artificial Intelligence

The new issue of DataScienceDigest is here!

The impact of NLP and the growing budgets to drive AI transformations. How Airbnb standardized metric computation at scale. Cross-Validation, MASA-SR, AgileGAN, EfficientNetV2, and more.

If you’re more used to getting updates every day, subscribe to our Telegram channel or follow us on social media: Twitter, LinkedIn, Facebook.

1.5K

m31 10 June 2021 at 12:48

DataScience Digest — 10.06.21

Python *Algorithms *Big Data *Machine learning *Artificial Intelligence

The new issue of DataScienceDigest is here!

Machine learning in healthcare, the top 10 TED talks on AI, fraud detection in Uber, DatasetGAN, Text-to-Image generation via transformers, and more…

932

m31 2 June 2021 at 23:42

DataScience Digest — 02.06.21

Python *Algorithms *Big Data *Machine learning *Artificial Intelligence

New issue of DataScienceDigest is here! OpenAI is launching a $100 million startup fund, Albumentations 1.0 has been released, lessons on ML platforms, image cropping on Twitter, and more.

869

m31 28 May 2021 at 14:29

DataScience Digest — 28.05.21

Python *Algorithms *Big Data *Machine learning *Artificial Intelligence

The new issue of Data Science Digest is here! Hop to learn about the latest news, articles, tutorials, research papers, and event materials on DataScience, AI, ML, and BigData. All sections are prioritized for your convenience. Enjoy!

593

Tott 21 April 2021 at 13:37

You are standing at a red light at an empty intersection. How to make traffic lights smarter?

Python *IT Infrastructure *Big Data *

Types of smart traffic lights: adaptive and neural networks

Adaptive works at relatively simple intersections, where the rules and possibilities for switching phases are quite obvious. Adaptive management is only applicable where there is no constant loading in all directions, otherwise it simply has nothing to adapt to – there are no free time windows. The first adaptive control intersections appeared in the United States in the early 70s of the last century. Unfortunately, they have reached Russia only now, their number according to some estimates does not exceed 3,000 in the country.

Neural networks – a higher level of traffic regulation. They take into account a lot of factors at once, which are not even always obvious. Their result is based on self-learning: the computer receives live data on the bandwidth and selects the maximum value by all possible algorithms, so that in total, as many vehicles as possible pass from all sides in a comfortable mode per unit of time. How this is done, usually programmers answer – we do not know, the neural network is a black box, but we will reveal the basic principles to you…

Adaptive traffic lights use, at least, leading companies in Russia, rather outdated technology for counting vehicles at intersections: physical sensors or video background detector. A capacitive sensor or an induction loop only sees the vehicle at the installation site-for a few meters, unless of course you spend millions on laying them along the entire length of the roadway. The video background detector shows only the filling of the roadway with vehicles relative to this roadway. The camera should clearly see this area, which is quite difficult at a long distance due to the perspective and is highly susceptible to atmospheric interference: even a light snowstorm will be diagnosed as the presence of traffic – the background video detector does not distinguish the type of detection.

1.5K

m31 21 April 2021 at 12:38

Data Science Digest — 21.04.21

Python *Algorithms *Big Data *Machine learning *Artificial Intelligence

Hi All,

I’m pleased to invite you all to enroll in the Lviv Data Science Summer School, to delve into advanced methods and tools of Data Science and Machine Learning, including such domains as CV, NLP, Healthcare, Social Network Analysis, and Urban Data Science. The courses are practice-oriented and are geared towards undergraduates, Ph.D. students, and young professionals (intermediate level). The studies begin July 19–30 and will be hosted online. Make sure to apply — Spots are running fast!

If you’re more used to getting updates every day, follow us on social media:

Telegram
Twitter
LinkedIn
Facebook

Regards,
Dmitry Spodarets.

767

m31 15 April 2021 at 22:34

Data Science Digest — We Are Back

Python *Algorithms *Big Data *Machine learning *Artificial Intelligence

Hi All,

I have some good news for you…

Data Science Digest is back! We’ve been “offline” for a while, but no worries — You’ll receive regular digest updates with top news and resources on AI/ML/DS every Wednesday, starting today.

If you’re more used to getting updates every day, follow us on social media:

Telegram - https://t.me/DataScienceDigest
Twitter - https://twitter.com/Data_Digest
LinkedIn - https://www.linkedin.com/company/data-science-digest/
Facebook - https://www.facebook.com/DataScienceDigest/

And finally, your feedback is very much appreciated. Feel free to share any ideas with me and the team, and we’ll do our best to make Data Science Digest a better place for all.

Regards,
Dmitry Spodarets.

866

FizpokPak 1 February 2021 at 13:51

Coins classifier Neural Network: Head or Tail?

Python *Data Mining *Big Data *Data Engineering *TensorFlow *

Home of this article: https://robotics.snowcron.com/coins/02_head_or_tail.htm

The global objective of these articles is to build a coin classifier, capable of scanning your pocket change and find rare / valuable coins. This is a second article in a series, so let me remind you what happened earlier (https://habr.com/ru/post/538958/).

During previous step we got a rather large dataset composed of pairs of images, loaded from an online coins site meshok.ru. Those images were uploaded to the Internet by people we do not know, and though they are supposed to contain coin's head in one image and tail in the other, we can not rule out a situation when we have two heads and no tail and vice versa. Also at the moment we have no idea which image contains head and which contains tail: this might be important when we feed data to our final classifier.

So let's write a program to distinguish heads from tails. It is a rather simple task, involving a convolutional neural network that is using transfer learning.

Same way as before, we are going to use Google Colab environment, taking the advantage of a free video card they grant us an access to. We will store data on a Google Drive, so first thing we need is to allow Colab to access the Drive:

869

FizpokPak 24 January 2021 at 21:59

Coins Classification using Neural Networks

Python *Big Data *Data Engineering *

Tutorial

See more at robotics.snowcron.comThis is the first article in a serie dedicated to coins classification.Having countless "dogs vs cats" or "find a pedestrian on the street" classifiers all over the Internet, coins classification doesn't look like a difficult task. At first. Unfortunately, it is degree of magnitude harder - a formidable challenge indeed. You can easily tell heads of tails? Great. Can you figure out if the number is 1 mm shifted to the left? See, from classifier's view it is still the same head... while it can make a difference between a common coin priced according to the number on it and a rare one, 1000 times more expensive.Of course, we can do what we usually do in image classification: provide 10,000 sample images... No, wait, we can not. Some types of coins are rare indeed - you need to sort through a BASKET (10 liters) of coins to find one. Easy arithmetics suggests that to get 10000 images of DIFFERENT coins you will need 10,000 baskets of coins to start with. Well, and unlimited time.So it is not that easy.Anyway, we are going to begin with getting large number of images and work from there. We will use Russian coins as an example, as Russia had money reform in 1994 and so the number of coins one can expect to find in the pocket is limited. Unlike USA with its 200 years of monetary history. And yes, we are ONLY going to focus on current coins: the ultimate goal of our work is to write a program for smartphone to classify coins you have received in a grocery store as a change.Which makes things even worse, as we can not count on good lighting and quality cameras anymore. But we'll still try.In addition to "only Russian coins, beginning from 1994", we are going to add an extra limitation: no special occasion coins. Those coins look distinctive, so anyone can figure that this coin is special. We focus on REGULAR coins. Which limits their number severely.Don't take me wrong: if we need to apply the same approach to a full list of coins... it will work. But I got 15 GB of images for that limited set, can you imagine how large the complete set will be?!To get images, I am going to scan one of the largest Russian coins site "meshok.ru".This site allows buyers and sellers to find each other; sellers can upload images... just what we need. Unfortunately, a business-oriented seller can easily upload his 1 rouble image to 1, 2, 5, 10 roubles topics, just to increase the exposure.

So we can not count on the topic name, we have to determine what coin is on the photo ourselves.To scan the site, a simple scanner was written, based on the Python's Beautiful Soup library. In just few hours I got over 50,000 photos. Not a lot by Machine Learning standards, but definitely a start.After we got the images, we have to - unfortunately - revisit them by hand, looking for images we do not want in our training set, or for images that should be edited somehow. For example, someone could have uploaded a photo of his cat. We don't need a cat in our dataset.First, we delete all images, that can not be split to head/hail.

1.9K

olegchir 16 December 2020 at 17:17

Big Data Tools EAP 12 Is Out: Experimental Python Support and Search Function in Zeppelin Notebooks

JetBrains corporate blog Python *Scala *Big Data *

Update 12 of the Big Data Tools plugin for IntelliJ IDEA Ultimate, PyCharm Professional Edition, and DataGrip has been released. You can install it from the JetBrains Plugin Repository or from inside your IDE. The plugin allows you to edit Zeppelin notebooks, upload files to cloud filesystems, and monitor Hadoop and Spark clusters.

In this release, we've added experimental Python support and global search inside Zeppelin notebooks. We’ve also addressed a variety of bugs. Let's talk about the details.

878

ValeryKomarov 15 December 2020 at 09:56

Big / Bug Data: Analyzing the Apache Flink Source Code

PVS-Studio corporate blog Programming *Java *Apache *Big Data *

Applications used in the field of Big Data process huge amounts of information, and this often happens in real time. Naturally, such applications must be highly reliable so that no error in the code can interfere with data processing. To achieve high reliability, one needs to keep a wary eye on the code quality of projects developed for this area. The PVS-Studio static analyzer is one of the solutions to this problem. Today, the Apache Flink project developed by the Apache Software Foundation, one of the leaders in the Big Data software market, was chosen as a test subject for the analyzer.

-1

693

snakers4 5 December 2020 at 12:55

Playing with Nvidia's New Ampere GPUs and Trying MIG

Image processing *Big Data *Machine learning *Computer hardware Natural Language Processing *

Every time when the essential question arises, whether to upgrade the cards in the server room or not, I look through similar articles and watch such videos.

Channel with the aforementioned video is very underestimated, but the author does not deal with ML. In general, when analyzing comparisons of accelerators for ML, several things usually catch your eye:

The authors usually take into account only the "adequacy" for the market of new cards in the United States;
The ratings are far from the people and are made on very standard networks (which is probably good overall) without details;
The popular mantra to train more and more gigantic models makes its own adjustments to the comparison;

The answer to the question "which card is better?" is not rocket science: Cards of the 20* series didn't get much popularity, while the 1080 Ti from Avito (Russian craigslist) still are very attractive (and, oddly enough, don't get cheaper, probably for this reason).

All this is fine and dandy and the standard benchmarks are unlikely to lie too much, but recently I learned about the existence of Multi-Instance-GPU technology for A100 video cards and native support for TF32 for Ampere devices and I got the idea to share my experience of the real testing cards on the Ampere architecture (3090 and A100). In this short note, I will try to answer the questions:

Is the upgrade to Ampere worth it? (spoiler for the impatient — yes);
Are the A100 worth the money (spoiler — in general — no);
Are there any cases when the A100 is still interesting (spoiler — yes);
Is MIG technology useful (spoiler — yes, but for inference and for very specific cases for training);

2.9K

2 3