The first text-based CAPTCHA ( we’ll call it just CAPTCHA for the sake of brevity ) was used in 1997 by AltaVista search engine. It prevented bots from adding Uniform Resource Locator (URLs) to their web search engine.
Back then it was a decent defense measure. However the progress can't be stopped, and this defense was bypassed using OCR available at those times (for example FineReader).
CAPTCHA became more complex, noise was added to it, along with distortions, so the popular OCRs couldn’t recognize this text. And then OCRs custom made for this task appeared. It costed extra money and knowledge for the attacking side. The CAPTCHA developers were required to understand the challenges the attackers met, what distortions to add, in order to make the automation of the CAPTCHA recognition more complex.
The misunderstanding of the principles the OCRs were based on, some CAPTCHAs were given such distortions, that they were more of a hassle for regular users than for a machine.
OCRs for different types of CAPTCHAs were made using heuristics, and the most complicated part of it was the CAPTCHA segmentation for the stand along symbols, that subsequently could be easily recognized by the CNN (for example LeNet-5), also SVM showed a good result even on the raw pixels.
In this article I’ll try to grasp the whole history of CAPTCHA recognition, from heuristics to the contemporary automated recognition systems. We’ll figure out, if a CAPTCHA is still alive.
I’ll review the yandex.com CAPTCHA. The Russian version of the same CAPTCHA is more complex.