Kiwix in Debian, 2022 update

Previous updates: 2018, 2021

Kiwix is an offline content reader, best known for distributing copies of Wikipedia. I have been maintaining it in Debian since 2017.

This year most of the work has been keeping all the packages up to date in anticipation of next year's Debian 12 Bookworm release, including several transitions for new libzim and libkiwix versions.

  • libzim: 6.3.0 → 8.0.0
  • zim-tools: 2.1.0 → 3.1.1
  • python-libzim: 0.0.3 → 1.1.1 (with a cherry-picked patch)
  • libkiwix: 9.4.1 → 11.0.0 (with DFSG issues fixed!)
  • kiwix-tools: 3.1.2 → 3.3.0
  • kiwix (desktop): 2.0.5 → 2.2.2

The Debian Package Tracker makes it really easy to keep an eye on all Kiwix-related packages.

All of the "user-facing" packages (zim-tools, kiwix-tools, kiwix) now have very basic autopkgtests that can provide a bit of confidence that the package isn't totally broken. I recommend reading the "FAQ for package maintainers" to learn about all the benefits you get from having autopkgtests.

Finally, back in March I wrote a blog post, How to mirror the Russian Wikipedia with Debian and Kiwix, which got significant readership (compared to most posts on this blog), including being quoted by LWN!

We are always looking for more contributors, please reach out if you're interested. The Kiwix team is one of my favorite groups of people to work with and they love Debian too.



A belated writeup of CVE-2022-28201 in MediaWiki

In December 2021, I discovered CVE-2022-28201, which is that it's possible to get MediaWiki's Title::newMainPage() to go into infinite recursion. More specifically, if the local interwikis feature is configured (not used by default, but enabled on Wikimedia wikis), any on-wiki administrator could fully brick the wiki by editing the [[MediaWiki:Mainpage]] wiki page in a malicious manner. It would require someone with sysadmin access to recover, either by adjusting site configuration or manually editing the database.

In this post I'll explain the vulnerability in more detail, how Rust helped me discover it, and a better way to fix it long-term.

The vulnerability

At the heart of this vulnerability is Title::newMainPage(). The function, before my patch, is as follows (link):

public static function newMainPage( MessageLocalizer $localizer = null ) {
    if ( $localizer ) {
        $msg = $localizer->msg( 'mainpage' );
    } else {
        $msg = wfMessage( 'mainpage' );
    }
    $title = self::newFromText( $msg->inContentLanguage()->text() );
    // Every page renders at least one link to the Main Page (e.g. sidebar).
    // If the localised value is invalid, don't produce fatal errors that
    // would make the wiki inaccessible (and hard to fix the invalid message).
    // Gracefully fallback...
    if ( !$title ) {
        $title = self::newFromText( 'Main Page' );
    }
    return $title;
}

It gets the contents of the "mainpage" message (editable on-wiki at MediaWiki:Mainpage), parses the contents as a page title and returns it. As the comment indicates, it is called on every page view and as a result has a built-in fallback if the configured main page value is invalid for whatever reason.

Now, let's look at how interwiki links work. Normal interwiki links are pretty simple, they take the form of [[prefix:Title]], where the prefix is the interwiki name of a foreign site. In the default interwiki map, "wikipedia" points to https://en.wikipedia.org/wiki/$1. There's no requirement that the interwiki target even be a wiki, for example [[google:search term]] is a supported prefix and link.

And if you type in [[wikipedia:]], you'll get a link to https://en.wikipedia.org/wiki/, which redirects to the Main Page. Nice!

Local interwiki links are a bonus feature on top of this to make sharing of content across multiple wikis easier. A local interwiki is one that maps to the wiki we're currently on. For example, you could type [[wikipedia:Foo]] on the English Wikipedia and it would be the same as just typing in [[Foo]].

So now what if you're on English Wikipedia and type in [[wikipedia:]]? Naively that would be the same as typing [[]], which is not a valid link.

So in c815f959d6b27 (first included in MediaWiki 1.24), it was implemented to have a link like [[wikipedia:]] (where the prefix is a local interwiki) resolve to the main page explicitly. This seems like entirely logical behavior and achieves the goals of local interwiki links - to make it work the same, regardless of which wiki it's on.

Except it now means that when trying to parse a title, the answer might end up being "whatever the main page is". And if we're trying to parse the "mainpage" message to discover where the main page is? Boom, infinite recursion.

All you have to do is edit "MediaWiki:Mainpage" on your wiki to be something like localinterwiki: and your wiki is mostly hosed, requiring someone to either de-configure that local interwiki or manually edit that message via the database to recover it.

The patch I implemented was pretty simple, just add a recursion guard with a hardcoded fallback:

    public static function newMainPage( MessageLocalizer $localizer = null ) {
+       static $recursionGuard = false;
+       if ( $recursionGuard ) {
+           // Somehow parsing the message contents has fallen back to the
+           // main page (bare local interwiki), so use the hardcoded
+           // fallback (T297571).
+           return self::newFromText( 'Main Page' );
+       }
        if ( $localizer ) {
            $msg = $localizer->msg( 'mainpage' );
        } else {
            $msg = wfMessage( 'mainpage' );
        }

+       $recursionGuard = true;
        $title = self::newFromText( $msg->inContentLanguage()->text() );
+       $recursionGuard = false;

        // Every page renders at least one link to the Main Page (e.g. sidebar).
        // If the localised value is invalid, don't produce fatal errors that

Discovery

I was mostly exaggerating when I said Rust helped me discover this bug. I previously blogged about writing a MediaWiki title parser in Rust, and it was while working on that I read the title parsing code in MediaWiki enough times to discover this flaw.

A better fix

I do think that long-term, we have better options to fix this.

There's a new, somewhat experimental, configuration option called $wgMainPageIsDomainRoot. The idea is that rather than serve the main page from /wiki/Main_Page, it would just be served from /. Conveniently, this would mean that it doesn't actually matter what the name of the main page is, since we'd just have to link to the domain root.

There is an open request for comment to enable such functionality on Wikimedia sites. It would be a small performance win, give everyone cleaner URLs, and possibly break everything that expects https://en.wikipedia.org/ to return a HTTP 301 redirect, like it has for the past 20+ years. Should be fun!

Timeline

Acknowledgements

Thank you to Scott Bassett of the Wikimedia Security team for reviewing and deploying my patch, and Reedy for backporting and performing the security release.



How to mirror the Russian Wikipedia with Debian and Kiwix

It has been reported that the Russian government has threatened to block access to Wikipedia for documenting narratives that do not agree with the official position of the Russian government.

One of the anti-censorship strategies I've been working on is Kiwix, an offline Wikipedia reader (and plenty of other content too). Kiwix is free and open source software developed by a great community of people that I really enjoy working with.

With threats of censorship, traffic to Kiwix has increased fifty-fold, with users from Russia accounting for 40% of new downloads!

You can download copies of every language of Wikipedia for offline reading and distribution, as well as hosting your own read-only mirror, which I'm going to explain today.

Disclaimer: depending on where you live it may be illegal or get you in trouble with the authorities to rehost Wikipedia content, please be aware of your digital and physical safety before proceeding.

With that out of the way, let's get started. You'll need a Debian (or Ubuntu) server with at least 30GB of free disk space. You'll also want to have a webserver like Apache or nginx installed (I'll share the Apache config here).

First, we need to download the latest copy of the Russian Wikipedia.

$ wget 'https://download.kiwix.org/zim/wikipedia/wikipedia_ru_all_maxi_2022-03.zim'

If the download is interrupted or fails, you can use wget -c $url to resume it.

Next let's install kiwix-serve and try it out. If you're using Ubuntu, I strongly recommend enabling our Kiwix PPA first.

$ sudo apt update
$ sudo apt install kiwix-tools
$ kiwix-serve -p 3004 wikipedia_ru_all_maxi_2022-03.zim

At this point you should be able to visit http://yourserver.com:3004/ and see the Russian Wikipedia. Awesome! You can use any available port, I just picked 3004.

Now let's use systemd to daemonize it so it runs in the background. Create /etc/systemd/system/kiwix-ru-wp.service with the following:

[Unit]
Description=Kiwix Russian Wikipedia

[Service]
Type=simple
User=www-data
ExecStart=/usr/bin/kiwix-serve -p 3004 /path/to/wikipedia_ru_all_maxi_2022-03.zim
Restart=always

[Install]
WantedBy=multi-user.target

Now let's start it and enable it at boot:

$ sudo systemctl start kiwix-ru-wp
$ sudo systemctl enable kiwix-ru-wp

Since we want to expose this on the public internet, we should put it behind a more established webserver and configure HTTPS.

Here's the Apache httpd configuration I used:

<VirtualHost *:80>
        ServerName ru-wp.yourserver.com

        ServerAdmin webmaster@localhost
        DocumentRoot /var/www/html

        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined

        <Proxy *>
                Require all granted
        </Proxy>

        ProxyPass / http://127.0.0.1:3004/
        ProxyPassReverse / http://127.0.0.1:3004/
</VirtualHost>

Put that in /etc/apache2/sites-available/kiwix-ru-wp.conf and run:

$ sudo a2ensite kiwix-ru-wp
$ sudo systemctl reload apache2

Finally, I used certbot to enable HTTPS on that subdomain and redirect all HTTP traffic over to HTTPS. This is an interactive process that is well documented so I'm not going to go into it in detail.

You can see my mirror of the Russian Wikipedia, following these instructions, at https://ru-wp.legoktm.com/. Anyone is welcome to use it or distribute the link, though I am not committing to running it long-term.

This is certainly not a perfect anti-censorship solution, the copy of Wikipedia that Kiwix provides became out of date the moment it was created, and the setup described here will require you to manually update the service when the new copy is available next month.

Finally, if you have some extra bandwith, you can also help seed this as a torrent.


Building fast Wikipedia bots in Rust

Lately I've been working on mwbot-rs, a framework to write bots and tools in Rust. My main focus is for things related to Wikipedia, but all the code is generic enough to be used by any modern (1.35+) MediaWiki installation. One specific feature of mwbot-rs I want to highlight today is the way it enables building incredibly fast Wikipedia bots, taking advantage of Rust's "fearless concurrency".

Most Wikipedia bots follow the same general process:

  1. Fetch a list of pages
  2. For each page, process the page metadata/contents, possibly fetching other pages for more information
  3. Combine all the information together, into the new page contents and save the page, and possibly other pages.

The majority of bots have rather minimal/straightforward processing, so they are typically limited by the speed of I/O, since all fetching and edits happens over network requests to the API.

MediaWiki has a robust homegrown API (known as the "Action API") that predates modern REST, GraphQL, etc. and generally allows fetching information based on how it's organized in the database. It is optimized for bulk lookups, for example to get two pages' basic metadata (prop=info), you just set &titles=Albert Einstein|Taylor Swift (example).

However, when using this API, the guidelines are to make requests in series, not parallel, or in other words, use a concurrency of 1. Not to worry, there's a newer Wikimedia REST API (aka "RESTBase") that allows for concurrency up to 200 req/s, which makes it a great fit for writing our fearlessly concurrent bots. As a bonus, the REST API provides page content in annotated HTML format, which means it can be interpreted using any HTML parser rather than needing a dedicated wikitext parser, but that's a topic for another blog post.

Let's look at an example of a rather simple Wikipedia bot that I recently ported from Python to concurrent Rust. The bot's task is to create redirects for SCOTUS cases that don't have a period after "v" between parties. For example, it would create a redirect from Marbury v Madison to Marbury v. Madison. If someone leaves out the period while searching, they'll still end up at the correct article instead of having to find it in the search results. You can see the full source code on GitLab. I omitted a bit of code to focus on the concurrency aspects.

First, we get a list of all the pages in Category:United States Supreme Court cases.

let mut gen = categorymembers_recursive(
    &bot,
    "Category:United States Supreme Court cases",
);

Under the hood, this uses an API generator, which allows us to get the same prop=info metadata in the same request we are fetching the list of pages. This metadata is stored in the Page instance that's yielded by the page generator. Now calls to page.is_redirect() can be answered immediately without needing to make a one-off HTTP request (normally it would be lazy-loaded).

The next part is to spawn a Tokio task for each page.

let mut handles = vec![];
while let Some(page) = gen.recv().await {
    let page = page?;
    let bot = bot.clone();
    handles.push(tokio::spawn(
        async move { handle_page(&bot, page).await },
    ));
}

The mwbot::Bot type keeps all data wrapped in Arc<T>, which makes it cheap to clone since we're not actually cloning the underlying data. We keep track of each JoinHandle returned by tokio::spawn() so once each task is spawned, we can await on each one so the program only exits once all threads have been processed. We can also access the return value of each task, which in this case is Result<bool>, where the boolean indicates whether the redirect was created or not, allowing us to print a closing message saying how many new redirects were created.

Now let's look at what each task does. The next three code blocks make up our handle_page function.

// Should not create if it's a redirect
if page.is_redirect().await? {
    println!("{} is redirect", page.title());
    return Ok(false);
}

First we check that the page we just got is not a redirect itself, as we don't want to create a redirect to another redirect. As mentioned earlier, the page.is_redirect() call does not incur a HTTP request to the API since we already preloaded that information.

let new = page.title().replace(" v. ", " v ");
let newpage = bot.page(&new)?;
// Create if it doesn't exist
let create = match newpage.html().await {
    Ok(_) => false,
    Err(Error::PageDoesNotExist(_)) => true,
    Err(err) => return Err(err.into()),
}
if !create {
    return Ok(false);
}

Now we create a new Page instance that has the same title, but with the period removed. We need to make sure this page doesn't exist before we try to create it. We could use the newpage.exists() function, except it will make a HTTP request to the Action API since the page doesn't have that metadata preloaded. Even worse, the Action API limits us to a concurrency of 1, so any task that has made it this far now loses the concurrency benefit we were hoping for.

So, we'll just cheat a bit by making a request for the page's HTML, served by the REST API that allows for the 200 req/s concurrency. We throw away the actual HTML response, but it's not that wasteful given that in most cases we either get a very small HTML response representing the redirect or we get a 404, indicating the page doesn't exist. Issue #49 proposes using a HEAD request to avoid that waste.

let target = page.title();
// Build the new HTML to be saved
let code = { ... };
println!("Redirecting [[{}]] to [[{}]]", &new, target);
newpage.save(code, ...).await?;
Ok(true)

Finally we build the HTML to be saved and then save the page. The newpage.save() function calls the API with action=edit to save the page, which limits us to a concurrency of 1. That's not actually a problem here, as by Wikipedia policy, bots generally are supposed to pause 10 seconds in between edits if there is no urgency to the edits (in constrast to an anti-vandalism bot wants to bad revert edits as fast as possible). This is mostly to avoid cluttering up the feed of recent changes that humans patrol. So regardless how fast we can process the few thousand SCOTUS cases, we still have to wait 10 seconds in between each edit we want to make.

Despite that forced delay, the concurrency will make most bots faster. If the few redirects we need to create appear at the end of our categorymembers queue, we'd first have to process all the pages that come before it just to save one or two at the end. Now that everything is processed concurrently the edits will happen pretty quickly, despite being at the end of the queue.

The example we looked through was also quite simple, other bots can have handle_page functions that take longer than 10 seconds because of having to fetch multiple pages, in which case the concurrency really helps. My archiveindexer bot operates on a list of pages, for each page, it fetches all the talk page archives and builds an index of the discussions on them, which can easily end up pulling 5 to 50 pages depending on how controversial the subject is. The original Python version of this code took about 3-4 hours, the new concurrent Rust version finishes in 20 minutes.

The significant flaw in this goal of concurrent bots is that the Action API limits us to a concurrency of 1 request at a time. The cheat we did earlier requires intimate knowledge of how each underlying function works with the APIs, which is not a reasonable expectation, nor is it a good optimization strategy since it could change underneath you.

One of the strategies we are implementing to work around this is to combine compatible Action API requests. Since the Action API does really well at providing bulk lookups, we can intercept multiple similar requests, for example page.exists(), merge all the parameters into one request, send it off and then split the response up back to each original caller. This lets us process multiple threads' requests while still only sending one request to the server.

This idea of combining requests is currently behind an unstable feature flag as some edge cases are still worked out. Credit for this idea goes to Lucas Werkmeister, who pioneered it in his m3api project.

If this works interests you, the mwbot-rs project is always looking for more contributors, please reach out, either on-wiki, on GitLab, or in the #wikimedia-rust:libera.chat room (Matrix or IRC).