The Wikimania Festival goes global

16:53, Tuesday, 12 2022 July UTC

From 11-14 August, the Wikimania Festival will bring together communities from around the world to celebrate free and open access to knowledge. Using a regionally-focused approach, we hope to welcome more contributors to our movement’s global flagship event. 

Registration opens soon. Here is a sneak peek of how the Festival will connect the local to the global.

Bringing regions together at the Wikimania Festival

The Wikimania Festival will have programming that rotates between three longitudinal zones: Asia-Oceania; the Americas; and Africa, Europe and the Middle East. Each day will be scheduled to accommodate the zone being featured, and organized to maximize overlap with other timezones. Each day will feature additional languages of the zone, and will spotlight the local and regional communities working within it. We expect each day to be unique and fun, allowing Wikimania participants a birds-eye view into the dynamics, trends, and contributions of the different broader areas of the movement.

From seven to thirteen languages: welcoming more communities than ever before

Last year marked the first time Wikimania was live interpreted into multiple languages. This year, we are almost doubling the number of languages supported at the virtual event: from seven to thirteen. In addition to the core United Nations languages (Arabic, Chinese, English, French, Russian and Spanish), the Wikimania Festival will provide live interpretation on a rotating basis into Hindi, Indonesian, Japanese, Brazilian Portuguese, Swahili, Turkish, and Ukrainian. These languages were selected based on the criteria of number of global speakers (including whether the language is a regional language in common), the size of the active editor community, the amount of affiliate presence, and responses to the Wikimania 2022 survey. While we would love to offer support to all of our language communities, we look forward to experimenting with this approach to enable live participation in more languages than ever before. 

Rotating schedule blocks, so that everyone is prime time

The virtual programming for Wikimania 2022 will cater to participants across all timezones. The approach is to optimize for the zone being spotlighted on that day, while also providing times that work in different zones. We wanted to plan with the global whole in mind, building a schedule that spans many hours to allow for each region to connect with every other region.

Day 1 will open with a single schedule block that covers after-work hours across Asia-Oceania, and also overlaps with other zones, allowing people across regions to be part of the kickoff. Day 2 will comprise two schedule blocks, during the day and in the evening for the Americas, separated by the Hackathon. Day 3 will also comprise two blocks, during the day and in the evening for Europe, separated by the Hackathon. The Global Day will have programming separated by global festivities. Days 3 and 4 will take advantage of the weekend with an additional hour of Wikimania programming. On every day, the networking space will be available for a number of hours beyond the program.

Day 1 – Asia – Oceania
August 11
Day 2 – Americas 
Beginning August 12
Day 3 – Africa, Europe, Middle East Day 4 – Global Day
Hours of programming Wikimania: 10:00 UTC – 15:00 UTC Wikimania: 14:00 – 16:00 UTC
Hackathon: 16:00 – 22:00 UTC
Wikimania: 22:00 – 24:00 UTC
Wikimania: 9:00 – 12:00 UTC
Hackathon: 12:00 – 17:00 UTC
Wikimania: 17:00 – 20:00 UTC
Wikimania: 10:00 – 13:00 UTC
Global festivities: 13:00 – 16:00 UTC
Wikimania: 16:00 – 19:00 UTC
Core language live interpretation  Arabic, Chinese, French, Spanish, Russian Arabic, Chinese, French, Spanish, Russian Arabic, Chinese, French, Spanish, Russian Arabic, Chinese, French, Spanish, Russian
Additional language live interpretation Hindi, Indonesian, Japanese Brazilian Portuguese Swahili, Turkish, Ukrainian Hindi, Indonesian, Japanese, Brazilian Portuguese, Swahili, Turkish Ukrainian

Speakers can present in any of these languages and receive live interpretation into English, and speakers presenting in English will receive live interpretation into all of these languages.

Local in-person events

As part of a hybrid Wikimania, Wikimedia affiliates around the globe will be hosting interactive in-person events, from watch parties to edit-a-thons, to slumber parties and picnics. Many of these events will be open to the wider movement virtually. These in-person events may plug in directly to the virtual event, or provide additional hours of programming or even support in additional languages. Details about these events will be published as part of the program in the coming weeks.

Ready to go?

Registration opens soon, so keep your eye out! Follow us here on Diff or on Twitter, Instagram, or Facebook for the latest updates.

Articles of the Universal Declaration of Human Rights written in chalk on the steps of Colchester Campus. University of Essex, CC BY 2.0, via Flickr

The Wikimedia Foundation affirmed its belief last year that knowledge is a human right in the announcement of our new Human Rights Policy. The policy was an important step forward in recognizing the role Wikimedia projects play in advancing human rights, and where the Foundation can be accountable for protecting the human rights of all people who use Wikimedia projects.

When we released the new policy, we made a commitment to share a Human Rights Impact Assessment that further informed our Human Rights Policy and wider human rights work as an organization. The assessment evaluated whether and how Wikimedia projects, platforms, and activities might cause or facilitate inadvertent human rights harms to Wikimedia volunteers, Foundation employees, readers, and others affected directly or indirectly by free knowledge projects. Today, we’re sharing that assessment on Meta-Wiki. 

The assessment outlines human rights risks to Wikimedia projects according to five categories: harmful content, harassment, government censorship and surveillance, risks to child rights, and limits to knowledge equity. Importantly, this report contains 59 recommendations for the Foundation. The Foundation has already implemented some recommendations, like adopting a human rights policy and hiring a Human Rights Lead. However, not all of the report’s recommendations may be feasible. We welcome feedback and discussion from volunteers, affiliates, and other movement stakeholders to determine which recommendations the Foundation should prioritize in order to benefit Wikimedia projects and the wider movement.

About the Human Rights Impact Assessment 

The assessment was finalized and shared with the Foundation in July 2020. In the two years since then, both Article One (the agency that produced the report for us) and Foundation staff carried out a comprehensive review of the report to remove or generalize any information that could put individuals or Wikimedia projects at risk. In that time period, the Foundation also took steps to advance human rights work, including some recommendations in the assessment, which aligned with the Foundation’s existing priorities. 

Some of these steps include: 

  • Strengthening human rights expertise at the Foundation, including the creation of a Human Rights team
  • Approving our Human Rights Policy, which commits the Foundation to:

    • conducting ongoing human rights due diligence
    • tracking and publicly reporting on our efforts to meet our human rights commitments
    • using our influence with partners, the private sector, and governments to advance and uphold respect for human rights, and
    • providing access to effective remedies when harms have occurred
  • Continuing human rights due diligence efforts
  • Mitigating the impacts of disinformation on Wikimedia projects in partnership with volunteers

These initial steps will allow the Foundation to continue responding meaningfully to the recommendations and findings in the report.

How Wikimedians Can Engage with the Assessment

This assessment can help all stakeholders in the Wikimedia movement to better understand the human rights risks and threats that we jointly face, and the work required to address those risks. By better understanding these risks, the Foundation, volunteers, and affiliates can work together to protect both our movement and our people. To this end, we have translated the assessment’s foreword and executive summary into Arabic, Chinese, French, Russian, and Spanish. 

The Foundation will continue to partner with the communities to learn more about current and emerging human rights threats and our role in responding to these threats. In May 2022, the Public Policy and Global Advocacy team hosted a series of regionally-focused community conversations to begin this dialogue, but there will be more opportunities in the future. This month, July 2022, Wikimedians can join the following events to ask questions about this assessment, provide feedback, and raise other human rights concerns:

Wikimedians can also share their feedback about this assessment and the risks and recommendations it identifies on the Movement Strategy Forum

In the long run, mitigating the risks identified in our Human Rights Impact Assessment is key to the health, vitality, and sustainability of the Wikimedia movement. Disinformation, government surveillance, and censorship represent existential threats to freedom of expression, and also to the broader free knowledge community. Taking steps to reduce the potential harm these and other threats can have on the projects is necessary to protect the projects’ independence as well as the people who make these projects flourish. 

If you would like to raise a human rights-related concern to the Foundation, you can email [email protected] to reach the Human Rights Team. If you have reason to believe that your life is in immediate danger, contact [email protected]

UNLOCK 2022: New ideas addressing knowledge equity

14:24, Tuesday, 12 2022 July UTC
Accessible Knowledge

Within the third edition of the Wikimedia Accelerator UNLOCK, seven innovative, bold, engaging, diverse projects will be supported and go through the accelerator. The UNLOCK program team is very excited to welcome passion-driven project teams with a social entrepreneurial spirit and a goal of achieving more equitable access to knowledge, information and data.

UNLOCK is a program by Wikimedia Deutschland (WMDE). For this year’s edition, we joined forces with Wikimedia Serbia (WMRS) to promote and strengthen cross-regional and cross-affiliate collaboration. Besides, we also teamed up with Impact Hub Belgrade – an organization from the innovation ecosystem.

New ideas addressing knowledge equity: UNLOCK 2022 projects

Seven innovative, bold, engaging and diverse projects – with a total of 24 participants from Albania, Germany, Montenegro and Serbia – will be supported within the UNLOCK program. These projects show and have convinced us throughout the selection process with new perspectives for free knowledge – and addressing knowledge equity – in different regional as well as thematic contexts: 

  • activist.org – an open source platform that breaks down barriers to becoming politically active and thereby connects people and organizations from different social and activist movements. 
  • Game of political participation – encouraging young people in the Western Balkans to familiarize themselves with political decision making and political systems through elements of gamification.
  • f[ai]r – establishing an ethics certification for digital applications through a holistic examination of the AI system in the social context, addressing aspects of bias, discrimination, diversity and inclusion.
  • Inclusio – providing user-generated audio descriptions of visual content to the blind and visually impaired. Ideally the solution could be connected and tested on structured data in Wikimedia Commons.
  • macht.sprache. – fostering politically sensitive translation through an open source platform that allows for crowdsourcing and discussing politically sensitive terms and their translations, and through a tool to help translate with sensitivity.
  • MOCI SPACE – a digital space to connect activists, grassroots initiatives and civil society actors in the Western Balkans and that allows for co-creating, publishing and sharing knowledge by making use of the Matrix protocol for federated communication.
  • P2P Wiki for indigenous wisdom and biodiversity – an open source tool to collect and safeguard indigenous knowledge, and to raise awareness about biodiversity with a P2P offline-first methodology.

Looking back: The selection process

For this year’s open call we received 34 applications from 12 countries and a total of 104 applicants. Out of these 104, 53% are from the Western Balkans region, 24% from the German-speaking area and 23% “others” (incl. people from European and non-European countries; plus those without information). The selection process included several review stages: 

Stage 1 – Initial review of all applications based on their program fit: While many outstanding project ideas exist, not all of them might fit into our program scope and focus. The criteria “program fit” aims at figuring out whether UNLOCK is the right program for the project in question, considering the status of the project and the fulfillment of our basic requirements for participation.

Stage 2 – Detailed assessment of all applications based on their idea fit; and ranking of recommended candidates that should be shortlisted: The criteria “idea fit” aims at finding projects and initiatives that make a valuable contribution to the thematic focus of the program: Knowledge Equity. We are looking for fresh ideas and new concepts that are not only feasible, but also have great potential for impact, demonstrate awareness and differentiation from other, similar projects, and properly meet the needs of the selected target groups. For this, we involved a jury who acted as an advisory board, lending their diverse expertise, experience and knowledge to the process. With this they supported us in properly assessing the applications, and shortlisted 10 teams for the final assessment stage (stage 3). The members of this year’s jury included:

  • Greta Doçi, Software Engineer at Nextcloud & Board Member Wikimedians of Albanian Language User Group
  • Hanna Petruschat, Head of Design at Wikimedia Deutschland
  • Gaia Montelatici, Co-Founder & CEO at Impact Hub Belgrade
  • Giulia Berti, Program Manager at Impact Hub Belgrade

Stage 3 – Get to know the shortlisted project teams in an online call, checking for team fit and clarify any open questions: While we see it as our mission to pass on certain values and mindsets on to the teams during the program, the criteria “team fit” indicates what the ideal team should bring along right from the start. While participants should combine all important skills that are necessary for the development of their idea, we are also looking for drive, a collaborative mindset and social spirit. 

The information received in those three stages helped us – the program team – make an informed decision upon who and which projects will be supported.

Looking forward: The UNLOCK journey

Last Thursday (June 30) and Friday (July 1), we virtually launched the official program together with all 24 participants. On these two days we set the stage for the program – highlighting the purpose and desired outcomes, clarifying expectations, roles and responsibilities as well as jointly co-creating with all involved people a playbook including values that should guide us through the upcoming journey. This program is not a typical grant/funding program that exists within our movement; instead it’s a support environment with several key elements that will help projects to be validated, tested and prototyped. In the next four months we will support the teams with coaching, a variety of workshops with experts, regular exchanges and funding. WMDE and WMRS will be responsible for designing and facilitating the cross-team events and workshops; whereas  Impact Hub Belgrade will provide need-oriented coaching and mentoring with regards to project and product development. At the end of the program, team members will have the chance to share their work at the Demo day. 

Support by the Movement Strategy Implementation Grant

The costs for the design and implementation of the UNLOCK Accelerator 2022 is covered by WMDE as well as by the Movement Strategy Implementation Grant. WMDE and WMRS jointly applied for the grant. The requested and approved grant of EUR 74.062,40 will be used to cover the costs incurred by WMRS and Impact Hub Belgrade only. This includes personnel and operating costs that are necessary for their respective involvement for the implementation of the program. This grant is of great importance to the partners, as their own budget is very tight or has been allocated elsewhere and other in-kind options are not possible. More details about the grant can be found here.

More insights and updates

We also encourage you to take a look at the recent episode of the WIKIMOVE podcast, where we talked about the UNLOCK accelerator program, how it is being implemented in collaboration with WMRS and WMDE this year, and explore how the movement can become more of an innovation ecosystem.

For updates about the projects and the program itself can be found on the UNLOCK website. Or feel free to follow us on UNLOCK Twitter or UNLOCK LinkedIn.

Making Instant Commons Quick

08:46, Tuesday, 12 2022 July UTC

 The Wikimedia family of websites includes one known as Wikimedia Commons. Its mission is to collect and organize freely licensed media so that other people can re-use them. More pragmatically, it collects all the files needed by different language Wikipedias (and other Wikimedia projects) into one place.

 

The 2020 Wikimedia Commons Picture of the Year: Common Kingfisher by Luca Casale / CC BY SA 4.0

 As you can imagine, it's extremely useful to have a library of freely licensed photos that you can just use to illustrate your articles.

However, it is not just useful for people writing encyclopedias. It is also useful for any sort of project.

To take advantage of this, MediaWiki, the software that powers Wikipedia and friends, comes with a feature to use this collection on your own Wiki. It's an option you can select when installing the software and is quite popular. Alternatively, it can be manually configured via $wgUseInstantCommons or the more advanced $wgForeignFileRepos.

The Issue

Unfortunately, instant commons has a reputation for being rather slow. As a weekend project I thought I'd measure how slow, and see if I could make it faster.

How Slow?

First things first, I'll need a test page. Preferably something with a large (but not extreme) number of images but not much else. A Wikipedia list article sounded ideal. I ended up using the English Wikipedia article: List of Governors General of Canada (Long live the Queen!). This has 85 images and not much else, which seemed perfect for my purposes.

I took the expanded Wikitext from https://en.wikipedia.org/w/index.php?title=List_of_governors_general_of_Canada&oldid=1054426240&action=raw&templates=expand, pasted it into my test wiki with instant commons turned on in the default config.

And then I waited...

Then I waited some more...

1038.18761 seconds later (17 minutes, 18 seconds) I was able to view a beautiful list of all my viceroys.

Clearly that's pretty bad. 85 images is not a small number, but it is definitely not a huge number either. Imagine how long [[Comparison_of_European_road_signs]] would take with its 3643 images or [[List_of_paintings_by_Claude_Monet]] with 1676.

Why Slow?

This raises the obvious question of why is it so slow. What is it doing for all that time?

When MediaWiki turns wikitext into html, it reads through the text. When it hits an image, it stops reading through the wikitext and looks for that image. Potentially the image is cached, in which case it can go back to rendering the page right away. Otherwise, it has to actually find it. First it will check the local DB to see if the image is there. If not it will look at Foreign image repositories, such as Commons (if configured).

To see if commons has the file we need to start making some HTTPS requests¹:

  1. We make a metadata request to see if the file is there and get some information about it: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata%7Cmime%7Cmediatype%7Cextmetadata&prop=imageinfo&iimetadataversion=2&iiextmetadatamultilang=1&format=json&action=query&redirects=true&uselang=en
  2.  We make an API request to find the url for the thumbnail of the size we need for the article. For commons, this is just to find the url, but on wikis with 404 thumbnail handling disabled, this is also needed to tell the wiki to generate the file we will need: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=300&iiurlheight=-1&iiurlparam=300px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  3.  Some devices now have very high resolution screens. Screen displays are made up of dots. High resolution screens have more dots per inch, and thus can display more fine detailed. Traditionally 1 pixel equalled one dot on the screen. However if you keep that while increasing the dots-per-inch, suddenly everything on the screen that was measured in pixels is very small and hard to see. Thus these devices now sometimes have 1.5 dots per pixel, so they can display fine detail without shrinking everything. To take advantage of this, we use an image 1.5 times bigger than we normally would, so that when it is displayed in its normal size, we can take advantage of the extra dots and display a much more clear picture. Hence we need the same image but 1.5x bigger: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=450&iiurlheight=-1&iiurlparam=450px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  4. Similarly, some devices are even higher resolution and use 2 dots per pixel, so we also fetch an image double the normal size:  https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=600&iiurlheight=-1&iiurlparam=600px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en

 

This is the first problem - for every image we include we have to make 4 api requests. If we have 85 images that's 340 requests.

Latency and RTT

It gets worse. All of these requests are done in serial. Before doing request 2, we wait until we have the answer to request 1. Before doing request 3 we wait until we get the answer to request 2, and so on.

Internet speed can be measured in two ways - latency and bandwidth. Bandwidth is the usual measurement we're familiar with: how much data can be transferred in bulk - e.g. 10 Mbps.

Latency, ping time or round-trip-time (RTT) is another important measure - it's how long it takes your message to get somewhere and come back.

When we start to send many small messages in serial, latency starts to matter. How big your latency is depends on how close you are to the server you're talking to. For Wikimedia Commons, the data-centers (DCs) are located in San Francisco (ulsfo), Virginia (eqiad), Texas (codfw), Singapore (eqsin) and Amsterdam (esams). For example, I'm relatively close to SF, so my ping time to the SF servers is about 50ms. For someone with a 50ms ping time, all this back and forth will take at a minimum 17 seconds just from latency.

However, it gets worse; Your computer doesn't just ask for the page and get a response back, it has to setup the connection first (TCP & TLS handshake). This takes additional round-trips.

Additionally, not all data centers are equal. The Virginia data-center (eqiad)² is the main data center which can handle everything, the other DCs only have varnish servers and can only handle cached requests. This makes browsing Wikipedia when logged out very speedy, but the type of API requests we are making here cannot be handled by these caching DCs³. For requests they can't handle, they have to ask the main DC what the answer is, which adds further latency. When I tried to measure mine, i got 255ms, but I didn't measure very rigorously, so I'm not fully confident in that number. In our particular case, the TLS & TCP handshake are handled by the closer DC, but the actual api response has to be fetched all the way from the DC in Virginia.

But wait, you might say: Surely you only need to do the TLS & TCP setup once if communicating to the same host. And the answer would normally be yes, which brings us to major problem #2: Each connection is setup and tore down independently, requiring us to re-establish the TCP/TLS session each time. This adds 2 additional RTT. In our 85 image example, we're now up to 1020 round-trips. If you assume 50ms to caching DC and 255ms to Virginia (These numbers are probably quite idealized, there are probably other things I'm not counting), we're up to 2 minutes.

To put it altogether, here is a diagram representing all the back and forth communication needed just to use a single image:

12 RTT per image used! This is assuming TLS 1.3. Earlier versions of TLS would be even worse.

Introducing HTTP/2

In 2015, HTTP/2 came on the scene. This was the first major revision to the HTTP protocol in almost 20 years.

The primary purpose of this revision of HTTP, was to minimize the effect of latency when you are requesting many separate small resources around the same time. It works by allowing a single connection to be reused for many requests at the same time and allowing the responses to come in out of order or jumbled together. In HTTP/1.1 you can sometimes be stuck waiting for some request to finish before being allowed to start on the next one (Head of line blocking) resulting in inefficient use of network resources

This is exactly the problem that instant commons was having.

Now I should be clear, instant commons wasn't using HTTP/1.1 in a very efficient way, and it would be possible to do much better even with HTTP/1.1. However, HTTP/2 will still be that much better than what an improved usage of HTTP/1.1 would be.

Changing instant commons to use HTTP/2 changed two things:

  1. Instead of creating a new connection each time, with multiple round trips to set up TCP and TLS, we just use a single HTTP/2 connection that only has to do the setup once.
  2. If we have multiple requests ready to go, send them all off at once instead of having to wait for each one to finish before sending the next one.

We still can't do all requests at once, since the MediaWiki parser is serial, and it stops parsing once we hit an image, so we need to get information about the current image before we will know what the next one we need is. However, this still helps as for each image, we send 4 requests (metadata, thumbnail, 1.5dpp thumbnail and 2dpp thumbnail), which we can now send in parallel.


The results are impressive for such a simple change. Where previously my test page took 17 minutes, now it only takes 2 (139 seconds).


Transform via 404

In vanilla MediaWiki, you have to request a specific thumbnail size before fetching it; otherwise it might not exist. This is not true on Wikimedia Commons. If you fetch a thumbnail that doesn't exist, Wikimedia Commons will automatically create it on the spot. MediaWiki calls this feature "TransformVia404".

In instant commons, we make requests to create thumbnails at the appropriate sizes. This is all pointless on a wiki where they will automatically be created on the first attempt to fetch them. We can just output <img> tags, and the first user to look at the page will trigger the thumbnail creation. Thus skipping 3 of the requests.

Adding this optimization took the time down from 139 seconds with just HTTP/2 to 18.5 seconds with both this and HTTP/2. This is 56 times faster than what we started with!



Prefetching

18.5 seconds is pretty good. But can we do better?

We might not be able to if we actually have to fetch all the images, but there is a pattern we can exploit.

Generally when people edit an article, they might change a sentence or two, but often don't alter the images. Other times, MediaWiki might re-parse a page, even if there are no changes to it (e.g. Due to a cache expiry). As a result, often the set of images we need is the same or close to the set that we needed for the previous version of the page. This set is already recorded in the database in order to display what pages use an image on the image description page

We can use this. First we retrieve this list of images used on the (previous version) of the page. We can then fetch all of these at once, instead of having to wait for the parser to tell us one at a time which image we need.

It is possible of course, that this list could be totally wrong. Someone could have replaced all the images on the page. If it's right, we speed up by pre-fetching everything we need, all in parallel. If it's wrong, we fetched some things we didn't need, possibly making things slower than if we did nothing.

I believe in the average case, this will be a significant improvement. Even in the case that the list is wrong, we can send off the fetch in the background while MediaWiki does other page processing - the hope being, that MediaWiki does other stuff while this fetch is running, so if it is fetching the wrong things, time is not wasted.

On my test page, using this brings the time to render (Where the previous version had all the same images) down to 1.06 seconds. A 980 times speed improvement! It should be noted, that this is time to render in total, not just time to fetch images, so most of that time is probably related to rendering other stuff and not instant commons.

Caching

All the above is assuming a local cache miss. It is wasteful to request information remotely, if we just recently fetched it. It makes more sense to reuse information recently fetched.

In many cases, the parser cache, which in MediaWiki caches the entire rendered page, will mean that instant commons isn't called that often. However, some extensions that create dynamic content make the parser cache very short lived, which makes caching in instant commons more important. It is also common for people to use the same images on many pages (e.g. A warning icon in a template). In such a case, caching at the image fetching layer is very important.

There is a downside though, we have no way to tell if upstream has modified the image. This is not that big a deal for most things. Exif data being slightly out of date does not matter that much. However, if the aspect ratio of the image changes, then the image will appear squished until InstantCommons' cache is cleared.

To balance these competing concerns, Quick InstantCommons uses an adaptive cache. If the image has existed for a long time, we cache for a day (configurable). After all, if the image has been stable for years, it seems unlikely it is going to be edited in very soon. However, if the image has been edited recently, we use a dynamically determined shorter time to live. The idea being, if the image was edited 2 minutes ago, there is a much higher possibility that it might be edited a second time. Maybe the previous edit was vandalism, or maybe it just got improved further.

As the cache entry for an image begins to get close to expiring, we refetch it in the background. The hope is that we can use the soon to be expired version now, but as MediaWiki is processing other things, we refetch in background so that next time we have a new version, but at the same time we don't have to stall downloading it when MediaWiki is blocked on getting the image's information. That way things are kept fresh without a negative performance impact.

MediaWiki's built-in instant commons did support caching, however it wasn't configurable and the default time to live was very low. Additionally, the adaptive caching code had a bug in it that prevented it from working correctly. The end result was that often the cache could not be effectively used.

Missing MediaHandler Extensions

In MediaWiki's built-in InstantCommons feature, you need to have the same set of media extensions installed to view all files. For example, PDFs won't render via instant commons without Extension:PDFHandler.

This is really unnecessary where the file type just renders to a normal image. After all, the complicated bit is all on the other server. My extension fixes that, and does its best to show thumbnails for file types it doesn't understand. It can't support advanced features without the appropriate extension e.g. navigating in 3D models, but it will show a static thumbnail.

Conclusion

In the end, by making a few, relatively small changes, we were able to improve the performance of instant commons significantly. 980 times as fast!

Do you run a MediaWiki wiki? Try out the extension and let me know what you think.

Footnotes:

¹ This is assuming default settings and an [object] cache miss. This may be different if $wgResponsiveImages is false in which case high-DPI images won't be fetched, or if apiThumbCacheExpiry is set to non-zero in which case thumbnails will be downloaded locally to the wiki server during the page parse instead of being hotlinked.


² This role actually rotates between the Virginia & Texas data center. Additionally, the Texas DC (when not primary) does do some things that the caching DCs don't that isn't particularly relevant to this topic. There are eventual plans to have multiple active DCs which all would be able to respond to the type of API queries being made here, but they are not complete as of this writing - https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Active-active_MediaWiki


³ The MediaWiki API actually supports an smaxage=<number of seconds> (shared maximum age) url parameter. This tells the API server you don't care if your request is that many seconds out of date, and to serve it from varnish caches in the local caching data center if possible. Unlike with normal Wikipedia page views, there is no cache invalidation here, so it is rarely used and it is not used by instant commons.





Announcing the Wikimedia CEE Meeting 2022

20:45, Monday, 11 2022 July UTC
Wikimedia CEE Meeting 2022 logo (Credit: Kiril Simeonovski and Nikola Stojanoski, CC-BY-SA 4.0)

After the previous two editions of the Wikimedia CEE Meeting took place online and the Wikimedia Foundation announced the lifting of the Covid Travel Policy earlier this year, Wikimedians from Central and Eastern Europe are looking forward to meeting in person again at the Wikimedia CEE Meeting 2022, which will be held from 14–16 October in Ohrid under the slogan “Bringing Back Together!” to highlight the recoupling.

This is going to be the first Wikimedia CEE Meeting to take place as an in-person event during the COVID-19 pandemic, which means that participants will have to abide by the local measures imposed to contain the pandemic. Additionally, speakers unable to participate on-site due to a positive test shortly before the conference will be allowed to present online.

Call for submissions now open

The programme for the event will be mostly filled with session proposals received through a call for submissions running from 15 June to 31 July 2022, while there will also be a handful of sessions that will feature keynote speakers or will be formatted as plenary discussions on important topics from the Wikimedia movement.

Session proposals can be submitted as lectures, panels, workshops, lightning talks and roundtables with a standardised duration, as well as posters which will be printed and publicly displayed in the venue during the course of the conference. No tracks have been pre-defined in the submission process this year and interested speakers are left with the option to indicate the topics that their session proposals fit the best into (e.g. Education, Community Engagement, Partnerships etc.).

Following the successful implementation of language interpretation from English to Russian and vice versa during the online editions of the conference, this is expected to be the first in-person edition with language interpretation, which means that the event can also accommodate sessions in Russian alongside English.

Central Asian communities invited

The conference is open primarily to participants from the CEE communities and, as usual in the past, the organisers will award scholarships for up to two members of each affiliate or community from the CEE region, with affiliates having the opportunity to send additional delegates on their own cost. There will also be scholarships awarded to interested Wikimedians from outside the region, particularly to people whose submissions will be accepted for inclusion in the conference programme.

In order to extend collaboration with like-minded communities from the adjacent regions that show similarities with the CEE communities, this year’s conference welcomes Wikimedians from Central Asia, specifically from the Uzbek, Kyrgyz and Tajik communities, and the organisers will award scholarships for up to two members from these communities as well.

The registration for the conference is open until 15 August 2022, and all interested participants have to fulfill the registration form.

Tech/News/2022/28

20:32, Monday, 11 2022 July UTC

Other languages: Bahasa Indonesia, Deutsch, English,español, français, italiano, magyar, polski, português, português do Brasil, suomi, svenska, čeština, русский, українська, עבריתعربية, فارسی, বাংলা, 中文, 日本語

Latest tech news from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. Translations are available.

Recent changes

Changes later this week

  • There is no new MediaWiki version this week.
  • Some wikis will be in read-only for a few minutes because of a switch of their main database. It will be performed on 12 July at 07:00 UTC (targeted wikis).

Future changes

Tech news prepared by Tech News writers and posted by bot • Contribute • Translate • Get help • Give feedback • Subscribe or unsubscribe.

This Month in GLAM: June 2022

03:49, Monday, 11 2022 July UTC
  • Albania report: CEE Spring 2022 in Albania and Kosovo
  • Argentina report: In the middle of new projects
  • Australia report: A celebration, a commitment, an edit-a-thon: Know My Name returns for 2022
  • Belgium report: Heritage and Wikimedian in Residence
  • Brazil report: FIRST WikiCon Brazil & Three States of GLAM
  • Croatia report: Network(ing) effect(s)
  • France report: French open content report promotion
  • Italy report: Opening and closing projects in June
  • Kosovo report: Edit-a-thon with Kino Lumbardhi; DokuTech; CEE Spring 2022 in Albania and Kosovo
  • New Zealand report: West Coast Wikipedian at Large and Auckland Museum updates
  • Poland report: Wikipedian in residence in the National Museum in Cracow; The next online meeting within the cycle of monthly editing GLAM meetings; Steps to communicate GLAM partnerships better and involve the Wikimedian community
  • Sweden report: 100 000 memories from the Nordic Museum; Report from the Swedish National Archives
  • Switzerland report: Diversity in GLAM Program
  • UK report: Featured images and cultural diversity
  • USA report: Fifty Women Sculptors; Juneteenth Edit-a-thon; Juneteenth Photobooths 2022; Wiknic June 2022; New York Botanical Garden June 2022; LGBT Pride Month
  • Structured Data on Wikimedia Commons report: Structured data on Commons editing now possible with OpenRefine 3.6; file uploading with 3.7
  • Calendar: July’s GLAM events

Tech News issue #28, 2022 (July 11, 2022)

00:00, Monday, 11 2022 July UTC
previous 2022, week 28 (Monday 11 July 2022) next

Tech News: 2022-28

weeklyOSM 624

19:27, Sunday, 10 2022 July UTC

28/06/2022-04/07/2022

lead picture

OSM Inspector – with a revamp [1] © Geofabrik | map data © OpenStreetMap contributors

About us

  • We found malware on our server and used the cleanup to come back online with a fresh server, an up to date theme and language switcher plugin. Thank you for all of you who have notified, analyzed, helped, tested and migrated with us – thehedgeh0g, firefishy, lonvia, matthiasmatthias, derFred, Michael S., someoneelse, stereo, strubbl, theFive.There are still some open points with all the changes, please do not hesitate to give us hints and suggestions if you come across anything you would change.

Mapping

  • ezekielf asked about the use of access tags for snowmobile trails.
  • User ImmaBeMe shared his concerns about the discrepancy between the few people in his area (Isfahan, Iran) who actively contribute to OSM and those who simply use it without thinking about how it was made.
  • In a short video, Gregory Marler used overpass turbo to find the outdoor artworks that he mapped.
  • Voting is currently open for:
    • school=entrance to deprecate the use of the tag school=entrance, until Wednesday 13 July.
    • school:for=* a tag for schools to indicate what kinds of facilities are available for special needs students.

Community

  • Yunita Sari, from Jakarta, is the UN Mapper of the month.
  • OSM Belgium has selected Ivan Lievens, from Belgium, as Mapper of the Month.
  • It’s once again time for an episode of ‘The OpenCage Blog’ interview series with OpenStreetMap communities around the world. This time you will hear from Martijn van Exel all about OpenStreetMap in Utah.
  • Sam Wilson briefly reported on a meetup in Shenton Park, Perth, which was supported by OSGeo Oceania, the Australian local chapter.

Imports

  • Daniel Capilla continued (es) > en with the import of open data from the Málaga municipality by adding parking areas for people with reduced mobility.

OpenStreetMap Foundation

  • Are you already using Mastodon? The OSMF is pleased to financially support the en.osm.town OpenStreetMap Mastodon (or ‘Mapstodon’) service.

Local chapter news

  • Włodzimierz Bartczak reported (pl) > en on an agreement between OSM Poland and the creators of ‘Seeing Assistant Move’ to support the movement of the blind using OSM data.

Events

  • Open Mapping Hubs from HOT (Asia-Pacific, Eastern and Southern Africa, and Western and Northern Africa) are going to conduct a training webinar, with Meta (Mapillary, MapWithAI), on AI-assisted mapping on Friday 15 July.

Maps

  • During the FOSSGIS Hackingweekend in Essen a map was presented which shows the characteristics of a way which are relevant and important for cyclists. You can, for example, select criteria (items only in (de)) such as ‘highways without cycleways and speed limits over 70 km/h’, but all tags will be displayed by clicking on a route section.

Software

  • Do you want your own worldwide route calculator? You get that with GraphHopper! Just try it.
  • Interested in mobile Linux, OSM, and/or Zig? Have a look at Mepo, a fast, simple, and hackable OSM map viewer for Linux, currently in active development. Designed with the Pinephone and mobile Linux in mind, it works both offline and online.
  • [1] OSM Inspector, an OSM QA tool, has received a revamp. Geofabrik is seeking feedback.
  • Since 7 June 2022, it has (de) > en been possible to create interactive maps in the German-language Wikipedia with comparatively little effort using ‘Kartographer’. Many articles are already waiting to be illustrated with maps. Take a look and join in.

Did you know …

  • … the OSM Notes Heatmap (fr) > en? You can limit notes to ‘Only open notes’, ‘Only anonymous notes’, and ‘Only notes w/o comments’, and select those ‘Created before’ and ‘Created after’ a certain date.
  • … that there’s a web version of the popular OsmAnd app?
  • … that non-square buildings are not the only kind of quality issue with mapped buildings on OSM? Berrely reported, on Discord, that he had found grouped multiple buildings in a relation with the building=yes tag applied to the relation and not the ways. This was related to the tasking manager project tasks.hotosm.org/projects/9717, which is no longer available, about the Izmir earthquake response, Turkey, a year ago.

OSM in the media

  • Ross Thorn gave an interview for the PLN8 Podcast about ‘The Realm of Playful Maps’. By this he means maps that are used in a playful context and thus pursue different goals than classic maps, e.g. by hiding specific information. He gave examples from computer games, board games and the field of art.

Other “geo” things

  • The Guardian reported that UK cycle couriers were fired from their jobs due to ‘impossible’ routes that were suggested by a routing service which was claimed to be 10 times cheaper than Google Maps .
  • Lat × Long is a new blog exploring and documenting geospatial things on the web, focusing on technology, data and standards. It provides – by their own account – regular updates and links to industry news, events, software and tools.
  • Do you want to be one of the very few driving down one of the loneliest roads in the United States? This map is for you.
  • Would you too consider it one of the last real and great challenges to have cycled all the tiles in as large a tile square as possible? Then read this report! Now.
  • The German Federal Cartel Office has (de) > en initiated proceedings against Google. According to the authority, it is investigating whether the US company restricts the combination of Google Maps with third-party map services and thus exploits its position of power.

Upcoming Events

Where What Online When Country
Região Geográfica Imediata de Angra dos Reis Mapatona – Angra dos Reis osmcalpic 2022-07-09 flag
Fremantle Social Mapping Sunday: Fremantle osmcalpic 2022-07-10 flag
Zürich 142. OSM-Stammtisch osmcalpic 2022-07-11 flag
London London pub meet-up osmcalpic 2022-07-12 flag
20095 Hamburger Mappertreffen osmcalpic 2022-07-12 flag
München Münchner OSM-Treffen osmcalpic 2022-07-12 flag
Berlin Missing Maps – GRC Online Mapathon osmcalpic 2022-07-12 flag
Landau an der Isar Virtuelles Niederbayern-Treffen osmcalpic 2022-07-12 flag
Salt Lake City OSM Utah Monthly Meetup osmcalpic 2022-07-14 flag
Berlin 169. Berlin-Brandenburg OpenStreetMap Stammtisch osmcalpic 2022-07-14 flag
臺北市 第三次 OpenStreetMap 街景踏查團工作坊 osmcalpic 2022-07-17 flag
OSMF Engineering Working Group meeting osmcalpic 2022-07-18
153. Treffen des OSM-Stammtisches Bonn osmcalpic 2022-07-19
City of Nottingham OSM East Midlands/Nottingham meetup (online) osmcalpic 2022-07-19 flag
Lüneburg Lüneburger Mappertreffen (online) osmcalpic 2022-07-19 flag
大阪市 ひがよどの街を世界にシェア #01 osmcalpic 2022-07-23 flag
京都市 京都!街歩き!マッピングパーティ:第32回 妙心寺 osmcalpic 2022-07-24 flag
Düsseldorf Düsseldorfer OpenStreetMap-Treffen osmcalpic 2022-07-27 flag
[Online] OpenStreetMap Foundation board of Directors – public videomeeting osmcalpic 2022-07-28
臺北市 COSCUP 2022 OpenStreetMap x Wikidata 聯合議程軌 osmcalpic 2022-07-30 flag

Note:
If you like to see your event here, please put it into the OSM calendar. Only data which is there, will appear in weeklyOSM.

This weeklyOSM was produced by Nordpfeil, PierZen, SK53, Sammyhawkrad, Strubbl, Supaplex, TheSwavu, derFred.

Wikidata at the Detroit Institute of Arts

16:01, Thursday, 07 2022 July UTC

If you’ve ever visited a museum or library, you’ve likely noticed the number of works in their collections. You may have even asked a librarian a hyper-specific question about a book, only to have an answer within minutes. How do they keep track of vast amounts of information in order to both serve their patrons and understand their own collections? Art museums and libraries are often tasked with cataloging and tracking all of their works, artists, authors, dimensions of each artwork, publication dates, exhibition data, genres, media, or any other data point that can help them document their collections. This work is both a major objective and quite the challenge, since the institution may have several thousand items to track.

To approach this challenge, museums ascribe unique identifiers to artists and artworks, like using a social security number instead of your name. This process is called authority control, which disambiguates people or works of art that share the same name. Museums create their own numbers to use as unique identifiers, and this can work extremely well within the institution. But since each institution creates their own unique identifier, artists and works of art may have dozens corresponding to them. I. Rice Pereira, for example, could have several identifiers that correspond to her name. And she does. Fifty-seven at the time of publication.

screenshot of Wikidata query service
Query results showing I. Rice Pereia’s External Identifiers

Wouldn’t it be wonderful if we had a method of authority control that’s independent of individual institutions, helping us discern when we’re talking about the same work or person?

That’s where Wikidata comes in and why Wiki Education is so passionate about bringing museum professionals into the Wikidata community.

In the Wikidata Institute I teach museum professionals and others interested in data about applications of Wikidata in their real work as well as tools to simplify the process of both contributing to and querying Wikidata. One museum, the Detroit Institute of Arts (DIA), sent their database manager to Wiki Education’s Wikidata training course in 2019, and two more staff participated in 2022. The DIA library holds thousands of boxes of institutional records and historic photographs, and curators, scholars, and the general public use it. When describing an artist or creator for their collections, they have incorporated Wikidata into their process, creating authorities that extend beyond the use of a singular museum.

photo of DIA's archives
Example of research materials in the Detroit Institute of Arts Research Library & Archives TEXTILE Collection. Textile Curator Adèle Coulin Weibel compiled her research materials over a 35 year career and donated them to the museum. A description is available at Wikidata item Q110559858.

I recently spoke to James Hanks, the archivist at the DIA Research Library and Archives, who walked me through some of the cataloging magic the DIA uses to make the collection more accessible to everyone. “[C]reating Wikidata authorities is now part of our normal workflow for processing archival collections. Our objective over the past few years has been to build well-cited biographical [Wikidata items] for DIA personages. We found that this facilitated the eventual creation of [Wikidata items] for our processed collections, with the anticipated result that this would promote discoverability of the archives beyond our existing Worldcat and corporate web presence.”

Annually, the DIA hosts over 600,000 guests who are eager to see the museum’s 65,000+ collection. By embedding Wikidata into their archival work and ensuring the DIA’s archives have a robust presence on Wikidata, they can increase exposure of their archives to even more people. Here are some of the ways this is significant:

  1. As Wikidata becomes more and more vital to the information ecosystem on the internet, more people will find the DIA archives on their research journey. More click-throughs to a museum’s website makes that museum more vital to the art and the community interested in it.
  2. Wikidata’s structured relationship with Wikipedia can make it easier, one day, to turn this data into citations in Wikipedia articles. This will help make Wikipedia more credible and will again drive readers back to the DIA’s source materials.
  3. This work can increase the likelihood that a Wikipedia article will be created about a work or a person. The more references that exist about something, the more likely it is that a Wikipedian will be able to write an article about it. This will be especially important for regional art and historically excluded groups of people who are otherwise missing in art publications.
  4. Corporate collection services like WorldCat can be wonderful resources, but they are not free and open to the public or to institutions that subscribe to them. Wikidata is. This has big implications for using and reusing data in the present and it also ensures that future metadata experts and archivists will have access to not just the DIA’s data, but potentially any museum’s data now and into the future.

Another benefit to this Wikidata work? Working through the COVID-19 pandemic. “An unforeseen, but welcome benefit to integrating Wikidata within our procedures has been the ability to improve intellectual controls remotely,” James said. “For example, when the DIA was on lockdown in 2020, I was still able to conduct meaningful archival description work from home using a combination of Office 365, Wikidata, Archive.org, JSTOR, and my own personal reference library. Although I did not have hands-on access to the primary source materials, I could still enhance and contextualize records pertaining to collections.”

The stability and continuity of Wikidata have many benefits beyond pandemic-induced remote work. James is excited about additional avenues Wikidata is opening to practicum students and interns. He recommends they learn Wikidata to orient themselves to the DIA’s collections and to help fill in the blanks of their collection. Endeavoring to make records more complete is not only beneficial to any collection, but it can also reveal untold and underrepresented stories. “Wikidata has been part of our toolkit for providing access to a more thorough historical record,” James shares, “and we are quite pleased to promote the work of women curators who have worked at the Detroit Institute of Arts over the past 130+ years.” See for yourself here: Adèle Coulin Weibel Textile Department records, 1876-1973.

We are excited by all of the new opportunities Wikidata is bringing to the Detroit Institute of Arts and look forward to helping facilitate as other institutions take on this important work. If you’re interested in learning how to get started with Wikidata, check out our upcoming Wikidata Institute training courses.

A special thanks to James Hanks ([email protected]) for taking the time to share his Wikidata enthusiasm with me and to Christina Gibbs (https://www.christina-gibbs.com/about) who took our Wikidata course years ago and was an early champion of Wikidata at the DIA.

If you’re interested to learn more about Wikidata and how your institution could start to do something similar, follow this link for more information about our courses.

Image credits: Weadock313, CC BY-SA 4.0, via Wikimedia Commons; Weadock313, CC BY-SA 4.0, via Wikimedia Commons

Update on Net Neutrality in the EU

08:06, Thursday, 07 2022 July UTC

Net Neutrality in the EU seemed like a topic of the past. Something we dealt with, secured and could turn our attention to other issues now. Two significant recent developments show that it remains a dynamic policy field and that we mustn’t forget about it. After all, we want an information infrastructure that allows all users to have equal access not only to Wikipedia and its sister projects, but also to all the citations and sources.  

Bad news from the Commission

Very large telecoms companies have wanted to make very large online platforms pay for network use for a while. Now they seem to have found a like-minded EU Commissioner in the face of Margrethe Vestager. The argument of the Danish politician is a modern classic for the EU. It boils down to the fact that very large platforms are responsible for a bulk of the internet traffic, but according to telecoms companies are not paying their fair share to fund the infrastructure. 

And while this might seem like just another fight between two very profitable industries with excellent lobbying structures at their disposal, it could actually undermine net neutrality. The aim of what telecoms companies are asking for here is to create a two-sided market: On the one hand, end customers pay for the internet connection, on the other hand, internet services might have to pay fees to reach users. The risk is that by developing this market some services will become more accessible than others. 

A letter co-signed by 34 civil society organisations was sent to the Commission criticising their latest statements and outlining fears that such a move could lead to a tiered internet.

Good news from BEREC

On the other side, BEREC, the EU’s body of telecoms regulators, has updated its net neutrality guidelines. By doing so, it closed some loopholes and effectively banned zero rating of data for some applications not yet explicitly covered. So far, specific applications could be exempt from network providers’ data caps. The change was a result of a European Court of Justice’s ruling from 2021 that stated that zero-tariff options, which differentiate between types of internet traffic, violate Europe’s open internet rules. 

Wikimedia’s History with Wikipedia Zero

Wikipedia Zero was a project by the Wikimedia Foundation to provide Wikipedia free of charge on mobile phones via zero-rating, particularly in developing markets. The objective of the program was to increase access to free knowledge, in particular without data-usage cost. The program has ended in 2018. Part of the reason is that data costs have become more affordable globally. Another important reason is that zero-access systems don’t allow users to research and investigate. Wikipedia doesn’t exist on its own, but cites and links to sources across the internet. These must be equally accessible to anyone. 

Episode 116: Adam Baso and Julia Kieserman

17:54, Tuesday, 05 2022 July UTC

🕑 1 hour 14 minutes

Adam Baso and Julia Kieserman are both developers in the Abstract Wikipedia group at the Wikimedia Foundation; Adam is the director of engineering, while Julia is a senior software engineer.

Links for some of the topics discussed:

Can you reuse images of the Italian cultural heritage in public domain published on Wikimedia Commons for commercial purposes? According to the new Italian National Plan for the Digitization of Cultural Heritage (Piano Nazionale di Digitalizzazione, PND) images can be published on the Wikimedia projects, but to reuse them for commercial purposes you need to ask for permission and pay a fee. This is a restriction to public domain and a misuse of our Wikimedia projects, which are collaborative repositories meant to freely provide content, also for commercial purposes.

The new PND – under review until June 30th, 2022 – for the first time explicitly refers to Wikimedia Commons in its Guidelines for the acquisition, circulation and reuse of cultural heritage reproductions in the digital environment (page 28) and it states:

“The download of cultural heritage reproductions published on third-party websites is not under the control of the public entity that holds the assets (e.g., images of cultural heritage assets downloadable from Wikimedia Commons, made “freely” by contributors by their own means for purposes of free expression of thought and creative activity, and thus in the full legitimacy of the Cultural Heritage Code). It remains the responsibility of the cultural institution to charge fees for subsequent commercial uses of reproductions published by third parties.”

In spite of a clear support to open access, FAIR data, collaboration, co-creation and reuse, the guidelines of the PND want to turn all images of Italian cultural heritage in public domain available on Wikimedia Commons into non commercial (NC) images with the new label MIC BY NC (MIC stands for the Italian Ministry of Culture). According to an Italian administrative norm (Codice dei beni culturali e del patrimonio), Italian monuments and collections under public domain can be photographed for non-commercial purposes, while the commercial uses are allowed only with a preventive authorization and the payment of a fee to the institutions managing that site or collection.

The system can’t work and it is unsustainable

The application of this kind of fee by the Ministry of Culture and cultural institutes on commercial reuses of Italian cultural heritage images on Wikimedia Commons is unrealistic (especially if the re-users are based outside Italy), complex and expensive to manage (for handling permissions and payments). Furthermore this system follows an outdated business model that aims at making money on heritage digitization instead of opening it up to reuses, as the European policies suggest (ref. open government, open data, open science).

Wikimedia projects are exploited and sabotaged

The fee charged on the reuse of Wikimedia content exploits our free infrastructure and the work of volunteers and donors and goes against our principles of free knowledge, openness and reuse. Furthermore it is in contrast with the thousands of authorizations collected by Wikimedia Italia in ten years of history of Wiki Loves Monuments and the engagement of Italian GLAMs committed to provide their public domain heritage with open tools accessible for all purposes without fees.

What is being done and what can be done

Wikimedia Italia sent an open letter to representatives of the Italian government, calling out not to add restrictions on images of cultural heritage in the public domain licensed under an open license on Wikimedia projects. We will keep asking that, to push our country to align to international standards on openness and civil society participation to the conservation of its own heritage.

Help us raise our voice: let us know if you have similar issues in your countries and how you have been dealing with that. If you are a volunteer on Wikimedia Commons, let us know if and how the community could help supporting our requests.

Learn more about the current situation here.

Tech/News/2022/27

15:49, Tuesday, 05 2022 July UTC

Other languages: Bahasa Indonesia, Deutsch, English,italiano, polski, português, português do Brasil, svenska, čeština, русский, українська, עברית, العربية, فارسی, বাংলা, 日本語

Latest tech news from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. Translations are available.

Changes later this week

  • The new version of MediaWiki will be on test wikis and MediaWiki.org from 5 July. It will be on non-Wikipedia wikis and some Wikipedias from 6 July. It will be on all wikis from 7 July (calendar).
  • Some wikis will be in read-only for a few minutes because of a switch of their main database. It will be performed on 5 July at 07:00 UTC (targeted wikis) and on 7 July at 7:00 UTC (targeted wikis).
  • The Beta Feature for DiscussionTools will be updated throughout July. Discussions will look different. You can see some of the proposed changes.
  • This change only affects pages in the main namespace in Wikisource. The Javascript config variable proofreadpage_source_href will be removed from mw.config and be replaced with the variable prpSourceIndexPage. [1]

Tech news prepared by Tech News writers and posted by bot • Contribute • Translate • Get help • Give feedback • Subscribe or unsubscribe.

Outreachy report #33: June 2022

00:00, Tuesday, 05 2022 July UTC

June was one of the hardest months since I’ve joined the Outreachy team almost 4 years ago. It was the first time I’ve experience the loss of a colleague—and Marina wasn’t just a colleague. I keep feeling that words aren’t enough to express how much I own Marina and Outreachy for the life I now have. Before becoming an Outreachy intern, I struggled to find meaning in life; I had so many dreams, but it was hard to see myself achieving any of them.

Tech News issue #27, 2022 (July 4, 2022)

00:00, Monday, 04 2022 July UTC
previous 2022, week 27 (Monday 04 July 2022) next

Tech News: 2022-27

weeklyOSM 623

10:32, Sunday, 03 2022 July UTC

21/06/2022-27/06/2022

lead picture

Qwant-Map – OSM data connected to Wikimedia and Tripadvisor [1] © Qwant | map data © OpenStreetMap contributors

About us

  • DeepL Pro now offers translations into Bahasa Indonesia. We have taken the liberty of automatically translating and publishing issue #623. We would like to do this better in the future. However, to do this, we need two proofreaders to make the necessary corrections starting Friday each week. Get in touch via info at weeklyosm dot eu.

Mapping

  • Enock Seth Nyamador raised some concerns about the quality of YouthMappers’ edits in Ghana.
  • SK53 noted that in a number of countries Bing imagery in the iD editor is very out-of-date (5 years old or more). This appears to be the result of a recent change in iD’s code, and is discussed on GitHub.
  • Tobias Zwick wrote about another small project he will be working on over the next few months. It is made possible by an NLNet NGI Zero Discovery grant. Topic: how to improve and complete maxspeed=* data in OSM, including inferring default speed limits.
  • Requests have been made for comments on the following proposals:

  • Voting on amenity=library_dropoff, for mapping a place where library patrons can return or drop off books, other than the library itself, is open until Friday 8 July.
  • The proposal for the improved tagging of neighbourhood places (place=*) in Japan was approved with 14 votes for, 1 vote against, and 1 abstention.

OpenStreetMap Foundation

  • The OSM Tech Twitter account conducted a poll on whether OpenStreetMap should consider publishing a quarterly electronic newsletter. Although, at present, the poll is favourable, the thread highlights that there are a number of obstacles in producing such a newsletter.

Local chapter news

  • Take a look at the June OpenStreetMap US Newsletter.

Events

  • The deadline for submitting a poster for this year’s State of the Map conference is Sunday 31 July.

Software

  • Lilly Tinzmann reported that there have been features added to the Ohsome Quality Analyst, a data quality analysis tool for OSM accessible via a web interface. The new features include the ability to retrieve HTML snippets with a visual representation of the indicator results and expanded data input options.
  • Sarah Hoffmann blogged about the current status of postcodes in OpenStreetMap, explaining their usefulness and offering a QA layer for incorrectly formatted postcodes.
  • Visit Sights offers suggestions for self-guided sightseeing tours by foot around the world – based on OpenStreetMap and Wikipedia. For each city there is also an overview with individual sights including a map.

Did you know …

  • [1] … Qwant-Maps? We last reported on Qwant-Maps in July 2019. The map has been developed further since then. It draws on Tripadvisor for hotels and restaurants, and also links to Wikipedia, thus providing significant added value.
  • … Martijn van Exel has a bash script to create a vintage OpenStreetMap tile server?
  • … that Jason Davies, one of the contributors to the D3 graphics package, created a webpage demonstrating several dozen map projections of the Earth with smooth transitions between each?

Other “geo” things

  • Ariel Kadouri saw that a road in Google Maps was renamed incorrectly. Something that is said to be a problem of an open system like OSM, not a closed system like Google Maps, which is not true evidently. Noel Hidalgo said he had a friend who filed a ticket at Google, which solved the issue after about 36 hours.
  • User F-5 System made (ru) a pilgrimage to a chapel at the reputed source of the Lena River in Northern Russia. It turns out that, as with many large rivers, the source is a contentious issue.

Upcoming Events

Where What Online When Country
Washington A Synesthete’s Atlas (Washington, DC) osmcalpic 2022-07-01 flag
Essen 17. OSM-FOSSGIS-Communitytreffen osmcalpic 2022-07-01 – 2022-07-03 flag
OSM Africa July Mapathon: Map Liberia osmcalpic 2022-07-01
OSMF Engineering Working Group meeting osmcalpic 2022-07-04
臺北市 OpenStreetMap x Wikidata Taipei #42 osmcalpic 2022-07-04 flag
San Jose South Bay Map Night osmcalpic 2022-07-06 flag
London Missing Maps London Mapathon osmcalpic 2022-07-05 flag
Berlin OSM-Verkehrswende #37 (Online) osmcalpic 2022-07-05 flag
Salt Lake City OSM Utah Monthly Meetup osmcalpic 2022-07-07 flag
Roma Incontro dei mappatori romani e laziali osmcalpic 2022-07-06 flag
Fremantle Social Mapping Sunday: Fremantle osmcalpic 2022-07-10 flag
München Münchner OSM-Treffen osmcalpic 2022-07-12 flag
Berlin Missing Maps – GRC Online Mapathon osmcalpic 2022-07-12 flag
20095 Hamburger Mappertreffen osmcalpic 2022-07-12 flag
London London pub meet-up osmcalpic 2022-07-12 flag
Landau an der Isar Virtuelles Niederbayern-Treffen osmcalpic 2022-07-12 flag
Salt Lake City OSM Utah Monthly Meetup osmcalpic 2022-07-14 flag
153. Treffen des OSM-Stammtisches Bonn osmcalpic 2022-07-19
City of Nottingham OSM East Midlands/Nottingham meetup (online) osmcalpic 2022-07-19 flag
Lüneburg Lüneburger Mappertreffen (online) osmcalpic 2022-07-19 flag

Note:
If you like to see your event here, please put it into the OSM calendar. Only data which is there, will appear in weeklyOSM.

This weeklyOSM was produced by JAAS, Lejun, LuxuryCoop, Nordpfeil, PierZen, SK53, Strubbl, TheSwavu, derFred.

A belated writeup of CVE-2022-28201 in MediaWiki

06:03, Sunday, 03 2022 July UTC

In December 2021, I discovered CVE-2022-28201, which is that it's possible to get MediaWiki's Title::newMainPage() to go into infinite recursion. More specifically, if the local interwikis feature is configured (not used by default, but enabled on Wikimedia wikis), any on-wiki administrator could fully brick the wiki by editing the [[MediaWiki:Mainpage]] wiki page in a malicious manner. It would require someone with sysadmin access to recover, either by adjusting site configuration or manually editing the database.

In this post I'll explain the vulnerability in more detail, how Rust helped me discover it, and a better way to fix it long-term.

The vulnerability

At the heart of this vulnerability is Title::newMainPage(). The function, before my patch, is as follows (link):

public static function newMainPage( MessageLocalizer $localizer = null ) {
    if ( $localizer ) {
        $msg = $localizer->msg( 'mainpage' );
    } else {
        $msg = wfMessage( 'mainpage' );
    }
    $title = self::newFromText( $msg->inContentLanguage()->text() );
    // Every page renders at least one link to the Main Page (e.g. sidebar).
    // If the localised value is invalid, don't produce fatal errors that
    // would make the wiki inaccessible (and hard to fix the invalid message).
    // Gracefully fallback...
    if ( !$title ) {
        $title = self::newFromText( 'Main Page' );
    }
    return $title;
}

It gets the contents of the "mainpage" message (editable on-wiki at MediaWiki:Mainpage), parses the contents as a page title and returns it. As the comment indicates, it is called on every page view and as a result has a built-in fallback if the configured main page value is invalid for whatever reason.

Now, let's look at how interwiki links work. Normal interwiki links are pretty simple, they take the form of [[prefix:Title]], where the prefix is the interwiki name of a foreign site. In the default interwiki map, "wikipedia" points to https://en.wikipedia.org/wiki/$1. There's no requirement that the interwiki target even be a wiki, for example [[google:search term]] is a supported prefix and link.

And if you type in [[wikipedia:]], you'll get a link to https://en.wikipedia.org/wiki/, which redirects to the Main Page. Nice!

Local interwiki links are a bonus feature on top of this to make sharing of content across multiple wikis easier. A local interwiki is one that maps to the wiki we're currently on. For example, you could type [[wikipedia:Foo]] on the English Wikipedia and it would be the same as just typing in [[Foo]].

So now what if you're on English Wikipedia and type in [[wikipedia:]]? Naively that would be the same as typing [[]], which is not a valid link.

So in c815f959d6b27 (first included in MediaWiki 1.24), it was implemented to have a link like [[wikipedia:]] (where the prefix is a local interwiki) resolve to the main page explicitly. This seems like entirely logical behavior and achieves the goals of local interwiki links - to make it work the same, regardless of which wiki it's on.

Except it now means that when trying to parse a title, the answer might end up being "whatever the main page is". And if we're trying to parse the "mainpage" message to discover where the main page is? Boom, infinite recursion.

All you have to do is edit "MediaWiki:Mainpage" on your wiki to be something like localinterwiki: and your wiki is mostly hosed, requiring someone to either de-configure that local interwiki or manually edit that message via the database to recover it.

The patch I implemented was pretty simple, just add a recursion guard with a hardcoded fallback:

    public static function newMainPage( MessageLocalizer $localizer = null ) {
+       static $recursionGuard = false;
+       if ( $recursionGuard ) {
+           // Somehow parsing the message contents has fallen back to the
+           // main page (bare local interwiki), so use the hardcoded
+           // fallback (T297571).
+           return self::newFromText( 'Main Page' );
+       }
        if ( $localizer ) {
            $msg = $localizer->msg( 'mainpage' );
        } else {
            $msg = wfMessage( 'mainpage' );
        }

+       $recursionGuard = true;
        $title = self::newFromText( $msg->inContentLanguage()->text() );
+       $recursionGuard = false;

        // Every page renders at least one link to the Main Page (e.g. sidebar).
        // If the localised value is invalid, don't produce fatal errors that

Discovery

I was mostly exaggerating when I said Rust helped me discover this bug. I previously blogged about writing a MediaWiki title parser in Rust, and it was while working on that I read the title parsing code in MediaWiki enough times to discover this flaw.

A better fix

I do think that long-term, we have better options to fix this.

There's a new, somewhat experimental, configuration option called $wgMainPageIsDomainRoot. The idea is that rather than serve the main page from /wiki/Main_Page, it would just be served from /. Conveniently, this would mean that it doesn't actually matter what the name of the main page is, since we'd just have to link to the domain root.

There is an open request for comment to enable such functionality on Wikimedia sites. It would be a small performance win, give everyone cleaner URLs, and possibly break everything that expects https://en.wikipedia.org/ to return a HTTP 301 redirect, like it has for the past 20+ years. Should be fun!

Timeline

Acknowledgements

Thank you to Scott Bassett of the Wikimedia Security team for reviewing and deploying my patch, and Reedy for backporting and performing the security release.

Sock🧦 nerdery🤓

01:15, Friday, 01 2022 July UTC

Being a nerd is not about what you love; it’s about how you love it.

Wil Wheaton

My running last week

I’m a runner and a sock nerd, and in four days, I’m running a half-marathon (eek!).

Here are some reflections on socks because if there’s one thing every runner knows it’s: socks. matter.

Join the Darn Tough sock cult.

Darn Tough makes merino wool socks prized by hikers, runners, and buy-it-for-lifers because they’re guaranteed for life.

Darn Tough’s lifetime warranty

According to my Amazon order history, I ordered five pairs of “Darn Tough Merino Wool Double Cross, No Show Tab, Light Cushion Sock Molten Large” socks in 2016. Today, six years later, I’m wearing a pair of the socks I ordered in 2016, and they’re great.

And in all this time I’ve never used their warranty program, but I decided to try it out on a particularly worn pair—we’ll see how it goes!

About compression socks

Why? Because squeezy is good.

Peter Sagal, Host of NPR’s “Wait Wait… Don’t Tell Me!”

Compression socks supply support and structure. And it makes them a joy to wear—even when you’re not running.

Initially, compression socks emerged to support circulation in the legs of diabetics. But now savvy runners sport them to capitalize on numerous studies claiming they aid performance and recovery (although who knows what the control is in those studies).

I own two colors of CEP Progressive+ Run 2.0—basic black and caution-tape yellow.

These socks are made of nylon (mostly) which massages my calves, keeping my blood flowing on my recovery days. I’ve owned these socks for years and wear them weekly.

But it’s not all cozy, compressed joy:

  • 💸Compression socks are too expensive—mine cost $65 a pair!
  • 🧐 The socks come with instructions about how to put them on
  • 🛂 You need instructions to put them on

Avoid cotton socks

90% of everything is crap

Sturgeon’s Law

Most socks are crap for running because most socks are cotton.

But cotton is the wrong material for socks for the same reason it’s the right material for towels. Cotton is absorbent—it holds water and doesn’t release it. The sweat trapped between your foot and your cotton sock can cause blisters while running or hiking.

In contrast, technical socks tend to be made of less absorbent material that dries quickly. So when you sweat, your sweat moves to the surface of the sock and evaporates before it gives you blisters.

I believed blisters were unavoidable—I tossed a roll of Leukotape in my firstaid kit and accepted that I’d use it often. But then I realized the real problem was my cotton socks.

You think about socks every day.

“I don’t want to make decisions about what I’m eating or wearing. Because I have too many other decisions to make.”

Barack Obama

Mental energy is precious. You should avoid misspending your limited mental energy on your socks.

You could argue writing a blog post about socks is the definition of misspent mental energy. But I believe it’s when you’re spending your mental energy that matters.

If you find yourself bleary-eyed, rooting around for the one good pair of socks in the drawer, then you’re thinking about socks at the wrong time.

Spend your effort up-front.

Declare sock bankruptcy and find a brand of comfortable socks that you can wear in every situation, and then stock up.

My GLAMorous introduction into the Wikiverse

16:54, Thursday, 30 2022 June UTC

In January 2021, I had no experience on any of the Wikimedia platforms. By the end of 2021, I had added over 200,000 words across Wikipedia and Wikidata and assisted in two Smithsonian edit-a-thons.

The Beginning

After completing a digital archival research project on anti-rape protests at The Ohio State University, my friend encouraged me to apply to the 2021 Virtual Because of Her Story (BOHS) internship Project with the American Women’s History Initiative (AWHI) at the Smithsonian Institution. The internship was eight weeks, 40 hours a week, and paid. Without the financial assistance BOHS provided, I would not have been able to do this opportunity. 

My BOHS project “Wikimedia, Gender Equity, and the Digital Museum” aimed to “advance gender equity on Wikipedia by making our collections about women accessible on the Wikimedia platforms.” As a Women’s, Gender, and Sexuality Studies major the project appealed to me because disseminating knowledge in accessible ways has been key to many feminists’ organization efforts. 

Summer 2021

My mentor Kelly Doyle taught me the basics of Wiki-etiquette, including conflict of interest, determining reliable sources, establishing notability, and adding categories. My first edit was adding the category “South Korean adoptees” to Mia Mingus’ page. Kelly encouraged utilizing the Wikipedia: Task Center to find pages to categorize or copyedit. I then started editing people’s Wikipedia pages. The visual editor was incredibly helpful for me. Being able to make these changes and see the results immediately gave me a lot of motivation to keep editing.

Andrew Lih introduced me and other BOHS interns to Wikidata, Wikipedia, and Wikimedia Commons. I began editing Wikidata and realized Wikidata was more intuitive for me than Wikipedia. I felt comfortable creating Wikidata properties in real time, like when Zaila Avant-garde won the 2021 Scripps National Spelling Bee. I utilized my Wikipedia and Wikidata skills for the Black Women in Food Smithsonian Edit-A-Thon. I created my first ever article for LaDeva Davis and created Wikidata properties for women featured in our Edit-A-Thon.

Before my internship ended, I wanted to complete a passion project so I created the Wikipedia page for the “Asian Americans (documentary series).” Creating this Wikipedia page meant a lot to me because I wanted to highlight Asian-American contributions on Wikipedia. I wanted the page to act in a similar way as the documentary and connect Asian-American Wikipedia pages together in a cohesive and contextually relevant way. Being especially thorough when it comes to Asian-American Wikipedia presence is important to me because omitting details felt like erasing Asian-American contributions all over again. 

At the end of my Summer BOHS internship with the Smithsonian I had created 6 Wikipedia pages and 24 Wikidata properties, added 294 references, and wrote ~36,000 words across Wikiplatforms.

Autumn 2021

Mia Cariello

I had the privilege to continue my internship into the Fall. During the Fall, I presented at WikidataCon 2021, collaborated with the National Air and Space Museum on their Wikipedia Edit-a-Thon, and participated in Wikipedia:Asian Month (finishing in at #18 out of 46 participants). During my fall internship I created 12 new Wikipedia articles and 45 Wikidata properties, added 1,170 references, and wrote ~188,000 words across Wikimedia. I did this all while completing my first semester of graduate school and only working 30 hours a week for AWHI. 

Off- Wiki Outcomes

I took my knowledge of the Smithsonian, Wikipedia, and women’s (under)representation to the classroom. I taught undergraduate students how they can find information on women in Wikipedia and Museum databases. We discussed how the internet can replicate biases and how including marginalized groups onto Wikipedia and Wikidata could help combat this. My WikiWork found its way into other people’s classes as well. Professors thanked me for creating the Asian Americans (documentary series) page because they planned on using it in their own courses.

The opportunity AWHI’s BOHS Internship program provided me is invaluable. After completing my degree, I hope to pursue more work with Wikipedia and GLAM institutions. I hope that the Smithsonian and other GLAM institutions continue to create or expand their Wiki-programs. Future interns could potentially add millions of words across Wikimedia platforms and have the tools to create their own passion projects well after their internships have ended. 


Mentor Observations

Mia’s contributions to Wikimedia are incredible and far exceeded my expectations. She went from a complete newbie with zero edits to a superstar editor who is now considering Wikimedia and/or open access as a career. Her internship teaches us several things: that it’s possible to pilot Wikimedia focused internships as a model for future engagement at GLAMs, that mentorship and focused Wikimedia guidance produces dedicated editors who care about our movement and, interns have high editor retention after their official role has ended. 

Mia participated in community campaigns that intersected with the focus of her internship like Wikipedia Asia Month, and quickly began to navigate between editing Wikipedia and Wikidata. She continues to find connections between Wikimedia and her graduate level coursework. Mia even incorporated Wikipedia into her Spring 2022 Women’s and Gender Studies course at Ohio State University. I’m hopeful that interns focusing on Wikimedia can become an integral part of future GLAM-Wiki engagement. 

This summer, I’m co-mentoring two more interns with the Smithsonian Asian Pacific American Center, focused on increasing the representation of Asian Pacific American women on Wikipedia and picking up on Mia’s successes in 2021. 


Learn more

Mia Cariello is currently pursuing a Masters Degree in Women’s, Gender, and Sexualitiy Studies at The Ohio State University.

Kelly Doyle is the Open Knowledge Coordinator for the Smithsonian American Women’s History Initiative

By Jesse Amamgbu and Isaac Johnson

Introduction

Every month, editors on Wikipedia make somewhere between 10 and 14 million edits to Wikipedia content. While that is clearly a large amount of change, knowing what each of those edits did is surprisingly difficult to quantify. This data could support new research into edit dynamics on Wikipedia, more detailed dashboards of impact for campaigns or edit-a-thons, and new tools for patrollers or inexperienced editors. Editors themselves have long relied on diffs of the content to visually inspect and identify what an edit changed.

Example wikitext diff of Lady Cockburn and Her Three Eldest Sons showing that User:BrownHairedGirl made an edit that inserted a new template, changed an existing template, and changed an existing category. Additional lines in the diff are shown by the tool for context to help in determining what this change did.
Figure 1: Example wikitext diff [Source]

For example, Figure 1 above shows a diff from the English Wikipedia article “​​Lady Cockburn and Her Three Eldest Sons” in which the editor inserted a new template, changed an existing template, and changed an existing category. Someone with a knowledge of wikitext syntax can easily determine that from viewing the diff, but the diff itself just shows where changes occurred, not what they did. The VisualEditor’s diffs (example) go a step further and add some annotations, such as whether any template parameters were changed, but these structured descriptions are limited to a few types of changes. Other indicators of change – the minor edit flag, the edit summary, the size of change in bytes – are often overly simplistic and at times misleading.

Our goal with this project was to generate diffs that provided a structured summary of the what of an edit – in effect seeking to replicate what many editors naturally do when viewing diffs on Wikipedia or the auto-generated edit summaries on Wikidata (example). For the edit to Lady Cockburn, that might look like: 1 template insert, 1 template change, 1 category change, and 1 new line across two sections (see Figure 2). Our hope is that this new functionality could have wide-ranging benefits:

  • Research: support analyses similar to Antin et al. about predictors of retention for editors on Wikipedia or more nuanced understandings of effective collaboration patterns such as Kittur and Kraut.
  • Dashboards: the Programs and Events Dashboard already shows measures of references and words added for campaigns and edit-a-thons, but could be expanded to include other types of changes such as images or new sections.
  • Vandalism Detection: the existing ORES edit quality models already use various structured features from the diff, such as references changed or words added, but could be enhanced with a broader set of features.
  • Newcomer Support: many newcomers are not aware of basic norms such as adding references for new facts or how to add templates. Tooling could potentially guide editors to the relevant policies as they edit or help identify what sorts of wikitext syntax they have not edited yet and introduce them to these concepts (more ideas).
  • Tooling for Patrollers: in the same way that editors can filter their watchlists to filter out minor edits, you could also imagine them setting more fine-grained filters, such as not showing edits that only change categories or whitespace.
A more visual wikitext diff on the left is summarized by the tool on the right to be 1 category change, 1 template change, 1 template insertion, and 1 whitespace insertion across 2 changed sections.
 Figure 2. Same wikitext diff of Lady Cockburn and Her Three Eldest Sons but with edit types output added to the left to show how the library describes the edit [Source]

Background

Automated approaches for describing edits is not a new idea. While our requirements led us to build our own end-to-end system, we were able to build heavily on past efforts. Past approaches largely fit into two categories: human-intelligible and machine-intelligible. The diff in Figure 1 from Wikidiff2 is an example of human-intelligible diffs that are generally only useful if you have someone who understands wikitext interpreting it (a very valid assumption for patrollers on Wikipedia). This sort of diff has existed since the early 2000s (then called just Wikidiff).

Past research has also attempted to generate machine-intelligible diffs, primarily for machine-learning models to do higher-order tasks such as detecting vandalism or inferring editor intentions. These diffs are useful for models in that they are highly structured and quick to generate, but can be so decontextualized as to be non-useful for a person trying to understand what the edit did. An excellent example of this is the revscoring Python library, which provides a variety of tools for extracting basic features from edits such as the number of characters changed between two revisions. Most notably, this library supported work by Yang et al. to classify edits into a taxonomy of intentions – e.g., copy-editing, wikification, fact-update. These higher-order intentions require labeled data; however, that is expensive to gather from many different language communities.

We instead focus on identifying changes at the level of the different components of wikitext that comprise an article – e.g., categories, templates, words, links, formatting, images [1]. The closest analog to our goals and a major source of inspiration was the visual diffs technology, which was built in 2017 in support of VisualEditor. While its primary goal is to be human-intelligible, it does take that additional step of generating structured descriptions of what was changed for objects such as templates.

Implementation

The design of our library is based heavily on the approach taken by the Visual Diffs team [2] with four main differences:

  1. Visual diffs is written in Javascript and we work in Python to allow for large-scale analyses and complement a suite of other Python libraries intended to provide support for Wikimedia researchers.
  2. Visual diffs work with the parsed HTML content of the page, not the raw wikitext markup. Because the parsed content of pages is not easily retrievable in bulk or historically, we work with the wikitext and parse the content to convert it into something akin to an HTML DOM.
  3. We do not need to visually display the changes so we relax some of the constraints of Visual diffs, especially around e.g., which words were most likely changed and how.
  4. We need broader coverage of the structured descriptions – i.e. not just specifics for templates and links, but also how many words, lists, references, etc. also were edited.

There are four stages between the input of two raw strings of wikitext (usually a revision of page and its parent revision) and the output of what edit actions were taken:

  1. Parse each version of wikitext and format it as a tree of nodes – e.g., a section with text, templates, etc. nested within it. For the parsing, we depend heavily on the amazingly powerful mwparserfromhell library.
  2. Prune the trees down to just the sections that were changed – a major challenge with diffs is balancing accuracy with computational complexity. The preprocessing and post-processing steps are quite important to this.
  3. Compute a tree diff – i.e. identify the most efficient way (inserts, removals, changes, moves) to get from one tree to the other. This is the most complex and costly stage in the process.
  4. Compute a node diff – i.e. identify what has changed about each individual element. In particular, we do a fair bit of additional processing to summarize what changed about the text of an article (sentences, words, punctuation, whitespace). It is at this stage that we could also compute additional details such as exactly how many parameters of a template were changed etc.

Learnings

Testing was crucial to our development. Wikitext is complicated and has lots of edge cases – images appear in brackets…except for when they are in templates or galleries. Diffs are complicated to compute and have no one right answer – editors often rearrange content while editing, which can raise questions about whether content was moved with small tweaks or larger blocks of text were removed and inserted elsewhere. Interpreting the diff in a structured way forces many choices about what counts as a change to a node – is a change in text formatting just when the type of formatting changes or also when the content within it changes? Does the content in reference tags contribute to word counts? Tests forced us to record our expectations and hold ourselves accountable to them, something the Wikidiff2 team also discovered when they made improvements in 2018. No amount of tests would truly cover the richness of Wikipedia either, so we also built an interface for testing the library on random edits so we could slowly identify edge cases that we hadn’t imagined.

Parsing wikitext is not easy and though we thankfully could rely on the mwparserfromhell library for much of this, we also made a few tweaks. First, mwparserfromhell treats all wikilinks equally regardless of their namespace. This is because identifying the namespace of a link is non-trivial: the namespace prefixes vary by language edition and there are many edge cases. We decided to differentiate between standard article links, media, and category links as the three most salient types of links on Wikipedia articles. We extracted a list of valid prefixes for each language from the Siteinfo API to assist with this, which is a simple solution, but will occasionally need to be updated to the most recent list of languages and aliases. Second, mwparserfromhell has a rudimentary function for removing the syntax from content and just leaving plaintext, but it was far from perfect for our purposes. For instance, because mwparserfromhell does not distinguish between link namespaces, parameters for image size or category names are treated as text. Content from references is included (if not wrapped in a template) even though these notes do not appear in-line and often are just bibliographic. We wrote our own wrapper for deciding what was text, so that the outputs more closely adhered to what we considered to be the textual content of the page.

It is not easy to consistently identify words or sentences across Wikipedia’s over 300 languages. Many languages (like English) are written with words that are separated by spaces. Many languages are not though, either because the spaces actually separate syllables or because there are no spaces in between characters at all. While the former are easy to tokenize into words, the latter set of languages require specialized parsing or a different approach to describing the scale of changes. For now, we have borrowed a list of languages that would require specialized parsing and report character counts for them as opposed to word counts (code). For sentences, we aim for consistency across the languages. The challenge is constructing a global list of punctuation that is used to indicate the ends of sentences, including latin scripts like in this blogpost, as well as characters such as the danda or many CJK punctuation. It is challenges like these that remind us of the richness and diversity of Wikipedia.

What’s next?

We welcome researchers and developers (or folks who are just curious) to try out the library and let us know what you find! You can download the Python library yourself or test out the tool via our UI. Feedback is very welcome on the talk page or as a Github issue. We have showcased a few examples of how to apply the library to the history dumps for Wikipedia or use the diffs as inputs into machine-learning models. We hope to make the diffs more accessible as well so that they can be easily used in tools and dashboards. 

While this library is generally stable, our development is likely not finished. Our initial scope was Wikipedia articles with a focus on the current format and norms of wikitext. As the scope for the library expands, additional tweaks may be necessary. The most obvious place is around types of wikilinks. Identifying media and category links is largely sufficient for the current state of Wikipedia articles, but applying this to e.g. talk pages would likely require extending this to at least include User and Wikipedia (policy) namespaces (and other aspects of signatures). Extending to historical revisions would require separating out interlanguage links.

We have attempted to develop a largely feature-complete diff library, but, for some applications, a little accuracy can be sacrificed in return for speed. For those use-cases, we have also built a simplified version that ignores document structure. The simplified library loses the ability to detect content moves or tell the difference between e.g., a category being inserted and a separate one being removed vs. a single category being changed. In exchange, it has an approximately 10x speed-up and far smaller footprint, especially for larger diffs. This can actually lead to more complete results when the full library otherwise times out.

[1] For the complete list, see: https://meta.wikimedia.org/wiki/Research:Wikipedia_Edit_Types#Edit_Types_Taxonomy 

[2] For more information, see this great description by Thalia Chan from Wikimania 2017: https://www.mediawiki.org/wiki/VisualEditor/Diffs#Technology_used 

About this post

Featured image credit: File:Spot the difference.jpg by Eoneill6, licensed under Creative Commons Attribution 4.0 International

Figure 1 credit: File:Wikitext diff example.png by Isaac (WMF), licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

Figure 2 credit: File:Edit types example.png by Isaac (WMF), licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

We are excited to be dropping the 3rd episode of WIKIMOVE, our podcast on everything Wikimedia Movement Strategy. In this episode we talk about innovation and explore the opportunities created by the UNLOCK accelerator within our movement and beyond. 

Good news!

Our podcast is now available with RSS Feed on Acast, Spotify, Soundcloud, Stitcher and Castbox. More podcast platforms will follow. 

The video version of our show is also available on Youtube with english subtitles. 

What’s in this episode? 

We are looking back at years of complaints about how Wikimedia technology is outdated and exclusive. Non-encyclopedic forms of knowledge are still impossible or hard to insert into our existing formats. The last big innovation from our movement is Wikidata, which is now almost ten years old. Movement Strategy calls on us to innovate our technical and social systems so that new and marginalized communities can join and share their knowledge. We talk about the UNLOCK accelerator program, how it is being implemented in collaboration with WMS and WMDE this year, and explore how the movement can become more of an innovation ecosystem.

Our guests are…

Kannika Thaimai, Program lead of the UNLOCK accelerator at Wikimedia Deutschland

Ivana Madžarević, Program and Community Support Manager at Wikimedia Serbia

Please visit our meta page to react to the episode and subscribe to our newsletter to get notified of each new release. 

We wish you all a summer break and will be back in August with our next episode! 

Tech/News/2022/26

21:29, Monday, 27 2022 June UTC

Other languages: Bahasa Indonesia, Deutsch, English, français, italiano, magyar, polski, português, português do Brasil, čeština, русский, українська, עברית, العربية, فارسی, বাংলা, 中文, 日本語, 한국어

Latest tech news from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. Translations are available.

Recent changes

Changes later this week

  • The new version of MediaWiki will be on test wikis and MediaWiki.org from 28 June. It will be on non-Wikipedia wikis and some Wikipedias from 29 June. It will be on all wikis from 30 June (calendar).
  • Some wikis will be in read-only for a few minutes because of a switch of their main database. It will be performed on 28 June at 06:00 UTC (targeted wikis). [1]
  • Some global and cross-wiki services will be in read-only for a few minutes because of a switch of their main database. It will be performed on 30 June at 06:00 UTC. This will impact ContentTranslation, Echo, StructuredDiscussions, Growth experiments and a few more services. [2]
  • Users will be able to sort columns within sortable tables in the mobile skin. [3]

Future meetings

  • The next open meeting with the Web team about Vector (2022) will take place tomorrow (28 June). The following meetings will take place on 12 July and 26 July.

Tech news prepared by Tech News writers and posted by bot • Contribute • Translate • Get help • Give feedback • Subscribe or unsubscribe.

Tech News issue #26, 2022 (June 27, 2022)

00:00, Monday, 27 2022 June UTC
previous 2022, week 26 (Monday 27 June 2022) next

Tech News: 2022-26

weeklyOSM 622

10:13, Sunday, 26 2022 June UTC

14/06/2022-20/06/2022

lead picture

Osmose using open data in France and Spain now [1] © Osmose | map data © OpenStreetMap contributors

Breaking news

  • The next OSMF Board meeting will take place on Thursday 30 June 2022, at 13:00 UTC via the OSMF video room (which opens about 20 minutes before the meeting). The draft agenda is available on the wiki. The topics to be covered are:
    • Treasurer’s report
    • Updated membership prerequisites plan
    • Consider directing the OWG to cut access off due to attribution or other
      legal policy reasons, if flagged by the LWG
    • OSM Carto
    • OSM account creation API
    • Advisory Board – monthly update
    • Presentation by Mapbox Workers Union
    • Guest comments or questions.

Mapping

  • ViriatoLusitano has updated (pt) > de his very detailed and richly illustrated guide describing how to integrate data from the National Institute of Statistics (INE) into OSM, with names and georeferenced boundaries of different urban agglomerations.
  • Anne-Karoline Distel made a short report on her mapping trip to North Wales.
  • At this year’s SotM France conference, Stéphane Péneau gave (fr) an overview of street-level imagery, from hardware choice to file management.
  • Requests have been made for comments on the following proposals:
    • school=entrance to deprecate the use of the tag school=entrance.
    • exit=* to deprecate entrance=exit, entrance=emergency, and entrance=entrance in favour of clearer tags.
    • Emergency access and exits to address issues with the current tagging of these items.
    • aeroway=stopway for mapping the area beyond the runway that has a full-strength pavement able to support aircraft, which can be used for deceleration in the event of a rejected take-off.
    • runway=displaced_threshold for mapping the part of a runway which can be used for take-off, but not landing.
    • school:for=* a tag for schools to indicate what kinds of facilities are available for special needs students.
    • information=qr_code for tagging a QR code that provides information about a location of interest to tourists.
  • Voting on the pitch:net=* proposal, for indicating if a net is available at a sports pitch, is open until Saturday 2 July.
  • Voting on the following proposals has closed:
    • aeroway=aircraft_crossing to mark a point where the flow of traffic is impacted by crossing aircraft, was approved with 14 votes for, 0 votes against and 0 abstentions.
    • substation=* to improve tagging of power substations and transformers mixing on the same node, was approved with 11 votes for, 1 vote against and 0 abstentions.

Community

  • In the 133rd episode of the Geomob Podcast, Muki Haklay, Professor of Geoinformatics at UCL, an early adopter of combining geography with computer science and one of the earliest supporters of OpenStreetMap, is the guest. There is a discussion about extreme Citizen Science.
  • Nathalie Sidibé (fr) > de, from OSM Mali, is now involved in another community: Wikipedia! Her commitment to the Malian community, to open source data and of course to OSM has already been featured in several profiles. Now there is her full biography (fr) > de and an initiative of the ‘Les sans pagEs(fr) > de women geographers project.

Imports

  • Daniel Capilla provided (es) > de an update about the import of Iberdrola charging stations for electric vehicles in Malaga, which is now complete. The data is available under an open licence from the Municipality of Malaga (Spain). He maintains a corresponding wiki page for the documentation and coordination of open data imports.

Events

  • YouthMappers UMSA, a recently opened chapter of YouthMappers in Bolivia, tweeted (es) about their first OpenStreetMap training activity on 22 June.
  • Videos of the presentations at the SofM-Fr 2022 conference are now available (fr) online. A session listing for the conference, which was held 10 to 12 June in Nantes, is available (fr) > en on their website.

Education

  • Anne-Karoline Distel explained in a new video how to add running trails to OpenStreetMap.
  • Astrid Günther explained, in a tutorial, how she created vector tiles for a small area of Earth and hosts them herself.

OSM research

  • Youjin Choe, a PhD student in Geomatics at the University of Melbourne, Australia, is looking for your advice on a potential focus group study on the design of the OSM changeset discussion interface. Her research topic is on the user conflict management process in online geospatial data communities (which has mixed components of GIS, HCI, and organisational management).

Maps

  • Hub and spoke is a map that shows the 10 nearest airports to a given position.
  • CipherBliss published (fr) a thematic map of places to eat based on OpenStreetMap, ‘Melting Pot(fr) > en.

Open Data

  • [1] Osmose is now using open data to compare against OpenStreetMap data to find any missing roads or power lines in OSM. At present comparisons are made for power lines in France and highways in Spain.

Software

  • The first version of ‘Organic Maps’, a free and open source mobility map app for Android and iOS, was released (ru) > en last June (2021). After more than 100,000 installations and one year of intensive development work, the results and plans for the future are presented.

Programming

  • The new OSM app OSM Go! is looking for translators and developers.

Releases

  • Version 17.1 of the Android OSM editor Vespucci has been released.
  • With version StreetComplete v45.0-alpha1 Tobias Zwick introduced the new overlays functionality.

Did you know …

  • … that there are apps out there helping you find windy roads? Curvature, Calimoto and Kurviger are just some examples.
  • … the MapCSS style for special highlighting of bicycle infrastructure in JOSM?
  • HistOSM.org, which renders historical features exclusively?
  • … the Japanese local chapter of OSMF, OSMFJ, maintains a tile server and also offers a vector tile download service (via user smellman)? More details are on the wiki (ja) > en.

OSM in the media

  • OpenStreetMap featured (fr) > en (see video (fr)) ) in an overview of a wide range of modern mapping technologies in a segment on the France24 news channel. The OSM examples were: participative mapping in Africa (3m17s); and, Grab’s use of OSM in South-East Asia (4m10s), which allows them, unlike other map providers, to take into account the reality of Asia with rainy seasons and a lot of narrow roads. Other topics include Apple’s 3-D visualisation of Las Vegas, 360 degree image capture, indoor mapping and geoblocking.

Other “geo” things

  • Matthew Maganga wrote, in ArchDaily, about the inequalities created through modern mapping methods and especially Google StreetView.
  • Google Earth blogged about how they process Copernicus Sentinel-2 satellite images daily to create a current and historical land cover data set.
  • Saman Bemel Benrud, an early Mapbox employee, looked back at the 12 years he worked at the company and describes how it changed over time – leading to a failed attempt to found a union, which was part of the reason he left the company last year.
  • Canada and Denmark had a decades long land dispute, called the Whisky War, over an uninhabited Arctic island between Nunavut and Greenland. Following an agreement to divide control of Hans Island / Tartupaluk / ᑕᕐᑐᐸᓗᒃ, Canada now has a land border with a second country after the United States. Note that Canada also shares a maritime border with a second European country (France) near Newfoundland (second because Greenland is a constituent country of the Kingdom of Denmark).

Upcoming Events

Where What Online When Country
Arlon EPN d’Arlon – Atelier ouvert OpenStreetMap – Contribution osmcalpic 2022-06-28 flag
Hlavní město Praha MSF Missing Maps CZ Mapathon 2022 #2 Prague, KPMG office (Florenc) osmcalpic 2022-06-28
City of New York A Synesthete’s Atlas (Brooklyn, NY) osmcalpic 2022-06-29 flag
Roma Incontro dei mappatori romani e laziali osmcalpic 2022-06-29 flag
[Online] OpenStreetMap Foundation board of Directors – public videomeeting osmcalpic 2022-06-30
Washington A Synesthete’s Atlas (Washington, DC) osmcalpic 2022-07-01 flag
Essen 17. OSM-FOSSGIS-Communitytreffen osmcalpic 2022-07-01 – 2022-07-03 flag
OSM Africa July Mapathon: Map Liberia osmcalpic 2022-07-01
OSMF Engineering Working Group meeting osmcalpic 2022-07-04
臺北市 OpenStreetMap x Wikidata Taipei #42 osmcalpic 2022-07-04 flag
San Jose South Bay Map Night osmcalpic 2022-07-06 flag
London Missing Maps London Mapathon osmcalpic 2022-07-05 flag
Berlin OSM-Verkehrswende #37 (Online) osmcalpic 2022-07-05 flag
Salt Lake City OSM Utah Monthly Meetup osmcalpic 2022-07-07 flag
Fremantle Social Mapping Sunday: Fremantle osmcalpic 2022-07-10 flag
München Münchner OSM-Treffen osmcalpic 2022-07-12 flag
20095 Hamburger Mappertreffen osmcalpic 2022-07-12 flag
Landau an der Isar Virtuelles Niederbayern-Treffen osmcalpic 2022-07-12 flag
Salt Lake City OSM Utah Monthly Meetup osmcalpic 2022-07-14 flag

Note:
If you like to see your event here, please put it into the OSM calendar. Only data which is there, will appear in weeklyOSM.

This weeklyOSM was produced by Lejun, Nordpfeil, PierZen, SK53, SeverinGeo, Strubbl, Supaplex, TheSwavu, YoViajo, derFred.

Women’s suffrage and the Hunger Strike Medal

16:16, Friday, 24 2022 June UTC

Dr Sara Thomas, Scotland Programme Coordinator for Wikimedia UK

On International Women’s Day, I ran training for long-term Wikimedia UK partners Protests & Suffragettes and Women’s History Scotland. The editathon focused on Scottish Suffrage(ttes), and is just one of a series of events that they’ll be running over the next few months.  

A few days after the event, I was tagged in a brilliant Twitter thread from one participant and new Wikipedia editor Becky Male. Becky had been working on the Hunger Strike Medal article. I was really struck not only by her new-found enthusiasm for Wikipedia editing, but also by this quote: Knowledge activism matters because, for most people, Wikipedia is their first port of call for new info. I did the Cat and Mouse Act in GCSE History. Don’t remember learning about the medal or the names of the women..” 

We often talk about Knowledge Activism in the context of fixing content gaps that pertain to voices and communities left out by structures of power and privilege, and how the gender gap manifests in different ways on-wiki. I thought that this was a great example of how the Wikimedia community’s work is helping to address those gaps, so I reached out to Becky to ask if she’d like to write a blog for us which you can read below. Thanks Becky!

Picture of the English suffragette Emily Davison, date unknown, but c.1910-12. CC0.

By Becky Male, @beccamale

Joining Wikipedia was one of those things I’d thought about doing from time to time – I’d come across an article that was woefully short and think to myself “someone should probably do something about that”. But fear of accidentally breaking something stopped me.

But then it’s International Women’s Day, and Women’s History Scotland, Protests & Suffragettes and Wikimedia UK are organising an Editathon to get some of the information P&S has found – they’ve created fantastic educational resources on the Scottish suffrage movement – added to Wikipedia. This is the Knowledge Gap: even when things are known about women, that knowledge hasn’t made it on to Wikipedia. It’s most people’s first port of call for new information, which makes this a big problem.

So I signed up and did the intro tutorial. A misspent adolescence on LiveJournal meant the leap from basic HTML to editing in source was fairly small. And there’s something about sitting in a Zoom call of two dozen women, all a bit nervous about this process too and being told “It’s okay, you really can’t screw this up that badly” that’s genuinely reassuring – failure’s a lot less scary when you’ve got backup.

Offline, I volunteer at Glasgow Women’s Library digitising artefacts. Creating the article on the Suffragette Penny sounded like a perfect extension of that. But it was wisely suggested that I should pick an existing article for my first. The Hunger Strike Medal needed work and was similar enough to get me started.

I studied the Cat and Mouse Act for GSCE History so I already had some background knowledge of the suffragette tactic of hunger striking. I cleaned up the lead, separated the information into sections and added a few other interesting titbits – as I learned at the Editathon, Wikipedia users love trivia. But the biggest change I made was to the list of medal recipients.

The medal was the WSPU’s highest honour – not only had a woman been gaoled for her beliefs, she’d risked her life and health for the cause. The hunger strikes and subsequent force-feeding by prison authorities contributed to early deaths, caused serious illnesses, and destroyed women’s mental health. They suffered horrifically and their sacrifices deserve to be remembered.

The list is now over 90 names, each one sourced, each medal confirmed. Some I found in books, maybe just one line about them. Others I found with a Google search, the suggested images showing me new medals the deeper I went, leading me to the sites of auction houses and local museums. My favourites, though, are in newsreels from 1955, women well into their 60s still proudly wearing their medals.

There’s another 60+ hunger strikers whose medals haven’t been found yet. Some names I moved to the Talk page if the evidence doesn’t support them on the list. I can’t say for sure that this is the most comprehensive list of WSPU hunger strikers but I think it’s likely – I certainly haven’t found one anywhere else.

And I’ve still got that Suffragette Penny article to write.

Militant suffragette Janie Terrero (1858-1944) wearing her Hunger Strike Medal and Holloway brooch c1912. CC0.

The post Women’s suffrage and the Hunger Strike Medal appeared first on WMUK.

Should Vector be responsive?

20:35, Thursday, 23 2022 June UTC

Here I share some thoughts around the history of "responsive" MediaWiki skins and how we might want to think about it for Vector.

The buzzword "responsive" is thrown around a lot in Wikimedia-land, but essentially what we are talking about is whether to include a single tag in the page. The addition of a meta tag with name viewport, will tell the page how to adapt to a mobile device.

<meta name="viewport" content="width=device-width, initial-scale=1">

More information: https://css-tricks.com/snippets/html/responsive-meta-tag/

Since the viewport tag must be added, by default websites are not made mobile-friendly. Given the traditional Wikimedia skins were built before mobile sites and this tag existed, CologneBlue, Modern, Vector did not add this tag.

When viewing these skins on mobile the content will not adapt to the device and instead will appear zoomed out. One of the benefits of this is that the reader sees a design that is consistent with the design they see on desktop. The interface is familiar and easy enough to navigate as the user can pinch and zoom to parts of the UI. The downside is that reading is very difficult, and requires far more hand manipulation to move between sentences and paragraphs, and for this reason, many search engines will penalize traffic.

Enter Minerva

The Minerva skin (and MobileFrontend before it) were introduced to allow us to start adapting our content for mobile. This turned out to be a good decision as it avoided the SEO of our projects from being penalized. However, building Minerva showed that making content mobile-friendly was more than adding a meta tag. For example, many templates used HTML elements with fixed widths that were bigger than the available space. This was notably a problem with large tables. Minerva swept many of these issues under the rug with generic fixes (For example enforcing horizontal scrolling on tables). Minerva took a bottom-up approach where it added features only after they were mobile-friendly. The result of this was a minimal experience that was not popular with editors.

Timeless

Timeless was the 2nd responsive skin added to Wikimedia wikis. It was popular with editors as it took a different approach to Minerva, in that it took a top-down approach, adding features despite their shortcomings on a mobile screen. It ran into many of the same issues that Minerva had e.g. large tables and copied many of the solutions in Minerva.

MonoBook

During the building of Timeless, the Monobook skin was made responsive (T195625). Interestingly this led to a lot of backlash from users (particularly on German Wikipedia), revealing that many users did not want a skin that adapted to the screen (presumably because of the reasons I outlined earlier - while reading is harder, it's easier to get around a complex site. Because of this, a preference was added to allow editors to disable responsive mode (the viewport tag). This preference was later generalized to apply to all skins:

Responsive Vector

Around the same time, several attempts were made by volunteers to force Vector to work as a responsive skin. This was feature flagged given the backlash for MonoBook's responsive mode. The feature flag saw little development, presumably because many gadgets popped up that were providing the same service.

Vector 2022

The feature flag for responsive Vector was removed for legacy Vector in T242772 and efforts were redirected into making the new Vector responsive. Currently, the Vector skin can be resized comfortably down to 500px. It currently does not add a viewport tag, so does not adapt to a mobile screen.

However, during the building of the table of contents, many mobile users started complaining (T306910). The reason for this was that when you don't define a viewport tag the browser makes decisions for you. To avoid these kind of issues popping up it might make sense for us to define an explicit viewport to request content that appears scaled out at a width of our choosing. For example, we could explicitly set width 1200px with a zoom level of 0.25 and users would see:

If Vector was responsive, it would encourage people to think about mobile-friendly content as they edit on mobile. If editors insist on using the desktop skin on their mobile phones rather than Minerva, they have their reasons, but by not serving them a responsive skin, we are encouraging them to create content that does not work in Minerva and skins that adapt to the mobile device.

There is a little bit more work needed on our part to deal with content that cannot hit into 320px e.g. below 500px. Currently if the viewport tag is set, a horizontal scrollbar will be shown - for example the header does not adapt to that breakpoint:


Decisions to be made

  1. Should we enable Vector 2022's responsive mode? The only downside of doing this is that some users may dislike it, and need to visit preferences to opt-out.
  2. When a user doesn't want responsive mode, should we be more explicit about what we serve them? For example, should we tell a mobile device to render at a width of 1000px with a scale of 0.25 ( 1/4 of the normal size) ? This would avoid issues like T306910. Example code [1] demo
  3. Should we apply the responsive mode to legacy Vector too? This would fix T291656 as it would mean the option applies to all skins.

[1]

<meta name="viewport" content="width=1400px, initial-scale=0.22">

Episode 115: BTB Digest 18

18:15, Tuesday, 21 2022 June UTC

🕑 30 minutes

It's another BTB Digest episode! Mike Cariaso explains why you should use SQLite, Tyler Cipriani talks about teaching deployment to volunteers, Dror Snir-Haim compares translation options, Alex Hollender defends sticky headers, Kunal Mehta criticizes Bitcoin miners, and more!

June 21, 2022, San Francisco, CA, USA ― Wikimedia Enterprise, a first-of-its-kind commercial product designed for companies that reuse and source Wikipedia and Wikimedia projects at a high volume, today announced its first customers: multinational technology company Google and nonprofit digital library Internet Archive.  Wikimedia Enterprise was recently launched by the Wikimedia Foundation, the nonprofit that operates Wikipedia, as an opt-in product. Starting today, it also offers a free trial account to new users who can self sign-up to better assess their needs with the product.

As Wikipedia and Wikimedia projects continue to grow, knowledge from Wikimedia sites is increasingly being used to power other websites and products. Wikimedia Enterprise was designed to make it easier for these entities to package and share Wikimedia content at scale in ways that best suit their needs: from an educational company looking to integrate a wide variety of verified facts into their online curricula, to an artificial intelligence startup that needs access to a vast set of accurate data in order to train their systems. Wikimedia Enterprise provides a feed of real-time content updates on Wikimedia projects, guaranteed uptime, and other system requirements that extend beyond what is freely available in publicly-available APIs and data dumps. 

“Wikimedia Enterprise is designed to meet a variety of content reuse and sourcing needs, and our first two customers are a key example of this. Google and Internet Archive leverage Wikimedia content in very distinct ways, whether it’s to help power a portion of knowledge panel results or preserve citations on Wikipedia,” said Lane Becker, Senior Director of Earned Revenue at the Wikimedia Foundation. “We’re thrilled to be working with them both as our longtime partners, and their insights have been critical to build a compelling product that will be useful for many different kinds of organizations.” 

Organizations and companies of any size can access Wikimedia Enterprise offerings with dedicated customer-support and Service Level Agreements, at a variable price based on their volume of use. Interested companies can now sign up on the website for a free trial account which offers 10,000 on-demand requests and unlimited access to a 30-day Snapshot. 

Google and the Wikimedia Foundation have worked together on a number of projects and initiatives to enhance knowledge distribution to the world. Content from Wikimedia projects helps power some of Google’s features, including being one of several data sources that show up in its knowledge panels. Wikimedia Enterprise will help make the content sourcing process more efficient. Tim Palmer, Managing Director, Search Partnerships at Google said, “Wikipedia is a unique and valuable resource, created freely for the world by its dedicated volunteer community. We have long supported the Wikimedia Foundation in pursuit of our shared goals of expanding knowledge and information access for people everywhere. We look forward to deepening our partnership with Wikimedia Enterprise, further investing in the long-term sustainability of the foundation and the knowledge ecosystem it continues to build.”

Internet Archive is a long-standing partner to the Wikimedia Foundation and the broader free knowledge movement. Their product, the Wayback Machine, has been used to fix more than 9 million broken links on Wikipedia. Wikimedia Enterprise is provided free of cost to the nonprofit to further support their mission to digitize knowledge sources. Mark Graham, Director of the Internet Archive’s Wayback Machine shared, “The Wikimedia Foundation and the Internet Archive are long-term partners in the mission to provide universal and free access to knowledge. By drawing from a real time feed of newly-added links and references in Wikipedia sites – in all its languages, we can now archive more of the Web more quickly and reliably.”

Wikimedia Enterprise is an opt-in, commercial product. Within a year of its commercial launch, it is covering its current operating costs and with a growing list of users exploring the product. All Wikimedia projects, including the suite of publicly-available datasets, tools, and APIs the Wikimedia Foundation offers will continue to be available for free use to all users. 

The creation of Wikimedia Enterprise arose, in part, from the recent Movement Strategy – the global, collaborative strategy process to direct Wikipedia’s future by the year 2030 devised side-by-side with movement volunteers. By making Wikimedia content easier to discover, find, and share, the product speaks to the two key pillars of the 2030 strategy recommendations: advancing knowledge equity and knowledge as a service. 

Interested companies are encouraged to visit the Wikimedia Enterprise website for more information on the product offering and features, as well as to sign up for their free account. 

About the Wikimedia Foundation 

The Wikimedia Foundation is the nonprofit organization that operates Wikipedia and the other Wikimedia free knowledge projects. Wikimedia Enterprise is operated by Wikimedia, LLC, a wholly owned limited liability company (LLC) of the Wikimedia Foundation. The Foundation’s vision is a world in which every single human can freely share in the sum of all knowledge. We believe that everyone has the potential to contribute something to our shared knowledge, and that everyone should be able to access that knowledge freely. We host Wikipedia and the Wikimedia projects, build software experiences for reading, contributing, and sharing Wikimedia content, support the volunteer communities and partners who make Wikimedia possible, and advocate for policies that enable Wikimedia and free knowledge to thrive. 

The Wikimedia Foundation is a charitable, not-for-profit organization that relies on donations. We receive donations from millions of individuals around the world, with an average donation of about $15. We also receive donations through institutional grants and gifts. The Wikimedia Foundation is a United States 501(c)(3) tax-exempt organization with offices in San Francisco, California, USA.

For more information on Wikimedia Enterprise:

How does Internet Archive know?

19:30, Monday, 20 2022 June UTC

The Internet Archive discovers in real-time when WordPress blogs publish a new post, and when Wikipedia articles reference new sources. How does that work?

Wikipedia

Wikipedia, and its sister projects such as Wiktionary and Wikidata, run on the MediaWiki open-source software. One of its core features is “Recent changes”. This enables the Wikipedia community to monitor site activity in real-time. We use it to facilitate anti-spam, counter-vandalism, machine learning, and many more quality and research efforts.

MediaWiki’s built-in REST API exposes this data in machine-readable form to query (or poll). For wikipedia.org, we have an additional RCFeed plugin that broadcasts events to the stream.wikimedia.org service (docs).

The service implements the HTTP Server-Sent Events protocol (SSE). Most programming languages have an SSE client via a popular package. Most exciting to me, though, is the original SSE client: the EventSource API — built straight into the browser.1 This makes cool demos possible, getting started with only the following JavaScript:

new EventSource('https://stream.wikimedia.org/…');

And from the command-line, with cURL:

$ curl 'https://stream.wikimedia.org/v2/stream/recentchange'

event: message
id: …
data: {"$schema":…,"meta":…,"type":"edit","title":…}

WordPress

WordPress played a major role in the rise of the blogosphere. In particular, ping servers (and pingbacks2), helped the early blogging community with discovery. The idea: your website notifies a ping server over a standardized protocol. The ping server in turn notifies feed reader services (Feedbin, Feedly), aggregators (FeedBurner), podcast directories, search engines, and more.3

Ping servers today implement the weblogsCom interface (specification), introduced in 2001 and based on the XML-RPC protocol.4 The default ping server in WordPress is Automattic’s Ping-O-Matic, which in turn powers the WordPress.com Firehose.

This firehose is a Jabber/XMPP server at xmpp.wordpress.com:8008. It provides events about blog posts published in real-time, from any WordPress site. Both WordPress.com and self-hosted ones.5 The firehose is also available in as HTTP stream.

$ curl -vi xmpp.wordpress.com:8008/posts.org.json # self-hosted
{ "published":"2022-06-05T21:26:09Z",
  "verb":"post",
  "generator":{},
  "actor":{},
  "target":{"objectType":"blog",…,},
  "object":{"objectType":"article",…}
}
{}

$ curl -vi xmpp.wordpress.com:8008/posts.json # WordPress.com
{}

Internet Archive

It might be surprising, but the Internet Archive does not try to index the entire Internet. This in contrast to commercial search engines.

The Internet Archive consists of bulk datasets from curated sources (“collections”). Collections are often donated by other organizations, and go beyond capturing web pages. They can also include books, music,6, and software.7 Any captured web pages are additionally surfaced via the Wayback Machine interface.

Perhaps you’ve used the “Save Page Now” feature, where you can manually submit URLs to capture. While also represented by a collection, these actually go to the Wayback Machine first, and appear in bulk as part of the collection later.

The Common Crawl and Wide Crawl collections represent traditional crawlers. These starts with a seed list, and go breadth-first to every site it finds (within a certain global and per-site depth limit). Such crawl can take months to complete, and captures a portion of the web from a particular period in time — regardless of whether a page was indexed before. Other collection are more narrow in focus, e.g. regularly crawl a news site and capture any articles not previously indexed.

Wikipedia collection

One such collection is Wikipedia Outlinks.8 This collection is fed several times a day with bulk crawls of new URLs. The URLs are extracted from recently edited or created Wikipedia articles, as discovered via the events from stream.wikimedia.org (Source code: crawling-for-nomore404).

en.wikipedia.org, revision by Krinkle, on 30 May 2022 at 21:03:30.

Last month, I edited the VodafoneZiggo article on Wikipedia. My edit added several new citations. The articles I cited were from several years ago, and most already made their way into the Wayback Machine by other means. Among my citations was a 2010 article from an Irish news site (rtl.ie). I searched for it on archive.org and no snapshots existed of that URL.

A day later I searched again, and there it was!

web.archive.org found 1 result, captured at 30 May 2022 21:03:55. This capture was collected by: Wikipedia Eventstream.

I should note that, while the snapshot was uploaded a day later, the crawling occurred in real-time. I published my edit to Wikipedia on May 30th, at 21:03:30 UTC. The snapshot of the referenced source article, was captured at 21:03:55 UTC. A mere 25 seconds later!

In addition to archiving citations for future use, Wikipedia also integrates with the Internet Archive in the present. The so-called InternetArchiveBot (source code) continously crawls Wikipedia, looking for “dead” links. When it finds one, it searches the Wayback Machine for a matching snapshot, preferring one taken on or near the date that the citation was originally added to Wikipedia. This is important for online citations, as web pages may change over time.

The bot then edits Wikipedia (example) to rescue the citation by filling in the archive link.

Wikipedia.org, revision by InternetArchiveBot, on 4 June 2022. Rescuing 1 source. The source was originally cited on 29 September 2018. The added archive URL is also from 29 September 2018. web.archive.org, found 1 result, captured 29 September 2018. This capture was collected by: Wikipedia Eventstream.

WordPress collection

The NO404-WP collection on archive.org works in a similar fashion. It is fed by a crawler that uses the WordPress Firehose (source code). The firehose, as described above, is pinged by individual WordPress sites after publishing a new post.

For example, this blog post by Chris. According to the post metadata, it was published at 12:00:42 UTC. And by 12:01:55, one minute later, it was captured.9

In addition to preserving blog posts, the NO404-WP collection goes a step further and also captures any new material your post links to. (Akin to Wikipedia citations!) For example, this css-tricks.com post links to file on GitHub inside the TT1 Blocks project. This deep link was not captured before and is unlikely to be picked up by regular crawling due to depth limits. It got captured and uploaded to the NO404-WP collection a few days later.

Further reading

Footnotes:

  1. The “Server-sent events” technology was around as early as 2006, originating at Opera (announcement, history). It was among the first specifications to be drafted through WHATWG, which formed in 2004 after the W3C XHTML debacle

  2. Pingback (Pingbacks explained, history) provides direct peer-to-peer discovery between blogs when one post mentions or links to another post. By the way, the Pingback and Server-Sent Events specifications were both written by Ian Hickson. 

  3. Feedbin supports push notifications. While these could come from from its periodic RSS crawling, it tries to deliver these in real-time where possible. It this does by mapping pings from blogs that notify Ping-O-Matic, to feed subscriptions. 

  4. The weblogUpdates spec for Ping servers was writen by Dave Winer in 2001, who took over Weblogs.com around that time (history) and needed something more scalable. This, by the way, is the same Dave Winer who developed the underlying XML-RPC protocol, the OPML format, and worked on RSS 2.0. 

  5. That is, unless the blog owner opts-out by disabling the “search engine” and “ping” settings in WordPress Admin. 

  6. The Muziekweb collection is one that stores music rather than web pages. Muziekweb is a library in the Netherlands that lends physical CDs, via local libraries, to patrons. They also digitize their collection for long-term preservation. One cool application of this, is that you can stream any album in full from a library computer. And… they mirror to the Internet Archive! You can search for an artist, and listen online. For copyright reasons, most music is publicly limited to 30s samples. Through Controlled digital lending, however, you can access many more albums in full. Plus you can publicly stream any music in the public domain, under a free license, or pre-1972 no longer commercially available

  7. I find particularly impressive that Internet Archive also host platform emulators for the software it preserves, and that these platforms not only include game consoles but also Macintosh and MS-DOS, and that these emulators are then compiled via Emscripten to JavaScript and integrated right on the archive.org entry! For example, you can play the original Prince of Persia for Mac (via pce-macplus.js), the later color edition, or Wolfenstein 3D for MS-DOS (via js-dos or em-dosbox), or check out Bill Atkinson’s 1985 MacPaint

  8. The “Wikipedia Outlinks” collection was originally populated via the NO404-WKP subcollection, which used the irc.wikimedia.org service from 2013 to 2019. It was phased out in favour of the wikipedia-eventstream subcollection

  9. In practice, the ArchiveTeam URLs collection tends to beat the NO404-WP collection and thus the latter doesn’t crawl it again. Perhaps the ArchiveTeam scripts also consume the WordPress Firehose? For many WordPress posts I checked, the URL is only indexed once, which is from “ArchiveTeam URLs” doing so within seconds of original publication.