The missing bedrock of Wikipedia’s geology coverage

16:56, Tuesday, 16 2021 November UTC

The Catoctin Formation is a geological formation that extends from Virgina, through Maryland, to Pennsylvania. This ancient rock formation, which dates to the Precambrian, is mostly buried deeply under more recent geological deposits, but is exposed in part of the Blue Ridge Mountains. And until a student in Sarah Carmichael’s Petrology and Petrography expanded it this Spring, Wikipedia’s article about the Catoctin Formation was only two sentences long. Now, thanks to this student editor, Wikipedia has a readable, informative, and well-illustrated article that’s almost 2,000 words long.

Despite having almost 6.4 million articles, there are still plenty of topics that are missing from Wikipedia. But it still surprises me when an entire class finds a lane as empty as this one did. In addition to working on two stubs, students in the class created 15 new articles.

The Roosevelt Gabbros is an intrusive igneous rock formation in southwestern Oklahoma. A gabbro is a magnesium and iron-rich rock formed by the cooling of magma. The Roosevelt Gabbros are named after the town of Roosevelt in Kiowa County, Oklahoma, and are one of the geologic formations that make up the Wichita Mountains. Other new articles created by the class include Red Hill Syenite, an igneous rock complex in central New Hampshire, the Ashe Metamorphic Suite in Ashe County, North Carolina and the Central Montana Alkalic Province, a geological province occupying much of the middle third of the state of Montana.

Content related to geology and mineralogy on Wikipedia is underdeveloped. From individual minerals to a 600,000 km2 geological basin, student editors in past classes have been able to create new articles about broad, substantive topics. And where articles exist, a lot of them are stubs.

Wiki Education’s Wikipedia Student Program offers instructors in geology and mineralogy — and other subjects — the opportunity to fill these content gaps by empowering students to contribute content as a class assignment. For more information, visit teach.wikiedu.org.

Image credit: Alex Speer, CC BY-SA 4.0, via Wikimedia Commons

The Wikimedia Foundation, the nonprofit that operates Wikipedia, applauds European policymakers’ efforts to make content moderation more accountable and transparent. However, some of the DSA’s current provisions and proposed amendments also include requirements that could put Wikipedia’s collaborative and not-for-profit model at risk. 

Wikipedia’s system of open collaboration has enabled knowledge-sharing on a global scale for more than 20 years. It is one of the most beloved websites in the world, as well as one of the most trusted sources for up-to-date knowledge about COVID-19. All of this is only made possible by laws that protect its volunteer-led model. But now, that people-powered model is getting caught in the cross-fires of the DSA proposals.

The current DSA framework is designed to address the operating models of major tech platforms. But a variety of websites, Wikipedia included, don’t work in the same way that for-profit tech platforms do. Applying a one-size-fits all solution to the complex problem of illegal content online could stifle a diverse, thriving, and noncommercial ecosystem of online communities and platforms. 

We are calling on European lawmakers to take a more nuanced approach to internet regulation. There is more to the internet than Big Tech platforms run by multinational corporations. We ask lawmakers to protect and support nonprofit, community-governed, public interest projects like Wikipedia as the DSA proceeds through the European Parliament and Council.

We are ready to work with lawmakers to amend the DSA package so that it empowers and protects the ability of all Europeans to collaborate in the public interest. 

Protect Wikipedia, protect the people’s internet. 

Here are four things policymakers should know before finalizing the DSA legislation: 

  1. The DSA needs to address the algorithmic systems and business models that drive the harms caused by illegal content. 

DSA provisions remain overly-focused on removing content through prescriptive content removal processes. The reality is that removing all illegal content from the internet as soon as it appears is as daunting as any effort to prevent and eliminate all crime in the physical world. Given that the European Union is committed to protecting human rights online and offline, lawmakers should focus on the primary cause of widespread harm online: systems that amplify and spread illegal content. 

A safer internet is only possible if DSA provisions address the targeted advertising business model that drives the spread of illegal content. As the Facebook whistleblower Frances Haugen emphasized in her recent testimony in Brussels, the algorithms driving profits for ad-placements are also at the root of the problem that the DSA is seeking to address. New regulation should focus on these mechanisms that maximize the reach and impact of illegal content. 

But lawmakers should not be overly focused on Facebook and similar platforms. As a non-profit website, Wikipedia is available for free to everyone, without ads, and without tracking reader behavior. Our volunteer-led, collaborative model of content production and governance helps ensure that content on Wikipedia is neutral and reliable. Thousands of editors deliberate, debate, and work together to decide what information gets included and how it is presented. This works very differently than the centralized systems that lean on algorithms to both share information in a way that maximizes engagement, and to moderate potentially illegal or harmful content.  

In Wikipedia’s 20 years, our global community of volunteers has proven that empowering users to share and debate facts is a powerful means to combat the use of the internet by hoaxers, foreign influence operators, and extremists. It is imperative that new legislation like the DSA fosters space for a variety of web platforms, commercial and noncommercial, to thrive. 

  1. Terms of service should be transparent and equitable, but regulators should not be overly-prescriptive in determining how they are created and enforced. 

The draft DSA’s Article 12 currently states that an online provider has to disclose its terms of service—its rules and tools for content moderation— and that they must be enforced “in a diligent, objective, and proportionate manner.” We agree that terms of service should be as transparent and equitable as possible. However, the words “objective” and “proportionate” leave room for an open, vague interpretation. We sympathize with the intent, which is to make companies’ content moderation processes less arbitrary and opaque. But forcing platforms to be “objective” about terms of service violations would have unintended consequences. Such language could potentially lead to enforcement that would make it impossible for community-governed platforms like Wikipedia to use volunteer-driven, collaborative processes to create new rules and enforce existing ones that take context and origin of all content appropriately into account. 

The policies for content and conduct on Wikipedia are developed and enforced by the people contributing to Wikipedia themselves. This model allows people who know about a topic to determine what content should exist on the site and how that content should be maintained, based on established neutrality and reliable sourcing rules. This model, while imperfect, keeps Wikipedia neutral and reliable. As more people engage in the editorial process of debating, fact-checking, and adding information, Wikipedia articles tend to become more neutral. What’s more, volunteers’ deliberation, decisions, and enforcement actions are publicly documented on the website.  

This approach to content creation and governance is a far-cry from the top-down power structure of the commercial platforms that DSA provisions target. The DSA should protect and promote spaces on the web that allow for open collaboration instead of forcing Wikipedia to conform to a top-down model.  

  1. The process for identifying and removing “illegal content” must include user communities.

Article 14 states that online platforms will be responsible for removing any illegal content that might be uploaded by users, once the platforms have been notified of that illegal content. It also states that platforms will be responsible for creating mechanisms that make it possible for users to alert platform providers of illegal content. These provisions tend to only speak to one type of platform: those with centralized content moderation systems, where users have limited ability to participate in decisions over content, and moderation instead tends to fall on a singular body run by the platform. It is unclear how platforms that fall outside this archetype will be affected by the final versions of these provisions. 

The Wikipedia model empowers the volunteers who edit Wikipedia to remove content according to a mutually-agreed upon set of shared standards. Thus while the Wikimedia Foundation handles some requests to evaluate illegal content, the vast majority of content that does not meet Wikipedia’s standards is handled by volunteers before a complaint is even made to the Foundation. One size simply does not fit all in this case.

We fear that by placing legal responsibility for enforcement solely on service providers and requiring them to uphold strict standards for content removal, the law disincentivizes systems which rely on community moderators and deliberative processes. In fact, these processes have been shown to work well to identify and quickly remove bad content. The result would be an online world in which service providers, not people, control what information is available online. We are concerned that this provision will do the exact opposite of what the DSA intends by giving more power to platforms, and less to people who use them. 

  1. People cannot be replaced with algorithms when it comes to moderating content. 

The best parts of the internet are powered by people, not in spite of them. Article 12 and 14 would require platform operators to seize control of all decisions about content moderation, which would in turn incentivize or even require the use of automatic content detection systems. While such systems can support community-led content moderation by flagging content for review, they cannot replace humans. If anything, research has uncovered systemic biases and high error-rates that are all-too-frequently associated with the use of automated tools. Such algorithms can thus further compound the harm posed by amplification. Automated tools are limited in their ability to identify fringe content that may be extreme but still has public interest value. One example of such content are videos documenting human rights abuses, which have been demonstrated to be swiftly removed.  These examples only underscore the need to prioritize human context over speed.

Therefore, European lawmakers should avoid over-reliance on the kind of algorithms used by commercial platforms to moderate content. If the DSA forces or incentivizes platforms to deploy algorithms to make judgements about the value or infringing nature of content, we all – as digital citizenry – miss out on the opportunity to shape our digital future together. 

On Wikipedia, machine learning tools are used as an aid, not a replacement for human-led content moderation. These tools operate transparently on Wikipedia, and volunteers have the final say in what actions machine learning tools might suggest. As we have seen, putting more decision-making power into the hands of Wikipedia readers and editors makes the site more robust and reliable. 

“It is impossible to trust a ‘perfect algorithm’ to moderate content online. There will always be errors, by malicious intent or otherwise. Wikipedia is successful because it does not follow a predefined model; rather, it relies on the discussions and consensus of humans instead of algorithms.”

Maurizio Codogno, longtime Italian Wikipedia volunteer 

We urge policymakers to think about how new rules can help reshape our digital spaces so that collaborative platforms like ours are no longer the exception. Regulation should empower people to take control of their digital public spaces, instead of confining them to act as passive receivers of content moderation practices. We need policy and legal frameworks that enable and empower citizens to shape the internet’s future, rather than forcing platforms to exclude them further. 

Our public interest community is here to engage with lawmakers to help design regulations that empower citizens to improve our online spaces together. 

“Humanity’s knowledge is, more often than not, still inaccessible to many: whether it’s stored in private archives, hidden in little-known databases, or lost in the memories of our elders. Wikipedia aims to improve the dissemination of knowledge by digitizing our heritage and sharing it freely for everyone online. The COVID-19 pandemic and subsequent infomedic only further remind us of the importance of spreading free knowledge.”

Pierre-Yves Beaudouin, President, Wikimedia France

How to get in touch with Wikimedia’s policy experts 

  • For media inquiries to discuss Wikimedia’s position on the DSA, please contact [email protected] 
  • For MEPs and their staff, please contact Jan Gerlach, Lead Public Policy Manager, [email protected]

Tech News issue #46, 2021 (November 15, 2021)

00:00, Monday, 15 2021 November UTC
previous 2021, week 46 (Monday 15 November 2021) next

weeklyOSM 590

11:30, Sunday, 14 2021 November UTC

02/11/2021-08/11/2021

lead picture

30DayMapChallenge Day 5 – Buildings in Santa Cruz, Bolivia by Eric Armijo [1] © rcrmj | map data © OpenStreetMap contributors

Mapping

Community

  • Public Lab Mongolia have started a blog series. First up: ‘Creating An Open-Source Database To Improve Access To Health Services Amid COVID-19 Pandemic In Mongolia’.
  • OpenStreetMap Belgium’s Mapper of the Month for November is Dasrakel from Belgium.

OpenStreetMap Foundation

  • Michael Collinson, acting as facilitator, has published the official set of questions and instructions for board candidates. Candidates are asked to send answers and manifestos by 24:00 UTC, Sunday 14 November.
  • Amanda McCann informed the Osmf-talk mail list that the microgrants programme has been shelved while the Board works out the budgeting.
  • Amanda McCann shared, in her diary, what she did in OpenStreetMap during October.
  • This year’s OSMF Annual General Meeting has a special resolution to change the OSMF’s Articles of Association to count time as associate member for board candidacy requirements.
  • Instructions on voting at this year’s OSMF Annual General Meeting have been published.

OSM research

  • A dissertation by Filip Krumpe was published (de) > en at the University of Stuttgart that deals with the labelling of interactive maps. OSM data are used as the geodata basis. The thesis can be downloaded (en) as a pdf (file size: 29.1 MB).
  • Lukas Kruitwagen and colleagues at Oxford University published (paywall) a large worldwide dataset of predicted locations of solar power plants. The lead author has also written an accessible account. The work involved using machine learning based on a training dataset from solar farms mapped on OpenStreetMap around 2017. Satellite imagery from both SPOT and Sentinel-2 were used for both the initial training and creation of the predicted data.

Humanitarian OSM

  • The annual HOT Summit will be held on Monday 22 November as a virtual event, with the theme: ‘The Evolution of Local Humanitarian Open Mapping Ecosystems: Understanding Community, Collaboration, and Contribution’. Registration closes on Friday 19 November.

Maps

  • [1] Participants in the ’30 Day Map Challenge’ on Twitter continued to make maps using OpenStreetMap data:
    • Day 3: Polygons. Angela Teyvi showed how how much detail exists for some buildings in Accra, Ghana.
    • Day 4: Hexagons. Hexbinning of bus stops in Accra also by Angela Teyvi. SIG UCA found some actual hexagons to map – lecture theatres in San Salvador, El Salvador.
    • Day 5: Buildings in Santa Cruz, Bolivien by Eric Armijo.
    • Day 6: Red. Polluted lakes in Finland by Sini Pöytäniemi.
    • Day 7: Green. Shammilah showed isochrons of walking time to heatlh care facilites in Kisoro District, Uganda.
    • Day 8: Blue. Common choices were watery themes and places with blue in the name. Jaroslav_sm combined the two for lakes named ‘Blue Lake’ in Ukranian.
    • Day 9: Monochrome. Heikki produced an intriguing identification quiz on Irish towns and cities, based on buildings alone (cleverly leveraging and publicising the project to map them across Ireland).
  • Day 5 was a little special as OpenStreetMap was the theme. Many mappers chose to explore specific classes of objects: Sber offices in Moscow (Дмитрий); restaurants in Merced (Derek Sollberger); 7-11 convenience stores in Hong Kong (Brandon Qilin).Xavier Olive did something a little different and explored the history of Zurich Airport on OSM.

Software

  • Mythic Beats hosting company donated two virtual servers to Organic Maps to help them distribute maps for offline usage on mobile devices. They point out that the apparently low value of their donation (in comparison to some other cloud service providers) is in part due to them not having to fund their own space programme.
  • TrackExplorer is software that allows you to upload a GPX file and visualise the trip in 3D. O J’s diary post gives some examples and notes that the base data is OSM, so the more accurate the data, the better the 3D environment displayed.

Programming

  • Komadinovic Vanja gave a whirlwind introduction to using OSM’s OAuth 2 authentication service.
  • Martin Raifer (user tyr_asd), the new iD developer contracted by OSMF, introduced himself.

Releases

  • Sarah Hoffmann presented version 4.0.0 of Nominatim, now available with a more flexible approach to handling how places can be searched.

Did you know …

  • Open Etymology Map? It allows you to view and edit links to the Wikidata elements of people after whom a street is named.
  • … the polygon extractor of OSM France? This tool allows you to download OSM relations as GeoJSON, image and other formats based on the relations’ ID.

OSM in the media

  • The Economist covered (may be paywalled) the work of Kruitwagen and colleagues (reported above), including the role of OpenStreetMap data.

Other “geo” things

  • David Costa tweeted a link to a zoomable version of ‘Les grandes routes vélocipédiques de France’, an 1897 cycle touring map of France.
  • Niantic announced that the AR game ‘Harry Potter: Wizards Unite’ will cease to operate on 31 January 2022. The in-game map and data used to calculate monsters’ types and appearance rates are from OpenStreetMap.
  • User-contributed content added to Google Street View is causing players of GeoGuessr to get angry. As Andrew Deck explained players of GeoGuessr, an online game where you guess your randomly selected location based on street views, are unhappy with the grainy, blurry, or otherwise poor-quality uploads that slow them down.
  • grin wrote about his experiences with his real-time kinematic (RTK) configuration in search of the most accurate position (precise to within a few centimetres).
  • ARTE has a series ((fr) with (en) subtitles) of videos on ‘Mapping the World’. The series presents the complex world of geopolitics broken down into ten minute, bite-sized chunks. Allegedly ‘you’ll never sound uninformed at the dinner table ever again’.

Upcoming Events

Where What Online When Country
Черкаси Open Mapathon: Digital Cherkasy osmcalpic 2021-10-24 – 2021-11-20 ua
Crowd2Map Tanzania GeoWeek FGM Mapathon osmcalpic 2021-11-15
UP Tacloban YouthMappers: MAPA-Bulig, Guiding the Youth to Community Mapping osmcalpic 2021-11-15
Bologna Geomatics at DICAM Geo Week Mapathon osmcalpic 2021-11-15 flag
Grenoble OSM Grenoble Atelier OpenStreetMap osmcalpic 2021-11-15 flag
OSMF Engineering Working Group meeting osmcalpic 2021-11-15
Missing Maps PDX GIS Day Mapathon osmcalpic 2021-11-16
UCB Brasil + CicloMapa: curso de mapeamento osmcalpic 2021-11-16 – 2021-11-26
Lyon Lyon : Réunion osmcalpic 2021-11-16 flag
Bonn 145. Treffen des OSM-Stammtisches Bonn osmcalpic 2021-11-16 flag
Berlin OSM-Verkehrswende #29 (Online) osmcalpic 2021-11-16 flag
Lüneburg Lüneburger Mappertreffen (online) osmcalpic 2021-11-16 flag
Missing Maps Arcadis GIS Day Mapathon osmcalpic 2021-11-17
Fort Collins CSU Geospatial Centroid GIS Day Mapathon osmcalpic 2021-11-18 flag
Missing Maps WMU GIS Day Mapathon osmcalpic 2021-11-17
Köln OSM-Stammtisch Köln osmcalpic 2021-11-17 flag
Zürich Missing Maps Zürich November Mapathon osmcalpic 2021-11-17 flag
Chambéry Missing Maps CartONG Tour de France des Mapathons – Chambéry osmcalpic 2021-11-18 flag
MSF Geo Week Global Mapathon osmcalpic 2021-11-19
State of the Map Africa 2021 osmcalpic 2021-11-19 – 2021-11-21
Maptime Baltimore Mappy Hour osmcalpic 2021-11-20
Lyon EPN des Rancy : Technique de cartographie et d’édition osmcalpic 2021-11-20 flag
Bogotá Distrito Capital – Departamento Resolvamos notas de Colombia creadas en OpenStreetMap osmcalpic 2021-11-20 flag
HOT Summit 2021 osmcalpic 2021-11-22
Bremen Bremer Mappertreffen (Online) osmcalpic 2021-11-22 flag
San Jose South Bay Map Night osmcalpic 2021-11-24 flag
Derby East Midlands OSM Pub Meet-up : Derby osmcalpic 2021-11-23 flag
Vandœuvre-lès-Nancy Vandoeuvre-lès-Nancy : Rencontre osmcalpic 2021-11-24 flag
Düsseldorf Düsseldorfer OSM-Treffen (online) osmcalpic 2021-11-24 flag
[Online] OpenStreetMap Foundation board of Directors – public videomeeting osmcalpic 2021-11-26
Brno November Brno Missing Maps mapathon at Department of Geography osmcalpic 2021-11-26 flag
長岡京市 京都!街歩き!マッピングパーティ:第27回 元伊勢三社 osmcalpic 2021-11-27 flag
Bogotá Distrito Capital – Departamento Resolvamos notas de Colombia creadas en OpenStreetMap osmcalpic 2021-11-27 flag
泉大津市 オープンデータソン泉大津:町歩きとOpenStreetMap、Localwiki、ウィキペディアの編集 osmcalpic 2021-11-27 flag
Amsterdam OSM Nederland maandelijkse bijeenkomst (online) osmcalpic 2021-11-27 flag
HOTOSM Training Webinar Series: Beginner JOSM osmcalpic 2021-11-27
Biella Incontro mensile degli OSMers BI-VC-CVL osmcalpic 2021-11-27 flag
Chamwino How FAO uses different apps to measure Land Degradation osmcalpic 2021-11-29 flag
OSM Uganda Mapathon osmcalpic 2021-11-29
Missing Maps Artsen Zonder Grenzen Mapathon osmcalpic 2021-12-02
Bochum OSM-Treffen Bochum (Dezember) osmcalpic 2021-12-02 flag

Note:
If you like to see your event here, please put it into the OSM calendar. Only data which is there, will appear in weeklyOSM.

This weeklyOSM was produced by Nordpfeil, PierZen, SK53, Strubbl, TheSwavu, cafeconleche, derFred.

Improving Wikipedia’s coverage of the climate crisis

18:45, Friday, 12 2021 November UTC

As the COP26 summit comes to a close, many people are reflecting on what we can do to help solve the climate crisis. For some student editors in Wiki Education’s Wikipedia Student Program, they already have: they’ve helped shape the world’s understanding of climate change and its impacts by sharing scientific information on Wikipedia. While some of the classes working on the topic have focussed specifically on climate change, others have been introductory-level composition classes.

Graduate students in Gunnar Schade’s Texas A&M climate change class took on a host of important topics. The student who re-wrote the Climate change in Texas article was able to flesh it out into an excellent article which addresses both the challenges Texas faces and some of the mitigation approaches. Another, who worked on the Media coverage of climate change article, was able to add information about coverage of recent events like the Trump Administration and the Australian wildfires.

Other students chose to focus on the science of climate change and its impacts. The history of climate change science helps to contextualize what has been done, and can help readers understand the long history of climate science. Greenhouse and icehouse Earth are the two states that the Earth’s climate has fluctuated between. Understanding these two states is important for forecasting future climates, now clearer on Wikipedia thanks to that student editor’s work. The Global temperature recordPolar amplification, and Tropical cyclones and climate change articles highlight the more obvious impacts of climate change; all were improved by student editors. The Climate change and ecosystems article looks at the impact of climate change on the natural systems human life depends on.

Effects of climate change on humans and the related Effects on climate change on human health are helping to connect the impacts of climate change to readers. Finally, the Climate change art looks at climate change in another way, delving into some of the ways we react as humans.

Erin Larson’s Climate Change class at Alaska Pacific University worked on articles related to mechanisms like CO2 fertilization effect, the Methane chimney effect, and the Tree credits article. A Fordham University student in Paul Bartlett’s Environmental Economics class Climate engineering.

Yale University students in Helene Landemore’s Democracy, Science, and Climate Justice class focused on a different set of articles. One student expanded the Public opinion on climate change, adding information about public perceptions of climate change in India to the article. Other students expanded the Carbon tax and Climate change policy in the United States articles.

Matthew Bergman’s Introduction to Policy Analysis class at the University of California at San Diego made important additions to the Economics of climate change mitigation and Climate change policy in California articles adding information about a series of bills passed in the state. Other students contributed to the Greenhouse gas emissions by the United States, the United States withdrawal from the Paris Agreement, and the San Diego Climate Action Plan.

Students from the University of California at Merced in Michelle Tonconis’ Extinction Events and Stewardship class also worked on the Effects of climate change on humans article; as humans, this topic is close to home for all of us.

While classes like these, which had a science or policy related to climate change are likely to contribute a lot to the topic, it’s an issue that almost everyone is aware of, and many classes with a more general focus were also able to make good contributions.

A University of Massachusetts Boston student in Brittany Peterson’s Composition 102 class, for example, was able to improve the Climate change in the United States article, while a College of DuPage student editor in Timothy Henningsen’s Research, Writing, and the Production of Knowledge class was able to improve the Effects of climate change article.

One of the participants in Joseph A. Ross’s Freshman Seminar at the University of North Carolina at Greensboro worked on the Individual action on climate change article.

All told, students from a wide range of backgrounds chose to work on articles related to climate change, demonstrating the fact that especially for younger people, climate change has a huge impacts on their lives and their futures. By improving the information available to the public, student editors can help people understand the topic, and cut through a lot of the misinformation that continues to persist in the space.

If you’re a university instructor wondering what you can do about the climate crisis, join these instructors! Ask your students to improve Wikipedia’s coverage of climate change topics. Visit teach.wikiedu.org to get started.

Image credit: Insure Our Future, PDM-owner, via Wikimedia Commons

Dr Nowak has a Wikipedia article in several languages. Her notability is obvious because wolves is a very hot topic in many European countries. When people have opinions about wolves, it is obvious that in a European context you cannot dismiss the research of Dr Nowak over the years. 

When the notability and the quality of a Wikipedia article is assessed, it is obvious that an encyclopedic article is not best served with a list of papers Dr Nowak contributed to; the Scholia template provides more in depth information. However, Scholia only functions when the papers are known and attributed.

In Wikidata, there were two items that needed to be merged. Three papers were linked, an additional nine could be attributed. Additional identifiers were added, of particular significance is Google Scholar as it knows many if not most of the papers of a scientist. 

Adding missing papers is easy; you search with a DOI for the paper and when Wikidata does not know it, it is suggested to add it using the quickstatements tool. The best bit is that when CrosRef knows the ORCiD identifier for an author, it will either identify the author or will add the ORCiD identifier as a qualifier. 

Adding the Scholia template to any Wikipedia article about published scholars makes sense; the data is a "work in progress". It changes as more papers and co-authors become known. It is also an invitation to our communities and scientists to improve both the Wikipedia article and the data represented in the Scholias for any scientist.

Thanks, GerardM 

SMW at ENDORSE conference

09:41, Thursday, 11 2021 November UTC

March 16, 2021

Semantic MediaWiki will be presented at the European Data Conference on Reference Data and Semantics (ENDORSE).

On March 18, 2021 (day 3), between 16:55 and 17:25, Bernhard Krabina will present "Linked Open Data with SMW". For the program, see this webpage for the detailed program of the conference.

This Month in GLAM: October 2021

06:21, Thursday, 11 2021 November UTC

The Hidden Costs of Requiring Accounts

19:55, Tuesday, 09 2021 November UTC

Should online communities require people to create accounts before participating?

This question has been a source of disagreement among people who start or manage online communities for decades. Requiring accounts makes some sense since users contributing without accounts are a common source of vandalism, harassment, and low quality content. In theory, creating an account can deter these kinds of attacks while still making it pretty quick and easy for newcomers to join. Also, an account requirement seems unlikely to affect contributors who already have accounts and are typically the source of most valuable contributions. Creating accounts might even help community members build deeper relationships and commitments to the group in ways that lead them to stick around longer and contribute more.

In a new paper published in Communication Research, I worked with Aaron Shaw provide an answer. We analyze data from “natural experiments” that occurred when 136 wikis on Fandom.com started requiring user accounts. Although we find strong evidence that the account requirements deterred low quality contributions, this came at a substantial (and usually hidden) cost: a much larger decrease in high quality contributions. Surprisingly, the cost includes “lost” contributions from community members who had accounts already, but whose activity appears to have been catalyzed by the (often low quality) contributions from those without accounts.


A version of this post was first posted on the Community Data Science blog.

The full citation for the paper is: Hill, Benjamin Mako, and Aaron Shaw. 2020. “The Hidden Costs of Requiring Accounts: Quasi-Experimental Evidence from Peer Production.” Communication Research, 48 (6): 771–95. https://doi.org/10.1177/0093650220910345.

If you do not have access to the paywalled journal, please check out this pre-print or get in touch with us. We have also released replication materials for the paper, including all the data and code used to conduct the analysis and compile the paper itself.

Filling gaps in marine biodiversity

16:44, Tuesday, 09 2021 November UTC

Reef-building corals rely on photosynthetic symbionts to be able to build reefs. Soft corals, which don’t rely on these symbiotic algae, are able to grow in much deeper water. In the case of Primnoa pacifica, this means that are are able to live in cold, dark waters as much as 6 km below the ocean surface. Student editors in Randi Rotjan’s Marine Biology class took a short stub and converted it into a very substantial article about this keystone species in sea bottom ecosystems in the Gulf of Alaska.

Before students in this class started editing, the article about the genus Primnoa consisted of just a single sentence (along with an infobox): Primnoa is a genus of soft corals in the family Primnoidae. A reader who tried to look up information on this genus would have found almost no useful information and no link to Primnoa pacifica or mention of any of the other four species in the genus. Because Wikipedia ranks so high in search engine rankings, people trying to learn more about Primnoa might have ended up with less information than they would have had if they had clicked on another link. Fortunately, students in this class also expanded the Primnoa article into something that’s substantial, informative, and useful to readers.

All told, student editors in this marine biology class were able to make significant improvements to 50 Wikipedia articles including Phronima sedentaria (a species of amphipod), Elacatinus puncticulatus (a goby), Ulva australis (a species of sea lettuce), Canthigaster rostrata (a pufferfish) and Ophiocoma scolopendrina (a brittle star). Species articles on Wikipedia tend to have a fairly standard layout (which you can see in our Editing Wikipedia articles about species handout), and this makes it easy for students to understand where to slot various pieces of information into an article.

Species and genus articles remain areas with a lot of gaps on Wikipedia. By adding species articles to Wikipedia, students can help people to understand their importance in ecological contexts or to conservation. If Wikipedia has no article about a species or just has a short stub, it can be difficult for people to get a sense of the role or importance of the species. And because many people expect Wikipedia to be more or less complete, the fact that an article doesn’t exist about a topic is often interpreted to mean that the topic is unimportant. So when student editors work on species articles, they’re doing important work informing the public.

To learn more about assigning students to edit species articles, visit teach.wikiedu.org.

Image credit: q.phia, CC BY 2.0, via Wikimedia Commons

Outreachy report #26: October 2021

00:00, Tuesday, 09 2021 November UTC

Highlights We finished reviewing initial applications We opened our contribution period We worked really hard on our job posting and our hiring process I gave a talk about remote working and Outreachy and joined a panel about the role of computing in a post-pandemic world at SECOMP 2021 I interviewed one LFX mentor about their experiences with the program This month’s report will be a bit different.

Why library catalogers should learn Wikidata

17:04, Monday, 08 2021 November UTC
Karen Snow head shot
Karen Snow
Image courtesy Karen Snow, all rights reserved.

“Linked data is one of the hot topics in the library cataloging world right now,” says Dominican University library and information science professor Karen Snow. That’s what prompted Karen to take one of Wiki Education’s recent Wikidata Institute classes.

The Wikidata Institute meets twice a week for three weeks and provides participants a detailed introduction to Wikidata, the open linked data repository.

“I felt it was essential for me to keep up-to-date on linked data projects that my students need to know about,” Karen says. “The course content made it really easy to get started making edits in Wikidata. I also appreciated the synchronous Zoom sessions twice a week to talk through Wikidata issues with others who were also novices. There was no judgment, only encouragement, which really helped me get over my initial fear!”

As part of the course, Karen edited several Wikidata items for cataloging-related people, especially those from the critical cataloging movement, such as Hope A. Olson and Sanford Berman.

“It has been fun researching some of my favorite topics!” she says. “What I like most about editing Wikidata is that I feel like I am making a positive contribution to the linked data community.”

As someone who teaches in a library and information science department, Karen feels courses like ours are important professional development. Prior to taking our course, she said it was difficult to imagine how linked data would change library cataloging processes on a purely practical level.

“How linked data will affect library catalogs is still murky to me, but working with Wikidata has helped me appreciate the practical potential of linked data,” she says. “I can imagine that other professionals would gain similar insights working with Wikidata.”

While the openness of Wikidata is a bit of a challenge — “as a library cataloger who has spent many years learning to follow standards, Wikidata’s more open approach still gives me a bit of anxiety ” — Karen says the thing about Wikidata that daunts her the most is the enormity of the undertaking. Her own edits are a “small drop in the massive information bucket,” she says. Nevertheless, she’s committed to continuing to add to Wikidata. Since wrapping up the class, she’s continued to edit Wikidata items, including creating her first brand new item.

“It was terrifying and exciting!” she says.

To take a course like the one Karen took, visit wikiedu.org/wikidata.

Image credit: HalloweenHJB, CC BY 3.0, via Wikimedia Commons

The ePrivacy Regulation could potentially make communications better by setting a firm standard on how online tools can and cannot be used in profiling and surveilling individuals. We became directly interested in the proposal for a regulation when we realised that the proposed rules on how our chapters and affiliates can communicate with their supporters are ambiguous. Here is the breakdown of the problems and ways out.

How it works now

The Regulation concerning the respect for private life and the protection of personal data in electronic communications (a full name of a Regulation on Privacy and Electronic Communications, or ePrivacy Regulation) is now subject to trilogue negotiations. We specifically look into provisions on the scope of direct marketing. As much as we don’t “market” any services or products for sale to individuals, we all want to keep in touch with our supporters. According to the ePrivacy proposal such communication falls under the definition of direct marketing. This concerns organisations in our movement that contact individuals to solicit donations or to encourage them to volunteer in various ways in support of our movement’s mission. 

Currently in several Member States, based on the ePrivacy Directive and subsequent national laws, nonprofits have the right to contact individuals who they were in touch with before, on an opt-out basis. It means that while they present a new initiative or a fundraising campaign, they need to provide the contacted people with a possibility to refuse receiving such information in the future. 

We want to maintain this opportunity, if the provisions of the ePrivacy Regulation include communication by the nonprofits be considered direct marketing, as it seems to be the case now. After all, Wikimedia chapters around Europe need to be in touch with their supporters in alignment with privacy protections. 

“It is evident from the European Commission’s proposal that the legislator meant to include nonprofits in the opportunity that they already enjoy in many European jurisdictions.”

What is the problem?

In the draft, this framework is provided for commercial entities that will be able to continue to use these electronic contact details for direct marketing of its own similar products or services only if customers are clearly and distinctly given the opportunity to object. Concretely, the proposal states that natural or legal persons may use electronic communications services for the purposes of sending direct marketing communications to end-users who are natural persons that have given their consent [art. 16(1)]. It also provides that the sender may use these electronic contact details for direct marketing of its own similar products or services only if customers are clearly and distinctly given the opportunity to object [art. 16(2)]. 

From the reading of these provisions it seems that the legislator may have forgotten the non-profit activities such as collecting donations that are neither tied to information about products nor received in exchange for services. Why shouldn’t the nonprofits enjoy equal rights?

Looking further into the text, the proposed recital 32 states that direct marketing refers to any form of advertising, and in addition to the offering of products and services for commercial purposes it also applies to messages sent by non-profit organisations to support the purposes of the organisation. However, the permission to use e-mail contact details as outlined in art. 16(2) itself is only further elaborated upon in recital 33 which directly refers to “existing customer relationship” and “offering of similar products or services” (emphases added). 

Recital against a recital

As we see from recitals 32 and 33, the text is ambiguous. There is a danger that the permission will not be interpreted as applying to messages sent by non-profit organisations to support the purposes of the organisation – they don’t have customers nor do they offer products or services in the commercial sense. 

The current wording results in elevated risk of a court interpretation. If a nonprofit acts on the understanding based on recital 32, somebody may challenge that decision based on the narrower scope of recital 33. This would put a considerable burden on both the nonprofits in member states and those who operate on an EU-wide scale. 

A simple clarification

The solution is to bring parity between communications on commercial relationships and those undertaken by non-profit organisations to support the purposes of the organisation. It can be done by introducing nonprofits into recital 33. Even better, they should also be mentioned in article 16(2).

It is evident from the European Commission’s proposal that the legislator meant to include nonprofits in the opportunity that they already enjoy in many European jurisdictions. Here we have a clear case where an intervention is easy and in alignment with the objectives of all parties in the trilogues. We are asking Members of the European Parliament and Members States to introduce this helpful tweak that is in practice a quick fix. 

Tech News issue #45, 2021 (November 8, 2021)

00:00, Monday, 08 2021 November UTC
previous 2021, week 45 (Monday 08 November 2021) next

weeklyOSM 589

11:01, Sunday, 07 2021 November UTC

26/10/2021-01/11/2021

lead picture

Ireland’s Coastline Simplified [1] | © HeikkiVesanto | map data © OpenStreetMap contributors

Mapping campaigns

  • Jinal Foflia invited contributors to participate in some interesting mapping challenges in the Philippines and Malaysia. She suggested that new or experienced mappers put their mapping hats on and join the MapRoulette challenges.
  • The mapping contest to improve OSM road data in Russia (we reported earlier) is over; results with some stats are available on this page (ru).

Mapping

  • An old forum thread concerning appropriate tags for populated places in Austria has been reanimated (de) > en . The original concern, back in 2015, was the promotion of most places in Salzburgerland to place=town. It appears that the tagging of some places is still contentious.
  • Voting on the following proposals has closed:
    • historic=creamery for an industrial building where butter, cheese or ice-cream was made from milk was approved with 12 votes for, 2 votes against and 1 abstention.
    • currency:crypto:*=yes,no, a currency key extension for cryptocurrency support was rejected with 22 votes for, 34 votes against and 2 abstentions. The rejection occurred despite the participation of cryptocurrency fans, often without substantial OSM experience.
    • boundary=border_zone was approved with 16 votes for, 0 votes against and 0 abstentions.
  • François Lacombe’s proposal on the new tag outlet=*, to map details of culverts or pipeline outlets releasing fluids, is open for voting until Thursday 11 November.
  • SK53 analysed and visualised, in a very descriptive way, the use of the tag natural=heath compared to a comprehensive habitat cover dataset for the whole of Wales.

Community

  • Mordechai23 showed, in a diary post, a collection of gifs illustrating the process he, and other mappers, used to redraw landuse, buildings, paths and other objects, both in his home city, Wrocław, and elsewhere in Poland.
  • Open Mapping Hub Asia Pacific by HOT has created a Facebook group.

Imports

  • Kai Poppe has outlined, on the wiki, preparations for an automated edit to update mapillary tags made obsolete by a software update (as we reported earlier). The plan is to update all mapillary tags from the old 22-character form to the new numeric IDs of version 4. The Wiki page will be used to track current statistics and any further plan of action.There was also a MapRoulette challenge that sought to clean up all the invalid values, but it is now complete.

OpenStreetMap Foundation

  • At the Board’s request, the OSMF’s ban policy has been updated by the DWG with a new section ‘Blocks until a particular action has been taken’. This is largely documenting existing practice, but the Board felt that it was important the process was documented.

Local chapter news

  • Applications are now open for scholarships to attend the State of the Map US 2022 in Tucson, Arizona at the start of April.

Events

  • The Humanitarian Open Mapping Community Working Group by HOT invited (es) > en
    Spanish-speaking local OSM community organisers, leaders and members (new and old) to come together (es) and share their tips, tricks and challenges related to starting and sustaining local OSM communities on Monday 8 November at 16:00 UTC. Register (es) on Eventbrite if you wish to participate.
  • The theme of this year’s Genoa Science Festival (Italy) had ‘maps’ as its keyword. The Italian OSMF Chapter participated (it) > en both through a round table, in which four members of the community presented their professional activity involving OSM, and in a theatrical representation made by one of the winners of its ‘Free Theatre’ competition. Another presentation involving OSM was delivered by a team from Doctors Without Borders.

Maps

  • [1] The 30 Day Map Challenge (#30DayMapChallenge), which we reported last month, is now in full swing. Some examples of OSM-based maps from the first few days are:
    • Day 1 – Points: Bus stops in Bengaluru, by IamThiyaku. Every address in Garland County, Arkansas by Justin Myers.
    • Day 2 – Lines: Heikko Vesantu’s maps of Ireland proved very popular. They were made with progressively fewer lines (through simplification using the Douglas-Peucker algorithm).
  • OpenStreetMap Uganda tweeted maps showing the difference in accessibility of older sources of drinking water compared with new solar-powered ones. The work was done in collaboration with Water Compass.
  • Mateusz Konieczny spotted that Google Maps has public transport data for the Polish capital city of Warsaw. Mikołaj Kuranowski explained that he created the feed by combining data from the Warsaw transport authority and OSM for bus route topology. The correct attribution appears at the bottom of the panel showing a planned journey.
  • Mateusz Fafinski reviewed the recent Reddit map of castles in Europe by Spatial_Overlay, especially noting regional differences in OpenStreetMap tagging practices.

Software

  • Wille Marcel announced a new release of OSMCha. This version provides a better user experience on smartphones and tablets.
  • The Organic Maps project is looking for translators to help localise and maintain each language version.
  • Ilya Zverev described (ru) > en the new OSMF Overpass server, which is dedicated to answering queries generated by the ↖? (Query Features) button on the main OSM website.

Programming

  • Digital artist JeongHo Park showcased an experimental visualisation of OpenStreetMap data on Twitter.
  • Brandon Liu, who has recently joined the Engineering Working Group, outlined why one cannot assume that OSM always uses Unicode.

Did you know …

  • … that it is possible to get free OSM advertising stickers and copyright banners?

Other “geo” things

  • In Atomos magazine, Ruth Hopkins surveyed the drive to restore indigenous topographic names in North America. Uldis Balodis pointed out, on Twitter, that this has resonances elsewhere, specifically in Baltic countries where they are involved in collecting Livonian place names.
  • Chronoscope World is a site for browsing historical maps.
  • Two unusual navigation errors were reported in Portugal this week. In one the driver of a car got stuck on (pt) > en
    a staircase, next to the Regional Legislative Assembly of Madeira, in Portugal. Apparently, the driver was misled by the GPS of the car, which told him to drive on.
  • In a second case, a driver crossed the D. Luís I Bridge (pt) > en in Vila Nova de Gaia, Portugal, and entered a tunnel (TomTom) intended for the Porto Metro. The driver of the vehicle, 46 years old, said he did not live in the area and it was the GPS of the car that led him into the exclusive lane for the surface metro. Apparently, the navigation system used in the two cars involved is the TomTom system, which equips Renault brand vehicles.

Upcoming Events

Where What Online When Country
Черкаси Open Mapathon: Digital Cherkasy osmcalpic 2021-10-24 – 2021-11-20 ua
Mauguio HérOSM Mauguio : Cartopartie osmcalpic 2021-11-06 flag
Paris Mairie de Paris Centre : Initiation OpenStreetMap osmcalpic 2021-11-06 flag
Bogotá Distrito Capital Resolvamos notas de Colombia creadas en OpenStreetMap osmcalpic 2021-11-06 flag
OSM Local Chapters & Communities Virtual Congress osmcalpic 2021-11-06
Crowd2Map Tanzania is 6! Join our party mapathon to learn more about our work.. osmcalpic 2021-11-07
Cuiabá Construcción de comunidad local en OSM: Consejos, trucos y desafíos osmcalpic 2021-11-08 flag
臺北市 OSM x Wikidata Taipei #34 osmcalpic 2021-11-08 flag
Toronto OpenStreetMap Enthusiasts Meeting osmcalpic 2021-11-09
Missing Maps Artsen Zonder Grenzen Mapathon osmcalpic 2021-11-09
Hamburg Hamburger Mappertreffen osmcalpic 2021-11-09 flag
Zürich OSM-Treffen Zürich osmcalpic 2021-11-11 flag
Berlin 161. Berlin-Brandenburg OpenStreetMap Stammtisch osmcalpic 2021-11-11 flag
FOSS4G State of the Map Oceania 2021 osmcalpic 2021-11-12
Missing Maps MonarchMappers Fall 2021 Mapathon osmcalpic 2021-11-13
Bogotá Distrito Capital Resolvamos notas de Colombia creadas en OpenStreetMap osmcalpic 2021-11-13 flag
Geography 2050 Symposium – Mapathon for an Equitable Future osmcalpic 2021-11-13
Crowd2Map Tanzania GeoWeek Human Right’s Day FGM Mapathon osmcalpic 2021-11-15
UP Tacloban YouthMappers: MAPA-Bulig, Guiding the Youth to Community Mapping osmcalpic 2021-11-15
Grenoble OSM Grenoble Atelier OpenStreetMap osmcalpic 2021-11-15 flag
OSMF Engineering Working Group meeting osmcalpic 2021-11-15
Missing Maps PDX GIS Day Mapathon osmcalpic 2021-11-16
Lyon Lyon : Réunion osmcalpic 2021-11-16 flag
Bonn 145. Treffen des OSM-Stammtisches Bonn osmcalpic 2021-11-16 flag
Berlin OSM-Verkehrswende #29 (Online) osmcalpic 2021-11-16 flag
Lüneburg Lüneburger Mappertreffen (online) osmcalpic 2021-11-16 flag
Missing Maps Arcadis GIS Day Mapathon osmcalpic 2021-11-17
Missing Maps WMU GIS Day Mapathon osmcalpic 2021-11-17
Köln OSM-Stammtisch Köln osmcalpic 2021-11-17 flag
Zürich Missing Maps Zürich November Mapathon osmcalpic 2021-11-17 flag
Chambéry Missing Maps CartONG Tour de France des Mapathons – Chambéry osmcalpic 2021-11-18 flag
MSF Global Mapathon osmcalpic 2021-11-19
State of the Map Africa 2021 osmcalpic 2021-11-19 – 2021-11-21
Lyon EPN des Rancy : Technique de cartographie et d’édition osmcalpic 2021-11-20 flag
HOT Summit 2021 osmcalpic 2021-11-22
Bremen Bremer Mappertreffen (Online) osmcalpic 2021-11-22 flag
Derby East Midlands OSM Pub Meet-up : Derby osmcalpic 2021-11-23 flag
Düsseldorf Düsseldorfer OSM-Treffen (online) osmcalpic 2021-11-24 flag
[Online] OpenStreetMap Foundation board of Directors – public videomeeting osmcalpic 2021-11-26
Brno November Brno Missing maps mapathon at Department of Geography osmcalpic 2021-11-26 flag
Amsterdam OSM Nederland maandelijkse bijeenkomst (online) osmcalpic 2021-11-27 flag
HOTOSM Training Webinar Series: Beginner JOSM osmcalpic 2021-11-27
長岡京市 京都!街歩き!マッピングパーティ:第27回 元伊勢三社 osmcalpic 2021-11-27 flag

Note:
If you like to see your event here, please put it into the OSM calendar. Only data which is there, will appear in weeklyOSM.

This weeklyOSM was produced by Nordpfeil, NunoMASAzevedo, PierZen, SK53, Strubbl, TheSwavu, arnalielsewhere, derFred.

Production Excellence #37: October 2021

02:05, Friday, 05 2021 November UTC

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

There were 4 documented incidents last month. This is currently on average, compared to the past five years (per Incident graphs).

2021-10-08 network provider
Impact: For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. It was caused by a routing problem with one of several redundant network providers.

2021-10-22 eqiad networking
Impact: For ~40 minutes clients that are normally geographically routed to Eqiad experienced connection or timeout errors. We lost about 7K req/s during this time. After initial recovery, Eqiad was ready and repooled in ~10 minutes.

2021-10-25 s3 db replica
Impact: For ~30min MediaWiki backends were slower than usual. For ~12 hours, many wiki replicas were stale for Wikimedia Cloud Services such as Toolforge.

2021-10-29 graphite
Impact: During a server upgrade, historical data was lost for a subset of Graphite metrics. Some were recovered via the redundant server, but others were lost as the redundant was also upgraded since then and lost some in a similar fashion.

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.


Trends
Norwegian blue 🐦

298 bugs were up on the board.
We solved 20 of those over the past thirty days.

How many might now be left unexplored?
We also added new bugs to our database.

Half those bugs are pining for their fjord.
The other 23 carry on, with their dossiers.

All in all, 301 bugs up on the board.

In October, 49 new tasks were reported as production errors. Of these, we resolved 26, and 23 remain unresolved and carry forward to the next month.

Previously, the production error workboard held an accumulated total of 298 still-open error reports. We resolved 20 of those. Together with the 23 new errors carried over from October, this brings us to 301 unresolved errors on the board.

For the month-over-month numbers, refer to the spreadsheet data.


Outstanding errors

Take a look at the workboard and look for tasks that could use your help.

View Workboard

Issues carried over from recent months:

Apr 2021 9 of 42 issues left.
May 2021 16 of 54 issues left.
Jun 2021 9 of 26 issues left.
Jul 2021 12 of 31 issues left.
Aug 2021 12 of 46 issues left.
Sep 2021 11 of 24 issues left.
Oct 2021 23 of 49 new issues are carried forward.

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

“Hey, I actually wrote the wiki article on that!”

15:49, Thursday, 04 2021 November UTC

Pamela Kalas is an Associate Professor of Teaching in the Department of Zoology at the University of British Columbia (UBC), Vancouver. She would like to acknowledge the two UBC staff who supported the assignment from behind the scenes: Will Engle (Strategist, Open Education Initiatives) and Ria Namba (Open Educational Resources Developer), as well as the Wiki Education team. Evan Warner is a PhD candidate in Genome Science and Technology and a Graduate Teaching Assistant at UBC.

head shot of Pamela Kalas
Pamela Kalas.
Image courtesy Pamela Kalas, all rights reserved.

We first incorporated a collaborative Wikipedia assignment into our medium-size, upper-level biology class two years ago. The intention was to give students an opportunity to learn something about a relevant topic—and practice synthesizing this information—by expanding a Wikipedia stub of their own choosing. Seeing the students’ enthusiasm for this assignment and the excellent work that they produced (about 22,000 words and 218 citations contributed), we decided to repeat it in the 2020-21 academic year, and take a more deliberate approach in documenting students’ experiences and their perceived learning.

A thematic analysis of their final reflection assignments confirmed our anecdotal observations: students reported an overwhelmingly positive experience filled with enthusiasm, excitement, and pride about making an authentic contribution to such a major information source that they themselves regularly use. In their own words:

It is really satisfying to go look at the page and see all my hard work for anyone to use and benefit from! I am really proud of the work I put in and hope that students will be able to use this information to understand DNA constructs better. It is super exciting and I hope that the information stays up on the page.”

I feel really good about it. As mentioned before, it feels good to be contributing to a website that I benefitted lots from in my undergrad.”

I feel that it’s one of the most meaningful projects I’ve done for school work! Often I write papers and complete projects just to present in front of class. With this Wikipedia page, I actually feel like I’m contributing back to the scientific community and helping others learn about this topic. Hopefully Wikipedia projects like this one will be implicated in more classes in the future.”

Although the course has been generally well-liked for many years, this is certainly the first time that an individual assignment has received such high praise from students!

Interestingly, when prompted about sharing what they learned about Wikipedia, most students expressed interest or surprise in discovering how the editing process works, and in many cases this altered their views about Wikipedia and its credibility as an information source. And it was not only the students becoming more educated about Wikipedia; there were some ‘a-ha’ moments for the teaching team, too! Considering how ubiquitous Wikipedia has become in our lives, this was a wonderful opportunity to all learn together about how its articles come to be.

word cloud
A word cloud of students’ responses to a reflection prompt asking them how they felt about having completed the assignment and having it “out in the world”.
Image courtesy Pamela Kalas, all rights reserved.

Students also identified several skills that the Wikipedia assignment helped them improve, including: researching information and evaluating sources, knowledge translation, and teamwork/collaboration — skills that not only align many competencies deemed essential for biology graduates, but are also among the most relevant and transferable trans-disciplinary skills.

While a number of these skills can be developed and practiced in standard university assignments, the unique element of having a real (and large!) audience seemed to enhance students’ level of care and investment in the assignment, as illustrated by this comment from a course alumna:

There is a higher burden of responsibility when producing something that will go out into the world versus something done just for a grade. I cared a lot more about the accuracy in the Wikipedia assignment than I did about my other assignments, because a mistake I made could potentially confuse someone else.”

It is often challenging for instructors in STEM disciplines to convey the importance of transferable soft skills, and have students take them as seriously as they should be. By eliciting enthusiasm and a sense of purpose, Wikipedia assignments can serve as excellent tools to engage students with soft transferable skills in a deep and meaningful way. We would recommend this type of assignment to any colleagues!

Image credit: Xicotencatl, CC BY-SA 4.0, via Wikimedia Commons

Q&A about doing a PhD with my research group

07:11, Wednesday, 03 2021 November UTC

Ever considered doing research about online communities, free culture/software, and peer production full time? It’s PhD admission season and my research group—the Community Data Science Collective—is doing an open-to-anyone Q&A about PhD admissions this Friday November 5th. We’ve got room in the session and its not too late to sign up to join us!

The session will be a good opportunity to hear from and talk to faculty recruiting students to our various programs at the University of Washington, Purdue, and Northwestern and to talk with current and previous students in the group.

I am hoping to admit at least one new PhD advisee to the Department of Communication at UW this year (maybe more) and am currently co-advising (and/or have previously co-advised) students in UW’s Allen School of Computer Science & Engineering, Department of Human-Centered Design & Engineering, and Information School.

One thing to keep in mind is that my primary/home department—Communication—has a deadline for PhD applications of November 15th this year.

The registration deadline for the Q&A session is listed as today but we’ll do what we can to sneak you in even if you register late. That said, please do register ASAP so we can get you the link to the session!

Benchmarking MediaWiki with PHPBench

17:42, Tuesday, 02 2021 November UTC

This post gives a quick introduction to a benchmarking tool, phpbench, ready for you to experiment with in core and skins/extensions.[1]

What is phpbench?

From their documentation:

PHPBench is a benchmark runner for PHP analagous to PHPUnit but for performance rather than correctness.

In other words, while a PHPUnit test will tell you if your code behaves a certain way given a certain set of inputs, a PHPBench benchmark only cares how long that same piece of code takes to execute.

The tooling and boilerplate will be familiar to you if you've used PHPUnit. There's a command-line runner at vendor/bin/phpbench, benchmarks are discoverable by default in tests/Benchmark, a configuration file (benchmark.json) allows for setting defaults across all benchmarks, and the benchmark tests classes and tests look pretty similar to PHPUnit tests.

Here's an example test for the Html::openElement() function:

namespace MediaWiki\Tests\Benchmark;

class HtmlBench {

        /**
        * @Assert("mode(variant.time.avg) < 85 microseconds +/- 10%")
        */
        public function benchHtmlOpenElement() {
                \Html::openElement( 'a', [ 'class' => 'foo' ] );
        }
}

So, taking it line by line:

  • class HtmlBench (placed in tests/Benchmark/includes/HtmlBench.php) – the class where you can define the benchmarks for methods in a class. It would make sense to create a single benchmark class for a single class under test, just like with PHPUnit.
  • public function benchHtmlOpenElement() {} – method names that begin with bench will be executed by phpbench; other methods can be used for set-up / teardown work. The contents of the method are benchmarked, so any set-up / teardown work should be done elsewhere.
  • @Assert("mode(variant.time.avg) < 85 microseconds +/- 10%") – we define a phpbench assertion that the average execution time will be less than 85 microseconds, with a tolerance of +/- 10%.

If we run the test with composer phpbench, we will see that the test passes. One thing to be careful with, though, is adding assertions that are too strict – you would not want a patch to fail CI because the assertion for execution was not flexible enough (more on this later on).

Measuring performance while developing

One neat feature in PHPBench is the ability to tag current results and compare with another run. Looking at the HTMLBench benchmark test from above, for example, we can compare the work done in rMW5deb6a2a4546: Html::openElement() micro-optimisations to get before and after comparisons of the performance changes.

Here's a benchmark of e82c5e52d50a9afd67045f984dc3fb84e2daef44, the commit before the performance improvements added to Html::openElement() in rMW5deb6a2a4546: Html::openElement() micro-optimisations

❯ git checkout -b html-before-optimizations e82c5e52d50a9afd67045f984dc3fb84e2daef44 # get the old HTML::openElement code before optimizations
❯ git review -x 727429 # get the core patch which introduces phpbench support
❯ composer phpbench -- tests/Benchmark/includes/HtmlBench.php --tag=original

And the output [2]:

Note that we've used --tag=original to store the results. Now we can check out the newer code, and use --ref=original to compare with the baseline:

❯ git checkout -b html-after-optimizations 5deb6a2a4546318d1fa94ad8c3fa54e9eb8fc67c # get the new HTML::openElement code with optimizations
❯ git review -x 727429 # get the core patch which introduces phpbench support
❯ composer phpbench -- tests/Benchmark/includes/HtmlBench.php --ref=original --report=aggregate

And the output [3]:

We can see that the execution time roughly halved, from 18 microseconds to 8 microseconds. (For understanding the other columns in the report, it's best to read through the Quick Start guide for phpbench.) PHPBench can also provide an error exit code if the performance decreased. One way that PHPBench might fit into our testing stack would be to have a job similar to Fresnel, where a non-voting comment on a patch alerts developers whether the PHPBench performance decreased in the patch.

Testing with extensions

A slightly more complex example is available in GrowthExperiments (patch). That patch makes use of setUp/tearDown methods to prepopulate the database entries needed for the code being benchmarked:

/**
 * @BeforeMethods ("setUpLinkRecommendation")
 * @AfterMethods ("tearDownLinkRecommendation")
 * @Assert("mode(variant.time.avg) < 20000 microseconds +/- 10%")
 */
public function benchFilter() {
        $this->linkRecommendationFilter->filter( $this->tasks );
}

The setUpLinkRecommendation and tearDownLinkRecommendation methods have access to MediaWikiServices, and generally you can do similar things you'd do in an integration test to setup and teardown the environment. This test is towards the opposite end of the spectrum from the core test discussed above which looks at Html::openElement(); here, the goal is to look at a higher level function that involves database queries and interacting with MediaWiki services.

What's next

You can experiment with the tooling and see if it is useful to you. Some open questions:

  • do we want to use phpbench? or are the scripts in maintenance/benchmarks already sufficient for our benchmarking needs?
  • we already have a benchmarking tools in maintenance/benchmarks that extend a Benchmarker class; would it make sense to convert these to use phpbench?
  • what are sensible defaults for "revs" and "iterations" as well as retry thresholds?
  • do we want to run phpbench assertions in CI?
    • if yes, do we want assertions about using absolute times (e.g. "this function should take less than 20 ms") or relative assertions ("patch code is within 10% +/- of old code)
    • if yes, do we want to aggregate reports over time, so we can see trends for the code we benchmark?
    • should we disable phpbench as part of the standard set of tests run by Quibble, and only have it run as a non-voting job like Fresnel?

Looking forward to your feedback! [4]


[1] thank you, @hashar, for working with me to include this in Quibble and roll out to CI to help with evaluation!

[2]

> phpbench run --config=tests/Benchmark/phpbench.json --report=aggregate 'tests/Benchmark/includes/HtmlBench.php' '--tag=original'
PHPBench (1.1.2) running benchmarks...
with configuration file: /Users/kostajh/src/mediawiki/w/tests/Benchmark/phpbench.json
with PHP version 7.4.24, xdebug ✔, opcache ❌

\MediaWiki\Tests\Benchmark\HtmlBench

    benchHtmlOpenElement....................R1 I1 ✔ Mo18.514μs (±1.94%)

Subjects: 1, Assertions: 1, Failures: 0, Errors: 0
Storing results ... OK
Run: 1346543289c75373e513cc3b11fbf5215d8fb6d0
+-----------+----------------------+-----+------+-----+----------+----------+--------+
| benchmark | subject              | set | revs | its | mem_peak | mode     | rstdev |
+-----------+----------------------+-----+------+-----+----------+----------+--------+
| HtmlBench | benchHtmlOpenElement |     | 50   | 5   | 2.782mb  | 18.514μs | ±1.94% |
+-----------+----------------------+-----+------+-----+----------+----------+--------+

[3]

> phpbench run --config=tests/Benchmark/phpbench.json --report=aggregate 'tests/Benchmark/includes/HtmlBench.php' '--ref=original' '--report=aggregate'
PHPBench (1.1.2) running benchmarks...
with configuration file: /Users/kostajh/src/mediawiki/w/tests/Benchmark/phpbench.json
with PHP version 7.4.24, xdebug ✔, opcache ❌
comparing [actual vs. original]

\MediaWiki\Tests\Benchmark\HtmlBench

    benchHtmlOpenElement....................R5 I4 ✔ [Mo8.194μs vs. Mo18.514μs] -55.74% (±0.50%)

Subjects: 1, Assertions: 1, Failures: 0, Errors: 0
+-----------+----------------------+-----+------+-----+---------------+-----------------+----------------+
| benchmark | subject              | set | revs | its | mem_peak      | mode            | rstdev         |
+-----------+----------------------+-----+------+-----+---------------+-----------------+----------------+
| HtmlBench | benchHtmlOpenElement |     | 50   | 5   | 2.782mb 0.00% | 8.194μs -55.74% | ±0.50% -74.03% |
+-----------+----------------------+-----+------+-----+---------------+-----------------+----------------+

[4] Thanks to @zeljkofilipin for reviewing a draft of this post.

Grappling with the history of contested monuments

15:57, Tuesday, 02 2021 November UTC

In the aftermath of the 2020 George Floyd protests and the 2017 Unite the Right rally, the question of monuments and their meaning has come to the forefront. Students in Oliver Wunsch’s Contested Monuments class worked on improving a number of Wikipedia articles about monuments, ranging from the Statue of Jefferson Davis at the U.S. Capitol, to the Gay Liberation Monument in New York, to the Stadio dei Marmi in Rome.

George Segal’s sculpture, Gay Liberation, was commissioned as a tribute to the 1969 Stonewall riots. Two castings of the sculpture were made, with one originally intended for Christopher Park in Greenwich Village, New York and the other for Los Angeles. Opposition in New York, and a failure to have the monument approved in Los Angeles resulted in one casting installed at Stanford University and the other in Madison, Wisconsin, before eventually being relocated to its originally intended location in New York.

The monument has been controversial and subject to vandalism, both because it depicts same-sex couples, and because the depiction has been described as whitewashing Stonewall. Student editors in the class were able to expand the article in a way that brings the history and context of the sculpture into focus more clearly, and helps readers understand the relationship between the monument, what it was meant to depict, and what this depiction means now.

Jefferson Davis was president of the Confederate States of America, and the presence of a statue honoring him in the US Capitol has been controversial since its installation there in 1931 and bills for its removal have been introduced to Congress in 2017, 2020, and 2021. But before a student in this class started working on the article it was just a three-sentence stub with only the most basic information. This student editor was able to turn this article into a substantial, well-referenced one that is currently undergoing a review process with the aim of classifying it as a Good Article, one of Wikipedia’s best works.

The Stadio dei Marmi is a stadium in Rome which was originally built as part of the Foro Mussolini (now the Foro Italico) by Italy’s Fascist government in the 1920s. A student in this class was able to convert a short, stubby article into its current form by adding information about its design and the significance of the monumental architecture and decor employed and its relationship with Italian fascism. They also added information about subsequent use in the 1960 Olympics and continued use, and how that has been seen as a symbol of Italy’s failure to “come to terms with its role in World War II”.

The Schwerbelastungskörper is a large concrete cylinder in Berlin that was built as a test structure by Albert Speer in preparation for the construction of a triumphal arch honoring the victories of Nazi Germany. It is one of the few remnants of Hitler’s plan to remake the city and is a protected monument as the “only tangible relic of National Socialist urban planning”. By contextualizing the structure in terms of Nazi plans to remake Berlin, and describing its construction and public perception, the student editor who worked on this article added a lot to readers ability to understand the structure and its significance.

Other students in the class improved on a range of monuments in the US, Italy, and Germany including the recently removed Robert E. Lee Monument in Richmond, Virginia, the Pioneer Monument in San Francisco, and the Lenin Monument in Berlin.

As the United States and much of the world struggles to reassess relationships with monuments like these, the contributions of this class help readers contextualize what’s currently happening. They are not only filling content gaps, they’re also filling gaps in terms of the information that people need.

To learn more about how to assign students to contribute to Wikipedia as a class assignment, visit teach.wikiedu.org.

Image credit: Sol Octobris, CC BY-SA 4.0, via Wikimedia Commons

Writing Skills for Engineering Managers

13:28, Tuesday, 02 2021 November UTC

Managers at every level are prisoners of the notion that a simple style reflects a simple mind. Actually a simple style is the result of hard work and hard thinking

– William Zinsser, On Writing Well

Every software engineering manager’s most precious resource is time. But you wouldn’t know it from reading our emails–bloated screeds of business buzzwords we expect our engineers to decipher.

If you lead a team and you value their time, then demonstrate it through lean and confident writing. Below you will find guidelines to help hone your writing skill.

1️⃣ Have a point

Make sure you have something to say before you write.

Corporate-speak will write your email for you unless you remain vigilant1. Jargon lulls the writer into the false belief they’ve said something precise while your reader may wonder whether you’ve said anything at all.

Be direct and start your draft with the purpose of the email. Writing That Works by Roman and Raphaelson offers this advice: try writing what you want to say as if you’re talking face-to-face. Don’t worry if your first draft sounds too casual. You can always wrap your plain language in the requisite business shibboleths later.

2️⃣ Keep it short

Write as if you were dying. At the same time, assume you write for an audience consisting solely of terminal patients.

– Annie Dillard, The Writing Life

Everything you write is too long.

People reading your emails aren’t fans of your writing—they’re trying to get through their email.

When you’ve finished writing your email, use Stephen King’s equation from On Writing: “2nd Draft = 1st Draft – 10%.” Your writing will be more effective.

3️⃣ Make it easy

All visually displayed text involves typography

– Matthew Butterick, Butterick’s Practical Typography

Appropriate typography and thoughtful information architecture make your email easier to parse. It’s not enough for your email to be easy to read; it’s got to look easy to read.

Researchers at the University of Michigan gave student test subjects two identical sets of instructions: one in a hard-to-read font and one in an easy-to-read font.

Despite the steps being identical, the student’s predictions of the difficulty of the tasks differed. Students believed the less intelligible instructions described a more daunting task. The author’s conclusion is the title of their study: “If it’s Hard to Read, It’s Hard to Do.”

Break up long text with headings. Keep your paragraphs short. Use bullet points and short sentences to make your text look less intimidating and easier to read.

A squint test of encyclopedia article vs. a popular article.

Squint test: Compare the shape of the first three paragraphs of a popular article about Barack Obama with the first three paragraphs of Barack Obama's Wikipedia entry.

Professional writers know to make it easy for their readers—the first paragraph on the left is a single sentence.

4️⃣ Make it ✨pretty✨

I know. Emojis 🙄.

In her book Because Internet, author and linguist Gretchen McCulloh posits that people embraced emojis because they add body language to our writing. The first two sentences of this section are an example of how emojis succinctly convey emotion.

Emojis help people process the shape of your text at a glance. I use emojis to lead the eye through a text. Emojis are precognitive signposts you can use to reinforce the meaning of your writing.

Emojis can’t substitute for substance, but they can help make your text easier for your readers.

I ❤️ the judicious use of emojis.

Further Reading

Software

  • Hemingway Editor – In-browser editor that points out problems like overuse of adverbs and passive voice.
  • Grammarly – I pay $60 quarterly for this. I don’t use the browser extension since that seems likely to send every plain text field I fill in my browser to their servers. I don’t trust this service with my data, but I do like this service.

  1. Orwell said in “Politics and the English Language”: jargon will “construct your sentences for you—even think your thoughts for you, to a certain extent.”↩︎

mwcli CI in Wikimedia GitLab (docker in docker)

22:11, Monday, 01 2021 November UTC

mwcli is a golang CLI tool that I have been working on over the past year to replace the mediawiki-docker-dev development environment that I accidently created a few years back (among other things). I didn’t start the CLI, but I did this mediawiki-docker-dev like functionality.

As some point through the development journey it became clear that one of the ways to set the new and old environments apart would be through some rigorous CI and testing.

This started with CI running on a Qemu node as part of the shared Wikimedia Jenkins CI infrastructure that is hooked up to Gerrit, where the code was being developed. This ended up being quite slow, and involved lots of manual steps.

A next iteration saw the majority of development take place in my own fork on Github, making use of Github Actions. Changes would then be copied over to Gerrit for final review once CI tests had run.

And finally the repository moved to the new Wikimedia GitLab instance (work in progress), where I could make use of GitLab Runners powered by a machine in Wikimedia Cloud VPS.

Screenshot of GitLab pipelines in action for the mwcli project

Overview

I have a dedicated Cloud VPS project for the machine used as a runners for the mwcli project (T294283). Currently 2 runners are configured, each with 4 cores, 8GB memory and 20GB disks running debian buster.

The runners make use of Docker in docker, which is one of the documented ways to use the docker executor per the GitLab documentation. I haven’t done a full review of the possible security implications of this approach yet, and it should be noted the virtual machines only runs CI for this 1 project, and only members of the project have the ability to run the CI.

Installation

You need docker installed. You can follow the docker install guide, or do something like this…


sudo apt-get update sudo apt-get remove docker docker-engine docker.io containerd runc sudo apt-get install \ apt-transport-https \ ca-certificates \ curl \ gnupg \ lsb-release curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg echo \ "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian \ $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update sudo apt-get install docker-ce docker-ce-cli containerd.io
Code language: PHP (php)

And you need code for GitLab runners installed. There is an install guide, and it looks something like this…


curl -LJO "https://gitlab-runner-downloads.s3.amazonaws.com/latest/deb/gitlab-runner_amd64.deb" sudo dpkg -i gitlab-runner_amd64.deb rm gitlab-runner_amd64.deb
Code language: JavaScript (javascript)

Registration

Once everything is installed, you are ready to register the runner, and connect it to the GitLab instance and project.

Head to Settings >> CI/CD on your project. Under the “Runners” section you should find a “registration token” which you’ll need to use on the runner.

This token can be used with the gitlab-runner register command, along with a user provided name and some other options such as --limit which limits the number of jobs that the runner can run at once.


sudo gitlab-runner register -n \ --url https://gitlab.wikimedia.org/ \ --registration-token xxxxxxxxxxxxxxxxxxxxxxx \ --executor docker \ --limit 3 \ --name "gitlab-runner-addshore-1012-docker-01" \ --docker-image "docker:19.03.15" \ --docker-privileged \ --docker-volumes "/certs/client"
Code language: JavaScript (javascript)

You should now see the runner appear in the GitLab UI.

Further Configuration

Concurrancy

Although we specified a limit of 3 jobs for the runner when registering it. This is only runner configuration. A single node and have multiple runners of multiple types (or of the same type). So there is also a node / global concurrency setting that needs to be changed.


sudo sed -i 's/^concurrent =.*/concurrent = 3/' "/etc/gitlab-runner/config.toml" sudo systemctl restart gitlab-runner
Code language: JavaScript (javascript)

Docker mirror

If your CI will make use of images from Docker Hub or any other registry that imposes limits, or if you want to speed up CI, you may want to run and register a local docker mirror.

Again, you can follow a blog post for setup here, or do something like this…

Create the mirror in a container…


sudo docker run -d -p 6000:5000 \ -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io \ --restart always \ --name registry registry:2
Code language: JavaScript (javascript)

Get the IP address of the host…


hostname --ip-address

Add the mirror to the docker deamon config…


sudo echo '{"registry-mirrors": ["http://<CUSTOM IP>:<PORT>"]}' > /etc/docker/daemon.json sudo service docker restart
Code language: HTML, XML (xml)

And also register it in the runner config, which you should find at /etc/gitlab-runner/config.toml (see these docs for why this is also needed)


[[runners.docker.services]] name = "docker:19.03.15-dind" command = ["--registry-mirror", "http://<CUSTOM IP>:<PORT>"]
Code language: JavaScript (javascript)

Finally restart the runner one last time…


sudo systemctl restart gitlab-runner

Example CI

You could then configure some very basic jobs using the GitLab CI configuration file for the project.


image: docker:19.03.15 variables: DOCKER_TLS_CERTDIR: "/certs" services: - name: docker:19.03.15-dind docker_system_info: only: - web stage: check script: - docker system info docker_hub_quota_check: only: - web stage: check image: alpine:latest before_script: - apk add curl jq script: - | TOKEN=$(curl "https://auth.docker.io/token?service=registry.docker.io&scope=repository:ratelimitpreview/test:pull" | jq --raw-output .token) && curl --head --header "Authorization: Bearer $TOKEN" "https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest" 2>&1
Code language: JavaScript (javascript)

Gotchas & Reading

  • The Wikimedia GitLab instance is still currently a work in progress.
  • If using images from Docker Hub the limit can be annoying. As well as a mirror there is also documentation for providing a key for Docker Hub or another registry. (T288377)
  • Depending on your CI, 20GB of disk can fill up quite quickly. While running at a concurrency of 4 I would occasionally hit disk limitations.
  • When people open merge requests from forks CI will not and can not run using the project runners.
  • Default caching is done per project, per runner, per job / concurrency slot. This can lead to a lot of duplication unless a shared cache is used!

The post mwcli CI in Wikimedia GitLab (docker in docker) appeared first on Addshore.

Generating Rust types for MediaWiki API responses

06:50, Monday, 01 2021 November UTC

I just released version 0.2.0 of the mwapi_responses crate. It automatically generates Rust types based on the query parameters specified for use in MediaWiki API requests. If you're not familiar with the MediaWiki API, I suggest you play around with the API sandbox. It is highly dynamic, with the user specifying query parameters and values for each property they wanted returned.

For example, if you wanted a page's categories, you'd use action=query&prop=categories&titles=[...]. If you just wanted basic page metadata you'd use prop=info. For information about revisions, like who made specific edits, you'd use prop=revisions. And so on, for all the different types of metadata. For each property module, you can further filter what properties you want. If under info, you wanted the URL to the page, you'd use inprop=url. If you wanted to know the user who created the revision, you'd use rvprop=user. For the most part, each field in the response can be toggled on or off using various prop parameters. These parameters can be combined, so you can just get the exact data that your use-case needs, nothing extra.

For duck-typed languages like Python, this is pretty convenient. You know what fields you've requested, so that's all you access. But in Rust, it means you either need to type out the entire response struct for each API query you make, or just rely on the dynamic nature of serde_json::Value, which means you're losing out on the fantastic type system that Rust offers.

But what I've been working on in mwapi_responses is a third option: having a Rust macro generate the response structs based on the specified query parameters. Here's an example from the documentation:

use mwapi_responses::prelude::*;
#[query(
    prop="info|revisions",
    inprop="url",
    rvprop="ids"
)]
struct Response;

This expands to roughly:

#[derive(Debug, Clone, serde::Deserialize)]
pub struct Response {
    #[serde(default)]
    pub batchcomplete: bool,
    #[serde(rename = "continue")]
    #[serde(default)]
    pub continue_: HashMap<String, String>,
    pub query: ResponseBody,
}

#[derive(Debug, Clone, serde::Deserialize)]
pub struct ResponseBody {
    pub pages: Vec<ResponseItem>,
}

#[derive(Debug, Clone, serde::Deserialize)]
pub struct ResponseItem {
    pub canonicalurl: String,
    pub contentmodel: String,
    pub editurl: String,
    pub fullurl: String,
    pub lastrevid: Option<u32>,
    pub length: Option<u32>,
    #[serde(default)]
    pub missing: bool,
    #[serde(default)]
    pub new: bool,
    pub ns: i32,
    pub pageid: Option<u32>,
    pub pagelanguage: String,
    pub pagelanguagedir: String,
    pub pagelanguagehtmlcode: String,
    #[serde(default)]
    pub redirect: bool,
    pub title: String,
    pub touched: Option<String>,
    #[serde(default)]
    pub revisions: Vec<ResponseItemrevisions>,
}

#[derive(Debug, Clone, serde::Deserialize)]
pub struct ResponseItemrevisions {
    pub parentid: u32,
    pub revid: u32,
}

It would be a huge pain to have to write that out by hand every time, so having the macro do it is really convenient.

The crate is powered by JSON metadata files for each API module, specifying the response fields and which parameters need to be enabled to have them show up in the output. And there are some uh, creative methods on how to represent Rust types in JSON so they can be spit out by the macro. So far I've been writing the JSON files by hand by testing each parameter out manually and then reading the MediaWiki API source code. I suspect it's possible to automatically generate them, but I haven't gotten around to that yet.

Using enums?

So far the goal has been to faithfully represent the API output and directly map it to Rust types. This was my original goal and I think a worthwhile one because it makes it easy to figure out what the macro is doing. It's not really convenient to dump the structs the macro creates (you need a tool like cargo-expand), but if you can see the API output, you know that the macro is generating the exact same thing, but using Rust types.

There's a big downside to this, which is mostly that we're not able to take full advantage of the Rust type system. In the example above, lastrevid, length, pageid and touched are all typed using Option<T>, because if the page is missing, then those fields will be absent. But that means we need to .unwrap() on every page after checking the value of the missing property. It would be much better if we had ResponseItem split into two using an enum, one for missing pages and the other for those that exist.

enum ResponseItem {
    Missing(ResponseItemMissing),
    Exists(ResposneItemExists)
}

This would also be useful for properties like rvprop=user|userid. Currently setting that property results in something like:

pub struct ResponseItemrevisions {
    #[serde(default)]
    pub anon: bool,
    pub user: Option<String>,
    #[serde(default)]
    pub userhidden: bool,
    pub userid: Option<u32>,
}

Again, Option<T> is being used for the case where the user is hidden, and those properties aren't available. Instead we could have something like:

enum RevisionUser {
    Hidden,
    Visible { username: String, id: u32 }   
}

(Note that anon can be figured out by looking at id == 0.) Again, this is much more convenient than the faithful representation of JSON.

I'm currently assuming these kinds of enums can be made to work with serde, or maybe we'll need some layer on top of that. I'm also still not sure whether we want to lose the faithful representation aspect of this.

Next steps

The main next step is to get this crate used in some real world projects and see how people end up using it and what the awkward/bad parts are. One part I've found difficult so far is that these types are literally just types, there's no integration with any API library, so it's all up to the user on how to figure that out. There's also currently no logic to help with continuing queries, I might look into adding some kind of merge() function to help with that in the future.

I have some very very proof-of-concept integration code with my mwbot project, more on that to come in a future blog post.

Contributions are welcome in all forms! For questions/discussion, feel free to join #wikimedia-rust:libera.chat (via Matrix or IRC) or use the project's issue tracker.

To finish on a more personal note, this is easily the most complex Rust code I've written so far. proc-macros are super powerful, but it's super easy to get lost writing code that just writes more code. It feels like it's been through at least 3 or 4 rounds of complex refactoring, each taking advantage of new Rust things I learn, generally making the code better and more robust. The code coverage metrics are off because it's split between two crates, the code is actually fully 100% covered by integration+unit tests.

Tech News issue #44, 2021 (November 1, 2021)

00:00, Monday, 01 2021 November UTC
previous 2021, week 44 (Monday 01 November 2021) next

weeklyOSM 588

11:04, Sunday, 31 2021 October UTC

19/10/2021-25/10/2021

lead picture

OpenData and OpenStreetMap [1] © Pascal Neis © OpenStreetMap contributors

Mapping

  • Pascal Neis tweeted that his ‘Unmapped Places of OSM’ has been updated. A total of 331,000 places have been identified this year, compared with 339,000 by the same time last year.
  • Christian Rogel, who recently bought an electrically assisted tricycle, suggested (fr) > en that he will be able to help provide advice on tags which would help tricyclists make use of existing cycle infrastructure.
  • DENelson83 started converting individually named bodies of water on the Atlantic coast of Canada to relations. He set out his rationale, but has already received some negative feedback.
  • Alec Schulze-Eckel reported the successful end of the 25th mapathon project that saw the the formation of a small but solid group of German Red Cross (GRC) volunteers over the course of the project.
  • MKnight shared (de) > en his experience of guardrail mapping.
  • CzerwonyPazdzierz asked (pl) > de , in the forum, about the places that are best mapped on OSM. Besides the individual suggestions in the thread about it, there is also a map showing the ratio of OSM objects to population density.
  • A request for comments has been made for artwork_type=maypole, an additional tag to man_made=maypole for maypoles with artistic merit.

Community

  • China’s OpenStreetMap local community, OSMChina, has released (zhcn) > en their website (zhcn) > en and offer a tile service based on OSM-Carto style, but at the moment the site only contains data for mainland China.
  • Mateusz Konieczny asked about the status of HOT Tasking Manager (TM) projects, which triggered another discussion about mapping quality and the coordination of TM organised editing projects. Arnalie Vicario from HOT reacted by sharing information about the HOT Data Quality and Assurance working groups and invited contributors to participate in the next meeting on 4 November.
  • OSMF Board member Amanda McCann provided her regular monthly activity update for September.
  • BryDee reported that the HOT Open Mapping Hub – Asia Pacific (OMH-AP) have signed a letter of intent with the Asia Pacific Region branch (WSB/APR) of the secretariat about working together with the World Organization of the Scout Movement (WOSM). The initial agreement is for three years. BryDee sees a lot of interest from the Scouts and more importantly 30 million potential OSM mappers.
  • GOwin reported that OSMaPaaralan tasks are complete and more than 39,129 schools have been mapped and verified in 2 years and 22 days.
  • Someone on the #osm IRC channel noticed that if anyone mentions ‘OpenStreetMap’ in a YouTube video comment, the post is immediately blocked. Simon Poole tweeted about it here.
  • Lejun is looking for an app to map building facades. He describes his desired criteria and evaluates possible tools in terms of their suitability for this purpose.
  • martien-vdg wrote about his ideas on how to improve the mapping quality of new mappers.
  • PlayzinhoAgro suggested a re-imagining of the OSM user profile page.

OpenStreetMap Foundation

  • Dorothea Kazazi has announced that nominations for the OSMF Board election have closed. There are six candidates for four board seats :
    • Guillaume Rischard, USA
    • Michal Migurski, USA
    • Amanda McCann, Germany
    • Mikel Maron, USA
    • Roland Olbricht, Germany
    • Bryan Housel, USA.

    Members can continue adding their questions and information about the candidates is available on the Candidate 2021 wiki page.

    Voting information and proposed resolutions are detailed on the Annual General Meeting 2021 wiki page.

Events

  • Never been to Perth, Australia? Just join The FOSS4G SotM Oceania Perth Hub conference with an OSM workshop all day on Saturday 13 November.
  • State of the Map Africa 2021 is happening online 19 to 21 November. Registration is open and Geoffrey Kateregga has invited applications for internet data scholarships to enable remote participation for those who need it. The application form is online.
  • The LCCWG extended an invitation to leaders and members of OpenStreetMap local communities to attend the 2021 Local Chapters and Communities Congress, which will be held virtually on Saturday 6 November. Representatives of other OSM user groups are also welcome.

OSM research

  • John Bryant, of the Overseas Development Institute, has released a working paper Digital mapping and inclusion in humanitarian response. He notes that maps can contribute to ‘distancing’ and remote management of responses. Some humanitarians interviewed also observed that a map frequently becomes the end product rather than the ‘beginning of a conversation’.
  • Hao Li introduced a paper that proposes an automatic surface water mapping workflow by training a deep residual neural network (ResNet) based on OpenStreetMap data and Sentinel-2 multispectral data and using the Simple Non-Iterative Clustering (SNIC) superpixel algorithm for generating object-based training samples. As part of the paper, a case study was conducted that provides comprehensive insights into how to best explore the synergy of volunteered geographic information (VGI) and machine learning (ML) of Earth observation (EO) data in a large-scale surface water mapping task.

Maps

  • The HeiGIT ohsome team is extracting full history OSM data of the volcanic eruption on La Palma (Spain) and its constantly growing lava field to help keep OSM up to date. Temporally explore the data, including every single map edit, here.
  • Web developer and artist Hans Hack has created (de) a map that displays the ‘War Traces in Berlin Street Names’ on an OSM base map. The historical eras (Prussia, Empire …) are colour-coded and it can be filtered by a wide variety of viewpoints.

Open Data

Software

  • Bryan Housel announced the release of RapiD 1.1.8, which includes performance improvements and bugfixes.
  • Flatmap is a tool that generates Mapbox Vector Tiles from geographic data sources like OpenStreetMap. Vector tiles contain raw point, line, and polygon geometries that clients such as MapLibre can use to render custom maps (demo).

Programming

  • The Overpass API treats closed ways not only as ways but also as areas now.

Did you know …

  • … the key winter_service=* is available? The key can be used to map paths or areas that do not have winter service, as well as those which are regularly cleared of snow and ice.
  • … the NUNAV Navigation app for cars? It combines OpenStreetMap with real-time traffic data in Germany, Austria and Switzerland (for now) and is available in the app stores in a number of European countries. Your route is continuously recalculated based on traffic during a trip. The app has no ads or trackers and the GPS location data (de) is only used for traffic aggregation. You can give the routing a try on the website of the traffic control centre Lower Saxony (de).
  • … there is a list of software for OSM users at home or on the road?

Other “geo” things

  • Satellite imagery firm Planet Labs is ten years old, and expecting to go public shortly with a valuation of around $2.8 billion. Ari Lewis gave a potted history of the company in a Twitter thread.
  • Carlo Ratti found that people (without navigation software) tend to follow their intuition rather than the most effective path when choosing routes, after analysing mobile phone data at MIT.
  • Shaun McDonald examined the cycle-friendliness of roadworks during the installation of a new gas main.
  • TomTom described how they have made progress in improving quality with their MapMetrics system by removing bad data, e.g. filtering out GPS tracks from people on trains. The map data used during the process is based on OpenStreetMap.

Upcoming Events

Where What Online When Country
Черкаси Open Mapathon: Digital Cherkasy osmcalpic 2021-10-24 – 2021-11-20 ua
OSM Uganda Mapathon: Strengthening the OSM community osmcalpic 2021-10-30
Amsterdam OSM Nederland maandelijkse bijeenkomst (online) osmcalpic 2021-10-30 flag
Bogotá Distrito Capital Resolvamos notas de Colombia creadas en OpenStreetMap osmcalpic 2021-10-30 flag
Prévessin-Moëns Cartographie dans le Pays de Gex osmcalpic 2021-10-31 flag
OSMF Engineering Working Group meeting osmcalpic 2021-11-01
MapRoulette Community Meeting osmcalpic 2021-11-02
[Online] OpenStreetMap Foundation – Board of directors and advisory board public videomeeting osmcalpic 2021-11-02
London Missing Maps London Mapathon osmcalpic 2021-11-02 flag
Landau an der Isar Virtuelles Niederbayern-Treffen osmcalpic 2021-11-02 flag
Stuttgart Stuttgarter Stammtisch (Online) osmcalpic 2021-11-02 flag
Bochum OSM-Treffen Bochum (November) osmcalpic 2021-11-04 flag
Bogotá Distrito Capital Resolvamos notas de Colombia creadas en OpenStreetMap osmcalpic 2021-11-06 flag
OSM Local Chapters & Communities Virtual Congress osmcalpic 2021-11-06
Crowd2Map is 6! Join our party mapathon to learn more about our work.. osmcalpic 2021-11-07
臺北市 OSM x Wikidata Taipei #34 osmcalpic 2021-11-08 flag
Hamburg Hamburger Mappertreffen osmcalpic 2021-11-09 flag
Zürich OSM-Treffen Zürich osmcalpic 2021-11-11 flag
Berlin 161. Berlin-Brandenburg OpenStreetMap Stammtisch osmcalpic 2021-11-11 flag
FOSS4G State of the Map Oceania 2021 osmcalpic 2021-11-12
Missing Maps MonarchMappers Fall 2021 Mapathon osmcalpic 2021-11-13
Bogotá Distrito Capital Resolvamos notas de Colombia creadas en OpenStreetMap osmcalpic 2021-11-13 flag
Geography 2050 Symposium – Mapathon for an Equitable Future osmcalpic 2021-11-13
Crowd2Map Tanzania GeoWeek Human Right’s Day FGM Mapathon osmcalpic 2021-11-15
Bonn 145. Treffen des OSM-Stammtisches Bonn osmcalpic 2021-11-16 flag
Berlin OSM-Verkehrswende #29 (Online) osmcalpic 2021-11-16 flag
Lüneburg Lüneburger Mappertreffen (online) osmcalpic 2021-11-16 flag
Missing Maps Arcadis Mapathon osmcalpic 2021-11-17
Missing Maps WMU Mapathon osmcalpic 2021-11-17
Köln OSM-Stammtisch Köln osmcalpic 2021-11-17 flag
Chambéry Missing Maps CartONG Tour de France des Mapathons – Chambéry osmcalpic 2021-11-18 flag
State of the Map Africa 2021 osmcalpic 2021-11-19 – 2021-11-21

Note:
If you like to see your event here, please put it into the OSM calendar. Only data which is there, will appear in weeklyOSM.

This weeklyOSM was produced by Nordpfeil, PierZen, RCarlow, SK53, Sammyhawkrad, Strubbl, TheSwavu, arnalielsewhere, derFred.

Volunteer support is a major theme in the Wikimedia movement, and a typical endeavour of many affiliates. Content contributors are supported in many ways − Wikipedia writers get access to reference books, Wikimedia Commons photographers to equipment and accreditations, contest organizers to project management support, etc.

Yet, to me there seems to be a major blindspot when it comes to technical contributors.

By technical contributors, I encompass all people contributing to server-side scripting (templates and modules), client-side scripting (user-scripts and gadgets), autonomous editing programs (bots), desktop and web applications, etc.

Existing support

Here I will try to draw a (likely non-exhaustive) map of the existing support.

Technical infrastructure

The Wikimedia Foundation (WMF) provides technical infrastructure:

  • to run the tools, eg web-apps and bots via Cloud VPS and Toolforge
  • to support the development life-cycle: issue tracking, code-hosting, code-review, build & testing, deployment − with Gerrit, Phabricator and Jenkins.

While these platforms have their shortcomings (indeed, many volunteer developers avoid them altogether), I believe they are an amazing proposition. I also believe they are well supported: hang out on IRC asking for help for your Toolforge tool, and you will most certainly get it, whether from WMF staff, volunteers, or WMF staff with their volunteer hat.

Events

The WMF, generally in partnership with an affiliate, has long been organizing hackathons − the Wikimedia hackathon around May, and the Wikimania one around July/August (other such events exist, such as the Dutch TechStorm).

I think hackathons are amazing − as I wrote before, they are great at creating an atmosphere suited to get work done.

Motivation & appreciation

A key tenet of the volunteer support, this has seen some welcome progress with the notable creation of the Coolest Tool Award.

Money

Ah, this is often where it ends: a question “What about VeryImportantTool” is likely to elicit an answer along the lines of “This tool is volunteer-developed, and not supported by BigOrganization. We would be happy to consider a grant request to work on it though!”

While grants are useful, and have certainly helped the creation and enhancement of great tools (both in the Wikiverse and in other open movements like OpenStreetMap¹), I think this answer is reductive and inappropriate in many cases. The main reason I see is that most often, when time or motivation are lacking, money will not buy either. Some tool developers may be self-employed freelancers, or be between jobs, with the flexibility to set aside a few weeks/months supported by a grant ; not so much if one is regularly employed. Timing can also be tricky: some grants have cycles, which may not fit the (work-)life of people.

So what is missing?

I believe that what is missing in tool development is more collaboration. It takes a village to raise a tool − and various specialties ranging from product ownership, design, development, operations, testing, QA, security, documentation… −  yet more often than not, a single person is behind a tool.

Staff support

It has been suggested before that WMF should take over part of the duties related to a tool − typically the operations (of tools like PetScan) but UX/design had also been suggested

Historically, I think it fair to say WMF has not been keen on doing so. I can see good reasons to avoid that (different priorities, limited resources, not so easy to adopt foreign and potentially messy codebases, perhaps fear of cannibalizing volunteer efforts), but also bad ones (wrong sense of priorities, NIH syndrome).

I can see how staff taking over operations of a major tool, or having a UX review queue, might not be sustainable. But I am convinced that there is space for a technical counterpart to the Volunteer Support for content contributors.

This crystalized to me when (story time) following the pipenv 2018 release − where Sumana Harihareswara (who once upon a time worked at WMF, interestingly enough) stepped in to help the maintainers put together a well-overdue release of a major piece of the Python ecosystem. Sumana called it “Coaching and cheerleading”:

An external perspective can help a lot. You can be that person. Whether you call yourself a sidekick, a project manager, a cheerleader, a coach, or something else, you can be a supportive accountability partner who helps with the bits that maintainers are not great at, or don’t have time for. And you don’t have to know the project codebase to do this, or be a feature-level developer – the only pipenv code I touched was the docs.

Sumana Harihareswara

We could really use some professional coaches and cheerleaders.

(Speaking only for myself, as the developer of the moderately successful integraality (and ex-co-maintainer of the the monuments database) − no smash hits, but somewhat relied upon tools − I can see how having someone checking in on me once every couple of months, to help me plan/organize work, could be very valuable).

Being the village

As I outlined above, there are many parts to tool development, many of which are non-technical: handling bug reports, writing documentation, prioritizing feature requests… We are an amazing collaborative community − why isn’t there more collaboration around tools? For example, when building integraality, I experienced first-hand how delegating product decisions helped:

Making the decisions on where to take the product and what red lines to draw scope-wise is sometimes the hardest. This saved me heaps of time, in the couple of instances where I was unsure how to proceed. I wonder if this should be part of the hackathon setup − every hacker being the “owner” or “client” of the another’s project.

This − again − crystalized with a recent story: reading Magnus Manske’s “The Buggregator” blog post. I could not help but feel deeply sad: Magnus, a prolific developer of highly-popular tools, is so overwhelmed with bug reports and feature requests, raised in many different places, that he feels the need to write a tool to help manage all of these. I could not help but think − why on earth does he have to manage this communication influx all by himself? Since his tools are so popular, surely there should be many folks happy to help with this tool, or that one? When someone reports a bug to Magnus on some random talk page or Telegram channel, why isn’t there a volunteer happy to take over, ask the necessary clarifications, file it in the canonical bug tracking place (whichever that might be), prioritize it for later?

Of course, most of us will never churn in so many tools that we need to write a bug aggregation platform ; but I think that Magnus’ problems are everyone’s problems − just magnified (magnufied?) ten times over.

In short, there should be more of a community active around a popular tool − people helping out in particular with the non-technical aspects. 

I used to think that the lack of such community was somehow the responsibility of the developer themself − that if you were the single-maintainer on your Toolforge tool account, then you should try a bit harder.

I have come around on this: we set out to develop software, often to scratch our own itch, and not necessarily expecting success ; we don’t sign up to do community building and management on top. But that community does not seem to happen on its own either: it seems that in this movement, we are more prone to ask tool developers for help, than to ask whether they need help.

If so, then I think this should be the first and foremost job of the professional technical volunteer support: helping build up an active support network around a tool. This may include recruiting co-developers, but I see a real opportunity in engaging volunteers in non-technical capacities, and structuring that engagement with models and best practices. Far from cannibalizing volunteer resources, this would rather fit nicely in our goal of capacity building and increasing the sustainability of our movement.

Conclusion

In this post, I tried to map out the existing support for technical contributors, and have made the case for professional volunteer support, mirroring the practices and successes of support for content contributors. Beyond a direct “coaching and cheerleading”, this would entail recruiting and structuring volunteer-based support networks around each tool.

This may find an echo at the existing big software-houses that are WMF and WMDE, who may decide to create dedicated technical volunteer support roles or teams − I would certainly welcome it. But there are so many tools − I don’t think this is a problem that can be solved centrally. After all, no one expects WMF to help out individual content contributors in a particular country or focus area − this is what affiliates have excelled in.

Rather, the idea would rather play well with the Hubs model of support structures that has been developed as part of the 2030 strategy: a hypothetical “GLAM” thematic hub helping out on GLAM-related tools, or a “Wikisource” hub on Wikisource tools, etc.

Until then, in the spirit of decentralization and subsidiarity, I rather hope that the idea might be taken up by all affiliates: expanding their existing volunteer support to technical contributors − one tool at a time.


This piece was in the works for a while ; the impetus to finish out was ahead of the “Please don’t get hit by a bus! Towards a resilient and sustainable Wikidata tool ecosystem” session at WikidataCon 2021.


Notes

¹ Pattypan (grant) immediately comes to my mind ; I also think of the StreetComplete project (which I am quite fond of)

By Martin Urbanec, Software Engineer, Growth Team

The Growth team recently improved the performance of a script that prepares data for usage in the mentor dashboard: we decreased the average runtime of the script from more than 48 hours to less than five minutes. In this post, you will learn how we did that.

What is the mentor dashboard?

The mentor dashboard project lets community mentors view the list of newcomers that they are mentoring and some information about their editing activity. As of October 2021, the features are only available for Wikipedia projects.

For instance, the mentor dashboard lets the mentors see their mentees’ registration dates, the number of questions they asked, the total number of edits they made, the number of reverted edits they have or how many times they were blocked. This kind of information makes it possible for mentors to be more proactive. Rather than waiting for questions from their mentees, they can reach out early and offer support.

Figure 1: Mentor dashboard at test.wikipedia.org (image source, CC BY-SA 4.0)

How do we get the data needed for display in the mentor dashboard?

The naive solution would be to calculate the data when a mentor loads their mentor dashboard by issuing the needed SQL queries. However, this wouldn’t scale as the number of mentees grows: a mentor can have anywhere from a few dozen to thousands of mentees. We want the mentor dashboard to load quickly, as we don’t want mentors to have to wait several seconds, or even minutes.

To work around this problem, we decided to precalculate the data in advance and to store them in a caching database table. The process responsible for updating data in the database table will be referred to as “the update process” from now on.

Since we wanted to deploy the first version of the dashboard to only four wikis (Arabic, Czech, Bengali, and Vietnamese Wikipedias), we first enabled the update process only on those four wikis. Measuring the overall runtime on those wikis showed it worked fine in terms of runtime: most of the time was spent on the Arabic Wikipedia (two to three hours). I felt that’s okay and understandable since Arabic Wikipedia has about 250k mentees to go through.

As I was preparing to deploy the mentor dashboard to more wikis, I started a test run of the update process on the French Wikipedia, to get an idea of how well it performs on that wiki. French Wikipedia has about 200k of mentees to go through, which is less than the Arabic version has. When I was starting the test run, I thought “this must take only a couple of hours too, as the number is comparable to the Arabic Wikipedia.” After the test run was completed, I was shocked to see it took more than two days.

This wasn’t acceptable. We want to run the update process daily, which means it needs to complete within 24 hours. I filed a task in Wikimedia Phabricator and started looking for options to improve the performance of the update process.

Optimizing: Part I

When I was preparing for engineering the mentor dashboard, I was experimenting with raw SQL queries in the analytics database replicas. I noticed that the “how many times was the mentee blocked” query is remarkably slow. For that reason (and without doing any profiling), I started to suspect it was slowing down the updating process.

I felt I knew why the blocks query was a slow one. Originally, that query was using JOIN conditions like user_name=REPLACE(log_title, ‘_’, ‘ ‘), meaning that the database wasn’t able to use the index for log_title. I, however, wasn’t sure how to do it more efficiently. Another member of the Growth team, Gergő Tisza, suggested that instead of doing the replaces at the database layer, I could do them at the application layer. With those changes, the query would end up using a different index: instead of iterating through all blocks the wiki placed, it would go through all events related to the mentees. Since, for most wikis, the total number of blocks is greater than the number of events per user, this approach had a chance of decreasing the runtime significantly.

After implementing the idea, the runtime decreased significantly: from 48 hours to 5.5 hours at the French Wikipedia.

Optimizing: Part II

With the performance improvement adopted in the first part, I was happy with the French Wikipedia result. It would make it possible to deploy the dashboard. To further prepare for wide deployment of the dashboard, I needed to verify the updating process would cope well with the English Wikipedia as well, which has more than 500k of mentees in just three months of Growth features being available there. Unfortunately, the English Wikipedia test run was running for more than three days, even with the Part I performance improvement, and I didn’t have enough patience to let it finish.

At that point, I was out of ideas about what could be slowing things down. To gather more information about the problem, I decided to consult Tendril. Tendril is a tool for analytics and performance tuning of the MariaDB servers. One of the many features it offers is the Slow queries report, which is available from the “Report” tab. To see queries by the maintenance script on French Wikipedia, I changed the user to wikiadmin (the user maintenance scripts run under) and set the schema to frwiki. Once I did so, I was able to see the queries my script made:


Figure 2: Queries executed by the update process, as shown by Tendril (image source, CC BY-SA 4.0)

I immediately noticed something was off: if you look closely at the user IDs, some of them are integers (as expected), but others are strings. Experimenting with raw SQL queries in the analytics replicas showed that casting the IDs to integers speeds up the query significantly: instead of taking half a minute, the query shown above now completes in less than a second. Once I casted the user IDs to integers, the overall update time for French Wikipedia went down from 5.5 hours to 5.5 minutes.

Conclusion

With all of the changes described above implemented, the overall update time for French Wikipedia decreased from 48 hours to 5.5 minutes. During the process, I found Tendril to be a very useful tool, as it allows me to view the actual slow queries (including information about their origin).

Given the great performance improvement accomplished, we will look into updating the information in the dashboard more frequently than just daily, as we want to offer mentors as fresh data as possible. This will also allow us to let mentors update their own data on-demand, in addition to the automated update process, without putting too much load on the database servers.

About this post

Featured image credit: File:Tendril.jpg, Electron, CC BY 2.0

Using Manjari as new orthography Malayalam font

14:30, Friday, 29 2021 October UTC

Manjari is a traditional orthography font for Malayalam. It has large set of ligatures, vowel signs like /u/ get attached to its corresponding consonants to form ligatures. But, sometimes there are requirements to illustrate new orthography Malayalam content in Manjari. Recently, Manjari was used to typeset an academic book related to Malayalam script and it was required to show some content in new orthography with detached vowel signs and detached reph signs.

Join us at WikidataCon!

13:51, Friday, 29 2021 October UTC

Today is Wikidata’s ninth birthday — and what better way to celebrate than a conference?

WikidataCon 2021 begins today! The conference spans three days (October 29–31), several tracks, and so many sessions, so we thought we would share a list of sessions that might be of interest to those of you in the GLAM and education sectors. Some of these sessions feature past participants from our Wikidata courses.

But first, if you’re not registered yet, you can take care of that at this link!

  • Wikidata at Texas A&M University Libraries: Enhancing Discovery for Dissertations (Jeannette Ho): This presentation will highlight how librarians at Texas A&M University uploaded student, dissertation, and faculty advisor data to Wikidata as part of the PCC Wikidata Pilot initiative. It will also cover some challenges and next steps, as well as possible implications that Wikidata may have for traditional processes to manage personal and organizational entities in a catalog.
  • Wiki API Connector – Simplifying ETL workflows from open APIs to Wikidata/Commons (Andrew Lih): This session will provide an overview of the Wiki API Connector. The connector aims to simplify the extract-transform-load (ETL) process of metadata uploads to Wikimedia projects without complicated coding or software development. This tool may serve as a general solution useful for other GLAM institutions or partner organizations. This session will address work completed with the tool so far and seek feedback on how it may be useful for other users and applications.
  • The Met Museum’s Work with Wikidata and Structured Data on Commons (Andrew Lih): This session will unpack how The Met Museum has contributed object metadata and depiction information to Wikimedia projects and in return, how Wikidata content is brought back into The Met’s database and made available via its open access API. This session will cover Structured Data on Commons (SDC), including the tools, processes, modeling challenges, and the complexities of using references for SDC.
  • Wikidata in the Classroom: Updates from North America (Stacy Allison Cassin, Lane Rasberry, Amanda Rust, Amy Ruskin): This session will explore some examples of Wikidata in the classroom across North America. From the University of Toronto, Stacy Allison Cassin will describe the use of Wikidata in an introductory library and information science course. At Northeastern University, Amanda Rust and Amy Ruskin will share insights about a public art documentation project they are working on with students. From the University of Virginia and Wiki Education, Lane Rasberry and Will Kent will report out on a partner project, supporting Data Science masters students who used Wikidata for a capstone project.
  • There are, of course, many other worthwhile sessions for anyone to attend, across all of the conference tracks. This year we’re especially excited about the the entire Education & Science track, which was co-curated by Shani Evenstein Sigalov and Wiki Education’s Will Kent and LiAnna Davis. With education and science being so dear to us at Wiki Education, we recommend attending as many of these sessions as you can!

This conference happens every other year, now is the perfect time to meet some luminaries in the Wikidata community and catch up on Wikibase, decolonizing Wikidata, Wikidata tools, and building a sustainable future for Wikidata. We encourage you to attend as many as you can. Many of these sessions will be recorded and archived for you to watch as your schedule permits. See the full schedule here.

Image credits: AJurno (WMB), CC BY-SA 4.0, via Wikimedia Commons; Bleeptrack, CC BY-SA 4.0, via Wikimedia Commons

Wikipedia and the Representation of Reality

18:12, Thursday, 28 2021 October UTC

Wikipedia and the Representation of Reality is a new book by Zachary J. McDowell and Matthew A. Vetter that was published by Routledge this summer. The entire book is available for download free from the publisher or as a free Kindle download from Amazon.

Wikipedia is the encyclopedia that anyone can edit. Making a change to its content is as simple as clicking the edit button and typing something in. And more often than not, what happens next for new editors is that their addition gets reverted and is hidden from everything but the edit history of the page. In theory Wikipedia’s barriers to entry are very low, but the barriers to making a meaningful contribution to the encyclopedia’s contents are much, much higher. In some ways this is a good thing — Wikipedia’s exclusion of a lot of new additions helps it achieve the seemingly impossible task of presenting reliable, high-quality information on a wiki that anyone can edit. But it also means that Wikipedia’s base of contributors, the topics they choose to write about, the information they choose to include, and the way they choose to phrase their contributions is limited to the group of people who make it past these barriers.

For those of us who work on trying to expand the demographic profile of Wikipedia contributors, it’s important to understand the policies and processes that shape content inclusion and exclusion on Wikipedia. But even a basic understanding of the key policies and processes can take years to grasp. A nuanced understanding of how they play out in practice — both the written and unwritten rules — is even more difficult.

This is what makes Wikipedia and the Representation of Reality such a valuable addition to the growing collection of scholarly and popular writing around Wikipedia. Zach McDowell and Matt Vetter are experienced members of the community of Wikipedians and educators who have been incorporating Wikipedia assignments into their teaching over the last decade. Both Zach and Matt have worked with Wiki Education for years both as instructors and, in Zach’s case, as a research fellow.

Over the two decades of its existence, Wikipedia has grown to the point where it has become the encyclopedia. Instead of explaining Wikipedia as an online encyclopedia, we now explain other encyclopedias in relation to Wikipedia. While Wikipedia’s policies on inclusion and exclusion were meant to limit its coverage to reality and avoid hoaxes, the project has come to shape reality, or at least shape what’s important for the many of its readers: if it isn’t covered, is it really that important?

In the book, Zach and Matt take a careful look at key policies and try to tease out a lot of the underlying assumptions of the policy-writers. From the perspectives of the early techno-utopian Wikipedians, a statement like “Be Bold” was a way of telling potential contributors that they didn’t need to ask anyone’s permission to make Wikipedia better. But for a different audience, this means something different; the book quotes their students as saying they “ultimately felt more anxiety than boldness” and found themselves “afraid to upset, anger, or disappoint” the original authors of the work they were editing.

These disconnects between policy as designed and as interpreted, between intent and effect, are the kinds of things that have kept Wikipedia from being what it was designed to be.

By looking at the interplay between policy and the way that policy is applied, Zach and Matt manage to introduce readers to the importance of the community aspect of Wikipedia. Many of Wikipedia’s readers don’t understand that behind the text, there is a community of individuals with different ideas and opinions of how to interpret policy. When someone reverts your edit, it isn’t Wikipedia, it’s a certain Wikipedian. Unlike a top-down organization where there’s someone who can decide what a policy means and how it’s going to be interpreted, the application of policy on Wikipedia is a socially constructed reality built through a process of discussion, debate, and negotiation. The book explains this by invoking Steven Thorne’s “culture-of-use” theory, which provides a conceptual framework that is likely to help people who are new to Wikipedia make sense of this sort of thing.

By creating a framework for understanding Wikipedia, I believe that the book will open up Wikipedia for a wide range of people. The book is an academic work (albeit a very readable one) and should be a necessary primer for anyone interested in studying Wikipedia. But it’s also very helpful for Wikipedians who are interested in fixing its problems. The book is a critique of Wikipedia, but it’s written by people who love the project and are optimistic about its future. Far too often Wikipedia is either written about in purely positive terms or it is dismissed as hopelessly exclusionary and sexist.

As a person who has spent much of the last 17 years thinking about and interacting with Wikipedia policy and its impact on knowledge, existing editors, and new ones, this book gave me a lot to think about. I didn’t agree with everything they said in the book, but even when I disagreed it tended to be over matters of interpretation rather than matters of fact. I love the fact that I finally have a source I can reference instead of feeling the need to explain everything from first principles.

I highly recommend this book to anyone who wants to know more about how Wikipedia works, or who wants to make Wikipedia better. The fact that they were able to release it under an open license means it’s freely available (either as a .pdf or through Amazon’s Kindle store) and easily accessible to anyone who has the time.

You can watch a presentation about this book from WikiConference North America here.

Image credits: Symbols illustrated by Jasmina El Bouamraoui and Karabo Poppy Moletsane, CC0, via Wikimedia Commons; image of books courtesy Zach McDowell, all rights reserved.