So, a really interesting question
cropped up this weekend:
I’m trying to find out how many biographies of living
persons exist on the English Wikipedia, and what kind of data we
have on them. In particular, I’m looking for the gender
breakdown. I’d also like to know when they were created;
average length; and whether they’ve been nominated for
deletion.
This is, of course, something that’s being discussed a lot
right now; there is a lot of emerging push-back against the
excellent
work being done to try and add more notable women to Wikipedia,
and one particular deletion debate got a
lot of attention in the past few weeks, so it’s on
everyone’s mind. And, instinctively, it seems plausible that
there is a bias in the relative frequency of nomination for
deletion – can we find if it’s there?
My initial assumption was, huh, I don’t think we can do
that with Wikidata. Then I went off and thought about it for a bit
more, and realised we could get most of the way there of it
with some inferences. Here’s the results, and how I got
there. Thanks to Sarah for
prompting the research!
(If you want to get the tl;dr summary – yes, there is
some kind of difference in the way older male vs female
articles have been involved with the deletion process, but exactly
what that indicates is not obvious without data I can’t get
at. The difference seems to have mostly disappeared for
articles created in the last couple of years.)
Statistics on the gender breakdown of BLPs
As of a snapshot of yesterday morning, 5 May 2019, the English
Wikipedia had 906,720 articles identified as biographies of living
people (BLPs for short). Of those, 697,402 were identified as male
by Wikidata, 205,117 as female, 2464 had some other value for
gender, 1220 didn’t have any value for gender (usually
articles on groups of people, plus some not yet updated),
and 517 simply didn’t have a connected Wikidata item (yet).
Of those with known gender, it breaks down as 77.06% male, 22.67%
female, and 0.27% some other value. (Because of the limits of the
query, I didn’t try and break down those in any more
detail.)
This is, as noted, only articles about living people;
across
all 1,626,232 biographies in the English Wikipedia with a
gender known to Wikidata, it’s about 17.83% female, 82.13%
male, and 0.05% some other value. I’ll be sticking to data on
living people throughout this post, but it’s interesting to
compare the historic information.
So, how has that changed over time?
This graph shows all existing BLPs, broken down by gender and
(approximately) when they were created. As can be seen, and as
might be expected, the gap has closed a bit over time.
Looking at the ratio over time (expressed here as %age of total
male+female), the relative share of female BLPs was ~20% in 2009.
In late 2012, the rate of creation of female BLPs kicked up a gear,
and from then on it’s been noticeably above the long-term
average (almost hitting 33% in late 2017, but dropping back since
then). This has driven the overall share steadily and continually
upwards, now at 22.7% (as noted above).
Now the second question, do the article lengths differ by
gender? Indeed they do, by a small amount.
Female BLPs created at any time since 2009 are slightly longer
on average than male ones of similar age, with only a couple of
brief exceptions; the gap may be widening over the past year but
it’s maybe too soon to say for sure. Average difference is
about 500 bytes or a little under 10% of mean article size –
not dramatic but probably not trivial either. (Pre-2009 articles,
not shown here, are about even on average)
Note that this is raw bytesize – actual prose size will be
smaller, particularly if an article is well-referenced; a single
well-structured reference can be a few hundred characters.
It’s also the current article size, not size at
creation, hence why older articles tend to be longer –
they’ve had more time to grow. It’s interesting to note
that once they’re more than about five years old they seem to
plateau in average length.
Finally, the third question – have they been nominated for
deletion? This was really interesting.
So, first of all, some caveats. This only identifies articles
which go through the structured “articles for deletion”
(AFD) process – nomination, discussion, decision to keep or
delete. (There are three
deletion processes on Wikipedia; the other two are more
lightweight and do not show up in an easily traceable form). It
also cannot specifically identify if that exact page was
nominated for deletion, only that “an article with exactly
the same page name has been nominated in the past” –
but the odds are good they’re the same if there’s a
match. It will miss out any where the article was renamed after the
deletion discussion, and, most critically, it will only see
articles that survived deletion. If they were deleted, I
won’t be able to see them in this analysis, so there’s
an obvious survivorship bias limiting what conclusions we can
draw.
Having said all that…
Female BLPs created 2009-16 appear noticeably more likely than
male BLPs of equivalent age to have been through a deletion
discussion at some point in their lives (and, presumably, all have
been kept). Since 2016, this has changed and the two groups are
about even.
Alongisde this, there is a corresponding drop-off in the number
of articles created since 2016 which have associated deletion
discussions. My tentative hypothesis is that articles created in
the last few years are generally less likely to be nominated for
deletion, perhaps because the growing use of things like the draft
namespace (and associated reviews) means that articles are more
robust when first published. Conversely, though, it’s
possible that nominations continue at the same rate, but the
deletion process is just more rigorous now and a higher proportion
of those which are nominated get deleted (and so disappear from our
data). We can’t tell.
(One possible explanation that we can tentatively dismiss is age
– an article can be nominated at any point in its lifespan so
you would tend to expect a slowly increasing share over time, but I
would expect the majority of deletion nominations come in the first
weeks and then it’s pretty much evenly distributed after
that. As such, the drop-off seems far too rapid to be explained by
just article age.)
What we don’t know is what the overall nomination for
deletion rate, including deleted articles, looks like. From our
data, it could be that pre-2016 male and female articles are
nominated at equal rates but more male articles are deleted; or it
could be that pre-2016 male and female articles are equally likely
to get deleted, but the female articles are nominated more
frequently than they should be. Either of these would cause the
imbalance. I think this is very much the missing piece of data and
I’d love to see any suggestions for how we can work it out
– perhaps something like trying to estimate gender from the
names of deleted articles?
Update: Magnus has run some numbers on
deleted pages, doing exactly this – inferring gender from
pagenames. Of those which were probably a person, ~2/3 had an
inferred gender, and 23% of those were female. This is a remarkably
similar figure to the analysis here (~23% of current BLPs female;
~26% of all BLPs which have survived a deletion debate female)
So in conclusion…
- We know the gender breakdown: skewed male, but growing slowly
more balanced over time, and better for living people than
historical ones.
- We know the article lengths; slightly longer for women than men
for recent articles, about equal for those created a long time
ago.
- We know that there is something different about the way
male and female biographies created before ~2017 experience the
deletion process, but we don’t have clear data to indicate
exactly what is going on, and there are multiple potential
explanations.
- We also know that deletion activity seems to be more balanced
for articles in both groups created from ~2017 onwards, and that
these also have a lower frequency of involvement with the deletion
process than might have been expected. It is not clear what the
mechanism is here, or if the two factors are directly linked.
How can you extract this data? (Yes, this is very dull)
The first problem was generating the lists of articles and their
metadata. The English Wikipedia category system lets us identify
“living people”, but not gender; Wikidata lets us
identify gender (property P21), but not reliably “living
people”. However, we can creatively use the petscan tool to get the
intersection of a SPARQL gender query + the category. Instructing
it to explicitly use Wikipedia (“enwiki” in other
sources > manual list) and give output as a TSV – then
waiting for about fifteen minutes – leaves you with a nice
clean data dump. Thanks, Magnus!
(It’s worth noting that you can get this data with any
characteristic indexed by Wikidata, or any characteristic
identifiable through the Wikipedia category schema, but you will
need to run a new query for each aspect you want to analyse –
the exported data just has article metadata, none of the
Wikidata/category information)
The exported files contain three things that are very useful to
us: article title, pageid, and length. I normalised the files like
so:
grep [0-9] enwiki_blp_women_from_list.tsv | cut -f 2,3,5 > women-noheader.tsv
This drops the header line (it’s the only one with no
numeric characters) and extracts only the three values we care
about (and conveniently saves about 20MB).
This gives us two of the things we want (age and size) but not
deletion data. For that, we fall back on inference. Any article
that is put through the AFD process gets a new subpage created at
“Wikipedia:Articles for deletion/PAGENAME”. It is
reasonable to infer that if an article has a corresponding AFD
subpage, it’s probably about that specific article.
This is not always true, of course – names get recycled,
pages get moved – but it’s a reasonable working
hypothesis and hopefully the errors are evenly distributed over
time. I’ve racked my brains to see if I could anticipate a
noticeable difference here by gender, as this could really
complicate the results, but provisionally I think we’re okay
to go with it.
To find out if those subpages exist, we turn to the enwiki dumps.
Specifically, we want “enwiki-latest-all-titles.gz”
– which, as it suggests, is a simple file listing all page
titles on the wiki. Extracted, it comes to about 1GB. From this, we
can extract all the AFD subpages, as so:
grep "Articles_for_deletion/" enwiki-latest-all-titles | cut -f 2 | sort | uniq | cut -f 2 -d / | sort | uniq > afds
This extracts all the AFD subpages, removes any duplicates
(since eg talkpages are listed here as well), and sorts the list
alphabetically. There are about 424,000 of them.
Going back to our original list of articles, we want to bin them
by age. To a first approximation, pageid is sequential
with age – it’s assigned when the page is first
created. There are some big caveats here; for example, a page being
created as a redirect and later expanded will have the ID of its
initial creation. Pages being deleted and recreated may get a new
ID, pages which are merged may end up with either of the original
IDs, and some complicated page moves may end up with the original
IDs being lost. But, for the majority of pages, it’ll work
out okay.
To correlate pageID to age, I did a bit of speculative guessing
to find an item created on 1 January and 1 July every year back to
2009 (eg
pageid 43190000 was created at 11am on 1 July 2014). I could
then use these to extract the articles corresponding to each period
as so:
...
awk -F '\t' '$2 >= 41516000 && $2 < 43190000' < men-noheader.tsv > bins/2014-1-M
awk -F '\t' '$2 >= 43190000 && $2 < 44909000' < men-noheader.tsv > bins/2014-2-M
...
This finds all items with a pageid (in column #2 of the file)
between the specified values, and copies them into the relevant
bin. Run once for men and once for women.
Then we can run a short report, along these lines (the original
had loops in it):
cut -f 1 bins/2014-1-M | sort > temp-M
echo -e 2014-1-M"\tM\t"`cat bins/2014-1-M | wc -l`"\t"`awk '{ total += $3; count++ } END { print total/count }' bins/2014-1-M`"\t"`comm -1 -2 temp-M afds | wc -l` >> report.tsv
This adds a line to the file report.tsv with (in order)
the name of the bin, the number of entries in it, the mean value of
the length column, and a count of the number which also
match names in the afds file. (The use of the
temp-M file is to deal with the fact that the comm tool
needs properly sorted input).
After that, generating the data is lovely and straightforward
– drop the report into a spreadsheet and play around with
it.