Page semi-protected

Wikipedia:Bots/Requests for approval

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

BAG member instructions

If you want to run a bot on the English Wikipedia, you must first get it approved. To do so, follow the instructions below to add a request. If you are not familiar with programming it may be a good idea to ask someone else to run a bot for you, rather than running your own.

 Instructions for bot operators

Current requests for approval

ProcBot 10b

Operator: ProcrastinatingReader (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 14:05, Monday, April 4, 2022 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s):

Source code available:

Function overview: Removing {{current related}} templates from articles that no longer require the template (no changes in an extended amount of time)

Links to relevant discussions (where appropriate): Wikipedia:Bots/Requests for approval/ProcBot 10

Edit period(s): Cont

Estimated number of pages affected: ~140 initially

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: Extends Wikipedia:Bots/Requests for approval/ProcBot 10 to the {{current related}} template, which is meant to serve the same purpose but on articles 'related' to a current event. Has the same guidance w.r.t. the page actually being actively edited. Right now we have a lot of stale articles like Zaporizhzhia Nuclear Power Plant tagged, even though the current event referred to no longer has the template (it having been removed due to staleness).

Will use the same filter logic and code as task 10, but adding the extra template name for processing.

Discussion

DoggoBot 7

Operator: EpicPupper (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 22:49, Sunday, April 3, 2022 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): AWB

Source code available: WP:AWB

Function overview: Make issue templates more specific

Links to relevant discussions (where appropriate):

Edit period(s): Weekly

Estimated number of pages affected: 15,271 on first run for {{refimprove}}, 919 for {{primary sources}}

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: Using AWB, the bot would do the following:

This would make the issue templates more specific, allowing for a range of benefits, such as prioritization for volunteers of which articles to work on (BLPs). General fixes will be enabled. Pinging GoingBatty, who originally had this listed as an idea on User:BattyBot Face-smile.svg

Discussion

  • Fine in theory, but it makes me wonder why these templates aren't merged in the underlying code, with the output changing based on whether the BLP cat is present. It would make the template slightly easier to use, automatically changes when a living person deceases, and (during the initial merge run) helps a bit in ensuring BLPs are tagged with the BLP cat. ProcrastinatingReader (talk) 13:35, 4 April 2022 (UTC)[reply]
    Also see Wikipedia:Bots/Requests for approval/OmniBot 2 ProcrastinatingReader (talk) 13:38, 4 April 2022 (UTC)[reply]

WOSlinkerBot 22

Operator: WOSlinker (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 16:41, Wednesday, March 30, 2022 (UTC)

Function overview: Fix pages with the Old behaviour of link-wrapping font tags lint issue.

Automatic, Supervised, or Manual: Automatic

Programming language(s): Javascript

Source code available: At User:WOSlinkerBot/linttask22.js

Links to relevant discussions (where appropriate):

Edit period(s): one time run

Estimated number of pages affected: currently 90,000 of those lint errors, but multiple errors per page so guessing about 30,000

Namespace(s): wikipedia & talk pages

Exclusion compliant (Yes/No): No

Function details: Similar to previous tasks, such as task 21, fixing lint issues. This task will be fxing the Old behaviour of link-wrapping font tags lint issues. While fixing though pages, some additions will be made to the javascript code as additional combinations of the font tag and wikilinks are found.

Discussion

Shouldn't this actually be a supervised task? There are lots of different cases involved that would make it difficult as an automatic task. I tested User:WOSlinkerBot/linttask22.js against a variety of cases here. It had one false positive of replacing <font color="#AFFFF">[[User:abc|abc]]</font> with [[User:abc|<span style="color:#AFFFF;">abc</span>]]. You can add more regexes to catch the ones it skipped. ಮಲ್ನಾಡಾಚ್ ಕೊಂಕ್ಣೊ (talk) 18:41, 30 March 2022 (UTC)[reply]

I support this task, of course, but I think Malnadach is correct. I think instead of \#[a-f0-9]*, the bot op may need to specify only three or six characters in the font color. It should probably skip any other number of characters. I also recommend a supervised task with inspection, or at least starting with known signatures, as collected here. – Jonesey95 (talk) 19:27, 30 March 2022 (UTC)[reply]
I've changed the \#[a-f0-9]* to 3 and 6 fixed length versions. -- WOSlinker (talk) 21:47, 30 March 2022 (UTC)[reply]

Aidan9382-Bot

Operator: Aidan9382 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 08:09, Wednesday, March 23, 2022 (UTC)

Function overview: Replace clear-cut cases of improperly used "|format=" in citations

Automatic, Supervised, or Manual: Supervised

Programming language(s): Python

Source code available: Not for now

Links to relevant discussions (where appropriate):

Edit period(s): one time run

Estimated number of pages affected: A couple hundered, as there are about 1200 pages in the related catagory, but only a small amount are actually clear-cut

Namespace(s): Mainspace

Exclusion compliant No: Only edits mainspace

Function details: For now, it will simply replace clear-cut cases of misused format without url (Paperback, e-book, etc.) by swapping the field from format to type. It applies citation-error edits to only mainspace, so i dont imagine an exclusion compliance is going to be needed, but i may be mistaken.

Discussion

If there are only a "small amount" that are straight-forward as claimed, this might be better for a quick AWB run, no? Primefac (talk) 12:13, 27 March 2022 (UTC)[reply]

Im relatively new to wikipedia, and i wasnt aware about AWB. If you think an AWB run would be better, feel free to reject this and go for that instead. I just didnt feel like doing all the clear-cut cases manually, and didnt realise AWB existed. Aidan9382 (talk) 17:47, 27 March 2022 (UTC)[reply]
@Aidan9382: What would you consider a "clear-cut" case? Curious the sort of rule set you built for this. Could you give some examples? --TheSandDoctor Talk 17:23, 3 April 2022 (UTC)[reply]
@TheSandDoctor: Sorry for the late-ish reply, I was having an internal conflict on whether or not to withdraw this (Ive decided against it for now). Im gonna be considering fixing these terms:
  • e-Book / Google e-Book / Kindle e-Book (You get the point)
  • Print
  • Hardback / Paperback
  • DVD / Blu-ray (I may consider replacing format with medium and not type here. Functionally its the same, but it may make more sense to people)
  • Novel
  • Newspaper / Magazine
(I may come up with more later, but these are my main ideas right now)
(Note: The relevant catagory is Category:CS1_errors:_format_without_URL, i just didnt realise how to properly reference when submitting)
If you have any other suggestions, do say, but these are the most common simple mistakes i see when going through the catagory. Thanks. Aidan9382 (talk) 11:27, 4 April 2022 (UTC)[reply]

TolBot 13B

Operator: Tol (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 19:15, Friday, March 18, 2022 (UTC)

Function overview: Creates redirects for tennis articles

Automatic, Supervised, or Manual: supervised

Programming language(s): Python

Source code available: simple mass creation

Links to relevant discussions (where appropriate):

Edit period(s): one-time

Estimated number of pages affected: 2175 531 (new redirects; no article changes)

Namespace(s): (Article)

Exclusion compliant (Yes/No): not applicable

Function details: Creates redirects based on this list (permalink), from the first link to the second link in each list item.

Discussion

  • This is a followup to the moves done by Tolbot 13A, where we moved all the non-redirects, but probably should have moved redirects, too, such that when we updated the case in links we wouldn't have created redlinks. But since we did create redlinks in some cases, where redirects only existed in the wrong-case versions, the easy fix is just to create the right-case versions of all those wrong-case redirects (it's not clear which ones we actually need, so let's do them all). So I requested and support this BRFA. Dicklyon (talk) 20:37, 18 March 2022 (UTC)[reply]
    Just to better understand the purpose of this BRFA: what's an example diff showing where some of these (current redlink) redirects would've been necessary? ProcrastinatingReader (talk) 17:41, 21 March 2022 (UTC)[reply]
    Here's a diff where the editor who found and reported the problem on my talk page did manual fixes by bypassing the redirects. There are an unknown number of these (probably hundreds). Dicklyon (talk) 02:00, 23 March 2022 (UTC)[reply]
    Could we use "WhatLinksHere" in mainspace to check there are actually incoming links before creating the redirect? e.g. with Special:WhatLinksHere with 2008 Canada Masters – Men's doubles (one instance of which was fixed in the diff you link) we see there is an incoming mainspace link at 2008 Tennis Masters Cup, so there's a problematic redlink there. A lot of the pages don't have any incoming links though; would prefer to see a more targeted task than the creation of 2175 redirects. ProcrastinatingReader (talk) 11:41, 23 March 2022 (UTC)[reply]
    @ProcrastinatingReader, I could trim the list with such checks. I'll put something together to remove titles with no incoming links from the list. Tol (talk | contribs) @ 13:42, 23 March 2022 (UTC)[reply]
    If we're going to go to the trouble to distinguish the ones that are unused, we should delete them, rather than leave redirects with only the incorrect capitalization. Dicklyon (talk) 16:12, 23 March 2022 (UTC)[reply]
    See original report/discussion at User talk:Letcord#Inquiry about Doubles → doubles. Dicklyon (talk) 06:15, 23 March 2022 (UTC)[reply]
  • I find myself mostly agreeing with PR on this one - if there isn't a reason other than "because they might be used in the future" to create over 2k redirects, I'm not overly inclined to push this task. Primefac (talk) 12:28, 27 March 2022 (UTC)[reply]
    OK, let's just create the ones that are used, and delete the wrong-case other ones then. Dicklyon (talk) 15:42, 28 March 2022 (UTC)[reply]
    @Tol, Qwerty284651, and Letcord: To get this done, someone will need to compile the list of which redirects are needed, and which should be deleted. I don't know how to do that. Dicklyon (talk) 23:17, 1 April 2022 (UTC)[reply]
    But hold off a few days, as the link case fixing we're doing is creating a need for more that are presently not needed (e.g. multiple years of these two I just fixed: 2009 Internazionali BNL d'Italia – Men's doubles, 2007 Internazionali BNL d'Italia – Men's singles Dicklyon (talk) 23:39, 1 April 2022 (UTC)[reply]
    @Dicklyon, I've reduced my list down to only those red links that have been/will be created during your run updating all the link casings, from 2175 down to 531. There'll be no need to hold off a few days as this is the complete list. Letcord (talk) 00:10, 2 April 2022 (UTC)[reply]
  • @ProcrastinatingReader and Tol: We are done preparing the pruned list of used/redlinked redirects needed, at User:Letcord/sandbox/TennisRedirects. PTAL. Dicklyon (talk) 23:41, 2 April 2022 (UTC)[reply]
    Dicklyon, looks good. I've revised the task description accordingly. Tol (talk | contribs) @ 23:49, 2 April 2022 (UTC)[reply]

ButlerBlogBot

Operator: Butlerblog (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 14:03, Monday, March 7, 2022 (UTC)

Function overview: Remove "name=" parameter from {{Infobox television}} for pages where this value matches the {{PAGENAMEBASE}} value (to handle the Category:Pages using infobox television with unnecessary name parameter maintenance category).

Automatic, Supervised, or Manual: Automatic

Programming language(s): AutoWikiBrowser

Source code available: AWB

Links to relevant discussions (where appropriate):

Edit period(s): Daily, until maintenance category is reduced to a manageable number.

Estimated number of pages affected: ~32,000

Namespace(s): Mainspace/Articles

Exclusion compliant (Yes/No): Yes

Function details: An AWB regex to remove instances of |name= in {{Infobox television}} when page is in Category:Pages using infobox television with unnecessary name parameter.

My main account is already enabled for AWB, and I have been successfully running this regex on the maintenance category manually. My manual runs included AWB genfixes, but I would not run genfixes in automatic (depending on recommendations in discussions).

Discussion

Two things of note: first, there is a discussion about whether this is an appropriate task, see Template_talk:Infobox_television#Bot_needed. Second, if the consensus there is that this is an appropriate job for 32k essentially cosmetic edits (you can see where I'm falling on this side of the debate...), my bot is already approved to run it. In other words, this task is not necessary. Primefac (talk) 14:23, 7 March 2022 (UTC)[reply]

Since you're right, I thought about withdrawing this request. However, after some thought, I decided to leave it for consideration as it is my desire to expand my skills in this area. This task is a basic step to feeling more confident taking regexes I have used manually and moving them to an automated process (the regex for this task is already being used via AWB by me manually). It doesn't hurt my feelings if it's denied on the basis of "already exists in another bot", but it would be a confidence builder if approved. ButlerBlog (talk) 13:52, 8 March 2022 (UTC)[reply]

Image-Symbol wait old.svg On hold. Pending the outcome of Template_talk:Infobox_television#Bot_needed. Primefac (talk) 10:29, 10 March 2022 (UTC)[reply]

Gaelan Bot 2

Operator: Gaelan (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 12:07, Monday, February 7, 2022 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): JS, Rust

Source code available: a bit of a mess at the moment, but happy to publish on request

Function overview: On file pages, remove {{fair use rationale}} and friends for pages that no longer use that file.

Links to relevant discussions (where appropriate): Wikipedia:Bot_requests#Remove_redundant_FURs_from_file_pages

Edit period(s): one time run for now

Estimated number of pages affected: <5,842

Exclusion compliant (Yes/No): No

Already has a bot flag (Yes/No): No

Function details: Many file pages include fair-use rationales that are no longer necessary. For example, File:AppleIIGSOS.png has a FUR for Palette (computing), but that article doesn't actually use that image. This bot finds those cases as follows:

  1. A xml dump and parse_wiki_text are used to find all File: pages containing one of these templates with an Article parameter.
  2. This is cross-checked against the results of this query (and a list of redirects, also extracted from the xml dump) to find which FURs are unused.

The resulting data is here. I've hand-checked a few dozen of these, and they seem fine. There are some cases like File:ContinentalSquare.JPG which have an FUR that accidentally links to the wrong article (the FUR links to Continental Center instead), which'll get removed by this bot. My thinking is that this is fine - it should get flagged up as having no FUR, and someone can rescue it from history? Not sure.

The actual editing part of the bot isn't implemented yet, but it should just consist of using pywikibot or mwn to loop over the JSON linked above, double check that the FUR still exists and is unused (as I'm working with a dump that's a week old at this point) and remove it.

For now, this'll just be a one-time run; I'd like to figure out an efficient way to run it continuously, but I'll file a new BRFA when we get to that point.

Discussion

My thinking is that this is fine - it should get flagged up as having no FUR, and someone can rescue it from history? But will someone? Or will some other bot or bot-like human come along and tag it for lacking a FUR? Keep in mind that fair use bots have historically been highly controversial. To what extent that was inherent in the task versus was due to the attitude of the operator I don't know, but people still may be touchy about the whole idea. It might be safer to limit the initial version to just those images that will remain fully FURred after the bot's edit, and to tag images also lacking a needed FUR for human attention (or ignore them for now). Anomie 12:48, 7 February 2022 (UTC)[reply]

P.S. If you do get to the point of wanting to run it continuously, the fact that the current version gives time for humans to revert vandalism that may have removed the images from the articles (by working from a week-old dump and only removes the FURs that were unused then and are unused "now") is a good thing that should be preserved. Anomie 12:48, 7 February 2022 (UTC)[reply]

I'm uncomfortable with the idea of a bot removing fair use rationales at all, but at minimum it must account for vandalism (it seems to do this) and page moves, mergers and splits (I'm not certain it attempts this). In the case of page moves and at least some mergers, the bot should follow any redirects and update the FUR if the image is used on the target page. If a redirect has been nominated at RfD then the bot should still follow the redirect - while most redirects from moves should be kept, there are occasional exceptions and there is going to be a large overlap between editors who don't know they are usually kept and those who don't know that FURs will need updating. I don't know how splits can be automatically detected. If there is consensus to remove FURs that are unusued, it would be much better for the bot to move them to the talk page with an explanation, perhaps something like:
"On <date> Gaelan Bot found that this image was not in use on the article(s) listed in the template below.
  • If the image has been restored or moved to a different article or title and the file page has no Free Use Rationale (FUR) for the current location, you should either move the template below back to the file page and update it appropriately or write a new FUR.
  • If the file page does contain a FUR for all current uses there is no action you need to take.
If you think the bot got something wrong, please leave a message with details at <preferred contact location>."
Thryduulf (talk) 10:49, 23 February 2022 (UTC)[reply]
I am not seeing a lot of confidence from those who have commented that this will be able to effectively deal with the issue presented at the BOTREQ without creating too many false positives and situations where images might be improperly altered after this removal. If these issues can be accounted for (noting that the bot operator has yet to respond to any of the above comments) then discussion can go further, but at the moment I am leaning towards declining this. Primefac (talk) 13:58, 27 February 2022 (UTC)[reply]
Hi. Sorry, Real Life has been a lot these past few weeks. I'll try to come back to this soon, but some quick notes:
  • It might be safer to limit the initial version to just those images that will remain fully FURred after the bot's edit: This is a good idea. If we went ahead with this, I'd limit it to files that have at least one other FUR, and maybe separately maintain a list of pages that seem to have no valid FURs.
  • In the case of page moves and at least some mergers, the bot should follow any redirects and update the FUR if the image is used on the target page. This is partially done: if a FUR refers to a redirect, that redirect is followed, and the destination page considered. I wasn't planning on updating the FUR to link to the redirect target, but that might be a good idea at some point.
One other possibility—and I haven't looked at the data to see how many useful edits this would exclude—is to only remove FURs when every current usage of the file is covered by an existing FUR. That (along with the anti-vandalism delay) should remove most of the issues with removing useful FURs—if every existing usage is covered, further FURs are (I think) pretty clearly not useful. Gaelan 💬✏️ 23:30, 28 February 2022 (UTC)[reply]
One other possibility—and I haven't looked at the data to see how many useful edits this would exclude—is to only remove FURs when every current usage of the file is covered by an existing FUR. That's what I was suggesting by saying "fully FURred". It's easier to start small and expand than to start too big then have to fight pushback. Anomie 12:50, 1 March 2022 (UTC)[reply]
  • @Gaelan: What is the status of this request? Following up as it's been over a month since the last activity on this page. Are you still wanting to pursue this? --TheSandDoctor Talk 17:09, 3 April 2022 (UTC)[reply]

ZabesBot

Operator: Zabe (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 22:43, Saturday, January 15, 2022 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Python

Source code available: pywikibot interwikidata.py script

Function overview: cleaning up old interwikilinks

Links to relevant discussions (where appropriate):

Edit period(s): one time rum

Estimated number of pages affected: Did not count, probably between a few hundret and a few thousand

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details: There are still quite a few articles that contain old interwikilinks. I would like to clean them up. The bot basically goes through all pages and then removes the interwiki links if the article is linked in the corresponding Wikidata object and there are no conflicts with the interwiki links.

Discussion

Could you give an example or three of what edits you will be making with your bot? (please do not ping on reply) Primefac (talk) 09:31, 16 January 2022 (UTC)[reply]

  • Note: This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. AnomieBOT 12:17, 16 January 2022 (UTC)[reply]

I made 3 test edits ([1], [2], [3]). --Zabe (talk) 12:18, 16 January 2022 (UTC)[reply]

I apologise but I should have asked this the first time - how are you finding these and could you give a slightly more accurate indication of how many edits are being planned? Primefac (talk) 19:05, 16 January 2022 (UTC)[reply]
I am fairly simply going through all pages in the article namespace. If that is too bad, I guess I could use a xml dump. --Zabe (talk) 21:39, 16 January 2022 (UTC)[reply]
For the number of edits, I need to query a bit. --Zabe (talk) 21:41, 16 January 2022 (UTC)[reply]
Zabe, have you had an opportunity to determine what sort of scale of edits we're looking at? Primefac (talk) 14:01, 13 February 2022 (UTC)[reply]
No, feel free to mark this request as stalled or so until I find some time to do that. --Zabe (talk) 17:00, 15 February 2022 (UTC)[reply]

Image-Symbol wait old.svg On hold. Pending numbers-crunching. Primefac (talk) 12:39, 16 February 2022 (UTC)[reply]

ElliBot

Operator: Elliot321 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 14:46, Saturday, January 23, 2021 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s):

Source code available:

Function overview: Automatically apply {{redirect category shell}} templates to redirects with Wikidata, and remove redundant {{Wikidata redirect}} templates.

Links to relevant discussions (where appropriate):

Edit period(s): one time run

Estimated number of pages affected: 50,000-100,000

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details: I recently modified {{Redirect category shell}} to automatically detect Wikidata links and apply the template {{Wikidata redirect}} if they exist. This was previously already done with protection levels, and there is no reason that {{Wikidata redirect}} should not also be applied.

There are currently 100,000 redirects in the category Category:Redirects connected to a Wikidata item, which is applied by the software. There are currently 30,000 redirects in the category Category:Wikidata redirects. Nearly all of these were put into that category by applying {{Wikidata redirect}} manually, meaning they will need the tag removed (as it will be a duplicate).

Many of the remaining 70,000 pages will need the template {{rcat shell}} added. As the change to {{Redirect category shell}} was recent, many redirects connected to Wikidata items, without {{Wikidata redirect}}, but with {{Redirect category shell}}, have not been added to Category:Wikidata redirects. The difference in count between Category:Wikidata redirects and Category:Redirects connected to a Wikidata item is the number of pages that will be modified.

The edits will be carried out with AWB running as an automated bot. There is very low risk of disruption in this task, though the number of edits is significant. Using AWB, this bot can also carry out other generic fixes to redirects, though this is not a significant part of its functions.

A somewhat similar failed request was Wikipedia:Bots/Requests for approval/TomBot, but that that request was for a bot that would edit ~30-60x more pages, with less benefit overall. This is a much more narrow and useful request.

Discussion

  1. Any prior discussions on doing this that you're aware of, which establish broader consensus for this task?
  2. Will this BRFA cause Template:Wikidata redirect to become redundant? If I understand correctly, this task will orphan all of its transclusions? If so, and especially if there's no prior discussion, I suggest sending that template to TfD (and then this bot task can be technically implementing that TfD). That would be one way to test the wider consensus for this task, too.

ProcrastinatingReader (talk) 16:01, 23 January 2021 (UTC)[reply]

There are no discussions I know of establishing consensus around this particular task. {{Wikidata redirect}} will not become redundant for two reasons. {{redirect category shell}} transcludes it. However, this usage could be subst, of course. The other usage is in cross-Wiki (such as to Wiktionary) and category redirects, the "soft" usage. The "hard" usage could be deprecated from the template, however (they are implemented slightly differently, with an automatic switch). Elliot321 (talk | contribs) 16:20, 23 January 2021 (UTC)[reply]
To begin with, I'd say stylistically this presentation is inferior. See eg here. The one on the top (caused by the edit) doesn't look as good as the manual one & looks slightly out of place with the plaintext.
If the rcat shell has to be manually added by bot, is there really a point to this? Why not have a bot add {{Wikidata redirect}} to pages in Category:Redirects connected to a Wikidata item? ProcrastinatingReader (talk) 00:39, 24 January 2021 (UTC)[reply]
Sorry - that was due to my changes being misunderstood and reverted. If you check now, you can see the way they were intended to look.
The reason for adding {{redirect category shell}} over {{wikidata redirect}} is for automatic detection. If the link on Wikidata is removed, no update on Wikipedia is necessary (likewise, if a link on Wikidata is added to one using the shell, no update is necessary). Elliot321 (talk | contribs) 07:52, 24 January 2021 (UTC)[reply]
Okay, makes sense. I'd suggest dropping a link to this BRFA from the template talk pages for the two templates, to allow some time for comments. ProcrastinatingReader (talk) 09:37, 24 January 2021 (UTC)[reply]
Done. Elliot321 (talk | contribs) 10:08, 24 January 2021 (UTC)[reply]

So the idea is that {{RCAT shell}} should add the Wikidata box by checking for the connected item. Manually adding the template wouldn't be necessary then because the software can already detect if a page is connected to a Wikidata item. Is that correct? --PhiH (talk) 13:20, 24 January 2021 (UTC)[reply]

@PhiH: pretty much. The shell will automatically detect a link to Wikidata, and if found, transclude the template. Therefore, this bot will remove the redundant manual transclusions of the template, and add the shell to automatically transclude on any redirect linked to Wikidata. Elliot321 (talk | contribs) 15:36, 24 January 2021 (UTC)[reply]

If I understand what changed with {{wikidata redirect}} and {{redirect category shell}} correctly, redirects that only have {{wikidata redirect}} will be changed to an empty {{redirect category shell}}, which then results in an error. This means that manual inspection is needed to determine another redirect category to apply, which obviously this Bot task cannot do. —seav (talk) 01:02, 25 January 2021 (UTC)[reply]

FYI, an empty Rcat shell results in sorting the redirect to the Miscellaneous redirects category, which is monitored by editors who will then tag the redirect with appropriate categories. P.I. Ellsworth  ed. put'r there 03:41, 25 January 2021 (UTC)[reply]
Would that be a problem then, Paine Ellsworth? Filling the cat up with some tens of thousands of pages? ProcrastinatingReader (talk) 08:12, 25 January 2021 (UTC)[reply]
An empty RCAT shell with a Wikidata item doesn't need to be categorised in Category:Miscellaneous redirects because it generates the Template:Wikidata redirect. I didn't check if that is implemented yet. A page with that template and no Wikidata item is a problem as well. They just move from one tracking category to another. --PhiH (talk) 08:44, 25 January 2021 (UTC)[reply]
Why doesn't it need to be categorised into misc redirects? Having a Wikidata item connected/existing isn't really an explanation of why there's a redirect on enwiki. Surely the redirect still needs to be categorised? ProcrastinatingReader (talk) 09:06, 25 January 2021 (UTC)[reply]
@PhiH: @ProcrastinatingReader: the {{redirect category shell}} template should not be applied without any categories by a bot as the Category:Miscellaneous redirects should be filled manually. Consequently, I don't plan to do that with this bot. I can manually categorize the redirects that do not have any categories.
(though, a tracking category for uncategorized redirects that can be applied by a bot would probably be useful. I don't feel like gaining consensus for that, though, as that would likely be much more contentious than this proposal) Elliot321 (talk | contribs) 11:37, 25 January 2021 (UTC)[reply]
I think my point is easier to demonstrate with an example, or I’m mistaken about exactly what is proposed here. Can you make 5 edits as a demonstration, either with the bot or by hand if you don’t have the bot coded yet? ProcrastinatingReader (talk) 12:10, 25 January 2021 (UTC)[reply]
Maybe a page demonstrating what changes would be made would be more useful, since there are a few differing cases here? Elliot321 (talk | contribs) 03:46, 26 January 2021 (UTC)[reply]
An actual edit or two of each case would be preferable, as that's the least confusing way to see what is actually proposed. ProcrastinatingReader (talk) 09:57, 26 January 2021 (UTC)[reply]
But you want to add an empty RCAT shell to pages that currently only use {{Wikidata redirect}}, don't you? Should they be added to Category:Miscellaneous redirects if they are connected to a Wikidata item or not? --PhiH (talk) 12:32, 25 January 2021 (UTC)[reply]
I can manually categorize the pages currently in that situation. Elliot321 (talk | contribs) 03:46, 26 January 2021 (UTC)[reply]
It's not about manual categorisation. We are discussing whether the category should be added by the RCAT shell when the redirect is connected to a Wikidata item. --PhiH (talk) 14:19, 26 January 2021 (UTC)[reply]
To ProcrastinatingReader: as long as there is at least one rcat template within the Rcat shell, such as the "Wikidata redirect" template, then the redirect would not be sorted to Category:Miscellaneous redirects. As the proposer suggests, that would not be a problem. The proposer appears to know that a bot should not add an empty Rcat shell to redirects, which would bloat the Misc. redirects category. P.I. Ellsworth  ed. put'r there 15:35, 25 January 2021 (UTC)[reply]

I think there are multiple cases we have to discuss, feel free to comment below.

  1. Redirects that already use the RCAT shell
    This should be uncontroversial: Where {{Wikidata redirect}} is used it gets removed and the template is transcluded by the RCAT shell.
  2. Redirects without the RCAT shell…
    1. …that use {{Wikidata redirect}} and are connected to a Wikidata item
      The template gets removed and the RCAT shell is added. Should the RCAT shell be programmed to add these pages to Category:Miscellaneous redirects?
    2. …that don't use {{Wikidata redirect}} and are connected to a Wikidata item
      The RCAT shell is added. Same question as above arises.
    3. …that use {{Wikidata redirect}} and are not connected to a Wikidata item
      The template gets removed. Adding the RCAT shell would cause them to be added to Category:Miscellaneous redirects.
      Currently these pages are tracked in Category:Unlinked Wikidata redirects. Before this bot task begins someone should work through this list and add the pages on Wikidata if necessary.

--PhiH (talk) 14:46, 26 January 2021 (UTC)[reply]

If I understand correctly, this bot will add the Rcat shell along with an internal {{Wikidata redirect}} tag when it senses any redirect that is already itemized on Wikidata. If that is what happens, then the redirect will not be sorted to the Misc. redirects category. I also sense a possible challenge where the {{NASTRO comment}} template is applied. One of many examples is at 3866 Langley. Would this bot do anything to those many redirects? I actually like the idea of more Rcat shell transclusions. I wonder if the bot will continue checking for new redirects that become connected to a Wikidata item? P.I. Ellsworth  ed. put'r there 21:57, 26 January 2021 (UTC)[reply]

The bot won't touch the {{NASTRO comment}} redirects, since it has no need to.
I could run this after the main clean-up job (probably a weekly run would be sufficient). Elliot321 (talk | contribs) 05:25, 27 January 2021 (UTC)[reply]
NASTRO comment applies the Rcat shell and so would auto-apply the Wikidata redirect template. There will then be two renditions of Wikidata redirect. Won't one of them have to be removed? P.I. Ellsworth  ed. put'r there 18:49, 27 January 2021 (UTC)[reply]
I thought the point of this bot is that these edits wouldn't be necessary anymore. Here you said If the link on Wikidata is removed, no update on Wikipedia is necessary (likewise, if a link on Wikidata is added to one using the shell, no update is necessary) --PhiH (talk) 19:00, 27 January 2021 (UTC)[reply]
A weekly run would handle any new redirects that have been created. Editors LOVE to create new redirects; however, they generally leave it up to bots and Wikignomes to categorize their new redirects. P.I. Ellsworth  ed. put'r there 17:11, 28 January 2021 (UTC)[reply]
That sounds reasonable. I hadn't thought about completely new pages. --PhiH (talk) 17:29, 28 January 2021 (UTC)[reply]
@PhiH: "Redirects that already use the RCAT shell: This should be uncontroversial": Have you thought about the cases where the rcat shell only contains the Wikidata rcat? 𝟙𝟤𝟯𝟺𝐪𝑤𝒆𝓇𝟷𝟮𝟥𝟜𝓺𝔴𝕖𝖗𝟰 (𝗍𝗮𝘭𝙠) 21:25, 16 February 2021 (UTC)[reply]

Also curious as to why AWB is used? Don't get me wrong; I love AWB. However it's not known for its speed or lack of clunkiness. According to the manual, ...any edit to the bot's talk page will halt the bot. Before restarting the bot, the bot operator must log in to the bot account and visit the bot's talk page, so that the "new messages" notification is cancelled. So why not make a non-AWB bot to do the task? P.I. Ellsworth  ed. put'r there 22:14, 26 January 2021 (UTC)[reply]

Mainly because I know AWB and regex better than I know any other frameworks to interface with Wikipedia. I could write custom code, if that would be preferred. Elliot321 (talk | contribs) 05:26, 27 January 2021 (UTC)[reply]
I was just curious, so it would be up to you, of course. I just know how it drives me crazy sometimes when I have to stop in the middle of something, log out of AWB, log in to Wikipedia just to check notifications, log back out and into AWB to commence. That happens with non-bot-auto work as well. P.I. Ellsworth  ed. put'r there 18:53, 27 January 2021 (UTC)[reply]
Um... you don't have to log out of AWB to reset the counter. Also, technically you don't have to log out of Wikipedia either if you log in to the bot account in private mode. Primefac (talk) 13:44, 8 March 2021 (UTC)[reply]

So just to clarify what I'm waiting on: An actual edit or two of each case would be preferable, as that's the least confusing way to see what is actually proposed. After that, it'll be more clear to have the discussion on which edits are good and which need further discussion. ProcrastinatingReader (talk) 15:32, 13 February 2021 (UTC)[reply]

Re: the message above. Primefac (talk) 13:44, 8 March 2021 (UTC)[reply]
@Primefac and ProcrastinatingReader: thanks for the ping. I've actually expanded the scope of what I'm intending to do here a bit (see User:Elli/rcat standardization) - and planning on getting consensus for the changes elsewhere first, before going through with this, so if I could put this request on hold or something that would be ideal (sorry, I'm not sure exactly how this type of situation works, but getting approval for the narrow task of shelling and removing one rcat doesn't really make sense given my goal to deal with ~20 of them). Elli (talk | contribs) 18:38, 8 March 2021 (UTC)[reply]
Image-Symbol wait old.svg On hold. Just comment out the template when you're ready to go (no harm in having it sit here for a while). Primefac (talk) 19:29, 8 March 2021 (UTC)[reply]
This very productive discussion probably should have happened somewhere else prior to the BRFA being filed. Maybe this BRFA should be withdrawn pending a full discussion and manual demonstration of various test cases, and then resubmitted with a link to that discussion and a better explanation of the task. – Jonesey95 (talk) 02:25, 7 December 2021 (UTC)[reply]
@Elli: Just to follow up here: do you still intend to go through with this BRFA? ProcrastinatingReader (talk) 01:07, 8 November 2021 (UTC)[reply]
@ProcrastinatingReader: yes, just been busy with other stuff and not completely sure of the technical details yet. I'll try to follow up on that soon-ish. Elli (talk | contribs) 01:10, 8 November 2021 (UTC)[reply]
  1. it keeps Qids ("Q" number), thus we can track if someone moves the redirect to another wikidata entry.
  2. it does not mess up with RCAT which is only for categories. Heanor (talk) 18:20, 8 February 2022 (UTC)[reply]

Bots in a trial period

BareRefBot

Operator: Rlink2 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 21:35, Thursday, January 20, 2022 (UTC)

Function overview: The function of this bot is to fill in Bare references. A bare reference is a reference with no information about it included in the citaiton, example of this is <ref>https://wikipedia.org</ref> instead of <ref>{{cite web | url = https://encarta.microsoft.com | title = Microsoft Encarta}}</ref>. More detail can be found on Wikipedia:Bare_URLs and User:BrownHairedGirl/Articles_with_bare_links.

Automatic, Supervised, or Manual: Automatic, mistakes will be corrected as it goes.

Programming language(s): Multiple.

Source code available: Not yet.

Links to relevant discussions (where appropriate): WP:Bare_URLs, but citation bot already fills bare refs, and is approved to do so.

Edit period(s): Continuous.

Estimated number of pages affected: around 200,000 pages, maybe less, maybe more.

Namespace(s): Mainspace.

Exclusion compliant (Yes/No): Yes.

Function details: The purpose of the bot is to provide a better way of fixing bare refs. As explained by Enterprisey, our citation tools could do better. Citation bot is overloaded, and Reflinks consistently fails to get the title of the webpage. ReFill is slightly better but is very buggy due to architectual failures in the software pointed out by the author of the tool.

As evidenced by my AWB run, my script can get the title of many sites that Reflinks, reFill, or Citation Bot can not get. The tool is like a "booster" to other tools like Citation bot, it picks up where other tools left off.

There are a few exceptions for when the bot will not fill in the title. For example, if the title is shorter than 5 chacters, it will not fill it in since it is highly unlikely that the title has any useful information. Twitter links will be left alone, as the Sand Doctor has a bot that can do a more complete filling.

There has been discussion over the "incompleteness" of the filling of these refs. For example, it wouldn't fill in the "work="/"website=" parameter unless its a whitelisted site (NYT, Youtube, etc...). This is similar to what Citation bot does IIRC. While these other parameters would usually not filled, the consensus is that "perfect is the enemy of the good" and that any sort of filling will represent an improvement in the citation. Any filled cites can always be improved even further by editors or another bot.


Examples:

Special:Diff/1066367156

Special:Diff/1066364250

Special:Diff/1066364589


Discussion

Pre-trial discussion

{{BotOnHold}} pending closure of Wikipedia:Administrators'_noticeboard/Incidents#Rlink2. ProcrastinatingReader (talk) 23:25, 20 January 2022 (UTC)[reply]

@ProcrastinatingReader: The ANI thread has been closed. Rlink2 (talk) 15:03, 25 January 2022 (UTC)[reply]

Initial questions and thoughts (in no particular order):

  1. I would appreciate some comments on why Citation Bot is trigger-only (i.e. it will only edit individual articles on which it is triggered) rather than approved to mass-edit any article with bare URLs. Assuming the affected page count is accurate, it seems like there's no active and approved task for this job, and since this seems like a task that's obviously suitable for bot use I'm curious to know why that isn't the case.
  2. How did you come to the figure of 200,000 affected pages?
  3. Exactly which values of the citation template will this bot fill in? I gather that it will fill in |title= -- anything else?

ProcrastinatingReader (talk) 23:25, 20 January 2022 (UTC)[reply]

@ProcrastinatingReader: it's not really accurate to say that Citation bot will only edit individual articles on which it is triggered. Yes it needs to be triggered, but it also has a batch mode, of up to 2,200 articles at time. In the last 6 months have used that facility to feed the bot ~700,000 articles with bare URLs.
The reason that Citation bot needs targeting is simply scope. Citation bot can potentially make an improvement to any of the 6.4million articles on Wikipedia, but since it can process only a few thousand per day, it would need about 4 years to process them all. That is why Citation bot needs editors to target the bot at high-priority cases.
By contrast, BareRefBot's set of articles is about 200,000. That's only 3% of the total, and in each case BareRefBot will skip most of the refs on the page (whereas Citation bot processes all the refs, taking up to 10 minutes per page if there are hundreds of refs). The much simpler and more selective BareRefBot can process an article much much faster than Citation bot ... so it is entirely feasible for BareRefBot to process the lot at a steady 10 edits/min running 24/7, in only 14 days (10 X 60 X 24 X 14 = 201,600). It may be desirable to run it more slowly, but basically this job could clear the backlog in a fortnight. Hence no need for further selectivity.
I dunno the source of Rlink2's data, but 200,000 non-PDF bare URLs is my current estimate. I have scanned all the database dumps for the last few months, and that figure is derived from the numbers I found in the last database dump (20220101), minus an estimate of the progress since then. I will get data from the 20220120 dump within the next few days, and will add it here.
Note that my database scans show new articles with bare URLs being added at a rate of about 300 per day. (Probably some are filled promptly, but that's what remains at the end of the month). So there will be ongoing work every month on about 9k–10k articles. Some of that work will be done by Citation bot, which on first pass can usually fill all bare URL refs on about 30% of articles. BareRefBot can handle most of the rest. BrownHairedGirl (talk) • (contribs) 01:40, 21 January 2022 (UTC)[reply]
Numbers of articles. @ProcrastinatingReader: I have now completed my scans of the 20220120 database dump, and have the following headline numbers as of 20220120 :
  • Articles with untagged non-PDF bare URL refs: 221,824
  • Articles with untagged non-PDF bare URL refs in the 20220120 dump which were not in the 20220101 dump: 5,415 (an average of 285 additions per day)
My guesstimate had slightly overestimated the progress since 20220101. However, the 20220120 total of articles with untagged non-PDF bare URL refs is 30,402 lower than the 20220101 total of 252,226. So in 19 days, the total of articles with untagged bare URLs was reduced by just over 12%, which is great progress.
Those numbers do not include refs tagged with {{Bare URL inline}}. That tally fell from 33,794 in 20220101 to 13,082 in 20220120. That is a fall of 20,712 (61%), which is phenomenal progress, and it is overwhelmingly due to @Rlink2's very productive targeting of those inline-tagged bare URL refs.
There is some overlap between the sets of articles with tagged and untagged bare URLs, because some articles have both tagged and untagged bare URL refs. A further element of fuzziness comes from the fact that some of the articles with inline-tagged bare URLs are only to PDFs, which tools cannot fill.
Combining the two lists gives 20220120 total of 231,316 articles with tagged or untagged bare URL refs, including some PDFs. So I guesstimate a total of 230,000 articles with tagged or untagged non-PDF bare URLs refs.
Taking both tagged and untagged bare URL refs, the first 19 days of January saw the tally fall by about 40,000. I estimate that about 25,000 of that is due to the work of Rlink2, which is why I am so keen that Rlink2's work should continue. BrownHairedGirl (talk) • (contribs) 18:06, 22 January 2022 (UTC)[reply]
Update. I now have the data from my scans of the 20220201 database dump
  • Articles with untagged non-PDF bare URL refs: 215,177 (down from 221,824)
  • Articles with untagged non-PDF bare URL refs in the 20220101 dump which were not in the 20220120 dump: 3,731 (an average of 311 additions per day)
  • Articles with inline-tagged bare URL refs: 13,162 (slightly up from 13,082 in 20220120)
So in this 12-day period, the average fall in the number of tagged+untagged non-PDF bare URLs fell by 6,567. That average net cleanup of 547 per day in late January is way down from over 2,000 per day in the first period of January.
In both periods, I was keeping Citation bot fed 24/7 with bare URL cleanup; the difference is that in early January, Rlink2's work turbo-charged progress. When this bot is authorised, the cleanup will be turbo-charged again. BrownHairedGirl (talk) • (contribs) 20:44, 5 February 2022 (UTC)[reply]
Thank you for the update. Provided everything goes well, we'll be singing the victory polka sooner than we think, meaning we can redirect our attention to bare URL pdfs (yes - I have some ideas of how to deal with PDFs, but let's focus on this right now). Rlink2 (talk) 04:10, 7 February 2022 (UTC)[reply]
@Rlink2: Sounds good.
I also have ideas for bare URL PDF refs. When this bot discussion is finished, let's chew over our ideas on how to proceed. BrownHairedGirl (talk) • (contribs) 09:57, 7 February 2022 (UTC)[reply]
  • Scope. @Rlink2: I ask that PDF bare URLs should be excluded from this task. {{Bare URL PDF}} is a useful tag, but I think that there are better ways of handling PDF bare URLs. I will launch a discussion elsewhere on how to proceed. They are easily excluded in database scans, and easily filtered out of other lists (AWB: skip if page does NOT match the regex <ref><ref[^>]*?>\s*\[?\s*https?://[^>< \|\[\]]+(?<!\.pdf)\s*\]?\s*<\s*/\s*ref\b), so the bot can easily pass by them. --BrownHairedGirl (talk) • (contribs) 02:20, 21 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok, I took it out of the proposal. The proposal is on hold due to the ANI, and it has not yet been transcluded on the main BRFA page, so I felt that it was OK to do so to clean up the clutter. Rlink2 (talk) 22:10, 21 January 2022 (UTC)[reply]
@Rlink2: I have had a rethink on the PDF bare URLs, and realise that I had fallen into the trap of letting the best be the enemy of the good.
Yes, I reckon that there probably are better ways to handle them. But as a first step, it is better to have them tagged than not to have them tagged ... and better to have them tagged with the specific {{Bare URL PDF}} than with the generic {{Bare URL inline}}.
So, please may I change my mind, and ask you to reinstate the tagging of PDF bare URLs? Sorry for messing you around. BrownHairedGirl (talk) • (contribs) 09:36, 1 March 2022 (UTC)[reply]
@BrownHairedGirl: No problem. I will make the change and update the source code to reflect it. Thanks for the feedback. Rlink2 (talk) 14:36, 1 March 2022 (UTC)[reply]
@Rlink2: that's great, and thanks for being so nice about my change-of-mind.
In the meantime. I have updated User:BrownHairedGirl/BareURLinline.js so that it uses {{Bare URL PDF}} for PDFs. I have also done an AWB run on the existing uses of {{Bare URL inline}} for PDFs, converting them to {{Bare URL PDF}}. BrownHairedGirl (talk) • (contribs) 16:15, 1 March 2022 (UTC)[reply]

Opening comments: I've seen <!--Bot generated title--> inserted in similar initiatives. Would that be a useful sort of thing to do here? It is acknowledged that the titles proposed to be inserted by this bot can be verbose and repetitive, terse or plainly wrong. Manual improvements will be desired in many cases. How do we help editors interested in doing this work?

The bot has a way of identifying bad and unsuitable titles, and will not fill in the citation if that is the case. I am using the list from Citation bot plus some other ones I have come across in my AWB runs. Rlink2 (talk) 22:06, 21 January 2022 (UTC)[reply]

Like ProcrastinatingReader I am interested in understanding bot permission precedence here. I'm not convinced that these edits are universally productive. I believe there has been restraint exercised in the past on bot jobs for which there is not a strong consensus that the changes are making significant improvements. I think improvements need to be large enough to overcome the downside of all the noise this will be adding to watchlists. I'm not convinced that bar is cleared here. See User_talk:Rlink2#A_little_mindless for background. ~Kvng (talk) 16:53, 21 January 2022 (UTC)[reply]

@Kvng: I think that a ref like {{cite web | title = Wikipedia - Encarta, from Microsoft | url=https://microsoft.com/encarta/shortcode/332d}} is better than simply just a link like <ref>https://microsoft.com/encarta/shortcode/332d</ref>. The consensus is that bare refs that are filled but not "completely" (filled with website parameter), it is still better than the link being 100% bare, if it leaves the new ref more informative. It's impractical do from ok to perfect improvements 100% of the time.
I understand that some people may want perfection, and I think if there is a room for improvement, we should take it. I recently made a upgrade to the script (the upgrade wasn't active for that edit) that does a better job of filling in the website parameter when it can. With the new script update, the ref you talked about on my page (http://encyclopedia2.thefreedictionary.com/leaky+bucket+counter) would be converted into {{cite web |url=http://encyclopedia2.thefreedictionary.com/leaky+bucket+counter |title = Leaky bucket counter | website = TheFreeDictionary.com}} . This is better than the old filling, which was {{cite web |url=http://encyclopedia2.thefreedictionary.com/leaky+bucket+counter. |title = Leaky bucket counter. {{!}} Article about leaky bucket counter. by The Free Dictionary}} It does not work for all sites though, but it is a start. Rlink2 (talk) 22:06, 21 January 2022 (UTC)[reply]
Rlink2 and BrownHairedGirl make the argument that these replacements are good and those opposing them are seeking perfect. In most cases, these are clear incremental improvements (good). In a few cases and aspects, they arguably don't improve or even degrade things (not good). Because the bot relies on external metadata (HTML titles) of highly variable quality and format, there doesn't seem to be a reliable way to separate the good from the not good. One solution is to have human editors follow the bot around and fix these but we don't have volunteers lined up to do that. Another solution is to tolerate the few not good contributions in appreciation of the overall good accomplished but I don't know how we do that value calculation. ~Kvng (talk) 16:14, 22 January 2022 (UTC)[reply]
@Kvng: I already explained I upgraded the script to use some more informatoin than just HTML titles, for a even more complete filling. See my response above Regarding there doesn't seem to be a reliable way to separate I have developed ways to detect bad titles. In those cases, it will not fill in the ref. There is a difference between a slightly ugly title (like the free dictionary one) and some non informative title (like "Website", "Twitter", "News story"). The former one provides more information to the reader, while the latter one provides less information. So if the title is too generic it wouldn't fill in the ref. Rlink2 (talk) 16:18, 22 January 2022 (UTC)[reply]
Sure, we can make improvements as we go but because HTML titles are so varied, there will be more discovered along the way. Correct me if I misunderstand, but the approval I believe you're seeking is to crawl all of Wikipedia at unlimited rate and apply the replacements. With that approach, we'll only know how to avoid problems after all the problems have been introduced. ~Kvng (talk) 16:55, 22 January 2022 (UTC)[reply]
@Kvng: as requested below, please provide diffs showing the alleged problems. BrownHairedGirl (talk) • (contribs) 17:02, 22 January 2022 (UTC)[reply]
@Kvng: with that approach, we'll only know how to avoid problems after all the problems have been introduced. Not necessarily, I save all the titles to a file before applying them. I look over the file and see if there any problem titles. If there are, I remove them, and modify the script to not place that bad title. And even when the bot is in action, I'll still look at some diffs after the fact to catch any possible mistakes. Rlink2 (talk) 17:24, 22 January 2022 (UTC)[reply]
@Kvng: please post diffs which identify the cases where you believe that Rlink2's filling of the ref has:
  1. not improved the ref
  2. degraded the ref
I don't believe that these cases exist. You claim that they do exist, so please provide multiple examples of each type. BrownHairedGirl (talk) • (contribs) 16:29, 22 January 2022 (UTC)[reply]
My previous anecdotal complaints were based on edits I reviewed on my watchlist. I have now reviewed the 37 most recent (a screenful) bare reference edits by Rlink2 and find the following problems. 10 of 37 edits I don't consider to be improvements.
  1. [4] introduces WP:PEACOCK issue
  2. [5] broken link, uses title of redirect page
  3. [6] broken link, uses title of redirect page
  4. [7] broken link, uses title of redirect page
  5. [8] broken link, uses title of redirect page
  6. [9] broken link, uses title of redirect page
  7. [10] website name, not article title
  8. [11] incorrect title
  9. [12] new title gives less context than bare URL
  10. [13] new title gives less context than bare URL ~Kvng (talk) 17:44, 22 January 2022 (UTC)[reply]
@Kvng: So that means there were 27 improvements? Of course there are bugs and stuff, but we can always work through it.
  1. [14] A informative but WP:PEACOCK title is better than a bare ref IMO
Regarding the next set of links (uses title of redirect page), the upgrades I have made will fix those. If two different URLs have the same title, it will assume that it is a generic one. Most of these URL redirects are dead links anyway, so they will be left alone.
  1. [15] This has been fixed in the upgrade.
  2. [16] Don't see an issue.
  1. [17] Easily fixed, didn't catch that one, but kept in mind for future edits.
  1. [18] The bare URL arguably didn't have much information (there is a difference between "https://nytimes.com/what-you-should-do-2022" and "NY times" versus "redwater.ca/pebina-place" and "Pembina Place"). Nevertheless, the upgrade should have tackled some of these issues, so should hopefully happen less and less.
So now there is only one or two problem edits that I have not addressed yet (like the WP:PEACOCK one). Not bad Rlink2 (talk) 18:09, 22 January 2022 (UTC)[reply]
The plan is for the bot to do 200,000 edits and at 1-2 issues for every 37 edits, we'd potentially be introducing 5-10,000 unproductive edits. I'm not sure that's acceptable. ~Kvng (talk) 19:21, 22 January 2022 (UTC)[reply]
@Kvng: I said 1-2 issues in your existing set, not that there would literally be 1-2 issues for every 37 edits. As more issues get fixed, the rate of bad edits will get less and less. The bot will run slowly at first, to catch any mistakes, then speed up. Sound good? Rlink2 (talk) 19:24, 22 January 2022 (UTC)[reply]
I'm extrapolating from a small sample. To find out more accurately what you're up against, we do need a larger review. Looking at just 50 edits, I've seen many ways this can go wrong. That leads me to assume there are still many others that have not been uncovered. You need to add some sort of QA plan to your proposal to address this. ~Kvng (talk) 00:31, 23 January 2022 (UTC)[reply]
@Kvng: You identified many edits of the same problem. The same problems that have been fixed. You didn't find 10 different errors, you found 5 issues, 4 of which have been fixed already/will be fixed, and 1 which I don't think is an issue, even if the title is WP:PEACOCK, it is still more informative than the original ref (I will look into this however). Remember, this is all about incremental improvement. Remember, these citatins have no information attached to them at all. There is nothing. It is important to add "something", even if not perfect, it will always be more informative than having nothing. If you were very thirsty and need of a drink of water right now, would you deny the store brand water because you prefer Fuji Water? It's also like saying you would rather have no car if you can't afford a Ferrari or Lamboghrini.
I have a QA plan already in action, as explained before. Rlink2 (talk) 00:56, 23 January 2022 (UTC)[reply]
I assume you're referring to I save all the titles to a file before applying them. I look over the file and see if there any problem titles. If there are, I remove them, and modify the script to not place that bad title. And even when the bot is in action, I'll still look at some diffs after the fact to catch any possible mistakes. This didn't seem to work well for the meatbot edits you've already done. Despite your script improvements, I'm not confident this will go better with the real bot. How about some sort of a trial run and review of edit quality by independent volunteers. ~Kvng (talk) 21:18, 25 January 2022 (UTC)[reply]
Can you do something about the 30 pages now found by insource:"title = Stocks - Bloomberg"? ~~~~
User:1234qwer1234qwer4 (talk)
20:56, 27 January 2022 (UTC)[reply]
@1234qwer1234qwer4: Nice to see you around here, thanks for reviewing my BRFA. Your opinion is very much appreciated and respected, you would know alot and have lots to say. Regarding "bloomberg", some (but not all) of those titles were placed by me. It appears that those 30 links with the generic title are dead links in the first place. I can go through them and replace them manually. The script has an upgrade to look for and not place any title that has been shared across multiple URLs, to help prevent the placement of generic titles. Rlink2 (talk) 21:11, 27 January 2022 (UTC)[reply]

It looks like a lot of cites use {{!}} with spammy content, for example from the first example |title = Blur {{!}} full Official Chart History {{!}} Official Charts Company. This is hard as you don't know which sub-string is spam vs. the actual title ("Blur"). One approach: split the string into three along the pipe boundary and add each as a new line in a very long text file. Then sort the file with counts for each string eg. "450\tOfficial Charts Company" indicates it found 450 titles containing that string along a pipe boundary ie. it is spam that can be safely removed. Add those strings to a squelch file so whenever they are detected in a title they are removed (along with the leading pipe). The squelch data would be invaluable to other bot writers as well. It can be run on existing cites on-wiki first to build up some data. You'd probably want to manually review the data for false positives but these spam strings are pretty obvious and you can get a lot of them this way pretty quickly. -- GreenC 07:10, 22 January 2022 (UTC)[reply]

If this gets done, I am leaning towards being amenable to a trial run; I don't expect this will get approved after only a single run but as mentioned in Kvng's thread above some of the concerns/issues likely won't pop until the bot actually starts going. Primefac (talk) 16:19, 25 January 2022 (UTC)[reply]
@Primefac: @GreenC:. I have already done this. All the titles are saved into a file, and if the more than one title from the same has common parts after the "|" symbol, it can remove it provided the website parameter can be filled. Detection and filling of the "website=" parameter is also alot better than before, like I explained above.
some concerns/issues likely won't pop until the bot actually starts going. Yeah I agree. It will go slow at first, then speed up. Rlink2 (talk)
I'm not sure if you missed it (or if I've missed your response), but can you confirm your answer to my third initial question? ProcrastinatingReader (talk) 18:04, 25 January 2022 (UTC)[reply]
@ProcrastinatingReader: When I first made the script, it would only fill in the "title=" parameter. Some editors were complaining that they would like to see the "website=" parameter, and while there is consensus that even filling in the "title=" parameter only is better than nothing, I added the capability to add that paremeter when possible into the script. It is sucessful at adding "website=" for some, but not all websites.
However this bot will leave the dead links bare for now. Rlink2 (talk) 18:29, 25 January 2022 (UTC)[reply]
@Rlink2: please please can bot the tag with {{dead link}} (dated) any bare URLs refs which return a 404 error?
This would be a huge help for other bare-URL-fixing processes, because such refs can be excluded at list-making stage, saving a lot of time.
Note that there are other situations where a link should be treated as dead, but they may require multiple checks. A 404 is fairly definitive, so it can be safely tagged on first pass. BrownHairedGirl (talk) • (contribs) 19:07, 25 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok, I can definitely do that. Rlink2 (talk) 20:03, 25 January 2022 (UTC)[reply]
Thanks! BrownHairedGirl (talk) • (contribs) 20:09, 25 January 2022 (UTC)[reply]
PS @Rlink2: my experimentation with bare URL PDFs shows that while HTTP status 410 ("Gone") is rarely used, it does have a non-zero usage.
Since 410 is a definitively dead link, please can the bot treat it like a 404, i.e. tag any such URL as a {{dead link}}?
Also pinging @GreenC, in case they have any caveats to add about 410. BrownHairedGirl (talk) • (contribs) 01:52, 8 February 2022 (UTC)[reply]
@BrownHairedGirl: Sounds good. I will add this. Thanks for bringing up the issue with the highest standards of civility and courteousness, as you always do.
Just to make sure the change works in the bot, could you link to some of the diffs where 410 is the code returned? Thank you again. Rlink2 (talk) 01:59, 8 February 2022 (UTC)[reply]
Many thanks, @Rlink2. I have not been tracking them so far, just tagging them as dead in the hour or so since I added 410 to my experimental code. That leaves no trace of whether the error was 404 or 410.
I will now start logging them as part of my tests, and will get back to you when I have a set. (There won't be any diffs, just page name, URL, HTTP code, and HTTP message). BrownHairedGirl (talk) • (contribs) 02:09, 8 February 2022 (UTC)[reply]
@Rlink2: I have posted[19] at User talk:Rlink2#HTTP_410 a list of 9 such URLs. That is all my script found since I started logging them a few hours ago.
Hope this helps. BrownHairedGirl (talk) • (contribs) 11:25, 8 February 2022 (UTC)[reply]
Accurately determining web page status is deceptively easy. For example, forbes.com uses bot blocking and if you check their site more than X times in a row without sufficient pause it will return 404s (or 403?) even though the page is 200. It's a CloudFlare service I think so lots of sites use it. A robust general purpose dead link checker is quite difficult. IABot for example checks it three times over at least a 3week period to allow for network variances. -- GreenC 20:34, 25 January 2022 (UTC)[reply]
For example, forbes.com uses bot blocking and if you check their site more than X times in a row without sufficient pause it will return 404s (or 403?) even though the page is 200. To be exact, it does not return a 404, it returns something else. BHG was just talking about 404 links, which are pretty clear cut in their "Dead or alive status" Rlink2 (talk) 20:40, 25 January 2022 (UTC)[reply]
Maybe that will work, keep an eye out because websites do all sorts of unexpected nonstandard and illogical things with headers and codes. -- GreenC 21:19, 25 January 2022 (UTC)[reply]
This project has so far been marked by underappreciation of the complexity of the work. We should keep the scope tight and gain some more experience with the primary task. I do not support adding dead link detection and tagging to the bot's function. ~Kvng (talk) 21:26, 25 January 2022 (UTC)[reply]
@Kvng: This project has so far been marked by underappreciation of the complexity of the work. Don't be confused, I have been fine tuning the script for some time now. I am aware of the nooks and crannies. Adding dead link detection is uncontroversial and keeps the tool in scope even more. So why don't you support it? Rlink2 (talk) 21:39, 25 January 2022 (UTC)[reply]
Because assuring we get a usable title is hard enough. We don't need the distraction. The bot is not very likely to be adding a title and dead link tag in the same edit so there will be few additional edits if we do dead link tagging as a separate task later. ~Kvng (talk) 22:44, 25 January 2022 (UTC)[reply]
Because assuring we get a usable title is hard enough. Except it isn't. You have identified 5 bugs, which we have already fixed. The bot is not very likely to be adding a title and dead link tag in the same edit The title and dead link detetction are similar but not the same. If the title is unsuitable, it will leave the ref alone. If the link is dead, it will place the Dead link template. Rlink2 (talk) 22:58, 25 January 2022 (UTC)[reply]
@Rlink2: I don't know how much software development experience you have but my experience tells me that the number of remaining bugs is directly related to the number of bugs already reported. It is wrong to assume all programs have a similar number of bugs and the more you've found and fixed, the better the software is. The reality is that the quality of software and the complexity of problems varies greatly and some software has an order of magnitude more issues than others. I found several problems in your work quickly so I think it is responsible to assume there are many more yet to be found. ~Kvng (talk) 14:12, 26 January 2022 (UTC)[reply]
@Kvng: There is zero distraction. The info needed to decide to tag a URL as dead will always be available to the bot, because the first step in trying to fill the URL is to make a HTTP request. If that request fails with a 404 error, then we have a dead link. It's a very simple binary decision.
Your claim about low coincidence is the complete opposite of my experience of months of working almost solely on bare URLs. There is a very high incidence of pages with both live and dead bare URLs. So not doing it here will mean a lot of additional edits, and -- even more importantly -- a much higher number of wasted human and bot jobs repeatedly trying to fill bare URLs which are actually dead. BrownHairedGirl (talk) • (contribs) 23:01, 25 January 2022 (UTC)[reply]
PS Just for clarification, a 404 error reliably indicates a dead URL. As GreenC notes there are many other results where a URL is definitively dead but a 404 is not returned, and those may take multiple passes. But I haven't seen any false 404s. (There may be some, but they are very rare). BrownHairedGirl (talk) • (contribs) 04:59, 26 January 2022 (UTC)[reply]
@BrownHairedGirl: I respect your experience on this. I did not find any of those cases in the 50 edits I have reviewed. Perhaps that's because of the state Rlink2's tool.
I don't agree that there is zero distraction. We were already distracted discussing the details of implementing this before I came in and suggested we stay focused. ~Kvng (talk) 14:12, 26 January 2022 (UTC)[reply]
@Kvng: That talk of distraction is disingenuous. There were two brief posts on this before you created a distraction by turning it into a debate which required explanation of things you misunderstood. BrownHairedGirl (talk) • (contribs) 14:22, 26 January 2022 (UTC)[reply]
Happy to take the heat for drawing out the process. It's the opposite of what I'm trying to do so apparently I'm not doing it well. I still think we should fight scope creep and stick to filling in missing titles. ~Kvng (talk) 00:21, 27 January 2022 (UTC)[reply]
As I already explained, tagging dead links is an important part of the process of filling titles, because it removes unfixables from the worklist.
And as I already explained, it is a very simple task which uses info which the bot already has. BrownHairedGirl (talk) • (contribs) 00:45, 27 January 2022 (UTC)[reply]
Yes, you did explain and I read and it did not persuade me to change my position. I appreciate that being steadfast about this doesn't mean I get my way. ~Kvng (talk) 00:56, 27 January 2022 (UTC)[reply]

Source code

Speaking of fine tuning, do you intend to publish your source code? I think we may be able to identify additional gotchas though code review. ~Kvng (talk) 22:44, 25 January 2022 (UTC)[reply]
Hopefully, but not right now. It wouldn't be very useful for "code review" in the way you are thinking. If there are bugs though, you can always report it. Rlink2 (talk) 22:54, 25 January 2022 (UTC)[reply]
@Rlink2: I have to disagree with you on this. As a general principle, I am very much in favour of open-source code. That applies even more strongly in a collaborative environment such as Wikipedia, so I approach bots with a basic presumption that the code should be available, unless there is very good reason to make an exception.
Publishing the code brings several benefits:
  1. it allows other editors to verify that the code does what it claims to do
  2. it allows other editors to help find any bugs
  3. it helps others who may want to develop tools for related tasks
So if a bot-owner does not publish the source code, I expect a good explanation of why it is being withheld. BrownHairedGirl (talk) • (contribs) 00:35, 26 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok, nice to see your perspective on it. I will definetly be making it open source then. When should I make it avaliable? I can provide a link later in the week, or should I wait until the bot enters trial? Where would I even post the code anyway? Thanks for your opinion. Rlink2 (talk) 00:39, 26 January 2022 (UTC)[reply]
@Rlink2: Up to you, but my practice is to make it available whenever I am ready to start a trial. That is usually before a trial is authorised.
I usually put the code in a sub-page (or pages) of the BRFA page. BrownHairedGirl (talk) • (contribs) 01:06, 26 January 2022 (UTC)[reply]
@BrownHairedGirl: Sounds good, I will follow your example and make it avaliable as soon as I can (later this week). Subpage sounds great, good idea and keeps everything on wiki. Rlink2 (talk) 01:11, 26 January 2022 (UTC)[reply]
There is preliminary code up on Wikipedia:Bots/Requests_for_approval/BareRefBot/Code. There is more to the script than that (eg: networking code, wikitext code ) but this is the core of it. Will be releasing more as time goes on and I have time to comment the additional portions. Rlink2 (talk) 20:08, 26 January 2022 (UTC)[reply]
Code review comments and discussion at Wikipedia talk:Bots/Requests for approval/BareRefBot/Code

Trial

Trial 1

Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. As I mentioned above, this is most likely not going to be the only time the bot ends up in trial, and even if there is 100% success in this first round it might get shipped for a larger trial anyway depending on feedback. Primefac (talk) 14:12, 26 January 2022 (UTC)[reply]

@Rlink2: Please can the report on the trial include not just a list of the edits, but also the list of pages which the bot skipped. That info is very useful in evaluating the bot. BrownHairedGirl (talk) • (contribs) 14:25, 26 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok. Rlink2 (talk) 20:07, 26 January 2022 (UTC)]][reply]
@Primefac: could you please enable AWB for the bot for the trial? Thank you. Rlink2 (talk) 21:43, 26 January 2022 (UTC)[reply]
@Rlink2: I don't see any problem with doing the trial edits from your own account, with an edit summary linking to the BRFA: e.g.
[[WP:BRFA/BareRefBot|BareRefBot]] trial: fill 3 [[WP:Bare URLs]]</ref>
... which renders as: BareRefBot trial: fill 3 WP:Bare URLs
That is what I have done with my BRFAs. BrownHairedGirl (talk) • (contribs) 18:06, 27 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok, I will do this later today. Thank you for the tips. Rlink2 (talk) 18:11, 27 January 2022 (UTC)[reply]
Trial complete. See edits here (page bit slow to load). The ones the bot skipped already had the bare refs filled in by Cite Bot, since I am working from the older database dump. If it skipped/skips one due to an bug in the script, I would have listed and noted that. Rlink2 (talk) 03:18, 28 January 2022 (UTC)[reply]
Here is the list of edits via the conventional route of a contribs list: https://en.wikipedia.org/w/index.php?title=Special:Contributions/Rlink2&offset=202201280316&dir=next&target=Rlink2&limit=53
Note that there were 53 edits, rather the authorised 50. BrownHairedGirl (talk) • (contribs) 03:27, 28 January 2022 (UTC)[reply]
Whoops! AWB said 50, so I think the edit counter is slightly off with AWB. Maybe I accidently stopped the session, which reset the edit counter or something. Not sure how it works exactly. Sorry about that. But it's just 3 2 more edits (the actual amount seems to be 52, not 53), so I don't think it should make a big difference. Rlink2 (talk) 03:38, 28 January 2022 (UTC)[reply]
Sorry, it's 52. My contribs list above included one non-article edit. Here's fixed contribs list: https://en.wikipedia.org/w/index.php?target=Rlink2&namespace=0&tagfilter=&start=&end=&limit=52&title=Special%3AContributions
I don't think that it's a big deal of itself. However, when the bot is under scrutiny, the undisclosed counting error is not a great look. --BrownHairedGirl (talk) • (contribs) 13:50, 28 January 2022 (UTC)[reply]
Well, if anything, it was my human mistake for overcounting, not a issue with the bot code. Next time I'll make sure its exactly 50 edits. Sorry about that. Rlink2 (talk) 14:03, 28 January 2022 (UTC)[reply]
I don't know much about this but I thought the way this was done was to program the bot to stop after making 50 edits? Levivich 18:46, 28 January 2022 (UTC)[reply]
I did the trial with AWB manually, and apperently the AWB counter is slightly bugged. If I was using the bot frameworks I could have made it exactly 50. Rlink2 (talk) 21:28, 28 January 2022 (UTC)[reply]
@Rlink2: I think that an AWB bug is very very unlikely. I have done about 1.5 million AWB edits over 16 years, and have never seen a bug in its counter.
I think that the error is most likely to have arisen from the bot saving a page with no changes. That would increment AWB's edit counter, but the server would see it as a WP:Null edit, and not create a new revision.
One technique that I use to avoid this is make the bot copy the variable to ArticleText to FixedArticleText. All changes are applied to FixedArticleText. Then as a final sanity check after all processing is complete, I test whether ArticleText == FixedArticleText ... and if they are equal, I skip the page. BrownHairedGirl (talk) • (contribs) 00:17, 29 January 2022 (UTC)[reply]
I think that the error is most likely to have arisen from the bot saving a page with no changes. This is the most likely explanation. Rlink2 (talk) 01:13, 29 January 2022 (UTC)[reply]
Not sure I understand this, since that would seem to result in less edits being made rather than more. ~~~~
User:1234qwer1234qwer4 (talk)
17:53, 29 January 2022 (UTC)[reply]
Well, if anything, it was my human error that made it above 50, since i manually used the script with AWB. It is not a problem with the bot or the script. Rlink2 (talk) 17:59, 29 January 2022 (UTC)[reply]

Couple thoughts:

  • It looks like if there is |title=xxx - yyy and |url=zzz.com, and zzz is equal to either xxx or yyy, it should be safe to remove it from the title and add to |website=. (or {{!}} or long dash instead of dash). Appears to be a common thing: A1, A2, A3, A4, A5
  • Similar to above could check for abbreviated versions of zzz: B1, B2, B3
  • FindArticles.com is a common site on wikipedia: C1 It's also many soft-404s. Looks like that is the case here, "dead" link resulting in the wrong title.
  • GoogleNews is a common site that could have a special rule: D1

-- GreenC 15:18, 28 January 2022 (UTC)[reply]

and zzz is equal to either xxx or yyy, it should be safe to remove it from the title and add to "website" Like I said, the script does do this when it can. See this diff as one example out of many. Some of the diffs you link also exhibit this behavior. Emphasis on "when it can" - it errs on the side of caution since a sensible title field is better than a possibly malformed website field. Also some of the diffs you linked to are cites of the main page of the website, so in that case a "generic" title is expected.
Even in some of the ones you linked, there is no obvious way to tell the difference between "| Article | RPGGeek" and just " |RPGGeek" since there are two splices and not just one.
findarticles.com - Looks like that is the case here, "dead" link resulting in the wrong title. Ok, good to know. The script does have something to detect when the same title is used across multiple sites.
GoogleNews is a common site that could have a special rule I saw that. I thought it was fine, because the title has more information than the original URL, which is the entire point, right? What special rule are you propsing. Rlink2 (talk) 15:36, 28 January 2022 (UTC)[reply]
  • A1: title {{!}} RPGGeek == url of rpggeek .. thus if the title is split along {{!}} this match shows up. It's a literal match (other than case) should be safe.
  • A2: same as A1. Split along "-" and there is a literal match with the URL. Adjust for spaces and case.
  • A3: same with title {{!}} Air Journal and url air-journal .. in this case flatten case and test for space or dash in URL.
  • A4: same with iReadShakespeare in the title and url.
  • A5: another RPGGeek
  • D1: for example, a site-specific rule if "Google News Archive Search" in the title, remove and set the work "Google News"
-- GreenC 18:09, 28 January 2022 (UTC)[reply]
A1, A2, A3: I didn't make the script code every splice blindly with one half being on the title and the other half being on the website field. That opens up a can of bugs, since sites can put anything in there. If the script is going to split it needs quality assurance. When it has that quality assurance, it will split the title and place the website parameter, like it did with some of the URLs in the airport article diff.
If the source is used alot on enwiki, it is easy to remove the common portions without much thought due to list. But the common portions of a title are not necessarily suitable for a website parameter (for example: the website above is RPGgeek.com, but the common parts of the title is "| Article | RPGgeek.com". Of course, you could say "just take the last splice", but what if there is another site that does "| RPGgeek.com | Article"? There are alot of website configurations so we need to follow Postel's law and play it safe.
Compare this to IMDB, where the part after the dash is suitable for the website parameter. So the script is not going to just remove common parts of the title if its not sure where that extra information should go. We want to make the citation more informative not less.
A4: The website name is part of the title as a pun, look at it closely. That's one case where we don't want to remove the website title, if we just go around removing and splitting stuff blindly this is one of the problems that are creating. And its a cite of a main webpage too.
D1 - OK, that sounds fine. Good suggestion. Rlink2 (talk) 18:49, 28 January 2022 (UTC)[reply]
but for A..A3 it's not just anything, it's a literal match. Test for the literal match. To be more explicit with A1:
title found = whatever {{!} whatever {{!}} RPGGeek. And the existing |url=rpggeek.com. Split the string along {{!}} (or dash). Now there are three strings: "whatever", "whatever", "RPGeek". For each of the three strings, compare with the base URL string, in this case "rpggeek". Comparison 1: "whatever" != "rpggeek". Comparison 2: "whatever" != "rpggeek". Comparison 3: "RPGGeek" == "rpggeek" - we found a match! Thus you can safely do two things: remove {{!}} RPGGeek from the title; and, add |website=RPGGeek. This rule/system should work for every example. You may need to remove spaces and/or replace with "-" and/or lower-case from the title string when doing the url string comparison. I see what your saying about A4 you don't want to mangle existing titles when it's a legit usage along a split boundary I guess the question is how common it is. -- GreenC 19:14, 28 January 2022 (UTC)[reply]
BTW if your not comfortable doing it, don't do it. It's the sort of thing that it may be correct 95% of the time and wrong 5% so you have to weigh the utility of that, versus doing nothing for the 95%. -- GreenC 19:51, 28 January 2022 (UTC)[reply]
@GreenC: Thank you for your insight, I will have think about implementing this. I have already kinda done this, see the updated source code I uploaded. I can implement what you are asking for domains that come after the splices. For example, if the website is "encarta.com" and the title is "Wiki | Encata.com", then the "encarta.com" can be split, but if the title is "Wiki | Encarta Encylopedia - the Paid Encylopedia", with no other metadata to help retrieve the website name, then its a harder situation to deal with, so I don't split at all. Rlink2 (talk) 21:28, 28 January 2022 (UTC)[reply]

I went through all 52 so that my contribution to this venture wouldn't be limited to re-enacting the Spanish Inquisition at ANI.

  1. Special:Diff/1068376250 - The bare link was more informative to the reader than the citation template, because the bare link at least said "Goodreads.com", whereas the citation template just gives the title, which is the title of the book, and the same as the title of the Wikipedia article (and the title was in the URL anyway). So in this case, the bot removed (or hid, behind a citation template) useful information, rather than adding useful information. I don't see how this edit is an improvement.
  2. Special:Diff/1068372499 - Similarly, here the bot replaced a bare URL to aviancargo.com with a citation template with the title "Pagina sin titulo" ("page without title"). This hides the useful information of the domain name and replaces it with a useless page title. This part of this edit is not an improvement.
  3. Special:Diff/1068369653 - Replaces English-language domain name in bare URL with citation template title using foreign language characters. Not an improvement; the English-speaking reader will learn more from the bare URL than the citation template.
  4. Special:Diff/1068369064 - |website=DR tells me less than www.dr.dk, but maybe that's a problem with the whitelist?
  5. Special:Diff/1068369849 - an example of promo being added via website title, in this case the source's tagline, "out loud and in community!"
    • I have a similar concern about Special:Diff/1068369121 because we're adding "Google News Archive Search" prominently in citation templates. However, news.google.com was already in the bare URL, and the bot is also adding the name of the newspaper, so it is adding useful information. My promo concern here is thus weak.
  6. Special:Diff/1068368882, Special:Diff/1068375545, Special:Diff/1068369433 (first one) - tagged as a dead URLs, but are not a dead URLs, they all go to live websites for me.
  7. Special:Diff/1068375631 and Special:Diff/1068369185 - tagged as dead URL but coming back to me as 503 not 404. Similarly Special:Diff/1068372071 is 522 not 404. Special:Diff/1068376097 is coming back to me as a timeout not a 404. Special:Diff/1068376127 as a DNS error, not a 404. This may not be a problem if "dead URL" also applies to 503 and 522s and timeouts and DNS errors and all the rest, and not just 404s, but thought I'd mention it.

I wonder if the concerns in #1-4 could be addressed by simply adding |website=[domain name] to the citation template? That would at least preserve the useful domain name from the bare URL. No. 5 is concerning to me as this came up in previous runs. Even if this promo problem only occurs 2% of the time, if we run this on 200k pages, that's 4,000 promo statements we'll be adding to the encyclopedia. Personally, I don't know if that is, or is not, too high or a price to pay for the benefit of converting bare URLs into citation templates. (I am biased on this issue, though, as I don't see much use in citation templates personally.) No. 6 is a problem, and I question whether tagging something as dead based on one ping is sufficient, as mentioned above. #7 may not be a problem at all, I recognize. Hope this helps, and thank you to everyone involved for your work on this, especially Rlink. Levivich 18:19, 28 January 2022 (UTC)[reply]

@Levivich: I went through all 52 so that my contribution to this venture wouldn't be limited to re-enacting the Spanish Inquisition at ANI. Thank you for taking the time to review those edits, and thank you for your civility and good faith both here and at ANI. Hopefully we avoided Wikimedia Archive War 2. Wikimedia Archive War 1 was the war to end all wars, there were lots of casulties, we don't need another one. as much as i think arguments about archive sites are stupid, and these comment was made before the conflict started, let's respect everyone who is suffering through a very real war right now.... Off topic banter aside....
Special:Diff/1068369064 - the difference between DR and DR.dk is very minimal. Besides, "DR" is the name of the news agency/website, so that is the more accurate one IMO.
And regarding the not 404s, I have explained before that I just recently upgraded the getter to only catch 404 links and not anything else. While the diffs that you linked that are not "404" are mostly actualy still dead links, the consensus here was to only mark firm 404 status code returns as "dead links", so I made that change. The "dead link" data used in this set was collected before that change was made to reflect just 404s only, and I only realized after the fact. Regarding the "completely" lived being marked as dead, might just be a timeout error (not that it matters now, because anything that is not 404 but doesn't work for me will just be left alone).
Even if this promo problem only occurs 2% of the time. It's less than that. I think there was only one diff showing this. And if its a big big issue I can blacklist puff words and not fill those in.
I don't see much use in citation templates personally. Well, normally, it wouldn't matter. But we are adding information to the citation, and the cite template is the perfereed way to do so.
Hope this helps, and thank you to everyone involved for your work on this, especially Rlink. Thank you, and also thanks for all the hard work you do around here on the wikipedia as well. But none of this would be possible without BHG. She laid the foundations of all this stuff. Without her involvement this would have been impossible. Her role in fixing bare refs is far far greater than mine. I am just playing my small part, "helping" out. But she has all the expertise. Rlink2 (talk) 21:28, 28 January 2022 (UTC)[reply]

I've taken the time to review the first 25 edits. My findings:

  1. [20] is a certificate issue (SSL_ERROR_BAD_CERT_DOMAIN) and is presumably accessible if you want to risk it. Is it right to mark this as dead link?
  1. [21], [22], [23], [24] I don't understand why there is no |website= on these.


  1. [25] first link does not appear to be dead.
  2. [26] first link does not appear to be dead.
  3. [27] first appears to be a dead link.
  4. [28] https://www.vueling.com/en/book-your-flight/flight-timetables does not appear to be a dead link.
  5. [29] https://thetriangle.org/news/apartment-complex-will-go-up-at-38th-and-chestnut/ is reporting a Cloudflare connection timeout. Is it right to mark this as a dead link?

Problems with bare link titles are mostly about the |website= parameter. The code that sorts this out is in a library and not posted and I don't know how it works and I'm not convinced it's doing what we want it to do. See the code review page for further discussion. ~Kvng (talk) 18:25, 28 January 2022 (UTC)[reply]


Is it right to mark this as dead link? (regarding SSL_ERROR_BAD_CERT_DOMAIN) I saw that one. If you click through the SSL error (type in "thisisunsafe" in Chrome or Chromium-based browsers) you see it redirected to another page. If you looked even closer, adding any sort of random characters to the URL redirects to the same page, meaning that there is a blanket redirect with that website. So yes, I think it is right to mark it as dead.
Regarding the findarticles thing, yes, it has already been reported. I think I have to add a redirect part to it, if multiple URLs redirect to the same one, mark it as dead. So thank you for reporting that one.
I don't understand why there is no website As explained before, it will only add the website parameter when it is absoultely sure it has a correct and valid website parameter. It is not as simple as splitting any character like "|" and "-", that seems obvious but there are a lot of bugs that could arise just from that.
Is it right to mark this as a dead link? That link does not work for me. I tested on multiple browsers. Rlink2 (talk) 18:47, 28 January 2022 (UTC)[reply]
@Kvng and Levivich: I have always believed that the approach to the |website= parameter should be to:
  1. Use the name of the website if it can be reliably determined (either from the webpage or from a lookup table)
    or
  2. If the name of the website is not available, use the domain name from the URL.
For example, take a bare URL ref to https://www.irishtimes.com/news/world/europe/munich-prosecutors-sent-child-abuse-complaint-linked-to-pope-benedict-1.4788161
If the bot can reliably determine that the name of the website is "The Irish Times", then the cite template should include |website=The Irish Times
... but if the bot cannot reliably determine the name of the website, then the cite template should include |website=www.irishtimes.com.
I take that view because without a name, we have two choices on how to form the cite:
  • A {{cite web |url=https://www.irishtimes.com/news/world/europe/munich-prosecutors-sent-child-abuse-complaint-linked-to-pope-benedict-1.4788161 |title=Munich prosecutors sent child abuse complaint linked to Pope Benedict}}
  • B {{cite web |url=https://www.irishtimes.com/news/world/europe/munich-prosecutors-sent-child-abuse-complaint-linked-to-pope-benedict-1.4788161 |title=Munich prosecutors sent child abuse complaint linked to Pope Benedict |website=www.irishtimes.com}}
Those two options render as:
  • A: "Munich prosecutors sent child abuse complaint linked to Pope Benedict".
  • B: "Munich prosecutors sent child abuse complaint linked to Pope Benedict". www.irishtimes.com.
Option A is to my mind very annoying, because it gives no indication of i) whether the articles appears on a website from Japan or Zambia or Russia or Bolivia, ii) whether the source is a reputable newspaper, a partisan politics site, a blog, a porn site, a satire site or an ecommerce site. That deprives the reader of crucial info needed to make a preliminary assessment of the reliability of the source.
In my view, option B is way more useful, because it gives a precise description of the source. Not as clear as the name, but way better than nothing: in many cases the source can be readily identified from the domain name, and this is one of them.
This is the practice followed by many editors. Unfortunately, a small minority of purists prefer no value for |website= instead |website=domain name. Their perfection-or-nothing approach significantly undermines the utility of bare URL filling, by letting the best (full website name) become the enemy of the good (domain name).
I know that @Rlink2 has had some encounters with those perfection-or-nothing purists, and I fear that Rlink2's commendable willingness to accommodate concerns has led them accept the demands of this fringe group of zealots. I hope that Rlink2 will dismiss that perfectionism, and prioritise utility to readers ... by reconfiguring the bot to add the domain name. BrownHairedGirl (talk) • (contribs) 19:45, 28 January 2022 (UTC)[reply]
I agree. Not only is B better than A in your example, but I would even say that the bare link is better than A in your example, because the bare link has both the title and the website name in it, but A only gives the title. I honestly struggle to see how anyone could think that a blank |website parameter in a citation template is better than having the domain name in the |website parameter. Levivich 19:50, 28 January 2022 (UTC)[reply]
@Levivich: puritanism can lead people to take very strange stances. I have seen some really bizarre stuff in other discussions on filling bare URLs.
As to this particular link, its URL is formed as a derivative of the article title, so the bare URL is quite informative. So it's a bit of tossup whether filling it with only the title is actually an improvement.
However, some major websites form the URL by numerical formulae (e.g. https://www.bbc.co.uk/news/uk-politics-60166997) or alphanumerical formulae (e.g. https://www.ft.com/content/8f1ec868-7e60-11e6-bc52-0c7211ef3198). In those (alpha)numerical examples, the title alone is more informative.
However, title+website is always more informative than bare URL, provided that the title is not generic. BrownHairedGirl (talk) • (contribs) 20:12, 28 January 2022 (UTC)[reply]
On the subject of |website=, one way of determining the correct website title is relying on redirects from domain names. That is since irishtimes.com redirects to The Irish Times, the bot can know to add |website=The Irish Times. That is likely to be more comprehensive then any manually maintained database. * Pppery * it has begun... 20:42, 28 January 2022 (UTC)[reply]
That is a good idea, thanks for letting me know @Pppery:. Your thoughts are always welcome here. I kinda have to agree with BHG, as usual. I just didn't know what the consensus was on it, but BHG and Levivch makes a clear case for the website parameter. I will add this to the script. One of the community wishlist items should have been to bring VE to non article spaces. Replying to this chain is difficult for me. Rlink2 (talk) 21:28, 28 January 2022 (UTC)[reply]
@Rlink2: to make replying much much easier, go to Special:Preferences#mw-prefsection-betafeatures enable "Discussion tools" (4th item from the top).
That will give you "reply" link after every sig. BrownHairedGirl (talk) • (contribs) 22:05, 28 January 2022 (UTC)[reply]
Thanks for that, this way is so much easier. Rlink2 (talk) 22:07, 28 January 2022 (UTC)[reply]
@Pppery: the problem with that approach is that some domain names host more than one publication, e.g.
It would be easy to over-complicate this bot's task by trying to find the publication's name. But better to KISS by just using the domain name. BrownHairedGirl (talk) • (contribs) 22:01, 28 January 2022 (UTC)[reply]
Makes sense. I have no objection to just using the domain name. * Pppery * it has begun... 22:10, 28 January 2022 (UTC)[reply]

Most of my concerns have to do with dead link detection. This is turning out to be the distraction I predicted. There were only 3 articles with bare link and dead link edits: [30], [31], [32]. Running these as separate tasks will require 12% more edits and I don't think that's a big deal. I again request we disable dead link detection and marking and focus on filling bare links now.

Many of the links you linked are actually dead. And regarding the ones that weren't, I think its using the data from when the script was more liberal with tagging a dead link (The code is now much more stricter, 404s only). I said I will be adding more source code as we go along with complete comments. Rlink2 (talk) 18:47, 28 January 2022 (UTC)[reply]
@Rlink2: you could have avoided a lot of drama by publishing all the bot's code, rather than just a useless fragment. I suggest that you do so without delay, ideally with sufficient completeness that another competent editor could use AWB to fully replicate the bot?
Yes, I will do this as soon as I finish my responses to these questions. Rlink2 (talk) 19:17, 28 January 2022 (UTC)[reply]
Also please note that on the 25th, I specifically requested[33] that the bot bot the tag with {{dead link}} (dated) any bare URLs refs which return a 404 error ... and you replied[34] just under 1 hour later to say Ok, I can definitely do that.
Now it seems that in your trial run, dead link tagging was not in fact restricted to 404 errors. I do not see any point in this discussion at which you disclosed that you would use some other basis for tagging a link as dead. Can you see how that undeclared change of scope undermines trust in the bot operator? BrownHairedGirl (talk) • (contribs) 19:15, 28 January 2022 (UTC)[reply]
Yeah, I said it was a bug from stale data from before I updated the script. I am sorry. I only realized after the fact. Rlink2 (talk) 19:17, 28 January 2022 (UTC)[reply]
The posted code was not useless. It helped me understand the project and I pointed out a few things that helped Rlink2 make small improvements.
I'm not upset about a gap between promises and performance on this trial because that is the observation that originally brought me into this. Rlink2 is clearly working in good faith; thank you! Progress has been made and we'll get there soon. ~Kvng (talk) 21:29, 28 January 2022 (UTC)[reply]
Thank you for the kind words. In response to BHG I do not see any point in this discussion at which you disclosed that you would use some other basis for tagging a link as dead. I said when i was running the script on my main account I did use a wider basis for tagging links as dead. However, when we started the BRFA, we limited the scope to just 404 responses. What I shoud have done is run a new batch and use data with the dead link creteria listed in the BRFA, but I forgot to do so and used bare ref data collected from before the BRFA, hence why other stuff that meant it was a dead link (but not 404) was marked as a dead link. I am so so sorry to have disappointed you. I will do better next time and be careful. The fix for this is to not place a "dead link" template for any of the old data, and only do it for the new data going forward, to make sure the scope is defined.
Can you see how that undeclared change of scope undermines trust in the bot operator? It was not my intent to be sneaky or try to bypass the scope.
The most important thing is what Kvng said. Progress has been made and we'll get there soon. yes, there is always improvements to be made. Rlink2 (talk) 21:44, 28 January 2022 (UTC)[reply]
@Rlink2: thanks for the collaborative reply, but this is not yet resolved.
You appear to be saying that the bot relies on either 1) the list of articles which it is fed not including particular articles, or 2) on cached data from previous http requests to that URL.
Neither approach is safe. The code should make a fresh of check each URL for a 404 error, and apply the {{Dead link}} tag to that URL and only that URL.
  1. The list of pages which the bot processes should be irrelevant to its actions. Pre-selection is a great way of making the bot more efficient by avoiding having it skip thousands of pages where it has nothing to do. However, pre-selection is no substitute for code which ensures that the bot can accurately handle any page it processes.
  2. Cacheing http requests for this task is a bad idea. It adds a further vector for errors, which are not limited to this instance of the cache remining unflushed after a change of criteria. BrownHairedGirl (talk) • (contribs) 22:19, 28 January 2022 (UTC)[reply]
I've not had a chance to fully review the additional code Rlink2 has recently posted but a brief look shows that it uses a database of URLs which is apparently populated by a different process. That database should have been rebuilt for the trial and wasn't but there is nothing fundamentally wrong with this sort of two-stage approach to the problem. The list of pages is indeed relevant if this approach is used. ~Kvng (talk) 23:03, 28 January 2022 (UTC)[reply]
the list of articles which it is fed not including particular articles Well, it should work in either batches or individual articles.
Cacheing http requests for this task is a bad idea. Despite my use of 'cache' as a variable name and the database, the way the script is supposed to work is retreive the titles, save it, and then retreive it immediately after, which would constitute as a "fresh check" while saving the title for further analysis. So there is one script that gets the title, and another that places it within the article. I released the code for the latter already, and will release the code for the former shortly. I did try to run the getter in advance for some of them (like now), but I won't do this anymore thanks to your feedback. Rlink2 (talk) 23:11, 28 January 2022 (UTC)[reply]
@Rlink2: my point did not relate to batches vs individual articles. It was about something different: that is, not relying on any pre-selection process.
As to the rest, I remain unclear about how the bot actually works. Posting all the code and AWB settings could resolve that.
@Kvng: the fundamental problem with the two stage approach is as I described above: that it creates extra opportunity for error, as happened in the trial run. BrownHairedGirl (talk) • (contribs) 00:04, 29 January 2022 (UTC)[reply]
I have posted the "getter" code at Wikipedia:Bots/Requests_for_approval/BareRefBot/Code2. If I missed something or something needs clarification let me know. I am a bit tired right now, and have been working all day on this, so it is entirely ossible I forgot to explain something.
Again, the delay in releasing the code is getting it commented and cleaned up so you can understand it and be clear about how the bot actually works Rlink2 (talk) 01:08, 29 January 2022 (UTC)[reply]
  • @Rlink2: I have just begun assessing the trial and noticed two minor things.
  1. the bot is filling the cite templates with a space either side of the equals sign in each parameter, e.g. |website = Cricbuzz.
    That makes the template harder to read, because when the wikimarkup is word-wrapped in the edit window, the spaces can cause the parameter and value to be on different lines. Please can you omit those spaces, e.g. |website=Cricbuzz
  2. in some cases, parameter values are followed by more than one space. Please can you eliminate this by adding some extra code to process each template by replacing multiple successive whitespace character with one space?
Thanks. --BrownHairedGirl (talk) • (contribs) 01:18, 29 January 2022 (UTC)[reply]
the bot is filling the cite templates with a space either side of the equals sign in each parameter Fixed, and reflected in posted source code.
parameter values are followed by more than one space Done, and reflected in posted source code. Rlink2 (talk) 01:22, 29 January 2022 (UTC)[reply]
Thanks, @Rlink2. That was quick! BrownHairedGirl (talk) • (contribs) 01:24, 29 January 2022 (UTC)[reply]
  • Comment Cite templates should only be added to articles that use that style of referencing, what are you doing to detect the referencing style and to keep ammended references in the article style? Keith D (talk) 21:43, 30 January 2022 (UTC)[reply]
    @Keith D: So you are saying that if an article is using references like [https://google.com google], then the bare ref <ref>https://duckduckgo.com</ref> should be converted to <ref>[https://duckduckgo.com Duckduckgo]</ref> style instead of the cite template? I can code that in. Rlink2 (talk) 21:50, 30 January 2022 (UTC)[reply]
    That is what I would expect in that case. Keith D (talk) 21:53, 30 January 2022 (UTC)[reply]
    @Keith D: I have added this in, but will have to update the source code posted here to reflect that. Rlink2 (talk) 15:54, 1 February 2022 (UTC)[reply]
    @Rlink2: Hang on. Please do not implement @Keith D's request.
    Citation bot always converts bracketed bare URLs to cite templates. I don't see why this bot should work differently.
    There are a few articles which deliberately use the bracketed style [https://google.com google], but they are very rare. The only cases I know of are the deaths by month series, e.g. Deaths in May 2020, which use the bracketed style because they have so many refs that cite templates show them down. It would be much better to simply skip those pages, or apply the bracketed format to defined set. BrownHairedGirl (talk) • (contribs) 17:21, 1 February 2022 (UTC)[reply]
    Citation bot should not be converting references to templates if that is not the citation style used in the artical. It sould be honouring the established style of the article. Keith D (talk) 17:37, 1 February 2022 (UTC)[reply]
    This is why {{article style}} exists, which only has 54 transclusions in 6 years! All we need is a new option for square-link-only, editors to use it, and bots to honor it. It's like CSS, a central mechanism to determine any style settings for a page. -- GreenC 18:06, 1 February 2022 (UTC)[reply]
    Citation templates radically improve the maintainability of refs, and ensure consistency of style. There are a very few cases where they are impractical due the server load of hundreds of refs, but those pages are rare.
    In most cases where the square bracket refs dominate, it is simply because refs have been added by editors who don't know how to use the cite templates and/or don't like the extra work involved. We should be working to improve those other refs, not degrading the work of the bot. BrownHairedGirl (talk) • (contribs) 19:20, 1 February 2022 (UTC)[reply]
    See WP:CITEVAR which states "Editors should not attempt to change an article's established citation style merely on the grounds of personal preference, to make it match other articles, or without first seeking consensus for the change." Keith D (talk) 23:45, 1 February 2022 (UTC)[reply]
    It would be foolish to label the results of quick-and-dirty referencing as a "style". BrownHairedGirl (talk) • (contribs) 01:54, 2 February 2022 (UTC)[reply]
    As above, I would much prefer that the bot always use cite templates.
    But if it is going to try to follow the bracketed style where that is the established style, then please can it use a high threshold to determine the established style. I suggest that the threshold should be
    1. Minimum of 5 non-bare refs using the bracketed style (i.e. [http://example.com/foo Fubar] counts are bracketed, but [http://example.com/foo] doesn't)
    2. The bracketed, non-bare refs must be more than 50% of the inline refs on the pge.
    I worry about the extra complexity this all adds, but if the bot is not going to use cite templates every time, then it needs to be careful not to use the bracketed format excessively. BrownHairedGirl (talk) • (contribs) 20:27, 2 February 2022 (UTC)[reply]
    As above, I would much prefer that the bot always use cite templates. As usual I have to agree with BHG here, if to reduce bugs and complexity, if ever. The majority of articles are using citation templates anyway.

    While it is technically possible to implement BHG's creteria, it would cause extra complexity. For that I would prefer following BHG's advice in always using templates, but I am open to anything. Rlink2 (talk) 20:44, 2 February 2022 (UTC)[reply]
    There is currently no mechanism to inform automated tools what style to use. It's so uncommon not to use CS1|2 these days, as a conscious choice, it should be the responsibility of the page to flag tools how to behave rather than engaging in error prone and complex guess work. I'm working on a solution to adapt {{article style}}, but it won't be ready before this BRFA closes. In the mean time, if you run into editors who remain CS1|2 holdouts (do they exist?) they will revert and we can come up with a simple and temporary solution to flag the bot, similar to how {{cbignore}} works - an empty template that does nothing, the bot just checks for its existence anywhere on the page and skips if so. -- GreenC 21:12, 2 February 2022 (UTC)[reply]
    99% of articles are using the citation templates. I agree with BHG, we want to avoid "scope creep" where most of the code is solving 1% of the problems.
    I personally I don't have any skin in the citation game, but again, basically all of the articles are using them.
    In the mean time, if you run into editors who remain holdouts (do they exist?) they will revert and we can come up with a simple and temporary solution to flag the bot Yes. Rlink2 (talk) 15:40, 4 February 2022 (UTC)[reply]
    @Rlink2: that is a much better solution. I suspect that such cases will be very rare, much less than 1% of pages. BrownHairedGirl (talk) • (contribs) 02:32, 5 February 2022 (UTC)[reply]
    I have noticed recently through an article on my watchlist that BrownHairedGirl has manually been tagging the dead 404 refs herself. If she and others can focus on tagging all the dead refs, then we can take dead link tagging out of the bot. What do people here think? Rlink2 (talk) 14:20, 5 February 2022 (UTC)[reply]
    My tagging is very slow work. I have been doing some of it on an experimental basis, but that is no reason to remove the functionality from this bot. If this bot is processing the page, and already has the HTTP error code, then why not use it to tag? BrownHairedGirl (talk) • (contribs) 18:34, 5 February 2022 (UTC)[reply]

It's now over 7 days since the trial edits. @Rlink2: have you made list of what changes have been proposed, and which you have accepted?

I think that a review of that list would get us closer to a second trial. --BrownHairedGirl (talk) • (contribs) 20:47, 5 February 2022 (UTC)[reply]

Here are the big ones:
  • PDF tagging was excluded before the trial, and will continue to stay that way. There was no tagging of PDF refs during the 1st trial.
  • Previous to the trial, the consensus was for the bot to mark refs with the "dead link" template if and only if the link returned a "404" status code at the time of filling. If the link was not 404 but had issues (service unavaliable, "cloudflare", generic redirect, invalid HTTPS certificate, etc...) the bare ref would simply be left alone at that moment. During the trial, several links that were not necessarily alive but did not return a 404 status error were marked with the "dead link" template, which was not the intended goal. The first change was to make sure the 404 detection was working properly, and didn't cache the inaccurate data. Other than the marking of these links, the bot will do nothing regarding dead links in references or archiving, "broadly construed".
  • There was a proposal to use bracketed refs when converting from bare to non bare in articles that predominantly used the bracketed refs, but there was no conesnsus to implement this. Editors pointed out that "bracketed ref" articles are very rare and usually special cases. In cases like this, the editors of the article make it clear that citation templates are not to be used, and use bot exlcusions, so the bot wouldn't have even processed those articles. GreenC pointed out that a template to indicate the citation style of the article existed, but only has 54 tranclsuions, and other editors expanded by explaining that it would difficult for a bot to determine the citation style for the article.
  • BrownHairedGirl pointed out two minor nitpicks regarding spacing of parameters, which was fixed.
  • There was some discussion about the possiblity of WP:PEACOCK titles, but I explained that such instances are rare, and trying to get a bot to understand what a "peacock" title even is would be difficult. The people who brought up this seemed to be satisfied with my answer, and so there was no consensus to do anything regarding this.
  • There was some argument over what to do regarding the website parameter. The bot is able to extract a proper website parameter and split the website and title parameter for some but not all websites. There was some debate over how far the bot could go regarding the website parameter, but I expressed a need to "play it safe" and not dwell too much on this aspect since we are dealing with unstructured data. There was consensus that if the bot could not extract the website name, that it should just use the domain name for the website parameter (etc. {{cite web | title = Search Results | website=duckduckgo.com}} instead of {{cite web | title = Search Results }}) so the resulting ref still has important info about the website being cited. This change has been made. Rlink2 (talk) 21:58, 5 February 2022 (UTC)[reply]
Many thanks, @Rlink2, for that prompt and detailed reply. It seems to me to be a good summary of where we have got to.
It seems to me that on that basis the bot should proceed to a second trial run, to test whether the changes resolve the concerns raised by the first trial. @Primefac, what do you think? Are we ready for that? BrownHairedGirl (talk) • (contribs) 23:13, 5 February 2022 (UTC)[reply]
Small update to this: the bot now catches 410 "gone" status codes, as explained above. 410 is basically a less-used way to indicate that the content is no longer avaliable. The amount of sites using 410 status codes to indicate a dead link is not many, but there are some, so it has been implemented in the bot. Rlink2 (talk) 21:09, 8 February 2022 (UTC)[reply]
Thanks, @Rlink2. After a long batch of checks, I now estimate that about 0.5% of pages with bare URLs have one or more bare URLs which return a 410 error. That suggests that there are about 1,300 such bare URLs to be tagged as {{dead link}}s, so this addition will be v helpful. BrownHairedGirl (talk) • (contribs) 00:29, 9 February 2022 (UTC)[reply]

Trial 2

Trial 2

Symbol tick plus blue.svg Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Sorry for the delay here, second trial looks good. Primefac (talk) 14:35, 13 February 2022 (UTC)[reply]

Trial complete. Adding for the record Primefac (talk) 14:46, 21 March 2022 (UTC)[reply]
@Primefac: @BrownHairedGirl:
Diffs can be found here: https://en.wikipedia.org/w/index.php?target=Rlink2&namespace=all&tagfilter=&start=2022-02-16&end=2022-02-16&limit=50&title=Special%3AContributions

The articles skipped were either PDF urls or dead URLs but did not return 404 or 410 (example: expired domain, connection timeout). One site had some strange website misconfig so it didn't work in Chrome, Safari, Pale Moon, Seamonkey, or Firefox. (I could only view it in some obscure browser). As agreed with the conesnsus, the bot will not touch these non-404 or 410 dead links, and it did not during the 2nd trial.

I think there was also a non Wayback Archive.org url (as you know, the archive.org has more than just archived webpages, they have books and scans of documents as well), along with a bare ref with the "Webarchive" template right next to it. As part of "broadly construed" these were not filled. The amount of archive bare refs are small I think, so should not be an issue.

The rest of the sites skipped had junk titles (like "please wait ....." or "403 forbidden")

As requested when the website parameter was added when the "natural" name of the website could not be determined and the website name was not in the title. There was extra care taken to avoid a situation where there is a cite like
{{Cite web | title = Search results {{!}} duckduckgo.com | website=www.duckduckgo.com }}
which would look like
"Search results | duckduckgo.com". www.duckduckgo.com. Rlink2 (talk) 04:18, 16 February 2022 (UTC)[reply]
Thanks, @Rlink2.
The list of exactly 50 Trial2 edits can also be found at https://en.wikipedia.org/w/index.php?title=Special:Contributions&dir=prev&offset=20220214042814&target=Rlink2&namespace=0&tagfilter=AWB BrownHairedGirl (talk) • (contribs) 05:09, 16 February 2022 (UTC)[reply]
Yes, this time I tried to make it exactly 50 for preciseness and to avoid drama. Rlink2 (talk) 15:26, 16 February 2022 (UTC)[reply]
  • Big problem. I just checked the first 6 diffs. One of them is a correctly tagged dead link, but in the other 5 cases ([35], [36], [37], [38], [39]) there is no |website= parameter. Instead, the website name is appended to the title.
    This is not what was agreed after Trial1 (and summarised here[40] by @Rlink2) ... so please revert all trial2 edits which filled a ref without adding the website field. BrownHairedGirl (talk) • (contribs) 05:22, 16 February 2022 (UTC)[reply]
    I see that some edits — [41], [42]did add a website field, filling it with the domain name.
    It appears that what has been happening is that when the bot figures out the website name, it is wrongly appending that to the title, rather than the correct step of placing it in the |website= parameter. BrownHairedGirl (talk) • (contribs) 05:41, 16 February 2022 (UTC)[reply]
    Hi @BrownHairedGirl:
    The reason the website parameter was not added in those diffs is because the website name is in the title (for example, the NYT link has "New York Times" right within in the title, you can check for yourself in your browser). The bot did not modify or change the title to add the website name, if it could extract the website name it would have been added to the "website=" parameter as we have agreed to do.
    There are three possibilities:
    • The website name can be extracted from the website, hence there is no need to use the domain name for the website parameter, since a more accurate name is available. An example of this would be:
    "Article Story". New York Times.
    • The bot could detect that the website name is included in the title, but for some reason could not extract it. As stated before, extracting the website name from a title can be difficult sometimes, so even if it is able to detect the website name is included, it may not be able to get a value suitable for the "website=" parameter. In this case, adding a website parameter would look like:
    "The Battle of Tripoli-Versailles - The Green-Brown Times". www.thegreenbrowntimes.com.
    in which case the website parameter is just repeating information so the bot just did a cite like this instead:
    "The Battle of Tripoli-Versailles - The Green-Brown Times".
    • The bot could not detect the website name and so addded the website parameter with the domain name (and this was done evidenced by the additional diffs you provided above). The cite would look like this:
    "Search results". www.duckduckgo.com. Rlink2 (talk) 15:25, 16 February 2022 (UTC)[reply]
    @Rlink2, I think you are over-complicating something quite simple, which I thought had been clearly agreed: the |website= parameter should always be present and filed. The points above should determine what its value is, but it should never be omitted. BrownHairedGirl (talk) • (contribs) 16:04, 16 February 2022 (UTC)[reply]
    @BrownHairedGirl:
    The reasoning behind adding the "website=" parameter was to make sure the name of the website is always present in the citation. In the first comment where you asked for the website param, the example title did not have the website name, so in that case it was clear that the website parameter should be added. In addition, the "website=" example I gave in my final list before we started Trial 2 the website name was not included in the title. In the citations where it did not add the "website=" parameter, the name of the website was still present.

    Personally, I am fine with following your advice and always including the website parameter, even if the website name is in the title. However, I feared it could have caused anger amongst some factions of the citation game who would claim that the bot was "bloating" refs with possibly redundant info, so this was done to keep them happy. Rlink2 (talk) 18:19, 16 February 2022 (UTC)[reply]
    @Rlink2: the name of the work in which the article is included is always a key fact in any reference. If it is available in any form, it should be included as a separate field ... and for URLs, it is always available in some form, even if only as a domain name. The "separate field" issue is crucial, because the whole aim of cite templates is to provide consistently structured data rather than unstructured text of the form [http://exmple.com/foo More foo in Ballybeg next year -- Daily Example 39th of March 2031]
    If there is any bloating, it is the addition of the site name to the title, where it doesn't belong. If you can reliably remove any such redundancy from the title, then great ... but I don't think you will satisfy anyone at all by dumping all the data into the |title= parameter.
    I am a bit concerned by this, because it doesn't give me confidence that you fully grasp what citation templates are for. They are about consistently structured data, and issues of redundancy are secondary to that core purpose. BrownHairedGirl (talk) • (contribs) 18:37, 16 February 2022 (UTC)[reply]
    @BrownHairedGirl:
    the name of the work in which the article is included is always a key fact in any reference. If it is available in any form, it should be included as a separate field ... and for URLs, it is always available in some form, even if only as a domain name. Ok.
    If you can reliably remove any such redundancy from the title, then great I was actually about to suggest this idea in my first reply, because the bot should be able to reliably remove website titles if that is what is desired. That way we have something like
    {{Cite web | title = Article Title | website=nytimes.com}}
    instead of
    {{Cite web | title = Article Title {{!}} The New York Times }}
    or
    {{Cite web | title = Article Title {{!}} The New York Times | website=nytimes.com }}
    I am a bit concerned by this, because it doesn't give me confidence that you fully grasp what citation templates are for. They are about consistently structured data, and issues of redundancy are secondary to that core purpose. You'd be right, I know relatively little about citation templates compared to people like you, who have been editing even before the citation templates were created, but I am learning as time goes on. Thanks for telling me all this, I really appreciate it. Rlink2 (talk) 18:57, 16 February 2022 (UTC)[reply]
    @Rlink2; thanks for the long reply, but we are still not there. Please do NOT remove website names entirely.
    The ideal output is to have the name of the website in the website field. If that isn't possible, use the domain name.
    If you can determine the website's name with enough reliability to strip it from the |title= parameter, don't just dump the info -- use it in the website field, 'cos it's better than the domain name.
    And if you are not sure, then some redundancy is better than omission.
    Taking your examples above:
    1. {{Cite web | title = Article Title | website=nytimes.com}}
      bad: you had the website's name, but dumped it
    2. {{Cite web | title = Article Title {{!}} The New York Times }}
      bad: no website field
    3. {{Cite web | title = Article Title {{!}} The New York Times | website=nytimes.com }}
      not ideal, but least worst of these three
    In this case, the best would be {{Cite web | title = Article Title |website= The New York Times}}
    I think it might help if I set out in pseudocode what's needed:
VAR thisURL = "http://exmple.com/fubar"
VAR domainName = FunctionGetDomainNamefromURL(thisURL)
VAR articleTitle = FunctionGetTitleFromURL(thisURL)
// start by setting default value for websiteParam 
VAR websiteParam = domainName // e.g. "magicseaweed.com"
// now see if we can get a website name
VAR foundWebsiteName == FunctionToFindWebsiteNameAndDoAsanityCheck()
IF foundWebsiteName  IS NOT BLANK // e.g. "Magic Seaweed" for https://magicseaweed.com/ 
     THEN BEGIN
         websiteParam = foundWebsiteName
         IF articleTitle INCLUDES foundWebsiteName
            THEN BEGIN
                VAR trimmedArticleTitle = articleTitle - foundWebsiteName
                IF trimmedArticleTitle IS NOT BLANK OR CRAP
                    THEN articleTitle = trimmedArticleTitle
                ENDIF 
             END
         ENDIF
     END
ENDIF
FunctionMakeCiteTemplate(thisURL, articleTitle, websiteParam)
  • Hope this helps BrownHairedGirl (talk) • (contribs) 20:25, 16 February 2022 (UTC)[reply]
    @BrownHairedGirl: Ok, this makes sense. I will keep this in mind from here on out. So the website parameter will always be present from now on. Rlink2 (talk) 23:28, 16 February 2022 (UTC)[reply]
    @Rlink2: I was hoping that rather than just keep this in mind, you'd be telling us that the code had been restructured on that basis, and that the revised code had been uploaded. BrownHairedGirl (talk) • (contribs) 13:40, 19 February 2022 (UTC)[reply]
    @BrownHairedGirl: Yes, precise language is not my strong suit ;)
    Done, and reflected in the source code (all the other bug fixes, like the 410 addition, should also be uploaded now as well) So now, if the website parameter can not be extracted or is not present, the domain name will always be used instead.
    And if you are not sure, then some redundancy is better than omission. I agree. Rlink2 (talk) 14:16, 19 February 2022 (UTC)[reply]
    Ok, it's been some time, and this is the only issue that has been brought up (and has been fixed). Should we have one more trial? Rlink2 (talk) 13:56, 22 February 2022 (UTC)[reply]
    @Rlink2: where is the revised code? BrownHairedGirl (talk) • (contribs) 10:01, 23 February 2022 (UTC)[reply]
    @BrownHairedGirl: Code can be found at the same place, Wikipedia:Bots/Requests_for_approval/BareRefBot/Code Rlink2 (talk) 12:48, 23 February 2022 (UTC)[reply]
    @Rlink2: code dated // 2.0 - 2022 Febuary 27.
    Some time-travelling? BrownHairedGirl (talk) • (contribs) 13:39, 23 February 2022 (UTC)[reply]
    @BrownHairedGirl: LOL, I meant 17th. Thank you ;) Rlink2 (talk) 13:44, 23 February 2022 (UTC)[reply]
    @Rlink2, no prob. Tipos happon tu us oll.
    I haven't fully analysed the revised code, but I did look over it. In principle it looks like it's taking a sound approach.
    I think that trial of this new code would be a good idea, and also that this trial should be of a bigger set (say 250 or 500 edits) to test a wider variety of cases. Some webmasters do really weird stuff with their sites. BrownHairedGirl (talk) • (contribs) 20:12, 23 February 2022 (UTC)[reply]
  • Problem2. In the edits which tagged link as dead (e.g. [43], [44]), the tag added is {{Dead link|bot=bareref|date=February 2022}}.
    This is wrong. The bot's name is BareRefBot, so the tag should be {{Dead link|bot=BareRefBot|date=February 2022}}. BrownHairedGirl (talk) • (contribs) 05:33, 16 February 2022 (UTC)[reply]
    I have fixed this. Rlink2 (talk) 15:26, 16 February 2022 (UTC)[reply]
  • I have not checked either trial to see if this issue has arrived, but domain reselling pages and similar should not be populated but the links marked as dead as they need human review to find a suitable archive or new location. AFAIK there is no reliable way to automatically determine whether a page is a domain reseller or not, but the following strings are common examples:
    • This website is for sale
    • Deze website is te koop
    • HugeDomains.com
    • Denna sida är till salu
    • available at DomainMarket.com
    • 主婦が消費者金融に対して思う事
  • In addition, the following indicate errors and should be treated as such (I'd guess the bare URL is going to the best option):
    • page not found
    • ACTUAL ARTICLE TITLE BELONGS HERE
    • Website disabled
  • The string "for sale!" is frequently found in the titles of domain reselling pages and other unsuitable links, but there might be some false positives? If someone has the time (I don't atm) and desire it would be useful to see what the proportion is to determine whether it's better to skip them as more likely unsuitable or accept that we'll get a few unsuitable links alongside many more good ones. In all cases your code should allow the easy addition or removal of strings from each category as they are detected. Thryduulf (talk) 11:44, 23 February 2022 (UTC)[reply]
    @Thryduulf: Thank you for the feedback. I already did this (as in, detect domain for sale titles). Usually anything wih "for sale" in it usually a junk title, and it is better to skip the ref for later than than to fill it with a bad title. Rlink2 (talk) 12:45, 23 February 2022 (UTC)[reply]
    This approach seems sound, but there will always be unexpected edge cases. I suggest that the bot's first few thousand edits be run at a slow pace on a random sample of articles, to facilitate checking.
    It would also be a good idea to
    1. not follow redirected URLs. That facility is widely abused by webmasters, and can lead to very messy outcomes
    2. maintain a blacklist of usurped domains, to accommodate cases which evade the filters above.
    Hope that helps. BrownHairedGirl (talk) • (contribs) 20:18, 23 February 2022 (UTC)[reply]
    @BrownHairedGirl: I suggest that the bot's first few thousand edits be run at a slow pace on a random sample of articles, to facilitate checking. Yes, this is a good idea. While filling out bare refs manually with AWB I saw first hand many of the edge cases and "gotchas", so more checking is always a good thing.
    not follow redirected URLs. This could actually be a good idea. I don't know the data on how many URLs are redirects and how many of those are valid, but there are many dead links that use a redirect to the front page instead of throwing a 404. There can be an exception placed for redirects that just go from HTTP to HTTPS (since that usually does not indicate a change or removal of content). Again, I will have to do some data collection and see if this approach is feasible, but it looks like a good idea that will work.
    maintain a blacklist of usurped domains I already have a list of "blacklisted" domains that will not be filled, yes this is a good idea. Rlink2 (talk) 19:39, 24 February 2022 (UTC)[reply]
    When it comes to soft 404 string detection, they are all edge cases. There is near infinite variety. For example there are titles in foreign languages: "archivo no encontrado|pagina non trovata|página não encontrada|erreur 404|något saknas" .. it goes on and on and on.. -- GreenC 21:40, 24 February 2022 (UTC)[reply]
    @GreenC: well the number "404" is in there for one of them, which would be filtered. Of course there will always be an infinite variety but we can get 99.9% of them. During my run the only soft "404"s I remebering seeing after my already existing filtering were fake redirects to the same page (discussed above). Rlink2 (talk) 22:05, 24 February 2022 (UTC)[reply]
    Well, I've been working on a soft 404 detector for over 4 years as a sponsored employee of Internet Archive and at best I can get 85%. That's after years of effort finding strings to filter on. There is a wall at that point because the last 15% are all mostly unique cases, one offs, so you can't really predict them. I appreciate you strive for 99% but nobody gets that. The best soft 404 filter in existence is made by Google and I don't think they get 99%. There are academic papers on this topic, AI programs, etc.. I wish you luck, please appreciate the problem, it's non-trivial. -- GreenC 23:11, 24 February 2022 (UTC)[reply]
    @GreenC:
    Yes, I agree that soft 404 detection is a very difficult problem. However, in this case, we may not even need to solve it.
    So I'm guessing its 85 percent of 99%. Lets just say because of my relative lack of experience, my script is 75% or even 65%. So out of all the "soft 404s" (of which they are not many of when it comes to Wikipedia bare refs, which is the purpose of the bot) it can still get a good chunk.
    The soft 404's ive seen are things like redirects to the same page. Now some redirects could be legitimate, and some could be not. That's a hard problem to figure out, like you said. But we know that if there is a redirect, there may or may not be a soft 404, hence avoiding the problem of detection by just leaving it alone at that moment.
    Another example could be when multiple pages have the same title. There is a possiblity at that moment of a soft 404, or maybe not. But if we avoid doing anything under this cirumstance at all we don't have to worry about "detecting" a soft 404.
    It's kinda like asking "what is the hottest place to live in at Antartica" and the answer being "Let's avoid the place all together, we'll deal with Africa or South America". not a perfect analogy but you get the point.
    The only thing that I have no idea how to deal with is foreign language 404s, but again, there are not too many of them.
    My usage of "99%" was not literal, it was was an exaggeration ("allteration"). Nothing will even come close to 100% because there are an infinite amount of websites with an endless amount of configurations and stuff. It is impossible to plan out for all those websites, but at the same time those types of websites are rare. Rlink2 (talk) 05:20, 26 February 2022 (UTC)[reply]
    User:Rlink2: Some domains have few to none, others have very high rates like as much as 50% (looking at you ibm.com ugh). What constitutes a soft-404 can itself be difficult to determine because the landing page may have relevant content but is not the same as original only detectable by comparing with the archive URL. One method: determine the date the URL was added to the wiki page. Examine the archive URL for the date, and use the title from there. That's what I would do if writing a title bot. All URLs eventually in time revert to 404 or soft-404 so getting a snapshot close to the time it was added to wiki will be the most reliable data. -- GreenC 15:19, 2 March 2022 (UTC)[reply]
    "determine the date the URL was added to the wiki page. Examine the archive URL for the date, and use the title from there.". This is actually a good idea, I think I thought this once actually but forgot, thanks for telling (or reminding) me.

    However as part of "broadly construed" I don't want the bot to do anything with archive sites, it will create unnecessary drama that will take away from the goal of filling bare refs. Also the website could have changed the title to be more descriptive, or maybe the content moved. So it archived title may not be the best one all of the time. Maybe if there is a some mismatch between the archive title and the current URL title, it should be a signal to leave the ref alone at the moment.

    If any site in particular has high soft 404 rates, we will simply blacklist it and the bot will not fill any refs from those domains. Rlink2 (talk) 16:18, 2 March 2022 (UTC)[reply]
    And regarding foreign titles, there are a very very small amount of them in my runs. At most I saw 10 of them during my 50,000+ bare edit run. Rlink2 (talk) 22:50, 24 February 2022 (UTC)[reply]
    Are you saying foreign language websites account for about 10 out of every 50k? -- GreenC 23:24, 24 February 2022 (UTC)[reply]
    Actually, maybe there were like 50 articles with foreign articles, but I can only remember like 5 or 10 of them. I filtered out some of the Cryliic characters since they were creating cite errors due to the way the script handlded them, so the actual amount the bot has to decide on is less than that. Rlink2 (talk) 05:22, 26 February 2022 (UTC)[reply]

@Rlink2 and Primefac: it is now 4 weeks since the second trial, and Rlink2 has resolved all the issues raised. Isn't it time for a third trial? I suggest that this trial should be bigger, say 250 edits, to give a higher chance of detecting edge cases. --BrownHairedGirl (talk) • (contribs) 23:14, 12 March 2022 (UTC)[reply]

@BrownHairedGirl, Yes, I think its time. Rlink2 (talk) 02:33, 13 March 2022 (UTC)[reply]

BareRefBot as a secondary tool

I would like to ask that BareRefBot be run as a secondary tool, i.e. that it should be targeted as far as possible to work on refs where the more polished Citation bot has tried and failed.

This is a big issue which I should probably have raised at the start. The URLs-that-Citation-but-cannot-fill are why I have been so keen to get BareRefBot working, and I should have explained this in full earlier on. Pinging the other contributors to this BRFA: @Rlink2, Primefac, GreenC, ProcrastinatingReader, Kvng, Levivich, Pppery, 1234qwer1234qwer4, and Thryduulf, whose input on this proposal would be helpful.

I propose this because on the links which Citation bot can handle, it does a very thorough job. It uses the zotero servers to extract a lot of metadata such as date and author which BareRefBot cannot get, and it has a large and well-developed set of lookups to fix issues with individual sites, such as using {{cite news}} or {{cite journal}} when appropriate. It also has well-developed lookup tables for converting domain names to work titles.

So ideally, all bare URLs would be filled by the well-polished Citation bot. Unfortunately, there are many websites which Citation bot cannot fill, because the zotero provides no data. Other tools such as WP:REFLINKS and WP:REFILL often can handle those URLs, but none of them works in batch mode and individual editors cannot do the manual work fast enough to keep up with Citation bot's omissions.

The USP of BareRefBot is that thanks to Rlink2's cunning programming, it can do this followup work in batch mode, and that is where it should be targeted. That way we get the best of both worlds: Citation bot does a polished job if it can, and BareRefBot does the best it can with the rest.

I am systematically feeding Citation bot with long lists of articles with bare URLs, in two sets:

  1. User:BrownHairedGirl/Articles with new bare URL refs, consisting of the Articles with bare URL refs (ABURs) which were in the latest database dump but not in the previous dump. The 20220220 dump had 4,904 new ABURS, of which there were 4,518 ABURs which still hsd bare URLs.
  2. User:BrownHairedGirl/Articles with bare links, consisting of articles not part of my Citation bot lists since a cutoff date. The bot is currently about halfway through a set of 33,239 articles which Citation bot had not processed since 1 December 2021.

If BareRefBot is targeted at these lists after Citation bot has done them, we get the best of both worlds. Currently, these lists are easily accessed: all my use of Citation bot is publicly logged in the pages linked and I will happily email Rlink2 copies of the full (unsplit lists) if that is more convenient. If I get run over by a bus or otherwise stop feeding Citation bot, then it would be simple for Rlink2 or anyone else to take over the work of first feeding Citation bot.

What do others think? --BrownHairedGirl (talk) • (contribs) 11:25, 2 March 2022 (UTC)[reply]

Here is an example of what I propose.
Matt Wieters is page #2178 in my list Not processed since 1 December - part 6 of 11 (2,847 pages), which is currently being processed by Citation bot.
Citation bot edited the article at 11:26, 2 March 2022, but it didn't fill any bare URL refs. I followed up by using WP:REFLINKS to fill the 1 bare URL ref, in this edit.
That followup is what I propose that BareRefBot should do. BrownHairedGirl (talk) • (contribs) 11:42, 2 March 2022 (UTC)[reply]
I think first and foremost you should look both ways before crossing the road so you don't get run over by a bus. :-D It strikes me as more efficient to have BRB follow CB as suggested. I don't see any downside. Levivich 19:28, 2 March 2022 (UTC)[reply]
@BrownHairedGirl
This makes sense, I think that citation bot is better at filling out refs completely. One thing that would be intresting to know is if Citation Bot can improve already filled refs. For example, let's say we have a source that citation bot can get the author, title, name, and date for, but BareRefBot can only get the title. If BareRefBot only fills in the title, and citation bot comes after it, would citation bot fill in the rest?
and it has a large and well-developed set of lookups to fix issues with individual sites, such as using cite news or cite journal when appropriate. I agree .
It uses the zotero servers to extract a lot of metadata such as date and author which BareRefBot cannot get, and it has a large and well-developed set of lookups to fix issues with individual sites Correct.
It also has well-developed lookup tables for converting domain names to work titles. Yes, do note that list could be ported to Bare Ref Bot (list can be found here)
That way we get the best of both worlds: Citation bot does a polished job if it can, and BareRefBot does the best it can with the rest. I agree. Let's see what others have to say Rlink2 (talk) 19:38, 2 March 2022 (UTC)[reply]
Glad we agree in principle, @Rlink2. You raise some useful questions:
One thing that would be intresting to know is if Citation Bot can improve already filled refs.
yes, it can and does. But I don't think it overwrites all existing data, which is why I think it's better to give it the first pass.
For example, let's say we have a source that citation bot can get the author, title, name, and date for, but BareRefBot can only get the title. If BareRefBot only fills in the title, and citation bot comes after it, would citation bot fill in the rest?
If an existing cite has only |title= filled, Citation Bot often adds many other parameters (see e.g. [45]).
However, I thought we had agreed that BareRefBot was always going to add and fill a |website= parameter?
My concern is mostly with the |title=. Citation Bot does quite a good job of stripping extraneous stuff from the title when it fills a bare ref, but I don't think that it re-processes an existing title. So I think it's best to give Citation Bot the first pass at filling the title.
Hope that helps. Maybe CB's maintainer AManWithNoPlan can check my evaluation and let us know if I have misunderstood anything about how Citation Bot handles partially-filled refs. BrownHairedGirl (talk) • (contribs) 20:27, 2 March 2022 (UTC)[reply]
I think you are correct. Citation bot relies mostly on the wikipedia zotero - there are a few that we go beyond zotero: IEEE might be the only one. A bit thing that the bot does is extensive error checking (bad dates, authors of "check the rss feed" and such). Also, almost never overwrites existing data. AManWithNoPlan (talk) 20:35, 2 March 2022 (UTC)[reply]
Many thanks to @AManWithNoPlan for that prompt and helpful clarification. --BrownHairedGirl (talk) • (contribs) 20:51, 2 March 2022 (UTC)[reply]
@BrownHairedGirl @AManWithNoPlan
But I don't think it overwrites all existing data, which is why I think it's better to give it the first pass. Yeah, i think John raised up this point at the Citation Bot talk page, and AManWithNoPlan has said above that it can add new info but no overwrite the old ones..
However, I thought we had agreed that BareRefBot was always going to add and fill a Yes, this hasn't changed. I forgot to say "title and website" while Citation Bot can get author, title, website, date, etc.....
So I think it's best to give Citation Bot the first pass at filling the title. This makes sense.
Citation Bot does quite a good job of stripping extraneous stuff from the title when it fills a bare ref, I agree. Maybe AManWithNoPlan could share the techniques used so they can be ported to BareRefBot? Or is the stripping done on the Zotero servers? He would have more information regarding this.
I also have a question about the turnaround of the list making process. How long does it usually take for Citation Bot to finish a batch of articles? Rlink2 (talk) 20:43, 2 March 2022 (UTC)[reply]
See https://en.wikipedia.org/api/rest_v1/#/Citation/getCitation and https://github.com/ms609/citation-bot/blob/master/Zotero.php it has list of NO_DATE_WEBITES, tidy_date function, etc. AManWithNoPlan (talk) 20:45, 2 March 2022 (UTC)[reply]
@Rlink2: Citation Bot processes my lists of ABURs at a rate of about 3,000 articles per day. There's quite a lot of variation in that (e.g. big lists are slooow, wee stubs are fast), but 3k/day is a good ballpark.
The 20220301 database dump contains 155K ABURs, so we are looking at ~50 days to process the backlog. BrownHairedGirl (talk) • (contribs) 20:47, 2 March 2022 (UTC)[reply]
@BrownHairedGirl
So every 50 days there will be a new list, or you will break the list up into pieces and give the list of articles citation bot did not fix to me incrementally? Rlink2 (talk) 21:01, 2 March 2022 (UTC)[reply]
@Rlink2: it's in batches of up to 2,850 pages, which is the limit for Citation Bot batches.
See my job list pages: User:BrownHairedGirl/Articles with bare links and User:BrownHairedGirl/Articles with new bare URL refs. I can email you the lists as they are done, usually about one per day. BrownHairedGirl (talk) • (contribs) 21:27, 2 March 2022 (UTC)[reply]
  • Duh @me.
@Rlink2, I just realised that in order to follow Citation Bot, BareRefBot's worklist does not need to be built solely off my worklists.
Citation Bot has 4 channels, so my lists comprise only about a quarter of Citation Bot's work. The other edits are done on behalf of other editors, both as batch jobs and as individual requests. Most editors do not publish their work lists like I do, but Citation Bot's contribs list is a record of the pages which the bot edited on their behalf, so it is a partial job list (obviously, it does not include pages which Citation bot processed but did not edit).
https://en.wikiscan.org/user/Citation%20bot shows the bot averaging ~2,500 edits per day. So if BareRefBot grab says the last 10,000 edits by Citation Bot, that will usually amount to about four days work by CB, which would be a good list to work on. Most editors do not not choose their Citation bot jobs on the basis of bare URLs, so the incidence of bare URLs in those lists will be low ... but any bare URLs which are there will have been recently processed by Citation Bot.
Also, I don't see any problem with BareRefBot doing a run in which the bot does no filling, but just applies {{Bare URL PDF}} where appropriate. A crude search shows that there are currently over such 30,000 refs to be tagged, which should keep the bot busy for a few days: just disable filling, and let it run in tagging mode.
Hope this helps. BrownHairedGirl (talk) • (contribs) 21:20, 4 March 2022 (UTC)[reply]
@BrownHairedGirl:
BareRefBot's worklist does not need to be built solely off my worklists. Oh yes, I forgot about the contribution list as well.
So if BareRefBot grab says the last 10,000 edits by Citation Bot, that will usually amount to about four days work by CB, which would be a good list to work on. I agree.
Most editors do not not choose their Citation bot jobs on the basis of bare URLs, so the incidence of bare URLs in those lists will be low ... but any bare URLs which are there will have been recently processed by Citation Bot. True. Just note that tying the bot to Citation bot will mean that the bot can only go as fast as citation bot goes, that's fine with me since there isn't really a big rush, but just something to note.
Also, I don't see any problem with BareRefBot doing a run in which the bot does no filling, Me neither. Rlink2 (talk) 01:44, 5 March 2022 (UTC)[reply]
Thanks, @Rlink2.
I had kinda hoped that once BareRefBot was authorised, it could start working around the clock. At say 7 edits per minute. it would do ~10,000 pages per day, and clear the backlog in under 3 weeks.
By making it follow Citation bot, we restrict it to about 3,000 pages per day. That means that it may take up to 10 weeks, which is a pity. But I think we will get better results this way. BrownHairedGirl (talk) • (contribs) 01:58, 5 March 2022 (UTC)[reply]
@BrownHairedGirl: Maybe a hybrid model could work, for example it could avoid filling in refs for websites where the bot knows citation bot could possibly get better data (e.x: nytimes, journals, websites with metadata tags the barerefbot doesn't understand, etc..). That way we have the best of both worlds - the speed of barerefbot, and the (higher) quality of citation bot. Rlink2 (talk) 02:02, 5 March 2022 (UTC)[reply]
@Rlink2: that is theoretically possible, but I think it adds a lot of complexity with no gain.
The problem that BareRefBot exists to resolve is the opposite of that set, viz. the URLs which Citation bot cannot fill, and we can't get a definitive list of those. My experience of trying to make such a list for Reflinks was daunting: the sub-pages of User:BrownHairedGirl/No-reflinks websites list over 1400 sites, and it's far from complete. BrownHairedGirl (talk) • (contribs) 02:16, 5 March 2022 (UTC)[reply]
  • Some numbers. @Rlink2: I did some analsysis of the numbers, using AWB's list comparer and pre-parser. The TL;DR is that there are indeed very slim pickings for BareRefBot in the other articles processed by Citation bot: ~16 per day.
I took CB's latest 10,000 edits, as of about midday UTC today. That took me back to just two hours short of five days, on 28 Feb. Of those 10K, only 4,041 were not from my list. Only 13 of them still have a {{Bare URL inline}} tag, and 93 have an untagged, non-PDF bare URL ref. After removing duplicates, that left 104 pages, but 25 of those were drafts, leaving only 79 mainspace articles.
So CB's contribs list gives an average of only 16 non-BHG-suggested articles per day for BareRefBot to work on.
In those 5 days, I fed CB with 14,168 articles, on which the bot made just short of 6,000 edits. Of those 14,168 articles, 2,366 still have a {{Bare URL inline}} tag, and 10,107 have an untagged, non-PDF bare URL ref. After removing duplicates, that left 10,143 articles for BareRefBot to work on. That is about 2,000 per day.
So in those 5 days, Citation bot filled all the bare URLs on 28.5% of the articles I fed it. (Ther are more articles where it filed some but not all bare refs). It will be great if BareRefBot can make a big dent in the remainder.
Hope this helps. --BrownHairedGirl (talk) • (contribs) 20:03, 5 March 2022 (UTC)[reply]
  • For what it's worth, I dislike the idea of having a bot whose sole task is to clean up after another bot; we should be improving the other bot in that case. If this bot can edit other pages outside of those done by Citation bot, then it should do so. Primefac (talk) 12:52, 27 March 2022 (UTC)[reply]
    @Primefac, well that's also a good way of thinking about it. I'm personally fine with any of the options (work on its own or follow citation bot), its up to others to come to a consensus over what is best. Rlink2 (talk) 12:55, 27 March 2022 (UTC)[reply]
    @Primefac: my proposal is not clean up after another bot, which describes one bot fixing errors by another.
    My proposal is different: that this bot should do the tasks that Citation bot has failed to do. BrownHairedGirl (talk) • (contribs) 03:37, 28 March 2022 (UTC)[reply]
    BrownHairedGirl is right, the proposal is not cleaning up the other bots errors, it is with what Citation Bot is not doing (more specifically, the bare refs not being filled). Rlink2 (talk) 17:55, 28 March 2022 (UTC)[reply]
    @Primefac: Also, there seems to me to be no scope for extending the range of URLs Citation bot can fill. CB uses the zotero servers for its info on the bare URLs, and if the zotero doesn't provide the info, CB is helpless.
    It is of course theoretically conceivable that CB could be extended with a whole bunch of code of its own to gather data about the URLs which the zoteros can't handle. But that would be a big job, and I don't see anyone volunteering to do that.
    But what we do have is a very willing editor who has developed a separate tool to do some of what CB doesn't do. Please don't let the ideal of an all-encompassing Citation Bot (which is not even on the drawing board) become the enemy of the good, i.e. of the ready-to-roll BareRefBot.
    This BRFA is now Rlink2 in it tenth week. Rlink2 has been very patient, but please lets try to get this bot up and running without further long delay. BrownHairedGirl (talk) • (contribs) 18:25, 28 March 2022 (UTC)[reply]
    Maybe I misread your initial idea, but you have definitely misread my reply. I was saying that if this were just a case of cleaning up after CB, then CB should be fixed. Clearly, there are other pages to be dealt with, which makes that entire statement void, and I never suggested that CB be expanded purely to take over this task. Primefac (talk) 18:31, 28 March 2022 (UTC)[reply]
    @Primefac: maybe we went the long way around, but it's good to find that in the end we agree that there is a job for BareRefBot to do. Please can we try to get it over the line without much more time? BrownHairedGirl (talk) • (contribs) 20:11, 28 March 2022 (UTC)[reply]

Trial 3

Symbol tick plus blue.svg Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Primefac (talk) 12:48, 27 March 2022 (UTC)[reply]

AssumptionBot

Operator: AssumeGoodWraith (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 11:34, Wednesday, February 16, 2022 (UTC)

Function overview: Adds AFC unsubmitted templates to drafts.

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python

Source code available: I think this works?

Links to relevant discussions (where appropriate): Wikipedia:Village pump (proposals) § Bot proposal (AFC submission templates)

Edit period(s): Meant to be continuous.

Estimated number of pages affected: ~100 a day, judging by the new pages feed (about 250 today) and assuming that not many drafts are left without the afc template

Namespace(s): Draft

Exclusion compliant (Yes/No): Yes (pywikibot)

Function details: Adds AFC unsubmitted templates ( {{afc submission/draft}} ) to drafts in draftspace that don't have them, the {{draft article}} template, or anything that currently redirects to those 2. See the examples in the VPR proposal listed above.

Discussion

  • I'm not going to decline this outright, if only to allow for feedback and other opinions, but not all drafts need to go through AFC, and so having a bot place the template on every draft is extremely problematic. Primefac (talk) 12:22, 16 February 2022 (UTC)[reply]
  • {{BotOnHold}} until the RFC (which I have fixed the link to) has completed. In the future, get consensus before filing a request. Primefac (talk) 12:22, 16 February 2022 (UTC)[reply]
  • @Primefac: Not sure if this is a misunderstanding, but it's the unsubmitted template, not the submitted one (Template:afc submission/draft). — Preceding unsigned comment added by AssumeGoodWraith (talkcontribs) 12:28, 16 February 2022 (UTC)[reply]
    I know, and my point still stands - not every draft is meant to be sent for review at AFC, and so adding the template to every draft is problematic. Primefac (talk) 12:38, 16 February 2022 (UTC)[reply]
    @Primefac: I thought you interpreted the proposal as "automatically submitting all new drafts for review". I'll wait for the RFC. – AssumeGoodWraith (talk | contribs) 12:49, 16 February 2022 (UTC)[reply]
  • Note: This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. AnomieBOT 12:41, 18 February 2022 (UTC)[reply]
  • I'm not a BAG member, but I'd like to point out that your code won't work as you expect for multiple reasons.
    First, Python will interpret "{{afc submission".lower(), "{{articles for creation".lower(), etc. as separate conditions that are always True, meaning the only condition that is actually considered is "{{draft article}}".lower() not in page.text.
    Also, your time.sleep call is outside the loop, meaning it will never actually be run. Bsoyka (talk · contribs) 04:59, 25 February 2022 (UTC)[reply]
    I'll figure it out when approved. – AssumeGoodWraith (talk | contribs) 05:09, 25 February 2022 (UTC)[reply]
    ... Or now. – AssumeGoodWraith (talk | contribs) 05:09, 25 February 2022 (UTC)[reply]
    Yes, if there are errors in the code, please sort them out sooner rather than later, as there is little point in further delaying a request because known bugs still need fixing. Primefac (talk) 13:54, 27 February 2022 (UTC)[reply]
  • I'd like to note that I've closed the RfC on this task. From the close: "There is consensus for such a bot, provided that it does not tag drafts created by experienced editors. The consensus on which users are experienced enough is less clear, but it looks like (auto)confirmed is a generally agreed upon threshold." Tol (talk | contribs) @ 19:06, 18 March 2022 (UTC)[reply]
    Approved for trial (50 edits or 21 days, whichever happens first). Please provide a link to the relevant contributions and/or diffs when the trial is complete. This is based on the assumption that the bot will only be adding the template to non-AC creations. Primefac (talk) 12:37, 27 March 2022 (UTC)[reply]

Bots that have completed the trial period

Qwerfjkl (bot) 9

Operator: Qwerfjkl (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 08:12, Saturday, April 2, 2022 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): JavaScript

Source code available: Just some regexps

Function overview: Mass-remove statement from articles

Links to relevant discussions (where appropriate): Wikipedia:Bot requests#Removal of a WP:RSUW statement from around 3,000 Poland related stub articles covering small villages and rural communities

Edit period(s): One time run

Estimated number of pages affected: ~3000

Exclusion compliant (Yes/No): No

Already has a bot flag (Yes/No): Yes

Function details: The bot will remove a WP:RSUW statement..."Before 1945 the area was part of Germany", per the reasons described at the BOTREQ linked above.

Discussion

  • Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Let's make sure this works and there aren't any further objections. Please link to this page in the edit summary. --TheSandDoctor Talk 17:00, 3 April 2022 (UTC)[reply]
    @TheSandDoctor: See these 50 contributions (I accidentally ran this on my main account, not my bot account). ― Qwerfjkltalk 18:12, 3 April 2022 (UTC)[reply]
Trial complete. For the sake of updating the status. --TheSandDoctor Talk 02:38, 4 April 2022 (UTC)[reply]

IndentBot

Operator: Notsniwiast (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 03:20, Friday, October 15, 2021 (UTC)

Function overview: Adjust indentation on discussion pages.

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python, pywikibot

Source code available: On Github

Links to relevant discussions (where appropriate): Wikipedia:Bot_requests/Archive_83#Bot_to_fix_indents

Edit period(s): Continuous (tracking recent changes on a delay)

Estimated number of pages affected: Depends on parameters. With delay of 10 minutes, around 20-30 pages are checked per 10 minutes (see function details below). Initially, most pages having substantial content will be edited, but since the bot processes the entire page, this will get reduced over time as it covers more ground.

Namespace(s): All talk namespaces, and the project namespace. Not sure if any other namespaces have discussion pages.

Exclusion compliant (Yes/No): Yes, uses pywikibot's save function.

Function details: First, the wikitext is partitioned into lines in the usual manner using \n as a delimiter, except that certain newlines, such as those immediately preceding table, template, or tag (as detected by WikiTextParser), are not considered the end of a line. Then we apply fix_gaps, fix_extra_indents, and fix_indent_style to the sequence of lines.

Definitions

  • The indentation characters are *, :, and #.
  • Given a line X, we denote the indentation characters of the line by indent_text(X), and we denote the indentation level by lvl(X). In particular, if X is not indented then lvl(X) == 0.
  • A blank line is a line consisting of whitespace only.
  • A gap is a nonempty contiguous sequence of blank lines sandwiched between two indented lines, which are called the opening line and closing line.
  • The length of a gap is the length of the sequence of blank lines.

Fixes

  1. fix_gaps: This fix has many variations. Let A and B be the opening and closing lines, respectively. No gap with an opening or closing line beginning with # is removed. Otherwise, all length 1 gaps are removed, and longer gaps are removed only if lvl(B) > 1.
  2. fix_extra_indents: We iterate over the lines from beginning to end. If we encounter a line A followed by a line B such that lvl(B) > lvl(A) + 1, then the subsequent chunk of lines which have indentation level greater than or equal to lvl(B), beginning with B, is shifted to the left by lvl(B) - lvl(A) - 1 positions. This is done by stripping out indent_text[lvl(A):lvl(B)-1] (in Python notation) from these lines.
  3. fix_indent_style: We iterate over the lines from beginning to end and adjust the indent_text of each line to use corresponding characters from the closest previous line with the same or smaller level, except that # characters are not removed from, introduced to, or shifted inside a line.

The above description leaves out some details (namely some exceptions for edge cases). The fixes are repeatedly applied in the above order until another round won't alter the page (one round is almost always enough).

It's basically impossible to handle all edge cases and it's not difficult to come up with some of them, especially when you use ordered lists and combinations of possible mistakes. The hope is these are rare enough to be acceptable.

The bot tracks recent changes with a delay minute delay in chunks of chunk minutes, checking for non-minor non-bot edits which include a user signature with the edit that have not been superseded in the most recent delay minutes. The effect of this is that IndentBot is activated by signature-adding edits only, and does not edit any page which has had a signature-adding edit in the most recent delay minutes. I believe delay should be set to 10 to 30 minutes. Too long of a delay results in editors manually fixing indentation in active discussions, partially defeating the purpose of the bot. Non-talk pages must have at least 3 signatures to be edited, ensuring that a single accidental signature to a non-discussion page doesn't trigger the bot. Most sandboxes are avoided.

Discussion

  • Also, can someone make IndentBot a confirmed user so that it can bypass CAPTCHAs? Winston (talk) 04:01, 15 October 2021 (UTC)[reply]
    • Nevermind, now autoconfirmed. Winston (talk) 22:54, 15 October 2021 (UTC)[reply]
  • Does anyone know why I still see some bots when filtering recent changes for human edits only? Winston (talk) 08:25, 17 October 2021 (UTC)[reply]
    Answered here. Winston (talk) 01:51, 18 October 2021 (UTC)[reply]
  • Note: This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. AnomieBOT 23:07, 18 October 2021 (UTC)[reply]
    Sorry, ran the wrong function once. Winston (talk) 01:42, 19 October 2021 (UTC)[reply]
  • Thanks for working on this. In response to Not sure if any other namespaces have discussion pages, DYK noms are the odd example that always comes to mind, e.g. Template:Did you know nominations/La Folia Barockorchester. It's probably fine if we skip these to keep things simple, though. — The Earwig (talk) 03:39, 20 October 2021 (UTC)[reply]
    If it's only a couple cases like the DYK noms, then it's pretty easy to handle them with a quick title prefix check. Winston (talk) 03:53, 20 October 2021 (UTC)[reply]
  • The code has pretty much settled and the bot is ready for a short trial if the example diffs given look good. Winston (talk) 04:07, 20 October 2021 (UTC)[reply]

Approved for trial (50 edits or 7 days, whichever happens first). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Primefac (talk) 07:35, 20 October 2021 (UTC)[reply]

@Primefac Should edits be minor? Winston (talk) 07:36, 20 October 2021 (UTC)[reply]
For the trial, let's go with "no" so that it receives a bit more scrutiny. I think if this goes through, marking as minor would match similar bots. Primefac (talk) 09:06, 20 October 2021 (UTC)[reply]

Trial complete. See the diffs here.

  • Haven't looked too carefully yet, but one edge case I saw was Line 80 in the diff Wikipedia:Arbitration enforcement log/2021 involving {{Div col}}. It would be fine if {{Div col}} started on the same line as the comment. Winston (talk) 12:56, 20 October 2021 (UTC)[reply]
    Possible fix is to not adjust style if the previous line contains an exceptional newline character, in this case the exceptional newline is the one just before {{Div col}} (since newlines just before templates do not count as delimiters in the line partitioning phase). Winston (talk) 13:03, 20 October 2021 (UTC)[reply]
  • Note: I suspect the easiest way to handle edge cases as they come up is to simply prevent the bot from making certain edits to certain lines, rather than trying to handle every case correctly. Winston (talk) 13:07, 20 October 2021 (UTC)[reply]
  • Not sure if this is open to all comments, feel free to remove if not. I came her after seeing the edit at Talk:List of Ayatollahs and I was interested. If you look at the edit it made, it didn't manage to get it correct. Although it start well it fails at the signature section starting "please study the answers", which should have been indented. Because it missed this, all the edits made afterwards are wrong.
    Also the messages it changed are over a decade old, will it be normal practice for it to change messages that are that old or was this part of the test? ActivelyDisinterested (talk) 23:34, 20 October 2021 (UTC)[reply]
    The edit looks better formatted to me and I don't see unintended edge cases, though I'm interested in others' opinions. Be sure to check out the links at User:IndentBot#IndentBot to understand why and how the indentation is being adjusted.
    As for old messages, the bot does not take that into account. It adjusts indentation on the entire page at once. For more active talk pages, old discussions are often stored in archives. Since the bot is only activated by a recent edit with signature, archived pages shouldn't be touched. Winston (talk) 00:07, 21 October 2021 (UTC)[reply]
    It's indented part of a message, and left the last section unindented (the second grey unedited section). That's definitely not right. ActivelyDisinterested (talk) 00:46, 21 October 2021 (UTC)[reply]
    Could you partially quote the lines you are referring to? Note that the bot does not fix indentation completely—in particular, it does not add extra indentation (so unindented lines will remain unindented). It only changes indentation characters, removes blank lines, and reduces over-indentation. Winston (talk) 00:56, 21 October 2021 (UTC)[reply]
    Ah! I see what happened. The section at issue in full is:
    please study the answers in his discussion pages in different languages. Academycanada (talk) 03:21, 24 November 2009 (UTC)
    This is the end of the message that began in full:
    :By a simple search, you can find the sources such as Islamic organizations, independent websites and academic institutions which introduced him as one of Marjas and Grand Ayatollahs. Here are some of them in different languages:. note the message starts with an indent
    The start of the message was indented, the bot correctly indented the middle lines of the message, but the end section was not originally indented and so the bot ignored it. As you said unindented lines remain unindented, but that does leave one message with two levels of indents. ActivelyDisinterested (talk) 01:08, 21 October 2021 (UTC)[reply]
    Yeah, this bot isn't smart enough to fix all errors. It would have to be way more advanced to tackle issues at a "per message" level of detail, and even then there are too many edge cases. Winston (talk) 01:14, 21 October 2021 (UTC)[reply]
  • Made minor improvements to the line partitioning. Also, fix_indent_style now resets its "memory" after list-breaking newlines. This behavior makes more sense and is more faithful to the original indentation. It solves quite a few bugs including the {{Div col}} one I mentioned earlier. Winston (talk) 05:08, 22 October 2021 (UTC)[reply]
  • @Primefac: Can I do another trial to draw more scrutiny? (Also want to test it on Toolforge this time.) Winston (talk) 08:42, 22 October 2021 (UTC)[reply]
  • Regarding [46]: this is not the bot's fault, but on AFDs its convention to use bullets for voting, but delsort notices use colon indent. In this case, the bot changed all bullets to colons after the first delsort notice – and this would happen on literally every AFD. Can something be done about this?
    Symbol tick plus blue.svg Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. I would suggest citing a policy/information page in the edit summary. Also, consider using minor edit flag for user talk pages even in trial as otherwise the users would get new messages alert. – SD0001 (talk) 12:24, 24 October 2021 (UTC)[reply]
    Ok, edits to user talk pages will get minor edits. Also, I can add a simple exception for comments beginning with <small class="delsort-notice". Added link to MOS:INDENTMIX to the edit summary. Winston (talk) 12:31, 24 October 2021 (UTC)[reply]
    @SD0001: Actually, how are these delsort notices inserted? Are they manually typed out or is there some automation involved? I noticed one of them just used <small> without the "class" attribute. Winston (talk) 12:48, 24 October 2021 (UTC)[reply]
    Found it. It is {{Deletion sorting}}. But I guess every now and then someone adds it manually. I'll just use a regex for a small tag followed by "Note:". Winston (talk) 12:57, 24 October 2021 (UTC)[reply]
    Yes, that template is substed by a couple of tools – MediaWiki:Gadget-twinklexfd.js and User:Enterprisey/delsort.js being the two common ones. – SD0001 (talk) 12:58, 24 October 2021 (UTC)[reply]
    @Notsniwiast Actually, I went ahead and boldly edited that template to use a bullet instead. If no one reverts my edit, then an exception would be unnecessary. – SD0001 (talk) 13:02, 24 October 2021 (UTC)[reply]
    @SD0001 Do you still want me to add this exception for the trial, then maybe remove it later? Winston (talk) 13:05, 24 October 2021 (UTC)[reply]
    Yes that would be better, as the bot would be touching many pages that already have colon-indented delsort notices. – SD0001 (talk) 13:06, 24 October 2021 (UTC)[reply]

Trial complete. See the contributions here, or see the diffs in alphabetical order here. Winston (talk) 14:50, 24 October 2021 (UTC)[reply]

  • It seems Wikipedia:Categories for discussion also uses some templates which trigger the bot. Winston (talk) 14:53, 24 October 2021 (UTC)[reply]
    See Template talk:Cfd2#Remove leading colons regarding those templates. – SD0001 (talk) 16:58, 24 October 2021 (UTC)[reply]
  • I think this edit should not have been made. It divided User:Salimfadhley's comment into 3 bullet points, when it looks like they intended to create an effect similar to parabreaks. ಮಲ್ನಾಡಾಚ್ ಕೊಂಕ್ಣೊ (talk) 15:53, 24 October 2021 (UTC)[reply]
  • Special:Diff/1051601137 is a worry for me. Not necessarily because the bot shouldn't have made the edit, but because those entries were all made by the default templates. Either we change the template, exclude the PERM pages from the bot, or accept the fact that every time someone requests a permission the bot will follow behind and fix it. I think option 1 (changing the pre-set layout) is likely best but that will likely require further discussion and/or consensus, especially since there's a bot that needs to clerk (not sure how that will affect it). Primefac (talk) 17:08, 24 October 2021 (UTC) sorry for the no-show today, dealing with a rather heavy headache for some reason[reply]
    • Yeah it seems there's a couple of these templates around. I guess the plan right now is to exclude the relevant pages, and include them later if the templates are changed. But I'm still not sure if all the relevant entries are made using templates. I see some variation in the delsort notices, e.g. <small class="delsort-notice"> versus just <small>, so unless there's more than one version of the templates or editors are doing it manually, there might be some other tools involved (I don't know anything about assisted editing tools). Another example is Wikipedia:Articles_for_deletion/Metropolitan_Gazette_(2nd_nomination) where the delsort notices still use : even though it was made after SD0001's edit. For now, I will skip "Wikipedia:Requests_for_permissions/" and "Wikipedia:Categories for discussion/". The notices using <small> tags were already handled for the trial. Winston (talk) 02:16, 25 October 2021 (UTC)[reply]
    I ended up asking one editor why their delsort notices didn't have the class attribute, and apparently they were just doing it manually. So I think it's likely that variations are due to manual edits. Winston (talk) 11:14, 25 October 2021 (UTC)[reply]
  • Note: (This is in reply to ಮಲ್ನಾಡಾಚ್ ಕೊಂಕ್ಣೊ's comment, but I'm posting it here since it's more generally relevant.) Unfortunately, there’s not much to do in these inevitable cases. SD0001 brought up a similar example before. The nature of the problem requires that the bot operate on entire discussions at once. As a result, anything more than a single minor “violation” in a discussion makes it impossible to create a consistent and accessible list without sometimes changing an editor’s indentation visually. Making exceptions leaves broken lists/markup, and often just shifts the issue to a different part of the list. Since the change is usually minor and doesn’t alter core content, I hope this is acceptable. I also hope the bot’s work will increase awareness of templates such as {{pb}} and {{HTML lists}} which address the most common reasons (that I’ve seen) for incorrect markup. I have links to these templates and other guidelines on the bot's user page. Winston (talk) 02:26, 25 October 2021 (UTC)[reply]
  • I have noticed the bot doing useless edits removing blank lines, which is not needed. In fact everything listed for this bot to do is useless. I will deliberately indent more or change style of indent , so it looks as if this will try to undo that. Looks like this bot is trying to fix a non-problem. Surely tehre are more useful things to do with bots around here. Graeme Bartlett (talk) 10:24, 26 October 2021 (UTC)[reply]
    @Graeme Bartlett Could you provide an example of your using over-indentation or changing indent style, for which normal indentation would be inadequate and for which an accessible solution is impractical? From what I've seen, this is quite rare, but it could be an edge case that can be avoided. Winston (talk) 02:46, 27 October 2021 (UTC)[reply]
 – Winston (talk) 11:10, 26 October 2021 (UTC) First time moving a discussion, tell me if I did it incorrectly.[reply]

This ?bot? made a useless edit here: https://en.wikipedia.org/w/index.php?title=Talk:Bicarbonate&curid=1450293&diff=1051599806&oldid=1051598562

which has no effect on the output we see. I thought that bots were not permitted to make cosmetic only changes. Even if the extra blank line is redundant, ether is no need to remove it! Graeme Bartlett (talk) 10:18, 26 October 2021 (UTC)[reply]

Break

What's the status of this? Not sure where to go from here. I've noticed that on mobile, bulleted and unbulleted comments don't line up (check here for example), so the bot is even more effective there. Winston (talk) 01:06, 29 October 2021 (UTC)[reply]

{{BAG assistance needed}} Winston (talk) 09:17, 31 October 2021 (UTC)[reply]
I think this needs another round of trial, this time a larger one. The CfD templates have been fixed per talkpage note, and I see you've edited the PERM template too. As for WP:RFUD, which is where I assume @Graeme Bartlett is coming from, the issue seems to be that {{UND}} when substed produces a bullet indent, but most users haven't noticed this and are anyway adding a indent character of their own.
Also, I think the issue of changing the final indent character should be discussed. I don't have any preferences, but I think changing a visible bullet to no bullet (or vice versa, see several cases in [47]) can be seen as intrusive. Would like to hear others' thoughts on this. – SD0001 (talk) 12:59, 31 October 2021 (UTC)[reply]
Apologies for radio silence on this one, it's relatively low-priority at this point in my life, but I do agree based on a read-through here that a further trial would probably be good. Primefac (talk) 13:03, 31 October 2021 (UTC)[reply]
I did realize that changing the final (and hence visual) character could be annoying, but the point is that mixing characters shouldn't happen in the first place. So if the final indent character is not changed, it neuters a large portion of the fixes. Even a simple single-level list such as
* Comment 1.
: Comment 2.
* Comment 3.
: Comment 4.
would be left as four separate lists in HTML and to screen readers. Let me see if I can compute approximately what fraction of indentation style fixes occur in the final character. Winston (talk) 13:17, 31 October 2021 (UTC)[reply]
In Category:Non-talk pages that are automatically signed (just using this to get a quick collection of pages), 2770 lines would have indentation characters altered, and 839 of those lines would have an altered final character. Each altered character represents (almost always) a new list being started where there shouldn't be. Winston (talk) 13:31, 31 October 2021 (UTC)[reply]
@SD0001 I'm confused about the {{UND}} template. When I substed it into my sandbox I didn't see a bullet point, and the template's doc doesn't show bullet points either. I believe Graeme Bartlett noticed the bot through the diff they linked. Winston (talk) 14:26, 31 October 2021 (UTC)[reply]
Indeed it doesn't. I assumed that was the reason why so many of the RFUD comments were over-indented ([48]). – SD0001 (talk) 14:36, 31 October 2021 (UTC)[reply]

@SD0001 I reviewed the "last character issue" and I see how it can be intrusive when, for example, the first comment in a level is unbulleted, but the following comments are all or mostly bulleted which then get changed by the bot. Two examples are in the sections "Unban request for Soumya-8974" and "SoyokoAnis unban appeal" in this diff. Perhaps I could implement a compromise where the bot first computes which type (bulleted or unbulleted) is more common for each level, then when it encounters an INDENTMIX violation, it uses the more common type. Winston (talk) 10:10, 1 November 2021 (UTC)[reply]

With this strategy, the number of lines with altered final character gets reduced by 25% to 630. Winston (talk) 13:27, 1 November 2021 (UTC)[reply]

I've made a number of slight improvements to each of the three fixes and I think the bot is ready for a third trial. I don't think the final character issue can be mitigated any more without simply ignoring final character INDENTMIX violations. I guess we can see whether anyone complains during/after the trial. I'll continue the non-minor edit policy except for user talk pages for the trial to draw more scrutiny. Winston (talk) 13:56, 2 November 2021 (UTC)[reply]

@SD0001: I realized the bot wasn't conservative enough and would sometimes make the text harder to understand by not preserving the original editor's indentation style. After some brainstorming and trial and error, I've managed to make the bot respect the original indentation much more while sacrificing a bit of accessibility, i.e. it defers to the original text for certain INDENTMIX violations. The number of final indentation characters changed has been reduced a further 48%. Can we start a third trial? Winston (talk) 11:04, 4 November 2021 (UTC)[reply]

Sure go ahead. Symbol tick plus blue.svg Approved for extended trial (200 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete.SD0001 (talk) 06:48, 5 November 2021 (UTC)[reply]
An indent-bot is definitely required. Some editors make mistakes with their indents. Some simply don't know how to indent. Most frustrating? some deliberately mis-indent (usually after their mistakes have been pointed out) & when they 'continue' to deliberate mis-indent? it's basically their way of giving you (the adviser) the figurative 'middle finger'. GoodDay (talk) 17:47, 5 November 2021 (UTC)[reply]

Trial feedback

Examples
  • I just reverted this massive refactoring when I saw this bot editing my discussion; I chose bullets on purpose to break that section apart. — xaosflux Talk 13:53, 5 November 2021 (UTC)[reply]
  • Here is another example: diff - this doesn't make sense, that first line was clearly not intended to be part of the "discussion" - so was stylized differently. — xaosflux Talk 14:01, 5 November 2021 (UTC)[reply]
  • More bad edits (already reverted by another editor). — xaosflux Talk 14:03, 5 November 2021 (UTC)[reply]
  • Lets not chase around another bot example that I can assume was specifically programmed to edit one way already. — xaosflux Talk 14:09, 5 November 2021 (UTC)[reply]
  • Another example diff that made the new list worse, see the section around "Person who is autistic" - where this bot has introduced double bullets. — xaosflux Talk 14:54, 5 November 2021 (UTC)[reply]
Discuss
  • I think this task is going to need a much larger discussion before being released on all edits, all the the time; I expect it will continue to make contentious edits that don't have a policy to support them (i.e. a policy that only certain indentation or list styles are allowed to be used). — xaosflux Talk 13:58, 5 November 2021 (UTC)[reply]
  • The more I look at these edits, the more fundamentally broken I think this is. Perhaps as an OPT-IN-ONLY on certain pages it could be useful? — xaosflux Talk 14:04, 5 November 2021 (UTC)[reply]
    I guess I should disable altering the final indent character completely for now. Too many edge cases. Sorry about that. I'll review the diffs you posted and see if the bot would still have made those edits after disabling this behavior. The final character issue was brought up before, but I underestimated the problem. Winston (talk) 14:09, 5 November 2021 (UTC)[reply]
    Yes, I'd suggest not changing the final character indents at all. They're not much of an accessibility issue in practise I believe, and fixing them is clearly looking like more trouble than it's worth. – SD0001 (talk) 14:47, 5 November 2021 (UTC)[reply]
  • I don't know what you people were thinking when you approved this thing, but it's completely screwing up existing discussions [49]. And BTW, according to a friend who actually uses a screen reader, the whole idea that indenting patterns are this big deal is a myth. This has the potential to make literally hundreds of thousands of discussions and posts unintelligible. Cut it out RIGHT NOW. EEng 14:10, 5 November 2021 (UTC)[reply]
    EEng, this is a trial run, which is done specifically to see if these sorts of issues arise. Clearly, there are major concerns, and based on the last few posts here I'm starting to think that this bot will not be approved without significant overhaul. Primefac (talk) 14:12, 5 November 2021 (UTC)[reply]
    Ya think? How can you possibly have ever thought this could fly? Above I read I've managed to make the bot respect the original indentation much more -- oh, he's respecting the indentation used by discussants, which is critical to following the flow of the discussion, much more? You mean, like, you guys are willing to compromise on that and only make discussions somewhat impossible to follow? EEng 14:19, 5 November 2021 (UTC)[reply]
    Maybe it would be better not to do a trial run on userpages without pre-approval by the users involved? I've reverted the bot at EEng's talkpage, just because it seemed really hard to believe EEng would like the effect. Remember, his talkpage can be seen from space. Bishonen | tålk 14:23, 5 November 2021 (UTC).[reply]
    You're right, I've removed the user talk namespace. Winston (talk) 14:25, 5 November 2021 (UTC)[reply]
    You've removed the user talk namespace, so you're only going to fuck up article talk pages and project guideline talk pages? Well, I guess that's a start.
    You're not getting this. There is no possible way to do what you're doing without screwing up existing pages, because there's a fundamental conflict between the assertions in INDENT (or wherever) and the way people actually format their discussions. What you're trying to do inevitably changes the formatting of existing discussion so that the meaning of editors' comments is changed. You're trying to square the circle, and need to give it up completely. EEng 14:56, 5 November 2021 (UTC) P.S. I just noticed above that the plan is to fuck of project pages (e.g. actual guidelines and policies, not just the talk pages) as well. The lunatics have clearly taken over the asylum.[reply]
    There were 2 trials already done (50 + 50 = 100 edits) which drew basically no negative feedback, which was why this was approved for extended trial of 200 edits. Something looks to have regressed in the newer code that's causing the issues. It looks like @Notsniwiast has stopped the bot now. – SD0001 (talk) 14:25, 5 November 2021 (UTC)[reply]
  • Agree. This thing is jacking up the formatting on talk pages. Sometimes the formatting is there intentionally. Just undid the bot at Talk:Stanley Kubrick for an example. Jip Orlando (talk) 14:14, 5 November 2021 (UTC)[reply]
    @Jip Orlando Sorry about that, was the issue the swapping from bullet/no bullet issue? If the issue was isolated, could you mention which part? Winston (talk) 14:27, 5 November 2021 (UTC)[reply]
    [50] here, it looks like it's tweaking the replyto stuff by moving the discussions to the left and adding bullets where colons where. I understand that it is making the formatting appear consistent, but it is undoing what appears do have been done intentionally. I see the bullets as used for making a salient point and the indents as a reply to the point. Maybe I'm being nitpicky, but having a sudden sea of bullets doesn't make things look organized. Jip Orlando (talk) 14:38, 5 November 2021 (UTC)[reply]
  • Not a fan of having an alert pop up wrt my account talkpage, only to find out it's a semantically void whitespace twiddle. If it had been another person doing the same thing I'd be miffed. More so when it's a mindless thing. ☆ Bri (talk) 14:21, 5 November 2021 (UTC)[reply]
    Yeah sorry, I should have respected the user talk space more. It's been removed from the bot for now. Winston (talk) 14:29, 5 November 2021 (UTC)[reply]
    For the record, user talk page notifications are suppressed when edits are marked as minor edit + bot edit + account has bot flag. I believe you need to add bot=True near here. Also the bot flag has expired. – SD0001 (talk) 14:39, 5 November 2021 (UTC)[reply]
    I forgot the bot flag expired. When the bot flag is on, the edits are automatically marked as a bot edit. Winston (talk) 14:41, 5 November 2021 (UTC)[reply]
  • USER TALK testing should not happen unless this has a bot flag. By combing +bot and +minor attributes this could make use of the (nominornewtalk) feature to not trigger the new message notifications. (This is not an endorsement that this should be currently tested). — xaosflux Talk 14:39, 5 November 2021 (UTC)[reply]
  • @SD0001: I suggest that the operator, @Notsniwiast: needs to go manually review every edit they just made and revert anything that possibly made the page worse. — xaosflux Talk 14:56, 5 November 2021 (UTC)[reply]
    I don't think he can be trusted to do that. What he needs to do is revert everything immediately, and where a page has been edited subsequent to the bot's edit, post a message or something warning watchers to take a look themselves. This is really serious. I cannot believe this got anywhere at all. EEng 14:59, 5 November 2021 (UTC)[reply]
    Yeah I'm reverting right now. Winston (talk) 15:00, 5 November 2021 (UTC)[reply]
  • From my watchlist: I don't want to pile on, but [51] this changed the placement (and hence meaning) of, at least, AlwaysInRed's message. Many of the diffs have large changes, so it is hard to figure out which are problematic. Urve (talk) 15:51, 5 November 2021 (UTC)[reply]
    @Urve Could you partially quote the message so I can find it? Is it "I am the lead-moderator and one"? Winston (talk) 15:53, 5 November 2021 (UTC)[reply]
    Yes. Even if that was an unintentional error, people do purposefully comment in this way (several indents after an outdent), to continue reply to the comment that isn't outdented. Why people do this (instead of a message directly underneath what they wish to reply to), I'm not sure. But the problem is that the meaning is changed if these are all outdented without regard to what they're replying to. Urve (talk) 15:57, 5 November 2021 (UTC)[reply]
  • I think this bot is doomed to failure, should not be approved for ongoing use, and should never have been approved even for the limited testing runs it made. The first thing I saw here was several diffs of completely wrong talk page refactorings and the diff-posters were correct that they were completely wrong. People on discussion pages use indentation to mean different things, that cannot adequately be guessed by a bot, because the meaning of the indentation is in the semantics of what they're saying rather than in the syntax of their comment. Just to pick an easy example, people will choose the indentation level of a comment (among several different indentation levels for a comment placed in the exact same place in the discussion) to indicate to whom they are replying; unless the bot can understand that part of the back-and-forth (and it can't) it cannot correctly adjust the indentation. People will sometimes deliberately choose between *-indentation of their comments or :-indentation of their comments according to how prominent they want that comment will be, and will use both *-indentation and :-indentation for sub-elements within comments as well as for whole comments. Additionally, editors often take significant offense even at careful human refactoring of their comments. This is not a task that can be solved without full human-level AI, which does not exist, and even then is of dubious value. A bot rampage that changes what is meant is a bad thing, and completely unnecessary. We do not need our talk pages to be well structured according to some spec. We need them to communicate with each other. —David Eppstein (talk) 16:16, 5 November 2021 (UTC)[reply]
    • @David Eppstein: What if only edits with no visual difference were made? That is, edits of the sort
      * One.
      :: Two.
      
      to
      * One.
      *: Two.
      
      Winston (talk) 16:25, 5 November 2021 (UTC)[reply]
      • Look at your example above. Look at the wikicode. If you ran your bot on this very page, it would "fix" your first example, rendering your message meaningless. That is the inherent problem here: you can't write logic that will know that this particular instance of "*" followed by "::" should remain because it's an intentional example of an error. You need a human brain for that. Levivich 16:33, 5 November 2021 (UTC)[reply]
        The bot wouldn't fix the first example since it is inside a "syntaxhighlight" tag. Winston (talk) 16:34, 5 November 2021 (UTC)[reply]
        I realized that as soon as I hit publish :-) But most editors wouldn't know to use such a tag. Anyway, what can the bot do about this:
        * One.
        : Two.
        
        Can it tell if "Two" is a new comment or the second paragraph of "One"? What if One were unsigned? Etc. Levivich 16:40, 5 November 2021 (UTC)[reply]
        It would do nothing, since final indentation characters would no longer be altered at all since that would change the visual appearance. Winston (talk) 16:43, 5 November 2021 (UTC)[reply]
        OK, then what about this:
        * One.
        : Two.
        * Three.
        : Four.
        * Five.
        
        Should Two and Four be bullets? Or, alternatively:
        : One.
        * Two.
        : Three.
        * Four.
        : Five.
        
        Is this all one comment with two bullet lists in it, or five different comments? Are we changing to colons to bullets or bullets to colons or nothing? Levivich 16:45, 5 November 2021 (UTC)[reply]
        Nothing would change. Sorry I should clarify, by final indentation character I mean the last indentation character for a line. So for *: it would be :. Winston (talk) 16:48, 5 November 2021 (UTC)[reply]
        Only basic list gaps and non-final characters could be altered. So indentation levels and final bullet/no bullet would not be changed at all. Winston (talk) 16:49, 5 November 2021 (UTC)[reply]
        Heh, you anticipated my next question about indentation levels :-) So these two changes (no change to indentation level, no change to final character) would be two things that are different from the last trial run that was just run? Levivich 16:56, 5 November 2021 (UTC)[reply]
        Correct. The visuals should not change. Winston (talk) 16:57, 5 November 2021 (UTC)[reply]
        Well, technically the visual would change if there was something like ::: followed by ***, since the bot would change the latter to ::*. Winston (talk) 17:01, 5 November 2021 (UTC)[reply]
        Yeah, but it seems like that particular example (::: followed by ***) is just a flat-out mistake, so the change would be for the better for both sighted and non-sighted readers. I think you're right that not changing the indentation level, and not changing the final character, are key to not making a visual change. I'm not a BAG member or anything, but it seems reasonable to me to do another trial run with those modifications you've suggested (and limiting the namespaces for the trial, etc.). It does seem like limiting the bot as you're describing would make the changes invisible to sighted readers. I recognize it won't totally fix the problem that you're setting out to fix (which can't be fixed, because editing text files and using indentation to separate one comment from another is downright stone-age archaic, we might as well use vacuum tubes), but it could improve things without pissing editors off. :-D Levivich 17:05, 5 November 2021 (UTC)[reply]
        Yeah I feel bad for angering/annoying a bunch of people. I was overzealous. Winston (talk) 17:11, 5 November 2021 (UTC)[reply]
        No worries. Heck, many of us annoy EEng just for sport. Levivich 17:18, 5 November 2021 (UTC)[reply]
    • (edit conflict) This would run afoul of WP:COSMETICBOT. Jip Orlando (talk) 16:35, 5 November 2021 (UTC)[reply]
      • Nevermind, this is an exception. Either way, you'll have a horde of mad users cluttering their watchlists. Jip Orlando (talk) 16:37, 5 November 2021 (UTC)[reply]
        • The COSMETICBOT argument is compelling to me but there's more to it than that. If you think that changing talk pages to normalize indentation coding without changing the appearance is helpful as a way to produce semantically clean wikimarkup, you're deluded. :-indentation is never semantically clean. :-formatting is only proper within definition lists, where its actual purpose is to delimit the body of a definition and the indentation is merely a side effect of how this kind of list is formatted. Its use on talk pages for indentation is a hack. As such, the bot's task would be to fill our watchlists with edits while polishing a hack rather than accomplishing anything useful. —David Eppstein (talk) 17:56, 5 November 2021 (UTC)[reply]
          It's not about the semantics. The changes are to help screen readers. I was simply overzealous with the bot, and unfortunately it took until this trial to become apparent. The limited version described above in the comment chain with Levivich should be much better. If you use macOS, you can try reading a list with gaps and/or mixed indents with VoiceOver to see how screen readers are affected. Winston (talk) 18:39, 5 November 2021 (UTC)[reply]
  • This would be better as a script than a bot. Preferably, a script that worked on just one section of a talk page. As a script, editors could manually review/correct mistakes before publishing. Levivich 16:31, 5 November 2021 (UTC)[reply]
  • Look, Winston, I know you're trying to help, but you have only 1800 edits to Wikipedia, and only a handful of those are to talk pages. You don't have the experience to even begin to understand the subtleties of what you're getting into. It's like having someone who's never driven a car start redesigning the highways. EEng 16:44, 5 November 2021 (UTC)[reply]
  • We most certainly do need an Indent Bot. Some editors don't know how to indent, or make human mistakes or simply refuse to, after being given advice. GoodDay (talk) 17:49, 5 November 2021 (UTC)[reply]
    • That's not something a bot is capable of fixing. —David Eppstein (talk) 17:56, 5 November 2021 (UTC)[reply]
      • Wish one could be created, that was capable of doing so. Frustrating, when you read long drawn out discussions, with mis-indents. Throws you off, as to who's responding to who. GoodDay (talk) 18:15, 5 November 2021 (UTC)[reply]
  • It's a good idea and I'm sure there's some sort of bot task that could be approved someday. Editing the wikitext of discussions happens to be just about the hardest bot task I can think of to do correctly. I think starting with the LISTGAP change or otherwise trying to limit the amount of change the bot does would be a good idea. Please ask me if you have any questions; I (unfortunately? lol) have a few years of experience with manipulating discussion wikitext. Enterprisey (talk!) 21:15, 5 November 2021 (UTC)[reply]
    Basically I caught feature creep. I've pared the bot back to simple LISTGAP and non-final-indentation-character INDENTMIX changes. Winston (talk) 21:37, 5 November 2021 (UTC)[reply]

Changes to bot

In the original bot request for a bot to fix indentation, two examples were given. The first example was the removal of a single extra indent (a general fix), and the second was a non-final-indent-character indentmix fix (an accessibility fix). I decided to tackle this request, but caught feature creep and took the idea too far. This ended up making some "fixes" the very opposite, as the last trial demonstrated. I believe the issues brought up (other than procedural issues like editing user talks and missing the bot flag) were due to the features I implemented beyond the original request, and I apologize.

I have limited the bot to listgap and non-final-character indentmix fixes only. Indentation levels and final indentation characters are not changed (so the first example in the original bot request would actually be left alone). Here are some sandbox diffs. These are accessibility changes, and the only noticeable change for sighted readers should be the hiding of “floating bullets” which are bullet points that appear not as the last indent character. For example,

Markup Renders as
:One.
*: Two
*** Three.

One.
  • Two
      • Three.

would become

Markup Renders as
:One.
:: Two
::* Three.

One.
Two
  • Three.

Winston (talk) 10:56, 7 November 2021 (UTC)[reply]

@Notsniwiast: here is just a sample mixed up list - what, if anything would you do to it?
Extended content
  • A
    • A
      • A
        A
        A
        • A
          1. A
          2. A
          3. A
        • A
          1. A
          2. A
          3. A
        A
      • A
      • A
    • A
  • A
xaosflux Talk 13:05, 7 November 2021 (UTC)[reply]
I've just tested it on this list. It does nothing. Winston (talk) 13:09, 7 November 2021 (UTC)[reply]
@Xaosflux: Please see the above. --TheSandDoctor Talk 07:39, 29 December 2021 (UTC)[reply]
  • Trial complete. Closing out the previous trial which was aborted. Winston (talk) 06:29, 5 January 2022 (UTC)[reply]
  • {{BAGAssistanceNeeded}} I'd like to try out the limited version as described above. To recap, there shall be no changes to indentation levels and no changes to the final indent character. The only noticeable visual difference should the hiding of "floating" bullet points. The other changes are reductions in the number of list gaps and amount of indentation-style mixing, which should not be visually noticeable. Here are some fresh diff examples. I can do more sandboxed runs if we're still wary of a trial on the live wiki. Winston (talk) 06:29, 5 January 2022 (UTC)[reply]
    Btw, the ordered pair in the edit summaries represents (# of blank lines removed, # of lines with at least one altered indent character). Winston (talk) 06:51, 5 January 2022 (UTC)[reply]
  • ...Sure. Symbol tick plus blue.svg Approved for extended trial (200 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. I totally support approving this task in some form. But: although I'd rather not say this, recognizing that people really don't like discussions getting messed up (as you can see above), I must warn you that any more edits that change the meaning of discussions (even in the most insignificant way) or mistakes aren't going to look too good for the request. I'd err on the side of being cautious. From my experience developing reply-link, in the land of Wikipedia talk pages, even if something looks like a mistake, there's a decent chance that it's intentional. Enterprisey (talk!) 07:02, 5 January 2022 (UTC)[reply]
    Understood. If we find a legitimate use of floating bullets affecting meaning, then I can simply prevent the bot from changing * to : , thus preventing bullets from disappearing whether floating or not. I'll do this trial in smaller chunks, posting the diffs for each chunk after I review them and point out diffs where bullet points have been removed. Winston (talk) 07:19, 5 January 2022 (UTC)[reply]
    @Enterprisey Before starting, should the bot be given a (temp) bot flag? Also, minor edits or not minor? Winston (talk) 07:22, 5 January 2022 (UTC)[reply]
    If the previous trials weren't flagged, I wouldn't think this one should be; and I'd mark them as not minor (even though the distinction isn't very important these days) because it's a trial and I think people would be slightly more likely to pay attention to non-minor edits. Enterprisey (talk!) 07:34, 5 January 2022 (UTC)[reply]
    @Notsniwiast, I'd even recommend, to be extra cautious, making sure that the edits don't change the visual appearance of the page (besides removing the "double bullets" error); we can always add more tasks to the bot later. Not sure if you were doing that already (I didn't check); just making a note. Enterprisey (talk!) 08:12, 5 January 2022 (UTC)[reply]
    Yes, only bullets which are not the final indent character get removed. I will pause after 20 edits to show the diffs, pointing out the ones where a bullet has been hidden. Winston (talk) 08:16, 5 January 2022 (UTC)[reply]

Chunk 1 (20 diffs)

  • See here. I'm pausing here to see if any concerns are raised. Winston (talk) 08:54, 5 January 2022 (UTC)[reply]
    The few I checked look fine. If nobody objects in the next day or two, feel free to keep going. Maybe pause again after 100 edits have been made? Enterprisey (talk!) 01:48, 6 January 2022 (UTC)[reply]
    @Notsniwiast, I notice the bot is currently editing your sandbox. If the sandbox edits aren't part of the trial, keep going and ignore this message. However, since you linked to one of them just above, I'm assuming you're counting the sandbox edits as the trial. Since the bot task is for editing actual discussion pages, the trial should be as similar to that usage as possible. That means the bot should edit the actual pages, not just its sandbox, for this trial. Part of the trial, in my view, is making sure that people won't object to the edits, and they won't have the opportunity to object if the edits are made to the sandbox. Enterprisey (talk!) 08:10, 6 January 2022 (UTC)[reply]
    Yup the sandbox edits aren't part of the trial. Not sure which link you're referring to, but when I link to the actual trial diffs I use a permanent url to a revision of my sandbox where I put the diffs, so as not to clutter up this page. Winston (talk) 08:21, 6 January 2022 (UTC)[reply]
    Sounds good. My bad; misread. Enterprisey (talk!) 08:41, 6 January 2022 (UTC)[reply]

Chunk 2 (50 diffs)

  • See here. Winston (talk) 10:07, 6 January 2022 (UTC)[reply]
  • This is the first I'm aware of this bot, and I've not read all the text above so sorry if this has been addressed before, but could the edit summaries be improved please: e.g. "Adjusted indentation per MOS:ACCESS#Lists. Trial edit. (1, 10)" has three parts:
    • "Adjusted indentation..." is sort of OK, but could imply that it is changing the indentation level (which it isn't), "Fixing indentation markup" would be better imo.
    • "Trail edit." is entirely unproblematic
    • "(1, 10)" is cryptic and while potentially useful to the operator for debugging is just confusing for editors who aren't intimately familiar with the bot.
    • The edits summary does not mention that it removed multiple blank lines from lists, or why. Personally I know that this is per MOS:LISTGAP, but not everybody will. I recommend including it in the summary as (a) noting what the bot has done, and (b) noting why it has done it so that people aren't tempted to revert the bot and also learn why they shouldn't leave blank lines in the first place. I do a bit of fixing of lists, and Redrose64 does even more, both of us mention LISTGAP in edit summaries and I've seen positive responses to that. Thryduulf (talk) 13:02, 6 January 2022 (UTC)[reply]
      • Good points. I changed the edit summaries to "Adjusting indentation markup per MOS:LISTGAP and MOS:INDENTMIX. X blank lines removed. Y adjustments of indent markup. Trial edit." Winston (talk) 15:18, 6 January 2022 (UTC)[reply]

Chunk 3 (60 diffs)

  • See here. I am stuck figuring out an error in this one: Diff for Talk:Mass killings under communist regimes. The bot apparently introduced floating bullets to the line beginning with "I don't think it is fair". But when I copy the wikitext into my sandbox here, it looks fine (can anyone confirm this). It also looks fine in the edit preview of Talk:Mass killings under communist regimes. Is wikitext displayed differently in User talk vs Talk or something? I can't reproduce the error (though I haven't tried reproducing it in actual Talk pages and there doesn't seem to be a sandbox Talk page). Winston (talk) 20:22, 7 January 2022 (UTC)[reply]
    Ok so I copied the entire wikitext rather than just the section, and indeed the floating bullets showed up. So I tried to produce a minimal reproducible example (the original talk page is over 600k bytes), and have discovered that it has something to do with links. Consider this revision (excuse the gibberish, I did some transformations to reduce the page size). The line we are interested in is the one containing "conclusions were rejected". Notice the floating bullets. Now edit the page and delete the wikilink to water at the end of the wikitext (deleting some other wikilink may work too). Notice how the floating bullets are gone. Instead of deleting a wikilink, you can delete the first template on the page and the floating bullets also disappear... Not sure what's going on. Winston (talk) 01:42, 8 January 2022 (UTC)[reply]
    • From a discussion at WP:VPT, this is probably a GIGO issue due to a newline inside a wikilink. I had thought that such newlines were allowed since they seemed to work ok, but apparently not. To keep the fix simple and to be extra conservative, I'm having the bot simply refuse to perform any indentmix fix on the page at all if it encounters a wikilink containing \n.
      I did not see anything unexpected in the other edits for this chunk. Winston (talk) 00:13, 9 January 2022 (UTC)[reply]

Chunk 4 (70 diffs)

  • See here. There was only one error here where a bullet point was introduced. This was due to a template creating a table which the bot did not anticipate. The bot now expands templates to check for tables. Trial complete. Winston (talk) 06:59, 10 January 2022 (UTC)[reply]

Gonna take a break. Code is still available. Withdrawing this request. {{BotWithdrawn}} Notsniwiast (talk) 05:15, 13 January 2022 (UTC)[reply]

Well, as the trial has been completed, assuming no issues are found, you don't have to do anything more as of now. If this is approved, you can start running the bot whenever you return. – SD0001 (talk) 09:51, 13 January 2022 (UTC)[reply]
SD0001, are you approving this request, or are you accepting their withdrawal? Primefac (talk) 15:10, 23 January 2022 (UTC)[reply]
@SD0001? theleekycauldron (talkcontribs) (she/they) 10:10, 13 February 2022 (UTC)[reply]
Left a note at their talk page. Primefac (talk) 14:18, 27 February 2022 (UTC)[reply]

Back

Hi I'm back from break. If people feel this bot will be useful let's continue. Only thing I'm worried about is if people think it's a nuisance relative to the number of people it helps. Winston (talk) 20:01, 27 February 2022 (UTC)[reply]

  • David Eppstein, sorry to put this on you, but you pursued the issues here farther than I did. Are your (our) concerns all addressed? EEng 20:21, 27 February 2022 (UTC)[reply]
    • I think my concerns are addressed by "I have limited the bot to listgap and non-final-character indentmix fixes only. Indentation levels and final indentation characters are not changed". —David Eppstein (talk) 20:26, 27 February 2022 (UTC)[reply]


Approved requests

Bots that have been approved for operations after a successful BRFA will be listed here for informational purposes. No other approval action is required for these bots. Recently approved requests can be found here (edit), while old requests can be found in the archives.


Denied requests

Bots that have been denied for operations will be listed here for informational purposes for at least 7 days before being archived. No other action is required for these bots. Older requests can be found in the Archive.

Expired/withdrawn requests

These requests have either expired, as information required by the operator was not provided, or been withdrawn. These tasks are not authorized to run, but such lack of authorization does not necessarily follow from a finding as to merit. A bot that, having been approved for testing, was not tested by an editor, or one for which the results of testing were not posted, for example, would appear here. Bot requests should not be placed here if there is an active discussion ongoing above. Operators whose requests have expired may reactivate their requests at any time. The following list shows recent requests (if any) that have expired, listed here for informational purposes for at least 7 days before being archived. Older requests can be found in the respective archives: Expired, Withdrawn.