Wikipedia:Village pump (idea lab)

The idea lab section of the village pump is a place where new ideas or suggestions on general Wikipedia issues can be incubated, for later submission for consensus discussion at Village pump (proposals). Try to be creative and positive when commenting on ideas.
Before creating a new section, note:

Discussions of technical issues belong at Village pump (technical).
Discussions of policy belong at Village pump (policy).
If you're ready to make a concrete proposal and determine whether it has consensus, go to the Village pump (proposals). Proposals worked out here can be brought there.

Before commenting, note:

This page is not for consensus polling. Stalwart "Oppose" and "Support" comments generally have no place here. Instead, discuss ideas and suggest variations on them.
Wondering whether someone already had this idea? Search the archives below, and look through Wikipedia:Perennial proposals.

Discussions are automatically archived after remaining inactive for two weeks.

« Archives, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54

Option to omit subordinate sections on edit[edit]

Case in point: [1] The editor meant to add the content at the end of the "Discussion (II)" section, but ended up adding it at the end of its subordinate section, "Split off into a new page". He didn't catch the error and it was fixed later by a different editor (me). He is an experienced editor, significantly above average in technical competence, and I see this happen too often.

(In this case, I ended up changing the level of "Split off into a new page" to that of "Discussion (II)" to prevent this from happening again, but that solution was sub-optimal. By all logic the "Split off into a new page" should be subordinate to the Discussion section.)

Even if one is aware of this pitfall, it can be really cumbersome to have to back up to find the section you want. Imagine if there are four or five subordinates, some of them really long.

There should be the option to edit a section without its subordinates. Equally beneficial on any page that has multi-level sections, including articles, not just talk pages. As for specifics, that's why I'm on this page.

One thing to consider is that an editor might not know the option exists, or it might not occur to them to use it. In such cases the option would do little good. I'm thinking a pop-up box if the edited section has any subordinates: "Do you want to include the subordinate section(s)?" ―Mandruss ☎ 21:58, 10 December 2023 (UTC)Reply[reply]

+1 for this sort of feature. It's been requested in various places for over a decade IIRC. I don't get caught adding content in the wrong place, so much as it's annoying to have to scroll to the correct place and an excessively long preview of subsections I am not planning to change. DMacks (talk) 22:19, 10 December 2023 (UTC)Reply[reply]

Okay, only half a decade. I knew it sounded familiar though... Wikipedia:Village pump (technical)/Archive 163#Edit section without subsections. DMacks (talk) 07:52, 12 December 2023 (UTC)Reply[reply]

So the last comment in that thread was PrimeHunter, one of our most credible editors on technical questions, saying this is not only technically possible but "straightforward". There was no reply, suggesting concession by the naysayers. That was at VPT, and it seems to me the next step would've been this page. Not sure why that didn't happen. ―Mandruss ☎ 22:17, 12 December 2023 (UTC)Reply[reply]

@PrimeHunter:... DMacks (talk) 20:16, 18 December 2023 (UTC)Reply[reply]

I said "It seems straightforward". I'm not a MediaWiki developer and don't know how easy it would be in practice but it doesn't sound hard. I don't believe Izno's earlier comment there: I'm pretty sure "this is not technically feasible" is the answer due to the way that HTML sectioning works. That seems irrelevant. When you save a section edit, MediaWiki reparses the wikitext of the whole page in the same way as if you had edited the whole page. PrimeHunter (talk) 21:55, 18 December 2023 (UTC)Reply[reply]

-1 to the popup confirmation, but +1 to being able to edit just the "lead" of a section sans any subsections. I'm sure people will jump in with some good examples, but I'm struggling to imagine when "edit smallest applicable subsection" and "edit entire page" are both worse options than "edit intermediate size chunk". Folly Mox (talk) 02:19, 11 December 2023 (UTC)Reply[reply]

@Folly Mox: Your last sentence seems to suggest that it should never include subordinate sections, which would be another way of solving this problem; do I have that correct? If so, there are some cases where one would want to do that, such as re-ordering the subordinate sections or moving text between subordinate sections. Such things could be accomplished in other ways, including editing the entire page, but significantly less easily and more error-prone. ―Mandruss ☎ 20:33, 11 December 2023 (UTC)Reply[reply]

Yeah, never including subsections except in the "edit full page" case was my idea for avoiding a popup confirmation, but those things you mention are fine arguments for retaining the ability to edit a section including all its subsections. Another one is when there is no "section lead", and the prose starts after the first subsection. Misclicking on the wrong pencil would send users to an empty editing interface, which we'd have to cancel out of annoyingly. So maybe my idea is bad? I definitely am not liking an additional modal thing to tap between the editing pencil and the editing interface, but I'm not sure of the way round it. Folly Mox (talk) 21:45, 11 December 2023 (UTC)Reply[reply]

"Editing pencil": You must be using a different editor. I click [ edit ] next to the section heading.

Remember that the pop-up would only happen when there are subordinates, so the impact might be less than you imagine. The question would be asked only when needed. ―Mandruss ☎ 21:56, 11 December 2023 (UTC)Reply[reply]

On mobile skin, you have to go all the way to the top toolbar on a page, click the three dots, and click "edit full page" to do that. On very large pages that may well be a bigger inconvenience than the issue described here. Mach61 (talk) 19:50, 11 December 2023 (UTC)Reply[reply]

(Actually, there's no technical reason why this feature would have to be implemented the same on m.wiki AFAIK, so carry on) Mach61 (talk) 19:52, 11 December 2023 (UTC)Reply[reply]

There are indeed two issues here. The major one is the back-end: we need MW API support for it. The other one is the interface to activate it, for which we could have all sorts UI/UX design ideas, gadgets, etc. But none of the latter matters without the former. DMacks (talk) 02:12, 12 December 2023 (UTC)Reply[reply]

That's above my pay grade. If this earned a consensus at VPR, what are the realistic odds it would happen? ―Mandruss ☎ 06:47, 12 December 2023 (UTC)Reply[reply]

Any chance the gadget that allows the editing of lead sections might help? CMD (talk) 07:43, 12 December 2023 (UTC)Reply[reply]

No, that is quite different. Each section is numbered sequentially, so the lead is section 0 already and is not a header-delimited section at all (so the other sections are not subsections of it, in the way a === is a subsection of ==). DMacks (talk) 07:52, 12 December 2023 (UTC)Reply[reply]

All the gadget does is make a section=0 link like https://en.wikipedia.org/w/index.php?title=The_Example&action=edit&section=0&summary=/*%20top%20*/%20 to use a feature which already exists in MediaWiki. You could have made the same url manually. The proposal here would require a new MediaWiki feature. PrimeHunter (talk) 21:55, 18 December 2023 (UTC)Reply[reply]

Brainstorming a gadget that would be a clickable link in the section to call action=edit buth then intercept the actual spawning of the editor. It would snip off everything starting with the first line that begins with "==" into a hidden separate field, then reattached it when the user clicks 'publish'. DMacks (talk) 10:11, 2 January 2024 (UTC)Reply[reply]

Also, when used, it can prepend an edit summary accordingly -- that alone, if graduated from gadget to default anonymous feature, would be so worth it to ease reviewing a significant percentage of edits to article ledes (which are a significant %age of cruddy/controversial edits). SamuelRiv (talk) 23:02, 17 January 2024 (UTC)Reply[reply]

Brainstorming a COPYVIO-hunter bot[edit]

I'd like to propose the idea of a a COPYVIO-hunter bot, but I'm not ready to make a specific Bot request yet, and so I'd like to expose this idea here first to brainstorm it. Sometimes, copyright violations are discovered that have been present on Wikipedia for years. (The copyright-violating content at Barnabas#Alleged writings was added on 4 August 2014 and discovered 18 December 2023.) But for an alert Tea house questioner two days ago, who knows when, if ever, this would have been discovered. That's worrisome.

We have some good tools out there, such as Earwig's detector, and my basic idea is to leverage that by building a bot around it, which would apply it to articles, and either generate a report, or apply the {{Copyvio}} template directly. A couple of additional bot tasks could streamline the human part of the investigation by finding the insertion point (Blame) and determining copy direction (IA search). There are input, performance, scaling questions, and human factors, and likely others I haven't thought of. As far as input, ideally I'd like to see a hybrid or dual-channel input of a hopper with manual feed by editors (possibly semi-automated feed by other tools), and an automated input where the bot picks urls based on some heuristic.

For performance, I launched Earwig with all three boxes checked, and it took 62 seconds to return results for Charles de Gaulle (174,627b) and 16 seconds for (randomly chosen) Junes Barny (5,563b). I'm pretty sure there are a lot more articles closer in size to the latter than the former, so let's say Earwig takes 30 seconds per search on average; multiplying that by {{NUMBEROFARTICLES}} gives us 6.43 years to search all of Wikipedia with a dumb, single-threaded bot with no ability to prune its input stack. (Of course, Wikipedia would be bigger six years later, but that gives us an idea.) Given that the Barnabas violation went undiscovered for nine years, six years is not so bad, as I see it. But not all articles are equal, and probably some pruning method could decrease the size of the input stack, or at least prioritize it towards articles more likely to have undiscovered violations.

As far as scaling, I have no idea of server availability at WMF, but presumably there are some bot instruction pages somewhere for bot writers which address how many threads are optimal, and other factors that could scale up the processing for better throughput; maybe someone knows something about that. If we had six threads going against one input stack, that would reduce it to one year; it would be great to run it annually against the entire encyclopedia.

For human factors, I'm thinking about the increased number of articles tagged with copy violations, and the additional load on admins that would inevitably result. There are currently 17 articles tagged with the {{Copyvio}} template right now. I wanted to provide some estimate of activity at Wikipedia:Copyright problems to gauge current throughput, but I'm not so familiar with the page, and was unable to do so. Inevitably, a bot would increase the load on admins (for WP:REVDEL) and other volunteers, and it would be helpful to gather some data about what would happen. Not sure if its possible to project that, but maybe a stripped down version of the bot just to wrap Earwig and spit out numbers on a test run of a week or two might give us some idea. I'm guessing in operation, it would generate a big, backlog balloon initially based on the first two decades of Wikipedia, but then its output would slow to some steady state; in any case, backlogs in other areas have been generated and attacked before with success.

Maybe a bot could somewhat reduce load per investigation, by means a handy output report that includes Earwig percent, maybe a brief excerpt of copied content, and so on. A couple of additional tasks could be defined which would work off the output report, one task running Blame on the suspect articles to add date of insertion to the report, and another to read IA snapshots and determine direction of copy (i.e., is it a mirror, or a copyvio), resulting in a report with information that ought to make the human part of the investigation considerably faster and more efficient per occurrence, which should at least somewhat offset the increased overall number of investigations.

Would love to hear any feedback on the technical aspects of this, as well as the human factors, and whether something like this should even be attempted. Thanks, Mathglot (talk) 02:00, 21 December 2023 (UTC)Reply[reply]

Maybe a fourth task could be a disposition-triage task, and would act on the report output of previous tasks based on configurable values; something like: "if copy-direction = copyvio then if Earwig-pct > 85 then remove content from article and mark/categorize as revdel-needed; else if Earwig-pct < 20 then remove Copyvio template and mark report as handled; else leave for human assessment; else mark as mirror and handled." Mathglot (talk) 02:29, 21 December 2023 (UTC)Reply[reply]

EranBot currently sends every new edit through CopyPatrol if I understand it correctly, which essentially runs the edits through Turnitin/iThenticate. One could reduce the bot load by making it only look at articles that were created prior to August 2016.

@MusikAnimal (WMF) and Mathglot: I understand that the WMF is currently working on a replacement/re-vamp of CopyPatrol (i.e. Plagiabot). Is there a way to integrate a sort of "historical article detection" into a similar interface while re-using some of the code from the new Plagiabot, or is this something that you think would be better kept separate? — Red-tailed hawk _(nest) 02:42, 21 December 2023 (UTC)Reply[reply]

That's terrific news, which means, if I understand correctly, that whatever the scope of the problem is, at least it's not getting worse (assuming perfect precision from Plagiabot). So we only have to deal with the pre-whatever-year issue, and slowly chip away at it. (I am subscribed; no ping needed.) Mathglot (talk) 02:56, 21 December 2023 (UTC)Reply[reply]

@MusikAnimal (WMF) I remember putting this up on phabricator somewhere (I think?), but would it be possible to provide a stable API to integrate CopyPatrol with various other editing/CVUA tools (specifically it would be great to be able to answer the question "What is the iThenticate score/URLs for a specific edit") Sohom (talk) 06:29, 21 December 2023 (UTC)Reply[reply]

I've left MusikAnimal a comment on their WMF account talk page. It would be nice to hear from them on this. — Red-tailed hawk _(nest) 17:45, 25 December 2023 (UTC)Reply[reply]

I acknowledge it's Christmas, and many WMF staff are taking vacation/holiday, so it's fairly possible that we might not hear back for a week or so. — Red-tailed hawk _(nest) 17:53, 25 December 2023 (UTC)Reply[reply]

Thanks. I've added DNAU for 1 month, imagining that he may be on a nice, long winter vacation. Mathglot (talk) 21:24, 25 December 2023 (UTC)Reply[reply]

An API for reviewing/unreviewing does exist, but it's undocumented right now. It also doesn't provide Access Control headers. I was working on an external-use API for CopyPatrol, but decided to hold off until the new version that uses Symfony was finished and deployed, since it won't be usable anyway until deployment has finished. Chlod (say hi!) 02:22, 26 December 2023 (UTC)Reply[reply]

Thanks for your patience! I was "around" on my volunteer account, but haven't been checking this one until today (my first day back at work after the break).

It sounds like you all are asking for phab:T165951, which was declined last November. It can be re-opened if there's interest in it. However, it's worth noting CopyPatrol doesn't go through every edit, only those that meet certain criteria. I let @JJMC89 speak to that before I say something wrong ;)

As for an API, we can certainly add an endpoint to get the score for a given revision, if it exists in our database. That's simple to implement and won't require authentication. If you could file a bug, I can have that ready for when the new CopyPatrol goes live.

API endpoints that make changes to our db, such as reviewing/unreviewing, is another matter. Right now we authenticate with OAuth, so we'd need to somehow have clients go through that before they could use the endpoint. If @Chlod is interested in building this, I'll happily review it! :) Off the top of my head, I'm not sure how to go about implementing it. Alternatively, maybe we could provide all logged in users an API key? That would avoid clients having to login to CopyPatrol.

I don't think we want to permit requesting new scores for any arbitrary revision, at least not until our partnership with Turnitin is finalized. That should happen very soon, and then we'll know for sure if we can send out that many API requests. Some changes to JJMC89's bot would likely also need to be made. All in all, I'd say this feature request is not much more than a "maybe".

Also, in case no ones mentioned it yet, attempting to identify old copyvios is tricky because of the all-too-common WP:BACKWARDSCOPY issue. In some cases it may not be possible to ascertain which came first -- Wikipedia or the source -- so I'd weary of attempting to automate this. MusikAnimal (WMF) (talk) 00:57, 3 January 2024 (UTC)Reply[reply]

The new bot looks at edits made in the article and draft namespaces (0 and 118) to submit to turnitin and skips the following types of edits:

made by a bots or users on the allow list
(revision) deleted before processing (rare unless catching up from a service outage)
rollbacks (MediaWiki native or Twinkle)
additions of < 500 characters after cleaning the wikitext.

Those that come back with more than a 50% match to a (non-allow listed) source are shown in CopyPatrol for human assessment.

As a quick test, I added an endpoint to dump the data from the database for a specified revision.[2]

{
  "diff_id": 7275308,
  "lang": "en",
  "page_namespace": 0,
  "page_title": "Mahāyāna_Mahāparinirvāṇa_Sūtra",
  "project": "wikipedia",
  "rev_id": 1178398456,
  "rev_parent_id": 1178304407,
  "rev_timestamp": "Tue, 03 Oct 2023 12:16:34 GMT",
  "rev_user_text": "Javierfv1212",
  "sources": [
    {
      "description": "C. V. Jones. \"The Buddhist Self\", Walter de Gruyter GmbH, 2021",
      "percent": 50.3817,
      "source_id": 820817,
      "submission_id": "3084bde6-3b8b-488c-bf33-c8c27a73ae06",
      "url": "https://doi.org/10.1515/9780824886493"
    }
  ],
  "status": 0,
  "status_timestamp": "Tue, 03 Oct 2023 12:38:16 GMT",
  "status_user_text": null,
  "submission_id": "3084bde6-3b8b-488c-bf33-c8c27a73ae06"
}

Please file a task so we can workshop the best way to design the API.

— JJMC89 (T·C) 00:40, 4 January 2024 (UTC)Reply[reply]

Filed as phab:T354324. This could be done on either the frontend or the backend; but it doesn't look like the backend source is publicly-available (and API endpoints are a frontend task anyway, so it should probably live on the frontend). Chlod (say hi!) 10:03, 4 January 2024 (UTC)Reply[reply]

I'd encourage making the repos public unless there is a reason for keeping them private. It will make things easier if someone goes inactive or if someone wants to submit a patch. –Novem Linguae (talk) 11:36, 4 January 2024 (UTC)Reply[reply]

Hi, Mathglot! Great to hear more initiative on copyright cleanup tasks; they're always a big help. Someone brought up a related idea at WT:CCI a while back, and I responded with a few points that probably apply here too. I've got a cannula lodged in my hand right now, so I'll copy over what I said in that thread to avoid straining it. There wasn't a lot of back-and-forth on that thread anyway so it's probably easier if I just repost it here.

There was an idea previously floated around about having Turnitin or Earwig run on all revisions of past cases; I'd say this is probably the general idea when talking about automation for CCI cases. When it actually comes down to making it happen, though, it's a spider web of caveats and limitations that make it hard to get off the ground. Here's a more-organized explanation of my thoughts that I randomly collected in the past few months:
First is the issue of cost. There's around 508 thousand revisions left to check (as of May this year), but we only ever have a finite amount of Earwig search engine searches or Turnitin credits. Processing all of these automatically means we have to work with the WMF to get more credits for a one-time run-through, and we're not sure if we'll get decent results for a majority of those checks.
We could work around this by completely disabling search engine checks, as the thread you linked discussed, but this can either work for or against us based on the case. We could also work around this by only selecting a few cases which rely mostly on web sources or (for Turnitin) sources that we know would probably be indexed. This significantly cuts down on the amount of revisions to check. But then there's the next issue:

A lot of the older cases, especially the ones over three years old, start getting a lot of false positives. As article text remains on the wiki for long periods of time, SEO spam sites, academic documents, slideshows, and others start copying from Wikipedia. We filter out a lot of these already (like those in this list and a bunch of others), but we still hit them every once in a while and enough that it clogs up what reports we would otherwise get from Earwig/Turnitin.
A possible solution to this would be human intervention (which is more or less a given with something like this), where editors will double-check to see if a flagged revision actually is copied from somewhere, or if it's just a false positive. Human intervention will weed out false positives, but then it won't weed out the false negatives.

At the end of the day, copyvio checking is a really hard computer science problem that humanity is still in the middle of solving. False negatives; like when a revision flies under the radar because a source it copied from has died, or when the text has been paraphrased enough to make checkers think it's completely original text; will always be one of the biggest brick walls we face. False positives waste editor time, yes, but false negatives arguably take up more time, because we then need to re-check the case. It also wouldn't be a good look for us or the WMF if it turns out that we get a lot of false positives and negatives, since that could be perceived by the community as a waste of funds. Perhaps this is still something that could benefit from research and testing.
— User:Chlod 13:02, 24 November 2023 (UTC)

This was for checking revisions on CCI pages, but the same applies for scanning every latest revision for all articles. It seems we've also been stretching Earwig to its limits recently, Earwig has been going down for almost every day in the past two weeks (CommTech's UptimeRobot). Unfortunately, the Earwig logs are project members-only, so I can't snoop in to figure out the cause by myself. But usually, we chalk this up to Earwig running out of Google API tokens. Would appreciate comments or ideas for the problems above; anything to ensure copyvios don't fly under the radar. Chlod (say hi!) 02:15, 26 December 2023 (UTC)Reply[reply]

Chlod thanks much for this. A few questions or comments:

Whats the 508,000 revisions? Is that just from CCI investigations?
In that same bullet, what cost are you talking about, processing time? And what did you mean by decent results, are you alluding to false +/- that you raised lower down?
- As far as the workarounds, this sounds like roughly what I referred to as various pruning methods to shorten or reorder the input list.
Re false + due to websites copying from Wikipedia, I don't see this as a major problem and I addressed it in the 'direction of copy' comment involving IA checks. Maybe we'd have to negotiate with IA for a certain amount of search traffic per unit time, but as a fellow non-profit and given the reasons for it, I can't imagine there wouldn't be some positive arrangement to come out of that. That would eliminate the need for human intervention in a proportion of cases; see the "if-then" psuedo-code at the end of my comment. The triage attempts to automate a lot of it, and steer only the grey-area cases toward human intervention. And it should also weed out most false negatives for the same reason, and I don't see the failure to have 0% false negatives as a problem. There is always a problem identifying edge cases, even when humans are involved; if an automated solution improves our accuracy and throughput over what it was before, then it's worthwhile. One hundred percent accuracy and coverage are a goal but they will never be attained and that shouldnt stop us from incremental progress; even if automated processes fail to identify some sites for human intervention, we'll catch 'em, hopefully, next iteration of the processing.
"Really hard computer science problem": again, imho, we don't need to "solve" it, we just need to do a bit better than we were doing heretofore. Paraphrase will fall, imho, to better shingling turbocharged with some AI to recognize synonyms and linguistic transformations at some point in the not-nearly so distant future as I would've guessed a year ago. We needn't let the perfect be the enemy of the good, and I think we can do a lot of good now.
Earwig woes: is anyone maintaining it?

Thanks, Mathglot (talk) 00:02, 27 December 2023 (UTC)Reply[reply]

Yep, the 508k revisions is those we have to check at CCI. That's from a dashboard by Firefly to see how much is left. It has its inaccuracies, but it's correct for most cases.
For the cost, it's actual monetary cost. From what I've heard (and what I assume from what I've heard), the WMF pays for the Google API and Turnitin credits, and that cost is pinned to how much we use Earwig and how many edits are checked by CopyPatrol, respectively. Attempting to request more credits for either needs discussion with the WMF, who then needs to discuss with Google/Turnitin. And yeah, the decent results is whether or not Earwig comes up with a false positive/negative.
- Definitely; there's a lot of one-or-two-sentence stubs that don't really need checking. This could, of course, be filtered out, possibly with a lot more criteria for skipping than just that.
I'm wary about using Internet Archive as a "source of truth" for dates. Though we do exactly that in CCI, it's probably not reliable enough to make broad judgements on whether a page is a copy or was copied from. If the pipeline goes Earwig → URL of likely match → Internet Archive, the data it would provide in a report could be a false positive if either the page changed URLs at any point in time (as I've seen happen with Sparknotes) as Internet Archive may not recognize the switch or if it was never archived before (though this practically never happens for recently-added citations). Of course, it's best if this is tested empirically first.
- This is a step in the right direction though. The downside of not using a system like this at all is that the direction checking will be manual, which then just pushes the investigation work back to the addressing user/administrator, and that could result in anywhere from zero (by luck) to a lot of false positives. But what has to be checked first is whether this will end up increasing processing time/workload for checking users.
Earwig's Copyvio Tool is actively maintained by The Earwig. The recent downtimes were shortly discussed in User talk:The Earwig § Copyvio tool is down; I only saw this now. Seems to have been from increased usage.

I agree; something is better than nothing. I'm mostly just worried about stretching the few editors working on copyvio even thinner by adding more work to do. We could balance this by encouraging more editors to help out at WP:CCP, but copyright cleanup really just has historically low participation rates. Chlod (say hi!) 05:14, 27 December 2023 (UTC)Reply[reply]

Hey Chlod, thanks for pinging me here.

With Google's API, there's a hard daily limit of 10,000 queries per day, which costs US$50. The copyvio detector will make up to 8 queries per page (each query corresponds to a sentence or so of text, so that is chosen to strike a balance between performance and detection accuracy – longer articles would really benefit from more than 8 queries in many cases). So that works out to somewhere between 1,250 and 10,000 articles per day; let's say 2,000 on average. To be very clear, that's a limit built into Google's API terms. We can't get around it without a special agreement with Google, and everything I've heard from the WMF indicates we have no special agreement: we're paying the regular rate. Over ten years of running the copyvio detector, and despite multiple people asking, I've never managed to make the right connections with the right people at Google to get a special agreement (or the WMF hasn't, and IMO it's really them who should be doing that instead of me).
Just bashing the numbers out, checking 500,000 pages without a special agreement with Google would cost $12,500 and take at least 8 months (again assuming 5 queries/page).
The search engine is really the limiting factor here, hence my emphasizing it. Compute cost is much cheaper and we could use WMCloud to parallelize this more effectively if the daily limits weren't so severe.
Recent issues aren't related to using up all of our Google API credits but mostly due to my own poor software engineering decisions ten years ago. Sometimes it's due to unauthorized bot traffic that needs to be identified and blocked, but in this case I haven't noticed any. There's an ongoing project to improve performance, but no timeline for when it will be ready, unfortunately.

— The Earwig (talk) 14:53, 27 December 2023 (UTC)Reply[reply]

Thanks for these detailed explanations. Just noting that I've started User:Novem Linguae/Essays/Copyvio detectors to try to document all these copyright tools and their nuances. Seems like every couple months this comes up and I've forgotten all the details since the last discussion, so maybe an essay will help me remember it :) –Novem Linguae (talk) 12:13, 31 December 2023 (UTC)Reply[reply]

@The Earwig: Anywhere I could possibly help with the copyvio detector's uptime? It's also affecting the NPP workflow at times, as the copyvio detector is part of checks to be done when patrolling. Chlod (say hi!) 13:56, 4 January 2024 (UTC)Reply[reply]

@Chlod: Thanks for offering to help! I've given you maintainer access to the tool, and you have permission to restart it when needed. This is the case if the request backlog gets full (a log message "uWSGI listen queue of socket" is printed to uwsgi.log over several minutes) but occasional slowness doesn't necessarily mean the queue is full and needs to be cleared. It's good for us to have maintainers across different timezones. But beyond the occasional restarts, addressing the underlying issue is complicated and not something I expect help with. As hinted above, a backend rewrite is in progress to improve performance. — The Earwig (talk) 16:41, 4 January 2024 (UTC)Reply[reply]

As I understand it, the issues with applying Earwig's copyvio thing to more pages (and the reason it always takes a million years to run) has nothing to do with computational power or programming skill on our part, but rather because Google search, which is a quite critical part of this software working, has deliberately decided to fuck us sideways on search queries.

Well, it's not clear: it could be that or it could be that nobody from Wikipedia or from the WMF has succeeded in figuring out how to ask them from a special dispensation.

At any rate, we have a rather low quota, and it would cost tens of thousands of dollars to make it higher, and we do not get any special dispensation although I guess they are perfectly fine to make millions of dollars from reusing our content in their own knowledge panels lol. jp×g 🗯️ 11:25, 28 December 2023 (UTC)Reply[reply]

Maybe @NPerry (WMF): might give more insight as to why the Wikimedia Foundation has not been able to get resources for copyright detection with Google search ? AFAIR, last year, they were involved with managing Wikimedia's partnership with Google. Sohom (talk) 11:54, 28 December 2023 (UTC)Reply[reply]

I'm not active in copyvio detection work, so take what I say as an outsider's perspective. Overall, copyvio detection on Wikipedia seems like an area that's struggling despite the heroic efforts of those working on it — multi-year backlogs at places like CCI are indicative of a system that's just not working. Bot assistance is our best hope of changing that dynamic on a systemic level, so I think it's a fruitful avenue to pursue. It'd be complex on a level greater even than ClueBotNG, but if successful it'd be similarly impactful.
One thing to perhaps think about is the difference between old copyvios and newly added ones. My vague understanding is that a lot of the difficulty/pain comes from years-old insertions, which have since been built upon, necessitating removal of large chunks of an article. If it'd be simpler to build a bot that only checks/fixes new contributions, then perhaps that'd be a good place to start. If it could sufficiently stem the tide, perhaps it'd lead to a situation similar to what we have with non-notable articles/deficient FAs today, where there's a bunch of stuff in the past to clean up, but ultimately it's a finite backlog with few new entries being added, creating hope we'll someday get through it (cf. WP:SWEEP).
Hope that's helpful, and good luck with this work! {{u|Sdkb}} ^talk 00:03, 3 January 2024 (UTC)Reply[reply]
(Possible overlap with part of above) - we have a copyright flagging system already (see log) - and allowing more bots to flag is fairly easy to do. Like many have said, building a reliable algorithm for doing the actual checking is a "hard" problem. One problem that came up during prior third party solutions like TURNITIN is that these companies wanted to reuse Wikipedia content without honoring the licensing requirements (e.g. We send them some text, they store it, then they reserve that to other people without attribution). — xaosflux ^Talk 17:00, 4 January 2024 (UTC)Reply[reply]

Deletion of account is needed[edit]

There should be an account deletion system. Edits made by deleted account should be left with name of the account without a link. 160.238.0.118 (talk) 19:34, 26 December 2023 (UTC)Reply[reply]

For legal reasons related to attribution of material, it is not possible to delete accounts. They can however be renamed in some circumstances: see Wikipedia:Courtesy vanishing. AndyTheGrump (talk) 19:45, 26 December 2023 (UTC)Reply[reply]

Given I can just search for all other edits made by that "name of the account", there is no difference whether or not they have a "link". Sounds like a distinction without a difference. What is it your understanding of what an 'account' actually is? DMacks (talk) 10:03, 2 January 2024 (UTC)Reply[reply]

In the site's terms of service, you agree that the link is sufficient attribution. Remove the link, and the attribution is no longer within what you agreed to, and technically, you can sue the WMF. Snowmanonahoe (talk · contribs · typos) 14:27, 14 January 2024 (UTC)Reply[reply]

Bibliography articles[edit]

We have a number of articles titled 'Bibliography of X'/'X bibliography'. Sometimes these are lists of works by a subject, eg Virginia Woolf bibliography. Sometimes they are lists of works about a subject, eg Bibliography of Andrew Jackson. Sometimes they're both, eg Harold Pinter bibliography. Is "both" a desired approach? For example, if I wanted to split out some of the massive bibliography at Virginia Woolf, would I add it to the existing Virginia Woolf bibliography or would I create a new article? And if the latter, what would that be called to distinguish it from the existing article? Nikkimaria (talk) 21:06, 7 January 2024 (UTC)Reply[reply]

That massive bibliography at the Virginia Wolfe article isn't just a bibliography, it is part of the references. The article uses shortened footnotes, so each of those sources is the target of a hyperlink from the short footnotes in the references section. So they can't be moved to another article. Since the term "Bibliography" is ambiguous I would rather articles used the terms Citations / References for the two sections rather than References / Bibliography.

This doesn't answer your question, however. StarryGrandma (talk) 10:19, 8 January 2024 (UTC)Reply[reply]

Many of the works listed at Virginia Woolf#Bibliography are in fact not referred to by any of the shortened footnotes: more than eighty of them, at a quick count. A script like User:Trappist the monk/HarvErrors marks these.

To answer Nikkimaria's question, the only comparative example I can immediately find is Winston Churchill, which has Bibliography of Winston Churchill for works about Churchill, and Winston Churchill as writer for works by him. Caeciliusinhorto (talk) 20:55, 8 January 2024 (UTC)Reply[reply]

Yep, wouldn't be looking at removing any of the sources actually cited, just some of the ones that aren't. Thanks for the example, that's helpful - anyone have thoughts on what the best titling approach would be for these different types of bibliographies? Nikkimaria (talk) 00:04, 9 January 2024 (UTC)Reply[reply]

Post on an article's Talk page when you remove/move anything from an article's improperly-intermingled Bibliography + Works Cited + References section -- that is, so that anything removed was not from the article's history from a state prior to material being cited in-line, such that any residual material failing in-line verification may be more easily cross-referenced to the article's history. (To those to whom this paragraph seems like nonsensical word salad: you know it when you see it -- see major history articles.) SamuelRiv (talk) 00:30, 16 January 2024 (UTC)Reply[reply]

Workshop: draftifying[edit]

Firstly, Jimbo Wales agrees with me. Well, not with me directly. But with the gist of this argument, and the argument behind unreferenced-PROD. He wrote in 2006:

I really want to encourage a much stronger culture which says: it is better to have no information, than to have information like this, with no sources. Any editor who removes such things, and refuses to allow it back without an actual and appropriate source, should be the recipient of a barnstar.

^[1]

Anyways...

As a New Page Patroller, I frequently draftify unsourced articles.

Not unfrequently, the creator of the article moves the draft back to mainspace, or re-creates it with the same content. The topic is frequently fringe, difficult to verify, but not necessarily PRODable or AfD'able.

What to do? There's an unsourced "article" in mainspace. It should be in draftspace, where it should be improved by the creator. It is unfit for mainspace. As one of my favourite essays points out, unsourced content is essentially digital graffitti and should be removed. The WP:BURDEN is on the creator to add references to their claims.

It isn't 2005 anymore. We shouldn't have new unsourced articles being created. They do get created, but are usually PRODed or draftified by NPPers.

Per WP:DRAFTIFY, we aren't allowed to re-draftify an article. Because of this clause, draftifying is essentially useless. All the creator has to do is move it back.

An analogy (or possibly a parable):

Someone dumps a pile of garbage on the sidewalk. There might be some re-usable or recyclable items in there, but it's hard to tell. Per municipal policy, a street cleaner takes it to the waste-dumper's house. It's their garbage.

Instead of throwing it out normally, or sorting out re-usable or recyclable stuff, the waste-dumper takes their garbage out of the facility and puts it right back onto the street. The street cleaner finds it again. Municipal policy states that the cleaner should either sort through it themself or ignore it. Once they have finished sorting, they should keep the recyclable items and take the rest to a waste-management facility, where they will have to write a report detailing why they think the garbage should by destroyed. The waste management facility is called AFD.

This is clearly nonsense. Why should the street cleaner have to sort through someone else's garbage?

I would like to propose disallowing draftified articles being moved back to mainspace if the problem for which the "article" was draftified has not been fixed. Let the street cleaner take the garbage back the waste-dumper's house. 🌺 Cremastra (talk) 15:34, 13 January 2024 (UTC)Reply[reply]

Notified: WT:NPP, WT:Draft, WT:AFD. ~~~~

🌺 Cremastra (talk) 15:42, 13 January 2024 (UTC)Reply[reply]

References

^ Wales, Jimmy (2006-07-19). "insist on sources". WikiEN-l. Retrieved 2007-01-31.

A reminder for those who don't notice the brightly-coloured editnotice: this is the idea lab, so no !votes. 🌺 Cremastra (talk) 15:35, 13 January 2024 (UTC)Reply[reply]
I think we discussed this before on Discord, and the main highlighted point was: 'Users have the right to object to draftification and can move it back to the article space.' But it's good to see you here, searching for some additional ideas. – DreamRimmer (talk) 16:17, 13 January 2024 (UTC)Reply[reply]
When I check my Draftify log, most articles have been improved & returned to Mainspace. The major challange is the PROD/AfD process. In May, 2023 I setup my PROD/AfD subpage to track articles. I am totally Ok with reverts along with a reasonable explanation. It's a problem for un-explained reverts, and "junk/incomplete" articles remaining in mainspace. And I understand the goal is article improvements. Thanks for this discussion. Regards, JoeNMLC (talk) 16:44, 13 January 2024 (UTC)Reply[reply]
Looking at one's draftify log is a good idea. Looking at my own, from August 2023, when I only draftified 8 articles:
- One was redirected, after some confusing duplicated drafts/AfC-dodging and this related AfD.
- Five remain in draftspace. Two of those drafts have been deleted because they were created by a sockpuppet.
- One has been moved back to mainspace after being improved in draftspace, and looks pretty good.
- One was re-created, unreferenced, in mainspace. It was unsuccessfully PRODed by a different user in October 2023; it has no references but is a list of sorts.
🌺 Cremastra (talk) 16:54, 13 January 2024 (UTC)Reply[reply]
I hate that linked essay, and have found that in most cases, unreferenced prose is relatively easily verifiable in published sources that the original editor neglected to include.
Having said that, I do think the current wording of WP:DRAFTOBJECT is overly strict. I don't think the same reviewer / patroller should draftify the same page more than once, even in the absence of improvements, but if multiple reviewers / patrollers think an article should be returned to draftspace for improvement, that no longer strikes me as "unilateral", and it is in fact the draft creator's moves to mainspace that are "unilateral", and the required next process should be AfC rather than AfD.
The AfD problem is real, but the garbage analogy is inapt. Unreferenced articles are less "this is garbage" and more "someone didn't fill out the paperwork". (Also, unless you're very nosy, it's usually pretty difficult to determine whose garbage you've happened across littered in the public space, and no municipality I'm aware of requires street cleaners to sort waste streams on pickup, even if it is best practice. Typically, this duty falls on the people who work the recycle and hazmat streams at the transfer station or other facilities, with the acknowledgement that the landfill stream will often contain material that properly ought to be processed differently.) Folly Mox (talk) 18:01, 13 January 2024 (UTC)Reply[reply]
To 99% of people having their article moved to draftspace is going to discourage them from ever improving it.★Trekker (talk) 22:16, 13 January 2024 (UTC)Reply[reply]
Why so? Having it moved to draftspace is a chance for them to fix it without other editors swarming over it with cleanup tags and PROD templates and brightly-coloured banners. 🌺 Cremastra (talk) 22:42, 13 January 2024 (UTC)Reply[reply]
I believe @WhatamIdoing has the specific numbers, but draftified articles have a dismal return-to-mainspace rate. Mach61 (talk) 01:26, 14 January 2024 (UTC)Reply[reply]
Draftspace is where articles go to die, and we've known that for years. Steven Walling knows the original research on this best, and if you wanted to get more recent numbers, @Cryptic or someone else at Wikipedia:Request a query could probably tell you what percentage of pages in the Draft: namespace got deleted last year (e.g., created in January 2023 and deleted from the Draft: namespace since then).

You can also estimate it from the logs. You can find the number of page moves into vs out of the Draft: space in Special:Log/move and the number of articles created and deleted in Special:RecentChanges. The numbers for the last couple of days look like roughly 120 articles created each day, 150 articles moved into the draftspace each day, 150 articles moved out of the draftspace each day, and 150 articles deleted each day. We put 270 articles in, and we deleted 150 of them. That's a 55% deletion rate. Ideally, you'd look at these numbers over the space of at least 7 full days, as there are definitely weekly trends in activity, and things like a holiday weekend, an important football game, a change in the activity level for even one key editor, etc., can throw the numbers off quite a bit. WhatamIdoing (talk) 22:48, 14 January 2024 (UTC)Reply[reply]
But is all this necessarily bad? I believe in quality over quantity. 🌺 Cremastra (talk) 22:51, 14 January 2024 (UTC)Reply[reply]
I believe in notable topics being allowed to have articles that have a chance to actually become better, which they do not in draftspace.★Trekker (talk) 21:58, 24 January 2024 (UTC)Reply[reply]
Issues I believe this proposal would have to first resolve in order to have any chance of gaining consensus: (1) There will probably be a dispute about whether the alleged problem with the article/draft actually existed in the first place. (2) There will probably be a dispute about whether the alleged problem with the article/draft was sufficiently serious to justify draftification. (3) There will probably be a dispute about whether the alleged problem with the article/draft has actually been fixed. In all three cases, the draftifier is not particularly unlikely to be completely on the wrong side of consensus. The fact that the draftifier believes or claims that a page is "garbage" does not mean that the page actually is garbage. To take the example given by the proposer, I have, over the course of many years, seen many articles tagged as "unreferenced", despite that fact that those articles very obviously did have references (presumably because the tagger just did not like the references in the article). I cannot imagine the community supporting the unilateral draftification, with no right of appeal, of articles, where there is a real dispute about the 'appropriateness' of the draftification. James500 (talk) 02:18, 14 January 2024 (UTC)Reply[reply]
I don't think this can pass. Judging from previous village pump discussions, about half of Wikipedians don't like draftspace, seeing it as a backdoor to deletion. The de facto options in situations where a very poor article (such as one with no sources) is WP:DRAFTOBJECTed are WP:AFD, or WP:TNT. Hope this helps. –Novem Linguae (talk) 03:58, 14 January 2024 (UTC)Reply[reply]
Unsourced articles are an easy target but fundamentally WP:DRAFTOBJECT is not about article content, it's about WP:CONSENSUS. If you think an article doesn't meet WP:V and therefore shouldn't be in main space, but another editor disagrees in good faith (i.e. by reverting your bold move to draft space), then you have to stop and talk about it. There's really no way around that. You can't just insist that you're right and the other editor has to satisfy you, because you're the New Page Reviewer. That's not "the encyclopedia that anyone can edit". Besides, I've seen NPPers wrongly identify articles as unsourced plenty of times, whether because they missed references that looked like something else, a new editor struggling to format their sources, or just didn't read it properly. Folly Mox makes a good point about multiple reviewers being involved above, but still, if multiple editors are involved in a dispute about where a page should be, we'd expect them to discuss it (e.g. at AfD), not get into a move war. – Joe (talk) 07:14, 14 January 2024 (UTC)Reply[reply]
Wasn't the "information like this" from the Jimbo quote something about two tech founders throwing pies at each other to settle a dispute? That probably doesn't apply to most of the articles we're talking about, which don't tend to involve unlikely stories about BLPs (the quote is from years before the creation of WP:BLP). A few thoughts:
- The underlying assumption is that the article creator WP:OWNs the article. This is supposed to be a collaborative project, so why should we treat an unsourced article as "your garbage"? I disagree that unsourced content is always "garbage" or "graffiti", but why don't we think of it as "our" problem? New content is a gift to us and to the world. Some gifts are bigger or smaller, and some are better or worse, but the absence of a little blue clicky number doesn't make it garbage. (My own idea of garbage is misinformation and disinformation.)
- The belief that an unsourced article is "unfit for mainspace" is not supported by any policy or guideline. It is the personal preference of a fraction of editors, but it's not one of our rules. If we want to build a system based on this preference, then that preference needs to be turned into an actual rule first.
- I wonder how big this problem actually is. I checked the pages created during the last three days in the mainspace and draftspace, using the visual editor (because there's a tag that makes it easy to check for the addition of a new ref, but it's not available for the 2010 wikitext editor [yet?]). 40% of them were redirects, at least 45% had at least one ref tag added in the first version of the page, and the remaining small fraction either had a ref added later (example, example, example), or not at all (example article, example dab page), or it actually had refs but they weren't autodetected (example, example, example, and pinging User:ESanders (WMF) to see whether that's a bug in mw:EditCheck). This is overall not feeling like a serious problem. Most pages that are supposed to have refs (e.g., they're not dab pages) are already getting refs. In fact, having looked at this, I don't think I would draftify a new article if this were the only serious problem.
WhatamIdoing (talk) 23:41, 14 January 2024 (UTC)Reply[reply]
Unsourced articles are definitely unfit for mainspace in this day and age especially if they don't fall into the evergreen WP:NSPECIES, WP:NPOL and WP:NGEO spectrum. I personally prefer the AFD route than the draftification route, however, it still stands that unless improved, a complete unsourced article is no better than misinformation and disinformation. Sohom (talk) 15:42, 15 January 2024 (UTC)Reply[reply]
@Sohom Datta, there is no policy or guideline that says all articles must cite at least one reliable source. Wikipedia:Notability explicitly says the opposite: what makes a subject notable is whether sources exist in the real world, not whether sources have been typed into the Wikipedia article. It is true that some individuals personally believe that an article without a source is unfit for mainspace, but that's a personal belief and is not supported by policy.
BTW, the research on the draftspace indicates that if you want unsourced articles to get sources, you need to leave them in the mainspace. If your goal is to get them deleted with a minimum of fuss and bother, then you should put them in the draftspace. WhatamIdoing (talk) 17:27, 15 January 2024 (UTC)Reply[reply]
I do agree that articles need to have at least one reliable source cited though. I think what you meant here is that this should not be acted retroactively. CactiStaccingCrane (talk) 17:29, 15 January 2024 (UTC)Reply[reply]
My point is even smaller than that: Editors should not misrepresent the state of the actual rules by claiming that unsourced articles can't be, or shouldn't be, in the mainspace solely because they are unsourced. The correct (i.e., accurate and honest) process is:
Rules explicitly do not require a source to be cited in a mainspace article.

Get rules changed to require at least one source to be cited.

Tell editors (especially newbies) that their new article is unacceptable because it does not comply with the new rule.

The process that some editors are currently using is:
Rules explicitly do not require a source to be cited in a mainspace article.

Tell editors (especially newbies) that their new article is unacceptable because it does not meet my personal criteria, while pretending that my personal criteria are the actual rules.

Whether the new rule is retroactive or not is not really a concern of mine. I am confident that it would eventually become retroactive even if it doesn't start that way. (That's exactly what happened with the rules for WP:BLPPROD: it started off as solely forward-looking, and became retroactive later.) What concerns me is editors claiming that the rules are X when the rules are actually not-X. Either change your claims or change the rules, but don't misrepresent the rules. WhatamIdoing (talk) 17:41, 15 January 2024 (UTC)Reply[reply]

@WhatamIdoing I think you missed my point about a AFD. The fastest way to get sources to a article in my experience (counterintuitively and unfortunately) is an articles for deletion/discussion, not letting it languish in mainspace (and definitely not draftspace). A AFD puts it on the radar on multiple wikiprojects, which are much more likely to provide reliable sourcing than I will ever be able to provide.

If even after 2/3 weeks of advertising, nobody (including the article creator) thinks the article is worth saving, that could/should indicate that the article is probably not notable at that current moment.

Also, I agree that there currently exists no policy that prevents a editor from not including any sources in a article (theoretically). But at a much more practical level, it is not really fair to expect a editor with limited prior understanding of the subject matter to accurately evaluate a articles notability if they have absolutely zero starting points for their search for sources. Sohom (talk) 17:57, 15 January 2024 (UTC)Reply[reply]
And yet we say that Wikipedia:Deletion is not cleanup, because it's uncollegial and anti-collaborative for an editor to demand that others drop everything they're doing because an article must be sourced this week, or it will be deleted. As you say, editors with limited prior understanding of the subject matter have difficulty accurately evaluating notability for those subjects – so they shouldn't be sending them to AFD in the first place. AFD is for articles that you genuinely believe to be non-notable, not articles you'd like someone else to improve right away.

Permitting editors to use AFD to demand clean up of subjects they're unfamiliar with is also a source of systemic bias. We've had altogether too many cases of editors sending Asia- and Africa-related subjects off to AFD out of ignorance, thinking that WP:NEVERHEARDOFIT is a good enough excuse and that if other editors want to keep it, then they will cheerfully drop everything they're doing to provide sources. If nobody intervenes, we lose the articles. This is not okay. WhatamIdoing (talk) 18:07, 15 January 2024 (UTC)Reply[reply]
I personally don't think it is uncollaborative to go "Hey, this article has no sources, and I could not find any based on a few Google searches, what should we do about this ?" (which tends to be most AFDs these days). For all you know, it could be a radioactive peice of hallucinating ChatGPT junk (or other promotional garbage) which needs to nuked out of orbit ASAP, or it could be a documenting an important niche topic that few people have heard about which needs to be preserved. AFD is lot more collaborative than, "well, that's somebody else's problem I guess" and walking away. Sohom (talk) 18:28, 15 January 2024 (UTC)Reply[reply]
Hey, this article has no sources, and I could not find any based on a few Google searches, what should we do about this ? That's something that should be said on the article's talk page, not in an AfD. In AfD, what should be said is: "Here are very good reasons why this article should be deleted". The two statements are not interchangeable. Sometimes they will address the same situation, but sometimes they won't. —Alalch E. 15:04, 17 January 2024 (UTC)Reply[reply]
Do you mean that the creating-edit isn't tagged with "adds reference"? That does look suspicious. Am I still allowed to ask you to file a Phab task 🙂 ESanders (WMF) (talk) 18:31, 15 January 2024 (UTC)Reply[reply]
If y'all end up creating a phab, could you subscribe @soda (me) as well :) Sohom (talk) 18:34, 15 January 2024 (UTC)Reply[reply]
Tracked in Phabricator
Task T355428

Ed, you can always ask... ;-) WhatamIdoing (talk) 14:32, 19 January 2024 (UTC)Reply[reply]
Oppose - The remedy here is AFD, not permanent banishment to AFC (followed by eventual deletion without discussion through G13). If you can't see a consensus to delete, please don't harass article creators. ~Kvng (talk) 13:15, 16 January 2024 (UTC)Reply[reply]
Sigh. You didn't read the banner, or the edit notice, or my reminder, did you. 🌺 Cremastra (talk) 13:24, 16 January 2024 (UTC)Reply[reply]
@Cremastra, No I clearly did not. Sorry. I'm not sure how to be constructive about this proposal. Does that mean I should say nothing? ~Kvng (talk) 02:14, 17 January 2024 (UTC)Reply[reply]
"The topic is frequently fringe, difficult to verify, but not necessarily PRODable or AfD'able." If it would be appropriate to boldly draftify an article, then it would appropriate (when contested without fixing the issue) to nominate it at AfD for draftification. As with a contested blank-to-redirect, for which the appropriate discussion venue is AfD per this request for comment, you don't need to request deletion in order to nominate an article at AfD. And if it is not nominated for the purpose of deletion, then a full WP:BEFORE inquiry about whether the subject is notable and so forth isn't applicable.

I'd like to see the standard draftification messages more explicitly say that if an editor disagrees with the reasons for draftification, they can respond to the reasons for the move and ask (insist) that the article be restored to mainspace until there is a discussion to either delete it or make it a draft. SilverLocust 💬 19:56, 16 January 2024 (UTC)Reply[reply]

The topic is frequently fringe, difficult to verify, but not necessarily PRODable or AfD'able—why would it not be AfDable?—Alalch E. 15:01, 17 January 2024 (UTC)Reply[reply]
I suspect that what's meant by "not AFD'able" is "it would probably not get deleted at AFD". WhatamIdoing (talk) 14:44, 19 January 2024 (UTC)Reply[reply]
[[User::WhatamIdoing|

WhatamIdoing]], when you write The belief that an unsourced article is "unfit for mainspace" is not supported by any policy or guideline, that seems to contradict to our core content policy of Verifiability which says All quotations, and any material whose verifiability has been challenged or is likely to be challenged, must include an inline citation to a reliable source that directly support the material. Emphasis added. If any editor acting in good faith says "I challenge the material in this particular unreferenced article, because it is unreferenced", does that not impose an immediate policy burden to provide references (citations) that verify the challenged material? Cullen328 (talk) 00:25, 20 January 2024 (UTC)Reply[reply]

And since that hasn't happened? "I challenge every unsourced article just because they're unsourced" is not acceptable. Even then, issuing a WP:CHALLENGE doesn't make the material unfit for the mainspace. If it did, then {{citation needed}} would hide the text instead of leaving it there. WhatamIdoing (talk) 18:12, 26 January 2024 (UTC)Reply[reply]
@WhatamIdoing and Cullen328: It would be great if {{citation needed}} hid the text! When someone found a reference, the template could be removed and the text could reappear! Like a more powerful version of {{citation needed span}}. Instead of this:
It is a city on the planet Earth.^[1] The city has a population of 300,320 as of 2019.^{[citation needed]} It is 10km from City Y.^[1]

We could have:
It is a city on the planet Earth.^[1] ^{(unsourced content—please add a reference)} It is 10km from City Y.^[1]

Of course, that's an ugly version, but with templatestyles we could have much better CSS. Not having the tooltip rely on title= would be a start. (If you have external CSS, you can do it with something like<span class="tooltip" data-mouseover="mouseover text here"> in the HTML and CSS like this:
*::::.tooltip:hover::after { *:::: cursor: help; *:::: content: attr(data-mouseover); *:::: background-color:peru; *:::: z-index:5; *:::: position:fixed; *:::: font-size: 15px; *:::: color: white; *:::: padding:2px; *::::}

Or, at least, that works for me. Fundamentally a good idea.

And "I challenge every unsourced article just because they're unsourced" is acceptable. The burden remains on the writer. 🌺 Cremastra (talk) 18:42, 26 January 2024 (UTC)Reply[reply]

I think you're reaching Wikipedia:Don't disrupt Wikipedia to make a point territory with that last comment. Anomie ⚔ 19:00, 26 January 2024 (UTC)Reply[reply]

What? CactiStaccingCrane (talk) 01:17, 27 January 2024 (UTC)Reply[reply]

Excise the "crime" topic from the "law" topic; make "law, government and administration" an established grouping of topics[edit]

As an editor largely focused on law and legal philosophy articles, I find it hard to locate articles on subjects relevant to me and my expertise on Wikipedia-space pages that are sorted by topic (like WP:FAC, WP:GAC, or WP:RfC).

The undifferentiated inclusion of the very popular "crime" and "law enforcement" categories (here I don't mean categories in the technical sense) in the overall fairly unpopular "law" category means that most articles sorted under "law" turn out to be "Murder of so-and-so" or "Jimminy Bob Killman (murderer)"-type articles which require no particular expertise in any country's law or any field of legal scholarship. In terms of the skills required to contribute, they are perhaps closer to biography or history articles, or they might be in a category of their own.

Since law is a fairly underappreciated topic on Wikipedia (I find), it may be appropriate in some contexts to group it together with related topics. I would suggest that "government" be split off from "politics and government" where it exists, and merged with "law" and a newly minted category, "administration", forming the "law, government and administration" category. ("Government" would thenceforth focus more on formal/institutional political science and social philosophy, whereas "politics" would be retain people, parties, ideologies, and what have you.)

I'd be interested in alternative solutions, though, since two objections to the idea do spring readily to my mind: For one, public law is arguably the only government-/administration-adjacent field of law, and private law should be appended to the "economics" topic or something. For another, grouping law with government might create the impression that the law is definitely an institution of the state (the government), or even synonymous with it. This is not undisputed; especially some jurists in the common law tradition (found e.g in England and the U.S.) maintain that it is fundamentally a set of societal traditions, influenced by, but not derived from, the state.

What I'll be looking for when I put the finished proposal out is a consensus that the new categorization/grouping practices be recommended for internal and public-facing categorization purposes.

Again, I'd love to hear perspectives and suggestions. If you could tell me which user communities I should notify about this post or the finished proposal so that consensus is representative and the relevant people are aware of this re-systematization, I'd also be much obliged.

§§ LegFun §§ talk §§ 22:20, 13 January 2024 (UTC)Reply[reply]

I was going to recommend -deepcat:"Crime", but there are too many subcategories and the search fails. Folly Mox (talk) 02:44, 14 January 2024 (UTC)Reply[reply]

Have you tried searching incategory:"WikiProject Law articles" -incategory:"WikiProject Crime and Criminal Biography articles" ? Make sure you are searching in the Talk: namespace or maybe just go through Category:WikiProject Law articles 115.188.140.167 (talk) 10:16, 14 January 2024 (UTC)Reply[reply]

Are you looking for articles to edit, or specifically for FAs and GAs? SuggestBot can look at your contributions and suggest a list of articles that you might be interested in. WhatamIdoing (talk) 23:44, 14 January 2024 (UTC)Reply[reply]

Thanks for those recommendations! SuggestBot and the specific phrasing of the search terms provided, in particular, sound useful for my purposes. I was hoping to help bring already high-quality law articles into the spotlight by reviewing/supporting them in the FA/GA processes, which is why I was looking there specifically. I also want to get a general idea of the structure and weak points of jurisprudential Wikipedia.

That said, I'm not sure if looking for articles listed by WikiProject Law will help in every case. Wherever articles of all topics are listed for specific encyclopedic purposes, the poor categorization of law will make things hard; which is why I wanted to achieve a general consensus before I charge ahead with shifting around categories. Additionally, I'm not too well-versed in Wikipedia's more technical side, so any list-producing bots that automatically compile articles into categories would need to be adjusted by more adept minds.

Incidentally, it occurs to me that perhaps I ought to get WikiProject Law on board first . . .

§§ LegFun §§ talk §§ 08:04, 15 January 2024 (UTC)Reply[reply]

I agree that making friends over at WPLAW is a good idea. Also, take a look at Wikipedia:Version 1.0 Editorial Team/Law articles by quality statistics. The top- and high-priority articles are probably the ones you want (or, the ones you want should probably be considered top- and high-priority; the categorization is not always perfect). Wikipedia:WikiProject Law/Popular pages might also be useful to you. WhatamIdoing (talk) 17:32, 15 January 2024 (UTC)Reply[reply]

New tags on recent changes.[edit]

Hello, Would it be possible to add the tags "previously undone edit", "possible edit war" and "likely repeated vandalism". All would be activated in slightly different ways, the first would be activated when the same user undos the reversion of their original edit; the second would be activated when 2 separate versions of a revision are rapidly changed between; the third would be an escalation for the first, a high ORES score edit being consistently reverted and re-added.

Thanks Geardona (talk to me?) 02:37, 14 January 2024 (UTC)Reply[reply]

Definitely maybe. The first might be possible in Special:AbuseFilter (about which you can inquire at Wikipedia:Village pump (technical)). I think that detecting the others might (a) have to be done in MediaWiki code, like the Reverted tag, and (b) not be available instantly, especially if you want it to work if there have been unrelated intervening tags. WhatamIdoing (talk) 23:47, 14 January 2024 (UTC)Reply[reply]

Probably useful to file a phab for this, and maybe something that the ModTools team would be interested in looking into :) Sohom (talk) 15:17, 15 January 2024 (UTC)Reply[reply]

How would I file a phab request? Geardona (talk to me?) 19:29, 16 January 2024 (UTC)Reply[reply]

WP:BUGS. --Redrose64 🌹 (talk) 19:32, 16 January 2024 (UTC)Reply[reply]

Thanks Geardona (talk to me?) 19:34, 16 January 2024 (UTC)Reply[reply]

The Instinct to Punish (essay)[edit]

Also WP:Crime labels. Both new. -- GreenC 06:33, 15 January 2024 (UTC)Reply[reply]

This is an excellent essay and reflects a concern I've been incubating for some time. Wikipedia — because of its page rank prominence and the use of article leads to populate the snippets of Google Knowledge Graph — is now in the position of being a prime destination for the application of punitive measures against perceived transgressors and realization of this instinct is having a deleterious impact on our maintenance of encyclopedic standards in some cases.

I would like to see some practical measures adopted to deflate the ability of this project to be used in such a way. For instance, courtesy de-indexing (if requested by subjects) of BLPs (perhaps with some exceptions) would make the stakes considerably lower and disincline, to a certain degree, some editors from using BLPs as a punitive tool while still maintaining our WP:NOTCENSORED ethos. Chetsford (talk) 18:46, 17 January 2024 (UTC)Reply[reply]

I think there's a good point behind Wikipedia:Crime labels, but I don't think it's presented well (e.g., no clear definition of "label", no way to know whether the suggested criteria are met). Unfortunately, my efforts to clarify it, or even get my questions answered, do not appear to be welcome at the moment. WhatamIdoing (talk) 18:27, 26 January 2024 (UTC)Reply[reply]

Can Wikipedia Provide An AI Tool To Evaluate News and Information on the Internet[edit]

(refactored from Help desk)

The integration of a neutral and curated AI platform with search engines has the potential to revolutionize information access and combat misinformation. By providing context, flagging biases, and promoting critical thinking, these platforms can empower users to navigate the complex information landscape of the internet Perfedge (talk) 06:06, 15 January 2024 (UTC)Reply[reply]

Wikipedia is an online encyclopaedia, not a search engine, so that would be out of scope for the project. And while the Wikimedia Foundation might possibly have the funding to finance such a project, if it were feasible, they don't have much of a reputation for creating complex software with any degree of rapidity. And frankly, the evidence that current AI is even capable of fulfilling the objectives you describe would appear to be lacking. 'Critical thinking' will, for now, have to be facilitated by actually doing the thinking ourselves. AndyTheGrump (talk) 06:17, 15 January 2024 (UTC)Reply[reply]

Oddly enough, that statement sounds like something a chatbot would spit out. Or a "corporate mission statement." But I have to object to using "curated" "revolutionize" "combat" "promoting" "empower" and "landscape" in just two sentences.

Also, no. Per AndyTheGrump, we will have to continue to use brains. 🌺 Cremastra (talk) 22:40, 16 January 2024 (UTC)Reply[reply]

Ignoring the lead comment (which no offense to user, sounds like machine-generated businessbabble, hold the synergy), I'm intrigued by the header question. I don't think ChatGPT in any nearcoming commercial iteration will be particularly useful as a dynamic tool for evaluating sources, not least because it doesn't get updated fast enough. However at its most basic, I'd like to see if it can give some intelligent analysis (at either the individual article level or of an outlet given a sample of articles) that's better than some of the crud at Media Bias/Fact Check. (Have MBFC's writeups gotten better in the last couple years? Their methodology for comparing outlets actually in theory seems not bad.)

To start I tried asking ChatGPT (3.5) in a couple prompts to evaluate bias and reliability on article from an inner-metro local paper, the Dorchester Reporter, figuring that the author is unknown and the paper would not be much evaluated in any secondary media GPT may have seen. I'm currently having trouble getting the output to say anything of substance (it almost seems to be telling me to do my own homework!) (and yes I'm including the full article plaintext in the prompt). If anyone else is interested, see if they can get anything substantial -- I have then prepared a regional cross-section of local outlet stories on the same topic, and a also spread of sample stories from a new outlet, to see what kind of reports it will give. SamuelRiv (talk) 01:54, 17 January 2024 (UTC)Reply[reply]

Export to PDF, Epub, Odt and LaTeX[edit]

Hi, I would like to include mediawiki2latex in Wikipedia. It exports wiki articles to PDF, Epub, Odt and LaTeX. Yours 11:43, 15 January 2024 (UTC) Dirk Hünniger (talk) 11:43, 15 January 2024 (UTC)Reply[reply]

Improving Wikipedia's Editing Process[edit]

I'm an expert in an emerging technology called "General Collective Intelligence" or GCI platforms, which aim to significantly increase the collective intelligence of groups, and hence to significantly increase impact on any collective outcome in general that a group might target, per dollar of resources spent. This has the potential for radically disruptive positive impact in a vast range of areas from addressing poverty, to sustainable economic development, to addressing the environmental degradation that has been associated with climate change, but I'd like to begin with improving Wikipedia's editing process. In summary, a general collective intelligence identifies different problem solving strategies and the domains in which each strategy is optimal in order to switch to the optimal strategy for each problem. Some typical issues that plague Wikipedia (like edit warring) are just symptoms of a clash between problem-solving strategies rather than the collective intelligence to cooperate in using the optimal strategy. A short summary is linked below. I look forward to your guidance.

https://en.wikipedia.org/wiki/User:CognitiveMMA/sandbox CognitiveMMA (talk) 14:39, 15 January 2024 (UTC)Reply[reply]

Please provide citations to appropriate third-party sources supporting your claims concerning the abilities of 'General Collective Intelligence'. AndyTheGrump (talk) 14:44, 15 January 2024 (UTC)Reply[reply]

What is a GCI platform? Is this something you invented? Barnards.tar.gz (talk) 21:29, 17 January 2024 (UTC)Reply[reply]

To achieve Collective Intelligence one needs a lot of permutations and most Wikipedia articles tend to reach a "good enough" state, then don't deviate much beyond that due to the work involved in rewriting existing text. It's actually harder to rewrite an existing article than create a new one. In rewriting, you have to consider and incorporate what is already there, plus add new content around it. Very time consuming. Most users don't undertake that unless they are really dedicated, or the existing requires WP:TNT. Most articles reach a point then don't evolve much. (Ignoring all the maintenance and trivial edits). A better example of GCI is generative AI because it speeds up the evolution process to (literally) light speed, one can quickly see collective intelligence generating new things of value. -- GreenC 16:24, 15 January 2024 (UTC)Reply[reply]

Uh, what? CactiStaccingCrane (talk) 17:31, 15 January 2024 (UTC)Reply[reply]

Sounds like doi:10.1007/978-981-19-2840-6_18. It's probably not a notable subject, and I doubt that it would work. That is, it would only work if nearly all the participants were truly open to any outcome, which is basically never the case when there's a dispute. It would have us be hyper-rational when people are being emotional. It could probably turn into a decent consulting career specializing in things like creating corporate mission statements and such, though. WhatamIdoing (talk) 17:52, 15 January 2024 (UTC)Reply[reply]

Indeed, uh what? Or to put it another way, 'citation needed'. I know it's fashionable these days to claim that AI can solve all of the world's problems before tea-time, but we actually need concrete verifiable evidence to justify using it on Wikipedia. AndyTheGrump (talk) 17:56, 15 January 2024 (UTC)Reply[reply]

Understandablifying advanced math articles -- GPT and the Holy Grail[edit]

Forever ago (2007) I created a few theoretical comp sci and neurosci articles -- stressing the content is mostly math. One of which I (mis?)remember, but cannot seem to find now, attracted some comment on the Talk page like it was completely incomprehensible and the worst article they had ever read on WP. (I had thought it was Flashsort, which I wrote on WP:REQuest and attracted some mockery, but I might be confusing stuff.) I had a good laugh at this stuff -- if you check the Talk pages on WP math articles, such comments aren't exactly unusual -- I always wondered in my time educating and writing if there wasn't some way to just break through this seeming dense incomprehensibility barrier that's almost universal in rigorous technical text.

Fastforward to today and people figuring out use cases for ChatGPT (3.5). I type in the following: Rewrite the wikipedia article on Flashsort into simplified English

ChatGPT 3.5: Flashsort into simplified English

Flashsort is a way to arrange numbers or data in order, from the smallest to the biggest. It's a special method that does this really quickly.

When we want to sort things, like a list of numbers, there are different methods to do it. Some are faster than others. Flashsort is one of the faster ones.

Here's how it works:

Dividing the Range: First, we look at the numbers we want to sort and find the smallest and biggest numbers. This helps us figure out how big our sorting range is.
Creating Buckets: Imagine we have a bunch of boxes, and each box can hold a certain range of numbers. We create these boxes, also called "buckets," to fit the range of numbers in the list.
Distributing Numbers: Now, we go through our list of numbers and quickly put them into the right buckets. We do this by calculating where each number should go based on its value.
Sorting Inside Buckets: After we've put all the numbers in their respective buckets, we sort the numbers inside each bucket. We can use any sorting method we like, but inside each bucket, the numbers are already close to each other, so we don't have to do a lot of work.
Joining Buckets: Finally, we take the sorted numbers from each bucket and put them together. When we do this, the numbers are all sorted from smallest to biggest.

Advantages of Flashsort:

It's fast, especially when dealing with a large number of items.
It's efficient because it doesn't waste time comparing every number to every other number.
It works well when the numbers are distributed unevenly, meaning some numbers are way bigger or smaller than others.

Limitations of Flashsort:

It might not be the best choice for sorting very small lists.
It needs some extra memory space to create the buckets.

In simple terms, Flashsort is a clever way to quickly sort numbers by dividing them into buckets, sorting those buckets, and then putting everything together. It's like sorting toys into different boxes based on their size and then arranging the toys inside each box neatly. This helps us find what we need faster.

The output is mostly accurate. More to the point, while it doesn't sound completely precise in a mathy sorta way, when I went over it a few times I couldn't think of anything wrong with it, or even necessarily disqualifyingly imprecise about the text. It reminded me of a really really well-written teen/lay science book, except such a really esoteric topic that nobody would ever put it in a teen or pop book.

All I can think is that this is a game changer, especially with GPT-4. The lede of almost every major, but densely written, technical article can be run through this and then curated, since the lede does not have to be 100% precise on any topic -- history, agriculture, sports, whatever -- that's for the body. Even more significantly, this could be an engine to draft minor articles for review for Simple English WP, which is still anemic by comparison.

Again, to emphasize, (as if anyone anywhere serious disagrees or hasn't considered this:) all generated material is a draft that must be reviewed in its entirety for content; and (lesser consensus on this:) any lengthy prose retained verbatim for WP or other publication should include attribution to content generation tools used. SamuelRiv (talk) 01:59, 16 January 2024 (UTC)Reply[reply]

I think the "mostly" in "mostly accurate" is a problem. ChatGPT and these things have a known reputation for saying untrue things. In practice, I think this suggestion will fail because it's a lot of effort to check each claim, and people will often forget to do so. Jo-Jo Eumerus (talk) 08:48, 16 January 2024 (UTC)Reply[reply]

And let's not forget that the sheer existence of English Wikipedia and Simple English wikipedia, is one of the reasons this model even knows how to do this. —TheDJ (talk • contribs) 09:34, 16 January 2024 (UTC)Reply[reply]

You really don't see "anything wrong" with "It works well when the numbers are distributed unevenly"? I'm also going to take issue with why it says "Sorting Inside Buckets" is fast, the very existence of the "Joining Buckets" step, and (relatedly) that it doesn't say anything besides "quickly" about how this differs from a histogram sort, the explanation of which forms the bulk of the article. —Cryptic 10:09, 16 January 2024 (UTC)Reply[reply]

I know the idea lab is for positive feedback, but this example is not great. There are basic errors: not only does flashsort not work well when the numbers are distributed unevenly (because some buckets remain big), but "uneven" doesn't mean some numbers are way bigger or smaller than others (1, 2, 10 is uneven; 1, 1001, 2001 isn't). Worse, it sounds credible: having sucked the reader in with a few truisms, it hallucinates plausibly. Even some true facts are misleading: it doesn't waste time comparing every number to every other number, but even bubble sort isn't that dim. Less importantly, the tone is unencyclopedic, but that covers up the fact that even the correct statements are vague and imprecise (fast rather than O(n), etc.). One day we may be able to curate good AI-written articles, but we seem to be a long way from that yet. Certes (talk) 14:16, 16 January 2024 (UTC)Reply[reply]

Some things to clarify:

First, I did all this several months ago and only have notes of the prompt and output. I should have reread instead of going just from memory. Yes, the "Advantages" and "Limitations" are awful, so I shouldn't have flatly said there wasn't "anything wrong".

Second, we can show it with a better article example (one I actually cared to remember anything about), and also on GPT-4, but what I'm impressed by is the ability to put technical text into a readable format while still accurately describing steps of the algorithm. I'd still say on a quick re-read there's nothing disqualifyingly wrong in what is affirmatively stated (except in the two sections noted), except there is significant omission (it's a very short output for a long technical article) -- my point is that it can help write good ledes where you will always have imprecision and omission. I'm not sure where specifically you see hallucinations per se -- it doesn't go off into outer space in the sections noted.

Third, you can fine-tune tone with prompts, and you can fine-tune how much mathematical language you want to have in there too (or just add it yourself in the curation process. For the prompt I used, copied literally, I'd say it output something expected.

Fourth, who said anything about "AI-written articles"? I'm talking about a tool to aid in writing -- the kinds of things the whole rest of the ChatGPT-using population is using it for. SamuelRiv (talk) SamuelRiv (talk) 18:06, 16 January 2024 (UTC)Reply[reply]

I agree that a simple prompt like the one you have provided can easily be improved to get better results. Pinging @JPxG who has a lot of experience in this area. — Qwerfjkl talk 19:37, 16 January 2024 (UTC)Reply[reply]

This output seems more suitable for the Simple English Wikipedia, than here. I disagree that since the lede does not have to be 100% precise on any topic - leads are subject to WP:V, WP:NPOV as much as any other article content. Sure, we can elide details in the lead but it still has to be verifiable and accurate. Galobtter (talk) 21:41, 19 January 2024 (UTC)Reply[reply]

My biggest problem, aside from the inaccuracies stated above, is how redundant most of the wording is. The second and the last paragraphs add nothing whatsoever. There's lots to say about flashsort's advantages other than being "fast". --Ahecht (TALK
PAGE) 16:01, 24 January 2024 (UTC)Reply[reply]

Quick template script[edit]

I have been working on a script that allows for the quick addition of important templates (such as citation needed), and I was wondering if I could get some feedback. It is very simple at the moment, being that JavaScript is not my strong suit. At the moment the way click inputs are handled is very inefficient, and templates can only added to the top of an article. I intend (if technically practical) to add options for different variations of templates, to handle click inputs better, and to future-proof it; which would allow for easy addition of new templates to the script. The page is found here. I thank you for your feedback, and I hope this is the correct way to present this (I am fairly new and haven't fully figured out Village Pump) Vghfr (talk) 04:01, 18 January 2024 (UTC)Reply[reply]

Adding section/portion of a page to Watchlist[edit]

Is it possible to add a portion or a subsection of an article to one's watchlist instead of the entire page? Or are there any options available to do something like this? IOHANNVSVERVS (talk) 07:41, 18 January 2024 (UTC)Reply[reply]

It is possible on talk pages through the recently introduced WP:SUBSCRIBE feature. InfiniteNexus (talk) 08:03, 18 January 2024 (UTC)Reply[reply]

Right, I was aware of that (although I didn't know that it was only recently introduced)

Speaking of that feature, do you know if there's a way to have topics one has created on a talk page be subscribed to automatically?

Thanks, IOHANNVSVERVS (talk) 09:01, 18 January 2024 (UTC)Reply[reply]

It's in your Preferences (Ctrl+F Automatically subscribe to topics). InfiniteNexus (talk) 01:30, 21 January 2024 (UTC)Reply[reply]

Thank you! IOHANNVSVERVS (talk) 04:31, 24 January 2024 (UTC)Reply[reply]

It would be helpful to have them added to one's watchlist rather than the notification system. I often want to watch one section in a major forum such as VPT or ANI without being alerted to each comment on several massive unrelated threads that are important to others but of little interest to me. Actually, what might be better is "don't alert me for updates to this section", so I get notified of new sections by default but can set each of them to ignored. That works well on other forums such as Usenet (yes, it still exists). Certes (talk) 14:14, 18 January 2024 (UTC)Reply[reply]

Use of ChatGPT and other LLMs specifically for medical and scientific content[edit]

There has been much debate about the use of ChatGPT and other LLMs already, and from what I gather, the chief objections to using these tools for page content creation center around one or more of these subjects:

Inaccurate/falsified statements
Bias of the program and its database(s)
Falsified references in generated content
Sometimes being overly vague

I agree fundamentally with these conclusions, and I believe that they make LLMs unsuitable for many types of page on Wikipedia, such as biographies.

However, within my usual discipline, medicine, I have observed an exceptionally high degree of accuracy and completeness in ChatGPT's ability to characterize diseases. It is not flawless - but not even an expert is. Hence why even subject matter experts cite their sources. Fundamentally, what I am thinking about is generating page content with ChatGPT, and then manually verifying information before putting it on Wikipedia. This makes article expansion much easier, and there is a lot of expansion needed in WP:Medicine.

I present an example in the form of Japanese spotted fever, which is currently a stub. I know how to track down good sources to expand this page, and I have both a textbook and a high-quality review article open right now, but actually writing the page out would take a lot of time. And there are thousands of pages just like this.

Take a look at what ChatGPT wrote for me. I challenged it to "create profiles of diseases and medical conditions, including at minimum sections on: clinical presentation, diagnostics, treatment, and epidemiology" and then simply prompted it with Japanese spotted fever. To summarize: everything written there is correct. It is more detailed than the textbook I'm using to verify things, and much of it can only be found from sources like this, which are well over 20 years old. I find that impressive.

While that page needs some more formatting and editing, the base content is a vast improvement over the current stub. Would a single editor object to replacing the stub with that sandbox page, after I formatted it and verified+cited all the material? Would it not represent a major improvement to Wikipedia's medical utility, if medical editors were encouraged and guided to responsibly use ChatGPT as a tool, even for page content? Just-a-can-of-beans (talk) 20:29, 19 January 2024 (UTC)Reply[reply]

What you are talking about is an article where every word is written or specifically verified by a human editor and determined by that editor to be sourc-able and where the human editor has sourced it sufficiently to comply with wp:ver. This is sound human editing, IMO not a reason for Wikipedia to give a particular blessing to what inspired or guided them to do that work (personal discussions with friends or experts, ChatGPT, a Ouija board :-) etc.) North8000 (talk) 20:47, 19 January 2024 (UTC)Reply[reply]

This might work well for a subject that has only been written about by knowledgeable people, but how does it get on with, say, COVID-19 vaccine or finasteride? Phil Bridger (talk) 21:39, 19 January 2024 (UTC)Reply[reply]

This is an interesting question. If database materials are inaccurate, then the AI could weave specific inaccuracies into the material. That reinforces the need to verify against outside sources, with existing standards for source veracity.

I'm not familiar with finasteride or what I presume are controversies surrounding it, but I asked it some COVID vaccine questions and even tried giving it some loaded questions. It definitely gets vague, but perhaps that's due to brevity. I don't see anything that looks like information sourced from questionable authors - perhaps the training data was vetted by competent and knowledgeable people or (more likely, in my opinion) the AI is able to draw its own conclusions about source veracity. It's also pretty good about answering specific questions posed in response to bits of vague language. This is more than good enough for a knowledgeable editor to use as a tool, I think. Just-a-can-of-beans (talk) 22:28, 19 January 2024 (UTC)Reply[reply]

Agree, but that isn't the question. North8000 (talk) 22:37, 19 January 2024 (UTC)Reply[reply]

I'm not really sure what Mr. Bridger was asking, then Just-a-can-of-beans (talk) 22:45, 19 January 2024 (UTC)Reply[reply]

My fundamental concern with suggestions of this nature, generally, is that (for instance) ChatGPT uses Wikipedia itself as part of its training corpus which presents the possibility of circular learning. However, in the specific form presented here: "Fundamentally, what I am thinking about is generating page content with ChatGPT, and then manually verifying information before putting it on Wikipedia." I probably don't personally have any objections, for whatever that's worth. Chetsford (talk) 22:42, 19 January 2024 (UTC)Reply[reply]

You make a great point about circular learning - if it does behave that way, this could be very dangerous in the long term. I formerly assumed ChatGPT used Wikipedia and thus didn't use it as a study aid for a long time, but having used it lately, I no longer personally believe it uses Wikipedia heavily, if at all, at least for medical subjects. A good example of this is coincidentally in the Japanese spotted fever page - compare ChatGPT's description of the rash with the one on Wikipedia at the moment. The AI is more detailed in its description, and matches several sources, but none exactly - while also omitting the mention of progression to petechiae. However, this could just be due to this particular page being very short and thus not considered much by the AI, so perhaps some testing with other pages is in order. Your concern definitely merits attention. Just-a-can-of-beans (talk) 22:58, 19 January 2024 (UTC)Reply[reply]

A fifth concern to add to the list is the possibility of the model emitting copyright-violating material. This could be of greater concern if the information only exists in a small number of niche sources. Barnards.tar.gz (talk) 22:53, 19 January 2024 (UTC)Reply[reply]

Considering the nature of LLMs, I'd say the burden is to show they have ever produced prose that would appear as plagiarism (by clone or close paraphrase) of a single source -- putting the as-yet unsettled nuances of copyright in scraping content for LLMs aside for now (and for the many years it will take to decide such law). Certainly they can reveal trade secrets or anything else people blab in their prompts, but that's not illegal. SamuelRiv (talk) 23:19, 19 January 2024 (UTC)Reply[reply]

Several examples in this article Barnards.tar.gz (talk) 23:50, 19 January 2024 (UTC)Reply[reply]

The song lyrics prompt examples are clearly problematic. The NYT prose example is basically directly asking for a copyvio of a specific article, which is kinda ridiculous.

Looking at G-Scholar I do see stuff on memorization (artificial intelligence) (WIP), which I didn't realize is an emergent LLM behavior. The studies they are doing are mostly still on GPT-2 to build on old literature it seems, so take some %ages with salt. Note de Wynter et al 2023^[1] discusses GPT-3 and 4, and that while conditioning/biasing prompts on existing prose and authors/titles greatly increases likelihood of outputting memorized training data, simple prompt mitigations dramatically lower it.[pp. 9--10] Note also that memorization scales (log-linear to quadratic) with frequency in the training data, so pop songs are astronomically more likely to be quoted than a niche academic paper.

We could say plagiarism detection software as many of these papers use, perhaps a cloud-based solution with easy credits for editors, may be additionally appropriate if we think remaining risk is still inappropriately high (relative to general risk from close paraphrasing). Any automated tool would certainly speed up the necessary review process anyway.

Note fyi Hartmann et al 2023 go over what is and isn't decided in copyright law, potentially, regarding LLM issues.^[2] It may or may not be the case that spontaneous memorizations are a copyvio per se. SamuelRiv (talk) 21:30, 20 January 2024 (UTC) SamuelRiv (talk) 21:30, 20 January 2024 (UTC)Reply[reply]

I can forsee issues with this. If people don't mind an anecdote, at work someone tried to help when we couldn't find information on a MEDRES something and asked an AI for a list of scientific papers that covered the topic. Of the three papers returned, one was a real paper which didn't cover the topic and the other two were total hallucinations. The first of the two hallucinations was particularly difficult to spot because it took the name of a real paper, and the author list of another real paper and merged them. Red Fiona (talk) 15:17, 20 January 2024 (UTC)Reply[reply]

Yeah. I don't think it's up to snuff just yet for identifying sources. I'm involved in a systematic review using AI to help sift through things and even that proprietary, paid service requires a lot of human oversight and manual correction.

Basically what I support is letting the AI write up a paragraph or a page on a subject, then verifying the info manually. ie, you may prompt the AI to "write a paragraph explaining medical treatment options for condition X" and then when it spits out that short-course corticosteroids are the preferred first-line treatment, you do a quick check of good sources to make sure that's correct. It saves a lot of time because while you would have had to look up those sources anyway, you would have also had to spend a lot of time writing up the paragraph yourself, phrasing and editing it, etc. Just-a-can-of-beans (talk) 15:12, 22 January 2024 (UTC)Reply[reply]

This might be me, but I am not seeing much time/effort saving between "I choose a topic, read a bunch of papers, synthesise and write" Vs "I choose a topic, machine writes it, I read the papers, check what's written reflects the papers, possibly have to rewrite much of what the machine wrote". Red Fiona (talk) 18:20, 23 January 2024 (UTC)Reply[reply]

I think the main practical issue is that it's easier to fact-check text while you are writing it yourself, than to go through every sentence of text someone else wrote. Jo-Jo Eumerus (talk) 10:40, 24 January 2024 (UTC)Reply[reply]

Here's the sort version of my previous post. For the example described in the OP, the article was full written and sourced by a human. So the example is not a reason for Wikipedia to be blessing ChatGPT.

For example, if I get my inspiration for an article from a Ouija board, and then I write a great article derived from and sourced to wp:reliable sources, that is not a reason for Wikipedia to specifically endorse use of Ouija boards. North8000 (talk) 23:25, 19 January 2024 (UTC)Reply[reply]

OP's sandbox article was not "full written" by a human -- not a word of the prose was written or altered by a human -- that's OP's point! Would a single editor object to replacing the stub with that sandbox page, after I formatted it and verified+cited all the material? No change in prose -- just verification and citations. We are long past talking about getting "inspiration" from AI as if it's 1988. SamuelRiv (talk) 23:32, 19 January 2024 (UTC)Reply[reply]

I'm with you in spirit, but I think that you are interpreting me in reverse. For the example, a human editor either OK'd or wrote every word and sourced it. My point is that we should not let this probably-OK example be a Trojan horse to get Wikipedia to partially bless use of ChatGPT. Sincerely, North8000 (talk) 15:33, 22 January 2024 (UTC)Reply[reply]

References

^ de Wynter, Adrian; Wang, Xun; Sokolov, Alex; Gu, Qilong; Chen, Si-Qing (September 2023). "An evaluation on large language model outputs: Discourse and memorization". Natural Language Processing Journal. 4. doi:10.1016/j.nlp.2023.100024.
^ Hartmann, Valentin; Suri, Anshuman; Bindschaedler, Vincent; Evans, David; Tople, Shruti; West, Robert (2023-10-24). "SoK: Memorization in General-Purpose Large Language Models". arXiv:2310.18362 [cs.CL].

Mickey Mouse (film series) and adding shorts[edit]

What do people think about adding the public domain Mickey Mouse shorts directly to the Mickey Mouse (film series) page? I think it would make them feel more accessible since you don't have to go to the direct page for each short. And it makes the public domain status feel more concrete to use them in a wider context. Generally this is an idea I have for all short film series, but I wanted to start smaller. SDudley (talk) 01:11, 21 January 2024 (UTC)Reply[reply]

Bump XfD heading sizes[edit]

I currently have a draft to bump XfD heading sizes for each page to level-2 and for each day's log to level-1 at User:Aaron Liu/sandbox#XfD heading size. Thoughts? Aaron Liu (talk) 02:15, 21 January 2024 (UTC)Reply[reply]

Are you asking about the format or substance? InfiniteNexus (talk) 04:37, 21 January 2024 (UTC)Reply[reply]

The format. I don’t see how this could affect the substance. Aaron Liu (talk) 14:07, 21 January 2024 (UTC)Reply[reply]

Make the talk page "Add a topic" form clearer[edit]

Currently if a user clicks "Add a topic" on an article talk page, they are prompted for a Subject and a Description and there's no further explanation of what's happening or what they're expected to type there.

Talk pages like Talk:ChatGPT and Talk:Speech synthesis have ended up having to be protected because so many IP visitors think that they're interacting with that software when they type there, and don't realise that they're posting a message on a Wikipedia talk page. Talk:DALL-E gets a lot but hasn't been protected yet. There are also weirder cases like Talk:Doppelgänger (perhaps it's also the name of an app?) where IPs constantly post short sentences about wanting to see their doppelgänger, sometimes entering their email address.

Can we give these cryptic Subject/Description boxes better names, and/or add a short "you are about to post a comment to Wikipedia" message somewhere? Description in particular seems a very strange word to use at all, for something that's a comment or a question. Belbury (talk) 15:47, 22 January 2024 (UTC)Reply[reply]

I have never had a problem, but I am probably not a typical reader. Phil Bridger (talk) 18:39, 22 January 2024 (UTC)Reply[reply]

Yes, I'm thinking more about new users here. As well as the IP problems above there will also be cases where the first talk page a new user visits happens to be blank, and they're left to guess what the Subject/Description interface is actually asking of them.

Replying you to here the message box says Reply to Phil Bridger in grey text before I start typing. I'm wondering if we just forgot to set a meaningful box message for new comments (the new interface only went live in 2022). Belbury (talk) 19:52, 22 January 2024 (UTC)Reply[reply]

"Description" could be replaced with "Message" or "Type your message here". QuietCicada - Talk 18:07, 23 January 2024 (UTC)Reply[reply]

"Your message" (or similar variations as suggested by @QuietCicada) would be much clearer than "Description". (Generally, I think of "description" as metadata.) Along the same lines, "Title" is clearer than "Subject". Schazjmd (talk) 18:27, 23 January 2024 (UTC)Reply[reply]

I was thinking "Suggest an improvement to the article" (for talkspace) or "Suggest an improvement to Wikipedia" (everywhere else), but these are more universal (neither would make sense, in, say, a user talk page.) 🌺 Cremastra (talk) 21:04, 23 January 2024 (UTC)Reply[reply]

Are you sure that this has anything to do with why people post messages on those pages ? You see a problem that annoys you and you want a solution. You THINK you have found a cause that is associated, but there is no proof of that association, you are only guessing. But some people are just really young/lost/unexperienced/dumb etc. when it comes to interacting with the Internet. I know someone working a Helpdesk and they literally keep a list of phone numbers of OTHER help desks, because ppl will happily call their bank if their internet is down. No amount of endlessly high stacked messaging or guardrails of any sort is going to protect some people from making mistakes like these. I just see a wall of meaningless text to ignore and archive. That page doesn't have to be clean. Nor are we required to answer each of those people dumb enough to ask a chat GPT a question there. —TheDJ (talk • contribs) 09:56, 24 January 2024 (UTC)Reply[reply]

Yes, I am guessing at a connection, I saw a recurring behaviour and was considering what upstream factors might feed into it. Even if there turns out to be no connection, replacing the Subject / Description prompt with something clearer seems like it would still be a useful change to Wikipedia's interface. Belbury (talk) 14:07, 25 January 2024 (UTC)Reply[reply]

@Trizek (WMF) will probably want to talk to the Editing team about this.

I've also wondered whether we're getting more reverted comments. For example, four misplaced comments were posted to WT:V last week. WhatamIdoing (talk) 18:21, 26 January 2024 (UTC)Reply[reply]

Looking for a common way to indicate no last name for sorting[edit]

There are at least two situations where a person's name shouldn't be changed to "Last Name First" for the purposes of sorting. Defaultsorts for Categories and listas for Biographies. I'd like a common way to indicate that a person's name should be taken as a unit and not flipped with a supposed last name. (For example, a way to indicate that the drag queen Adore Delano should be *always* sorted under A, and never treated as if Delano was a last name. This can also occur in some cases in countries where names do not include patronymics. I know there are the various templates in Category:Hatnote_templates_for_names, but I guess I'm looking for something more general.Naraht (talk) 19:05, 23 January 2024 (UTC)Reply[reply]

Thank you for starting this discussion. I get tired of changing the DEFAULTSORT and LISTAS fields re: categories and WikiProject Biography talk page banner for drag performers, after editors (presumably not knowing the page title is a stage name) sort by a fake "last name".

I would welcome a way to flag this sorting issue to editors. Look at Willow Pill, for example. This should not be sorted as "Pill, Willow". Fixing this all the time is tiring. ---Another Believer _(Talk) 19:10, 23 January 2024 (UTC)Reply[reply]

I proposed last year explicitly noting bibliographic surname in the {{infobox person}} parent box, but was told to use DEFAULTSORT instead. The reason I think the infobox is a useful place to flag it is it offers a few other indicators -- place/time of birth, parents, etc -- that can be used to help indicate (by algorithm or not) that some extra consideration should be given. Also, listing bibliographic sort explicitly and visibly by editors following naming guidelines is a lot different from something like DEFAULTSORT where doing so is seen as part of a backend cleanup with minimal concern for front-end content. SamuelRiv (talk) 19:38, 23 January 2024 (UTC)Reply[reply]

Change the new article requirement to extended-confirmed[edit]

WP:ACTRIAL happened in 2011, when extended confirmed user right did not yet exist (that came about in 2016). The autoconfirmed right is 10 edits and four days on Wikipedia and for almost all editors that is not enough experience to successfully create an article. Over at the Help Desk / Teahouse, the suggestion to new users is quite often that they should get more experience with Wikipedia before trying to take on the task of creating a new article.

Idea: Change the autoconfirmed requirement to extended-confirmed (500 edits, 30 days), so that users will have more experience with Wikipedia before they can create a new article directly.

Thoughts? RudolfRed (talk) 00:31, 25 January 2024 (UTC)Reply[reply]

I see where this is coming from, but I feel like that's too much for good faith editors that might benefit from initial failure, and the non-good faith editors will just game the system to reach that level(which is done already). The purpose of autoconfirmed is not to weed out inexperienced editors, but simply to confirm a person is operating the account and grant them additional access(that described at WP:AUTOCONFIRM). It would seem odd to let autoconfirmed do the stuff listed there except create articles. 331dot (talk) 00:40, 25 January 2024 (UTC)Reply[reply]

Considering that many new users come in wanting to make new articles right off the bat, extended confirmed might come off as bitey. A new user right of 30 edits and 7 days might be good. QuietCicada chirp 01:02, 25 January 2024 (UTC)Reply[reply]

I think that would be an improvement. Doug Weller talk 12:01, 25 January 2024 (UTC)Reply[reply]

500 edits can seem a daunting amount of work, especially for those very welcome editors who preview and think before pressing Publish and thereby do more work with a lower edit count than others. It might also encourage article hijacking, which is harder to patrol than correctly created new articles. Would this restriction apply to the creation of all mainspace pages? I think I could have managed a useful redirect, and perhaps even a simple dab, before clocking up 500 edits. Certes (talk) 11:56, 25 January 2024 (UTC)Reply[reply]

Even unregistered editors are creating good pages where ACTRIAL fails to obstruct their efforts because the page title fortuitously exists. An example from this morning: new dab 05. Certes (talk) 12:08, 25 January 2024 (UTC)Reply[reply]

10 to 500 edits is a massive jump and I think far too high a bar. EC seems trivial to those of us who edit frequently, but if you're someone who just focuses on content contributions in a limited area (writing new articles, for example!), it could take years to get there. More fundamentally, I don't think the idea behind ACTRIAL/ACPERM was that you need a certain level of experience to write articles. It was that before, it was trivially easy for bad faith editors to create a throwaway account for spam, vandalism, etc. The idea that writing articles is a difficult or beyond the capabilities of new editors has become a bit of a meme lately, but I've never seen any actual evidence for it, and it seems belied by the number of perfectly fine articles created by new editors. – Joe (talk) 14:31, 25 January 2024 (UTC)Reply[reply]
- Done much new editor training, Joe? In my experience new editors fall roughly into two groups: those who find it all too easy to start new articles, and those who find it technically & psychologically rather difficult - if they continue editing these are often the stronger ones. I don't know what evidence there could be for new editors not starting new articles. Johnbod (talk) 15:07, 25 January 2024 (UTC)Reply[reply]
  No, but I've reviewed several thousand new articles. The theory, as I understand it, is not that new editors don't create articles, but that they do and do so badly – hence the need to be discouraged and/or restricted from trying. The evidence I'm looking for and not finding is lots of bad articles by new editors (just like, before ACTRIAL, it was very easy to find examples of bad articles by non-autoconfirmed accounts). – Joe (talk) 21:17, 25 January 2024 (UTC)Reply[reply]
Support an increase to 100 edits and 7 days. If the new article is "urgent" someone else will probably start it, or they can ask another editor to do it for them. Johnbod (talk) 15:07, 25 January 2024 (UTC)Reply[reply]
I think that is way to excessive. If we could ever get the new article patrol log down (it is currently at 8259 articles and 18460 redirects over 1254 days) then maybe I'd be supportive of changing it so that unpatrolled new articles are soft-published (only visible to logged in editors). — xaosflux ^Talk 15:20, 25 January 2024 (UTC)Reply[reply]
One way to get the backlog down would be to autopatrol experienced editors – say 10 years, 100k edits, no blocks – even if they aren't article writers, so that their dabs and redirects don't clog the queue. Certes (talk) 20:52, 25 January 2024 (UTC)Reply[reply]

The NPP backlog and problem is due to difficult-to-NPP articles, not the sheer quantity. A difficult-to-NPP article takes 50-100 times longer than a clear-pass article. An experienced NPP-er could do 50-100 clear pass articles in the time it takes to do one difficult one. North8000 (talk) 21:38, 25 January 2024 (UTC)Reply[reply]

As it turns out, there are only 841 editors with more than 100,000 edits, and all but 271 already have autopatrolled. So it'd certainly be feasible to go through the remainder and check their eligibility. Not sure how much of a difference it would make to the queue, though. – Joe (talk) 21:41, 25 January 2024 (UTC)Reply[reply]

I don't really understand the edit count calculus if you don't run cleanup bots. 100k edits in 10 years means ~30 edits a day. It usually takes me 10--30 minutes to do each articlespace edit. I don't get where this time is supposed to come from. SamuelRiv (talk) 02:00, 26 January 2024 (UTC)Reply[reply]

Nobody ought to be allowed to edit Wikipedia at all til they have made at least 1000 edits on Wikipedia. That'd show them. Hyperbolick (talk) 02:40, 26 January 2024 (UTC)Reply[reply]

Redirects have their own autopatrol with a lower bar at Wikipedia:New pages patrol/Redirect autopatrol list. CMD (talk) 03:40, 26 January 2024 (UTC)Reply[reply]
What issue is this seeking to fix? NPP has been working pretty efficiently recently, so I'm not sure bumping more stuff to AfC (which by nature gets more inundated) is a good idea. Also, I created several articles directly in mainspace pre-ECP, so it would be hypocritical of me to support a raise all the way up to ECP. Curbon7 (talk) 22:04, 25 January 2024 (UTC)Reply[reply]
Strong Support It's easier to create an article on WP than it is to delete one and new articles are one of the areas most ripe for abuse. We already recognize that fact by requiring new articles be reviewed by NPP, except in cases of Autopatrolled editors who have to meet standards in excess of Extended Confirmed (not just meeting the mathematical criteria of EC, but also undergoing a manual review of their edit history). To not, therefore, require new article creators to be, at a minimum, Extended Confirmed is policy incoherence. Chetsford (talk) 01:21, 26 January 2024 (UTC)Reply[reply]
I guess whether that's "incoherent" depends on whether you believe that Wikipedia is best served – as the Wikipedia:Editing policy claims – when we have the most accepted knowledge, or if it's best served by making it as difficult as possible for new people to contribute separate articles on what they believe to be worthy subjects.

BTW, the suggested level means that only the top 0.25% of registered accounts will be able to start an article directly in the mainspace. Have you considered what it means for 3,999 out of 4,000 registered accounts to be unable to do this? WhatamIdoing (talk) 18:41, 26 January 2024 (UTC)Reply[reply]

@WhatamIdoing: Thank you for pointing this out. I just checked Special:Statistics and now see that there are only 68K extended confirmed editors, which is about half of the active editor count and a very small fraction of all editors. I had thought that EC user count would be much higher, so these stats suprised me. RudolfRed (talk) 23:12, 26 January 2024 (UTC)Reply[reply]

Oppose per Joe Roe this suggestions seems to pulled out of nowhere and has the potential to negative impact the project. There needs to be some actual justification about a) there being a problem b) how this would improve it and c) what the potential negative impacts would be. Seddon ^talk 20:59, 26 January 2024 (UTC)Reply[reply]
Oppose - too high a bar per Joe Roe, and per 331dot it won't have the desired effect. As an anecdote: I also edit the social GPS navigation app Waze, where you have to make a certain number of edits to get to the next level and unlock more functionality and road access. But my small province doesn't build a lot of roads, and Waze keeps locking new features to the highest edit level which you can only get to by applying. Even though I'm an "area manager" for my province I've been stuck at their level 2 (there are 5 levels) for nearly 10 years now, as the threshold for level 3 has gone from 5k edits when I started, to 10k edits a few years later, then to 15k, and just as I got close to that number they pushed it to 25k. Basically it means that except for like drawing a new parking lot, I have to ask someone else to do nearly everything, and every time I get close to being able to do some of those things myself, they raise the bar. It's incredibly frustrating, and I frequently wonder why I bother at all. I do not want any new Wikipedia editor to have that experience: new editors gain experience by editing, not by jumping through pointless hoops. As another anecdote: my account was 5 years old before I ever created a new article, and I was still under 500 edits at that time. If some automated message told me I wasn't trusted enough to write my own article after five years, I would have just bailed and never come back. And it still won't keep away dedicated spammers, we all know it won't. Ivanvector (^Talk/_Edits) 21:27, 26 January 2024 (UTC)Reply[reply]

Thank you, everyone, for the replies. I understand the objections and why this is not a good idea. Appreciate everyone sharing their perspectives. RudolfRed (talk) 21:46, 26 January 2024 (UTC)Reply[reply]

You raise attention to longstanding logistical problems in AfC that have not been resolved, and you suggested a possible mitigation. Just because the idea seems to be a non-starter doesn't mean the identified issues go away, so better you said something rather than nothing. SamuelRiv (talk) 22:52, 26 January 2024 (UTC)Reply[reply]

[1] Wales, Jimmy (2006-07-19). "insist on sources". WikiEN-l. Retrieved 2007-01-31.

[2] Wynter, Adrian; Wang, Xun; Sokolov, Alex; Gu, Qilong; Chen, Si-Qing (September 2023). "An evaluation on large language model outputs: Discourse and memorization". Natural Language Processing Journal. 4. doi:10.1016/j.nlp.2023.100024.

[3] Hartmann, Valentin; Suri, Anshuman; Bindschaedler, Vincent; Evans, David; Tople, Shruti; West, Robert (2023-10-24). "SoK: Memorization in General-Purpose Large Language Models". arXiv:2310.18362 [cs.CL].

[1]

[1]

[2]