Wikipedia:Bots/Requests for approval/PrimeBOT 13 - Wikipedia


Article Images
The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was  Approved.

Operator: Primefac (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 14:10, Friday, March 24, 2017 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): AWB

Source code available: AWB

Function overview: Replace magic words with templates

Links to relevant discussions (where appropriate): RFC, RFC follow up

Edit period(s): one time run initially, then maybe once a month until the magic link functionality is removed

Estimated number of pages affected: 378k (369k ISBN, 7000 PMID, 2150 RFC)

Exclusion compliant (Yes/No): yes

Already has a bot flag (Yes/No): yes

Function details:

  • On pages in the ISBN cat: ISBN\s([0-9 -]{9}[0-9 -]{4}?[0-9 -]{3}?[0-9X])(?!-|[0-9]){{ISBN|$1}}
  • On pages in the RFC cat: RFC\s([0-9]{1,4})(?![0-9]){{IETF RFC|$1}}
  • On pages in the PMID cat: PMID\s([0-9]+){{PMID|$1}}

I've tried to account for every situation, and the only potential issue I can see is that strange ISBN values (specifically, mis-typed 11- or 14-digit ISBNs) get captured, but I genuinely can't figure out how to account for every single possibility without having an unnecessarily long regex (basically accounting for every possible combination of hyphen-and-number). Primefac (talk) 14:10, 24 March 2017 (UTC)[reply]

I feel that \[?\[?ISBN\]?\]?:?\s*(([0-9]|-|X)+), or alternatively ISBN\s*(([0-9]|-|X)+), would be much simpler. Let the template flag the errors and mistypes. Headbomb {talk / contribs / physics / books} 16:57, 24 March 2017 (UTC)[reply]
Likewise for
If you're fine with that, I'm fine with that. So (based on the assumption of that last half-phrase) that will make the regex
  • ISBN\s([0-9 -]+[0-9X])
  • PMID\s([0-9]+)
  • RFC\s([0-9]+) (modified from the above to take any number following the RFC)
Note that I still need to add that extra [] in ISBN because the magic words will accept numbers with hyphens and spaces. I'm not going with the possible [[ISBN]] option because that instance wouldn't trigger the magic link (and I don't want to be making assumptions about what people are attempting to use ISBN for). Primefac (talk) 17:29, 24 March 2017 (UTC)[reply]

Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete., 25 edits of each. Headbomb {talk / contribs / physics / books} 18:07, 24 March 2017 (UTC)[reply]

Trial complete. - ISBN, PMID, and RFC. Primefac (talk) 21:19, 24 March 2017 (UTC)[reply]
@Primefac: These two edits have issues: [1], [2]. Headbomb {talk / contribs / physics / books} 22:00, 24 March 2017 (UTC)[reply]
Also it would be nice if the bot made use of the CS1 parameters when possible [3]. A tricky thing is that |id= might have more than one identifier, but when there's only one, it's much better to make use of the existing CS1 parameter. See Help:Citation Style 1#Identifiers for which exists.
I'm also very inclined to enlarge the task to linking all bare identifiers (again, see the CS1 help link above), with proper formatting cleanup when possible. This would need a wider discussion, but I don't suspect that this would be very controversial, since having links are better than no links. People will be very puzzled that the doi wasn't linked, or that the pmc wasn't linked. Headbomb {talk / contribs / physics / books} 22:08, 24 March 2017 (UTC)[reply]
I'll play around with look ahead/behind and see what can be done as far as implementing some of the fixes mentioned above. I don't mind updating other identifiers like doi and PMCID while doing this particular BRFA; I highly doubt it would be a contentious issue. Primefac (talk) 22:42, 24 March 2017 (UTC)[reply]
I have a rule in my own User:CitationCleanerBot that does find \|(\s*)id(\s*)=(\s*)\{\{(arxiv|asin|bibcode|biorxiv|citeseerx|doi|eissn|hdl|isbn|ismn|issn|jfm|jstor|lccn|mr|oclc|ol|osti|pmc|pmid|rfc|ssrn|zbl)\s*\|([^(\}\|)]*)\}\}\s*(\s*(\||\}\})) replace |$1$4$2=$3$5$6. It works pretty well last I checked. Headbomb {talk / contribs / physics / books} 22:49, 24 March 2017 (UTC)[reply]

I note the regular expression for MediaWiki's magic ISBN linking is, I believe, \bISBN(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs})++((?:97[89](?:-|(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs}))?)?(?:[0-9](?:-|(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs}))?){9}[0-9Xx])\b. Most of the complexity is different kinds of whitespace that are allowed. It also skips anything that's inside link text, inside an HTML tag (e.g. in an attribute), and inside a bare URL. If you use something else, you're liable to miss and/or include things that are/aren't currently linked. The RFC and PMID links are almost right, they're \b(?:RFC|PMID)(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs})++([0-9]+)\b. The actual code for all of it is here, although it runs after much of the wikitext has been parsed. Anomie 23:03, 24 March 2017 (UTC)[reply]

Hmm, I'm not sure I can get the id= and the actual regex from mediawiki to play nice, but I'll see what I can do. Primefac (talk) 23:20, 24 March 2017 (UTC)[reply]
Apply my regex after Anomie's. That ought to work. Headbomb {talk / contribs / physics / books} 23:24, 24 March 2017 (UTC)[reply]
(edit conflict) Yup, was just amending my strike to say that. Great minds, right? Primefac (talk) 23:26, 24 March 2017 (UTC)[reply]
@Primefac: Let me know when you're ready for another trial. Headbomb {talk / contribs / physics / books} 23:30, 24 March 2017 (UTC)[reply]
Headbomb, based on the new regex I'm not 100% thrilled (it still misses when the magic link is smack dab inside the URL brackets (first replacement) but I've managed to catch all of the other errors you've brought up (and added a few tweaks to improve accuracy further). As a note, the one "failed" test was something I threw in just to see if I could get the regex to ignore it - obviously I couldn't manage. Primefac (talk) 00:30, 25 March 2017 (UTC) I FIXED IT. Primefac (talk) 00:46, 25 March 2017 (UTC)[reply]
(edit conflict) AWB has a "ignore external links" option. Have you tried that?Headbomb {talk / contribs / physics / books} 00:54, 25 March 2017 (UTC)[reply]

Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete., 100 edits each. Please update summary to indicate that it may also do things beyond magic link deprecation. Headbomb {talk / contribs / physics / books} 00:58, 25 March 2017 (UTC)[reply]

Trial complete. - ISBN, PMID, and RFC. Primefac (talk) 01:53, 25 March 2017 (UTC)[reply]
  • ISBN edits seem all good. Reviewing others.
  1. [4] is problematic and didn't fix the magic link. I'm not quite sure where it is. In {{vcite journal}} maybe?
  2. [5] [6] is there a reason to change the ref tags?
  3. Also, I thought you'd be doing edits like [7] alongside the 'main' task, but I guess I wasn't very clear on this either. I'd be fine including those in the trial, assuming there's a notice at the RFC saying that bot could clean up similar plain instances to templated instances, and that people are invited to comment here.
  4. [8] this is a fairly unique case. The bot didn't make anything worse than is already is though. I also note the existing of several 'genes'-related RFC, like RFC1, RFC2, etc... It's likely an exclusion for RFC < 100 is need so they can be done by hand/semi-automatically.
Sorry if there's an ec, wanted to number the responses to better reply.
  1. You're right, I noticed that on quite a few that were skipped by AWB. I've updated the template, and that should fix the issue going forward.
  2. Shit, meant to modify that. Because I want to be editing inside templates, I can't check the "ignore refs, templates..." box, and throwing in a one-off exception to account for <ref name=RFC #> seemed like a quick fix. I was planning on un-modifying the change after the actual regex had gone through, but didn't implement it. I've fixed the regex.
  3. Forgot, actually. I was more focused on the regex of the original task and didn't think about the doi regex.
  4. Yeah, I noticed that in a reference for the Royal Flying Corps. There are valid RFCs that are <100. I think post-cleanup it would be easy to check through the transclusions for any value <100 and fix if necessary, because as you say, it's not doing any major harm to convert to a template at the moment.
I'll work on the ref issue. Primefac (talk) 02:29, 25 March 2017 (UTC) To avoid further ECs I'll leave this for now and get back when you're fully done. Primefac (talk) 02:36, 25 March 2017 (UTC)[reply]

@Primefac: Fully done now. Also, is there a reason why AWB genfixes shouldn't be enabled during the main run? Headbomb {talk / contribs / physics / books} 02:41, 25 March 2017 (UTC)[reply]

Honestly couldn't say. I thought at one point someone said to disable genfixes, but I just checked the previous BRFAs and there's no mention. I can certainly enable it. Primefac (talk) 02:46, 25 March 2017 (UTC)[reply]
For doi, I've got doi:([\S]*)/([A-Za-z0-9.]*){{doi|$1/$2}}, and for PMCID I've got PMCID:PMC([0-9]+){{PMC|$1}}. Primefac (talk) 02:58, 25 March 2017 (UTC)[reply]
DOIs have a lot more than alphanumeric characters. Also look for a plain PMC0123456 or a PMCID PMC0123456. Headbomb {talk / contribs / physics / books} 03:09, 25 March 2017 (UTC)[reply]
Err, since when are there magic words for DOI and PMC? Or is this BRFA now going beyond fixing the three existing magic words? Anomie 12:49, 25 March 2017 (UTC)[reply]
@Anomie: Right now, I'm of the opinion those can be done alongside the magicword deprecation, mostly because the reader will be quite puzzled that in "PMID 0123456, doi:10.123456", the PMID is linked, but the doi isn't. I'm 99% certain this could be done on its own too. I certainly can't think of any argument against making all those identifiers link to their respective databases. Headbomb {talk / contribs / physics / books} 12:59, 25 March 2017 (UTC)[reply]

(edit conflict) I came up with

  1. find (?<!(\[\[|=))doi:?\s*10\.([^(\s)]+?)(\.)?(\])?<(\/)?ref> replace {{doi|10.$2}}$3$4<$5ref>, followed by
  2. find (?<!(\[\[|=))doi:?\s*10\.([^(\s)]+?)(\.|,|\])?(\s) replace {{doi|10.$2}}$3$4

There is some GIGO cases, but I've yet to see a place where this screws up anything. Errors tend to be that GIGO will result in a missed conversion, than wrong conversion. Headbomb {talk / contribs / physics / books} 12:53, 25 March 2017 (UTC)[reply]

See also Wikipedia:Bots/Requests_for_approval/CitationCleanerBot_2. Headbomb {talk / contribs / physics / books} 13:56, 25 March 2017 (UTC)[reply]
Honestly, at this point I'm going to leave it up to BAG which additional tasks get done. I can agree that there's no point in two bots running the same general task (one for magic links, another for what-should-probably-be-magic-links), and contrary to what Magioladitis is implying below I won't be going out of my way to find them, so even if I do the extra changes there will still be plenty of pages not hit that another bot can handle. Primefac (talk) 15:00, 25 March 2017 (UTC)[reply]
@Primefac: This is just a notice that I'm proposing a similar bot for non-magic link identifiers and cleanup. I want to focus on more error-prone/less popular identifiers, so that you can focus on the less error prone magic link ones. Then we can share regex between us as we find cornercases, so both bots get all the benefits. I don't see the necessity of waiting for bare {{zbl}} conversions to start bare ISBN/PMID/RFC conversions. But if you can cover the most common identifiers (ISSN/DOI/PMC) while focusing on ISBN/PMID/RFC magic word conversions, that will be 99% of the needed conversions with little need for further bot cleanup. I want to cover the remaining of these conversions. Headbomb {talk / contribs / physics / books} 15:14, 25 March 2017 (UTC)[reply]
That's fine; makes sense. Primefac (talk) 15:37, 25 March 2017 (UTC)[reply]

I think this large scale task is a good opportunity that the bot also does AWB's general fixes so that all secondary tasks are done in addition to the main task. Many other bots have been running general fixes while making a primary task. It's a great time. -- Magioladitis (talk) 14:11, 25 March 2017 (UTC)[reply]

Magioladitis, I genuinely have no idea what you're saying here, other than possibly being sarcastic about my age, experience, or your own experiences. Primefac (talk) 15:00, 25 March 2017 (UTC)[reply]
Likely a typo for "that the bot". Headbomb {talk / contribs / physics / books} 15:03, 25 March 2017 (UTC)[reply]
That.... makes a lot more sense. My apologies. Primefac (talk) 15:04, 25 March 2017 (UTC)[reply]
Er... not ofcourse!!! Also not "bee" but been. I type too fast. My apologies. -- Magioladitis (talk)

Based on my involvement with ISBNs I proposed same additions to the current regex. For instance, to allo catch tabs and dashes. There is also a lift of bad ISBns found in Wikipedia:WikiProject Check Wikipedia/ISBN errors. -- Magioladitis (talk) 14:27, 25 March 2017 (UTC)[reply]

What proposed additions? My regex is the same that the magic links currently use to find the ISBNs, which I'm pretty sure does catch tabs (unless \t means something else I'm not aware of). Either way, if it's a badly formatted ISBN it shouldn't be converted to a template in the first place. I'm not going to bother with the checkwiki errors; I'm just focusing on the categories populated by the magic links. Primefac (talk) 15:00, 25 March 2017 (UTC)[reply]
OK. -- Magioladitis (talk) 15:08, 25 March 2017 (UTC)[reply]
I agree that the bot should only operate on pages in the categories. It should leave the CheckWiki pages alone, since they are by definition erroneous edge cases that should be dealt with by humans. – Jonesey95 (talk) 17:56, 25 March 2017 (UTC)[reply]

Did this task at other wiki. Some notes:

  • Remember, that those categories aren't fully populated (yet). There will be more pages. After templating category members, I did a database scan for ISBN, which found relatevily much more results.
  • Don't know what the situation is at enwiki, but bare in mind, that VE wrapped ISBNs (and probably opther magic links) into nowiki tags.
  • Mediawiki regex, that Anomie posted here, for some reason isn't catching everything it should catch, for example these three ISBNs weren't changed (neither using AWB or Python):
    • Phibbs, Brendan (2007). The human heart: a basic guide to heart disease(2nd ed.). Philadelphia: Lippincott Williams & Wilkins. p. 1.ISBN 9780781767774.
    • Aina Dālmane (2004). Histoloģija. Rīga: LU Akadēmiskais apgāds. 319. lpp. ISBN 9984-770-42-7.
    • Maton, Anthea; Hopkins, Jean (1993). Human Biology and Health. Englewood Cliffs, New Jersey, USA: Prentice Hall. ISBN 0-13-981176-1.
    • Not necessarily regex/AWB/python fault, it may be my stupidity :) --Edgars2007 (talk/contribs) 03:57, 26 March 2017 (UTC)[reply]

Looking through the RFC category, it looks like the false positive rate for pages in this category is well over 1%. "RFC" is used within Wikipedia, meaning that phrases like "...since the RFC 4 years ago" should not be wrapped. RFC can also mean Rangers F.C. and Randers FC and whatever is meant in 1976 Eastern Suburbs Roosters season. Since the total population of the category is only about 2,000 pages, I recommend that AWB, rather than a bot, be used on this category, with "RFC \d+" text that does not refer to an IETF RFC be wrapped in nowiki tags until the magic words are turned off. – Jonesey95 (talk) 16:08, 26 March 2017 (UTC)[reply]

I've been working on killing these magic links for a very long time. About a year ago I looked at doing doing the ISBN updates. Those are easy enough to do automatically or semi-automatically. PMID as well, probably. But RFC has only a few hundred instances and many are quirky/tricky. I would recommend doing RFC magic links manually. --MZMcBride (talk) 23:42, 3 April 2017 (UTC)[reply]

Yeah, pretty much everyone agrees on that. I'll formally strike it from the request. Primefac (talk) 23:52, 3 April 2017 (UTC)[reply]
@Primefac: would you be willing to add code to handle basic conversions for doi/pmc (as a side benefit)? Headbomb {t · c · p · b} 19:32, 17 April 2017 (UTC)[reply]
I have no objections to that, provided it's as an approved amendment. Primefac (talk) 22:59, 17 April 2017 (UTC)[reply]

There was no objections in the call for comment, and those who commented agreed it was a good idea. The bot should likely restrict itself to something like (pseudo regex)

Maybe the should all ISBN fixing bots use the same edit summary? -- Magioladitis (talk) 08:24, 7 June 2017 (UTC)[reply]

I'll be honest, I'm amazed that not a single person has suggested that yet. We're all using the same regex (I think), might as well do the same ES. Brilliant idea. Primefac (talk) 11:38, 7 June 2017 (UTC)[reply]

@Primefac: I'm going to try to get this moving this week. Where do you see the current status of this task as being? There appear to have been substantial fixes, so I think this should probably have one more trial to ensure everything looks good. First, what ever happened with the ref tag issue? If that was never fixed, please experiment with a negative lookbehind for the start of a ref tag. That should fix the issue pretty easily. ~ Rob13Talk 09:38, 13 June 2017 (UTC)[reply]

The ref thing was for RFC, so since we've dropped that I don't think it will be an issue going forward. I'm fine with another trial. Primefac (talk) 12:24, 13 June 2017 (UTC)[reply]
@Primefac: I was referring to this [9]. ~ Rob13Talk 15:17, 13 June 2017 (UTC)[reply]
Oh, right, forgot about that. Yes, I included a lookbehind for errant ref tags. Primefac (talk) 16:19, 13 June 2017 (UTC)[reply]

Approved for extended trial (400 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Please perform 200 edits each. ~ Rob13Talk 16:30, 13 June 2017 (UTC)[reply]

Trial complete.. IBSN, PMID. Note that I was unable to perform the sorcery done in the Magic links bot edit summary to get the MW RFC link in there, but hopefully the summary used is acceptable. Primefac (talk) 22:51, 16 June 2017 (UTC)[reply]
I looked at about 50 ISBN edits and 25 PMID edits. I found zero errors. – Jonesey95 (talk) 00:00, 18 June 2017 (UTC)[reply]

 Approved. ~ Rob13Talk 23:25, 18 June 2017 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.