Unify the various deletion systems


Article Images
  • Affected components:
    • The Page deletion (archiving) feature of MediaWiki core.
    • The Revision delete (RevDel) feature of MediaWiki core.
  • Engineer for initial implementation: TBD.
  • Code steward: TBD.

Motivation

We currently have two systems that are provide a ways to make content no longer publicly accessible:

  1. Page deletion.
  2. Revision delete.

The "Revision delete" system seems to scale fairly well currently. It has a natural way to limit or divide its internal database interactions. If it were to show scale problems, we would have a clear path for how to make it scale further.

The "Page delete" system on the other hand has severe limitations. Even if we ignore the edge case of pages with 5000+ revisions, the underlying concept is still problematic. Database operations for smaller page that move rows between tables is something DBAs would prefer never happens, even at a small scale, and should be migrated away from as soon as possible.

The objective is to unify these two systems and end up with something that is as good as the best of both.

Issues:

Requirements
  • Administrators must still be able to delete entire pages in a way that is as easy as "Page deletion" is today.
  • Administrators must still be able to selectively hide revisions in a way that is as easy as "Revision deletion" offers today.
  • The technical implementation of that action must not move rows between tables.
  • The viewing of "Page history" and "User contributions" (and related APIs) must not display revisions of deleted pages (by default), the same as today.

Exploration

Status quo: Page deletion

This is MediaWiki's original deletion system. Exposed through the interface as "Delete page" (action=delete) and "Restore page" (Special:Undelete).

Database process:
Moves a page and its revisions to the "archive" database table.

Visibility:
Revisions from deleted (or "archived") pages are not shown in page history, or user contributions. Administrators may view them via Special:Undelete/<title> or Special:DeletedContributions/<user>.

Limitations:
The database process for page deletion is inefficient. This cannot be improved because the problem is not how we do it, it is what we do (moving rows between tables). This concept is considered bad practice for database operations. This is why, in order to reduce its negative impact on database stability, replication lag, and performance - "Page deletion" can be limited via the $wgDeleteRevisionsLimit configuration. When limited, only users with the bigdelete may access the feature on pages with more than this number of revisions.

On Wikimedia wikis, the limit has been set at 5,000 revisions. And the right has mostly been reserved to Stewards and Developers. When used with caution, these users are then sometimes able to perform the deletion through a simple request procedure. However, even with this user right, the underlying process is highly inefficient and can cause a longer lasting impact on the database performance in the minutes/hours that follow. As such, all database transactions have additional limits on Wikimedia wikis, that abort these when this is about to happen.

Pages with revisions a lot more than 5,000 as such cannot be deleted through this process. The only way to do so in a way that does not disrupt database performance would be to batch the deletion. However, it is unknown whether it is feasible to do this in a safe manner, given the possible database failure and rollback scenarios it would have to account for.

See also:

Status quo: Revision delete

This is a newer mechanism introduced in 2009. Exposed on the "View history" and "User contributions" views as "Change visibility of selected revisions". And works by ticking the relevant check boxes first.

Database process:
Changes the numerical value in the rev_delete field for the relevant revisions in the database. This can be done in batches.

Visibility:
Revisions that have been "deleted" (or "hidden") still have a placeholder row shown in the interface on "Page history" and "User contributions".

The "Revision delete" feature allows admins to decide which aspect(s) of a revision to hide, and from whom. In particular, it is capable of separately controlling the visibility of the textual content, the edit summary, or the user's name/IP. And it can hide it from either non-admins only, or from everyone (suppression, aka "oversight").

Limitations:
I couldn't find any limitation in the code (which is concerning), but the interfaces (History page, Contributions page) do have a limitation on how many revisions they offer at once. And in any event, there are general transaction limits that will still apply. Regardless of whether this needs a limit, though, it could be batched internally if needed (either in-request or using the JobQueue). And as last fallback, the user themselves has the option to manually "batch" as well (e.g. increase history to show 500 rows at once, and shift-select it as one chunk). Which could work in extreme cases when stewards/developers need to intervene.

See also https://www.mediawiki.org/wiki/Help:RevisionDelete.

Proposal

Nothing specific yet, but it seems I (@Krinkle) and others find it worth exploring to see if we can re-implement the logic behind "Page deletion" by using the same code and database logic that is used by "Revision delete". This would involve the following:

  • Add a bit-field value for revision.rev_delete to represent "archived".
  • Update page/user revision views (Page history, User contributions) to make sure revisions with this flag are not shown by default.
  • Add a way to see them. (e.g. re-using Special:DeletedContributions, or through a switch on Special:Contribs itself, same for history).
  • TODO: Decide what to do with the page entity itself (meta data). E.g. a page_deleted flag (possibly including a state for "deletion in progress", to be batch-friendly).
  • TODO: Decide how/if to migrate archive into revision.rev_delete=archived.

Original task description from bugzilla.wikimedia.org user FT2.wiki:

At present we now have 4 means of deleting material from either the public or from administrators. Material can either be

  • Deleted from the public with traditional deletion
  • Deleted from the public (part or full) with RevisionDeleted
  • Deleted from admin view with Oversight
  • Suppressed from admin view with RevisionDeleted

This collection means that any review of editor actions or conduct, or article matters on the wiki, now faces two big problems in evaluating the existance or seriousness or any issue:

  • It's incredibly easy to overlook some edits or actions in the review, which should be taken account of.
  • It's more complex and takes examination of several screens, to review a matter.
  • Each of these has different mechanisms for viewing edits they affect; there is no consistency of links, formats, access methods, etc.
  • A third issue at a technical level - it's a lot to maintain, and allows for inconsistent software behavior (or bugs fixed in one of these but not spotted in the other), and requires more developer time etc.

I would like to suggest that in fact, all we now need is RevisionDeleted, with the following options:

  • What to hide - revision text, edit summary, user name/IP
  • Whether admins can or can't access the hidden data
  • Whether admins or users who cannot access the hidden data, should nonetheless be able to see it exists even if they can't read it (there are cases when this is safe, and cases when it isn't).

This proposal is that RevisionDeleted is amended slightly to show the above options, and then both traditional deleted revisions and oversighted revisions are converted to RevisionDeleted entries as a background task (ie a script written that achieves this in the job queue over time). Following this:

  • Delete and oversight both redirect to RevDel for their actions
  • Delete/undelete and oversight url's both redirect to the appropriate lookup link for any historical URL used to view an old deleted/oversighted edit.

The issue here is not so much one of software development, as of a once-off conversion task of old data stored in one system to be moved to another.