⚓ T17161 A generalized language conversion engine


Article Images

Resurrecting this task, slightly.

Above when I wrote:

Please see the worked example at the end of comment 6 for how variant conversion can be accomplished without information loss.

I was referring to:

(12:20:59 PM) cscott: i think the idea is that, if i edit in en-pig and type 'appleway' it should get saved as appleway and probably a default translation into en-us should be made? (ie, in the latin/arabic pairs, assume lowercase). There should be a specific UX affordance in VE to specify both sides of the variant, which serializes into -{en-pig:appleway,en-us:apple,en-gb:apple}-.
(12:24:45 PM) cscott: i guess when you edit text which was originally in en-us, it needs to be converted to -{en-us:apple,en-pig:appleway}- by the language converter so that information isn't lost when the edited en-pig text is saved back.

To be more precise, *just before you edit the variant B text* in variant A, you do a reversible transformation from variant B to variant A, which entails adding markup like the apple/appleway example above to ensure that reconverting from B to A yields *exactly* the original variant B text.

That is, assume the original text, in the en-us variant, was:

John was here. I ate an apple.

When I edit in en-pig the source text is first converted to:

<span id=X>Ohnjay -{en-us:was,en-pig:asway}- erehay.</span>
<span id=Y>-{en-us:I,en-pig:Iway}- -{en-us:ate,en-pig:ateway}- -{en-us:an,en-pig:anway}- -{en-us:apple,en-pig:appleway}.</span>

And displayed in VE as:

Ohnjay asway erehay. Iway ateway anway appleway.

I'm using <span> tags to represent selser boundaries for the purposes of illustration; there would not necessarily be actual <span> tags present in the DOM. (The idea would be to use a language-dependent sentence segmentation algorithm; one implementation could indeed use synthetic <span> tags to indicate breaks.) Similarly, I'm using LanguageConverter syntax, but you can mentally substitute an appropriate DOM representation if you prefer. Note that "extra" information only needs to be stored for non-reversible constructs -- in Pig Latin, those are just the words ending in "-way".

If I then changed "appleway" to "orangeway", VE would have an internal document like this:

<span id=X>Ohnjay -{en-us:was,en-pig:asway}- erehay.</span>
<span id=Y>-{en-us:I,en-pig:Iway}- -{en-us:ate,en-pig:ateway}- -{en-us:an,en-pig:anway}- orangeway.</span>

And when this was serialized to wikitext using the fine-grained selser we'd have the following en-us wikitext, which preserves the original en-us text for span X:

John was here. I ate an worange.

Note that this reads correctly when converted to en-pig, but we've introduced an error in the en-us variant: it should be "orange" not "worange". This is part of the inherent tradeoff of LanguageConverter, and occurs frequently in zhwiki during edits. The community prefers errors such as these to be immediately visible so that an en-us speaker can see the problem and fix it quickly. This is different from the delayed model of editing supported by Content Translation.

VE should, of course, provide good visibility of both variants during editing (client-side conversion?) and an explicit UX affordance to edit the "nondefault variant" so that a user fluent in both en-us and en-pig could author -{en-us:orange,en-pig:orangeway}- directly and avoid the variant conversion error.

(One wrinkle: currently LanguageConverter uses no explicit marking of variant in the article text. Stuff which "looks like" variant A is converted to variant B and vice-versa. Articles contain multiple variants interleaved. This seems to mostly work fine in practice, especially for script conversions where character set can be used to identify the variant being used. We probably want to separate the variant sniffing out as an orthogonal issue and introduce some synthetic elements for explicit variant markup -- maybe just <span lang=en-us> -- so that it is clear during html2wt which variant we expect to generate for every region of the article.)

ToC:

  1. British<>American English, Portuguese variants conversion
  2. Different options to enable/disable the system in various way, with additional user settings that allow custom rules and different separated set of rules
  3. Some words on language converter in editing mode
  4. Javascript conversion tool
  5. sentence-based conversion tool
  6. Classical Chinese Kanbun conversion
  7. Mutliple parallel conversion

  1. Would it be a good idea to ask for an implement of language converter that would work between British English and American English? Benefits include there will no longer be needs to force editors to use a fixed variant in any given article as that will be handled by the language converter, and also people can read wikipedia in spellings that they are accustomed to. It'd probably be easier for English-based developers to develop language converter and also attract interest from English-speaking developers onto the module.
    1. And what is currently blocking T28121 from resolving at the current stage? Someone mentioned that Portuguese being a character-based space-delimited language that would require more than word-to-word conversion might make it difficult to implement the language converter into Portuguese wikipedia and then T17161 is listed as a subtask to it, but currently there are already English <> Piggy Latin conversion being made available and that is also word-based?
  2. Currently the enabling/disabling of the availability of language converter is made across the entire wiki, however it might not always suit every needs. For instance, as shown in the recent request for a Cyrillic converter in Romanian wiki T169453, there are some tiny amount of users that would be interested in reading the wiki in Cyrillic alphabet, however most Romanian user reject the idea. The situation is not unique for Romanian Wikipedia, when a javascript-based language conversion tool get implements into Cantonese Wikipedia, there are some similar debates too. Therefore, I think it would be a good idea that
    1. An option should be available to users within individual wikis implemented the language converter. Users should be able to select whether they want the Language Converter button to appear in their personal preference.
    2. A wiki-wide setting should be made available so that admin of wikis could determine according to community consensus that, whether such tool should be enabled to all users, enabled by default but users could switch the tool off, disabled by default but individual user could switch the tool on if they wish to, or disabled entirely even when the language conversion have been installed.
    3. Additionally, setting that allow individual users to configure their own conversion rule would probably be a nice idea too? Currently there are only general rule per region/script/variant of language, but in some cases where there are no formal orthography, individual users might write something slightly different from how others write them. As such, a personalized user option for what to convert could also be a good idea?
    4. And then, in Chinese wikipedia, conversion options currently available include for example unconverted, hans, hant, CN, SG, TW, HK, MO. I think it would probably make more sense that, instead of separating option for simple character conversion of hans/hant away from other regional setting, the system can instead separate the regional variant conversion against script conversion and provide a checkbox on both article reading interface as well as personal preference interface for users to select which level of conversion do they want the conversion tool to perform. The conversion database would need to separate accordingly in such case.
  3. Currently, the language converter does not support converting text in the process of editing in the source mode (not sure about visual mode). I have encountered a few users that are only literate in one of the multiple scripts in use find it difficult to edit in such a situation. (I have just read about a complain about this related to Chinese Wikipedia, and then on Cantonese Wikipedia there are a javascript tool that partially help with it, and then on Korean Wikipedia that have been one of the point against a potential conversion implementation with Hanja script) As such, the language converter should probably support converting languages in the source mode.
    1. However, it is important that the language converter should not save the result it have converted the wikicode into. It should compare the different on what user have edited after the conversion, and then apply only those part back to the original wikicode, and code for things that users have not edited should not be affected by the language conversion tool.
  4. Was there any discussion being made before regarding why Cantonese Wikipedia uses its own Javascript tool to convert between different scripts instead of language converter?
  5. Inner Mongolian University have developed a conversion tool between Cyrillic Mongolian and Traditional Mongolian script known as "Conversion System between Traditional Mongolian and Cyrillic Mongolian" (http://trans.mglip.com/EnglishC2T.aspx), and the technical background is apparently discussed in this paper: https://link.springer.com/chapter/10.1007/978-3-642-41644-6_2 . It claims that, with a language model based conversion approach, a correction rate of up to 87% can be achieved. However, the approach mean that it must analyse the entire sentence as a whole to figure out the context of individual word before a likely conversion candidate can be selected as conversion result. Which mean a language converter that would work with entire sentence as a unit instead of an individual word as a unit would be required. The situation regarding other languages that use both phonetic writing system as well as ideographic writing system would also be similar. It would probably be a good idea for a generalized conversion system to be able to work based on the context of sentence where individual words are located at.
  6. For Classical Chinese, there is a dedicated system being developed by people, using some very specific rules, to apply different reading marks onto Classical Chinese text, with effects including adding extra characters, ignoring unneeded characters, rearranging the reading order of characters in a sentence and such, so that the text would be rendered understandable by whomever that could understand Old Japanese. If the rules are provided, is it possible for the language converter to:
    1. Automatically add Japanese reading marks onto Chinese text when it is desired, and
    2. Automatically convert Classical Japanese text into Japanese reading order?
    3. Additionally, it seems like similar system have also been developed for other non-Chinese non-Japanese users in history. Is there any further information about that?
  7. Is there a way to run multiple language conversion scheme at the same time? Like for example, Chinese Wikipedia is currently running the conversion of different Chinese scripts between Traditional and Simplified Chinese characters with different regional variants. However, although rarely but in many different cases, pinyin and zhuyin romanization or phonetic writing schemes would also be used in article main text in order to phonetically record something, either because the thing that are to be described is only phonetically noted or because the article is discussing about phonetic itself. Is it possible to additionally develop a pinyin<>zhuyin conversion system and then deploy it onto Chinese WIkipedia, side by side to the Chinese characters conversion system?