How to train a model

This is a simple guide to training a model using editquality. This tutorial assumes that you have cloned the editquality repo, installed all dependencies and are working from the repo’s root directory.

For our purposes we will be training a revert model for a fictitious wiki called foowiki.

Config

First, you need to make a config file. This should be a yaml file that lives in config/wikis. For example, with foowiki, you would create a file config/wikis/foowiki.yaml.

The config file is where you define your datasets and model configurations.

Example config:

name: foowiki
label: Foo Wikipedia
host: foo.wikipedia.org

external_samples:
  sampled_revisions.60k_2019:
    quarry_url: https://quarry.wmflabs.org/run/385851/output/0/json-lines?download=true

autolabeled_samples:
  trusted_edits: 1000
  trusted_groups:
   - checkuser
   - bureaucrat
   - sysop
   - eliminator
   - bot
  labeled_samples:
    autolabeled_revisions.60k_2019: sampled_revisions.60k_2019

extracted_samples:
  autolabeled_revisions.w_cache.60k_2019:
    sample: autolabeled_revisions.60k_2019
    features_for:
      - reverted

models:
  reverted:
    observations: autolabeled_revisions.w_cache.60k_2019
    label: reverted_for_damage
    pop_rate_true: 0.0405
    tune: true
    cv_train:
      algorithm: GradientBoosting
      parameters:
        learning_rate: 0.01
        max_depth: 3
        max_features: log2
        n_estimators: 700
        min_samples_leaf: 7

In the above config we define some sampled data from Quarry to be autolabeled and feature extracted. Then we define our revert model and specify parameters.

Create a features list

Now we need to create a list of language features we would like to extract from the data. This should correspond to an existing feature collection module in revscoring. Also, make sure that you have installed the language dictionary files needed for your model.

This list should be Python file located in the feature_lists/ directory. The contents of this file should be a couple of list variables containing different features we would like to extract.

Here is an example feature list for our “foowiki” revert model:

from revscoring.languages import foo

from . import enwiki, mediawiki, wikipedia, wikitext

badwords = [
    foo.badwords.revision.diff.match_delta_sum,
    foo.badwords.revision.diff.match_delta_increase,
    foo.badwords.revision.diff.match_delta_decrease,
    foo.badwords.revision.diff.match_prop_delta_sum,
    foo.badwords.revision.diff.match_prop_delta_increase,
    foo.badwords.revision.diff.match_prop_delta_decrease
]

informals = [
    foo.informals.revision.diff.match_delta_sum,
    foo.informals.revision.diff.match_delta_increase,
    foo.informals.revision.diff.match_delta_decrease,
    foo.informals.revision.diff.match_prop_delta_sum,
    foo.informals.revision.diff.match_prop_delta_increase,
    foo.informals.revision.diff.match_prop_delta_decrease
]

damaging = wikipedia.page + \
    wikitext.parent + wikitext.diff + mediawiki.user_rights + \
    mediawiki.protected_user + mediawiki.comment + \
    badwords + informals +\
    enwiki.badwords + enwiki.informals

reverted = damaging
goodfaith = damaging

Generate a Makefile

Next, you need to generate a new Makefile with your new config. We can do this using the generate_make utility like this:

./utility generate_make > Makefile

This will code-generate a new Makefile containing additional commands based on the new config. These commands will help us create all the necessary datasets and run all the steps to train the new model. You can see them quickly by doing git diff Makefile. There should be a number of dataset commands and model training and tuning commands.

Training a model

Assuming everything is configured correctly, you should be able to build all the necessary datasets and train the model using a single command:

make foowiki_models

Tuning a model

Once you have trained a model, you should be able to tune it and generate fitness reports using a single command:

make foowiki_tuning_reports

This will create a new report in the /tuning_reports directory, which contains fitness statistics for the new model.