.. _tutorials.train_model: ===================== How to train a model ===================== This is a simple guide to training a model using editquality. This tutorial assumes that you have cloned the editquality repo, installed all dependencies and are working from the repo's root directory. For our purposes we will be training a revert model for a fictitious wiki called `foowiki`. Config ------- First, you need to make a config file. This should be a yaml file that lives in `config/wikis`. For example, with `foowiki`, you would create a file `config/wikis/foowiki.yaml`. The config file is where you define your datasets and model configurations. Example config:: name: foowiki label: Foo Wikipedia host: foo.wikipedia.org external_samples: sampled_revisions.60k_2019: quarry_url: https://quarry.wmflabs.org/run/385851/output/0/json-lines?download=true autolabeled_samples: trusted_edits: 1000 trusted_groups: - checkuser - bureaucrat - sysop - eliminator - bot labeled_samples: autolabeled_revisions.60k_2019: sampled_revisions.60k_2019 extracted_samples: autolabeled_revisions.w_cache.60k_2019: sample: autolabeled_revisions.60k_2019 features_for: - reverted models: reverted: observations: autolabeled_revisions.w_cache.60k_2019 label: reverted_for_damage pop_rate_true: 0.0405 tune: true cv_train: algorithm: GradientBoosting parameters: learning_rate: 0.01 max_depth: 3 max_features: log2 n_estimators: 700 min_samples_leaf: 7 In the above config we define some sampled data from Quarry to be autolabeled and feature extracted. Then we define our revert model and specify parameters. Create a features list ----------------------- Now we need to create a list of language features we would like to extract from the data. This should correspond to an existing feature collection module in `revscoring `_. Also, make sure that you have installed the language dictionary files needed for your model. This list should be Python file located in the `feature_lists/` directory. The contents of this file should be a couple of list variables containing different features we would like to extract. Here is an example feature list for our "foowiki" revert model:: from revscoring.languages import foo from . import enwiki, mediawiki, wikipedia, wikitext badwords = [ foo.badwords.revision.diff.match_delta_sum, foo.badwords.revision.diff.match_delta_increase, foo.badwords.revision.diff.match_delta_decrease, foo.badwords.revision.diff.match_prop_delta_sum, foo.badwords.revision.diff.match_prop_delta_increase, foo.badwords.revision.diff.match_prop_delta_decrease ] informals = [ foo.informals.revision.diff.match_delta_sum, foo.informals.revision.diff.match_delta_increase, foo.informals.revision.diff.match_delta_decrease, foo.informals.revision.diff.match_prop_delta_sum, foo.informals.revision.diff.match_prop_delta_increase, foo.informals.revision.diff.match_prop_delta_decrease ] damaging = wikipedia.page + \ wikitext.parent + wikitext.diff + mediawiki.user_rights + \ mediawiki.protected_user + mediawiki.comment + \ badwords + informals +\ enwiki.badwords + enwiki.informals reverted = damaging goodfaith = damaging Generate a Makefile -------------------- Next, you need to generate a new Makefile with your new config. We can do this using the `generate_make` utility like this:: ./utility generate_make > Makefile This will code-generate a new Makefile containing additional commands based on the new config. These commands will help us create all the necessary datasets and run all the steps to train the new model. You can see them quickly by doing `git diff Makefile`. There should be a number of dataset commands and model training and tuning commands. Training a model ----------------- Assuming everything is configured correctly, you should be able to build all the necessary datasets and train the model using a single command:: make foowiki_models Tuning a model ------------------- Once you have trained a model, you should be able to tune it and generate fitness reports using a single command:: make foowiki_tuning_reports This will create a new report in the `/tuning_reports` directory, which contains fitness statistics for the new model.