Windows Interface for Tree Tagger
Ciarán Ó Duibhín
Latest version of tagger program interface: 2016/03/30
New: a customised version of ttmodels.txt is allowed in the directory containing the shortcut to the tagging interface. If such a file exists, it will be used in preference to ttmodels.txt from the \lib directory. For this feature to work, it is necessary to blank out the "Start in" field of the shortcut (which is probably a good idea anyway). It is also permitted to place parameter files in the shortcut directory, which may be useful for models under development.
New: an error is fixed which prevented "languages" with names like "Italian (Baroni)" or "Dutch (Bioche)" from detecting relevant abbreviations files or multiword files.
Latest version of training program interface: 2016/03/30
New: a "Force utf8 input" checkbox is provided, corresponding to the -utf8 parameter for the TreeTagger training program.
This version of the Windows graphic interface for the tagger program is intended to work with utf8 encoding, as well as with latin-1 encoding (provided that the encoding of the input text matches that of the language model). The previous version of the interface, in which the built-in tokenization option works only with latin-1 encoding, remains available here. Please report any problems with the utf8 version.
The TreeTagger is a program developed by Helmut Schmid at the University of Stuttgart (now at the University of München), for part-of-speech tagging and lemmatization. Language models (known as “parameters”, file extension .par) are supplied on the TreeTagger webpage for using the program with texts in English, French, German, Italian, Spanish, Russian, Bulgarian, Dutch, Estonian, Finnish, Galician, Latin, Mongolian, Polish, Slovak and Swahili, and models for some other languages are available from sites linked to the TreeTagger webpage. For a language for which no model exists, it is necessary to hand-tag some text, and then run a training program (provided with the TreeTagger) to create the model.
A zipped Windows distribution of the TreeTagger is available for download through a link near the end of the "download" section of the TreeTagger webpage. As supplied in that distribution, the programs have to be run from the MS-DOS command line, on which the required options are specified.
What is offered here is an add-on Windows graphic interface to the tagger program — and also a similar interface to the training program — which allows the options to be selected visually, and then the TreeTagger program to be launched, without the user having to switch from Windows to MS-DOS.
The selected set of options may be saved and re-loaded, similar to a ‘configuration file.’
Below, a screenshot of the Windows interface to the tagger program.
Below, a screenshot of the Windows interface to the training program.
Download the Windows interface to the tagger program (updated 2016/03/30).
Download a text file required by the Windows interface to the tagger program.
Download the Windows interface to the training program (updated 2016/03/30).
2015/06/01: It has come to my attention that Trend Micro reports the first and third of these downloads as "dangerous". You can rest assured that this is not true; these files, like all others offered on my webpages, are not in any way harmful.
If you already have TreeTagger installed, skip to number 3.
If you are installing the Windows TreeTagger distribution from the beginning, I would suggest the following procedure, to prepare the TreeTagger for use from the Windows interface.
1. From the TreeTagger website, download the Windows
TreeTagger distribution. Unzip it to C:\Program Files, with the "Use
folder names" box ticked in Winzip's "Extract" dialog — DON'T OMIT TO TICK THIS
BOX. (In Vista, right-click the zip file, choose "Extract all", and browse to
C:\Program Files.) This will create and populate the following directories:
To use the TreeTagger from the graphical interface, you will not need any of the files which you will find in \cmd, and you will not need any of the .bat files in \bin. All you will need are the two .exe files in \bin, and the lists of abbreviations and multi-words in \lib.
2. From the TreeTagger website, download the Parameter files for the language models you need, decompress them (eg. using Winzip) and move them to the subdirectory C:\Program Files\TreeTagger\lib.
3. To add the graphic interface, simply place the two interface programs (wintreetagger.exe and wintraintreetagger.exe) in the \bin subdirectory, alongside the two .exe files from the TreeTagger distribution (tree-tagger.exe and train-tree-tagger.exe). And place the textfile (ttmodels.txt) in the \lib subdirectory, alongside the language model files.
Do NOT make copies of the interface programs in other directories, as such copies will be unable to find the TreeTagger components. When you want to use either of the interface programs from another directory, place a shortcut in that directory to the interface program in the \bin subdirectory. The "Start in" field in the properties of the shortcut to the tagging program should be set blank (otherwise, the facility to make a customised ttmodels.txt in the directory containing the shortcut will not work).
Now you can test the TreeTagger. Find or create a plain text file containing a small piece of running text in English — not more than a couple of hundred words for a start. Say the file is called sample.txt. In Windows, go to the directory containing sample.txt, make a short-cut to bin\wintreetagger.exe and run the short-cut.
Regarding running the training program, some limited experiments suggest the
• the openclasses.txt file must ALWAYS be encoded in Latin-1
• the best parameters for tagging Latin-1 text are made with both the lexicon and handtagged files encoded in Latin-1, and — though it does not seem to make much difference — without the -utf8 option
• the best parameters for tagging UTF8 text are made with both the lexicon and handtagged files encoded in UTF8, and with the -utf8 option
• the tagged text file will have the same encoding — Latin-1 or UTF8 — as the input untagged text
• the tagging of Latin-1 text is slightly more accurate than the tagging of UTF8 text, compared on the same training data
This file as supplied lists the currently-available language models (as of June 2014) from the TreeTagger site, except for Chinese, spoken French, Portuguese (Gamallo), Galician (Gamallo), which are omitted as requiring special tokenization (not available in the GUI) or otherwise falling outside the usual pattern. Note that user-written programs for special tokenisation can be accommodated by the interface.
Each line of the file describes one language model, giving the language name, the name of the file containing the model's parameters, and the encoding. Columns are separated by a tab character.
The names used in this file for the parameter files correspond to the downloads current in June 2014.
Several non-current models are also listed, mostly using latin-1 encoding. Three of these have "-latin1" added to the original name of the parameter file, for uniqueness.
This file is used by the tagger interface to populate its menu of languages. Any model listed in this file, and whose parameter file is found in the \lib directory, will be included on the languages menu. The order of models on the languages menu follows the order in this file.
This file is user-editable. It is important to update it if any change is made to the
models in the \lib directory. Reasons for such changes might be:
• you rename a parameter file in \lib
• you obtain or make parameters for a new language model; just add a line to this file
• you want to re-order the languages on the menu, or to restrict the number of languages shown
It is permitted to create a customised copy of ttmodels.txt in any directory containing a shortcut to the tagging interface program, and the shortcut will use such a copy in preference to the copy in the \lib directory. For this facility to work correctly, the "Start in" field of the properties of the shortcut should be blanked out (which is probably a good idea anyway).
It is also permitted to place parameter files in the shortcut directory, which may be useful for models under development. When the language menu of the tagger interface is being populated from ttmodels.txt — whether located in the shortcut directory or in the \lib directory — each named parameter file is sought first in the shortcut directory; if not found there, it is sought in the \lib directory.
Text passed to the TreeTagger must be in a one-token-per-line format. Tokenization is the automated process of converting a running text into this format.
A token is usually a word, but a token may have internal spaces, ie. be a multi-word unit. Each punctuation mark is treated as a token and should be on a separate line too. Clitics should also be on separate lines, if they were treated as separate tokens in the language's training data. The character-set (or encoding) of the text should be the same as that of the training data, eg. utf8, latin-1. Text in the one-token-per-line format may optionally contain manual tagging for some tokens, in the form of a probability and/or a lemma.
This interface provides several options for tokenization:
The built-in tokenization performs the following operations:
There are several further optional actions in the built-in tokenization:
The following features are under consideration for addition to the interface in the future:
This interface is offered as a facility for corpus analysis on Windows. By
using it, you are deemed to accept that the author bears no responsibility for
any adverse consequences. Needless to say, he hopes that there will be no such
consequences. He will be pleased to receive comments, but cannot promise to
act upon them. For help with a problem, you need to send me several things:
• an input file which produces the problem
• your options, saved in a .cfg file
• your ttmodels.txt file, if you have changed it
• a list of the files in your TreeTagger /bin and /lib folders, with dates and filesizes