Windows Interface for Tree Tagger
Ciarán Ó Duibhín
Latest version of tagger program interface: 2017/02/03
• Handling of utf8 improved. Utf8 text may now be used with Latin-1 language model, and vice versa. Utf8 text may or may not have a byte-order-mark (BOM). The output will have the same encoding and the same BOM status as the input.
Version of 2016/08/21
• Fixed: options to output probability tree and affix tree did not run if directory names contained a space.
Version of 2016/03/30
• A customised version of ttmodels.txt is now allowed in the directory containing the shortcut to the tagging interface. If such a file exists, it will be used in preference to ttmodels.txt from the \lib directory. For this feature to work, it is necessary to blank out the "Start in" field of the shortcut (which is probably a good idea anyway). It is also permitted to place parameter files in the shortcut directory, which may be useful for models under development.
• An error is fixed which prevented "languages" with names like "Italian (Baroni)" or "Dutch (Bioche)" from detecting relevant abbreviations files or multiword files.
Latest version of training program interface: 2016/03/30
• A "Force utf8 input" checkbox is provided, corresponding to the -utf8 parameter for the TreeTagger training program.
This version of the Windows graphic interface for the tagger program is intended to work with utf8 encoding, as well as with latin-1 encoding. The previous version of the interface, in which the built-in tokenization option works only with latin-1 encoding, remains available here. Please report any problems with the utf8 version.
The TreeTagger is a program developed by Helmut Schmid at the University of Stuttgart (now at the University of München), for part-of-speech tagging and lemmatization. Language models (known as “parameters”, file extension .par) are supplied on the TreeTagger webpage for using the program with texts in English, French, German, Italian, Spanish, Russian, Bulgarian, Dutch, Estonian, Finnish, Galician, Latin, Mongolian, Polish, Slovak and Swahili, and models for some other languages are available from sites linked to the TreeTagger webpage. For a language for which no model exists, it is necessary to hand-tag some text, and then run a training program (provided with the TreeTagger) to create the model.
A zipped Windows distribution of the TreeTagger is available for download through a link near the end of the "download" section of the TreeTagger webpage. As supplied in that distribution, the programs have to be run from the MS-DOS command line, on which the required options are specified.
What is offered here is an add-on Windows graphic interface to the tagger program — and also a similar interface to the training program — which allows the options to be selected visually, and then the TreeTagger program to be launched, without the user having to switch from Windows to MS-DOS.
The selected set of options may be saved and re-loaded, similar to a ‘configuration file.’
Below, a screenshot of the Windows interface to the tagger program.
Below, a screenshot of the Windows interface to the training program.
Download the Windows interface to the tagger program (updated 2017/02/03).
Download a text file required by the Windows interface to the tagger program.
Download the Windows interface to the training program (updated 2016/03/30).
2015/06/01: It has come to my attention that Trend Micro reports the first and third of these downloads as "dangerous". You can rest assured that this is not true; these files, like all others offered on my webpages, are not in any way harmful.
Note that the interface to the tagger program allows Latin-1-encoded text to be run against a UTF8-encoded language model; or UTF8-encoded text to be run against a Latin-1-encoded language model — the text will be internally converted to match the model encoding if necessary. In the latter case however the conversion of the text from UTF8 will alter any character not covered by Latin-1, and this will be seen in the output. Also, do not run text in any 8-bit encoding other than Latin-1 through the interface: it will be treated as if it were in Latin-1 and the result will be meaningless; such texts should be externally converted to UTF8 before using the interface.
If you already have TreeTagger installed, skip to number 3.
If you are installing the Windows TreeTagger distribution from the beginning, I would suggest the following procedure, to prepare the TreeTagger for use from the Windows interface.
1. From the TreeTagger website, download the Windows
TreeTagger distribution. Unzip it to C:\Program Files, with the "Use
folder names" box ticked in Winzip's "Extract" dialog — DON'T OMIT TO TICK THIS
BOX. (In Vista, right-click the zip file, choose "Extract all", and browse to
C:\Program Files.) This will create and populate the following directories:
To use the TreeTagger from the graphical interface, you will not need any of the files which you will find in \cmd, and you will not need any of the .bat files in \bin. All you will need are the two .exe files in \bin, and the lists of abbreviations and multi-words in \lib.
2. From the TreeTagger website, download the Parameter files for the language models you need, decompress them (eg. using Winzip) and move them to the subdirectory C:\Program Files\TreeTagger\lib.
3. To add the graphic interface, simply place the two interface programs (wintreetagger.exe and wintraintreetagger.exe) in the \bin subdirectory, alongside the two .exe files from the TreeTagger distribution (tree-tagger.exe and train-tree-tagger.exe). And place the textfile (ttmodels.txt) in the \lib subdirectory, alongside the language model files.
Now you can test the TreeTagger. Find or create a plain text file containing a small piece of running text in English — not more than a couple of hundred words for a start. This file should not be located under Program Files, which may be a protected directory in some versions of Windows. Say the file is called sample.txt. In the directory containing sample.txt, make a shortcut to bin\wintreetagger.exe and run the shortcut.
You should always place your own text data in a normal directory, not in Program Files, which may be a protected directory in some versions of Windows. Then to run either of the interface programs, create in your data directory a shortcut to the interface program in its \bin subdirectory (of Program Files or whatever). Do NOT make a copy of the interface program in your data directory, as such a copy will be unable to find the TreeTagger components. The "Start in" field in the properties of the shortcut to the tagging interface should be set blank (otherwise, the facility to make a customised ttmodels.txt in your data directory will not work).
Regarding running the training program, some limited experiments suggest the
• the openclasses.txt file must ALWAYS be encoded in Latin-1
• the best parameters for tagging Latin-1 text are made with both the lexicon and handtagged files encoded in Latin-1, and — though it does not seem to make much difference — without the -utf8 option
• the best parameters for tagging UTF8 text are made with both the lexicon and handtagged files encoded in UTF8, and with the -utf8 option
• the tagged text file will have the same encoding — Latin-1 or UTF8 — as the input untagged text
• the tagging of Latin-1 text is slightly more accurate than the tagging of UTF8 text, compared on the same training data
This file as supplied lists the currently-available language models (as of June 2014) from the TreeTagger site, except for Chinese, spoken French, Portuguese (Gamallo), Galician (Gamallo), which are omitted as requiring special tokenization (not available in the GUI) or otherwise falling outside the usual pattern. Note that external or user-written programs for special tokenisation can be accommodated by the interface.
Each line of the file describes one language model, giving the language name, the name of the file containing the model's parameters, and the encoding. Columns are separated by a tab character.
The names used in this file for the parameter files correspond to the downloads current in June 2014.
Several non-current models are also listed, mostly using latin-1 encoding. Three of these have "-latin1" added to the original name of the parameter file, for uniqueness.
This file is used by the tagger interface to populate its menu of languages. Any model listed in this file, and whose parameter file is found in the \lib directory, will be included on the languages menu. The order of models on the languages menu follows the order in this file.
This file is user-editable. It is important to update it if any change is made to the
models in the \lib directory. Reasons for such changes might be:
• you rename a parameter file in \lib
• you obtain or make parameters for a new language model; just add a line to this file
• you want to re-order the languages on the menu, or to restrict the number of languages shown
It is permitted to create a customised copy of ttmodels.txt in any directory containing a shortcut to the tagging interface program, and the shortcut will use such a copy in preference to the copy in the \lib directory. For this facility to work correctly, the "Start in" field of the properties of the shortcut should be blanked out (which is probably a good idea anyway).
It is also permitted to place parameter files in the shortcut directory, which may be useful for models under development. When the language menu of the tagger interface is being populated from ttmodels.txt — whether located in the shortcut directory or in the \lib directory — each named parameter file is sought first in the shortcut directory; if not found there, it is sought in the \lib directory.
Text passed to the TreeTagger must be in a one-token-per-line format. Tokenization is the automated process of converting a running text into this format.
A token is usually a word, but a token may have internal spaces, ie. be a multi-word unit. Each punctuation mark is treated as a token and should be on a separate line too. Clitics should also be on separate lines, if they were treated as separate tokens in the language's training data. The character-set (or encoding) of the text should be the same as that of the training data, eg. utf8, latin-1. Text in the one-token-per-line format may optionally contain manual tagging for some tokens, in the form of a probability and/or a lemma.
This interface provides several options for tokenization:
The built-in tokenization performs the following operations:
There are several further optional actions in the built-in tokenization:
This program has been developed and tested under 32-bit Windows Vista, with UAC (user account control) switched off.
If you receive error messages like "File access denied", particularly under more
recent versions of Windows, here are some things to try:
• don't place your data under Program Files , which may be a protected directory; place your data in a normal
directory and create a shortcut in that directory to the interface program in Program Files
• make sure UAC is off
• run from an account with administrator privilege
• under Windows 8 or later, set the "Compatibility Mode" of the shortcut's properties to "Windows 7"
• run the program as administrator
It is hoped that some combination of these settings will solve the problem.
Thanks to Mohsen Shirazizadeh for several of these suggestions.
• Irene Doval Reixa reports that, under Windows 10 Home 64-bit, moving the entire TreeTagger folder
from C:\Program Files\ to D:\ solved the "File access denied" problem.
The author will try to help with any problems. To get help, you need to send me several things:
• an input file which produces the problem
• your options, saved in a .cfg file
• your ttmodels.txt file, if you have changed it
• a list of the files in the folder containing your shortcut, with dates and filesizes
• a list of the files in your TreeTagger /bin and /lib folders, with dates and filesizes
The following features are under consideration for addition to the interface in the future:
This interface is offered as a facility for corpus analysis on Windows. By using it, you are deemed to accept that the author bears no responsibility for any adverse consequences. Needless to say, he hopes that there will be no such consequences. He will be pleased to receive comments, but cannot promise to act upon them.