Using the IMS Corpus Workbench in Windows

Ciarán Ó Duibhín

Corpus Workbench (CWB) is a system for managing and querying large text corpora for linguistic purposes. Texts may be encoded in Unicode or in Latin-1, and may be annotated at token level, eg. with lemma or with part-of-speech. The system was originally created at the Institut för Maschinelle Sprachverarbeitung (IMS) in Stuttgart, and is now open-source. It is written in C and implemented on Unix, but a Windows port has been created (still in beta-test as of April 2014).

To use one's own corpus, it is necessary first of all to encode and index it. Instructions on how to do this in Windows may be added here later (refer meanwhile to the encoding tutorial). Initially we will show only how to get the query program (CQP) up and running in Windows, using a supplied demonstration corpus (to take querying further, refer to the query tutorial).

Running a query on a demonstration corpus

The home page for the Windows port of CWB contains the download link for the current beta version (3.4.3 since 2012/02/05). This should be downloaded and unpacked. If using WinZip, the "Use folder names" option should be active; then, if C:\Program Files is chosen as the target, the material will be placed in C:\Program Files\cwb-3.4.3-windows-i586 and subfolders. I suggest renaming this to C:\Program Files\CWB . The PATH environment variable should then have C:\Program Files\CWB\bin added to its value.

Next, download the sample corpus of Dickens novels and unpack it, again retaining folder names, to a folder C:\Program Files\CWB\data (it is not necessary to use a subfolder of C:\Program Files\CWB for this; any folder will do, with the necessary adjustments to subsequent instructions). The material will be placed in C:\Program Files\CWB\data\DemoCorpus which I suggest renaming to C:\Program Files\CWB\data\dickens A batch file should now be created in C:\Program Files\CWB\data\dickens , and may be called rundickens.bat The contents may be:

        cqp -e -r "C:\Program Files\CWB\data\dickens\registry"
        pause

On double-clicking the icon for this batch file, a command-line window will appear, with the prompt "[no corpus]>". Even though there is only one corpus at this stage, it must be named and opened. The following line should be typed (ending with the enter key):

        DICKENS;

The command-line prompt now changes to "DICKENS>" and a CQP query may be typed. Such a query may be as simple as a single word to be concorded, eg.

        "pauper";

Typing exit at the prompt will close the program. Further queries may be formulated with the help of the query tutorial referred to above.

Uninstallation of CWB involves only the removal of the unpacked folders and files.

Disclaimer

This page is offered as a facility for corpus analysis on Windows.  By using it, you are deemed to accept that the author bears no responsibility for any adverse consequences.  Needless to say, he hopes that there will be no such consequences.  He will be pleased to receive comments, but cannot promise to act upon them.


Ciarán Ó Duibhín
2014/04/04
Clár cinn / Home page / Page d'accueil / Hauptseite