Using the FreeLing Analyzer Program under Windows
Ciarán Ó Duibhín
![]()
FreeLing
is a system for the linguistic analysis of text (tagging, lemmatization etc)
developed at the Universitat Politècnica de Catalunya by a team including Lluís
Padró. The system includes extensive language data for Spanish, Catalan,
Galician, English and Italian; and from Version 2.2, for Welsh, Portuguese and
Asturian. The system takes the form of a library, which can be called from
within a computer program, but there is also a fully-compiled application
program, called analyzer.exe, which allows most of
the functionality of FreeLing to be used. We will be concerned here only with
the installation and use of this stand-alone analyzer program.
FreeLing is written in C++, and development takes place in a Unix environment.
The latest version as of 03 September 2010 is FreeLing 2.2. A port of FreeLing
2.2 for Windows has been made by Israel Olalla, cross-compiled on Linux using
MingW32. Here is information
and here are binaries (8.4 MB) and
here is the usermanual for
2.2 (407 KB). Data files for the supported languages are to be found in the
machine-independent 2.2
distribution (40.5 MB). A number of ports of earlier versions of FreeLing
are also downloadable, and will be mentioned later.
These Windows ports have been made by individuals on a
voluntary basis, and are offered by the FreeLing developers for download from
the FreeLing website merely "as a service". The FreeLing developers are at
pains to point out that they have no interest in Windows, and cannot assist
users of FreeLing under Windows. Some discussion among users of FreeLing under
Windows may be read on the FreeLing
Forum (to contribute to discussion, you must become registered and then
login on the FreeLing home page).
The zip files for Version 2.2 named in the links above should
be downloaded. The zip file containing the Windows binary should be extracted
into a suitable directory, taking care to preserve the internal subdirectory
structure (by checking the "Use folder names" button or similar). If you set to
extract to C:\Program Files, the package will be extracted into C:\Program
Files\freeling-2.2-mingw and subdirectories. It may facilitate later steps if
you now rename this directory to C:\Program
Files\FreeLing-2.2 No installer program is required.
The Version 2.2 User Manual may be downloaded from here, or
alternatively, extracted from the machine-independent 2.2 distribution.
As a Unix program in origin, the analyzer expects to read its options from the
MS-DOS command line (other possibilities are mentioned later). The success of
the installation so far may be verified by opening an MS-DOS window, moving to
the extraction directory, and typing
bin\analyzer -h
This should run the analyzer program and display a list of allowed program
options (the options are fully described in the user manual).
The most important option for the analyzer is the name of a configuration
file. This is a file containing a set of program options, so that the
remainder of the command line need contain only options absent from the
configuration file, or those which are to be made different from what is in the
configuration file. If no configuration file is specified on the command line,
the default is analyzer.cfg in the extraction
directory.
Other possible command-line content includes redirection of the input (text for
analysis) and/or output (results of analysis) to named files. In the absence
of redirection, input comes from the keyboard and output goes to the MS-DOS
window. End of input through the keyboard may be signalled by keying
Ctrl/Z.
You should now extract the data files for the supported languages from the
downloaded zip file for the machine-independent 2.2 distribution. You need
select only those files packed in FreeLing-2.2\data, and then extract, having
ensured that "Use folder names" or similar is selected, and that the target is
set to C:\Program Files You will now find a configuration file for each
supported language — ca.cfg, en.cfg, es.cfg, gl.cfg, it.cfg, as.cfg, cy.cfg, pt.cfg — in the data\config
subdirectory. Among the contents of a configuration file will be a number of
filenames, and these filenames — at least in those configuration files which
you intend to use — may have to be changed to suit (a) Windows syntax — in
particular, by changing forward slashes in paths to backslashes; and (b) your
choice of extraction directory.
So, for example, if a configuration file contains the filename
$FREELINGSHARE/en/tokenizer.dat
it should be changed to
C:\Program
Files\FreeLing-2.2\data\en\tokenizer.dat
on the assumption that you extracted to C:\Program
Files\FreeLing-2.2
You will also have to change two filenames in configuration files:
maco.db to dicc.src
senses30.db to senses30.src
This completes the installation of the analyzer program. It can be run by
opening an MS-DOS window, moving to the extraction directory, and typing a
command line. A typical command-line might begin with
bin\analyzer -f data\config\en.cfg
On its own, this command line will accept text from the keyboard, analyse it
according to typical rules for English, and output the results to the MS-DOS
window. Further options and redirections could be added on the command line,
to name input and/or output files, or to vary some of the settings in the
distributed English configuration file.
Note that this Windows port of FreeLing 2.2 has not been tested here under any
version of Windows other than Vista.
As an alternative to typing command lines at the MS-DOS
prompt, you can — while remaining in Windows — go to the extraction directory
and edit the same lines into a text file with extension .bat (called, for example, freeling.bat). Double-clicking on this file will open
an MS-DOS window and run the program in it. You can even add a shortcut to the
batch file from your desktop or from your Start Menu. If filenames mentioned
in the batch file contain non-ASCII characters (such as accented letters), you
may have to render these in the MS-DOS character-set rather than in the Windows
character-set; in that case, you may find it easier to edit the batch file
under MS-DOS than under Windows.
The analyzer program has to be run from the MS-DOS
command-line, where the required options and redirections are specified.
We hope at some stage to offer a Windows graphic interface to FreeLing, which
will allow the options to be selected visually, and then the MS-DOS application
to be launched automatically.
Earlier versions of FreeLing are available for Windows as
follows, and are downloadable from the FreeLing website. Version
1.4 (28.8 MB) has been compiled for Windows by Jordi Atserias using
cygwin. Version 1.5 — the version prior to 2.0 — has been compiled twice for
Windows; firstly, by Bruno Martínez using MS Visual C++ 2005 — Version
1.5 (Martínez) (22.8 MB); and secondly, by Javier Puche using MingW + Msys
+ msysDTK —
Version
1.5 (Puche) (27.59 MB). Version 2.0 has been compiled using cygwin and is
available for download as a set of three zip files:
the
Version
2.0 program files (5.72 MB),
the Version
2.0 data files for English and Italian (14.19 MB), and
the Version
2.0 data files for Spanish, Catalan and Galician (20.95 MB).
A comparison of the directory structures created by unzipping the various ports
may be helpful. Those ports without a \bin subdirectory hold the binary in the
root directory.
| 2.2 (Olalla) | 2.0 | 1.5 (Puche) | 1.5 (Martínez) | 1.4 (Atserias) |
| bin | bin | |||
| doc | doc | doc | ||
| userman | userman | |||
| html | html | |||
| refman | ||||
| html | ||||
| latex | ||||
| diagrams | ||||
| include | include | include | ||
| freeling | freeling | |||
| fries | fries | |||
| omlet | omlet | |||
| lib | ||||
| devel | ||||
| java | ||||
| dynamic | ||||
| java | ||||
| util | ||||
| share | data | |||
| common | common | common | common | |
| nec | nec | nec | nec | |
| config | config | config | config | |
| ca | ca | ca | ca | |
| nec | nec | nec | nec | |
| en | en | en | en | |
| nec | nec | nec | nec | |
| es | es | es | es | |
| nec | nec | nec | nec | |
| gl | gl | gl | gl | |
| nec | nec | nec | nec | |
| it | it | it | it | |
| nec | nec | nec | nec |
The zip files linked above for Version 2.0 should be
downloaded, and extracted into a suitable directory (eg. C:\Program Files\FreeLing20), taking care to preserve
the internal subdirectory structure. No installer program is used. The
remaining information required for Windows installation will be found in the
file README.txt, contained in the top-level
directory.
The Version 2.0 User Manual is included in the download of the program files,
as doc\userman\userman.pdf.
As a Unix program in origin, the analyzer expects to read its options from the
MS-DOS command line (other possibilities are mentioned later). For now, the
installation of the analyzer can be tested by opening an MS-DOS window, moving
to the extraction directory, and typing
bin\analyzer -h
This should display a list of allowed program options (the options are fully
described in the user manual).
The most important option for the analyzer is the name of a configuration
file. This is a file containing a set of program options, so that the
remainder of the command line need contain only options absent from the
configuration file, or those which are to be made different from what is in the
configuration file. If no configuration file is specified on the command line,
the default is analyzer.cfg in the extraction
directory.
Other possible command-line content includes redirection of the input (text for
analysis) and/or output (results of analysis) to named files. In the absence
of redirection, input comes from the keyboard and output goes to the MS-DOS
window. End of input through the keyboard may be signalled by keying
Ctrl/Z.
You will find a configuration file for each language — ca.cfg, en.cfg, es.cfg, gl.cfg, it.cfg — in the share\config subdirectory. Among the contents of a
configuration file will be a number of filenames, and these filenames — at
least in those configuration files which you intend to use — may have to be
changed to suit (a) Windows syntax — in particular, by changing forward slashes
in paths to backslashes; and (b) your choice of extraction directory.
So, for example, if a configuration file contains the filename
$FREELINGSHARE/en/tokenizer.dat
it should be changed to
C:\Program
Files\FreeLing20\share\en\tokenizer.dat
on the assumption that you extracted to C:\Program
Files\FreeLing20
This completes the installation of the analyzer program. It can be run by
opening an MS-DOS window, moving to the extraction directory, and typing a
command line. A typical command-line might begin with
bin\analyzer -f share\config\en.cfg
On its own, this command line will accept text from the keyboard, analyse it
according to typical rules for English, and output the results to the MS-DOS
window. Further options and redirections could be added on the command line,
to name input and/or output files, or to vary some of the settings in the
distributed English configuration file. Note that, in older versions of
Windows, you may have to abbreviate filenames and directory names to their 8+3
forms.
The zip file linked above for Version 1.5 (Puche) should be
downloaded, and extracted into a suitable directory (eg. C:\Program Files\freeling1.5-win-java-all-langs), taking
care to preserve the internal subdirectory structure. No installer program is
used. (Don't worry about the mention of Java — it is not involved in running
the analyzer compiled program.)
The Version 1.5 PDF User Manual is not included in the download, and has been
superseded on the FreeLing website. Therefore I make it available here.
As a Unix program in origin, the analyzer expects to read its options from the
MS-DOS command line (other possibilities are mentioned later). For now, the
installation of the analyzer can be tested by opening an MS-DOS window, moving
to the extraction directory, and typing
analyzer -h
This should display a list of allowed program options (the options are fully
described in the user manual).
The most important option for the analyzer is the name of a configuration
file. This is a file containing a set of program options, so that the
remainder of the command line need contain only options absent from the
configuration file, or those which are to be made different from what is in the
configuration file. If no configuration file is specified on the command line,
the default is analyzer.cfg in the extraction
directory.
Other possible command-line content includes redirection of the input (text for
analysis) and/or output (results of analysis) to named files. In the absence
of redirection, input comes from the keyboard and output goes to the MS-DOS
window. End of input through the keyboard may be signalled by keying
Ctrl/Z.
You will find a configuration file for each language — ca.cfg, en.cfg, es.cfg, gl.cfg, it.cfg — in the config
subdirectory. Among the contents of a configuration file will be a number of
filenames, and these filenames — at least in those configuration files which
you intend to use — may have to be changed to suit (a) Windows syntax — in
particular, by changing forward slashes in paths to backslashes; and (b) your
choice of extraction directory.
So, for example, if a configuration file contains the filename
en/tokenizer.dat
it may have to be changed to
C:\Program
Files\freeling1.5-win-java-all-langs\en\tokenizer.dat
on the assumption that you extracted to C:\Program
Files\freeling1.5-win-java-all-langs
This completes the installation of the analyzer program. It can be run by
opening an MS-DOS window, moving to the extraction directory, and typing a
command line. A typical command-line might begin with
analyzer -f config\en.cfg
On its own, this command line will accept text from the keyboard, analyse it
according to typical rules for English, and output the results to the MS-DOS
window. Further options and redirections could be added on the command line,
to name input and/or output files, or to vary some of the settings in the
distributed English configuration file. Note that, in older versions of
Windows, you may have to abbreviate filenames and directory names to their 8+3
forms.
First off, if you are using Windows
95, FreeLing 1.5 (Martínez) will NOT work under it — you may follow the
process below for a certain distance, but it will eventually fail.
The zip file linked above for Version 1.5 (Martínez) should be downloaded, and
extracted into a suitable directory (eg. C:\Program
Files\FreeLing-1.5), taking care to preserve the internal subdirectory
structure. No installer program is used. No information whatever is included
about how to install.
The Version 1.5 PDF User Manual is not included in the download, and has been
superseded on the FreeLing website. Therefore I make it available here.
As a Unix program in origin, the analyzer expects to read its options from the
MS-DOS command line (other possibilities are mentioned later). For now, the
installation of the analyzer can be tested by opening an MS-DOS window, moving
to the extraction directory, and typing
analyzer -h
This should display a list of allowed program options (the options are fully
described in the user manual).
However, an error may occur at this point because the download does NOT contain
the essential files MSVCR80.DLL and MSVCP80.DLL, which are runtime libraries required by
applications written in Visual C++ 2005 (that includes this port of FreeLing).
You may already have these files on your machine, if you have previously
installed Visual C++ 2005, or an application written in it.
The absence of these files produces different error messages in different
versions of Windows. In early versions (95; 98; ME? 2000?), it may say "A
required .DLL file, MSVCP80.DLL, was not found." With later versions of
Windows (XP; XP with SP2? 2003? Vista?), the message may be "The application
has failed to start because the application configuration is incorrect.
Reinstalling the application may fix this problem." Use of Resource Hacker shows that
analyzer.exe contains an embedded manifest which asks for version 8.0.50727.762
of the .dlls.
If these .dlls are missing or are causing errors, the best (and safest) way to
rectify this is to download and install the appropriate one of two free
Microsoft packages:
• for Windows 98; 98 Second Edition; ME: Microsoft
Visual C++ 2005 Redistributable Package (x86), v. 1.0 (actually,
8.0.50727.42), dated 2006/04/10 (2.6 MB);
• for Windows 2000; XP; 2003; Vista: Microsoft
Visual C++ 2005 SP1 Redistributable Package (x86), v. 8.0.50727.762, dated
2007/04/10 (2.6 MB)
As regards Windows 95, the first of these
Microsoft packages will actually install under Windows 95 — or at least, under
Windows 95B — but running the analyzer program now produces the message "The
MSVCR80.DLL file is linked to missing export
KERNEL32.DLL:GetLongPathNameW." This means that MSVCR80.DLL (and indeed
VC++ 2005 as a whole) is simply incompatible with Windows 95, which does not
have routines like GetLongPathNameW in its kernel. To run under Windows 95,
this port of FreeLing would need to be recompiled under an earlier version of
VC++, such as VC++ 7.1 (also known as Visual C++ .NET 2003), or possibly even
as far back as VC++ 6.
Under Windows 98 or later, if the analyzer -h
command is now producing output, we may try something more ambitious.
The most important option for the analyzer is the name of a configuration
file. This is a file containing a set of program options, so that the
remainder of the command line need contain only options absent from the
configuration file, or those which are to be made different from what is in the
configuration file. If no configuration file is specified on the command line,
the default is analyzer.cfg in the extraction
directory.
Other possible command-line content includes redirection of the input (text for
analysis) and/or output (results of analysis) to named files. In the absence
of redirection, input comes from the keyboard and output goes to the MS-DOS
window. End of input through the keyboard may be signalled by keying Ctrl/Z
followed by Enter.
You will find a configuration file for each language — ca.cfg, en.cfg, es.cfg, gl.cfg, it.cfg — in the config
subdirectory. Among the contents of a configuration file will be a number of
filenames, and these filenames — at least in those configuration files which
you intend to use — may have to be changed to suit (a) Windows syntax — in
particular, by changing forward slashes in paths to backslashes; and (b) your
choice of extraction directory.
So, for example, if a configuration file contains the filename
en/tokenizer.dat
it may have to be changed to
C:\Program
Files\Freeling-1.5\en\tokenizer.dat
on the assumption that you extracted to C:\Program
Files\FreeLing-1.5
You may now try a command line such as the following
analyzer -f config\en.cfg
On its own, this command line will accept text from the keyboard, analyse it
according to typical rules for English, and output the results to the MS-DOS
window. Further options and redirections could be added on the command line,
to name input and/or output files, or to vary some of the settings in the
distributed English configuration file. Note that, in older versions of
Windows, you may have to abbreviate filenames and directory names to their 8+3
forms.
In Windows 98 — and the same is probably
true of Windows ME — the analyzer program
may now produce error messages such as "Error 14 while opening database
en\maco.db." This problem is cured if the distributed file libdb45.dll is replaced by this
one, which was kindly compiled by Andrei Costache of Oracle/BerkeleyDB for
Windows 98/ME as the target system. (It works on newer Windows systems too, but
may not perform as efficiently on the newer systems as the distributed
libdb45.dll.) Many thanks to Andrei for his patient help with this. Remember
that use of libdb45.dll is subject to the terms of the BerkeleyDB
licence agreement.
This completes the installation of the analyzer program.
The omissions in the distribution and the incompatibility of the executable
with Windows 95 are regrettable, as compilation under Virtual C++ feels like
the best way to go.
The zip file linked above for Version 1.4 should be
downloaded, and extracted into a suitable directory (eg. C:\Program Files\FreeLing-1.4), taking care to preserve
the internal subdirectory structure. No installer program is used. The
remaining information required for Windows installation will be found in the
file Readme, contained in the top-level
directory.
The Version 1.4 User Manual is included in the general download for that
version, as doc\userman\userman.pdf.
As a Unix program in origin, the analyzer expects to read its options from the
MS-DOS command line (other possibilities are mentioned later). For now, the
installation of the analyzer can be tested by opening an MS-DOS window, moving
to the extraction directory, and typing
analyzer -h
This should display a list of allowed program options (the options are fully
described in the user manual).
The most important option for the analyzer is the name of a configuration
file. This is a file containing a set of program options, so that the
remainder of the command line need contain only options absent from the
configuration file, or those which are to be made different from what is in the
configuration file. If no configuration file is specified on the command line,
the default is analyzer.cfg in the extraction
directory.
Other possible command-line content includes redirection of the input (text for
analysis) and/or output (results of analysis) to named files. In the absence
of redirection, input comes from the keyboard and output goes to the MS-DOS
window. End of input through the keyboard may be signalled by keying
Ctrl/Z.
You will find a configuration file for each language — ca.cfg, en.cfg, es.cfg, gl.cfg, it.cfg — in the data\config
subdirectory. Among the contents of a configuration file will be a number of
filenames, and these filenames — at least in those configuration files which
you intend to use — may have to be changed to suit (a) Windows syntax — in
particular, by changing forward slashes in paths to backslashes; and (b) your
choice of extraction directory.
So, for example, if a configuration file contains the filename
N:/Eines-SL/freeling1.4/FreeLing/en/tokenizer.dat
or, in fact,
<anything>/en/tokenizer.dat
it should be changed to
C:\Program
Files\FreeLing-1.4\data\en\tokenizer.dat
on the assumption that you extracted to C:\Program
Files\FreeLing-1.4
This completes the installation of the analyzer program. It can be run by
opening an MS-DOS window, moving to the extraction directory, and typing a
command line. A typical command-line might begin with
analyzer -f data\config\en.cfg
On its own, this command line will accept text from the keyboard, analyse it
according to typical rules for English, and output the results to the MS-DOS
window. Further options and redirections could be added on the command line,
to name input and/or output files, or to vary some of the settings in the
distributed English configuration file. Note that, in older versions of
Windows, you may have to abbreviate filenames and directory names to their 8+3
forms.
This webpage is really about using FreeLing on Windows in the
form of the analyzer application, but it is also
possible to call FreeLing from a programming language, such as C++ or Java. I
have no personal experience of doing this, but here are a few basic notes,
which others are invited to correct and extend.
To use FreeLing from a programming language in Windows, FreeLing has to be
compiled as a .dll file. Only the Version 2.2 (Olalla) and Version 1.5 (Puche)
ports supply such a file — actually, it was found necessary there to break the
.dll file into two parts, which are called morfo.dll and morfo_java.dll.
I will try to describe here Puche's use of these files. His file java\USAGE.txt tells how to call FreeLing from a Java
application. A Java application which calls FreeLing 1.5 (Puche) requires the
above two distributed .dlls and also a distributed file libmorfo_java.jar, which
contains the FreeLing API definitions, as well as some code. Such an
application, myprog, is compiled as follows:
javac -classpath libmorfo_java.jar
myprog.java
and is executed as follows:
java -classpath libmorfo_java.jar;.
myprog
The application source file myprog.java makes
internal reference to morfo_java.dll
To call FreeLing from a C++ application under Windows, we may again use the
.dll files, while the definitions are in the .h
files in the \include directory. These .h files
are not distributed in either of the Version 1.5 ports (Puche or Martínez),
though they are probably to be found in the Linux distribution of that
version. Of course, the C++ programmer has the alternative of recompiling any
version of FreeLing entirely from source along with his own application.
This page is offered as a facility for corpus analysis on
Windows. By using it, you are deemed to accept that the author bears no
responsibility for any adverse consequences. Needless to say, he hopes that
there will be no such consequences. He will be pleased to receive comments,
but cannot promise to act upon them.