Twitter log of work on POOLS-T

(See multidict.net for the latest versions of Wordlink and Multidict.)

2008-12-02

Wrote rough report, “Wordlink - current bugs and things to do”: Wordlink_report_2008-12-02.doc

2008-12-16

Came across mysql-administrator, a mySQL management tool and installed it and explored it. Looks very useful, especially for foreign keys in mySQL databases - Needed for lexical databases (which are needed for lemmatization). Also for simplifying remote user management - useful for future database of dictionaries which people can add to.

2008-12-17

Started reading up on Apache mod_rewite rules. These are needed for the setup of www2.smo.uhi.ac.uk, so that selected parts of the SMO website can be developed using PHP5 on www2.smo.uhi.ac.uk, while everything else is transparently passed over to www.smo.uhi.ac.uk (so relative links will still work). This is already working for http://www2.smo.uhi.ac.uk/gaidhlig/faclair/, but I need to get it working also for http://www2.smo.uhi.ac.uk/wordlink/.

Reading up on $_SERVER variables in PHP. Trying to find a way (better than the present hack) in Wordlink of retrieving the physical address of the fetched page (with any trailing slashes, etc, added). This is needed so that it can set basedir and relative links (e.g. to pictures!) will work. Did various tests. Nothing worked. Silly idea anyway, thinking $_SERVER variables would work. Need to read up on PEAR packages, especially HTTP_Request2.

2008-12-18

Looked at various POOLS-T partner websites, including new hopeful Swiss partner. Added them to http://www.smo.uhi.ac.uk/~caoimhin/obair/pools-t/. Started this twitter log (previous entries are backdated, from memory).

Looked a bit at www.languages.dk - Good stuff - Need to read it thoroughly. Looked at POOLS Blog. Tried sending a “test” message. Successful, except that Edit HTML didn’t work - probably something to do with Opera browser and Wordpress editing. Need to try another browser next time. Read various messages from Gordon and Kent. Read Hoorn minutes. Read project schedule. Noted proposed March workshop dates in my diary. Checked Gordon’s revised evaluation form. Had idea for online quick report facility to supplement this. Could automatically note problem URLs in error-report log. Would be easy to do general logging of URLs, but not needed and privacy issues (which a logging on/off switch would partially solve). Logging of dictionary use statistics would be good, though.

2009-01-05

Back at work. Learned a tiny bit of/about Modern Greek during the holidays.

2009-01-06

Reading up on Apache module mod_rewrite. Learned enough to complete my setup of parallel server www2.smo.uhi.ac.uk, so that /gaidhlig/faclair/ and /wordlink/ are serviced by www2.smo.uhi.ac.uk and anything else is redirected to www.smo.uhi.ac.uk. Moved Wordlink from claran.smo.uhi.ac.uk to www2 and set up redirection on claran.

Incidentally, mod_rewrite looks really powerful. Looks like it could probably be used to do something I have wanted to do for years - take the "Accept-language" parameter of the HTTP request and rewrite it to put Gàidhlig (gd) in at the top if it is not already there, but to leave Accept-language alone if it already has gd somewhere. This could be used used to set up a webserver with dual Gàidhlig-English webpages which serves up the Gàidhlig by default to everyone except those who explicitly take the trouble to put Gàidhlig lower down than English in their browser language preferences. Could be useful for other minority language websites too.

2009-01-07

Tidied up, simplified, and beautified a bit the Wordlink navigation frame, incorporating some suggestions from Kent.

2009-01-08

Incorporated a language choice suggestion from Kent (error message if no language specified). Some tests. Message to Kent. Found Breton-French dictionary no longer working. Found its new location and altered the program. A database of dictionaries and their parameters would be very useful to us.

2009-01-09

Did a lot of tests to find out why Wordlink was just returning a blank page for some sites. It turned out that it happened whenever the webpage html had trailing whitespace after the </html>. I couldn't understand why this was happening, but it was easy enough to correct by getting rid of the trailing whitespace with the PHP function rtrim.

Improved Wordlink by making it detect, when breaking the text into words to be linked, that the sequence “ ” constitutes whitespace.

Tweaked the navigation frame layout a bit more.

2009-01-10

Message from Kent with a suggestion by Elizabeth to use the IATE terminology database as a “dictionary” to link to. Worked out a url which will work for doing a search on this, and added Dutsch-English translation (as a first step) to Wordlink.

Decided on a system for archiving versions of Wordlink, both for safety and so that people could see that it was (hopefully) improving. Added an “About Wordlink” link to the navigation frame, and a link to an archived version on the “About” page.

2009-01-11

Added Greek-English tranlation via IATE to Wordlink, and added the Greek and Dutch Wikipedias as examples in the navigation frame. Message to Kent and Elizabeth.

IATE, if it proves useful, gives translation between 276 language pairs, thereby enormously increasing the number of language pairs with Wordlink should be able to deal with. It would swamp the program and make it very ugly if they were all written in directly, so some restructuring will be necessary. I have been planning to put the dictionaries into a database keyed on language pairs and for Wordlink to be able to fetch the dictionaries from there, with some JavaScript (perhaps also generated with help from the database) somehow helping the user to choose dictionaries. However, it might not even be sensible to put all 276 pairs as separate records in the dictionaries database, so some rethink might be necessary there too. Apart from the source and target languages, IATE also opens up the choice of interface language, a third language choice.

Talked to a friend who works in a health service physiology clinic in an area with lots of immigrants who know little English and who was asking me about quick online translation service which could translate simple sentences such as, “The Urdu translator is ill today. Could you come back at half past ten on Thursday next week.” That gave me the idea that Wordlink should include a facility to open up scratch-pad where the user could type in a few sentences which would be linked to a dictionary. Something for the future.

2009-01-13

Main Gaelic-English dictionary, Dwelly (from early 1900s) suddenly released on Internet by enthusiasts at www.dwelly.info. Sdudying this and offering comments. Succeeding in creating an HTTP GET interface so that it can be used with POOLS-T. Testing this. Works well except that there is a huge 6 second delay for some unknown reason.

2009-01-14

Further tests and comments to originator on Dwelly. Trying to track down along with Màrtainn Dòmhnallach (SMO) the reason for the delay in my GET interface to Dwelly. Suspected SMO firewall, but Màrtainn doesn't see how this could be the case.

Swopped messages with Gordon Wells. Tried (unsuccessfully so far) to get Wordlink working with HTTP_Request.

2009-01-15

Major improvements to Wordlink (thought still to be moved from /wordlink/test/ to /wordlink/). Succeeded in getting Wordlink working with HTTP_Request (though I should probably be looking at HTTP_Request2). This simplifies things (e.g. it can follow redirects and add trailing slashes automatically). And it gives me access to more facilities. e.g. I can now get the content-type from the HTTP header. Used this to (hopefully mostly correctly) determine the character encoding of the page to be processed. Changed Wordlink to convert the page if necessary from this encoding to utf-8 (having decided to work with utf-8 as standard) (and to convert any meta content-type charset tag in the header to specify utf-8). This enabled Wordlink to work with lots of pages which didn't work before (outputing nothing but “..appears to have no <body> tag”).

Used “html_entity_decode” to convert any html “entities” (things like “&egrave;”) to proper utf-8. This cleans up lots of mess which had been appearing on certain pages. It also seems to cope neatly with the “&nbsp;”s, although I need to check this.

Changed Wordlink to deal more correctly with existing links - to look for a “</a>” rather than thinking that any tag found within the link text represented the end of the link text. (I should perhaps be doing the same for similar less common situations such as “<textarea>...</textarea>” and “<object>...</object>”

Improved the stylesheet for Wordlink links (class="wll"). Added “inherit” attributes to stop overstyling. This creatly improved the appearance of very many (but not all) linked pages.

2009-01-16

Moved latest version from /wordlink/test/ to /wordlink/ and archived old version.

Tried getting HTTP_Request to use UHI proxy (wwwcache.uhi.ac.uk) to see if this would get over current delays in fetching pages from new sites (due to SMO firewall???), but while this generally worked, it sometimes failed so reverted back to direct request. (Could try again sometime with increased timeout?)

2009-01-20

Testing Dwelly’s dictionary over the last few days, and offering some advice to the guys who put it online. Improved the 6 second delay on lookups from Wordlink somewhat by using HTTP_Request to go through a proxy server.

2009-01-22

Over the last few days started work on major task - Moving all dictionary lookup pages/forms/programs at SMO (e.g. Stòr-dàta Briathrachais, Briathrachan Beag, Dúil Bélrai, Celtic Cognates) to www2.smo.uhi.ac.uk and doing a major upgrade - first in about four years - to make them work with PHP5, using $_GET and $_POST and exceptions and the modern object-oriented database query interface. This is not totally POOLS-T related, but it is indirectly related in many ways:

Finished the modernisation of the Stòr-dàta Briathrachais search program.

Produced a “to-do” page for Wordlink - Features we would like to have in the long term. Sent message to main POOLS-T people. Meeting at SMO with Alison Dix and Gordon Wells to overview POOLS-T work.

2009-01-22

Over the last few days continued work on the modernisation of dictionary search programs at SMO. Finished the Briathrachan Beag. Finished the first half (headword search) of the Dúil Bélrai. Tried hard yet again unsucessfully to get a search string which will work to lookup words in eDIL, the huge authoritative dictionary of Old-Irish.

Learning new techniques for debugging PHP programs: running them offline; watching PHP errors and notices in the Apache webserver log; switching them on and off with the likes of error_reporting(E_ALL & ~E_NOTICE) and ini_set. Corrected various longstanding errors.

Reading up on security techniques in PHP - Cross-site scripting (XSS) is potentially a serious problem, since we want scripts and everything from the original page to be procesed and not filtered out. Something we need to think about.

Modified postair.php (the program which replaces GET queries with POST queries for dictionaries which require POST) to replace action="" with action="$url" - So that followup searches go to the dictionary rather than going to postair.php and failing.

2009-02-12

Over the last few weeks: Correspondence with the people who put the Gaelic dictionary Dwelly-d on the Internet; updating Gaelic dictionaries at SMO to PHP5; tests of HTML iframes as a possible alternative to frames - so far inconclusive; checked out the English-Greek dictionary Elizabeth found.

2009-02-13

A lot of work on building up a database of Internet dictionaries in a variety of languages, including fields for source language, target language, get/post methodology and the parameters required to call the dictionary.

2009-02-23

In the past week, adding to the database of Internet dictionaries, but also busy with some other work, not directly connected with POOLS-T, but with some overlap. Continuing converting Gaelic dictionary lookup programs to PHP5 (needed so they stay working when we convert our main web server to PHP5, and also needed for development work adding a lemmatisation step to lookups). Developing the college weblog system (so that now there is a “Log a-steach” (login) link on the college internal homepage. (This might be useful for extending to Wordlink later in the project, as a supplement to cookies for storing user preferences (favourite languages and dictionaries), and might even be necessary for limiting access to subscription dictionaries or to “oppen proxy” facilities).

Various other semi-educational/semi-practical work. Access mySQL developer forums to pursue the problem of being able to do indexed dictionary retrievals for both case/accent dependent and independent lookups on the same field. Seems the best solution for the moment might be my current one of duplicate columns, but using trigger functions for automatic update.

Message from Kent to say Wordlink had stopped working. Put it right. (It was due to changes I made at the weekend in the location of PHP class libraries.)

2009-02-24

Produced new test version Wordlink2 with much better (I think) behaviour as regards frames. Frame behaviour in Opera had been fine, but Firefox and Internet Explorer do not seem to respect “target=” when linking from one frame to another. So the Wordlinked document was being generated (possibly out of sight) in a new window or tab instead of in a frame below the navigation from as intended. Cured this by targetting _top and by parametrizing index.php so that it can regenerate both navframe and mainframe. Some people might prefer the other behaviour, though, so I need to ask the testers. It could possibly be made an option via a checkbox in navframe.

Brief project update meeting with Alison Dix and Gordon Wells.

2009-02-26

Request from Kent to try and analyse the POST parameters in the Korlex English-Bosnian dictionary and try and produce a GET request url for querying the dictionary. Did this and produced a url using postair.php and sent it to Kent. Added the Korlex dictionary to my database of dictionaries. Found it also does Bosnian-English and added this too.

Making progress in trying to program Wordlink to replace prexisting links in the document with links to Wordlinked documents, rather than just the original. The idea being that people should be able to browse around in a Wordlinked world, the same as for BBC Vocab. Decided, though, that I really need to move from PEAR package HTTP_Request, which I am currently using, to HTTP_Request2, since the HTTP_Reqest documentation clearly says that it has been superceded and to use HTTP_Request2 instead. Attempting to do this, but so far unsuccessfully, due to lack of online documentation for HTTP_Request2

2009-02-27

Succeeded in getting Wordlink working with HTTP_Request2. Hurray!

Got a version of Wordlink going which changes any pre-existing links in the document into links to Wordlinked documents. Another hurray! Produced a page with three test versions, Wordlink1, Wordlink2 and Wordlink3, illustrating the new features and sent out a message to testers. Messages coming in from some testers say Wordlink2 and Wordlink3 are not working and just giving an error message.

2009-02-28

Away for the weekend, but did some tests on Wordlink. Found it is not working well on pages with embedded iframes, since it doesn't Wordlink the iframe content. This should now be easy to sort.

2009-03-02

Back from weekend and more messages from testers saying Wordlink2 and Wordlink3 were just giving an error message. Found there was an error (a <?xml line which should not be there in a frameset page) which Internet Explorer was choking on but other browsers were letting through. Put it right.

Looked properly for the first time at the Wordlinked output produced by Internet Explorer and found that it is rather ugly, often with every worded underlined or in blue or both. Found that this is because Internet Explorer, up to and including the current version, Internet Explorer 7, does not support the CSS “inherit" property, whereas all the other major browsers do: Fireforx, Opera, Chrome, Safari, Konqueror. (See http://reference.sitepoint.com/css/inheritvalue". Not sure what if anything can be done about this. Downloaded Internet Explorer 8 (currently available as a release candidate) and found that it supports inherit and looks ok.

2009-03-09

Working over the past week on a new version, Wordlink4. This overcomes a problem with certain pages which had their own CSS styles attached to links - All the wordlinked words were being given this style too, which made the whole document appear very ugly. Overcame this by rather cunningly giving the <body> tag a “wll” id, and then using a “body#wll a.wll” CSS selector in my rules to make sure they get given high priority than the page’s own rules. So many Wordlinked pages now look much better.

A more major improvement behind the scenes, though, is that Wordlink4 now connects to a database and uses this for logging. This will be very useful for many things:

2009-03-13

Lots and lots of improvements to Wordlink over the past week. Improved the character set detection, following testing by Elizabeth. Now detecting Greek encodings much better in lots of pages.

Now following HTTP redirections by examining HTTP status codes from HTTP_Request2, retrieving the Location header, and resolving it if necessary. Moved test version Wordlink4 to be the “current” version of Wordlink, since it is now much better than Wordlink 1.

Replied to message from Elizabeth about, in effect, the requirement for lemmatization. Lots of ideas on how we could approach the problem of lemmatization.

2009-04-04

Huge developments over the past three weeks, but too busy to write it up at the time. Wordlink now uses Multidict, a new general purpose interface to online dictionaries, based on a database of dictionaries with specifications of the parameters and methods required to access them, and of the source and target languages. Multidict can also cope with “multidictionaries”, online dictionaries or termbanks such as IATE and Sansagent which can deal with any source-target pair from a large set of possible languages. This means a huge increase in the number of languages which Wordlink can deal with, and great flexibility in adding more. Also a major rewrite of internal logic - Wordlink now makes great use of session id and session variables, stored in a database. Built up the database of dictionaries.

Lots of other things:- Added an option to remove existing document links so that these words too can be Wordlinked. Put all the program source online. Attended the week-long POOLS-T workshop in Brussels.

2009-04-07

E-mail from Kent saying that Wordlink was failing on some pages on the new SDE website. Tracked the problem down to ampersands within the URL, which Wordlink was confusing with the ampersands separating its own parameters. Cured this very simply by protecting ampersands by getting wordlink.php to convert them to "{and}" in links, and getting readGetvars in the WlSession class to convert them back again.

Demonstrated Wordlink and Multidict to Màrtainn Dòmhnallach in Computing at SMO. Discussion about "lemmatization" needs and possibilities for Gaelic

2009-04-12

Added the Pons dictionaries (pons.eu) to Multidict - These translate between German and each of English, French, Spanish, Italian, Polish and Russian; and also between English and French, English and Spanish.

2009-04-13

Added a new mechanism to Multidict to cater for multidictionaries such as Interglot which use their own language codes instead of standard codes such as fr, de, sv. This works via a new field in the dictLang table in the dictionaries database. Added the Interglot multidictionary (nl, en, de, fr, es, sv) to the dictionaries database.

Added the Pons Bildwörterbuch to the dictionaries database.

Added Google Translate as a “multidictionary” (with lemmatization!) covering 41 languages.

Added the Kypros Greek-English dictionary. Needs character encoding conversion added to postair.php.

2009-04-14

Set up a copy of postair.php at its new home at /multidict/postair.php. Converted it to use the new PEAR class HTTP_Request2 in place of the now deprecated HTTP_Request. Added character encoding detection and signaling in the HTTP response header, so the Kypros Greek dictionaries now work properly, as do other dictionaries in non utf-8 encodings.

Added Google Translate as a “multidictionary” in the database. This just gives a single translation with no further information - so from that point of view it is not good for language learning, and occasionally it can be badly wrong - but it does cater for 41 languages, is remarkably complete, and best of all it does lemmatization.

2009-04-15

Added a new fewture to Wordlink. If you forget to specify the webpage language, it now attempts to guess this, using PEAR class Text/LanguageDetect, and invites you to confirm. Doesn’t recognise Gaelic or Greek, but otherwise seems quite successful.

Examined in detail Frans message from January on the Pools blog, and replied in detail to his observations from tests on Wordlink. Also started looking at the dicts.info dictionaries which he recommended - Very interesting.

2009-06-29

Busy with non-POOLS things over the last couple of months, but various maintenance and preparation work on POOLS as well.

Added various dictionaries to the dictionaries database, so that they are now available to Wordlink and Multidic. e.g. Added all the Leo dictionary pairs: German<->English; German<->French; German<->Spanish; German<->Italian; German<->Chinese. Added a Maori<->English dictionary on a recommendation.

Began work on a Gaelic lexical database (needed for lemmatization) - initially by rehashing Am Briathrachan Beag, the Gaelic schools dictionary which I typed in years ago. Consulted David Adger and Kevin Scannell on questions of structure. Reading up on Kevin Scannell’s grammar checker for many languages, An Gramadóir, which already includes some sort of Scottish Gaelic lexicon.

Reading up on some of Kevin Scannell’s suggestions for possible algorithmic lemmatizers (or “stemmers”) for various languages: Snowball, and a possible Greek stemmer.

Reading up on plans for better Unicode support in PHP6, when it finally arrives.

2009-07-18

Major restructure of the dictionary paramater database and the way Multidict works. Previously this had a “url” field (which contained everything up to the question mark in the case of GET queries), a “params” field (which contained either GET parameters - i.e. everything after the question mark, or else POST parameters), and a “method›” field (which said whether the params were to be interpreted as GET parameters or POST parameters. Now it only has a “url” field (which also contains GET parameters, together with the question mark, if there are any), and a “pparams” field (which contains the POST parameters if any). There is no need for a “method” field. And it is now possible to send both GET and POST parameters to an online dictionary, which is occasionally useful.

Added the Lexicelt Welsh-Irish and Irish-Welsh dictionary to the database.

Increased the size of the url and pparams fields from 255 to 2046 bytes (needed for the large code parameter in Lexicelt).

2009-07-20

Google Translate had stopped working as a “dictionary” from Multidict because its parameter structure had changed. Updated its entry in the dictParams table to get it working again.

Lots and lots of thinking over the last few days on a modified target language and dictionary choice system for Wordlink and Multidict. Settled on a system which would try to remember the user’s last dozen or so sl|tl|dict usage combinations and choose from them. But iff it didn’t find anything there, as in the case of cold calls for example, it would default to a system based, not on dictionary “qualities” assigned by me as at present, but on weightings based on an exponentially decaying average call rate with a “half-life” of probably about three months or so. Devised a logging system which could implement this efficiently and started work on it.

2009-08-29

Various things in the last week or so. Following a prompt from Kent, looked at the InterTran multidictionary and assessed it. Added it with some of its languages to Wordlink/Multidict. Character encoding issues with other languages - need to return and have another in-depth look later.

Mad a major upgrade to software on the webserver, including mySQL, phpMyAdmin, most importantly PHP (upgraded to 5.3). Wordlink/Multidict depend on mySQL and PHP. Some things collapsed following the upgrade of PHP. Eventually tracked the problem down to the location and permissions for the handling of temporary files (e.g. for sessions), which needs to be secure these days. Put this right by creating a separate folder in /tmp for PHP temporary files. Then found Wordlink/Multidct had collapsed with a second problem. Tracked this down to PHP not finding PEAR libraries. Put this right in php.ini.

Reading up on new features in PHP. Reading up on PHP’s PDO database interface (now recommended), and considering whether to switch to using this.

2009-09-01

Read up on and installed the PMA tables and special features in phpMyAdmin - something I have been meaning to do for a long time. Looks good, especially the “Designer” tool. These special features give extra tools for managing and documenting the databases on which Wordlink/Multidict depends.

Carried out tests of the PDO interface between PHP and mySQL. These worked and look good. PDO is now the recommended interface between PHP and mySQL, instead of the mysqli interface which I am currently using. It looks as if it gives somewhat cleaner and clearer code which would be good. As well as that it gives greater flexibility in choosing the underlying database (very easy to change from mySQL to something else), which would give greater flexibility if we ever wanted to move the system to another host or produce a mirror somewhere other than at SMO. Thinking about converting the programs to use PDO and wondering how much work this would be and whether this is the right time.

2009-09-02

Added the MultiTrans (Мультитран) dictionary, www.multitran.ru to the Multidict database. This gives Russian translation to and from 12 other languages, including English, German, Dutch, French, Italian; and also German translation to and from English.

Corrected a bug introduced by the restructuring mentioned in the note of 2009-07-18. Following this restructuring I had been sending all dictionary queries to the dictionary as a POST query (with a suitable URL if there were GET parameters, since some dictionaries take POST parameters, some take GET parameters, and some can take both. However, it turns out that the Leo dictionary is not happy with a query arriving as POST - and the same may be true for other dictionaries. Corrected the problem by setting the request method to GET if the dictionary takes no POST parameters.

Finally got Google Translate (using it as a “dictionary”) to deal properly with words with accented characters. Found that it needs an additional parameter, “ie=utf8”, to make it interpret the word as utf-8, otherwise it treats it as Latin-1.

2009-09-03

More work on incorporating the MultiTrans (Мультитран) dictionary into Wordlink/Multidict. This doesn’t use utf-8 at all for its parameters. Set the encoding to cp1251 when the source language is Russian, which seems to be what it uses. For Latin script source languages, however, I worked out that it encodes non-ascii characters as html decimal entities ('é' as '&#233;' for example), which it further urlencodes. Set up a mechanism for handling this, with a new flag field in the dictParams table. This might be useful for other dictionaries too in the future.

Spotted that Multitrans also does English to Japanese translation and added this to the table.

2009-09-04

Returned to the InterTran multidictionary which Kent tipped me off about and which I mentioned in my note of 2009-08-29. This uses different character encodings for each source language (yeuch! - Life will be much simpler when everyone uses utf-8). Added a new field to the dictLang table and a new mechanism to Multidict to cope with this. Worked out what encoding it uses for each of the 23 languages it handles and added these to the table. This gives translation between 23x23 language pairs (in theory at least).

Came across the Koralsoft Eurodict family of dictionaries which give translation between Bulgarian and 7 other languages (including Greek), and between Turkish and 3 other languages; and also the ABBYY family of dictionaries, which give translation between Russian and 7 other languages. Threw these into the system (very easy since they use utf-8 encoding).

Added a timeout to Multidict so that it times out and gives an error message if the dictionary fails to respond in 12 seconds. Previously it could just hang waiting for a very long time if the dictionary happened to be down and failed to respond. (Perhaps this was what caused one comment in the Swiss team testing that Multidict “crashed the computer”)

2009-09-05

Added “robots=noindex,nofollow” to wordlink.php, just in case Google starts going daft and indexing millions of Wordlinked pages.

Produced a list of the languages which Google Translate can now handle and compared this with the existing list. Found that Google Translate now handles 10 new languages (a 25% increase in six months), namely: Africaans, Belarusian, Irish, Icelandic, Macedonian, Malay, Persian, Swahili, Yiddish, Welsh. Added these to the dictLang table used by Multidict. Found that Google Translate uses an obsolete code for Hebrew (‘iw’ instead of ‘he’) and incorrect macrolanguage codes instead of the specific language codes for Malay (‘ms’ instead of ‘zlm’) and Persian (‘fa’ instead of ‘pes’). Added translation between the correct and incorrect codes to dictLang.

Cured a bug in multidict.php and in wordlink.php whereby blank_page.html (the result of a query with no word or url specified) was aparently being cached by the browser, resulting in subsequent queries giving no result. Did away with the file blank_page.html and instead got the program to generate a blank page dynamically with pragmas to prevent caching.

Added to Multidict a couple of English monolingual dictionaries which were recommended at the March Brussels workshop, namely the Oxford Advanced Learner’s Dictionary and the Cambridge Advanced Learner’s Dictionary. Had a look at the Zanichelli monolingual Italian dictionary which was recommended by the Swiss team, but this is subscription and 30-day trial only, so could not add it.

2009-09-09

Look closely at the dictionaries www.magenta.gr. These look like excelent dictionaries, Greek to and from English, German, French, Italian, Spanaish, Ancient Greek, etc. And they seem to do lemmatization, which would be great. Unfortunately, however, they turn out to be limited to five uses each per day - they are just tasters for CDs. So that turned out to be a bit of a waste of time.

Devising a cunning plan to make use of the huge number of old, out of copyright dictionaries which are now appearing on the Internet in page image format at the Web Archive and at Google Books. Even though page image format is clumsy, it at least has no advertising whereas most online dictionaries have so much advertising that can be difficult to see the results. And if you could get straight to the appropriate page for the word, it bypasses the lemmatization problem. (And least, it does in the case of end-of-word variation, which is what most languages have, and in the case of Irish and Scottish Gaelic any start of-word mutation can be removed algorithmicly.) In both the Web Archive and Google Books, individual pages can be addressed via a page number in the url. The problem is finding the appropriate page for any given word. I think this could be dealt with by storing for each dictionary a table giving the first word on each page - not too big a task to set up - and the program would do a binary search on this to determine the page.

2009-09-10

Implemented yesterday’s cunning plan and it works! Wrote a short, simple PHP program which takes a dictionary parameter and a word parameter and redirects to an image of the appropriate page where that word can be found. All it needs is a table giving the first word on each page of the dictionary. It is super-flexible - It works with the Web Archive, with Google Books, and with other systems. It can cope with multi-volume dictionaries - You just specify the URL for each volume. This is awesome! It is going to make all these old dictionaries really usable on an everyday basis.

Didn’t in fact bother with the binary search, but left it to a SELECT query on the database to determine the last firstword (initial word on a dictionary page) which comes before the search word. Did this on the grounds that (1) databases are so fast that a query examining a thousand records will be processed internally in no time; and (2) the database might even, who knows, be clever enough to implement the query by doing a binary search internally.

2009-09-12

Major work over the last couple of days, both improving the implementation of the dictionary page-image search, and also making it available for various Gàidhlig dictionaries.

Added an ‘inst=’ paramter to dictpage.php, making it possible to switch between different copies (“instances”) of the page images (e.g. WebArchive/GoogleBooks/SMO) while using the same page index.

When I added page-image dictionaries to Multidict, found that they worked with images at GoogleBooks and at SMO, but not with images at the WebArchive which is where most of the dictionaries are. Tracked the problem down to the ‘#’ which appears in the middle of url’s for pages on the WebArchive. Got round it by using an HTTP redirect for page-image dictionaries, rather than returning the result to Multidict for processing as I do for other dictionaries. Decided that the page-image dictionaries are all utf-8 compliant anyway (at least, WebArchive and GoogleBooks are) and so didn’t need any postprocessing to massage the character encoding.

For two Gàidhlig dictionaries which I had typed into the computer myself 20 years ago, namely Am Briathrachan Beag and MacBain’s etymological dictionary, I was able, by editing the original files, to produce a table giving the first word on each page. This enabled me to make them available in page-image format to WordLink/Multidict. For Am Briathrachan Beag, we had page images available on the SMO website. Made both this and the Web Archive copy availablevia Wordlink/Multidict, for test and comparison purposes.

For Dwelly (the main Gàidhlig-English dictionary), though, I had to spend many hours (about 7 at a guess), typing in the first word of each of the 1000 pages. Did this and it works a treat.

Added a link to Wordlink for the first time to the SMO internal homepage, http://www.smo.uhi.ac.uk/dachaigh/, and also to the SMO page listing other Gàidhlig organisations. While testing it, noticed a bug whereby internal links on a Wordlinked page (ones starting with ‘#’) did not work because they had been Wordlinked like everything else, when they shouldn’t have been. Cured the bug by amending the wordlink.php program.

2009-09-12

Added Dinneen’s dictionary to the Multidict database. This is the classic Irish Gaelic to English dictionary, published in 1927. It is available in page image format from the University of Limerick, which bypasses the lemmatisation problem - which is just as well since it is in the old pre-reform spelling. However, this is actually good for Scottish Gaelic, since the old Irish Gaelic spelling was actually much closer to Scottish Gaelic spelling that is the reformed spelling. So I added Dinneen to the Scottish Gaelic dictionaries as well as the Irish Gaelic.

The University of Limerick site with the images of Dinneen’s dictionary includes a sophisticated search and navigation system, so there was no need, I thought at first!, to make use of my new facility to find page image using an index of the first word on each page. However, I came across a page detailing the index of “first words” used by the University of Limerick facility and found it to be very seriously faulty - which explained something that I had been vaguely aware of in the past, that the site very frequently takes you to the wrong page. All they were using for an index was the list of page headers from the book, which was faulty in all the following ways: (1) they keys were far too short, usually just the first two to five letters of the word; (2) they preserved hyphens, whereas they ordering in the dictionary ignores hyphens; (3) they used ‘h’ for lenition, whereas the dictionary uses a diacritic which gives a different ordering; (4) they OCRed the page headers, with very frequent faults. It is a shame that such an excellent facility is spoiled somewhat by something which could be so easily cured. Decided because of this to also provide an alternative facility linking to the bare page images and using a copy I took of the page image and cleaned up partially - only partially, but still much much more accurate than the original.

2009-09-13

Realised that the official page index to Dinneen is worse than I first thought, and decided that I ought to use my cleaned up page index for linking to the official pages as well as the bare images. To do this, I had to add a new field and mechanism to the page image lookup facility to cope with lookups which require POST as well as GET parameters. This mechanism might be useful for other dictionaries too in the future.

Spent some time testing Wordlink myself on Gàidhlig pages. Fixed a long-standing bug in the handling of the “Remove existing links” checkbox.

2009-09-17

Found another copy of Dwelly’s dictionary on the Web Archive. Threw this into my system as well - very easy since the page index is already in there. The different instances become searchable via URLs such as:
  http://www.smo.uhi.ac.uk/multidict/dictpage.php?dict=Dwelly&inst=1&word=...
  http://www.smo.uhi.ac.uk/multidict/dictpage.php?dict=Dwelly&inst=2&word=...
Found that the second instance actually seemed to have clearer type with fewer faint and broken letters, so swopped them and made this the default.

2009-09-17

At Gordon’s suggestion, added Colin Mark’s Gaelic-English dictionary to Multidict. This is a new book, only available in Preview format from Google Books, so full of holes. We might not keep it in the system, therefore, although it has enough information to be useful in some circumstances. Adding the dictionary required a couple of new tricks to be added to the system - including a flag to strip accents from the search word, because Google Books doesn’t like accents.

I am devising a system to allow dictionaries to remain in the system but be “disabled” temporarily, perhaps with a user option to switch on and off the less useful dictionaries.

Not directly related to my mainstream POOLS-T work, but I revived an old system, my Celtic Cognates Database, the search program for which had not been working since moving to a new server running PHP5. Completely rewrote it, and took the opportunity to learn and use the new PDO interface between PHP and database systems such as mySQL. This is so good that after the Brussels workshop I want to move all my programs - Wordlink, Multidict, etc. - to it. It is only about three years since I was converting all my programs to the mysqli interface, which was then the bee’s knees, but PDO is a lot better. Things are moving fast. Changed the Celtic Cognates Database search to make use of Multidict instead of its own dictionary system. This has greatly simplified the program and improved it for the user. It shows the power of the Multidict approach.

2009-09-19

Conversations and tests over the past week over the Scottish Parliament’s official Gaelic<->English dictionary, “Faclair na Pàrlamaid”, which is one of the Gaelic dictionaries I use in Multidict. The Parliament’s Gaelic officer notivied me and their technical support contact that it had stopped working via my system, which is how a lot of people use it. After investigating, I found out that their system had become very much more complicated behind the scenes, much more complicated than most dictionaries on the Internet, and that it would need a lot of work to incorporate it into the system, if it turned out to be possible at all. So I sent off a message to the technical contact explaining the situation, and asking if it would be possible for them to provide a simpler “hook” for linking to the dictionary for other systems. (Dwelly’s dictionary www.dwelly.info is in a very simpler system, which had the same problems, and when I asked Will Robertson, one of the pair who put the dictionary on the Internet, whether he could provide a simpler hook, he came up with it straight away. No reply yet from the Faclair na Pàrlamaid people, so I’ll probably have to delete it from the Multidict system.

Discussions with Will Robertson about the possibility of him providing a “neighbouring words” hook to www.dwelly.info, so that if it doesn't find your search word it at least returns nearby words, which usually include what you want, thereby overcoming the lemmatisation problem. I need to do the same for the dictionaries/termbanks I look after, the Stòr-dàta Briathrachais Gàidhlig and the Briathrachan Beag, etc.

2009-09-22

After a great struggle, purely because I hadn't done file uploads via PHP before and didn’t realise you had to follow their instructions exactly to get the file uploaded past their anti-hacker mechanisms, I succeeded in getting a file upload facility added to Wordlink. This will ask you to select an html file from your own computer. After uploading, it is wordlinked and you can save the wordlinked file back to your own computer if you want.

Following a suggestion from Will Robertson that we cooperate with Sámi language people in Tromsø, where he is at the moment, added the Freelang Sámi-English dictionary to Wordlink. In contrast to the uploads this took only a few minutes, and it works for at least a proportion of the words on the Sámi Wikipedia.

Gave demonstrations of Wordlink/Multidict to two batches of college students (about 25 in all), and to three of the teaching staff. All seemed to think it would be useful, a couple of things which the teaching staff were looking for were an upload facility for material which is not on WWW, and for the system to work with material in password protected systems, such as the Virtual Learning Environment which UHI uses (Blackboard). It doesn't too this yet, which means that it doesn't work with social networking sites either. Other things they asked about were the possibility of wordlinking PDF files (totally impossible) or Word files (might just be possible, but not now - Maybe a Word macro?).

During the demonstrations, I encountered a problem while looking up the word “muinntir” in the page image version of Dwelly at the Web Archive. It turns out that the problem was that I had switched to a clearer scan of Dwelly which I found on the Web Archive, and it turns out this is not totally identical to the first. While the pages are identical, they are split into three volumes in slightly different places. Put the problem right.

Correspondance and co-operation with Will Robertson, who is attempting to set up a new “hook” to Dwelly for us which will return neighbouring words rather than simply returning blank when asked for a word (usually a wordform) which it doesn’t have. Added this to Multidict as a test version. In the process, added a new mechanism to Multidict to simply redirect (i.e. HTTP redirect) to dictionaries which work better like this (which I now flag in a new field in he dictionaries database), as opposed to other dictionaries which require a POST query and capture of the results, or still others which require a character encoding change on the results.

Added a checkbox to Wordlink to make it easy to choose up upload a file. Added an “encoding” select option to the file upload form, to allow non-utf-8 encodings to be converted and processed correctly.

2009-09-23

Added to two Gaelic dictionaries/termbanks I have control (Stòr-dàta Briathrachais and Am Briathrachan Beag) the feature whereby if a search word is not found then a list of neighbouring words are displayed which exist in the dictionary and which the user can click on. This makes them much more useful for use with Multidict, compared to before when they would simply return blank if presented with a wordform which was not a headword. This is another ploy to get round the current lack of lemmatisation, although it would still be useful even if we had lemmatisation.

Added a javascript popups option to Wordlink, whereby the dictionary page is opened in a popup window rather than being opened in the same tab. This is a feature requested by users, although I prefer myself to just use mouse gestures to flick back. It seems to be working well on the whole, although there is the question of what size the popup window should be - need to ask users - and the question of how to make it that size (and position) in all the different browsers. Currently Wordlink is not remembering your choice properly when you follow links - Need to investigate and sort this.

Added a primitive “Compose” facility to Wordlink, allowing test to be copy-and-pasted in and then wordlinked.

2009-10-10

Major improvements to Wordlink/Multidict following workshop in Brussels. Added the www.vertalen.nu n×n dictionary recommended by the Dutch team which does nine languages with lemmatisation!!! Wonderful. Also added the Danish-Danish dictionary, Ordbog over det Danske Sprog, and the Ordbogen subscription dictionary.

Added a new option to Wordlink to open the dictionary in a new tab. (Reorganised the navigation frame to make room for the many options which now exist.) By popular request made this new option the default, although I don't like it myself (since I use Opera and mouse gestures and can go back easily in the same tab and find new tabs a nuisance).

Following a suggestion by Soren in the restaurant, added a new “splitscreen” option which enables the dictionary and page to be viewed side by side. This far far nicer to use than any other option, provided you have a fairly wide screen, most especially because you don’t loose track of the place where you are reading in the text. Provided too that you are using Opera, Safari or Chrome, because unfortunately it does not work with Internet Explorer or Firefox. Something to work on.

2009-10-11

Worked on Worklink so that it can now process pages which use frames - e.g. http://www.smo.uhi.ac.uk/gaidhlig/corpus/samhlaidhean/gla-gle.html. At any rate, it works with the well-behaved pages using frames.

2009-10-12

Added archive copies of the current Wordlink and Multidict and all associated programs and databases to the folder http://www.smo.uhi.ac.uk/~caoimhin/obair/pools-t/wl/ (simply for the sake of openness of program source, as promised in the POOLS-T project application, and also to make sure that they are easily available for others to continue should I suddenly get run over by a bus).

Added an ‘accept="text/html"’ attribute to the file upload form in Wordlink, so that when you are browsing for an html file to upload, it will only display html files. This was a suggestion which someone made at the Brussels workshop. I can’t remember who it was now, but thank you.

2009-10-15

Got the splitscreen method in Wordlink working for Internet Explorer and Firefox - after a lot of reading up, although the trick turned out to be really simple: just give the dictionary frame a ‘name="dict"’ attribute as well as an ‘id="dict"’ atribute. Not sure why I was using id= rather than name= in the first place, but there is/was probably some good reason(?).

This is really good! It might even mean that I should change Wordlink to use splitscreen method by default.

2009-10-16

Testing something requested by Gordon in particular: some way of keeping a video stationary in a webpage while allowing the text to scroll, so that you can watch the video right through while reading the transcript. Did some tests on one example of Textblender output with an embedded video. Currently, these are at: http://www.smo.uhi.ac.uk/~caoimhin/ceolaschair/. Got it working using divs (avoiding using frames).

2009-10-26

Changed Wordlink to default to “splitscreen” mode instead of “new tab” mode. Everyone so far seems to like splitscreen mode. Further reports of this today from Kent from two workshops he gave. And I haven’t enjoyed new tab mode myself - I keep thinking the program is not working because the dictionary output has gone off to some tab I had forgotten was open, although it might be possible to cure this problem with some Javascript which would change the focus to the dictionary tab.

Also changed the name of the dictionary mode parameter from “pUps” to the more appropriate “mode”, and changed it from having numeric values to more meaningful string values. So instead of pUps=0,1,2,3, we now have mode=‘nt’,‘st’,‘pu’,‘ss’, respectively, for the modes “new tab”, “same tab”, “popups”, “splitscreen”.

Changed the default width ratio, webpage:dictionary, for splitscreen from 50%:50% to 60%:40% - which is very suitable indeed for m.vertalen.nu and tolerable too for the Briathrachan Beag page-image which is now the default Gaelic dictionary, although it leaves rather little space for most other dictionaries.

Following a report from Kent that Wordlink/Multidict was not making use of Vertalen for Swedish, found that the problem was that I had used the wrong code for the Swedish language, ‘se’ instead of ‘sv’ in the parameters for the Vertalen dictionary. Corrected this.

Found that I had accidentally deleted the Freelang Sámi-English dictionary from the dictionaries database. Restored the record from a backup copy, although the dictionary is actually too small to be of much use.

2009-10-27

Looking for a way of switching off the search pane in the Web Archive bookreader, because it is just wasting space - space which is very precious, especially in splitscreen mode. e.g. Try the Gàidhlig Wikipedia with Dwelly page-images. So far unsuccessful. The bookreader doesn't seem to have a parameter for this yet. But left a request on this bulletin board and also on this blog.

2009-10-29

Report from Kent of a problem with Wordlink when you use it to Compose a new page. The page was appearing with every word shown blue and underlined in Internet Explorer, though not in other browsers. Found that the problem was caused by Internet Explorer defaulting to its old buggy handling of CSS in its non-strict mode, and that it was cured by adding the following lines to the top of the file:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">

Report from Elizabeth that Wordlink was not remembering the “Dictionary in” mode you chose, but instead was reverting to “Splitscreen” all the time. Found that this was due to a bug in my storeVars function. When I changed from using a numeric ‘pUps’ parameter to using a string valued ‘mode’ parameter, I had forgotten to change the first parameter of bind_param from ‘i’ to ‘s’. Corrected this.

2009-11-06

Got Multidict working with the Dictionary of the Scots Language. The parameters I had previously been using were linking to an old interface program of mine which no longer worked.

2009-11-09

Mail from Søren with screenshot of a glitch - “Go” button out of sight in Multidict. Now ok so must just have been a glitch. Reduced the width of the “Source language”, “Target language” and “Dictionary” select boxes in Multidict, though, to reduce the chances of the “Go” button being off screen in splitscren mode if the user has a narrow screen. This means that the ends of some of the longer language and dictionary names will be lost, but I don’t think that will be much of a problem.

Added Bokmålsordboka og Nynorskordboka - Norwegian Bokmål and Nynorsk dictionaries - to the Multidict database. Separated the language codes for Norwegian into nb for Bokmål and nn for Nynorsk. Previously I had been using no which is now a macrolanguge.

2009-11-17

Added the Greek-Greek (Τριανταφυλλίδη) and Greek-English (Γεωργακά) dictionaries from http://www.greek-language.gr/greekLang/modern_greek/tools/lexica/ to Wordlink/Multidict. They look like very good dictionaries, but probably with too much information to be really useful in Wordlink.

Added EUdict (mobile version) - 49 language pairs to Multidict. Not as good as Vertalen, but it is concise like Vertalen, and it has some languages which Vertalen does not have, e.g. Norwegian, Hungarian, Finnish, and lots and lots of Croation. Does not do lemmatisation (so again nowhere near as good as Vertalen), but does suggest similar words, which is a very good fallback.

2009-11-18

Demoted EUdict slightly by reducing its quality rating, following feedback from Kent

Again folowing feed back from Kent, improved the Compose mode in Wordlink so that it now creates a new paragraph when it encounters a blank line in the input. Also so that it now escapes HTML special characters, & < >.

2009-12-05

Added the German-English dictionaries Vokabelsalat and dict.cc to Wordlink/Multidict. Also added the Beolingus set of dictionaries: German-English; German-Spanish; German-Portugese. Also Neurolingo’s Lexiscope, a Greek-Greek lemmatizer and word information tool.

Pulled in Morph-it, the Italian lexical database. Looks great. I could make good use of this in Wordlink/Multdict to provide an intermediate lemmatization stage in between the webpage and the dictionary lookup for the sake of dictionaries which do not incorporate lemmatization. First of all, though, this needs some additional framework which I need to plan out and program.

2009-12-06

Following a suggestion from Frans and Kent, added to Multidict the Czech-English, Czech-German, Czech-French and Czech-Spanish dictionaries at www.wordbook.cz. These look really excellent. Most importantly, they seem to do lemmatization.

2009-12-10

Added the Perseus Latin-English dictionary and the Perseus Word Study Tool (lemmatization!) to Multidict. Same for Perseus Ancient Greek to English. Same for Persus Old-Norse to English and Arabic to English although these aren’t so great. Added Latin Wikipedia and Ancient Greek Pater Noster to the Wordlink examples page.

2009-12-15

Added the Cornish-Welsh dictionary (produced by another Kevin Donnelly - someone with the same name as myself!) to Wordlink/Multidict. This is the first online dictionary I have found for the Cornish language.

2009-12-17

Added two Dutch monolingual dictionaries recommended to me today by Ellen and her students: www.vandale.nl and www.synoniemen.net.

Found that m.vertalen.nu is not working. It is displaying a programming error. I don’t know how long it has been like that, but it can only be a few days at most, so hopefully it will be fixed before too long. In the meantime, pointed “Vertalen” in the database at www.vertalen.nu, and inserted a new “low quality” entry, “m.Vertalen”, pointing at m.vertalen.nu.

Some of the multidictionaries in the database can function monolingually. e.g. Sensagent will happily work as a Spanish-Spanish or French-French dictionary. But others such as Vertalen and Mijnwoordenboek will not work monolingually. Up til now Multidict has been selecting them, or allowing them to be selected for monolingual work and they have been failing. Added a new indicator to the dictionaries database (an ‘x’ instead of a ‘¤’ in the target language field in the dictParam table) to distinguish those multidictionaries which do not work monolingually, and altered the program so as to exclude them.

2010-01-08

Added the Tritrans trilingual Norwegian-Spanish-English dictionary to Multidict. This refuses to work if there is no referrer information in the http request, so modified multidict.php to generate a “Referer” header.

Message from Frans to say that m.vertalen.nu is now working again - Great! - so reversed the temporary changes I made before Christmas.

2010-01-20

At Elizabeth’s request, altered the priorities on the Greek dictionaries to make Sensagent the first choice.

2010-01-22

Added the Welsh-English Eurfa dictionary to Multidict.

2010-01-26

Sorted a problem with the “encoding” parameter in Multidict, notified to me by Kent. I had introduced this to Multidict so that it could be used programmatically, and in particular so that it could be used from the TextBlender, which currently uses 8-bit encodings. However, I had only added it to /multidict/multidct.php, not to the version including the navigation fram, /multidict/, which the TextBlender now uses. Put this right.

2010-01-29

Added a new feature to Wordlink/Multidict whereby they remember the previous session id (sid) via a cookie (They had never made any use of cookies before), they and use this to fill in any target language and dictionary choices which are not directly specified in the current call. This was in reponse to someone (Frans if I remember) who pointed out that while the Textblender now makes uses of Multidict, and Multidict helpfully allows the user to change dictionary, this was not being remembered and the user was having to change the dictionary repeatedly for every single word.

(Prior to this, experimented with a version where Wordlink/Multidict not only remembered the previous sid but reused it. This didn't turn out to be a good system, though. It was remembering too much - the previous page and the previous word, which was disconcerting when you returned to Wordlink/Multidict after a long absence, and caused cross-interference between concurrently running instances.)

The system needs expanded to remember more than just the previous call. It needs to remember the previous user choices for various source languages, not just one.

2010-02-03

Corrected a bug in Wordlink notified to me by Kent, and probably introduced by me by mistake last week. The navigation frame for the Wordlink form was missing a ‘target="top"’ and sending everything to the small navigation frame instead of the full window. Now ok.

2010-02-09

Cured (I think) problems with the dictionary output going to the wrong tab or to the wrong frame, if the user changed the dictionary in Multidict, either when it was used alone or when it was used in conjunction with Wordlink. I had been directing the output from the Multidict navigation form to either ‘target="_top"’ or ‘target="dict"’ and neither seemed to be right for all circumstances. What was needed was ‘target="_parent"’, which I had forgotten existed.

2010-02-10

Following a tip from a German friend (Olaf Klöcker - Vielen Dank, Olaf), changed all the LEO dictionary entries (all 10 of them - German to and from various languages, including Italian) to use "pda.leo.org" instead of "www.leo.org". This is their PDA or mobile phone version, which is much more succinct and much much better for use with Wordlink - a bit like m.vertalen.nu, but with more information than Vertalen and the big disadvantage compared to Vertalen that it doesn't do lemmatization. Left the old www.leo.org entries in the dictionary table, but demoted them.

2010-02-11

Found a very succinct Bulgarian English dictionary, sa.dir.bg, and added this to the Multidict database.

2010-02-22

At Gordon Wells’ suggestion, tested Wordlink on the Guthan nan Eilean/POOLS video transcripts newly added to the Am Baile website. Found that Wordlink worked fine on the transcripts, but that the link to the video didn't work through Wordlink. Found that this was a general problem applying to all media links - e.g. in photo galleries. Cured it by adding a few lines to wordlink.php to get it to avoid trying trying to wordlink known media types - .jpg, .gif, .tif, .mp3, .flv, etc. - and to give a direct link to the media instead.

2010-03-01

Our Stòr-dàta Briathrachais (Gaelic terminology database) and other resources was getting hammered all day by Googlebot pulling it in word by word, page by page, 100,000 Gaelic words and 100,000 English words at 2 words per second - so badly that the server became unusable for other work. This is because words are linked to other words so that ultimately all words are linked. Cleared the problem by restarting the computer. But learned my lesson and programmed the Stòr-dàta Briathrachais and also Wordlink and Multidict to generate a robots "noindex, follow" meta-tag in all circumstances except for the case of blank form (so that Google will index the page with the empty form).

2010-03-04

Found that Multidict was not working properly with the WordReference dictionaries when the word had accented characters. Investigated and found that this was due to a flaw in the way in which WordReference deals with accented characters which it receives as GET parameters (in URLs). Introduced some code to work round this and it is now working ok.

Found that WordReference now has lots more dictionaries than before, including Italian monolingual and English<>Greek. Added these to the Multidict database. Found that it also now has a “mini” version, seemingly in all the languages that the main version does. Tried this and found it works with Multidict and could possibily be very useful.

2010-03-05

Got Multidict to remember (via a cookie) not just the most recent target language and dictionary, but the most recent target language for each source language, and the most recent dictionary used for each sl,tl combination.

Removed the “Target language” field from the Wordlink navigation form. This never really did anything much and I think things are simpler without it. People liked the idea of being able to choose up front, but I think this is not so necessary now that Multidict is remembering the user’s previous choices.

Added some JavaScript to get Multidict to submit the form automatically when the user changes the source language (so that the list of available target languages can be updated), or target language (so that the list of available dictionaries gets updated). In the future it would probably be nice to get the JavaScript to know about a big array of target languages and dictionaries so that the lists can be updated locally without the need for submission to the server, but in the meantime automatic submission is a good stopgap.

Changed the list of available target languages in the select box in Multidict to be in alphabetic order. Previously I had attempted to show the available target languages in some kind of “merit” order, with the ones most likely to be needed appearing first. However, this proved to be too confusing, especially when the list of possible target languages was long, as was the case when English was the source language.

2010-03-06

Added a little EU favicon to the top right-hand corner of the Wordlink and Multidict navigation frames to help give the EU recognition for their sponsorship of POOLS-T.

Made a major change behind the scenes. Converted all the Wordlink and Multidict programs and the associated PHP class to access the underlying mySQL database using the new PDO (“PHP Data Objects”) interface methodology instead of the previous mysqli (“mysql improved”) methodology. This was a lot of work with nothing to be seen for it on the surface. However, it could prove benificial in the long run. As well as giving some extra facilities, the PDO interface makes the programs completely independent of the brand of relational database which underlies them. So it would be easy to move everything to Oracle or Microsoft database server for example in the future if there was ever a need to move Wordlink/Multidict to a host where mySQL was not available.

Found that the Lexicelt Welsh<>Irish dictionary had stopped working from Multidict. Got it working again. Inspired by this, I looked again at Faclair na Pàrlamaid, the Scottish Parliament’s Gaelic<>English dictionary, which stopped working from Multidict a long time ago when they changed systems, and this time I succeeded in getting it working again.

2010-03-07

Added the Irish Gaelic terminology database Focal.ie to the dictionary database.

Programmed a system which displays below the dictionary select box in Multidict little “favicons” for each dictionary, each linked to a Javascript function which instantly switches to that dictionary and submits the form. This makes it very quick and easy to try the word in a different dictionary. Also put a little favicon for the current dictionary after the “Dictionary” label, this linked directly to the dictionary’s “homepage”, thus giving them some publicity. Altogether the system is working very nicely indeed. Visually it gives the impression when you click on an icon that it is “popping up” to be the current icon. All this is supported by a new table “dict” in the dictionaries database, which has fields for the dictionary name, icon, and dictionary homepage - as opposed to the main table “dictParam” which has the GET and POST parameters required to call it for a particular language combination.

2010-03-08

Busy collecting favicons for each dictionary and adding them into the database. Sometimes creating favicons for dictionaries which don’t have them.

A problem has become aparent with dictionaries which are available both as a normal version and as a mini/pda/mobile version. The mini version is usually very good when Multidict is used side by side with Wordlink or the Text Blender and space is very scarce; whereas the normal version might be better when Multidict is used as a standalone. The problem is that both the mini version and the normal version usually have the same favicon so there is no way to distinguish them. Started working on a system which would record in the database any particular properties for each dictionary (e.g. “mini”, or “page-image”) and would display the favicon with some distinguishing style to show the property - Currenly just using a coloured underline or overline, since space is very scarce.

Updated the parameters for the Eurfa Welsh<>English dictionary, which had changed address and stopped working.

Changed An Focóir Beag, a monolingual Irish Gaelic dictionary, to use the standard Multidict lookup mechanism, instead of an old ad-hoc mechanism I had written years ago for it and was still using.

2010-03-09

Updated the Bulgarian-English diri (SA) dictionary, whose parameters had changed. Spotted that it does English-Bulgarian as well as Bulgarian-English and added this.

Found that the Hindi<>English Shabdkosh dictionary had stopped working from Multidict because its parameters had changed. Put this right.

Found that Српски (Serbian) was sorting in the wrong place in the lists of language because its name in the database had a Latin script ‘C’ instead of a Cyrillic script ‘С’! Put this right.

Improved (I hope) the layout of the Multidict (and Wordlink) navigation frame. Made the labels smaller and grey coloured. Moved the "no JavaScript" message down to an absolutely positioned div at the very bottom of the frame.

Added a new feature to Multidict, a Javascript button labelled ‘⇆’ which swops the source and target languages and submits the form.

2010-03-10

Put a link to Multidict (starting with Gaelic as source language of course!) on the Sabhal Mór Ostaig internal homepage for staff and students. Sent a message about this to all users on the college’s e-mail system.

Noticed that although Multdict was remembering the user’s previous choice of target language and dictionary, Wordlink now was not! Investigated and found that the cause was that I had removed sid “sid=” (session id) parameter from the wordlinks generated by Wordlink after setting up the system based on cookies, thinking that it was no longer required. Put it back and everything seems to be working ok now.

Created a small Help page for Multidict. Made this and the Wordlink Help page open in the bottom fram rather than full screen.

Found that the ABBYY dictionaries (now called ABBYY Lingvo) now cater for a few more language pairs, uk<>en, uk-uk and ru<>la, and added these to the database. Noticed that ABBYY now uses new URLs with a simpler structure and changed all the ABBYY dictionaries to use this, even though the old URLs still work. Notice that ABBYY now offers a choice of Russian or English user interfaces for all dictionaries. Opted for the English interface for the time being. However, Wordlink/Multidict badly needs a mechanism for recognising and making use of the user’s preferences for interface languages.

The large number of dictionaries I have in the database under the Lingvo name, or in one case Lingvosoft, have not been working for at least several days, possibily longer. Investigated and found that Lingvosoft is owned by a company called Ectato, but their online dictionaries are now working either. Wondering whether they are in the process of being taken over by ABBYY or something?? Left the dictionaries in the database meantime.

Added the Sensagent mini dictionaries to the database. Sensagent is a multidictionary doing 26x26 language pairs, so that is a huge number of language pairs. We need to test and decide whether Sensagent or Sensagent mini is best for our purposes, and whether to keep the other or keep it in reserve. Sensagent mini certainly looks good at first sight - consice, and most importantly fluid format so none of it gets lost when the window gets narrower.

2010-03-11

Kent notified me of a bug. Multidict was sometimes not remembering the user’s previous choices properly, even when a cookie was set. Investigated and found a bug. When multidict.php was recording the user’s dictionary choice against uid,sl,tl, it was not updating the timestamp properly on old uid,sl,tl records, even though it was giving new records a proper timestamp. So when old dictionaries were reused, they weren’t being recorded as fresh uses. Put this right.

2010-03-16

Corrected a bug pointed out by Kent. Multidict was giving an unhappy face to show that there were no cookies found when really cookies were working fine. Turned out to be because I had changed the name of the cookie but was still testing for the old name.

2010-03-19

Set up redirection from http://multidict.net/multidict/ to http://www.smo.uhi.ac.uk/multidict/, and from http://multidict.net/wordlink/ to http://www.smo.uhi.ac.uk/wordlink/, so that the new domain can be used right away.

Added various new dictionaries to the Multidict database

Corrected a minor bug in the processing of POST parameters in Multidict.

Got the Multidict navigation frame to submit the form automatically as soon as you change dictionary, to match the behaviour when you change source language or target language.

2010-03-20

Got a message from the other Kevin Donnelly (in Wales) about a great 34,000 word Breton-Dutch dictionary (Deloof) which he had just put online. Added it to the Multidict database.

2010-03-22

Corrected yet another bug which had been preventing Multidict from going to the previously remembered target language and dictionary in certain circumstances. (The automatic JavaScript onChange form submission needed to clear previous tl and dict values before submission.)

2010-03-23

Updated the archive copies of the program source and the database dump, which are freely available at http://www.smo.uhi.ac.uk/~caoimhin/obair/pools-t/wl/. (See the note above from 2009-10-12.)

Added more dictionaries recommended by Kevin Donnelly (Wales): the Breton<>Welsh sentence bank at brezhoneg.org.uk; and the excellent set of Russion to/from Welsh and other Celtic language dictionaries at www.cymraeg.ru - excellent because they do lemmatisation, including Russian<>English.

2010-03-24

Updated the database entry for the Lampeter Welsh<>English dictionary, as its parameters had changed and it had stopped working.

2010-03-26

Added foreign key constraints to the Multidict tables in the database (which helps to avoid errors when adding new dictionaries).

Added an excellent Dutch<>Polish dictionary, pools-woordenboek.nl to the database, and an Icelandic-English dictionary.

Switched to using a Wordlink examples pages which automatically generates a list of Wikipedia links for languages in the Multidict database (which saves work maintaining the list as more and more dictionaries and languages are added). Got it to automatically show in grey those languages which have no dictionary other than Google Translate.

Did a lot of tidying up of program source.

Added the Consice Dictionary of Middle English historical dictionary by Mayhew and Skeat. This is a “page image” dictionary available from the Gutenberg Project and so required a bit of work to construct an index.

2010-03-27

Added the Dutch<>German Uitmuntend dictionary (for schools, by the look of it) to the database.

Added the 6x6 multidictionary online-dictionary.nl to the database. Noticed that it is identical to the Interglot multidictionary but with more concise interface. Retained the Interglot dictionaries but marked them as “hidden”.

Introduced for the first time a “view” into the mySQL database. The view dictParamV is like the table dictParam but does not show hidden dictionaries and gets the dictionary name from the dict table (unabling me to drop the name field from the dictParam table. Changed the WlSession class to use dictParamV instead of dictParam. So for the first time, dictionaries marked as “hidden” are really hidden.

Added what looks like an excellent set of Hungarian <> English/Dutch/German/French/Italian/Polish dictionaries, szotar.sztaki.hu to the database.

Added a Dutch<>Lithuanian dictionary.

2010-03-28

Added the Convertaal Dutch/English to Spansish termbank.

Added the lagom.nl Dutch<>Swedish dictionary. This looks like a good dictionary, but it kept jumping out of the frame - obviously it contained some JavaScript to make it jump out if it is loaded into a frame. Added an extra parameter, ‘zapOnload’ to the dictParam table, and a line of code to zap the “onload” JavaScript in dictionaries which require this to make them work with Multidict. Normally it is considered bad etiquette to load someone else’s page into a frame if it doesn’t want to be in a frame, but in this case I think it is ok. We are not passing the work off as our own, we are not “scraping” or removing advertising or anything else from the page, and we provide a link to the dictionary’s own homepage. In any case, the new parameter worked a treat.

The new parameter also proved to be just what was required to make the Dutch<>German Uitmuntend dictionary work properly with Wordlink. This fits fairly well into the narrow Multidict frame, except for a form on the right. The form isn’t needed for our purposes, but some onload JavaScript in the dictionary page gives it the focus, which has the side-effect of making the page scroll horizontally, removing the results from view. Zapping the onload JavaScript cured this a treat, without any obvious ill effects.

Added the Yiddish Online Dictionary. Split the Yiddish language into yi-Hebr (Yiddish written in the Hebrew script) and yi-Latn (Yiddish written in the Latin script).

2010-03-29

Added some older French dictionaries.

Added the Infopédia Portugese monolingual dictionary.

2010-03-31

Added to Multdict the Wortschatz monolingual resources from the University of Leibzig for about 54 languages. This is not a dictionary (except for German, where they provide some dictionary facilities) but rather an interesting corpus resource, giving example sentences, word frequencies, and diagrams showing common colocations with other words. It seems to work better for some languages than for others. I am not sure whether this should be in with the dictionaries in Multidict, but I thought it was sufficiently interesting to add it in. We should probably have an option facility to switch things like this off or on according to user preferences.

Added the Infopédia Portugese <> Spanish/French/English/German dictionaries. The quality looks excellent.

2010-04-01

Added the Michaelis Portuguese <> Portugese/Spanish/Italian/French/English/German dictionaries.

2010-04-02

Added a Polish monolingual dictionary.

Added the Morris Basque<>English dictionary. Looks excellent, except that it doesn't seem to do lemmatiszion.

Added the Kamus Jot German<>Indonesian dictionary. Looks very good. It was jumping out of its frame, though, so added a new mechanism to zap the JavaScript which was making it do this.

2010-04-06

Following a suggestion from Kent, made Multidict also “recall” the previously used source language, in the event that it is opened without any source language being specified.

Added the Den Dansk Ordbog at Kent's suggestion.

2010-04-07

Added the Hazar Turkish<>English, German<>English and Spanish<>English dictionaries.

Added the langtolang.com multidictionary to Multidict. It does about 32x32 language pairs, including Greek and Lithuanian.

Added the Lexin dictionaries - Swedish to and from 16 other languages. These were produced to go along with the Swedish Government classes which teach Swedish to immigrants and look very good.

Added the Danish<>Swedish and Danish<>Turkish immigrant dictionaries at lexin.emu.dk.

2010-04-08

Added the Celtic Cognates Database as a “dictionary” for the various Celtic languages, even though it is not really a dictionary.

2010-04-17

Added the Norwegian Lexin dictionaries, which were designed for immigrants to Norway - Bokmål and Nynorsk to about 10 other languages.

Added the Finsetaal Dutch<>Finnish dictionary.

2010-04-18

Added the Svenska Akademiens Ordbog, a large Swedish monolingual dictionary.

2010-04-19

Added Dicts.info, a multidictionary which does any pair of 77(!) different languages.

2010-04-20

Added the Dicts.info Universal Dictionary, which does any pair of 72 different languages. What I added yesterday turned out to be actually a huge batch of Dicts.info bilingual dictionaries. Added quite a number of other Dicts.info bilingual pairs. I am finding the Dicts.info “empire” huge and rather confusing, but there is lots and lots of good stuff in it. Worth revisiting sometime in the future, especially dictlist1.php, to pick up some more languages.

Added the Gran Diccionari Catalana, a large Catalan monolingual dictionary. This keeps jumping out of frames, which I didn’t manage to cure. So it is just as well that Wordlink still has modes other than Splitscreen! Found that it is rather dangerous too, because it gives you no way of returning to Multidict, so after you use it once it can be difficult to change from Catalan->Catalan to say Catalan->English. (The solution is to choose Catalan-English without giving it any word so that you never get into the Gran Diccionari Catalana.

2010-05-03

Made wordlink.php assume windows-1252 encoding when the page specifies iso-8859-1. This is because windows-1252 is an very commonly used extension (originally from Microsoft) of iso-8859-1, and quite a lot of pages wrongly specify iso-8859-1 when they contain characters such as “smart quotes” or the ‘£’ sign which are not in iso-8859-1.

2010-05-07

Added the Latvian Letonika dictionaries.

Added the Italian monolingual Treccani dictionary.

2010-05-08

Added Cregeen’s Manx-English dictionary (1825) from the Web Archive by typing in the first word on all 171 pages.

2010-05-09

Added Craine as an English to Manx dictionary. I had previously included it only as Manx-English and had forgotten to add it as English-Manx.

Added an old 400 page Cornish-English dictionary from the Web Archive, the Lexicon Cornu-Britannicum by Robert Williams, 1865.

2010-05-10

Added the Online Scots Dictionary.

2010-05-14

On the pages on the SMO website which list Gàidhlig dictionaries and Manx dictionaries which are available on the Web Archive, added a Multidict “” icon (linked to a Multidict lookup) against each dictionary where words can now be looked up using Multidict and Dictpage.

2010-05-16

Kent reports problems when linking to Multidict hosted at multidict.net using the TextBlender. If the word contains non-ASCII characters, it just comes up with “Not Acceptable”. After doing tests, I found out that JustHost is rejecting HTTP requests if the (%-encoded) GET parameters are not in utf-8. Contacted the JustHost helpdesk and was told that this was a security measure which could be switched off for the site. In the meantime, though, I devised a workaround. Things work, at least for Western European languages, if a single line is added to the dLink function in the JavaScript generated by the TextBlender to convert the word to utf-8 using the JavaScript function “encodeURIComponent”. It remains to be seen, though, whether this will also work for Greek.

2010-05-17

Added the “Dictionarium Scoto-Celticum” (otherwise known as the “Highland Society Dictionary”) to Dictpage and hence Multidict, by typing in the first word on each page. This is a high quality Gaelic->English and Gaelic->Latin dictionary with nearly 1000 pages, published in 1828.

2010-05-19

In the examples page for Wordlink, made the list of (currently 100) Wikipedia examples more attractive looking and clearer.

2010-05-22

Added McKenna’s English-Irish phrase dictionary, 1911, 285 pages, to Dictpage and hence to Multidict.

2010-05-23

Added Kelly’s English<>Manx dictionary, 1866, to Dictpage and hence to Multidict, both the Manx>English part (191 pages) and the English>Manx part (236 pages).

2010-05-24

“Disentangled” the Multidict/Wordlink database from other databases at SMO. Up until now, the various dictionary information tables used by Multidict and Wordlink, and the operational tables used by Multidict and Wordlink (for example, to “remember” previous dictionaries used by each user) have been embedded in the same database as Sabhal Mór Ostaig’s Gaelic terminology database, other Gaelic online dictionaries, a Gaelic placenames database, and lots more. Notably too, the database contains a copy of the English “Moby” thesaurus, which is used to enhance the Gaelic terminology database but which is so huge that it makes every database copy or backup take over an hour on the slow computer on which it currently runs.

I have now disentangled the tables used by Wordlink/Multidict/Dictpage to live in their own dedicated database called “multidict”. This was a necessary step on the way to being ready to make a copy of Wordlink/Multidict/DictPage live and work separately at multidict.net. I still have to disentangle the programs from some PHP classes which they share with other programs at SMO, put hopefully that will be a smaller job.

2010-06-10

One hour keynote address, “Making Better Use of Online Dictionaries” to the annual conference of the North American Association for Celtic Language Teachers.

2010-06-13

Added Mark Nodine’s Welsh<>English dictionary to Multidict.

Got Dineen’s Irish-English dictionary, page-image at the University of Limerick, working again with Wordlink. They had changed things and it had stopped working.

Added the Cronfa Genedlaethol o Dermau (Welsh National Database of Terms) to Multidict. This had proved stubborn, with a huge list of complicated encrypted POST parameters, but I managed to do it by adding a new “onload” dictionary handling mechanism to Multidict. The new mechanism does not attempt to assemble all the required parameters and submit a query directly. Instead, it fetches the search page from the dictionary site, and loads it into the user’s browser frame having first injected it with some “onload” JavaScript which will fill in the required form fields, including the search word, and then submit the form from the user’s browser. This new mechanism could prove very useful for lots of other dictionaries which have up til now proved stubborn!

2010-06-14

Added the Vertimas (VDU) English-Lithuanian translator to Multidict using the new “onload” mechanism. Previously I had tried hard several times to incorporate this but had not succeeded. To do this, though, I had to further develop the new “onload” mechanism. Using JavaScript to submit the form did not work, so I had to use JavaScript to simulate the user actually clicking on the submit button, and the way to do this is different in Internet Explorer compared to other browsers. Checked with Jollita in Lithuanian, who requested Vertimas, and she is happy with it.

2010-06-15

Following a suggestion from Søren, tried to incorparate the Gyldendal set of subscription Danish dictionaries into Multidict using the new mechanism, but so far without success. Gyldendal is even more complicated than other dictionaries.

Added the Interactive Manx Dictionary.

2010-06-17

Following a Wordlink/Multidict/TextBlender demonstration workshop at SUPSI, Jan Hardie reported a problem with Multidict not “remembering” the dictionary which the user was using. Kent said this would be due to the browser not being properly set up to accept cookies. Ellen and Frans then reported similar problems on college computers at Horizon college. Kent claimed that modern versions of Internet Explorer were not set up “out of the box” to accept cookies. I was initially extremely sceptical about this, because all modern Web life (Facebook, etc.) requires cookies, but Panos in Greece confirmed that Kent was right and that there was a problem with IE8 “out of the box”.

Following a lead from Kent (a mention of “privacy policy”) and help from Google, I eventually tracked the problem down. Internet Explorer treats all cookies in frames as “third party cookies” (even though in our case they are from the same website and not third-party at all), and it rejects third-party cookies by default (Medium Security). The solution was to set a “concise privacy policy” for the site (which I did by a mod-headers directive in Apache, and also (belt and braces) by using “header” in the PHP programs). When Internet Explorer sees our promise in the privacy policy that we will not to do anything naughty with user data, it suddenly becomes willing to accept cookies. Problem solved!

2010-06-23

Added the Gerlyver Kernewek-Sowsnek, a Cornish<>English dictionary. Looks like the best Cornish dictionary yet.

2010-07-01

Lots of correspondnce with the chap behind the Cornish online dictionary. He spotted that my publicly available copy of the program source was not in fact readable because the new SMO webserver is trying to process “*.php.txt” files as PHP. Changed the extension to “.phps”, which has cured the problem.

2010-07-15

Added three new Basque dictionaries/termbanks recommended by Michael Bauer to Multidict.

2010-07-17

Corrected a Wordlink problem notified to me by Kent. When it “wordlinks” existing links in the page, wordlink.php adds a ‘target="_top"’ to them to ensure that when the link is followed we escape from the Wordlink frame. However, a few pages, including one encountered by Kent, have links which already have a “target” attribute in them and this was taking precedence. I added a line to the program to zap any existing target attribute, and this seems to have cured the problem.

2010-07-18

Completely updated the copies of the program source and the database dump at /home/caoimhin/public_html/obair/pools-t/wl/.

2010-07-20

Added a new feature to Wordlink: If it is called with “url=referer”, it now takes the url from the HTTP_REFERER system variable from the HTTP request. This makes it very easy to add a Wordlink link to a webpage which will “wordlink” the page for the benefit of users who are not fluent in the language, or to add Wordlink links to all the pages in a website via an HTML template or Content Management System. It has the additional benefit that if the page is moved, the Wordlink link will still work with no alteration. However, since it is possible to turn off the sending of referer information by a browser, the method is not 100% guaranteed.

2010-07-28

Added Lexer’s Middle High German dictionary.

2010-07-02

Had another serious attempt at getting the Dictionary of the Irish Language (the very large and very important Old Irish dictionary) working properly with Multidict. Tried this time using the new "onload" mechanism which I devised for the Lithuanian Vertimas dictionary. Found that it didn’t work because their search page has corrupt html with no head or body tags(!) so had to get multidict.php to repair the html first. After all that, it still didn’t work properly a lot of the time. Gave up and reverted to the old system (which only works on DIL if you have initialised DIL first by searching www.dil.ie outside of Multidict). Just hoping they improve the interface sometime.

2010-08-03

Added Bosworth’s Anglo-Saxon dictionary (Old English) to Multidict, treating it as a page-image dictionary even though it isn’t really - but this gets round lemmatization problems. Rather slow because the files are so big, but the dictionary is huge and so should be useful to people trying to read Old English.

Made a minor alteration to wordlink.php to cope with those rare pages which use single quotes instead of double in the html - i.e. <a href='...'> instead of <a href="...">.

2010-09-02

Kent and myself gave a presentation on POOLS-T at the Eurocall 2010 conference in Bordeaux.

2010-09-03

Got Sensagent working again with Multidict. (It had changed address and stopped working.)

2010-09-25

Added the reverso.net (Collins) dictionaries to Multidict - English to and from German, French, Spanish, Italian, Russian and Chinese, as well as English-English.

2010-10-05

Added an [Esc] button to Wordlink to make it easy to escape from Wordlink and go straight to the current URL, after a period of browing around on the Internet with the help of Wordlink.

Did some tests in Wordpress (currently at http://teanga.wordpress.com/) of adding a Wordlink link automatically to all the pages of a Wordpress blog. Since the link makes use of url=referer, it allows the user to easily Wordlink any particular page they are reading.

2010-10-06

Following a suggestion from Kent (and also feedback from Gordon), modified the wordlink.php program slightly to deactivate and make invisible any Wordlink links which happen to be in the Wordlinked page. This is to avoid the user clicking “Wordlink” in a page which has already been Wordlinked, which had previously been causing lots of confusion and errors. A big improvement.

2010-10-19

Added the Norsk Ordbok at www.ordnett.no.

Did a bit of maintenance to Russian dictionaries.

2010-10-22

Major improvement to Wordlink to make it work with more sites. When webpages contain links to pictures or other pages, there is no problem if these are absolute links. However, if they are relative links (i.e. relative to the original location of the webpage), they would no longer work after the page has been Wordlinked and is served from the SMO server. Wordlink has coped with this in two ways: by adding a “basedir” directive to the page header, and by resolving any external links before “Wordlinking” the address. However, Wordlink has never previously coped with relative links to spreadsheets, and since these are used extensively by the BBC, it has never worked with the BBC website. The improvement is that Wordlink now preprocesses the html file looking for any references such as href=, src=, or url(..) and resolving them all before going on to its other work. As a result, Wordlink now works with the BBC website, the EuroCALL website, and probably many other websites where it previously failed. However, it does not yet work with the EuroCALL 2011 website which has an additional complication, namely relative links within the CSS files themselves. To cope with this, Wordlink would have to modify the links to the stylesheets and preprocess the stylesheets to resolve relative links.

2010-10-24

Added program code to wordlink/index.php to clean up the address a bit, removing unnecessary parameters such as “go=Go”, “upload=0” and “mode=ss”. This is purely so that people get a cleaner address if they copy and paste it from the address bar in the browser. More could be on these lines: same sort of thing to Multidict; and standardising the order of the sl=, tl= and url= parameters.

I have ideas for something a bit more ambitious - getting rid of the sid= (session id) parameter, which confuses people because they copy and paste it and lock themselves into past choices. The idea is to hide it by assigning it as a window name (which can only be accessed via Javascript).

2010-10-31

I am thinking out some ideas for doing something about lemmatization (i.e. converting wordforms such as “churches” and “tumbling” to dictionary headwords such as “church” and “tumble” so that they can be found in the dictionary). Lack of lemmatization is currently the main hindrance to Wordlink’s usefulness in many languages.

The only tiny bit of “lemmatization” which the programs do at the moment is that Wordlink removes any letter ‘h’ found as the second letter of a Scottish Gaelic or Irish Gaelic word - which is nearly always a good thing to do because this is a very common grammatical mutation in Gaelic. It seemed appropriate at the time to put the lemmatization functionality into Wordlink rather than Multidict, since it was predominantly when using Wordlink that the requirement for lemmatization arises, and since Multidict would then display clearly what had happened and what word was actually being searched for in the dictionary. However, I now believe that this was a bad idea and that lemmatization functionality should go into Multidict rather than Wordlink - for the following reasons:

  1. If the functionality was in Multidict, it would be useful to the Textblender as well as to Wordlink.
  2. I am tending towards the view that Wordlink should not only be indifferent to the choice of target language and dictionary, but that it should perhaps often be indifferent to the source language as well. It should just link words on the page to “dictionaries” - i.e. to Multidict - rather than specifying the source language in each link, because Multidict can now perfectly well remember (via cookies) the previous choice of source language. And besides, a page might contain a mixture of languages, in which case it would not be appropriate to freeze a single source language into all the links.
  3. The appropriate type of lemmatization, or whether lemmatization is needed at all will in many cases be dependent on the particular dictionary. The wonderful Vertalen dictionary, which does 9x9 European language pairs, has lemmatization built in, and no doubt does it far better than I could do. Similarly, Michael Bauer (“Akerbeltz”), the developer of the online Gaelic dictionaries, Dwelly and Am Faclair Beag has contacted me to say that since these dictionaries now do their own lemmatization, Wordlink is actually doing a disservice by removing the ‘h’s, because in the case of a small number of common irregular verbs this actually prevents the dictionary from finding the headword.

So what I am thinking of doing is storing for each dictionary in the Multidict database an indication of what “lemmatization scheme” is usually best for that particular dictionary. Normally that would be either “no lemmatization at all”; or else “the best which Multidict can do for the language”, whether that be table-based or else based on some simple rules such as “remove ‘h’s”, or some combination of the two. However, more individualistic “lemmatization” schemes might be appropriate for particular dictionaries. Prior to the spelling reform of Irish Gaelic in the 1950s, for example, the usual practice was to write a “dot-above” diacritic on a letter rather than writing an ‘h’ after it. The equivalent in German would be a reform which changed the spelling of “überflüssig” to “ueberfluessig” and placed it under “ue...” in the dictionary rather than under “ub...”. So for using the excellent pre-reform Dineen page-image dictionary with a modern Irish Gaelic text, an appropriate “lemmatization scheme” would be to remove every single ‘h’ from the word so as to put it in the correct alphabetical order.

Automatic spelling changes which attempt to convert between closely related languages might also be included in the portfolio of “lemmatization schemes”. For example, to attempt to convert a Scottish Gaelic word to modern Irish Gaelic spelling so as to find enlightenment in Irish Gaelic dictionaries, one would change ‘sg-’ at the beginning of the word to ‘sc-’ (e.g. “sgian” → “scian”) and change “-achadh” at the end of the word to “-ú”. Something similar might work between Norwegian Bokmål and Danish, or between pre- and post-reform Irish Gaelic.

My current thoughts are that Multidict, instead of displaying “[searchword] [Go]” as at present, would display “[wordform] [ls] [lemmatized word] [Go]”, and it would search for the lemmatized word in the dictionary. “[ls]” is an abbreviation for the lemmatization scheme, which could be changed by the user. The lemmatized word could also be manually changed by the user. (I am not sure what to do when a lemmatization scheme (e.g. table-based) comes up with several possible lemmas for the same wordform. A selection dropdown would be appropriate, but would get in the way of the user being able to manually rewrite the word.)

2010-11-02

Found that Faclair na Pàrlamaid, the Scottish Parliament’s Gaelic-English dictionary had stopped working with Multidict because they had moved it to a different location. Tracked it down and got it working again.

Wrote “Note for dictionary owners” to link to from the About Multidict page, just as a bit of additional documentation for completeness.

2010-11-22

Added a new small but possibly important feature to Wordlink. It now *marks* the word in the document which you have clicked, and only unmarks it when you have clicked another word.

The reason for this is that I noticed that when I was reading through a webpage with the help of Wordlink, things were ok provided that I could keep the mouse near where I was reading. But if I had to move over to the dictionary frame on the right hand side to scroll the dictionary or or change dictionaries or manually lemmatize the word, I lost track of where I had been reading in the document and it took me a while to find the place again. Hopefully the new feature will make it easy for users to cast their eye straight back and continue reading where they left off.

2011-09-14

At Kent’s request, following feedback from a workshop, promoted the SensAgent dictionary to first place for Greek, while demoting SensAgent Mini.

2011-09-21

Added the MalagasyWorld Malagasy-English/French dictionary to Multidict. Seems to be very good.

 

CPD