Acmhainn Díchlaochlaithe agus Scoilte Focal:
Word Demutation and Segmentation Tool

Ciarán Ó Duibhín

Más fada leat an teideal thuas, ní miste liom má bheir tú "An Duibhíneach" ar an acmhainn seo!

Bainidh an acmhainn seo na claochlaithe tosaigh ón fhocal a bheirtear dó, agus, má bhíonn uaschamóg nó fleiscín san fhocal, scoiltidh sé an focal ina chodanna dá réir.  Oibridheacht riachtanach seo i dtéacs-phróiseáil na Gaedhilge.

This tool attempts to remove initial mutations from a supplied Gaelic word, and also to segment any word containing an apostrophe or a hyphen into its constituent parts.  This is a basic operation in Gaelic text processing.

An taisbeanadh — Using the demo

Is féidir taisbeanadh beag fá choinne MS-Windows a tharraingt anuas as seo.  Leis an taisbeanadh seo, cuirtear isteach focla fríd an mhéarchlár, do réir ceann is ceann, lena bhfuascladh.  Bain triail as ar shamplaí mar iad seo: fhear, bhfear, n-éan, nÉan, héanacha, hata, d'ól, arsa'n, b'fhada, 'seadh, nárbh' etc.

A simple demo application for MS-Windows is provided for downloading from here.   In this demo, words to be resolved are simply typed in, one at a time.  Try it out on a few of the examples just suggested.

Más eol dó go bhfuil an dá fhéidearthacht ann leis an fhocal a bheirtear dó (m.sh. thart), fiafruighidh sé díot an mbainfidh sé an claochlú de nó nach mbainfidh.  In gnáth-úsáid na h-acmhainne, agus na focla dá ndealbhú as téacs reatha, thiocfadh leis an chomhthéacs bheith ina chuidiú leis an cheist a fhreagairt.

If you give it a word-form which it knows to be ambiguous with regard to demutation (e.g. thart), it will ask you to choose whether to demutate or not. In a realistic application, where the words are being drawn from a running text, you could arrange to show some context to inform the decision.

Tá an acmhainn seo in úsáid leis na bliadhanta ag Foclóir na Nua-Ghaeilge in Acadamh Ríoga na hÉireann.  Is mar thoradh ar an fhéacháil chruaidh a cuireadh uirthi annsin a tháinig cuid mhaith den fhorbairt atá deánta uirthi.

This tool has been in use for many years by Foclóir na Nua-Ghaeilge at Acadamh Ríoga na hÉireann. Much of its development is due to the range of data which it has encountered there.

I dteannta na modhanna oibre a mbeifí ag súil leo, bainidh an acmhainn leas as comhad ina bhfuil liostaí de fhocla eisceachtamhla.  Tá dhá chomhad den tseort seo in éineacht leis an oideas taisbeanaidh, agus caithfear ceann acu a cheangal leis an oideas nuair a chuirtear a dhul é.  Is iad na comhaid seo

Is féidir sain-liostaí a chur ina n-áit seo (ar a ndeánamh le deisightheoir téacs), má tchíthear buntáiste le sin.

The tool uses obvious algorithms, backed up by a file containing lists of exceptional words.  Two such files are supplied with the demo, and one or other of them must be selected as the demo starts up.  They are

These may be replaced by a customised list (which can be made using a text editor) if it is advantageous to do so.

An t-oideas bunaidh — Using the source

Sa dóigh is go dtig leat an acmhainn seo a chur go feidhm in do obair oideas-chumtha féin, cuirtear ar fagháil é san fhoirm bhunaidh (in Delphi 5); thig a tharraingt chugat as seo.

If you wish to use the tool in your own programming, it is supplied in source form (in Delphi 5), downloadable from here.

Glaoidhtear air mar fhó-oideas ó do fheidhm-oideas féin:

The interface takes the form of a procedure which may be called from an application:

function enrichword
(word: string; action: affirmtype;
segment, demutate: boolean;
endofline, prefixnow: boolean;
splitter, padesc: char;
continuation: boolean): string;


word: the word to be processed.  Normally a complete word, but it is allowable for it to be either part of a broken word (e.g. one hyphenated at an end of line in running text).  Special treatment as the initial or final part of a broken word is secured by setting endofline or continuation respectively true; for a complete word, both should be set false.

action: the name of a user-supplied function to handle queries from enrichword and pass back the user's replies.  The specification is:
function action (prompt, default: string): boolean;
The function should be written to display the prompt string, and if desired the default value of the reply — these are given to it by enrichword — and should invite a reply (typically,
Y or N).  It should return true or false to enrichword, depending on the reply.

segment: if the word is to be considered for segmentation at hyphens or apostrophes, set true (default value), otherwise set false

demutate: if the word is to be considered for demutation, set true (default value), otherwise set false

Note that the procedure may usefully be called even if both segment and demutate are false; it may still, through the medium of the endofline parameter, be used to resolve end-of-line hyphens in a running text.

The three boolean parameters endofline, prefixnow and continuation are all set false in the demo, where words are typed in one at a time.  They may however prove useful in situations which arise when the words supplied to enrichword have been automatically extracted from running text.

endofline: if endofline is set true, then if word ends in a hyphen, it will be treated like the first part of a broken word, and (irrespective of the value of segment) the user will be asked whether the hyphen should be dropped or regarded as permanent.  And secondly but only if segment is true, if endofline is true, then if word ends in an apostrophe, which is not the whole of word, and word is on a stored list which contains
d', m', b' etc., a splitter is inserted at the end of word, after the apostrophe (this is to make sure that a splitter is inserted in cases such as where a word like "b'" is line-final, and something like "fhearr" begins the next line); this can also be forced by setting prefixnow true.  The default value for endofline is false.

prefixnow: effective only if segment is true. Its default value is false.  If set true, it may force insertion of a splitter after a word-final apostrophe.

To understand the operation of prefixnow, suppose enrichword takes delivery of a word ending in an apostrophe, such as "
d'".  Normally enrichword will perform no splitting here, whereas it will split a word like "d'achan" to give "d'+achan".  It can happen, however, that the original text has a string like "d'*Éamonn", where the "*" is a markup character (e.g. marking a name); and this could cause the string to arrive at enrichword in two words, the first of which is "d'".  In this case, we DO want to return "d'+", to mark a clear split this point, rather than continuing to depend on the markup character.  Setting prefixnow to true will force insertion of the splitter character after a word-final apostrophe, in either of two situations: (1) word is comprised of an apostrophe and nothing else (e.g. "'*Éamoinn" from "a *Éamoinn"); (2) the word is on a stored list which contains d', m', b', etc.  (In the second case, setting endofline true also forces splitting.)

splitter: a character to be inserted in the result string where a segmentation point is detected (if segment is true), or at an end-of-line hyphen made permanent; default value '
+'

padesc: a character to be inserted in the result string before any character which is detected to be part of a mutation (if demutate is true), or at an end-of-line hyphen to be suppressed; default value '
^'

continuation: effective only if demutate is true.  If continuation is set true, word will be treated as the second part of such a broken word, and demutation of its initial will not be attempted (even if demutate is true), though of course if it should turn out to be splittable, demutation of later sections of it will be attempted.  The default value is false.

returned value: the word with splitter and padesc characters inserted as appropriate


In the demo, word is as typed by the user; action uses a one-line edit box on the screen; segment and demutate are always true; endofline, prefixnow and continuation are always false; splitter is '+' and padesc is '^', but before the output is displayed splitter is replaced by a number of spaces, while padesc and the character following it are removed.

In general, it is the responsibility of the user program which calls enrichword to:


The file of word lists will be opened when first required by the application.  If you wish to open it before then, you may call the procedure:
processlists (action)
where action has the same specification as the similarly-named parameter of enrichword above.  The same procedure as there may be reused here.


Tá sé ceadmhach úsáid ar bith is mian leat a bhaint as an acmhainn seo, agus a chur in oireamhaint duit féin.  Má bhaintear leas as, ba mhór liom dá dtabharfaí creideamhaint don áit a bhfuarthas é.

This tool may be used freely and adapted in any way. If it is found useful, it would be appreciated if its source was credited.


Ciarán Ó Duibhín
2002/11/24
Clár cinn / Home page / Page d'accueil / Hauptseite