The System of Thesauri in the Sejm Library - implementation in ALEPH system

Presented during 6-th ICAU Meeting in Prague, 18-19 September 1995
Ewa Chmielewska - Gorczyca
Katarzyna Nowosad
Bozena Dabrowska
THE SEJM LIBRARY

The work of translating the EUROVOC thesaurus began in the Chancellory of the Sejm more than two years ago. After a survey of various indexing tools EUROVOC was chosen as the most suitable one for the needs of different information services of the Chancellory. The goal was to replace various subject retrieval systems existing in the Chancellory and used for different databases, catalogues and files by one indexing language fulfilling the needs of all those services. Until then, subject headings and a variation of UDC were used in the Sejm Library catalogue, home-made classification was used for the database (file) of periodical articles, home-made and very simple thesaurus - for database of legal acts, etc. There was no compatibility between those systems and to use them indexers and searchers had to learn them separately. Besides, the problem of maintenance of several indexing systems was a severe one. The possibility of replacing them with one language, and what is more important - compatible with European Community system of subject approach seemed very exciting and promising.

The work of translating was completed within one year, the result being a database with English, French, German and Polish equivalents linked hierarchically. Unfortunately, most of the time was spent not on finding Polish equivalents but on finding French and German equivalents for English terms (with the help of "Multilingual Thesaurus"), typing them into the computer, linking them with relationships (exceptionally time-consuming), etc. If EUROVOC had been available in the computer form (not only as hard copy version) the work might have been done in half of this time.

First of all we needed the system to be more user-friendly. To achieve this goal certain changes in the EUROVOC policy were necessary. They were:

more non-descriptors,
polyhierarchy,
AND and OR references,
more scope notes and indexing instructions,
more RT-relationships,
faceted structure.

Of course, all those changes were only introduced to Polish thesaurus and will not be developed in the remaining versions.

Non-descriptors.

In EUROVOC the number of non-descriptors is quite limited (approximately less than one non-descriptor per an index term). In Polish version it is much greater and still growing because we want to lead our end-user to the proper indexing term (preferred term) through all possible terms, regardless how bizarre they may seem to us. As a result there are descriptors having ten or more lead-in terms.

Polyhierarchy.

In EUROVOC polyhierarchy was used only in two fields: Geography and International Organizations, and even there on a very limited scale. For the Polish thesaurus polyhierarchy was extended to all fields and used quite generously. It means that a term can appear in more than one place in the classification scheme (subject-oriented thesaurus), at different hierarchical levels. Thus "TUNISIA" appears as a narrower term under "Arab countries", "French-speaking Africa", "Maghreb", "Mediterranean countries", "North Africa", etc. The consequence of this is that descriptor TUNISIA has got in the alphabetical list several broader terms.

AND and OR references.

The thesaurus maintenance software available in the Sejm Library allows to use AND and OR references. The reference of the AND type is a non-preferred term leading to more than one preferred terms.

The AND reference should not be confused with OR-type reference, that leads from a non-preferred term to more than one descriptors, this time not in combination but as an alternative, e.g. customs check USE customs inspection OR customs formalities It reminds users that the chosen term has to be replaced in the system by two (or more) specific ones.

Scope notes and indexing instruction.

In EUROVOC scope notes are scarce (344 scope notes for 5 359 descriptors). In Polish system they are more numerous and divided into two kinds, those explaining the meaning or the scope of a term (a sort of definition) and those concerning the usage of a term in indexing (a sort of indexing instruction).

Related terms.

The intent to give as much support to the user as possible resulted also in the much greater number of related terms (associative relationships) prescribed to one descriptor than it was the case in EUROVOC. All possible associations were traced and added, giving full list of connected concepts that can be potentially useful when indexing or searching.

It helps users to check whether the term chosen by him for representing the content of the document or information query cannot be replaced by the more adequate one.

Faceted structure.

Another innovation in Polish thesaurus was the application of a faceted structure whenever possible. It means that narrower terms are grouped together according to one characteristic of division, eg. for the countries

(by political system): capitalist countries; socialist countries
(by level of development): developed countries; developing countries; less-developed countries.

System of thesauri.

Nevertheless, the most important change was the complete change of the EUROVOC structure.

Even during the translation it was noticed that not all the hierarchical and association relationships were adequate to Polish (or at least Chancellory of the Sejm) needs.

The most crucial problem with hierarchical structure was that our library is not only a parliamentary library but, what is more important , the main (leading) law library in Poland and consequently our collections are mainly on law. Legal subjects are dispersed (scattered) all over EUROVOC (they are in various fields - microthesauri). There are, indeed, some microthesauri dedicated completely to law, e.g. sources and branches of the law, civil law, criminal law, international law, but they are very modest and not gathering the terms that are in fact narrower to them (that are the objects of the particular field of law). Most of the terms connected with law are scattered in different microthesauri, e.g. administrative law (0436 - Executive power and public service), agricultural law (5606 - Agricultural policy), commercial law (2006 - Trade policy), company law (4006 - Company organization), electoral law (0416 -Electoral procedure), energy law (6606 - Energy policy), environmental law (5206 - Environmental policy), family law (2806 - Family), patent law (6416 - Research and intellectual property), data-processing law (3236 - Computer science), transport law (4806 - Transport policy), etc.

To put the legal terminology together and to keep the hierarchical structure that fully corresponds to the legal point of view, EUROVOC has to be "rebuild". A "legal thesaurus was constructed on its base. However, as was already mentioned, translation of EUROVOC was supposed to serve other information services as well, not necessarily dedicated to law. To satisfy the users' needs of those services, e.g. politics or parliamentary oriented thesaurus should have been rebuilt once more, this time organizing its structure from the political or parliamentary point of view. Not to do the same work many times the content of EUROVOC (the set of descriptors) was divided into 10 subthesauri (called system STEBIS):

GEO- Thesaurus of Geographical Terms
TIO- Thesaurus of International Organizations
LAW- Thesaurus of Law
PAR- Thesaurus of Parliamentary Affairs
EBU- Thesaurus of Economy & Business
POL- Thesaurus of Politics
SEC- Thesaurus of Science, Education, Culture, Arts & Religion
TIC- Thesaurus of Information & Communications
SOC- Thesaurus of Social Policy & Environment
TIA- Thesaurus of Transport, Industry & Agriculture

The two first exist as separate microthesauri, but as a combined file as well. They constitute the structured list of so called (in the theory of indexing languages) identifiers, that is the proper names, and can be added to any other thesaurus (in fact not only from STEBIS system).

Other thesauri of the STEBIS system gather all the terminology of EUROVOC specific for the fields to which they are dedicated. The sets of terms separated (isolated) in this way are overlapping. i.e. some terms appear in more than one thesaurus. In this case, they always have the same form and the same set of lead-in terms (UF-terms) appended to them but the semantic relationships may be different. This difference is the reason for creating many thesauri instead of a single one, as there are pairs of terms that are connected with the reverse relationship in different fields, e.g.

in SOC family: NT family law
in LAW family law: NT family

Such a reverse relationship cannot be represented in a single thesaurus.

In practice, in every thesaurus there exists the main part grouping all the terms typical for this field, e.g. in LAW all terms representing legal concepts, and reflecting the semantic structure adequate to this field, as well as the auxiliary part (fringe areas) containing all the terms from Eurovoc that were needed in indexing though they did not fit to the field (had been recognized as being out of scope of this field). Thus, you can use one thesaurus of the system for searching the database and not have to switch from one to another all the time, but in every one you have different semantic structure (hierarchical and associative relationships).

Implementation of Thesaurus in ALEPH system

STEBIS system was created and is still maintained in MicroIsis by dr Ewa Chmielewska - Gorczyca - the head of Subject Heading Department in our library.
In April 1995 LAW (Thesaurus of Law) was implemented in ALEPH to serve as indexing and searching tool.

The following procedure of implementation was used:

Database records were exported from MicroIsis via printing utility in the sequential format.
Using UTIL - 91 those records were uploaded into new local library CLA in global library TEZ which is dedicated only for LAW. Thesaurus record numbers are the same as in the source database. STEBIS is maintained in MicroIsis, but using always the same system numbers allows us to keep them when updating CLA.
Using UTIL - 67 for GBL TEZ, a file in the ACCREF sequential format for local li- brary BIS in global library KSJ was prepared.
For GBL KSJ we used UTIL - 65 for uploading ACCREF from CLA (prepared in p.3).
It is very important that the first not empty field in the CLA record must have the same name as the created ACC code.

Apart from TEZ we have another GBL library TGM for Geographical Names and International Organizations.

That's a pity that not all types of cross-references are created in ACCREF (only seef). That's why we have chosen the option displaying thesaurus record in EXPAND window instead of displaying ACC file record. Only in Thesaurus records all cross-references exist, but there is no possibility to continue direct searching . Polyhierarchy and multilingual aspects of STEBIS are lost. We believe that in the near future all STEBIS advantages can be transmitted to ALEPH.

Back to Papers menu
Back to main ICAU '95 menu