The technique to prepare the database input in exchange format ISO 2709

Trachtengerts M.S.

 

In this article I consider a workplace for an operator that is preparing data records for input in a DB, able to accept exchange format ISO 2709. Its basic tool is program IsoWin for text editing and converting. As example the technique is applied to CDS/ISIS for Windows.

Introduction

It is known that routine updating with new documents is a labour-consuming work in maintaining of bibliographic and other information retrieval systems (IRS). The Thermophysical center in the Institute for High Temperatures of Russian Academy of Science developed a technique for efficient organization of the input information to the IRS system CDS/ISIS for Windows (ISIS). The Thermophysical center for decades have used ISIS of different versins, developed by UNESCO, for its main database THERMAL on properties of substances.

ISIS has two means to enter new information:

The second way is much more effective, but it is necessary that document file would be prepared by some way, for example, it can be exported in ISO format from other IRS that has function to do such conversion.

It is necessary to consider one more important circumstance that appeared nowadays and connected with global Internet. Within it many documents (articles, books, materials from periodicals, etc.), that should be included in databases, are became accessible in an electronic format. It allows to reduce manual typing of texts in fields of the system interface and considerably to diminish expenses on data input.

The ISIS system interface can be used in this case too by copying line to line from the electronic document via clipboard and then input the ready document in the usual way like hand typed.

But it is possible also to mark in a special way the text of electronic document and convert this information to data exchange format ISO 2709. The records prepared thus from different sessions and/or from various staff operators can be grouped in a file and added to a DB during a single session.

Also it is not a problem now to use the same technique when a relevant document is available as a hard paper. With a scanner and recognition text program it can be easily transformed into electronic form. In some cases only necessary part of the document (authors, title, abstract, and etc.) can be scanned for input to a database. After that transformation the procedure mentioned above can be applied to get records in ISO 2709 format.

As an example, I show now some details of the approach that was applied in Thermophysical Center (TPC) of the Institute for High Temperatures in Moscow, Russia.

It is easily understood that a text marking scheme must correspond to DB structure in ISIS. So, having aim to input into a DB, one must use the same tags in marking procedure, that may be different in every ISIS data base. Such approach has been developed, and program IsoWin (Beta-version) making transformation of marked record in exchange format ISO 2709 was created. The procedure is illustrated by examples of a real DB on thermophysical properties of pure substances TERMAL.

The scheme of text marking

Each database in ISIS has as a rule a unique structure of records described in Field Description Table (FDT). In marking of the original text we use fragments from it to label fields and subfields (tags and subfield indexes). Thus compatibility of new data and database for which they are intended is provided. It is not to use all FDT labels, but only those that have appropriate content in the text.

For example, FDT for DB ÒÅÐÌÀËÜ has the following structure:

Tag

Name of field

Subfield

001

Authors

A

002

Title in Russian

 

003

Title original

 

004

Source

A B

005

Conference

 

006

Abstract

 

007

Full text holding (hard)

 

008

Properties

A

009

Carrier

A

010

Phase

A

011

Phase trans.

A

012

Property type

A

013

External physical field

A

014

Kind of work

A

015

Chemical formula

A

016

Class of substance

A

017

TPC number

 

018

Kind of document

 

019

Language

 

020

Year of the publication

 

021

Expert

 

022

Low temperature of interval

 

023

High temperature of interval

 

024

Low pressure of interval

 

025

High pressure of interval

 

026

Name of electronic full text

 

Letters on the right row (indexes of subfields) show marks of repeating records for those fields.

Fig.1 Editor window of IsoWin with text of a relevant document marked in accordance with FDT structure of DB TERMAL.

As one can see the rules of text marking for ISIS databases are simple:

Apparently, these rules are obvious. In an example on the Fig.1 the basic part of record in English was taken from an external electronic source (Internet, fields 001-006). The other part (search terms, names of substances and properties in accordance with thesaurus of DB TERMAL in Russian, and others) have been added by expert-reviewer on analysis of the publication content in whole.

Program IsoWin

This program fulfil transformation of the marked text to exchange format ISO 2709. IsoWin runs in Microsoft Windows95 and above. The program consists of two basic functional parts ¾ own text editor for text marking and the converter of the marked text. Generally speaking, one may use any other editor which keeps the text in a file precisely the same as it is visible in a window of this editor. An example of suitable editor ¾ Microsoft Notebook. There should be no hidden symbols in the marked file that operate kind and size of fonts, some tabulation, etc., except for commands "go to a new line " and "move the carriage ".

According to position accepted by developers of CDS/ISIS in UNESCO, it is necessary to preserve former information collections gathered by users in DOS versions with ÎÅÌ coding. Therefore, expanded multilingual codes Unicode are not used in it and, accordingly, in this program.

In this Beta-version editor IsoWin fulfils all operations typical for text editors, but for simplification of work with external sources it has following additional features:

The editor has quantitative restrictions:

After the marked file is prepared in the editor window or open ready from an other source, its converting to exchange format is run by command GetISO (the button on the panel or the command in menu Make). The result is visible in the same editor window. The structure of exchange format is not discussed here. I’ll note only, that record of each document begins with long sequence of the figures that codes its structure. So, it is possible to allocate visually separate records and to receive some understanding on of the file content. Generally speaking, minimal editing of result is possible in mode "replace" of the keyboard without change of field length.

As it was said above, texts in internal records of databases in ISIS have DOS coding (OEM), but nowadays they are prepared or downloaded practically always in Windows (code ANSI). Therefore, converting in an exchange format can be carried out by the program in two modes ¾ with transformation of the coding from Windows into DOS (ANSI to OEM) and without such transformation. The second mode is used in the case when the initial file is prepared in OEM coding. It is necessary to note, that though transformation ANSI to OEM is indicated in ISIS, in case of Russian its results are inappropriate. Data import in ISIS needs file in OEM coding. Probably, the same occurs with other languages also. In this version transformation ANSI to OEM fulfils only for Russian with own accordant table. Russian letters in an input file, that was coded in OEM by any other editor, in IsoWin editor look unreadable and are not fit for editing. The usual operating mode is typing and text editing in standard Windows mode, and then converting of result into OEM.

The algorithm of code conversion ANSI->OEM and back is chosen in view of situation that documents in DB TERMAL are bilingual ¾ in the same document, as a rule, there are records in both Russian and English. It is well known that now the most of the relevant information in science is published (or presented as abstracts) in English. At the same time search terms for convenience of local users are presented in Russian. This is shown in an example of document ¾ Fig.1. The code conversion concerns Cyrillic letters only, as Latin letters and punctuation look correctly in both codes. In IsoWin the transformation is made with pair of related code strings for Windows SWin="¸¨Tþàáöäåôãõèéêëìíîïÿðñòóæâüûçøýù÷úÞÀÁÖÄÅÔÃÕÈÉÊËÌÍÎÏßÐÑÒÓÆÂÜÛÇØÝÙ×Ú" and for DOS ¾ SDos. The last is not presented here because it is a string with meaningless symbols. I think that the problem appears in every local ISIS DB application which alphabet differs from English. In localization to other language these code strings may be easily obtained by the following steps:

Such algorithm allows to make localization of program IsoWin simply enough. So, such localization does not change its English interface, but it is not an inconvenience because it is common in computer world.

At first command GetISO does automatic reservation of the initial text in file Source.txt in the same directory (folder) where program IsoWin is running. After converting the initial file in editor window is replaced by result which is also automatically kept in file Result.iso in the same folder. If option ANSI->OEM is active Cyrillic letters became unreadable. This file can be saved also in other place by command Save as.

The result file can be seen after its transformation back to Windows by command AnsiToOem in menu Window or by the button on the tool panel. To save it such coding is not useful, since DB in ISIS will not import it correctly, but it is possible to use opportunity for detection of mistakes. File Result.iso is kept on initial place unchanged. It may be shown in editor window by command Result. If mistakes in converted file are found, it is possible to open Source.txt file of last version by command Source and to make necessary corrections. Other commands of the menu and corresponding buttons are typical for simple editors and do not require special explanations.

Certainly, program IsoWin can be used also for other databases that accept information in exchange format ISO 2709 in different coding.

For many information services IsoWin may practically become the basic tool of workplace for an operator, working on data preparation to input in a DB.

The Beta version is free of charge from developer. All comments, discussions, and proposals for localization are welcome at isoconvert@yahoo.com to developer Michael Trachtengerts, PhD.