BTA - Telematic Bulletin of Art / Texts / bta00277.html

Pollution and Protection of Informatic Cultural Heritage

Stefano Colonna
ISSN 1127-4883 BTA - Telematic Bulletin of Art, August 20th 2001, n. 277
http://www.bta.it/txt/a0/02/en/bta00277.html

Historian of Art have used informatic tools with delay as regards scientific discipline. First important application of automatic treatment of History of Art related data, has been sperimented with developing of prototypal of electronic folders for catalogue of Cultural Heritage and automatic analysis of art texts, for example Vite of Vasari. ¹. With a biggest delay has been developed the telematics applications, that have to allow online informations net-exchange.
According to this line of action, the first step of informatic evolution in our field of interest, that is the History of Art, is digitalisation, i.e. change of the physic format of information, from paper to bit. The second step is the publishing of the data in an Intranet telecommunication network or in the Internet public network.
But nobody of the two steps, that are digitalisation or telepublishing, may happen without a good encoding of informations.
Delay in History of Art wasn't fortuitos, but due to the fact that History of Art data encoding is more complex of only textual data encoding, so that the related treatment is more expensive due to dimension of images and multimedia files.
Moreover Historian of Art generally don't have aptitude to use the informatic encoding for historic studies, thinking that application of encoding is an activity essentially automatic, then repeating, then technic and generally considered non directly related to universitary humanistic studies.
We want here to demonstrate how indeed the empiric making of the encoding produce real situations very particular that are related to the Cultural Heritage domain litteraly considered; and to make evidence the damages deriving from real situation of unsuccessfull communication deriving from that I here define "informatic pollution" and finally to advance a general solution for the protection of Informatic Cultural Heritage.

CONCEPT OF ENCODING

The encoding consist in the creation of a logic-symbolic frame of reference, because of it's possible to make biunivoc information conversion without loss of data. In the big course of automatic elaboration of data exists very levels of encoding ².
Only to make a few example without pretension of esaustivity, it's possible to quote: binary code, that is the smallest logical-matematic element of informatic; set of characters ASCII that include respectively all the letters of various world alphabets and much of related punctuation and special marks; the machine language (Assembler) that allow communication between hardware and software; the operating system, that manage communication between various software and hardware; application software, that make particular procedures and calculation.

The encoding of texts and images is one of the operations more complicated for a computer and one that have more "humanistic" implications and consist in analysis and declaration of text structure and related formatting with SGML standard and sub-derivated HTML, DHTML, XML, etc., and also, at level even more complex and sophisticated, at META-information, that take care of manage informations super-structured, as METADATA, included, for example, inside HTML file.
I want to confirm that, in ambit of SGML encoding, the word "text" don't mean only text literaly, but also the whole of all the data deriving from old papery editing, inlcuding then also images and, for extension to modern multimedial editing, also video, sounds or music and execution of Java program or other logical or network operations.

All different encoding here above quoted are well documented, hardly founded, very publicized and commented, so that are standard de facto and I would say also "by right", in international context. The author of a new model of hardware, or software, but also the author of any electronic publishing, whether a single person, or a public or private body, have to follow existing encoding and, if want to propose a new one, to submit it for public debate and preventive approval through making new working-group composed by representative of institutions and bodies different from own, publishing a draft, with related request for comment, followings revisions and at last the publishing of the final document. This essentially cooperative process of information and innovation management, based on building and respect of standard and universal encoding, allowed global world communication. ³.

Also for the hardware, the starting restriction of IBM proprietary architecture so-called "microchannel", has been exceeded by the global market so that personal computer has been diffused also in home context, then predisposing subsequent world-wide affirmation of Internet network.

The mere ASCII encoding of the texts, theorically perfect and universal, don't allow association of text and images. Indeed HTML language, that derive from SGML and is only one of all possible data encoding, using the set of characters ASCII by two levels compounded and structured, that is wether to create the text, or to create informations additional and structured relative to the text-self, has become de facto standard for modern electronic and multimedia publishing founded upon universal encoding. The multimedia files produced according to HTML standard infact may to be read and reproduced regularly inside different operating-systems, such as Microsoft Windows, Apple Macintosh, Unix, Linux, BSD, etc. and exceed the unpleasant barrier formed by native incompatibility of file and media formats (floppy-disk), etc.

CONCEPT OF "POLLUTION" OF INFORMATIC DATA

Proliferation of intellectual energies caused by sharing of resources and by free world-wide competition with regard to writing software and encodings of various levels above quoted, and related establishing of the empiric application of such standards in the virtual, but real time and space of informatic, doesn't forbidden formation of a set of bonds and obstacles very dangerous that here I define "pollutions" for analogy to others domains of sciences applications.

Every Word Processor of a major class would assure the rescue of files in universal format. But complex and structured texts using tables and very millimetric formatting, then all electronic publishing, also simple one, hardly communicate with different versions of the same software, so that next version not always read correctly the files produced by the previous one. It's very difficult to join the files created by programs of different sotware house.

Despite of existence and world-wide affirmation of HTML standard, that universal data encoding has been used only to publish in Internet network or to making CD-ROM. Hardly HTML is adopted as format to exchange files, especially because is critic for a middle-experienced user the management of the potentially numerous attachments, as images, that in a Word Processor file are comfortably grouped inside an only file.

Then exist an every-increasing quantity of data destined in a short time to become obsolete and not communicating. Some software programs assure retroactive compatibility only for a determinate period. Very files become incompatible with one and the other. Formats conversion is a very hard informatic operation and not assure absolute fidelity and have to be executed and manually supervised to avoid involuntary loss of data during process of convertion.
Actual situation indeed imply an essential predominance of proprietary standards for personal informations exchange. For example nor HTML, nor Microsoft RTF - Rich Text Format preserve progressive and automatic numeration of footnotes of a text after a format conversion.

A news have to be considered the invention and adoption of a set of proprietary filters given in charge to manage input and output of file produced by more well-known Word Processors inside a product that is free as Star Office, originally created and written from germans for Linux, open and free operating-system, has been bought from Sun and opened also to users of operating system Microsoft Windows, making an atypic case, Infact this product, even remaining free for single user , allows input and managemente of file in very formats and their related saving without breaking copyright of respectives producers of Word Processor. But conversion have some limits and not always assure a full compatibility.

Even more delicate the management of complex and structured data, as electronic archives originally created inside table of Word Processor, that during the time increasing very much and can't to be converted automatically in relational databases file, as necessary. In practice a database created with tables of a Word Processor is equivalent to a free-text, because it has only the graphic-look of logical division of informations, but not their real encoding according to the principles of relational algebra.

Creation of an art-historic database where the authors names are catalogued without referencing to a special thesaurus, or authority-file, are destined for generate duplication, dispersion or loss of data. For example creating records of a work of Art with an author name "Michelangelo Merisi" differents from others with author name "Il Caravaggio" when in the reality of facts they are the same person. If anyone would create ex-novo an already public-published authority-file, he would duplicate an existent information and prevent automatic connection via network of new electronic archive with other existent. Deficiency more big than biggest is the importance of the archive to which to be connected.

Scientific community and civil society-self demand the free accessibility to information of public interest. Concept implicit in the constitution of national libraries. Almost all OPACs, i.e. electronic catalogues of the libraries, are free accessible in the Internet network.
Also the resources of History of Art related to work of Art of public domain would be free available, with restrictions, when necessary, limited only to specific contexts.

Still more heavy is the pollution produced non by interest of single person, but by incapacity or programming's lack of information producers.
Very information published in the Web, also with high expense, become obsolete, are erased or definitively discarded, indeed of to be opportunely and rationally archived. The damage is statistically enormous. Not only is completely vanified the investment, but also destroyed the related historical memory. And major damage it's of course the second. It would be necessary to read the back-up unities before completely erasing all existing data, to create one or more national and international archives of informations.
Viceversa very users of Web resources download the files, but still'nt having an objective and universal identifiing system , actually again in a beta phase, quote the source only using Web address, when is clear that the URL may change for very servers and webmaster's technical reasons. As if you would quote a periodic using only the tile.
The problem of "information pollution" is more sensible when it's absolutely indispensable assure the exact identification of a particular information. If, for example, you want to quote a particular image, that you have seen in a determined Internet web site with a determined format, number of colours, DPI, etc., it's absolutely fundamental that, quoting that image in a Scholarly article, the reader may to look the same image to understand the text of the article. The changing of the resource in other directory of the Web site, or the changing of the domain of the site, or the changing of the structure of the directory or numeration' system; don't allow object individuation and generate that I would define a type of "informatic pollution". Surely it isn't "fraudulent pollution", because actually unaware, but it's anyway a form of pollution.

Dispite of technical innovations, substantial and inherent fragility of informatic supports, the magnetical, or opto-magnetical, or optical, doesn't have comparison with duration of the paper, that have given signe to resist very centuries to usury of time. In favour of new technologies the duplicability and reproduction absolutely identical to original; against the impossibility of faithful reproduction of a paper work. But also this concept is archivistically fallacious.
Infact every document is composed not only by content of the media, but also by nature and composition and dating of the media itself. As an epistle of the Renaissance is'nt constituted only by his text and data, but also by watermark of paper-mill of production, thus the new informatic context is constituted not only by data, but also by the type of informatic media utilized for transmitting data-self: by the name and type of server and by type and version of file-system that manage informations if the transmission occur within a network; by description of informations-packet that contain the single data if the informations has been produced or transmitted inside a group according to a pre-defined logic. All element, after all, that at first sight seems apparently superfluous, but in a long term and in a scientific context, are absolutely indispensable to understand means of "polluted" informations. For example, to understand data missing of chronology or author.

Then, for "pollution of informatic data" I mean alteration of the data or their media-self, as regards to encoding adopted during their creation; but also unsuccessful adoption of the so-called best practices, with consequent partial or complete loss of meaning or capacity of communication.

AN INFORMATIC EXTENSION OF ITALIAN LAW 1089/1939

Since wether database, or multimedia products, that can include all hypertexts published in World Wide Web, are together inside the field of application of copyright as intellectual products, then it's necessary to understand the novelty of the phenomenon and to foretell conformed normative tools.

To limit pollution of informatic data and dispersion of related historical memory, it would be necessary to provide at the followings different levels:

1) to promote the best pratices, especially in the university research to light of international context

2) to project widening of the field of application of italian protection Law 1089 of 1939 to protection of informatic cultural heritage.

I think a tool that not be restricted to create a mediateca to organize more efficaciously some material existent in an only public institution, as RAI with Teche project, or Istituto Luce with his mediateca; but a wide giuridic tool that consent to Regions and Superintendences, when necessary, or exploit, bind and to acquire in copy any informatic resource of national and international interest as the Cultural Heritage literaly meant.

NOTE

All registered marks are of the respective owners.

1 In Italy " CRIBECU - Centro di Ricerche Informatiche per i Beni Culturali " della Scuola Normale Superiore di Pisa ha carried out a pioneristic activity in this way. But look our 249 Strumenti della ricerca storico-artistica dalla tradizione all'innovazione for a general analysis of phenomenon.

2 Giuseppe Gigliozzi, Studi di codifica e trattamento automatico di testi, Roma, Bulzoni, 1987. Collana Informatica e Discipline Umanistiche diretta da Tito Orlandi, vol. 1.

3 Only for an example, the first Request for Comments was published April 7th 1969 and was signed by Steve Crocker of UCLA and concern implementation of the protocols of the new network. Now all consortia that manage standards, first of all W3 Consortium, that manage HTML and XML, publish the publics draft related to all working phases of the standards.