Roberto Busa & IBM Adapt Punched Card Tabulating to Sort Words in a Literary Text: The Origins of Humanities Computing (1949 – 1951)

In 1949 Roberto Busa, Jesuit priest, professor of Ontology, Theodicy and Scientific Methodology and, for some years, librarian in the "Aloisianum" Faculty of Philosophy of Gallarate, in Northern Italy,  began the monumental task of creating an index verborum of all the words in the works of St Thomas Aquinas and related authors, totaling some 11 million words of medieval Latin. This was, of course, before any electronic digital computers were available. What was available was a single operating example of Vannevar Bush's Rapid Selector in Washington, D.C., and various versions of electric punched card tabulators, some of which could be programmed. Busa's first published report on this project appears to be Sancti Thomae Aquinatis hymnorum ritualium varia specimina concordantiarum. Archivum Philosophicum Aloisianum, Ser. II, no. 7. (Milan, 1951), in which the specimen of the concordance was, of course, published in Latin, while Busa's introductory text was published in English and Italian. The bilingual subtitle of the work read in English, "A First Example of Word Index Automatically Compiled and Printed by IBM Punched Card Machines." In this work Busa first summarized notable examples of indices verborum compiled before his project, and then analyzed five stages of the process:

"1- transcription of the text, broken down into phrases, on to separate cards;

"2- multiplication of the cards (as many as there are words on each);

"3- indicating on each card the respective entry (lemma);

"4- the selection and placing in alphabetical order of all the cards according to the lemma and its purely material quality;

"5 - finally, once that formal elaboration of the alphabetical order of the words which only an expert's intelligence can perform, has been done, the typographical composition of the pages to be published.

"A kind of mechanisation has been working for years so far as regards caption 2: the T.L.L. and the Mitellateinisches Wörterbuch use the services of Copying Bureaux, where one of the many well known systems of duplicating are used; Prof. J.H. Defarrari of Washington used electrical typewriters which can make many copies; Prof. P. O'Reilly of Notre Dame. . .had each side of the page repeated as many times as there were words contained theron" (Busi, op. cit., p. 20).

Busa ruled out the Rapid Selector and approached IBM in New York and in IBM's head office in Milano, where he obtained funding and cooperation. Busi's summary of his progress to date published in 1951 is perhaps the earliest detailed discussion of the methods used and problems encountered in applying punched card tabulators to a humanities project. Therefore I quote it in detail. Readers will notice some pecularities in the English translation published:

" Now what I intend publishing, are the results of a first series of experiments carried out with electric accounting machines operating by means of punched cards. Of the three companies using this system, the International Business Mchines (IBM), the Powers of the Remington Rand, and the Bull, it was at the Milan Head Office of the Italian organisation of the first, which is also the most important, that I continued the research I had commenced at the New York Headquarters.

"What had first appeared as merely intuition, can today be presented as an acquired fact: the punched card machines carry out all the material part of the work mentioned under captions 2, 3, 4, and 5 [above].

"I must say that if this success has its origin in the multiple adaptability, characteristic of the equipment in question, it was nonetheless due to the openmindedness and intelligence of the IBM people, who have honoured me with their patient confidence, that the method for such application has been found. I will give a brief description of the stages of the process and the first trials which were carried out on one of Dante's Cantos.

"The Automatic Punch, controlled by a keyboard similar to that of an ordinary typewriter, «wrote» by holes or perforations, one for each card, all the lines; a total of 136 cards. This is the sole work done by human eyes and fingers directly and responsibly; if at this point oversights occur, the error will be repeated from stage to stage; but if no mistakes were made, or were elminated, there is no fear of fresh errors; human work from now onwards is reduced to mere supervision on the proper functioning of the various machines.

"The contents of each card can be made legible either on the punch itself which, if required, can simultaneously write in letters on the upper edge of the card what is «written» in holes on the various lines of columns thereon; or else on a second machine, the so-called Interpreter, which transcribes in letters the holes it encounters on the cards (previously punched). This offers not only a more accurate transcription in virtue of the better type and greater spacing of the characters, but a transcription which can be effected on any desired portion of the card.

"The 136 cards thus punched were then processed through a third machine, the Reproducer: this automatically copied them on another 136 cards, but adding, sideways of the lines and their quotations, the first of the words contained in each. Subsequently it makes a second copy, adding on the side the second word, then a third copy adding the third, and so forth. There were finally 943 cards, as many as were the words of the third canto of Dante's Inferno; thus each word in that canto had its card, accompanied by the text (or rather, here, by the line) and by the quotation. This is equivalent to state that each line was multiplied as many times as words it contained. I must confess that in actual practice this was not so simple as I endeavoured to make it in the description; the second and the successive words did not actually commence in the same column on all cards. In fact, it was this lack of determined fields which constituted the greatest hindrance in transposing the system from the commercial and statistical uses to the sorting of words from a literary text [bold text mine, JN] The result was attained by exploring the cards, column by column, in order to identify by the non-punched columns the end of the previous word and the commencement of the following one; thus, operating with the sorter and reproducer together, were produced only those words commencing and finishing in the same columns.

"This operation is rather a long one; theoretically as many sortings and groups of reproductions as there are columns occupied by the longest line, multiplied by the number of letters contained in the longest word; in practice various devices make it possible to shorten this routine a good deal. It must be borne in mind that the amount of human work entailed by all ths processing the words and setting up of the reproducer panels--about two persons' one day work--remains unchanged notwithstanding the increased number of cards. While it is true that there are longer intervals, namely those intervals during which the machines carry out their own operations, it is equally true that the operations which in the case of a few cards are inevitably consecutive, with many cards can be simultaneous; the time taken by the reproducer to copy one stack can be used to sort others or to set up the panel for the next reproduction. At present the reproducer can reproduce 6,000 cards an hour, and the sorter can explore 36,000.

"Having reached this point, it is a trifle to put the words into alphabetical order; the Sorter, proceeding backwards, from the last letter, sorts and groups gradually column by column, all the identical letters; in a few minutes the words are aligned and the card file, in alphabetical order, is already compiled.

"This order can be obtained again with the same ease, as often as required. If the scholar, while making his research on the carried conceptual content, disturbed the alphabetical order of the items, this same order can be very easily obtained once more merely by the use of the sorter, which is the most elementary IBM machine.

"The philologist, however, must group or sort further on what the machine has not been able to «feel»; thus have, had are different forms of the same verb; thus, in Italian, andiamocene, diamogliene are several words joined into one, and for the Latin mortuus est is a single word form which means died, but could also mean the dead man is and then they would be two items; and so on for the whole wide range of homonyms.

"When the order has thus been properly modified and attains its final form, the cards are ready to be process in the Alphanumerical Accounting Machine, or Tabulator.

"The tabulator retranscribes on a sheet of paper, in letters and numbers— no longer in holes— line after line, the contents represented by the holes in the cards, at the rate of 4,800 cards per hour; and this is a page of the concordance or index in its final arrangement. The published edition can now obtained by some kind of reproduction; for ex. employing ribbon and paper of the kind that allows the use of lithographical dupicators.

"The concordance which I am presenting as an example is precisely an off-set reproduction of tabulated sheets turned out by the accounting machine.

"The flexibility of these machines offers the possibility of making varied and sometimes extremely useful, applications. I am making a brief mention of the most salient ones.

"The tabulated document can be printed on a continuous paper roll or else on separate sheets of varying sizes; in other words, the machine can be made to change the sheet automatically after a given number of lines.

"The distance between lines can also be automatically differentiated; it is possible to arrange the machine so as to make, for example, without further human intervention, a double space when it goes on to a new word (for example from anima to animato) and, say, four spaces between the words commencing with the letter A and those commencing with B, and so on,. The data which are, for example, at the right of the card can be tabulated, if desired, at the left, viceversa; so that the quotation can be placed prior or subsequent to the line independently of its position on the card.

"The card contents can be reproduced also partially, which makes it possible to obtain only an index of the quotations for those words of which it is not deemed desirable to have the concordance.

"The tabulator's performance is extremely useful when, to use, the current technical phrase, it is running in tab.

"Then it turns out only the list of the words which are different if, for example, the cards containing the preposition ab total two hundred, the machine will print ab once only, but, if desired, will add at the side thereof the number of times, that is 200, and so on for each word. The list thus obtained is very useful in studying those intelligent integrating touches to be given to the alphabetical order of the words, which, as I said, is effected by the machine on the mere basis of the purely material quality of the printed word. It is also useful as an entry table for all who wish to peruse the whole vocabulary of an author for determined purposes; still more useful when beside the word is shown the frequency with which it is used. When another machine called the Summary Punch is connected to the accounting machine running in tab, while the latter is turning out the long tabulated list of different words, the former, electrically controlled by the accounting machine, simultanteously punches a new card for each of these words, thus providing ready headings to be placed before the single groups of lines or quotations. If necessary, these can be inserted in their proper place among all the others automatically by the collator.

"This Collator which searches simultaneously two separate groups of cards at the rate of 20,000 per hour, and can insert, substitute and change cards from one with the cards from the other group, also offers some initial solutions to the problem of finding phrases or compound expressions. Taking, for example the expression according to: the group of cards containing according and that containing to are processed in the machine; on the basis of the identical quotation, the machine will extract all those cards on which both appear. It is true that they may be separated by other words, but one thing is certain, namely that all the cards bearing according to will be among those extracted; the eye and the hand must do the rest. It is still easier to obtain the same result when a card beaing the phrase sought for can be used as a pilot-card.

"The collator can also be used to verify and correct the cards which have been manually punched at the beginning, and thus guarantee the accuracy of the transcription, an indispensable condition for philological works, particularly in the light of their peculiar function. Two separate typists punch the same text, each on his own; the collator compares the two series of cards, perceiving the discrepancies; of the cards not coinciding, at least one is wrong. This control allows only the following case to pass unobserved, namely two typists make the same error in the same place. This case is very improbable and so much the less probable in as much as the qualities and circumstances of typing and typist are different.

"This method of verifying, although substantially the same, offers perhaps some advantages over the other, usually employed by IBM in the intent of not doubling the number, and consequently the cost, of the cards purposely, whereas in our case this is no hindrance, since each card already has to be multiplied as many times as the words it contains; the punched cards are put through the Verifier on the keys of which a typist repeats the sane text; the machine signals him when his punching does not concord with the existing holes; one of the two is wrong.

"Before concluding, a criticism of these initial results should be made, also to justify the lines along which I am working to perfect the method: only the first man [an allusion to Adam] happened to begin his life as an adult.

"In the first place, the machines I used— those commonly used in Europe up to 1950— produce a final tabulated page the appearance of which is still perceptibly less satifactory than that of printed material. Many will hold the opinion that this is compensated by the automatic performance and the high speed of their writing. But it is indeed hard to sacrifice accents and punctuation as well as the difference between capitals and small letters. Similar considerable limitations are involved by the card capacity; eighty spaces.

"Since each card includes both quotation and lemma, the average text for each word could not therefore surpass, by much, a hendecasyllable. And this is little, the more so one bears in mind that the machines do not allow the omission of subordinate phrases or even words, by which the penworker instead can choose only those few words, which constitute the substance of an expression. This brevity in the text, perceptible in a printed concordance and even more so in the case of prose instead of verse, is extremely distressing when the card file is used for research work; infinite occasions will indeed arise where the scant surrounding will not give the lexicographer sufficient elements for a well-grounded interpretation and, by compelling him to a too frequent and aggravating recourse to the text, will tempt him—there are even little devils specialised in leading philologists into sin!— with the bait of a hasty judgment.

"Even with only the groups of machines above mentioned, it is quite possible to obviate the latter hindrance, but I will not set forth the various means of doing this. Not only so as not to disconcert the reader; it does happen indeed that when one glimpses at the unimagined possibility of carrying out, for example, in four years a work which would have required otherwise half a century (this is the case of the concordance I have in mind for 13,000 in folio pages of the works of St.Thomas Aquinas) everyone becomes so confident and at the same time so exacting with the new method, that all feel deluded when told that the operations involved in making it possible to have an abundance of text on every card will delay, let us say, by twelve months, the conclusion of the work. But it would above all be purposeless to devote time and attention to such devices, for new model IBM machines already in public use in the United States, but not yet in Europe, will allow a more aesthetically precise final printing, punctuation, accents and texts longer than the usual card capacity. I refer to the Cardatype and the type 407 Accounting Machine. I hope to write about this in the near future" (Busa, op. cit. 22-34).

(This entry was last revised on 03-15-2015.)