✦ LIBER ✦

A flexible real-time recognizer of spoken words for man-machine communication

✍ Scribed by R. de Mori; L. Gilli; A.R. Meo

Publisher: Elsevier Science
Year: 1970
Weight: 765 KB
Volume: 2
Category: Article
ISSN: 0020-7373
DOI: 10.1016/s0020-7373(70)80001-3

No coin nor oath required. For personal study only.

✦ Synopsis

A relatively simple real-time recognizer of spoken words is described. The main characteristics of this system are the following: (1) the vocabulary of accepted words is settled with simple operations on the panel of the machine; (2) the system is quasi-adaptive in the sense that the characteristic parameters of a given word uttered by a certain speaker can be measured and displayed on a set of Nixie tubes, and it is very easy (operating on the keyboard) to fit the recognizer to the speaker according to the displayed data; (3) at present, a maximum of 15 words can be classified, but, owing to the modularity of the system, other units can be added in order to enlarge the accepted vocabulary.

The coder consists of a bank of active filters and a set of circuits translating spectral information into a set of binary digits. The processor is composed (a) of a set of combinational units which evaluate the Hamming distance of the input patterns from the characteristic sequences of phonemes or other "tracts" of words; (b) of a set of sequential units which complete the classification of the uttered word by analysing the time evolutions of the combinational network outputs. The panel controls which make it possible to fix the accepted vocabulary and to adapt the recognizer to the speaker operate on the connections between combinational and sequential units and on the operation parameters of all the units.

The machine, used for recognizing the ten digits from 0 to 9 and five other words spoken in Italian, reaches an efficiency larger than 99% on condition of its being previously adapted to the speaker. Programmed according to the average characteristics of male speakers and for classifying words spoken by voices not used in the preceding stage of learning, it reaches average efficiencies of about 90Yo (from 85 to 99~ for a given speaker).