Online Music Recognition and Searching (OMRAS)

The overall goal of the research is to build a working prototype of a system, OMRAS (Online Music Recognition and Searching), for content-based searching of online musical databases via an intuitive interface that uses music in a visual or aural form familiar to the user, rather than a text-based encoding system, for both search-query construction and to display results. The query interface will allow the user to import, construct, interactively edit and send a query item which appears to him/her as an extract from a piece of music to be matched by the system. Results will be returned either as a list of complete pieces which contain the query item, or as locations within a specified piece or collection of pieces. The user will be able to test the results, and monitor the query item by audio playback at any time.

Much experience has been gained in full-text content-based information search and retrieval systems based on data such as free text and tabular statistical material, and reasonably good tools for these media are now widely available. However, retrieving information from multimedia archives is severely hampered by the almost total absence of tools for searching their specialised content. This proposal aims to address one area of this problem, that of searching musical databases. Furthermore, in several ways, music is exceptionally challenging, and it is quite possible that some of the results from this project will be valuable in future research on (non-musical) content-based information-retrieval.

Project scope

The 3-year OMRAS project will focus principally on four aspects (see also the table in the Project Summary, above):

System architecture, music-representation internal data-format, data-conversion;

Preprocessing and indexing of musical data, implementation of efficient search algorithms;

User interfaces for search-queries and for result-presentation;

Audio music-recognition sufficient to generate data in the internal format.

The proposers recognise that the system described in this proposal is complex, with several areas requiring significant research in difficult fields. The proposers expect during the course of the project to identify some tasks which require more research effort than has been allowed for in this proposal. For these, other external funding will be sought. However, it is impossible to predict now the exact nature or extent of such tasks, nor which funding bodies may be appropriate. The aim of this proposal is not to produce a fully-finished, ‘industrial-strength’ suite of software, but to provide a soundly-researched basis for the development of such a program. By the nature of research, at the end of the project, some features will be more advanced than others.

For example, an aspect which the proposers expect will present some difficulty is number 4, audio music-recognition. While this is seen as of crucial importance for the long-term success of the project, and clearly reflects the demands of many potential users, it could make disproportionately great demands in terms of research effort. It is not intended, therefore, to devote a large portion of the project’s resources from the present bid to aspect 4; other funding is likely to be forthcoming for this because of its general importance and its potential for application outside the project. The techniques developed will be incorporated into OMRAS as they become available.

Potential users of OMRAS

Throughout the duration and scope of OMRAS, the research will keep sight of the needs of the spread of potential users and their diverse levels of musical knowledge and ability, since the perceived arcane nature of musical notation can be seen as a major disincentive to the kind of investigations and insights that these tools would allow. In this way it is intended that the whole musical community, rather than an elite of academic specialists, can benefit from the research.

An important part of the project will engage with the principles of design for a choice of user-interfaces to reflect this need, which can be tailored to the requirements of various user-types from the sophisticated computer-literate musicologist engaged in academic research through to the musically untrained user seeking a familiar piece of music for a multimedia presentation. In each case the interface needs to offer a mechanism for search-queries which allows interactive editing and monitoring (audio playback) of the query before sending it to the search-engine. The three-year project scenario outlined here is likely only to allow the full development, testing and assessment of a working interface for two user-types (probably from these extremes); it is certain, however, that very useful insights into the general requirements of users will emerge from the process of evaluation and testing. Again, the OMRAS architectural model allows the independent development and rapid implementation of new interfaces as required in a future programme of research.

Nature of the problem

Searching a musical database can be accomplished in various ways. In most existing score-based musicological encoding systems, the musical data is encoded in a form (usually an ASCII encoding such as DARMS [Selfridge-Field 1997, pp. 163-200] or Plaine and Easie [Howard 1997]) that describes the musical score as completely as possible. However this is not necessarily a convenient format for searching and retrieval. Scores may be re-encoded or converted to a format that allows retrieval by a specially adapted text-based method [Schaffrath 1997; Huron 1997]. There are some cataloguing systems that encode an identifying fragment of the music (usually the ‘incipit’, i.e. the music’s opening phrase) and can thus perform a useful retrieval service, but with a severe and obvious limitation [the RISM music-cataloguing system is briefly described in Howard 1997]; the retrieval is limited to the musical material that is present in the encoded identifying field only. In practice, retrieval is accomplished through the use of metadata such as titles, opus numbers, standard catalogue references, etc., rather than by musical content-based searching.

For non-musicological applications, the situation is even less satisfactory. Music is familiarly ‘stored’ in the form of digital recordings; on the Internet, on the other hand, the standard MIDI file format (originally designed as an electronic musical-instrument control protocol) is increasingly popular as a means of recording and storing musical material, which can be played back through a multimedia computer’s standard sound-system. [See Selfridge-Field 1997, pp. 41-107.] However, apart from the use of catalogue metadata as described above, users have no means of searching and retrieving musical data from such files. [A prototype system for MIDI-file searching is described in McGettrick 1997.]

Even given a file-format that allows efficient retrieval, such as the Humdrum ‘kern’ format [see Huron 1997], there are formidable difficulties in user interfaces for query formulation for music. The problems of text-retrieval interfaces have recently attracted much attention as a result of the phenomenal popularity of text searching on the World-Wide-Web [See Shneiderman et al., 1997.], but the problems of music retrieval are even more complex. Music is notoriously context-sensitive, and careless or unskilled query-construction can render such systems as exist almost useless. [Several examples of such problems are given in Selfridge-Field 1990 and Selfridge-Field 1998.] Even musical notation itself is prone to divergencies in interpretation and inconsistencies of application that can often frustrate attempts at computational treatment. [See Byrd 1984 and Byrd 1994.]

Useability

A further problem with existing systems for musical data-retrieval is that all depend on the user’s learning and understanding a more-or-less arcane encoding system for query-construction and a similarly obscure presentation of results. Even with a ‘user-friendly’ interface such as that described in Körnstadt 1998, there are many aspects that are simply too complex to be of use to any but dedicated computational musicologists.

The most important recent advance in useability in query-construction techniques for musical searches is represented by the MELDEX web-based system developed at the University of Waikato, New Zealand. [See Bainbridge 1998; McNab 1996; Smith et al. 1998.] Here, all the user has to do to query a database of c10,000 folk-tunes is to sing the required melodic fragment into a microphone; the query can be monitored by audio playback and re-recorded if necessary, and the system returns a list of songs containing the query or similar phrases. The query, and the matched songs, may be displayed in musical notation, and the system works well, although slowly, in the modest uses for which it has been implemented. (In the current web-implementation, the user has to e-mail queries as audio files to the system rather than singing directly into a microphone attached to the computer, but a prototype version for Macintosh computer allowed this desirably freer interaction.)

Music-representation

All the schemes mentioned above depend on a highly-simplified internal music-representation scheme. This is necessary both for performance-related reasons, and for the transparent construction of search-queries by text-encoding, typically via a simple dialog box containing text-fields. [An example is illustrated in Körnstadt 1998.] In almost all cases, the music is stored as one-dimensional (monophonic) melodic strings, allowing the use of standard string-matching algorithms. These algorithms use familiar ‘distance’ measures to calculate matches or similarities, and can be highly efficient in implementation. [A review of string-matching techniques for music research is in Crawford et al. 1998; discussion of such algorithms can also be found in McGettrick 1997, Mongeau et al. 1990, O’Maidin 1998 and Smith et al. 1998.]

But simple melodic strings are satisfactory representations of a musical score in only a minority of special cases: when the music is essentially monophonic (as in most folk-song or in Western plainchant, for example) and when questions of chromatic or enharmonic spelling ambiguity are ignored or allowed for in some manner in the encoding. Also, a parallel search needs to be performed on an encoding of the rhythmic pattern of the music if this is to be taken into consideration; the same applies to other aspects such as dynamic level (loudness). Searches taking into account all such features simultaneously are likely to be considerably more complex and thus less efficient in performance.

Furthermore, very little music in the Western tradition is simply monophonic, and, among other considerations of musical context, the most important is probably the simultaneous interaction of several polyphonic ‘voices’, each of which may have the inherent multi-dimensional nature implicit from the preceding discussion. Performing such complex musical search-queries on a database of polyphonic music encoded as a series of successive monophonic strings is likely to prove extremely difficult.

Musical structure and searching

The human brain is highly adapted to recognising levels of structure in physical phenomena such as music, and it is likely that much of our ability to recognise a piece of music (or our mother’s face, or a voice on the telephone) is related to this psychological fact. By the same token, musical searches might be made much less demanding for a computer system by the recognition of certain structural features. Two of these are of immediate importance in the present context: musical phrase-segmentation and structural repetition. (These will be discussed below under ‘Search-surface reduction’.)

On the other hand, ‘music’ heard as sound (i.e. the patterned variations in air-pressure that activate the ear-drum) contains no explicit structure whatever. It is a cognitive response of the brain (more-or-less trained by experience of similar music or attuned by rigorous methods of listening) that imposes structure on it, and the resulting structured mental ‘image’ may or may not correspond with the composer’s intentions as expressed in the ‘music’ in the written score. Indeed, the ‘structure’ of a piece of music may not be made explicit in a composer’s score at all, and the composer him- or herself may not even be aware of its structure in every aspect: if this were not true, the discipline of musical analysis would be much less interesting. Audio recordings are captured as a simple stream of ‘samples’ (simple number values, typically captured 44,100 times per second in a commercial recording); although the format is very simple, processing is difficult owing to the large amounts of data involved.

(The recent MPEG 7 initiative should allow highly-sophisticated structural tagging of audio files, with data-retrieval as one of the declared aims; the standard is, however, some distance from a useable definition. [See Koenen 1997; also see:

http://drogo.cselt.stet.it/mpeg/standards/mpeg-7/mpeg-7.htm])

‘Music’ that consists of a stream (however notated) of performance indications without reference to their structural context (beyond the sequence and timing of musical events) represents an intermediate state. Examples are MIDI files, or ‘tablatures’ for guitar or for historical instruments like the lute, which only tell the performer (in the case of MIDI, the computer) which notes to sound and when. Such representations do not distinguish, for example, between alternative chromatic ‘spellings’ of notes (A#=Bb), and do not necessarily distinguish between individual polyphonic voices. However, they do contain a higher level of inherent structure than ‘raw’ audio, and they are fairly easily represented for computer processing. Such encodings are also extremely economical in terms of file size.

Structural types of music

In the above discussion, three distinct types of music files (in terms of ‘structuredness’) were identified:

Highly-structured (as for example the files of a music-notation printing program or the slowly-emerging standards for music-notation information, NIFF [Selfridge-Field 1998, pp. 491-512] and SMDL [Selfridge-Field 1998, pp. 469-490])

Semi-structured (as for example MIDI files [Selfridge-Field 1998, pp. 41-69] or encoded lute tablatures [Crawford 1991])

Unstructured (audio files in the standard digital formats)

Conversions between structural types

It is possible to convert from one structural type to another, but in general it is far easier to reduce the level of structuredness than to enhance it. (The recognition of implicit structuredness is one of many cognitive tasks in which the human brain far outstrips the computer.) Files of type a can be reduced to type b with ease (see the discussion of data-formats, below); the playback of a MIDI representation of a score-notation file produces sound, which can easily be captured as digital audio (type c), although successfully reproducing in this manner the sound of ‘real’ instruments performing a piece of music is extremely difficult, of course. (We must also point out that it is very difficult to do any of these conversions in a way that is completely satisfying to a knowledgeable person.)

(Re)constructing the structuredness of file type a from either b or c is an extremely difficult task, and requires techniques of artificial intelligence which will be a major research effort far beyond the scope of this project.

Converting from type c to type b involves a simpler task of note-recognition and segmentation. Although this is greatly complicated by acoustical matters such as the presence of harmonic overtones and the interaction of these in a polyphonic recording, the conversion might be expected (using recently-developed DSP techniques) to provide a mapping to a format corresponding to type b. Pitches present can be plotted against note-onset times in what has been called a ‘characteristic signature’ of an audio music recording [Pfeiffer et al. 1996]. (See Audio recognition, below.)

The following diagram summarises the relative difficulty of converting between different structural types of music representation.

Basic Representations Of Music And Audio

	Digital audio (type c)	Time-stamped MIDI (b)	Music notation (a)
Unit	Sample	Event	Note, clef, dynamic mark, etc.
Explicit structure	none	little	lots
Average relative storage (approx.)	2000	1	10
Convert to left	-	easy	OK job: easy Excellent job: very hard
Convert to right	1 note*: pretty easy 2 notes: hard >2 notes: very hard	OK job: moderately hard	-

* n notes = n notes played at the same time. ‘Convert to left/right’ refers to doing so fully automatically.

The OMRAS approach to musical content-based searching

OMRAS will investigate two principal search-strategies, each appropriate for different structural types:

The first strategy involves the construction of what might be termed ‘score-matrices’ from the original music files (in any of types a, b or c). These will broadly correspond to type b in terms of structure, but will be even simpler in terms of musical detail. They will be susceptible to rapid and efficient matching.

The second strategy is more appropriate to music of structural type a, which can be separated into melodic strings corresponding to the individual voices. It will involve the derivation of indexes, analogous with, but different in nature to, the standard indexes associated with text-searching.

‘Score-matrices’

Search-strategy 1 will make use of simple pitch/time matrices roughly corresponding with the ‘characteristic signature’ form of representation. Search-queries may be expressed as similar matrices, and treated as patterns to be located within the complete matrices stored in an ‘score-matrix’ database. The matched locations may be returned to the user either as a reference to a parent file (from which the matrix was derived) for further detailed display or aural monitoring (using appropriate software for the original file-type), or as detailed time-locations within the score-matrix, which could be examined by the user. OMRAS will develop software-modules for the display and (very simple) audio or MIDI playback from the score-matrices, which will also be used in monitoring during search-query construction.

It must be stressed that this simple pitch/time matrix is a very crude form of music representation. At a higher level of sophistication, experiments will be conducted in increasing the number of dimensions in the matrix (to embrace categories such as ‘loudness’ or timbre) and treating queries as ‘surface’ elements to be matched within the overall ‘surface’ of the score-matrix.

One advantage of the score-matrix is that it loosely resembles a bitmap image, and certain techniques associated with graphics can be applied to it. These include various pattern-matching and pattern-detecting routines, curve-fitting and smoothing, all of which might be useful in approximate matching of various kinds. Although it is an assertion which needs rigourous testing, it seems intuitively true that musical ‘shapes’ will be ‘similar’ in somewhat the same way that graphical shapes are ‘similar’. This is reminiscent of the well-known cluster hypothesis in text information retrieval, which ‘states that similar documents tend to be relevant to the same requests, and a clustering method should hence provide a way of bringing together the relevant documents for a query.’ (Sparck Jones and Willett 1997, p. 308)

It is perfectly possible to construct such score-matrices using different parameters than pitch and time. In fact the term ‘pitch’ is here a generalisation for various kind of value which can easily be derived from higher-structural-level representations (chromatic and diatonic scale-degrees being two obvious examples). [See Cambouropoulos 1996.] ‘Time’ might be measured in absolute or relative terms (milliseconds, or measures/beats, for example). Experimentation will determine which combinations of parameters are most effective for rapid and accurate searches.

Efficient methods for musical matrix-matching are currently being investigated in the KCL Melodic Similarity and Musical Recognition project (Depts of Computer Science and Music 1997-9), with valuable input from Drs Costas Iliopoulos and Rajeev Raman of the KCL Algorithm Design Group (Dept of Computer Science). [For details of the Algorithm Design Group, see: http://helium.dcs.kcl.ac.uk/research/groups/adg/index.html ]

A preliminary report on one such method, by Matthew Dovey, a Faculty Associate for this proposal, will be presented at the Focus Workshop on Pattern Processing in Music Analysis and Creation to be held in the context of the Symposium on Artificial Intelligence and Musical Creativity at the AISB'99 Convention, Edinburgh College of Art & Division of Informatics, University of Edinburgh, 6th-9th April 1999. [See Dovey 1999.]

Musical indexing

Search strategy 2 adopts a technique analogous to the standard method used in large-scale text retrieval, that known as ‘file-inversion’, or more familiarly, indexing. This relies upon the fact that if certain sequences of characters in a text (usually words) recur frequently enough, it can be much more efficient (typically by several orders of magnitude) to find them by reference to an ordered index of such sequences referenced by position to the original text, rather than by exhaustively searching the text itself and noting all occurrences.

In the case of music for which we have reliable information about the individual voices (type a), we can extract from the melodic strings that constitute each voice smaller elements which frequently occur and store these, with location-references to the original file, as terms in an index-table. The process of matching locates terms similarly derived from the query-string, and tests that their order and contiguity matches with their occurrence in the target score.

The index-terms need not be absolute pitches; in fact pitch-intervals (differences between adjacent notes) will be useful in matching transposed versions. A further possibility is to use simple contour-descriptions (Up, Down, Same), which will be useful in matching certain types of related melodic fragments. Variants of the original search-string in the standard musical transformations (inversion, retrograde and retrograde inversion) can be detected by multiple passes.

As far as the proposers are aware, no large-scale musical indexing project has ever been attempted. Intuitively, it seems that it will be an essential technique for dealing with the large number of MIDI or audio files that are becoming available on the Internet, for example. By the same token, the task of searching a very large audio file (a CD recording of a complete act of an opera, say) could be hugely reduced by such a technique; annexing such an ‘index’ to the audio file itself might become a standard procedure in future.

A problem with the apparently simple index-matching process is that it may not actually be very computationally efficient, since a very large number of occurrences of index-terms (especially if these are too simple and frequent) may need to be checked, which will be a very costly procedure. On the other hand, for large music files, it is almost certain to produce an improvement in performance and efficiency over exhaustive full-file searching.

A good deal of research needs to be done into designing algorithms for indexing, for searching index-files efficiently and for maintaining index-tables as database items are added and deleted, as well as to decide on matters such as the optimum length of index terms, how to deal with approximate matching (perhaps by using the n-gram technique used in text-retrieval in certain situations, e.g., with OCR input), and how, precisely, to reference the large number of index-terms to the original files from which they are derived. [Many related issues in text-retrieval are addressed in Witten et al. 1994b.] It is expected that much progress can be made in these areas, especially with the input of the KCL Algorithm Design Group (see above) and in the light of the considerable experience of the CIIR in large-scale text-indexing methods.

Search-surface reduction

It is possible to reduce the computational task of musical searches by taking account of musical structure in ways which are closely related to techniques in computational music-analysis. For example, Dr Emilios Cambouropoulos, currently working at KCL on the Melodic Similarity and Musical Recognition project (Depts of Computer Science and Music 1997-9), has developed techniques for detecting significantly-recurring ‘motifs’ in a piece of music. [See Cambouropoulos 1998a; Cambouropoulos 1998b]

This is a promising field of research which might significantly enhance the derivation of index-terms described above. The motifs which would be indexed are not necessarily identical (as the simple text-derived process would produce), but would be classified in a manner that ‘similar’ motifs could very quickly be located. Although Dr Cambouropoulos will be engaged elsewhere during this project, he is interested in this application of his ideas, and has offered his continuing active support.

Format conversion

In those parts of OMRAS that make use of the score-matrix format, it will be necessary to provide a conversion module for data input and search-query construction. The system will need to be able to handle MIDI files, music-notation files and audio files. (The audio-recognition part of the system is more than a simple conversion module, and is discussed below.) MIDI-file code libraries are available which can be used to handle standard MIDI files, but music-notation programs usually have proprietary file-formats. (The NIFF format is slowly emerging as a possible common standard, but at present few programs support it. It is, however, worth considering the provision of a NIFF-import converter. [Selfridge-Field 1998, pp. 491-512])

An exception to the rule about proprietary file-formats is the Macintosh program, Advanced Music Notation Systems’ Nightingale (R), developed by Donald Byrd (one of the PIs for this proposal) and others. This can save the contents of a file in an ASCII format very convenient for processing, the Nightingale Notelist. [Crawford et al. 1997.] Notelist files can also be read into Nightingale. Converting the Notelist files to and from the internal format of OMRAS is not expected to be difficult, so this provides a useful and relatively simple ‘ready-made’ means of creating search-queries in the internal format, and of visualising the results of a search returned in the internal format. This will be highly valuable, especially in the early phases of the project.

The following message has been received from Advanced Music Notation Systems:

7 January 1999

I certify that, for four years from this date, Advanced Music Notation Systems, Inc., will make its product Nightingale(R) available at no cost to the Center for Intelligent Information Retrieval and to Kings College, London, but solely for use in research on music recognition and musical similarity.

Donald Byrd, President, Advanced Music Notation Systems, Inc.
57 South St., Williamsburg, MA 01096

Display and playback of internal-format files

It is desirable that the music stored within the OMRAS system can be displayed and monitored by audio playback. Because of the lack of explicit structure of the internal score-matrix format, as discussed above, it will be difficult to display proper musical score notation (suitable for printing, for example) on screen. A simpler form of musical notation sufficient for the task in hand undoubtedly can be achieved. This will be supplemented by a graphical display (perhaps somewhat similar to the ‘piano-roll’ display used in MIDI sequencing programs) on which selected passages can be selected, highlighted and played back. Many users may prefer this more visually-oriented display, especially those without musical training.

This graphical interface will have use both in the construction of search queries and in the presentation of search results. As mentioned in the discussion of Project scope, above, it is important to bear in mind the range of skill and musical understanding of potential users while designing such an interface.

Search-query construction

A search will normally be specified by importing into the search-query module the musical material to be matched. This could be in any of the three music format-types for which a conversion module is available. It may comprise a longer piece of music from which an extract is required for matching. (In the case of music in Nightingale format, for example, the user could select the score-fragment to be matched within Nightingale, and the selection imported into the search-query module as a Notelist file.)

Having been converted into the internal score-matrix format, the search-query may be monitored by audio or MIDI playback, and if necessary, modified for certain types of search. Thus the system must allow insertion, deletion and modification of musical objects in the graphic display in the search module.

Search-result display

The results of a search for a musical pattern may be quite unexpected, especially to an unsophisticated user. In general, for example, a short input query (the equivalent of a very common word in a text search) may give an unmanageably large number of matches. While a simple strategy to avoid this may be a restriction on the shortness of the query (‘Sorry, not enough notes provided!’), it is not necessarily helpful merely to ask the user to ‘refine the search’ — some helpful visual grouping or categorisation may be necessary. It will also be necessary to devise ways to present the results so that, as well as being ranked for their estimated ‘relevance’ to the query (a standard information-retrieval requirement), they can easily be tested by the user (e.g. by playback of the internal-format file, thus approximating the sound of the original score). The OMRAS architecture encourages the development of independent ‘output’ modules just as it does for the ‘input’ of search-queries.

(Presenting musical data in this way to users of varying levels of musical understanding is seen as one of the major research challenges for the next phase of the the ongoing research, although it is unlikely to be addressed in detail in the present project. A well-designed combination of novel visual display, audio playback and conventional music notation is required that may have further uses outside the immediate goals of information retrieval, especially in the domain of music education.)

Throughout its duration and scope, the research will keep sight of the needs of this spread of potential users, since the perceived arcane nature of musical notation, or, on the other hand, musical codes, can be seen as a major disincentive to the kind of investigations and insights that these tools would allow.

Audio recognition

Within the rapidly-developing discipline of Intelligent Signal Processing, a number of techniques for coding of digital time-based signals have been developed. (Some of these were adopted in the MPEG1 and MPEG2 standards for audio coding.) Each involves the use of a time-frequency representation (TFR); a major design aspect of the selection or development of a TFR concerns its suitability for the related tasks of audio compression and analysis. (The technique of audio compression has close analogies with the search-surface reduction by recognition of frequently-recurring patterns mentioned elsewhere in this proposal.)

The analysis of a signal involves breaking it down into its component parts, and recognising what those parts are made from. As well as recognising, for example, the different instruments or voices present in a musical recording, such analysis might be used for the automatic description of the music (timbres, scoring, musical style, etc.). It can also be used for the detailed description of the melodic content of each instrument or voice part, i.e., for automatic transcription.

The OMRAS system does not require the level of audio analysis that would be necessary, for example, for printing a musical score from an audio recording. But the level would be similar to that needed to ‘track’ a musical score (stored electronically within the computer), an idea with obvious commercial potential for the recording and broadcast industries. All that is required here is the pattern of note-onset pitches as they occur in time, corresponding to the OMRAS internal score-matrix format. It is well understood that, although this might be very easy to provide for monophonic music (a single instrument or voice), the task is much harder for polyphonic music, where several instruments and voices play simultaneously and acoustically interact. It is also the case that the more complex the texture of the music, the more ambiguities and inaccuracies are bound to occur, hence the need for somewhat generalised approximate pattern-matching in the score-matrix searches outlined above.

The TFRs used for the task will be selected from several in current use, especially those using multiple resolutions, including:

Wavelet decomposition (Wavelet Transform and Packets) [See: IEEE Wavelets; Hess-Nielsen et al.; Ramchandran et al.; Guilleman et al.]

Cohen class representations (Wigner-Ville distributions and derivatives) [See: IEEE TFA; Pitton et al.; Pielemeier et al.]

The Multi-resolution Fourier Transform [See: Wilson et al. 1992; Shuttleworth 1996]

Polyphase filtering (similar to Wavelet decomposition)

Discrete Cosine Transform and derivatives, e.g. MDCT.

The actual analysis of the TFRs will be carried out by the use of a neural network interacting dynamically with the TFR using the principle of progressive recognition. The recognition process will build upon recognition tasks already completed earlier and have the potential to ‘rethink’ earlier decisions.

OMRAS system design

For a collaborative project of the nature of OMRAS, careful system design is essential, since work must proceed simultaneously on several interacting elements of the overall mechanism. For various reasons, including the need to allow for web access at every stage, OMRAS will have a platform-independent object-oriented architecture, wherein code modules can be ‘plugged in’ to the overall system at any stage, and thus it will be possible for development contributions to be made by distributed researchers using a variety of computer platforms. Of course, platform-independence is more an issue of implementation language, support libraries, etc., than of architecture. For our purposes the most practical implementation language is very likely Java with the standard Sun APIs, perhaps in conjunction with XML. The resulting system will have further benefits in interoperability and flexibility, allowing the researchers to test the mechanism on musical databases of various sizes and types on a range of hardware, with appropriate levels of computing power matched to the demands of those databases.

Project Management

At the most recent SIGIR conference, CIIR had far more papers than any other institution. By any measure, CIIR is one of the world’s leading information-retrieval research institutions, and Bruce Croft, its founder and director, one of the world’s leading researchers in IR in general. Donald Byrd is well-known in the music-representation community for his work with conventional music notation, and the Nightingale program whose development he has spearheaded is a musical score editor that is highly regarded both for the quality of notation in produces and for its user interface [see Shrock 1997]. Thus, the CIIR team has much strength both in general IR and in some relevant aspects of computer applications to music.

On the KCL team, Tim Crawford has been working for many years on representation and transcription of early music. Matthew Dovey has worked on methods of musical-structure recognition, and Crawford and Dovey have recently been working together (with Emilios Cambouropoulos) on the KCL Melodic Similarity and Musical Recognition project.

Overall, while the two research teams clearly have different areas of expertise, there is a significant degree of overlap as well. This is a great strength of the proposal, but it also carries risks, which in general will be avoided by careful planning, good communications, and rigorous evaluation.

With respect to communications, the team members will communicate very frequently, mostly via the Internet but undoubtedly via phone on occasion. The Senior Personnel will meet electronically as a group on a regular basis. Of course, face-to-face meetings will also be necessary. We plan a kickoff meeting attended by all members of both groups at the start of the project; after that, there will be annual meetings of all members. We expect that two additional trips by members of each group to the other site will be required each year.

It should be mentioned that Crawford has been advising Byrd on the development of Nightingale since 1990, that Crawford made major contributions to Nightingale’s Notelist format (referred to elsewhere in this proposal), and that Crawford, Byrd, and a third person wrote the published description of that format [Crawford et al 1997]. Essentially all of this collaboration has taken place with Byrd in the United States and Crawford in England, and the vast majority over the Internet. So senior members of the teams have a long history of intercontinental collaboration.

The outline schedule of work presented below indicates the distribution of effort between the partners.

1	2	3	4	5	6	7
1	Music representation	KCL + CIIR			P	X
2	Internal file-format	KCL			P
3	File conversion	KCL
4	Musical search-surface reduction	KCL + CIIR		I	P	X
5	Musical indexing methods	CIIR + KCL		I	P
6	General index-creation and management methods	CIIR	E
7	Web-harvesting technology (as appropriate)	CIIR	E
8	Relevance feedback strategies and implementation	CIIR	E
9	Search-interface design and implementation	CIIR + KCL		I
10	Result-display design and implementation	CIIR + KCL		I
11	User-evaluation strategy and implementation	CIIR	E
12	Performance-testing strategy and implementation	CIIR + KCL	E
13	System architecture	KCL		I	P
14	Integration with library information systems	KCL		I
15	Web implementation	KCL		I
16	Search and musical-comparison algorithms	KCL + CIIR		I	P	X
17	Audio music-recognition (as appropriate)	KCL		I		X
18	Report-writing and dissemination, etc.	KCL + CIIR
19	Project management	KCL

(In column 3, the lead partner for each aspect of the project is named first. Codes for columns 4-7: E, substantially-existing technologies, adapted for our purpose; I, major innovative research effort; P, significant preliminary work already has been carried out; X, related parallel research work external to this project will be incorporated.)

The essential plan of work will be based round an annual cycle of deliverables. This will be continually adjusted according to feedback from the continuous evaluation of research progress. A detailed plan for the first year together with a provisional second-year plan will be provided within 3 months of commencement (as prescribed in the Annex to JISC Circular 15/98). The third year plan can only be outlined at first, but is likely to be devoted to intense evaluation, both internal (rigorous testing on a range of types and sizes of database) and external (user-trials using alpha versions of a pilot/provisional software product). According to the progress of the project, and to some extent the direction it takes during its course, approaches may be made to outside agencies and possible industrial partners for support in a future product-development phase.

This exit strategy ensures that even if the research does not achieve all the goals set at commencement, it will make a very significant contribution in its fields which will be well tested and evaluated (see Testing and evaluation, below), as well as widely-disseminated to both the information-science and musical communities (see Dissemination, below). Given the complexity of music, there is also reason to believe that some of the work will have future use in content-based multimedia information retrieval in general, and thus will benefit a far larger community than just musicians.

Testing and evaluation

The OMRAS system, as described above, involves a complex interaction of various modules of research. The goal of musical retrieval is of course paramount, and testing and evaluation will have this as the primary focus. Unfortunately, there are no existing standard test data-sets for music corresponding to the TREC text data-sets, for example. It will be necessary to construct test databases of music files in each of the three data-types: score files, MIDI files, audio files. Furthermore, it is important to ensure that a wide range of musical styles is represented within each of these test data-sets; it is probable, for instance, that MIDI files of popular music (with highly ‘redundant’ continuous and repetitious drum-tracks) will present different problems for retrieval than, say, renaissance choral music. Raw material for such test data-sets of each type is readily available, from the World Wide Web and other sources.

As well as ‘real’ music files, it will probably be useful to construct artificial test files, wherein example search-patterns are embedded in various densities of ‘randomised’ data.

A consistent and continuous routine of testing retrieval performance and accuracy on these test data-sets will be maintained throughout the project, and the results will feed back into the design process of the system and the various algorithms on which it depends.

User-interface testing is seen as of crucial importance to this project. The specially-designed graphical displays, and the general user-interaction aspects of the OMRAS system will be constantly monitored and evaluated with the aid of students and academic peers as each part of the system becomes available. Since the project is conceived as having a WWW interface from early in its history, it will be possible to set up a group of testers with appropriate access to the relevant URLs; this has the potential advantage of international participation from the outset, with the special insights that it can bring to the project.

Dissemination

The best form of dissemination for a project of this sort, of course, is its actual implementation in a working high-profile application, freely accessible on the WWW. However, this is likely to take some time to come about, and more conventional scholarly and media channels will be exploited in the meantime. It seems likely that a good deal of press interest could be generated, especially when a suitable pilot system is up and running for demonstration purposes. (Audio-recognition, in particular, should excite a fair amount of media comment.) In the early phases of the project, there is much scope for scholarly publications on the more innovative aspects of OMRAS, especially the music-representation, matrix-searching and audio-recognition parts.

A WWW site will be maintained (probably at KCL, and linked from CIIR’s web-site) to report on progress, to offer demonstrations, screen-shots and system diagrams as they become available, to maintain a list of current research groups working in similar fields and to allow interested parties to discuss issues in music similarity, retrieval and recognition. It may be useful to establish an Internet mailing-list on the topic, although it is important that such activities do not take up so much time that they divert the principal researchers from the task(s) of developing OMRAS.

Bibliographical references

Bainbridge, D. (1998) MELDEX: A Web-based Melodic Locator Service. Computing in Musicology 11: 223-229.

Bakhmutova, I.V., Gusev, V.D. and Titkova, T.N. (1997) The Search for Adaptations in Song Melodies. Computer Music Journal, 12(1): 58-67.

Byrd, D. (1984) Music notation by computer. PhD dissertation, Indiana University.

Byrd, D. (1994) Music notation software and intelligence. Computer Music Journal, 18(1): 17-20.

Cambouropoulos, E. (1996) A General Pitch Interval Representation: Theory and Applications. Journal of New Music Research, 25 (3): 231-251.

Cambouropoulos, E. (1998a) Towards a General Computational Theory of Musical Structure. Ph.D. Thesis, University of Edinburgh.

Cambouropoulos, E. (1998b) Musical Parallelism and Melodic Segmentation. In Proceedings of the XII Colloquio di Informatica Musicale, Gorizia, Italy.

Chen, T.C. and Chen, A.L.P. (1998) Query by Rhythm: An Approach for Song Retrieval in Music Databases. In Proceedings of IEEE International Workshop on Research Issues in Data Engineering: 139-146.

Chou, T.C., Chen, A.L.P. and Liu, C.C. (1996) Music Databases: Indexing Techniques and Implementation. in Proceedings of IEEE International Workshop on Multimedia Data Base Management Systems.

Crawford, T. (1991) TabCode for Lute Repertories. Computing in Musicology, 7: 57-59.

Crawford, T. (1992) Lute tablature and concordance recognition: special problems needing special solutions, IMS Study Group: New Methologies in the Study of Melody, read at Congress of the International Musicological Society, Madrid, April 3, 1992 . Available at:
http: //www.kcl.ac.uk/kis/schools/hums/music/ttc/madrid.html

Crawford, T., Byrd, D. and Gibson, J. (1997) The Nightingale Notelist. In Selfridge-Field 1997: 293-318.

Crawford, T., Iliopoulos, C.S. and Raman, R. (1998) String Matching Techniques for Musical Similarity and Melodic Recognition. Computing in Musicology, 11: 73-100.

Dannenberg, R. (1993) Music representation issues, techniques and systems. Computer Music Journal, 17(3): 20-30.

Dovey, M. (1999) A matrix-based algorithm for locating polyphonic phrases within a polyphonic musical piece. (Paper to be presented at Focus Workshop on Pattern Processing in Music Analysis and Creation to be held in the context of the Symposium on Artificial Intelligence and Musical Creativity at the AISB'99 Convention, Edinburgh, April 1999) See:
http://www-poleia.lip6.fr/~rolland/PatternProcessingWorkshop/ppw.html

Faloutsos, C. and Oard, D. (1995) A Survey of Information Retrieval and Filtering Methods. CS-TR-3514, Dept. of Computer Science, Univ. of Maryland.

Guilleman, P. and Kronland-Martinet, R.Characterization of Acoustic Signals through Continuous Linear Time-Frequency Representations. In IEEE Wavelets, 561-585.

Hess-Nielsen, N. and Wickerhauser, M.V. Wavelets and Time-Frequency Analysis. In IEEE Wavelets, pp 523-540.

Howard, J. (1997) Plaine and Easie Code: a code for music bibliography. In Selfridge-Field (1997): 362-372.

Huron, D. (1997) Humdrum and Kern: selective feature encoding. In Selfridge-Field (1997): 375-401.

IEEE (Wavelets) IEEE Proceedings, Special Issue on Wavelets, Vol. 84, No.4: 505-688.

IEEE (TFA) IEEE Proceedings, Special Issue on Time-Frequency Analysis, Vol. 84, No.9: 1193-1352.

Koenen, R., ed. (1997) MPEG-7: Context and Objectives. ISO/IEC JTC1/SC29/WG11 N1733: July 1997.

Körnstadt, A. (1998) ThemeFinder: A Web-based Melodic Search Tool. Computing in Musicology, 11: 231-236.

McGettrick, P. (1997) MIDIMatch: Musical Pattern Matching in Real Time. MSc Dissertation, York University, U.K.

McNab, R. (1996) Interactive applications of music transcription. MSc thesis, Waikato University, N.Z.

Mongeau, M. and Sankoff, D. (1990) Comparison of Musical Sequences. Computers and the Humanities, 24: 161-175.

O’Maidin, D. (1998) A Geometrical Algorithm for Melodic Difference. Computing in Musicology, 11: 65-72.

Pfeiffer, S., Fischer, S., and Effelsberg, W. (1996) Automatic Audio Content Analysis. University of Mannheim, Germany, Praktische Informatik IV L 15, 16, (Technical Report" University of Mannheim, Germany, Praktische Informatik IV L 15, 16, (Technical Report) Reihe Informatik 8/96

Pielemeier, W.J., Wakefield, G.H. and Simoni, M.H. Time-Frequency Analysis of Musical Signals in IEEE TFA: 1216-1230.

Pitton, J.W.,Wang, K. and Juang, B.-H. Time-Frequency Analysis and Auditory Modeling for Automatic Recognition. In IEEE TFA, 1199-1215.

Ramchandran, K., Vetterli, M. and Herley, C. Wavelets, Sub-band Coding and Best Bases. In IEEE Wavelets, pp541-560.

Rowe, R. and Li, T.C. (1995) Pattern Processing in Music. In Proceedings of the Fifth Biennial Symposium for Arts and Technology. Connecticut College, New London, Connecticut.

Schaffrath, H. (1997) The Essen Associative Code [ESAC]: A code for folksong analysis. In Selfridge-Field (1997): 343-361.

Selfridge-Field, E., ed. (1997) Beyond MIDI: The handbook of musical codes. Cambridge, Mass.: MIT Press.

Selfridge-Field, E. (1998) Conceptual and Representational Issue in Melodic Comparison. Computing in Musicology, 11: 3-64.

Shneiderman, B., Byrd, D., and Croft, W. B. (1997) Clarifying Search: A User-Interface Framework for Text Searches. D-Lib online magazine, January 1997, Appendix 2. URL: http: //www.dlib.org/ .

Shrock, Rob (1997). Musicware Nightingale 3.0: an easy-to-use feature-laden notation program. Electronic Musician 13(2): 190-196.

Shuttleworth, T. and Wilson, R. (1993) Note Recognition in Polyphonic Music using Neural Networks. University of Warwick, U.K., Department of Computer Science, Research Report RR 252, October 1993.

Shuttleworth, T. (1996) A Multiresolution Approach to the Transcription of Polyphonic Musical Signals using Neural Networks. PhD Thesis, University of Warwick.

Smith, L.A., McNab, R.J. and Witten, I.H. (1998) Sequence-based Melody Comparison: A Dynamic-Programming Approach. Computing in Musicology, 11: 101-118.

Sparck Jones, K. and Willett, P. (1997) Readings in Information Retrieval. San Francisco: Morgan Kaufmann.

Wilson R. et al. (1992) A Generalized Wavelet Transform for Fourier Analysis: The Multiresolution Fourier Transform and its application to image and audio signal analysis. IEEE Trans. Information Theory, 38(2), March 1992.

Witten, I., Moffat, A. and Bell, T. (1994b) Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold.