Audio Fingerprint Extraction Using an Adapted Computational Geometry Algorithm

This work presents an adapted version of the Computational Geometry Algorithm (CGA) used for the development of audio-based applications and services. The CGA algorithm analyses an audio stream and produces a unique set of points that can be considered to be the audio data “fingerprint”. It is shown that this fingerprint is coding-independent, a fact that can render the proposed algorithm suitable for multiple purposes, including the categorisation of content identity and the identification of audio clips, hence providing support for the realisation of audio sorting/searching tasks and services. Additionally, based on specific novel applications and services, the overall algorithmic performance and efficiency characteristics of the CGA algorithm are discussed and analysed.


Introduction
The use of digital technology for content distribution and reproduction introduces new domains and audio-embedded applications.New-media arts (Trifonova, Jaccheri, & Bergaust, 2008), interactive and experimental multimedia (Deliyannis, 2012) are characteristic areas of application featuring dynamic user-content interaction (Deliyannis & Karydis, 2011) and a culture of continuing content use changes (Gillespie, 2004).These global transformations introduce new services and markets (Simpson, 2004), a fact that is supported by the evolution of standards such as MPEG-7, which links content to context offering multimedia accessibility, while MPEG-21 is designed to support full user-content interaction.The considerable content availability through various media certainly induces requirements for new broadcast-control and content verification mechanisms.Additionally, the area of copyright management shares cross-media solutions such as pattern-matching algorithms and techniques to identify copyrighted content in various media types (Furht & Kirovski, 2005).All the above uses require digital coding audio format and sampling ratio -independent identification and fingerprinting, able to cover a wide range of music genres.
Audio fingerprinting techniques aim to develop mechanisms for assessing the perceptual equivalence of different audio / audio content.This is performed by providing a summary of the corresponding audio clip, which is typically stored in a database and serves as an index to the audio library metadata (Beekhof, Voloshynovskiy, Koval, & Holotyak, 2009).By further using matching / similarity functions, distorted versions of the audio content can be mapped to the original version.Particular research issues relating to audio fingerprinting typically include: (i) the determination of the limit that an audio clip can be processed before it cannot be identified anymore (degradation) (Beekhof et al., 2009); (ii) the assessment of the probability of false identification (reliability) (Kennedy & Naaman, 2009;Turnbull, Barrington, Torres, & Lanckriet, 2008); (iii) the effect of the audio segment length on the identification efficiency (granularity) (Marolt, 2008;Orio, 2010;Seifert, 2004); and (iv) the search speed, i.e., the time required for accessing a fingerprint in a large database (Seo et al., 2005).
Due to the several forms of audio content processing and transformation, an efficient audio fingerprinting technique should provide accurate, reliable and robust identification against a large number of audio-processing tasks, including, for example, lossy compression, equalisation, time-scale modification, and linear speed change (Seo et al., 2005).Towards this aim, different fingerprinting algorithm realisation approaches exist (Baluja & Covell, 2006;Seifert, 2004;Sukittanon & Atlas, 2002).Some works cited in the literature utilise specific peaks extracted from the Power Spectral Density (PSD), which may be visualised through a "constellation map" (Wang, 2003).In particular, they derive PSD peaks and represent them using a scatter plot.This approach imposes a time-alignment problem between the original audio and the comparison audio peaks in the "noisy" (distorted) case (Baluja & Covell, 2006;Chandrasekhar, Sharifi, & Ross, 2011), as it detects a significant cluster of points that form a diagonal line within the scatterplot in order to perform matching using as basis the density criterion (Wang, 2003).In that respect, successful matching is implemented via the identification of a similar diagonal line within the scatterplot of the original file stored in the library.In order to increase accuracy, proprietary mobile phone software applications, such as Shazam (shazam.com) and Sound Hound (soundhound.com),demand additional audio that requires samples of over 10 seconds for matching and 30-60 seconds for copyright detection (Chandrasekhar et al., 2011).
The present work offers a novel approach under the identification (matching) stage, as it is based on the significant reduction of the original spectral peaks enclosed in convex layer areas.The proposed approach introduces audio-track identification through the use of computational geometry algorithms.The problem of matching sample peaks with original peaks is addressed using an intersection technique between convex layers.More specifically, the proposed method produces a convex polygon in the frequency domain that resembles a coordinate-based fingerprint pattern.This is performed through the Computational Geometry Algorithm (CGA), a scheme of onion-like layers that results in unique frequency-domain representations of the innermost onion layer (Poulos, Belesiotis, & Papavlasopoulos, 2007;Poulos, Bokos, & Vaioulis, 2008;Poulos, Magkos, & Chrissicopoulos, 2003;Poulos, Papavlasopoulos, Bokos, Avlonitis, & Kanellopoulos, 2007;Poulos, Papavlasopoulos, Chrissicopoulos, & Magkos, 2003).
The paper is organized as follows.In Section 2, the analytic description of the proposed CGA audio fingerprinting algorithm is presented.Section 3 describes the proposed identification procedure using the onion algorithm in relation to the density criterion that is commonly applied in similar identification problems.Finally, in Section 4 the further research and the concluded remarks are discussed.

CGA Algorithm Description
As mentioned in the previous Section, the fundamental requirement for robust and reliable audio fingerprinting depends on the extraction of features that are independent from audio encoding techniques (quality) or common signal degradation.Moreover, the fingerprint produced for each sampled audio clip represents a significant parameter that may affect the realisation complexity, mainly in terms of computational load and memory requirements.Both the above requirements are met by the reduction of the spectral resolution of the original Fast Fourier Transform (FFT) magnitudes using the Onion Algorithm (OA), according to our latest studies (Gillespie, 2004;Kosch, 2004;Trifonova et al., 2008).In more detail, the basis of the method depicts the centre of the multi-layers of a set of arithmetic points on the Cartesian plane that represent the values of the application used (Dalal, 2004).In the case of fingerprint audio verification (Poulos et al., 2007), these layers can be formed by the values of the audio signal FFT magnitudes.Additionally, the basis of this algorithm has been already applied successfully in a text categorisation and audio identification procedures (Poulos et al., 2003(Poulos et al., , 2007(Poulos et al., , 2008;;).More specifically, the proposed CGA audio fingerprinting technique is divided into three stages, namely pre-processing, feature extraction and identification, which are described in the following Sections.

Pre-processing Stage
This stage is divided into two stages: audio digitisation and spectral processing.
In audio digitisation, the audio signal under identification is processed in order to be transformed / decoded in a pre-defined reference coding scheme.For the purposes of this work, the reference coding-scheme considered was the widely employed linear Pulse Code Modulation (PCM) method.This was the coding scheme of all the reference (initial) signals employed in this work for assessing the proposed audio fingerprinting algorithm efficiency.Moreover, the pre-processing stage is also responsible for removing any additional data produced by the initial audio coding process.For example, assuming that the audio signal under identification is compressed according to the MPEG-1 Layer III standard, the initial fixed-duration silence imposed by the relative coder (if any) is removed.
In spectral processing, the Power Spectral Density (PSD) Pxx of the pre-processed signal under identification x is calculated in units of power per radians per sample.The corresponding vector of frequencies w is computed in radians per sample, and has the same length as Pxx.It should be noted that a real-valued input vector x produces a full power one-sided (in frequency) PSD (by default), while a complex-valued x produces a two-sided PSD.In general, the length N of the FFT and the values of the input x determine the length of Pxx and the range of the corresponding normalised frequencies.For this syntax, the (default) length N of the FFT is the larger of 256 and the next power of 2 greater than the length of x.
Consider that the pre-processed sample x t  t0 N 1  of a stationary, discrete-time, real-valued signal with power spectral density, or spectrum, Φ(ω) (ω (-π, π]) For simplification we considered that Φ(p) denotes the values of the spectrum at the Fourier frequency grid points: The periodogram estimate of Φ(p) is given by (Stoica & Moses, 1997) as: Thus, the periodogram values consist of the elements of the vector S: (3)

Feature Extraction Stage
In the feature extraction stage, the OA method is based on a novel statistical estimation in which the smallest layer of an onion convex polygon encloses the geometric median value of a feature vector (Poulos et al., 2007).Additionally, the semantic features of the OA are grounded in previous studies, in which, the smallest convex polygon (layer) depicts a particular geometrical area in which this average value of the data may be characterised as strong effect on the properties of the data (Dalal, 2004) and unique (Poulos et al., 2003).The OA has also been widely used in many application domains, i.e., image processing, pattern recognition, photo image analysis, and study of Earth atmosphere (Bae, Alkobaisi, Vojtechovsky, Narayanappa, & Bae, 2010) in which the process of peeling a planar point set is central in the study of robust estimators in statistics (Chazelle, 1985).The feature extraction of this study is based on a procedure, in which the peaks in each time-frequency according to the amplitude are chosen via inner layer (suitable depth) of the peak's amplitudes.The choice of the critical depth of the OA is estimated via the geometric property of convex hull where the onion technique is based upon allows one to scale, rotate, and shift the spatial structure without changing its property (Chang et al., 2000).Thus, in the experimentation we selected the suitable depth's layer (k) (see Equation 4), where the variance of the enclosed amplitudes was minimised.In this way, we applied Bartlett's estimation in spectral resolution for a reduction in variance in order to isolate minimised variance according to Equation (4): Additionally, this statistical approximation has been verified empirically in several pattern recognition problems (Bose & Toussaint, 1995;Graham, 1972;Hess, 1983;O'Rourke, 1998).The proposed procedure is divided into the following 3 steps: (1) In the Cartesian plane (see Figure 1) the values of vector S are placed in the y-axis and in the x-axis the order of each element of vector S is shown.We put the elements of the S vector in the Cartesian plane according to the f(p, |Φ p |) function.(2) We determine the finite set of points S=S 0 .Let S 1 be the set S 0 \ H (S 0 ) : S minus all the points on the boundary of the hull of S, as it is illustrated in Figure 2: (3) The process continues until a set with three innermost points or less is produced.Similarly, we define S i1  S i \ H (S i ).The Hulls H i  H (S i ) are also referred as layers of the set and the process of peeling away the layers is termed onion peeling (Graham, 1972;O'Rourke, 1998) (see Figure 3).4) is the depth of the convex layer and n is the size of vector S (see equation 3) in the numerical representation (in accordance with the description of the pre-processing stage in Section 2.1).However, according the equation 4 the calculation, the value k is implemented as follows:  The standard least-squares technique was selected as opposed to the non-linear minimisation method of Newton -Raphson because it can estimate k more efficiently. Convergence of the algorithm described above is not guaranteed, as it was verified in the experimental part.For those 2 x ( ) amplitudes of the envelope candidate k ultimate layer where convergence is not achieved, we resort to the non-linear minimisation of 2 x ( ) via the Newton-Raphson method.It has been seen in the experimental part that in almost all such cases, the parameters have converged.In the rare cases where convergence failed to occur, candidate layers were deleted from the set. When applying the Newton-Raphson minimisation, the parameter estimate 2 x ( ) obtained through the least squares minimisation is used as the initial point of the iteration.The first and second derivatives of 2 x ( ) are estimated and the respective derivative matrices  F ( 2 x ( )) and F''( 2 x ( )) are formed and used in the iteration: Here, 2 x ( )

 
 is the estimate produced at the k-th step of the iteration.The convergence threshold, which is the Euclidean distance between every two consecutive estimates 2 , is taken equal to 0.1.
In our example, the calculation of this depth k is valuable because this subset may be characterised as representing the significant semantics of the selected audio signal (see Figure 4).The decision regarding this selection is explained in Section 3. We may consider the smallest convex layer to comprise a significant geometrical region of frequency enclosing the median frequencies of the original audio shot file.

Identification Stage
In this stage subset S xy intersects a new subset N xy , which came from the processing of another set N (see Section 2.3), on the Cartesian plane, as illustrated in Figure 5: In this way, the degrees of similarity s 1 & s 2 consist of the vector: where i=1,… x.More details of the degrees of similarity s 1 & s 2 are given in the Experimental Part depicted in Figure 7.
Figure 7. Case of categorisation in the same category.The green area is the common area of the compared text categorisation documents

Related Study
As mentioned in the introduction, the density criterion is the dominant method in which the known methods are based (Wang, 2003).According to this criterion the candidate peaks are used to assure that the time-frequency strip for the audio file has reasonably uniform coverage.Thereinafter, the selected peaks from the each side are ordered so that diagonal elements correspond to target/target distances.Thus, ideally a threshold θ can be chosen such that all diagonal elements are smaller than, and all off-diagonal elements are larger than, θ (Burges, Platt, & Jana, 2003).For a rectangular window see Figure 8; the transform kernel  p (see equation 2) has a sinx/x behavior along the heaviest diagonal line (Sukittanon & Atlas, 2002), in which a threshold θ is obtained by the distance of the centroid of p in relation to of diagonal (see Section 2.1).In our case, the threshold is the value of the fraction between the intersected convex layers, which expresses the degree of correlation between two audio files (clips).Thus, the proposed method is superior to known methods because the threshold is satisfied with a non-linear way (Milenkovic, 1999) via degree of intersection (see Section 2.3).In this case, the absolute alignment between the amplitudes of the peaks with a minimum threshold error is not needed (as shown in Figure 10).For better comprehension, the phases of the comparison selected amplitude pairs of the peaks in two algorithms are described in Figures 8-10.

Further Research and Conclusions
The research field addressed in this work is broad and interdisciplinary, incorporating not only aspects of computer science, but in our case diverse fields such as archival science, cognitive science, commerce, communications, law, library science and signal processing which are essentially interconnected.Therefore, the need for interdisciplinary research in the area of information integration and computer sciences is evident from this work.The experimental aspects of the algorithm have to be addressed in the near future while standardization of the identification process is the key issue that needs to be established, before this technology may be exploited commercially.Our preliminary studies showed that beyond one-to-one pattern matching, dynamic content detection is also possible (Poulos et al., 2007).In that respect, once the standards are related to the semantic layers (fingerprints) of information and communication systems, important consequences arise that require further research.This study's major objective was to try to construct an audio mechanism of detection similarity and fingerprinting.This work presents the adjustment of a computational geometric algorithm for the semantic representation of the information of audio data in terms of a frequency-domain audio fingerprint.The idea for this construction came from the test of the onion-peeling algorithm in other areas of signal-processing, such as the identification of humans by fingerprints.The aim of this application is to construct an audio fingerprint (i.e. in terms of a serial number) that could identify a copyright-protected published audio file even if its file format has changed from one type to another.Furthermore, it aims to provide a satisfactory amount of correlation similarity with other audio files created from the original by applying different coding / compression techniques, and to detect and automatically reject audio files that are not related to the original.

Figure 2 .
Figure 2. The external hull S 0

Figure 3 .
Figure 3.The iterative procedure of convex hulls calculation

Figure 4 .
Figure 4. Graphical representation of the ultimate (last) convex polygon, which has depth k

Figure 5 .Figure 6 .
Figure 5. Decision stage between two different onion algorithm procedures

Figure 8 .
Figure 8.The red line depicts the diagonal elements (peaks) correspond to target/target distances while the onions peeling graphs, with green for the original audio file and blue for the comparison, are also presented