Hybrid-Based Compressed Domain Video Fingerprinting Technique

,


Introduction
Text, image, audio, and video can be represented as digital data.The explosion of Internet applications leads people into the digital world, and communication via digital data becomes recurrent.However, new issues also arise and have been explored, such as data security in digital communications, copyright protection of digitized properties, and invisible communication via digital media (Nianhua, 2011).Due to the rapid development of video production technology and the decreasing cost of video acquisition tools and storage, a vast amount of video data is generated around the world every day, including feature films, television programs, personal/home/family videos, surveillance videos, game videos, etc.Digital video has opened up the potential of using video sources in ways other than the traditional serial playback.However, this requires the development of new technologies for accessing and manipulating digital video (Moxley, 2010).
Additionally, the amount of digital video data, which has the potential of becoming much greater than that of traditional analog video, necessitates the development of digital video management tools for handling massive video databases.Also, the ease with which all digital media can be flawlessly copied makes the development of appropriate rights protection and authentication tools highly desirable.There necessitates techniques for automatically managing this vast amount of information, such that users can structure them quickly, understand their content and organize them in an efficient manner.An emerging technology which is useful for the management of video, particularly with respect to rights protection, is fingerprinting (Cherubini, 2009) and (Peng, 2010) also known as perceptual hashing or replica detection.This is defined as the identification of a video segment using a representation called fingerprint (or sometimes perceptual hash), which is extracted from the video content (Saikia, 2011).
The fingerprint must uniquely identify a video segment, but does not necessarily need to represent its content.Additionally, it must remain the same when a video segment is manipulated, usually by common video processing operations such as resizing, cropping, histogram equalization, compression etc. Fingerprints can be used for establishing whether two given segments are either identical or derived from each other, and also for establishing whether a video segment is identical with (or derived from) any segment within a given video database (Liu, 2010).
Several video fingerprinting algorithms that work at the pixel level have been proposed.Working directly with pixels is, nowadays, computationally feasible and accurate.However, those solutions do not address the magnitude of the resources needed for such a system and little analysis is provided in order to use a video fingerprinting solution in practical cases.The compressed domain processing techniques use the information extracted directly from compressed bitstream, and therefore are more advantageous than the uncompressed domain in computational means.To achieve this, the information already inherent in the video stream, which was included during the compression stage, is utilized (Maria, 2009).
A partial decompression must still be done to extract the information necessary for processing, however this overhead is small compared to full decompression of the video stream.It is shown that in MPEG video decompression, approximately 40% of the CPU time is spent in Inverse Discrete Cosine Transform (IDCT) calculations, even when using fast DCT algorithms (Young-min, 2000).This paper propose a video fingerprinting technique that work directly in the compressed domain at the stage of variable length decoding (Figure 1), so it is computationally more efficient than uncompressed domain techniques and even the methods that utilize DCT coefficients that are partially decoded.The paper is organized as follows: Related work is given in Section (2); Section (3) describes our proposed fingerprinting technique; Experimental results, comparisons and discussion are shown in Section (4); Finally, Section (5) concludes the paper and point out some directions for future work.Joly, Frelicot and Buisson extract local fingerprints around interest points in (Joly, 2003).These interest points are detected with the Harris detector and compared using the Nearest Neighbor method.They propose statistical similarity search in (Joly & Buisson, 2005) and (Joly, 2005).Joly et al. use this method and propose distortion-based probabilistic approximate similarity search technique in order to speed up scanning in content based video retrieval framework (Joly, Buisson, & Frelicot, 2007).Zhao et al. extract PCA-SHIFT descriptors and use it for video matching in (Zhao, 2007).They use the nearest neighbor search for matching and SVMs for learning matching patterns with their duplicates.Law et al. propose a video indexing method using temporal contextual information which is extracted from local descriptors of interest points in (Law-To, 2006) and (Law-To & Gouet-Branet, 2006).They use this contextual information in a voting function.

Related Works
Poullot et al. present a method for monitoring a real time TV channel in (Poullot, 2007).They use the method for comparing the incoming data with indexed videos in database.Innovations of the method are z-grid for building indexes, uniformity-based sorting and adapted partitioning of the components.Lienhart et al. (Yang, 2004) use color coherence vector to characterize the key frames of the video.Sanchez et al. (1999) discuss using color histograms of key frames for copy detection.They test the developed system on TV commercials and the system is sensitive to color variations.Hampapur (2000) uses edge features but he ignores the color variations.Indyk (1999) use distance between two scenes as its signature.However, it is a weak and limited signature.
So far, too fewer methods are known to propose video fingerprinting in compressed domain, one of them which DC coefficient is used to model the fingerprint (Mikhalev et al., 2008).Given the extracted DC coefficient, the video fingerprint method constructs the video frame, the modeled video frame is evaluated to obtain the key-frames of the video, which would be further analyzed for generating the fingerprints.Naphade (1999) use histogram intersection of the YUV histograms of the DC sequence of the MPEG video.It is an efficient method in terms of compression.Recently AlBaqir (2009) proposed a video fingerprinting method, in which motion vectors are used to model the fingerprint.He considers utilizing motion vector to construct approximated motion field since the motion vectors are commonly generated during video compression for exploiting the temporal redundancy within a video.

Proposed Technique
In general, MPEG normally classify the video frames into I (intra) frame, P (predicted) frame and B (bi-directional) frame, and each frame is divided to macroblocks (MB) which are 16x16 pixel motion compensation units within a frame.I frames can only have Intra blocks.P and B pictures can have different modes according to motion content.Macroblock type modes in P and B frames are given in Figure 2 and Figure 3 respectively.If intra coding is selected, the corresponding MB is encoded individually by exploiting the two dimensional discrete cosine transform (2D-DCT) coefficients (Equation 1).On the other hand, if inter coding is selected, the MB is then encoded using motion estimation-motion compensation (ME/MC) algorithms (Richardson, 2003;Heath, 2002).
where, and X i,j is M×M block of pixels.

Figure 2. Macroblock type modes in P pictures
The purpose of using ME/MC is to reduce redundancy in temporal direction.To exploit temporal redundancy, MPEG algorithms compute an interframe difference called prediction error (Pei, 1999).
where M and N are macroblock sizes.For a given macroblock, the encoder first determines whether it is Motion Compensated (MC) or Non Motion Compensated (NO_MC) and a scheme is used to determine whether the current block is intra/inter coded based on the prediction error.The scheme can be quite complex, but the general idea is to code the difference between target macroblock and reference macroblock when the prediction error is small, otherwise, intra-code the macroblock.For normal scenes, prediction is performed in P and B frames.When there is a scene change, this prediction drops significantly, which lead to some macroblocks to be encoded in intra mode.Strictly speaking, if a frame is inside of a shot, then the macroblocks should be predicted well from previous or next frames.However, when the frames are on the shot boundary, the frames cannot be predicted from the related macroblocks, and a high prediction error occurs (Yeo, 1995).This causes most of the macroblocks of the P frames to be intra coded instead of motion compensated.

Figure 3. Macroblock type modes in B pictures
The proposed technique fuses the macroblock type's information and the motion field generated by using the motion vectors in the MPEG stream to capture the intrinsic content of the video.Only VLC decoding and number of macroblocks times addition operation is necessary to obtain the MB data (see the rounded rectangle zone in Figure 1).
The proposed technique (Figure 4) is divided into two stages namely the fingerprint extraction stage and the similarity matching stage respectively.In the first stage the compressed MPEG video clip is parsed to extract the macroblocks information and motion vector data.Then, 10 bin one dimensional histogram is constructed using the macroblocks types (Figure 5).The macroblocks types used to generate this histogram aggregate not only the normal types in the normal scenes but also the types that happened in the scene change-like scenes.So tis histogram carries important clues to the spatial and temporal content of the video.Lastly, the proposed technique merge the before mentioned feature to the motion field feature (AlBaqir, 2009) where η t is the total number of MB in frame t.
To further reduce the resultant fingerprint redundancy, the proposed technique quantizes the generated fingerprint into a binary sequence (for each part separately as Figure 4 shows) using the median filtering.This can be achieved by fixing a threshold level, and quantize every bin value within each histogram according to the threshold.By having a fingerprint in a binary sequence, a more efficient matching process can be performed using Hamming distance between the compared fingerprints.Instead of using the histogram intersection (Equation 5) or even the Jaccard coefficient (Equation 6) to compare the resultant fingerprints.
where Q and R are two pair of fingerprints, each containing N bins.
Figure 5.The macroblocks types used in generating the first part of the proposed fingerprint The fusion of the two-part fingerprint in the second stage is performed as follows:      (7) Where α is the fusion parameter, Ψ is the distance metric, MFH x is the MFH for video x.

Experimental Results
To evaluate the performance of the proposed technique, a test set of 200 videos all taken from the ocean (ReefVid) was used.Then attacks were individually mounted on the videos, so a new test set equals to 3600 videos was generated.The mounted attacks included added watermark, mosaic effect, Embossment effect, flipping, blindness effect, cropping, contrast adjustment, brightness modification, and bit rate change as Table 1 illustrates.Also, the hardware spec of the PC used for the experiments was an Intel dual-core running at 2 GHz with 3GB memory.However, all tests were ran using a single core.Finally, the technique used to compare with is the motion field technique (AlBaqir, 2009) which also operates in the compressed domain as the proposed one to be a fair comparison.Four sets of the experiments are conducted to study the following issues: 1) Determine the best value of the fusion parameter.
3) Studying the behavior of the proposed technique against content-preserving attacks (Table 1) to investigate the robustness and uniqueness of the proposed fingerprint.
4) Comparing the proposed work against existing technique working in the compressed domain (Motion Field (AlBaqir, 2009)).
The average retrieval rate across all the 17-attacks for the proposed technique with and without using binarization is shown in Figure 6.The figure shows that the proposed work is improved when using the binarization, also using α=0.2 gives the best result in the two cases.

Conclusion
This paper proposes a video fingerprinting method in the compressed domain that utilizes the macroblock and the motion vectors information in a hybrid way.The proposed work gives promising results despite of its low computational overhead against a large spectrum of the content-based video transformations.Also, the proposed work shows that the macroblocks types is more important than the motion vectors as intrinsic content-preserving feature in the video compressed domain.One direction for future work is combining this technique with compressed domain watermarking methods to design a robust content management methodology and apply it in the broadcast monitoring area.Also, the proposed work can be adapted to work in real time environment like cellular phones associated with cloud computing metaphor.

Figure 1 .
Figure 1.Full decompression versus partial decompression of the compressed videoThe bold segmented rectangle is the zone where the partially decoded-based techniques work and the rounded rectangle is where the proposed technique work.
Some video similarity detection methods use uncompressed MPEG video to directly extract the features.Content of the frames, DC values of macroblocks or motion vectors are used as features.Ardizzone (1999) use motion vectors for feature extraction.They use global motion feature or motion based segmented feature as a signature of the video.In global motion extraction step, statistical distribution of directions (i.e., an angle histogram) is calculated.The angle histogram is computed by dividing the [-180, 180] interval into subintervals.Sum of magnitudes of motion vectors in intervals constructs the angle histogram.In motion based segmentation, motion vectors are clustered and labeled.Labels are given according to the similarity of motion vectors or the histogram of motion vector magnitudes.Dominant regions are taken into account in comparison step.
to generate two-part fingerprint.Let call the first part of the proposed fingerprint as MBTH (Macroblock Type Information Histogram) and the second part MFH (Motion Field Histogram).

Figure 6 .
Figure 6.The proposed technique performance using the median to binarize the proposed fingerprint

Table 1 .
Distortions used in the study