Efficient and Privacy-Preserving Multi-User Outsourced K-Means Clustering

In recent years, with the development of the Internet, the data on the network presents an outbreak trend. Big data mining aims at obtaining useful information through data processing, such as clustering, clarifying and so on. Clustering is an important branch of big data mining and it is popular because of its simplicity. A new trend for clients who lack of storage and computational resources is to outsource the data and clustering task to the public cloud platforms. However, as datasets used for clustering may contain some sensitive information (e.g., identity information, health information), simply outsourcing them to the cloud platforms can't protect the privacy. So clients tend to encrypt their databases before uploading to the cloud for clustering. In this paper, we focus on privacy protection and efficiency promotion with respect to k-means clustering, and we propose a new privacy-preserving multi-user outsourced k-means clustering algorithm which is based on locality sensitive hashing (LSH). In this algorithm, we use a Paillier cryptosystem encrypting databases, and combine LSH to prune off some unnecessary computations during the clustering. That is, we don't need to compute the Euclidean distances between each data record and each clustering center. Finally, the theoretical and experimental results show that our algorithm is more efficient than most existing privacy-preserving k-means clustering.

privacy-preserving multi-user outsourced k-means algorithm, which is based on locality sensitive hashing (LSH) (Datar, Immorlica, Indyk & Mirrokni, 2004) for pruning off unnecessary computations in clustering. Our contributions are showed as follows: 1) We propose a LSH-based privacy-preserving multi-user outsourced k-means clustering (LSH-PPMOC) algorithm, whose main idea is to prune off the unnecessary computations during clustering task.
2) We combine the privacy-preserving techniques with LSH to guarantee the security of users' datasets and clustering results.
3) The experiments on real-world datasets show that we improve the clustering efficiency greatly.
The remainder of this paper is organized as follows. In Section 2, we discuss the existing related work. Section 3 presents some definitions and properties related to k-means clustering algorithm, Paillier cryptosystem and Locality Sensitive Hashing as a background. In Section 4, we discuss our proposed LSH-PPMOC solutions in detail. Also, we analyze the security guarantees and complexities of our solution in Section 5. Section 6 presents our experimental results on a real-world dataset under different parameter settings. Finally, we conclude the paper along with the scope for future research in Section 7.

Related Work
Privacy-preserving k-means clustering has been widely used in data mining.The private data mining algorithms can be categorized as, 1) data perturbation based, 2) secure multiparty computation (SMC) based and 3)cryptography technologies based. In perturbation, data records are randomized by adding noises, and the work by (Liu, Kargupta & Ryan, 2005) showed that the clustering results based on multiplicative perturbation only had below 5% error rate compared to the results on the original data. However, data perturbation cannot guarantee privacy in any formal sense (Liu, Giannella & Kargupta, 2006;Wong, Cheung, Kao & Mamoulis, 2009). For example, if an adversary gets a few data records in plaintext, he may recover the rest records though they are perturbated (Liu, Giannella & Kargupta, 2006).
Algorithms based on secure multiparty computation (SMC) (Goldreich, 2009) can preserve the security and privacy of users' data. There existed literatures utilizing SMC like (Lindell & Pinkas, 2008;Mohassel, Rosulek & Trieu, 2020;Upmanyu, Namboodiri, Srinathan & Jawahar, 2010), in which algorithms were proposed to perform distributed data mining without revealing private inputs of participants. In addition, Clifton et al. (Clifton, Kantarcioglu, Vaidya, Lin & Zhu, 2002) proposed that a relatively small set of cryptography primitive should be utilized to generate the SMC protocol. However, these solutions suffered from high levels of communication and computing cost. We can conclude that, though the private clustering algorithms based on SMC can provide better security protection, it is too costly to applied in practical applications (Jagannathan & Wright, 2005).
There are also literatures those propose privacy-preserving clustering algorithms using the semantically secure additive or multiplicative homomorphic encryption schemes. Liu et al.  first leveraged fully homomorphic encryption technique (Gentry, 2009) to perform clustering outsourcing. However, the encryption scheme adopted in their scheme was not secure according to (Wang, 2015). Besides, this algorithm needed clients to participate and provide some information during each iteration, thus leading to heavy overhead on clients. To reduce the interaction with users, Almutairi et al. (Almutairi, Coenen & Dures, 2017) presented an efficient mechanism by using the concept of an Updatable Distance Matrix (UDM). But, their work revealed partial privacy to the cloud servers, such as the size of each cluster and the distances between data objects and centroids.
Similarly, Samanthulat et al. (Samanthula, Rao, Bertino, Yi & Liu, 2014) proposed a secure and outsourced k-means clustering scheme using the Paillier cryptosystem. Unfortunately, this scheme has a high computation cost because of using the bit array-based comparison. Then Rong et al. (Rong, Wang, Liu, Hao & Xian, 2017) put forward a privacy outsourced k-means clustering under multiple keys based on public key cryptosystem with double decryption (Youn, Park, Kim & Lim, 2005). However, the addition protocol in their scheme was not secure because the assistant server could extract the ratio of messages. To address this problem, (Zou, Zhao, Shi, Wang, Peng, Ping & Wang, 2020) proposed a highly secure outsourced k-means clustering scheme using BCP cryptosystem which had the additive homomorphic property.
To better the clustering performance in cloud computing environment, many scholars proposed MapReduce (Dean & Ghemawat, 2008) based k-means clustering schemes to handle large-scale dataset in parallel (Cui, Zhu, Yang, Li & Ji, 2014;Sardar & Ansari, 2020), but they all didn't consider privacy protection. Yuan et al. (Yuan & Tian, 2017) proposed a privacy-preserving scheme using a lightweight cryptosystem basing on the hardness of learning with errors (LWE) (Brakerski, Gentry & Halevi, 2013), and incorporated MapReduce to improve the efficiency. However, the scheme was not fully outsourced and didn't support multiple users. Similarly, in (Rong, Wang, Liu, Hao & Xian, 2017), the clustering was executed under the Spark framework.
Besides, there are also other ways to improve efficiency for k-means clustering. Bhaskara et al. (Bhaskara & Wijewardena, 2018) proposed a variant of Locality sensitive hashing (LSH) (Indyk & Motwani, 1998) to speed up the clustering. Similarly, (Li, Wang, Wang, Hu, Li & Li, 2014) utilized the local sensitive property of LSH to prune off unnecessary computations during clustering and carried out the experiment with the help of MapReduce. However, the both schemes did not take privacy into consideration.
Inspired by literature (Li, Wang, Wang, Hu, Li & Li, 2014), we propose a new and efficient LSH-based privacy-preserving outsourced k-means clustering. We consider a scenario, in which there are two cloud service providers and multiple users. We explicitly assume the two cloud servers will never collude during clustering, then our proposed scheme protects all users' data confidentiality.

Preliminaries
In this section, we will first introduce the definition of typical k-means clustering. Then we give a brief overview of additive homomorphic Paillier cryptosystem and some basic cryptographic preliminaries for securely performing clustering. At last, we briefly introduce LSH which we use as the basis for our solution.

K-means Clustering
K-means clustering is an unsupervised clustering algorithm, and it is widely used in various application scenarios. Supposing given d-dimension data records _1, … , _ , the goal of k-means clustering algorithm is to divide these records into k disjoint groups. That is to say, the objects in the same group are similar, while objects in different groups have low similarity. We use _1, … , _ to denote k clusters, and _1, … , _ , the corresponding centroids. To cluster one data record _ , ∈ ,1,into correct cluster, what we should do first is to compute the distance between it and k centroids _ , j ∈ ,1, -. We use squared Euclidean distance here, which is given as Then we assign _ to the cluster _ , if _ has the smallest distance with _ , ∈ ,1,among k distances.
A traditional k-means clustering has three stages, (1) Initialization; (2) Clustering; (3) Updating new centroids. For Stage (1), initial k records are selected randomly as cluster centroids _1, … , _ . In Stage (2), we first compute k distances between every one record _ and k centroids, then cluster them according to the distances. Later, new centroids _ ′ can be derived as the mean values of attributes of records belonging to _ in Stage (3). If the given maximum iteration number is not reached or the cluster results differ from that getting before, next iteration will restart from Stage (2) to (3) with the new centroids, otherwise, the algorithm terminates.

The Paillier Cryptosystem
In this paper, we use the Paillier encryption (Paillier, 1999), which is a probabilistic asymmetric encryption scheme. Without loss of generality. Let (•) and (•) denote the encryption and decryption function under the Paillier cryptosystem and denote the RSA modulus.
We denote the encryption scheme as a triple * , , +.
( , ) → Select a random value ∈ _^ * for the message and the ciphertext is c=^ ^ ^2.
Then, for any , ∈ _ , the encryption scheme is additive homomorphic: Where the symbol " × " denotes the multiplication on ciphertexts, and " • " means the multiplication on plaintext. Besides, we will omit the term ^2 from homomorphic operations in the rest of the paper.

Basic Cryptographic Primitives
In this part, we propose some cryptographic primitives as the basis of our methods.
Algorithm 1: (2) Secure Squared Euclidean Distance (SSED) Protocol For k-means algorithm, there are many ways to compute the similarity scores between the data record _ , 1 ≤ ≤ and cluster centroid _ , 1 ≤ ≤ , such as Euclidean distance, Cosine similarity and Jaccard coefficient. In this paper, we use squared Euclidean distance for its simplicity, denoted by ‖ _ − _ ‖^2.
Note that _ is a vector, and its components may be rational numbers. However, the ring _ doesn't support rational division operation, so we use a new form of expression to represent the cluster center like (Rong, Wang, Liu, Hao & Xian, 2017). Let ⌌ _ , | _ |⌍ denotes the new form of cluster center, where _ and | _ | represent the sum vector and the total number of records belonging to cluster _ , respectively. Now we define Ω_( , ) as the scaled squared Euclidean distance between _ and _ , which satisfies that ‖ _ − _ ‖ = √(Ω_( , ) )/(| _ |). So Ω_( , ) can be calculated as follows: where 1 ≤ ≤ , 1 ≤ ≤ , and are the total number of objects and dimension size, respectively.
(3) Secure Comparison (SC) Protocol Given that Alice has two encrypted values _(〖 〗_ ) ( ), _(〖 〗_ ) ( ), and Bob has _ , the goal of SC protocol is to securely compare and without knowing both of them. To be specific, Alice generates an arbitrary value firstly, and then encrypts it as _( _ ) ( ) . After that, Alice computes utilizing the additive homomorphic property of the Paillier cryptosystem. Alice flips a coin where = 0/1, . This protocol will return (4) Privacy-preserving Minimum (PMIN) Protocol In this protocol, we aim at comparing two squared Euclidean distances and output the minimum one. To achieve this, we use the privacy-preserving maximum protocol proposed by Liu et al. (Liu, Lu, Ma, Chen & Qin, 2015). However, we will give a slight change to it to transfer the maximum value to the minimum value.

end for
Alice and Bob jointly compute Then _1, _2, _3 and _4 are sent to Bob.
Algorithm 3: ESDC Input: Alice has Alice and Bob jointly compute:  However, the PMIN protocol can't meet the requirement we need for pruning strategies in section 3.5.3, so we design a protocol, enhanced secure distance comparison (ESDC) protocol, the goal of which is to compare the absolute value of the difference between two encrypted Euclidean distance, . As a result, we can get , where = 1 means | _ − _ | > , and vice versa. We show it in Algorithm 3.
(6) Privacy-Preserving Minimum Among k Distances ( _ ) Protocol The goal of _ protocol is to gain the encrypted minimum distance and its corresponding secret from encrypted square Euclidean distances. Assume Alice has distances in ciphertext along with their secrets , and Bob has the corresponding secret key _ . To get the minimum value, Alice and Bob need to execute PMIN with two inputs in a sequential fashion. This process is straightforward, so we omit it here.

Locality Sensitive Hashing
Locality sensitive hashing (LSH) was introduced in the seminal work (Indyk & Motwani, 1998), which gave an efficient algorithm for nearest neighbor searching in high dimensional space. The main idea of LSH is to use a few special hash functions, called function family to map objects into different buckets, and the objects that are close to each other may be mapped into the same bucket (collision) with higher probability than objects those are far apart. Assume is the domain of objects, and is the distance measure between objects. where _1, _2 are both probability value, and 0 < _1 ≤ _2,0 ≤ _2 < _1 ≤ 1.
We know different LSH families are adopted for different distance functions. (Datar, Immorlica, Indyk & Mirrokni, 2004) has proposed LSH families for _ norms based on -stable distributions. If is 1, it is Manhattan distance, while is equal to 2, it is Euclidean distance. As we use Euclidean distance in this paper, we use the hash function in Eq. (3).

LSH Based Privacy-Preserving Multi-User Outsourced K-Means Clustering for Massive Datasets
This section discusses a new LSH based privacy-preserving multi-user outsourced k-means clustering algorithm (LSH-PPMOC) which includes two optimization strategies for large-scale high-dimensional data clustering. Our target is to minimize the cost in the process for k-means clustering while we won't reveal any information about users' data records or distances between data records and centroids. Firstly, we use a LSH function family which is determined by all users to get the data skeleton, where the similar points are reduced to the same bucket. Then we can save unnecessary computations or comparisons between k encrypted distances during the work through performing clustering on the data skeleton and utilizing efficient strategies.

Data Skeleton
In this section, we will firstly briefly introduce data skeleton proposed by Li et al. (Li, Wang, Wang, Hu, Li & Li, 2014), and then illustrate what we can do on data skeleton.
Each element of data skeleton is a 2-tuple ( _ , _ ), where _ represents a list of points in _ . To get data skeleton, we first use hash functions to map the original datasets into different buckets, where the bucket value is an -dimension value as the result of hash functions. Then we select the center point of each bucket as a representative data point _ . Initially, we set _ for each representative data point to null. The distances between _ and all other points in this bucket are computed, and if the distance is smaller than , the according point will be added into _ , where is a user-defined threshold. For the point which further away from _ is also an element of data skeleton, whose corresponding _ is null. An example is given as Figure  1. It is worth noting that all the data skeletons in each bucket form a hash table(HT).
It is well known that, as the dataset grows, the cost of distance computing between each point and centroid pair becomes unbearable, especially when the number of clusters is large. And we know that the points in one data skeleton are close in distance, and they may belong to the same cluster with high probability. Then we might save some unnecessary calculations based on this.
We demonstrate that in Figure 2. We say that _1 and _2 are two center points, _1 and _2 are two different data records, _1, _2 are the distances between _1 and _1, _2, while _1′, _2′ are distances of _2, is the distance between _1 and _2. If we know that _1 is much smaller than _2, and is near to zero, it is very likely that _1^′ < _2′, which means that _2 shares the same cluster with _1. We can generalize it to Theorem 1. Theorem 1. Given _ and _ as two center points, _ and _ are two points with the distance .
The proof was given in (Li, Wang, Wang, Hu, Li & Li, 2014), and we wouldn't go into details here.

Privacy-Preserving Pruning Strategies Using LSH
In this section, we explain how we make use of the low bound property of LSH to prune off the unnecessary computations during privacy k-means clustering.
There are two pruning strategies, and both of them are based on Theorem 1. The first one aims at reducing the number of data records that need to find the nearest centers. For each representative point _ in one bucket, it represents the points in _ , and the distance between any _ and _ is less than . We first compute k distance between _ and centroids * _1, … , _ + using SSED protocol. Let ( _ , _1 ) and ( _ , _2 ) be the smallest and the second smallest distances among distances. If ( _ , _2 ) − ( _ , _1 ) ≥ 2 , we can say that all points _ represented by _ have a shortest distance with _1. In this case, we need not compute the distances between _ and centroids in this iteration.
The second strategy is to reduce the number of centroids to be compared for each point _ in _ . Let's think about a situation, we have computed distances for _ , represented by _1, … , _ and _1 is the smallest distance, then _ is a point represented by _ . When we tend to assign _ to a proper cluster, we need to compare _ − _1 and 2 , ∈ ,2, -. Only the cluster _ which holds _ − _1 ≤ 2 will be marked, and added into a set closeSet. That is to say, for _ , we only need compute distances between it and centroids in the closeSet which is obviously the subset of centroids.
In Algorithm 4, we give the pseudo-code of privacy-preserving LSH-based multi-user outsourced k-means clustering.
for ∈ ,1, -and ≠ ′ CS and AS jointly compute Then CS and AS will update the bitmap matrix according to , and they cooperatively judge the termination conditions.

System Model
In our setting, we assume that there are two types of entities: users and cloud service providers as given in Figure  3. There are users denoted by _1, … , _ . Suppose that _ holds a dataset _ with d-dimension _ data records, ∑_( = 1)^▒ _ = . We use a Computation Server (CS) and an Assistant Server (AS) to excute the k-means clustering task, which are both semi-honest and don't collude. Let AS generate two public-secret key pair (〖 〗_ , 〖 〗_ ), (〖 〗_ , 〖 〗_ ) based on the Paillier cryptosystem and the public key 〖 〗_ is sent to all users and CS, while AS remains 〖 〗_ , while (〖 〗_ , 〖 〗_ ) is sent to CS. When clustering starts, all users first use the same LSH function family to get their own hash table (Data Skeleton) , and then blind the bucket values through MD5 to preserve the privacy of buckets. After that, each user _ encrypts his/her own dataset attribute-wise using 〖 〗_ , and sends hash table and encrypted dataset to CS. Then CS will first aggregate all received hash tables into one hash table (HT), in this process, buckets with the same value will be merged. Next, CS performs k-means clustering on ciphertexts with the help of AS. When the clustering results do not change anymore or a predefined number of the iteration is reached, the clustering is finished. The results will be send to users.

The Proposed LSH-PPMOC Protocol
In this section, we will discuss our proposed LSH-PPMOC algorithm. There are three steps in our proposed algorithm. Specifically, our algorithm is composed of the following: 1) Data uploading, 2) LSH-based pruning clustering, 3) Termination.
Then each user generates their own key pair (〖 〗_ , 〖 〗_ ) for decrypting the final clustering centroids. Before clustering, all users agree on a hash function family, and generate their own hash tables. The table is made up of different buckets. Unfortunately the bucket value is gotten from users' original data records, it needs to be blinded by a one-way hash function (SHA5, MD5 and so on) to protect information about data records from leaking. The hash table will be sent to CS. Besides, each user encrypts their dataset using 〖 〗_ before sending to CS.
LSH-based pruning clustering. After receiving encrypted hash tables, and encrypted original datasets, CS will aggregate tables firstly and then perform the clustering work cooperating with AS. This process can be divided into four steps.
(1) CS chooses initial k centroids randomly from all encrypted objects for k clusters, denoted as (2) For every representative point _ in each bucket, CS and AS perform original privacy-preserving k-means clustering on them. In other words, they compute k squared Euclidean distances as _1,…, _ for each point using SSED, where we always assume _1 is the smallest distance, then we assign this point to proper cluster through running PMIN_k protocol on the k distances. After clustering, we update the bitmap vector.
(3) For any point _ in _ in each bucket, whose distance to representative point _ in this bucket is smaller than , we add cluster into a set closeSet if _ − _1 ≤ 2 , ∈ ,2, -. The next step we should do is to compute distances between _ and centroids in closeSet, cluster _ accurately, and update according bitmap vector.

Termination.The
Step (2)-(4) in LSH-based pruning clustering process will be executed repeatedly until the clustering results do not change any more or the given maximum iteration number is reached.

Security Analysis
In this section, we will give a security proof of our LSH-PPMOC protocol under the semi-honest model. Since the proof of preliminaries we utilize in this paper are similar, we only take the SM protocol as an example and give a formal proof under ``Real-versus-Ideal'' framework (Goldreich, 2009).
Theorem 2. The SM Protocol securely computes the multiplication on ciphertexts using the Paillier cryptosystem under two semi-honest but passive cloud servers.
Proof. Our SM protocol is performed by two semi-honest parties, Alice and Bob. We need to prove that SM is secure against both of semi-honest attacker Alice _ and Bob _ .
(1) Security against _ : In Step (1) where the symbol means multiplication function on and _ ≈^ means computationally indistinguishable.
(2) Security against _ : In Step (2), the real world view of _ comprises of input _(〖 〗_ ) ( + _ ) , _(〖 〗_ ) ( + _ ) , and output _(〖 〗_ ) (( + _ )( + _ )) . Though _ can decrypt the ciphertexts and get the messages + _ , + _ with 〖 〗_ , it is still hard for _ to get any useful information about the original data records and . Since _ , _ are random values which are chosen by Alice, they are random values in the point of view of _ . Then we can build a simulator _ in ideal world, where we just choose random messages as ciphertexts. There is no doubt that it is computationally hard for _ to distinguish the ideal world and the real world. That is to say Combine the above two analyses, we prove the correctness of Theorem 2.
In the data uploading process, even though CS holds the data records in ciphertexts form, it can't get any information about original datasets due to the semantic security of Paillier cryptosystem. As for blinded hash tables, though it may disclose the relationship among user's data records, it is a compromise for efficiency. During the clustering process, AS with decryption key will assist CS to perform various computation operations. It is worth noting that the plaintexts obtained by AS are randomized. Considering Paillier cryptosystem is semantic secure, and blinding factors are all randomly chosen, no additional information regarding users' data or clusters is revealed to the cloud servers. Besides, (Liu, Lu, Ma, Chen & Qin, 2015) has given security proof of PMIN and PMIN_k. Then preliminaries utilized in our paper are proven to be privacy-protected, according to Composition Theorem (Goldreich, 2009), we conclude the sequential composition in proposed LSH-PPMOC protocol is secure against the semi-honest cloud servers.

Performance Analysis
In this section, we give the performance analysis of LSH-PPMOC protocol from the view of theory and experiment.

Theoretical Analysis
To better illustrate the computational cost and communicational cost of our LSH-PPMOC protocol, we use the symbol , being the modular exponentiation and multiplication operations. We assume = + 1 in the Paillier cryptosystem which is a common setting like (Samanthula, Rao, Bertino, Yi & liu, 2014), then _ ( ) = ( + 1)^•^ ^2 = ( • + 1) • 2 (6) From Equation (6), we can conclude that the computational cost for one encryption is 1 + 2 ,, also the same cost for decryption. The computational overheads and communicational cost in one iteration for our main preliminaries are given in Table 1. Here is the dimension of a data record, k is the number of predefined clusters and m is number of all users' records. Besides, we use to denote the size (in bits) of the Paillier encryption key. It is worth noting that the Stage 1 of our proposed algorithm is run only once, while the Stage 2 and Stage 3 are run in an iterative fashion until the clustering is finished.

Experimental Analysis
In this part, we will conduct our experiments on local terminals, the server running Windows10 has Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz 3.19GHz and 8.00 GB RAM. We compare our proposed LSH-PPOMC with PPOCM (Rong, Wang, Liu, Hao & Xian, 2017), because our models are similar and both under public key cryptosystem, what's more, we have the same security goal for outsourced k-means clustering. We choose the key size | | as 1024-bit.
For authenticity of our experiments, we choose a real dataset, KEGG Metabolic Reaction Network dataset, which consists of 65,554 data records and 28 attributes, to perform our algorithm. But we find some data records are missing attribute values or repeating, we delete these corresponding records. And because some of the attribute values are decimal, we normalize all records so that each attribute value is scaled into integer which maintains [0,1000].

Figure 4. Original Dataset and Data Skeleton for KEGG
Before we perform k-means clustering work on the dataset, we want to illustrate the significant decrease between data points in original dataset and data skeleton in the form of a chart. The results are shown in Figure 4. From this chart, we can see that, after users' preprocessing, the number of data points falls off drastically.
As for the secure k-means clustering on KEGG dataset, there are few factors affecting the final effect: the number of clusters (k), the size of aggregated dataset (m), and the number of attributes (d). We first asset the performance of PPOCM and LSH-PPMOC on the datasets of different sizes with dimension d=10, while k=10 and 20. The results was given in Figure 5. We can see that, the cloud running time grows almost linearly with k for both algorithms. Besides, we can also see that our LSH-PPOMC outperforms PPOCM with an increase in the number of clusters k owing to our pruning strategies.
Then we compare two k-means algorithms, with or without pruning strategies, while keeping everything else the same. Notably, the algorithm with pruning technique is exactly our LSH-PPMOC. More specifically, we change the size of aggregated dataset and the size of attribute values $d$ respectively, and we assume k=10. The result is given in Figure 6. From the former diagram, we can learn that the pruning strategies can save about half the cloud running time compared with the algorithm without pruning strategies, while from the latter, we observe that our LSH-PPMOC can save nearly 70% time in terms of distance calculation. For example, when m=10000, d=10 and k=10, the time of distance calculation is 92min after pruning, while 275min without pruning operation. The difference can be attributed to the pruning strategies, whose goal is to reduce unnecessary distance calculation during the clustering. Besides, we can also learn that, as the amount of data increasing, the pruning is more effective. Similarly, we give Figure 7 to show the effect of various k on performing clustering with two methods. It is obvious that the cloud running time saved is proportion to the number of clusters k, that is to say, when k is larger, the pruning effect is more obvious. So our LSH-PPMOC is suitable for k-means clustering on big dataset even though k is large.

Conclusion
In this paper, we proposed an efficient and privacy-preserving outsourced k-means clustering algorithm (LSH-PPMOC) for big data mining. For that purpose, we gave a series of building blocks to help achieve ciphertext multiplication, squared Euclidean distances computation, comparison and so on, which would never leak any useful information. The main idea in our algorithm is utilizing the local sensitivity of locally sensitive hash to prune some unnecessary computations when clustering. Besides, the experiments on KEGG dataset show that our algorithm is more efficient than some existing job. In the future, we are more willing to focus on the data integrity verification during k-means clustering and achieve strong secure under malicious model.