Data Storage Control System Design

The paper presents a methodology for evaluating and improving the effectiveness of storage management during the development of automated control systems. The description of the storage management system in terms of queuing theory is proposed. The model of the system and the criteria for efficient processing of requests to read and write data are provided. The authors also propose the partitioning of stored data and the use of several software solutions to improve the system performance


Introduction
Nowadays automated control systems (ACS) are becoming more and more widespread.ACSs are computer-aided systems aimed to convert information, to make calculations and logical operations based on computer networks and modern information technology (IT).ACSs have proliferated in the form of control loops within manufacturing, transport, construction and other economic processes.The purpose of automated control systems is to ensure sound management of a complex object (process) in accordance with performance targets. of data transmission channels and high server load in some cases, the shortcomings of the 'classic' client-server system significantly affect performance of the ACS.
An another approach involves storing copies of the database at each site (documentation on Oracle database).This solves some of the problems, and significantly reduces the load on the server because of a single transmission of a change to the connected sites.However, in this case the problem of 68ptimization68 the client and server copies of the database becomes much more difficult.Also, a client site either requests a full copy of the database when connecting to the server, or logging and 'partial update' mechanisms for site 68ptimization68n are necessary.In general, unlike the client-server architecture storage, databases with this architecture are more suitable for frequently requested data of relatively small volumes.
To demonstrate the above-described it can be noted that the cached systems greatly outperform the client-server ones in read rate, especially in the case of a growing number of reading applications (documentation on PostgreSQL database, (documentation on SQLite database).The dependence of the reading speed on a fixed number of objects for the PosgreSQL remote database and SQLite local database is represented in Figure 1: Figure 1.The dependence of the reading speed on the number of processes A number of software solutions combine the features of both mentioned architectures implementing partial separation of the stored data and disposing part of it in local databases.In such cases the factor that determines performance of the system for fixed software solutions is the algorithm of data separation (the local part selection).In most such systems data separation is done by cashing -that is the dynamic placement of frequently requested data in the local database (documentation on memcached storage system).The cashing technique allows operation only with the frequency of requests and does not take into account the characteristics of local databases (if the local database differs from the main database).In some cases on-going data separation is based on the object domain and 'exemplary' properties of parts.
Another quite common approach is to design a distributed database in which data is shared between the servers in the cluster so that each server stores the data and the whole database allocation table, requests on pieces of data are transferred to the appropriate servers and the results are compiled afterwards (Carey, Livny, 1988).This approach has given a good account of itself, and there are a lot of projects in this area, both open (documentation on distributed system Apache Hadoop) and commercial (documentation on distributed system The Google File System).The main advantages of the approach are the reduction of the load on individual servers and the relative ease of scaling the data up to the volume of petabytes.The main disadvantages are complexity and concurrency control.According to well-known CAP-theorem, we can't achieve simultaneously all three guarantees of consistency, availability and partition tolerance.(Gilbert, Seth, Lynch, 2012) The main difficulty of the data storage control system development is the heterogeneity of the stored data, which makes designing a universal internal storage engine (the DSCS engine) a complex task.There is, for example, a significant difference in access rates of the heterogeneous data.In ACSs the data types vary from the large volume of relatively rarely changing objects to smaller volumes of data requested and modified multiple times per second.In some cases, special different precautions up to storage of the encrypted information may be required (Dudakov, Pirogov, Shumilov, 2009).
Considering the specific software solutions, it can be noted that in some cases performance of the database that proved to be faster during the simple operations dropped considerably with an increase in the intensity of the load.At the same time, software solutions specifically designed to be accessed by multiple applications are more resistant to competitive behavior.
In general, in accordance with the internal architecture of each software solution there is data with the 'most appropriate' set of properties, which is processed most effectively, and data with the 'less appropriate' set of properties, the processing of which can be difficult.At the same time, given the considerable diversity of the data stored, the stringent requirements and high load of modern ACSs, processed data characteristics cannot be neglected for the benefit of a software solution performance.
Thus, the lack of a universal open source database and the restriction on the power and cost of server hardware leads to the fact that when a large-scale ACS aimed at the processing of heterogeneous data is designed, it is often impossible to meet all the requirements by one ready-made software solution.
The partitioning of data and use of multiple interacting databases with modified software solutions could be a possible means of improving data storage efficiency.Accordingly, the stored information is considered as a set of data classes (tables in a relational data model) and the improvement of storage efficiency is proposed by means of a considered choice between ways to distribute the classes of data in databases.The choice of several software solutions allows us to combine the advantages of each system, avoiding, if possible, the shortcomings due to of the best-fit match of the data type and the respective database.The constancy of the partitioning ensures constant characteristics of the system if load characteristics also remain constant.That is not achieved by cashing.
During the DSCS design process a mathematical model of a query to storage processing (the system of several storages) was established.The model and the efficiency criteria can numerically assess the performance of one or more databases, which, in turn, enables an evaluation of the best way to distribute the data, the suitability and feasibility of a software solution.
The DSCS working process with sufficient adequacy can be described by the mathematical tools of queuing theory (Kleinrock, 1979).Thus, according to the classical works (Khinchin, 1963), various experimental data queries to a database can be described as a Poisson events flow: according to the Poisson distribution, the number of events k in the time interval t is distributed with the following probability: where λ is the intensity of the events flow.In this case, the superposition of Poisson flows is a Poisson flow with total intensity.
Further, for the 'typical' query service time can be represented by a discrete random variable, i.e. the probability of service time bj of a j-type query: ( ) where ( ) λ λ is the intensity of the flow of the j(k)-type queries.
Considering this model, it is possible to use the average waiting time of the query as the basis of the efficiency criterion.For one database the efficiency criterion for processing the group of data classes can be variable: where b is query service time expectation for this database in accordance with its load, 2 B σ -query service time variance.
In this paper we do not consider the interaction of DBMSs and data consistency.The relationship between classes could be represented as a sequence of queries.
To use of the average query service time for a system of multiple databases is obviously wrong.As the efficiency criterion of the total system of multiple databases (i.e., the criterion of DSCS efficiency) the following value is proposed: where w i is an i-type database efficiency criterion, b ij -j-type query service time in i-type database (let us denote The value of the criterion of DSCS efficiency for a fixed set of data classes and used software solutions is, in fact, the evaluation of the efficiency of data separation to the different databases.
Next, to determine the optimal, in terms of the proposed criteria, way of dividing the stored data, it is necessary to solve the problem of 70ptimizati the variables x ij , where criterion J is the objective function and the following are the restrictions: where i λ is the total intensity of the query flow for the i-type database, b i -the expectation of query service time for the i-type database, M -number of databases.
Submitted constraints determine the absence of a queue accumulation for each database.Any linear constraints can be added further to define, for example, a strict affiliation of the respective classes to a database.
The problem of 70ptimization criterion J belongs to a class of nonlinear pseudo-Boolean problems (Boros, Hammer, 2001).For a practical problem of 100 data classes and 2 databases there are approximately 1030 possible 70ptimization.Pseudo-Boolean 70ptimization problems are often encountered in practice; there are many approaches to solve them.However, in view of large dimensions and high computational complexity, there is no universal algorithm that allows an exact solution in a reasonable time.

Optimisation of the Data Partitioning
To solve the problem we must find a proper partitioning where X is a data partition, X -all possible partitions.
The criterion J is a nonlinear function of variables x ij , which brings difficulty in future 70inearizatio.For a formal description of the problem of pseudo-programming, we introduce the following notation: -the set of integers with characteristic vector: According to (Boros, Hammer, 2001), all pseudo-Boolean functions could be uniquely represented as multi-linear polynomials: 1 {1,..., } ( ,..., ) The size of the largest subset S: cS≠0 is called a degree of f and is denoted by deg(f).We will call a pseudo-Boolean function linear, if deg(f)≤1, and linear-fractional, if it could be represented as a fraction of two linear functions: In general, it is difficult to transform a function into a multi-linear form due to a large number of additional variables and constraints.
The integer programming problems and, in particular, problems of pseudo-Boolean 71 inearizatio are well-known.There are many approaches to solve them, but the lack of versatility and productivity of the algorithms makes it necessary to find the proper algorithm for solving each particular problem.
In general, the algorithms can be classified in terms of accuracy of the solution (exact and approximate) and belonging to a certain class of functions (linear, linear-fractional quadratic problem, and others.).
In some cases, it is appropriate to use the branch and bound algorithm, the effectiveness of which is determined by the accuracy of estimates of the upper and lower limits of the values of the function being 71ineariza (Martello, Toth, 1990).
Given the nature of the criterion proposed in the paper, as well as reasonable restrictions on the number of classes of data, we propose to exercise partial 71inearization of criterion J, resulting in the linear-fractional function with the introduction of additional Boolean variables and constraints.The obvious disadvantage of this approach is an increasing of the number of variables, according to dimension of the criterion J.

Partial linearisation
Linearisation is a standard technique to reduce nonlinear binary optimization to linear integer programming.The basic idea is to replace a nonlinear term with a new variable subject to additional constraints and so force it to take in all feasible solutions of the value of the original term.For instance, we could replace a product of two binary variables , {0,1 } x y ∈ by new binary variable u with constraints: With these constraints, values of the variable u are identical to corresponding values of the product xy: Table 1.Replacement of the product of binary variables Similarly, in the case of replacing product of L values x1, …, xL, we will add L+2 constraints of the following form: During the optimization, seeing {0,1} ij x ∈ , it is obvious that: Let us denote: u={u 1 ,…,u H } -a vector of u variables that reduces criterion J (4) to linear-fractional form (10), and H is a dimension of the vector.
Clearly, u {1,0} H ∈ , i.e. the problem is still pseudo-Boolean. Denoting it could be seen that i.e., w i is a fraction of two quadratic functions.Furthermore, for notational convenience, we will assume that With this notation (omitting index i): we will have: Furthermore, by observing that j-type query 'real' service time (service time with waiting time) in i-type database: we obtain: Given by ( 14) the independence w 2i of j, we could represent: and hence we have: Considering the optimization72, for the pseudo-Boolean function L(x 1 ,…x MN ): deg(L)=l the amount of additional variables could be limited by: and the amount of additional constraints would be no more than: According to replacement (12) we could reduce the problem of optimization of the optimizat criterion J(X) to the pseudo-Boolean linear-fractional optimization problem.
It is necessary to consider the case M=2, in which: , that greatly simplifies the final expression of J(X) and, in fact, reduces the number of variables x 11 , …, x MN by half.

Approaches to Solve the Linear-Fractional Problem
After the partial linearisation, the criterion J could be represented: Hence, let us denote: In fact, J L = J L (x 11 ,…,x MN ,u MN+1 ,…,u H-MN ), i.e., we replace only products, but, to simplify the notation we will replace (rename) all x-variables in such a way: , , It should be noted that the solution of the optimisation problem may be the only u, for which: F 2L (u)>0.
That follows from the initial constraints of the form: 1, ( ) 1 Linear-fractional pseudo-Boolean problems often occur, for example, in clustering, when we have to decide, if an element belongs to some cluster or not (Prokopyev, 2006).
Furthermore, if the denominator F 2L could take both negative and positive values, the optimisation problem is NP-hard.It could be shown, that optimising can't be easier than finding a solution to the subset sum problem, a well-known NP-complete decision problem.

Complete Linearisation with Additional Non-Integer Variables
In general case, it is possible to reduce the linear-fractional problem to linear form: we could replace the denominator by a new variable: and, denoting v h =u h z, we will have new linear problem with variables: v 1 ,…,v H ,z: with additional constraint: d 0 z+d 1 v 1 +…+d H v H =1. Due to non-integer variable z the problem transforms into non-integer linear optimisation problem.After solving this problem we have to make further iterations (for example, cutting-plane method) to find a integer 0-1 solution u 1 ,..,u H .

Heuristic Approaches Based on the GRASP Algorithm
As one of the approaches to solve the optimisation problem we could consider heuristic algorithms that are capable of providing a proper solution with a reasonable amount of time and sufficient accuracy (Brady, Catanzaro, 2008).As an example, let us consider a GRASP (Greedy Randomised Adaptive Search Procedure)an approach that is widely used for the construction of a heuristic algorithm for optimisation problems (Mauricio, Resende, 1998).A GRASP typically consists of iterations made up from the successive construction of a greedy randomised solution and subsequent iterative improvements of it through a local search.
This paper proposes the use of this approach only if the dimension of the problem does not allow an exact solution in a reasonable time.We take into account that the problem of data partitioning is the task of planning, i.e., the solution of the problem in real-time is not required and the solution accuracy is more important than time.

Reducing the Optimisation Problem to the SAT-Problem
One of the approaches to the optimisation problem is to reduce the issue to a fairly well-known Boolean Satisfiability Problem (SAT-problem) (Een, Sorensson, 2006).SAT-problem is a NP-complete problem; there are many algorithms which solve it with different efficiency.
The essence of the transformations considered in the optimisation problem is the following (Crama, Hansen, Jaumard, 1990): 1. Preparation of a solution satisfying the constraints (the construction a workable solution); 2. Adding a new constraint to cut off the solution found.
There are many programs solving SAT-problems (SAT-solvers), most of which deal with linear constraints.
The general scheme for solving the problem is as follows: The partially linearised problem could be represented: where 1 ( , , ) Furthermore, we will add constraint by form: and when we solve an extended SAT-problem, we will have another solution u**: J L (u**)<J L (u*).Denoting u*=u**, we will have an iteration cycle, at each iteration the solution will improve the value of the objective function.For the global minimum we should continue as long as the extended SAT-problem is solvable.The solution that we have will be an optimum, hence the data partitioning ( ,..., ) will be the optimal required partitioning.
It is necessary to consider that additional constraints are linear and could be represented by form: Also, in each iteration we will have the same SAT-problem with one changing constraint.
In the base of SAT-solvers there is a backtracking based search algorithm DPLL (Davis-Putnam-Logemann-Loveland).Despite the high theoretical complexity of the algorithm, program SAT-solvers are sufficiently effective in practice (Hossein, Sheini, Karem, Sakallah, 2006).According to results of yearly competitions, modern SAT-solvers allow the solution of problems within a reasonable time, even when they consist of about 10 6 variables and 3•10 6 constraints (Jarvisalo, Le Berre, Roussel, 2013).

Simplified Optimisation with Strict Constraints
Considering the case in which, for all u value F 2L (u)>0, we can see that the solution could be found as follows: Without loss of generality let us assume: d h >0, h=0,…,H; c h >0, h=1,…,H: • simultaneously c h0 <0 and d h0 <0, let us denote: u h0 =1-u h0 ; • c h0 <0, d h0 >0, let us define u h0 =1; • c h0 >0, d h0 <0, similarly, we define u h0 =0; Furthermore, let us denote u* -desired vector (optimum of J), then, if we assume * ( ) J u t ≤ , we will have: If we reorder coefficients in such a way that: we will have u* as one of the H vectors of the form: 1, , 0, 0, If all the vectors of (40) satisfy the constraints L 1 and L 2 (32), then, obviously, the solution could be found faster than with iterative methods.The computational complexity could be estimated as: Given the limits of the practical problem (2 databases, less than 100 data classes), we propose: • To make an attempt to solve the simplified optimisation problem; • In the case of the non-constant sign of F 2L (u), to reduce the pseudo-Boolean problem to SAT-problem and to solve it with one of the open-source SAT-solvers.
As an example, let us consider a partition of 10 classes into 2 databases; the input data are the intensity and the time of query processing:  The choice of software solutions as the basis for the ACS design is caused, first of all, by characteristics of the subject area and the stored data.To solve the practical problems it is most useful to combine software solutions with different architectures for the 'widest' coverage of processed data characteristics.However, as a kind of limiting case, the described technique can be applied to the equal databases.In such cases, this configuration is a highly scalable system with a permanent partitioning of the data classes.Also, the criterion for evaluating the effectiveness of a single database can be used when choosing the right software solutions regardless of data partitioning.

Conclusion
Thus the article describes the technique of evaluating and improving data storage control systems' efficiency.This technique is based on a mathematical model.Despite limitations in equipment capabilities and communication channels, it enables all functions and effectively completes the task of different-type data storage during ACS design.
H c constraints with additional variables by form (12).The solution of the SAT-problem will be the vector u=u* (the set values (x 11 ,…,x MN ,u MN+1 ,…,u H-MN ) or given renaming, u 1 ,…,u H ) which satisfy the constraints L 1 and L 2 .Let us denote: +-H

Table 2 .
The input data

Table 3 .
The initial approximation Value of the criterion J=12.8s, the average waiting time of execution of applications is 0.69s (i=1), 0.66s (i=2).Optimal partition X opt is the following:

Table 4 .
The optimal partition Value of the criterion J=6.07s, the average waiting time for the execution of applications is 0.01s (i=1), 0.22s (i=2).