Database security is a critical aspect of information security. Access to enterprise databases grants a ackers great control over critical data. For example, SQL injection a_ acks insert malicious code into the statements the application passes to the database layer. is enables a ackers to do almost anything with the data, including accessing unauthorized data and altering, deleting, and inserting data. Although SQL injection exploitation has declined steadily over the years owing to secure frameworks and improved awareness, it remains a high-impact means to exploit system vulnerabilities. For example, Web applications receive four or more Web a_ ack campaigns per month, and SQL injections are the most popular a_ acks on retailers. 1 Furthermore, SQL injection vulnerabilities a_ ect 32 percent of all Web applications.2
NoSQL (not only SQL) is a trending term in modern data stores; it refers to nonrelational databases that rely on di_ erent storage mechanisms such as document store, key-value store, and graph. _ e wide adoption
of these databases has been facilitated by the new requirements of modern large-scale applications, such as Facebook, Amazon, and Twi_ er, which need to distribute data across a huge number of servers. Traditional relational databases don’t meet these scalability requirements; they require a single database node to
execute all operations of the same transaction.1 As a result, a growing number of distributed, NoSQL key-value stores satisfy the scalability requirements of modern large-scale applications. _ ese data stores include NoSQL databases such as MongoDB and Cassandra as well as in-memory stores and caches such as Redis and Memcached. Indeed, the popularity of NoSQL databases has grown consistently over the past several years, and MongoDB is ranked fourth among
the 10 most popular databases, as Figure 1 illustrates. In this article, we provide an analysis of NoSQL threats and techniques as well as their mitigation mechanisms.
Mobile networks have rapidly evolved in recent years due to the increase in ultimedia traffic and offered services. This has led to a growth in the volume of control data and measurements that are used by self-healing systems. To maintain a certain quality of service, self-healing systems must complete their tasks in a reasonable time. The conjunction of a big volume of data and the limitation of time requires a big data approach to the problem of self-healing. This article reviews the data that self-healing uses as input and justifies its classification as big data. Big data techniques applied to mobile networks are examined, and some use cases along with their big data solutions are surveyed
Data security on web vulnerability with NoSQL injection.
Web sites can be dynamic and static sometimes a combination of both. To assure security all the websites need protections to their database. Database of standard websites can be easily attacked using SQL injection. Here the user may inject SQL queries as input to the websites which may attack their database in many ways. This paper proposes a static analysis of various technique of detecting recently discovered web-based vulnerabilities such as cross-site scripting and HTTP splitting attacks with NoSQL injection. Today, the situation is¬ better, and traditional databases have introduced built-in protection mechanisms. NoSQL databases use different query languages, which makes traditional SQL injection techniques irrelevant. But does this mean that NoSQL systems are immune to injections? Our study shows that although the security of the query language has largely improved; there are still techniques for injecting malicious queries. Some works already provide reports of NoSQL injection techniques. We also show the methods of accessing the dark web. We use TOR browser for the explanation of accessing dark web
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce.
Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop’s performance, we develop a workload balance metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable.
Index Terms—Frequent itemsets, frequent items ultrametric tree (FIU-tree), Hadoop cluster, load balance, MapReduce.
Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields.
Processing large volumes of data has presented a challenging issue, particularly in data-redundant systems. As one of the most recognized models, the conditional random fields (CRF) model has been widely applied in biomedical named entity recognition (Bio-NER). Due to the internally sequential feature, performance improvement of the CRF model is nontrivial, which requires new parallelized solutions. By combining and parallelizing the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L- BFGS) and Viterbi algorithms, we propose a parallel CRF algorithm called MRCRF (MapReduce CRF) in this paper, which contains two parallel sub-algorithms to handle two time-consuming steps of the CRF model. The MRLB (MapReduce L- BFGS) algorithm leverages the MapReduce framework to enhance the capability of estimating parameters. Furthermore, the MRVtb (MapReduce Viterbi) algorithm infers the most likely state sequence by extending the Viterbi algorithm with another MapReduce job. Experimental results show that the MRCRF algorithm outperforms other competing methods by exhibiting significant performance improvement in terms of time efficiency as well as preserving a guaranteed level of correctness.
Real-Time Big Data Analytical Architecture for Remote Sensing Application.
The assets of remote senses digital world daily generate massive volume of real-time data (mainly referred to the term “Big Data”), where insight information has a potential significance if collected and aggregated effectively. In today’s era, there is a great deal added to real-time remote sensing Big Data than it seems at first, and extracting the useful information in an efficient manner leads a system toward a major computational challenges, such as to analyze, aggregate, and store, where data are remotely collected. Keeping in view the above mentioned factors, there is a need for designing a system architecture that welcomes both realtime, as well as offline data processing. Therefore, in this paper, we propose real-time Big Data analytical architecture for remote sensing satellite application. The proposed architecture comprises three main units, such as 1) remote sensing Big Data acquisition unit (RSDU); 2) data processing unit (DPU); and 3) data analysis decision unit (DADU). First, RSDU acquires data from the satellite and sends this data to the Base Station, where initial processing takes place. Second, DPU plays a vital role in architecture for efficient processing of real-time Big Data by providing filtration, load balancing, and parallel processing. Third, DADU is the upper layer unit of the proposed architecture, which is responsible for compilation, storage of the results, and generation of decision based on the results received from DPU. The proposed architecture has the capability of dividing, load balancing, and parallel processing of only useful data. Thus, it results in efficiently analyzing real-time remote sensing Big Data using earth observatory system.
DyScale: a MapReduce Job Scheduler for Heterogeneous Multicore Processors.
The functionality of modern multi-core processors is often driven by a given power budget that requires designers to evaluate different decision trade-offs, e.g., to choose between many slow, power-efficient cores, or fewer faster, power-hungry cores, or a combination of them. Here, we prototype and evaluate a new Hadoop scheduler, called DyScale, that exploits capabilities offered by heterogeneous cores within a single multi-core processor for achieving a variety of performance objectives. A typical MapReduce workload contains jobs with different performance goals: large, batch jobs that are throughput oriented, and smaller interactive jobs that are response time sensitive. Heterogeneous multi-core processors enable creating virtual resource pools based on “slow” and “fast” cores for multi-class priority scheduling. Since the same data can be accessed with either “slow” or “fast” slots, spare resources (slots) can be shared between different resource pools. Using measurements on an actual experimental setting and via simulation, we argue in favor of heterogeneous multi-core processors as they achieve “faster” (up to 40%) processing of small, interactive MapReduce jobs, while offering improved throughput (up to 40%) for large, batch jobs. We evaluate the performance benefits of DyScale versus the FIFO and Capacity job schedulers that are broadly used in the Hadoop community
An Incremental and Distributed Inference Method for Large-Scale Ontologies Based on MapReduce Paradigm.
With the upcoming data deluge of semantic data,
the fast growth of ontology bases has brought significant chal-
lenges in performing efficient and scalable reasoning. Traditional
centralized reasoning methods are not sufficient to process large
ontologies. Distributed reasoning methods are thus required to
improve the scalability and performance of inferences. This
paper proposes an incremental and distributed inference method
for large-scale ontologies by using MapReduce, which realizes
high-performance reasoning and runtime searching, especially
for incremental knowledge base. By constructing transfer infer-
ence forest and effective assertional triples, the storage is largely
reduced and the reasoning process is simplified and accelerated.
Finally, a prototype system is implemented on a Hadoop frame-
work and the experimental results validate the usability and
effectiveness of the proposed approach.
—Big data, MapReduce, ontology reasoning, RDF,
Self-Adjusting Slot Configurations for Homogeneous and Heterogeneous Hadoop Clusters.
The MapReduce framework and its open source implementation Hadoop have become the defacto platform for scalable analysis on large data sets in recent years. One of the primary concerns in Hadoop is how to minimize the completion length (i.e., makespan) of a set of MapReduce jobs. The current Hadoop only allows static slot configuration, i.e., fixed numbers of map slots and reduce slots throughout the lifetime of a cluster. However, we found that such a static configuration may lead to low system resource utilizations as well as long completion length. Motivated by this, we propose simple yet effective schemes which use slot ratio between map and reduce tasks as a tunable knob for reducing the makespan of a given set. By leveraging the workload information of recently completed jobs, our schemes dynamically allocates resources (or slots) to map and reduce tasks. We implemented the presented schemes in Hadoop V0.20.2 and evaluated them with representative MapReduce benchmarks at Amazon EC2. The experimental results demonstrate the effectiveness and robustness of our schemes under both simple workloads and more complex mixed workloads.
A social inverted index for socialtagging- based information retrieval.
Keywords have played an important role not only for searchers who formulate a query, but also for search engines that index documents and evaluate the query. Recently, tags chosen by users to annotate web resources are gaining significance for improving information retrieval (IR) tasks, in that they can act as meaningful keywords bridging the gap between humans and machines. One critical aspect of tagging (besides the tag and the resource) is the user (or tagger); there exists a ternary relationship among the tag, resource, and user. The traditional inverted index, however, does not consider the user aspect, and is based on the binary relationship between term and document. In this paper we propose a social inverted index – a novel inverted index extended for social-
based IR that maintains a separate user sublist for each resource in a resource-posting list to contain each user’s various features as weights. The social inverted index is different from the normal inverted index in that it regards each user as a unique person, rather than simply count the number of users, and highlights the value of a user who has participated in tagging. This extended structure facilitates the use of dynamic resource weights, which are expected to be more meaningful than simple user-frequency-based weights. It also allows a flexible response to the conditional queries that are increasingly required in tag-based IR. Our experiments have shown that this user-considering indexing performs better in IR tasks than a normal inverted index with no user sublists. The time and space overhead required for index construction and maintenance was also acceptable.
Index Terms— information retrieval, inverted index, social tagging, tags, web search