Big data tools can help big data workers do everyday big data work. Here are the tools that are commonly used in big data work:
Hivemall combines multiple machine learning algorithms for Hive. It includes a number of highly scalable algorithms for data classification, recursion, recommendation, k nearest neighbors, anomaly detection, and feature hashing.
Supported operating systems: Independent of the operating system.
Mahout is an open source project of the Apache Software Foundation (ASF) that provides implementations of classic algorithms for scalable machine learning to help developers create smart applications more quickly and easily. Mahout includes many implementations, including clustering, classification, recommendation filtering, and frequent sub-item mining. In addition, by using the Apache Hadoop library, Mahout can be effectively extended to the cloud.
MapReduce is a programming model for parallel operations on large data sets (greater than 1TB). The concepts “Map” and “Reduce” are their main ideas, borrowed from functional programming languages, and borrowed from vector programming languages. It greatly facilitates programmers to run their own programs on distributed systems without distributed parallel programming.
Oozie is a Java web application that runs in a Java servlet container, Tomcat, and uses a database to store the following:
The currently running workflow instance, including the state and variables of the instance
Pig is a data flow language and runtime environment for retrieving very large data sets. Provides a higher level of abstraction for the processing of large data sets. Pig consists of two parts: one is the language used to describe the data stream, called Pig Latin; the other is the execution environment for running the Pig Latin program.
Sqoop (pronounced: skup) is an open source tool for transferring data between Hadoop (Hive) and traditional databases (mysql, postgresql…). You can use a relational database (for example: MySQL, Data from Oracle, Postgres, etc. is imported into Hadoop’s HDFS, and HDFS data can also be imported into relational databases.
Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two. These useful differences make Spark perform better on some workloads, in other words, Spark is enabled. In addition to providing interactive queries, it also optimizes iterative workloads.
Built on Apache Hadoop YARN, Tez is “an application framework that allows a complex directed acyclic graph to be built for tasks to process data.” It allows Hive and Pig to simplify complex tasks. These tasks originally required multiple steps to complete.
ZooKeeper is a distributed, open source distributed application coordination service. It is an open source implementation of Google’s Chubby and an important component of Hadoop and Hbase. It is a software that provides consistent services for distributed applications, including: configuration maintenance, domain name services, distributed synchronization, group services, and more.