What is Hadoop?
Apache Hadoop is an open source software project that works on master and slave concept. There will be one master system and many slave systems which we called as clusters. Master machine distributes large data sets across all clusters. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.
Hadoop enables a computing solution that is:-
- Scalable– New cluster can be added as and when required, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
- Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
- Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
- Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a fright beat.
It has two main components:
- Hadoop Distributed File system (HDFS)
- Data processing framework & MapReduce
Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. It was designed to be a scalable, fault-tolerant, distributed storage system that works closely with MapReduce. By distributing storage and computation across many slaves, the combined storage resource can grow with demand while remaining economical at every size.
The data processing framework is the tool used to work with the data itself. By default, this is the Java-based system known as MapReduce. It’s the tool that actually gets data processed. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks among clusters.
Hadoop is not really a database: It stores data and you can pull data out of it, but there are no queries involved – SQL or otherwise. Hadoop is more of a data warehousing system – so it needs a system like MapReduce to actually process the data. MapReduce runs as a series of jobs, with each job essentially a separate Java application that goes out into the data and starts pulling out information as needed.
Where it is used?
Almost everywhere, you name a web site, and it will surely be using Hadoop to handle its tons of data.
Facebook, ebay, Yahoo!, salesforce uses Hadoop.
Hadoop can be run using Amazon EC2 services. You just need to login to AWS and with the few clicks, Hadoop will be running in front you within few minutes. Here you can create as many clusters as you want. Yeah, but it will be chargable on per hour basis.