The command used for copying data from the Local system to HDFS is: hadoop fs â€“copyFromLocal [source][destination]
Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output.
NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.
The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.
Hadoop provides an option where a particular set of lousy input records can be skipped when processing map inputs. Applications can manage this feature through the SkipBadRecords class.
This feature can be used when map tasks fail deterministically on a particular input. This usually happens due to faults in the map function. The user would have to fix these issues.
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
Hive stores the table data by default in /user/hive/warehouse.
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.
HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.
By default, the replication factor is 3. There are no two copies that will be on the same data node. Usually, the first two copies will be on the same rack, and the third copy will be off the shelf. It is advised to set the replication factor to at least three so that one copy is always safe, even if something happens to the rack.
We can set the default replication factor of the file system as well as of each file and directory exclusively. For files that are not essential, we can lower the replication factor, and critical files should have a high replication factor.
Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.
It is a tool that is used for copying a very large amount of data to and from Hadoop file systems in parallel. It uses MapReduce to affect its distribution, error handling, recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
1) Robust: It is highly robust. It even has community support and contribution and is easily usable.
2) Full Load: Sqoop can load the whole table just by a single Sqoop command. It also allows us to load all the tables of the database by using a single Sqoop command.
3) Incremental Load: It supports incremental load functionality. Using Sqoop, we can load parts of the table whenever it is updated.
4) Parallel import/export: It uses the YARN framework for importing and exporting the data. That provides fault tolerance on the top of parallelism.
5) Import results of SQL query: It allows us to import the output from the SQL query into the Hadoop Distributed File System.
SerDe stands for Serial Deserializer. They are implemented using Java. The interface provides instructions regarding how to process a record. Hive has in-built SerDes. There are also many third-party SerDes that you can use depending on the task.
You can share details on how you deployed Hadoop distributions like Cloudera and Hortonworks in your organization either in a standalone environment or on the cloud. Mention how you configured the number of required nodes , tools, services, security features such as SSL, SASL, Kerberos, etc. Having set up the Hadoop cluster, talk about how you initially extracted the data from data sources like APIs, SQL based databases, etc and stored it in HDFS( storage layer) , how you performed data cleaning and validation, and the series of ETLs you performed to extract the data in the given format to extract KPIs.
Some of the ETLs tasks include :
Date format parsing
The casting of data type values
Deriving calculated fields
We have further categorized Big Data Interview Questions for Freshers and Experienced-
Hadoop Interview Questions and Answers for Freshers - Q.Nos- 1,2,4,5,6,7,8,9
Hadoop Interview Questions and Answers for Experienced - Q.Nos-3,8,9,10
Big data is defined as the voluminous amount of structured, unstructured or semi-structured data that has huge potential for mining but is so large that it cannot be processed using traditional database systems. Big data is characterized by its high velocity, volume and variety that requires cost effective and innovative methods for information processing to draw meaningful business insights. More than the volume of the data â€“ it is the nature of the data that defines whether it is considered as Big Data or not.
The NameNode is responsible for ensuring that the number of copies of the data across the cluster is equal to the replication factor.
In some cases, maybe due to a failure in one of the nodes, the number of copies of the data is less than the replication factor. In such cases, the block is said to be under-replicated. The nodes are required to send updates to the NameNode regarding their health. In such cases, if the NameNode does not receive any updates from a particular node, it will ensure that the replication factor for a block is reached by starting re-replication of the blocks from the available nodes onto a new node.
Blocks are said to be over-replicated In cases where the number of copies of the data exceeds the replication factor. The name node fixed this issue by automatically deleting the extra copies of the blocks. Over-replication may occur in cases when after the shutdown of one particular node, the NameNode starts re-replication of data across new nodes, following which the node which was previously not available is restored.
The major differences between HBase and Hive are:
HBase is built on HDFS, while Hive is a data warehousing infrastructure.
HBase provides low latency, whereas Hive provides high latency for huge databases.
HBase operations are carried out in real-time, whereas Hive operations are carried out as MapReduce jobs internally.
1. setup(): It is used to configure various parameters like input data size and heap size.
2. reduce(): It is the heart of the reducer.
3. cleanup(): This method is used to clear the temporary files at the end of the reduced task.
Simple distributed coordination process: The coordination process among all nodes in Zookeeper is straightforward.
Synchronization: Mutual exclusion and co-operation among server processes.
Ordered Messages: Zookeeper tracks with a number by denoting its order with the stamping of each update; with the help of all this, messages are ordered here.
Serialization: Encode the data according to specific rules. Ensure your application runs consistently.
Reliability: The zookeeper is very reliable. In case of an update, it keeps all the data until forwarded.
Atomicity: Data transfer either succeeds or fails, but no transaction is partial.
Apache Zookeeper is an open-source service that supports controlling a huge set of hosts. Management and coordination in a distributed environment are complex. Zookeeper automates this process and enables developers to concentrate on building software features rather than bother about its distributed nature.
Zookeeper helps to maintain configuration knowledge, naming, group services for distributed applications. It implements various protocols on the cluster so that the application should not execute them on its own. It provides a single coherent view of many machines.
The following commands can be used to restart the NameNode:
First, stop the NameNode using:
./sbin /Hadoop-daemon.sh stope NameNode
After this, start the NameNode using this command:
./sbin/Hadoop-daemon.sh start NameNode
Hadoop is said to be highly fault tolerant. Hadoop achieves this feat through the process of replication. Data is replicated across multiple nodes in a Hadoop cluster. The data is associated with a replication factor, which indicates the number of copies of the data that are present across the various nodes in a Hadoop cluster. For example, if the replication factor is 3, the data will be present in three different nodes of the Hadoop cluster, where each node will contain one copy each. In this manner, if there is a failure in any one of the nodes, the data will not be lost, but can be recovered from one of the other nodes which contains copies or replicas of the data.
Hadoop 3.0, however, makes use of the method of Erasure Coding (EC). EC is implemented by means of Redundant Array of Inexpensive Disks (RAID) by striping, where logically sequential data is divided into smaller units and these smaller units are then stored as consecutive unts on different disks. Replication results in a storage overhead of 200% in case of a replication factor of 3 (which is the default replication factor). The use of EC in Hadoop improves storage efficiency when compared to replication, but still maintains the same level of fault tolerance.
The decision to choose a particular file format is based on the following factors-
i) Schema evolution to add, alter and rename fields.
ii) Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.
iii)Splittability to be processed in parallel.
iv) Read/Write/Transfer performance vs block compression saving storage space
File Formats that can be used with Hadoop - CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.
i) Data Ingestion â€“ The foremost step in deploying big data solutions is to extract data from different sources which could be an Enterprise Resource Planning System like SAP, any CRM like Salesforce or Siebel , RDBMS like MySQL or Oracle, or could be the log files, flat files, documents, images, social media feeds. This data needs to be stored in HDFS. Data can either be ingested through batch jobs that run every 15 minutes, once every night and so on or through streaming in real-time from 100 ms to 120 seconds.
ii) Data Storage â€“ The subsequent step after ingesting data is to store it either in HDFS or NoSQL database like HBase. HBase storage works well for random read/write access whereas HDFS is optimized for sequential access.
iii) Data Processing â€“ The ultimate step is to process the data using one of the processing frameworks like mapreduce, spark, pig, hive, etc.
The most common Input Formats defined in Hadoop are:
1. Text Input Format- This is the default input format defined in Hadoop.
2. Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
3. Sequence File Input Format- This input format is used for reading files in sequence.
The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low - end. ECC memory is recommended for running Hadoop because most of the Hadoop users have experienced various checksum errors by using non ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly.
Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems.
Core components of a Hadoop application are-
1) Hadoop Common
3) Hadoop MapReduce
Data Access Components are - Pig and Hive
Data Storage Component is - HBase
Data Integration Components are - Apache Flume, Sqoop, Chukwa
Data Management and Monitoring Components are - Ambari, Oozie and Zookeeper.
Data Serialization Components are - Thrift and Avro
Data Intelligence Components are - Apache Mahout and Drill.
Hadoop Framework works on the following two core components-
1)HDFS â€“ Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master-Slave Architecture.
2)Hadoop MapReduce-This is a java based programming paradigm of the Hadoop framework that provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples. The reduce job is always performed after the map job is executed.
Resource Manager: It runs on a master daemon and controls the resource allocation in the cluster.
Node Manager: It runs on the slave daemons and executes a task on each single Data Node.
Application Master: It controls the user job lifecycle and resource demands of single applications. It works with the Node Manager and monitors the execution of tasks.
Container: It is a combination of resources, including RAM, CPU, Network, HDD, etc., on a single node.
Yarn stands for Yet Another Resource Negotiator. It is the resource management layer of Hadoop. The Yarn was launched in Hadoop 2.x. Yarn provides many data processing engines like graph processing, batch processing, interactive processing, and stream processing to execute and process data saved in the Hadoop Distributed File System. Yarn also offers job scheduling. It extends the capability of Hadoop to other evolving technologies so that they can take good advantage of HDFS and economic clusters.
Apache Yarn is the data operating method for Hadoop 2.x. It consists of a master daemon known as â€œResource Manager,â€ a slave daemon called node manager, and Application Master.
MapReduce needs programs to be translated into map and reduce stages. As not all data analysts are accustomed to MapReduce, Yahoo researchers introduced Apache pig to bridge the gap. Apache Pig was created on top of Hadoop, producing a high level of abstraction and enabling programmers to spend less time writing complex MapReduce programs.
Hive is an open-source system that processes structured data in Hadoop, living on top of the latter for summing Big Data and facilitating analysis and queries. In addition, hive enables SQL developers to write Hive Query Language statements similar to standard SQL statements for data query and analysis. It is created to make MapReduce programming easier because you donâ€™t know and write lengthy Java code.
MapReduce is the framework used to process Big Data in parallel, distributed algorithms on a cluster. It works in two steps â€“ mapping and reducing. The mapping procedure performs the data sorting while reducing performs the operations.
The port number for NameNode is 50070.
As soon as the client is ready to load the file in the cluster, the file contents are divided into racks or data blocks. The client then allocates three DataNodes for each block. Two copies of the data are stored in one rack, while the third copy is stored in another rack.
In this question, first explain NAS and HDFS, and then compare their features as follows:
1) Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides services for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed filesystem to store data using commodity hardware.
2) In HDFS Data Blocks are distributed across all the machines in a cluster. Whereas in NAS data is stored on a dedicated hardware.
3) HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations.
4) HDFS uses commodity hardware which is cost-effective, whereas a NAS is a high-end storage devices which includes high cost.
In HDFS, a block is the minimum data that is read or written. The default block size is 64MB. A scanner tracks the blocks present on a DataNode. It then verifies them to find errors.
Indexing is used to speed up the process of access to the data. Hadoop has a unique way of indexing. It does not automatically index the data. Instead, it stores the last part of the data that shows where (the address) the next part of the data chunk is stored.
With HDFS, the data is stored on a distributed system while the regular FileSystem stores the data in a single system. Additionally, the data stored on HDFS can be recovered even if the DataNode crashes. But data recovery on the regular FileSystem can become difficult in case of a crash.
In Hadoop MapReduce, shuffling is used to transfer data from the mappers to the important reducers. It is the process in which the system sorts the unstructured data and transfers the output of the map as an input to the reducer. It is a significant process for reducers. Otherwise, they would not accept any information. Moreover, since this process can begin even before the map phase is completed, it helps to save time and complete the process in a lesser amount of time.
Only one NameNode is possible to configure.
Secondary NameNode was to take hourly backup of MetaData from NameNode.
It is only suitable for Batch Processing of a vast amount of Data, which is already in the Hadoop System.
It is not ideal for Real-time Data Processing.
It supports up to 4000 Nodes per Cluster.
It has a single component: JobTracker to perform many activities like Resource Management, Job Scheduling, Job Monitoring, Re-scheduling Jobs etc.
JobTracker is the single point of failure.
It supports only one Name No and One Namespace per Cluster.
It does not help the Horizontal Scalability of NameNode.
It runs only Map/Reduce jobs.
Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)
1) File size limitations: HDFS can not handle a large number of small files. If you use such files, the NameNode will be overloaded.
2) Support for only batch-processing: Hadoop does not process streamed data and has support for batch-processing only. This lowers overall performance.
3) Difficulty in management: It can become difficult to manage complex applications in Hadoop.
Local mode: It is the default run mode in Hadoop. It uses a local file system and is used for input, debugging, and output operations. This mode does not support the use of HDFS.
Pseudo-distributed mode: The difference between local mode and this mode is that each daemon runs as a separate Java process in this mode.
Fully-distributed mode: All daemons are executed in separate nodes and form a multi-node cluster.
Big data analysis is helping businesses differentiate themselves â€“ for example Walmart the worldâ€™s largest retailer in 2014 in terms of revenue - is using big data analytics to increase its sales through better predictive analytics, providing customized recommendations and launching new products based on customer preferences and needs. Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue. There are many more companies like Facebook, Twitter, LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc. using big data analytics to boost their revenue.
IBM has a nice, simple explanation for the four critical features of big data:
a) Volume â€“Scale of data
b) Velocity â€“Analysis of streaming data
c) Variety â€“ Different forms of data
d) Veracity â€“Uncertainty of data
Here is an explanatory video on the four Vâ€™s of Big Data
HDFS is the Hadoop Distributed File System, is the storage layer for Hadoop. The files in HDFS are split into block-size parts called data blocks. These blocks are saved on the slave nodes in the cluster. By default, the size of the block is 128 MB by default, which can be configured as per our necessities. It follows the master-slave architecture. It contains two daemons- DataNodes and NameNode.
The NameNode is the master daemon that operates on the master node. It saves the filesystem metadata, that is, files names, data about blocks of a file, blocks locations, permissions, etc. It manages the Datanodes.
The DataNodes are the slave daemon that operates on the slave nodes. It saves the actual business data. It serves the client read/write requests based on the NameNode instructions. It stores the blocks of the files, and NameNode stores the metadata like block locations, permission, etc.
The three major challenges faced with Big Data are:
Storage: Since Big Data comprises high volume, storing such massive amounts of data is a major issue.
Security: Naturally, enterprises need to ensure that every bit of data is stored safely, and there are no instances of a data leak or data compromise in any other way.
Analytics and insights: It becomes difficult to derive insights, analyze trends, and identify patterns when dealing with such a high volume of data.
Hadoop proves to be one of the best solutions for managing Big Data operations due to the following reasons:
High storage: Hadoop enables the storage of huge raw files easily without schema.
Cost-effective: Hadoop is an economical solution for Big Data distributed storage and processing, as the commodity hardware required to run it is not expensive.
Scalable, reliable, and secure: The data can be easily scaled with Hadoop systems, as any number of new nodes can be added. Additionally, the data can be stored and accessed despite machine failure. Lastly, Hadoop provides a high level of data security.
Hadoop is a collection of open-source software that provides a framework to store and process Big Data using the MapReduce programming model. Hadoop is a part of the Apache project sponsored by the Apache Software Foundation.
Hadoop has helped solve many challenges previously faced with distributed storage and Big Data processing. It has gained popularity among data scientists, and it is the preferred choice of many.