Hadoop interview question set 2/Apache Hadoop Interview Questions and Answers for Freshers & Experienced

How to use Apache Zookeeper command line interface?

ZooKeeper has a command line client support for interactive use. The command line interface of ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy of Znodes where each znode can contain data just similar to a file. Each znode can also have children just like directories in the UNIX file system.

Explain about ZooKeeper in Kafka

Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by bye-passing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request.

Does Apache Flume provide support for third party plug-ins?

Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.

How multi-hop agent can be setup in Flume?

Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.

What are the differences between Pig and SQL?

The prime differences between the two are as follows:

1. Pig is a procedural query language, while SQL is declarative.
2. Pig follows a nested relational data model, while SQL follows a flat one.
3. Having schema in Pig is optional, but in SQL, it is mandatory.
4. Pig offers limited query optimization, but SQL offers significant optimization.

Explain Grunt Shell.

It is like an interface that allows interaction with the HDFS. It is also sometimes referred to as Pig interactive shell.

How do you configure an “Oozie” job in Hadoop?

“Oozie” is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs such as “Java MapReduce”, “Streaming MapReduce”, “Pig”, “Hive” and “Sqoop”.

Define RDD.

RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD are immutable and distributed, which is a key component of Apache Spark.

What are the components of Region Server?

The components of a Region Server are:

1. WAL: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.
2. Block Cache: Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.
3. MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region.
4. HFile: HFile is stored in HDFS. It stores the actual cells on the disk.

Can the default “Hive Metastore” be used by multiple users (processes) at the same time?

“Derby database” is the default “Hive Metastore”. Multiple users (processes) cannot access it at the same time. It is mainly used to perform unit tests.

What is a UDF?

If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file.

What are the different data types in Pig Latin?

Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.

Atomic data types: Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[].

Complex Data Types: Complex data types are Tuple, Map and Bag.

What do you know about “SequenceFileInputFormat”?

“SequenceFileInputFormat” is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.

Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.

What is a “Combiner”?

A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.

Does Flume provide 100% reliability to the data flow?

Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.

What are the limitations of importing RDBMS tables into Hcatalog directly?

There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.

Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?

Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.

How are large objects handled in Sqoop?

Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store-

1)CLOB ‘s – Character Large Objects

2)BLOB’s –Binary Large Objects

Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object.

Is it possible to do an incremental import using Sqoop?

Yes, Sqoop supports two types of incremental imports-


2)Last Modified

Explain “Distributed Cache” in a “MapReduce Framework”.

Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Then you can access the cache file as a local file in your Mapper or Reducer job.

What is the purpose of “RecordReader” in Hadoop?

The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.

State some key components of ZooKeeper.

The primary components of ZooKeeper architecture are:

Node: These are all the systems installed on the cluster.
ZNode: A different type of node that stores updates and data version information.
Client applications: These are the client-side applications useful in interacting with distributed applications.
Server applications: These are the server-side applications that provide an interface for the client applications to interact with the server.

What is an interceptor?

Interceptors are useful in filtering out unwanted log files. We can use them to eliminate events between the source and channel or the channel and sink based on our requirements.

What do you know about Sqoop metastore?

Sqoop metastore is a shared repository where multiple local and remorse users can execute the saved jobs. We can connect to the Sqoop metastore through sqoop-site.xml or with the help of the -meta-connect argument command.

How to import BLOB and CLOB-like big objects in Sqoop?

We can use JDBC-based imports for BLOB and CLOB, as Sqoop does not support direct import function.

Mention the consequences of Distributed Applications.

1. Heterogeneity: The design of applications should allow the users to access services and run applications over a heterogeneous collection of computers and networks taking into consideration Hardware devices, OS, networks, Programming languages.
2. Transparency: Distributed system Designers must hide the complexity of the system as much as they can. Some Terms of transparency are location, access, migration, Relocation, and so on.
3. Openness: It is a characteristic that determines whether the system can be extended and reimplemented in various ways.
4. Security: Distributed system Designers must take care of confidentiality, integrity, and availability.
5. Scalability: A system is said to be scalable if it can handle the addition of users and resources without suffering a noticeable loss of performance.

What is Apache Flume in Hadoop ?

Apache Flume is a tool/service/data ingestion mechanism for assembling, aggregating, and carrying huge amounts of streaming data such as record files, events from various references to a centralized data store.

Flume is a very stable, distributed, and configurable tool. It is generally designed to copy streaming data (log data) from various web servers to HDFS.

How can I restart “NameNode” or all the daemons in Hadoop?

This question can have two answers, we will discuss both the answers. We can restart NameNode by following methods:

1. You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode command.
2. To stop and start all the daemons, use. /sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.

What is “speculative execution” in Hadoop?

If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

How do you define “Rack Awareness” in Hadoop?

Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.

What does ‘jps’ command do?

The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.

Why do we use HDFS for applications having large data sets and not when there are a lot of small files?

HDFS is more suitable for large amounts of data sets in a single file as compared to small amount of data spread across multiple files. As you know, the NameNode stores the metadata information regarding the file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too many files will lead to the generation of too much metadata. And, storing these metadata in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.

Can NameNode and DataNode be a commodity hardware?

The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience, you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.

How is HDFS fault tolerant?

When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.

What applications are supported by Apache Hive?

The applications that are supported by Apache Hive are,


Explain the process of row deletion in HBase.

On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.

What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?

The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.

Explain about the different catalog tables in HBase?

The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.

Explain the three types of tombstone markers for deletion.

1. Family delete marker: It marks all the columns for deletion from a column family.
2. Version delete marker: It marks a single version of a column for deletion.
3. Column delete marker: It marks all versions of a column for deletion.

What is a checkpoint?

In brief, “Checkpointing” is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.

Give a brief on how Spark is good at low latency workloads like graph processing and Machine Learning.

The data is stored in memory by Apache Spark for faster processing and development of machine learning models, which may need a lot of Machine Learning algorithms for multiple repetitions and various conceptual steps to create an optimized model. In the case of Graph algorithms, it moves within all the nodes and edges to make a graph. These low latency workloads, which need many iterations, enhance the performance.

What are the basic parameters of a mapper?

The primary parameters of a mapper are text, LongWritable, text, and IntWritable. The initial two represent input parameters, and the other two signify intermediate output parameters.

Explain the distributed Cache in MapReduce framework.

Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when you want to share the files across all nodes in a Hadoop cluster. These files can be jar files or simple properties files.

Hadoop's MapReduce framework allows the facility to cache small to moderate read-only files such as text files, zip files, jar files, etc., and distribute them to all the Datanodes(worker-nodes) MapReduce jobs are running. All Datanode gets a copy of the file(local-copy), which is sent by Distributed Cache.

Explain the difference between RDBMS data model and HBase data model.

RDBMS is a schema based database whereas HBase is schema less data model.

RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.

RDBMS stores normalized data whereas HBase stores de-normalized data.

What is Row Key?

Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.

What are the different operational commands in HBase at record level and table level?

Record Level Operational Commands in HBase are –put, get, increment, scan and delete.

Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

Explain the actions followed by a Jobtracker in Hadoop.

The client application is used to submit the jobs to the Jobtracker.
The JobTracker associates with the NameNode to determine the data location.
With the help of available slots and the near the data, JobTracker locates TaskTracker nodes.
It submits the work on the selected TaskTracker Nodes.
When a task fails, JobTracker notifies and decides the further steps.
JobTracker monitors the TaskTracker nodes

How does NameNode tackle DataNode failures?

NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.

A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.

The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.

What happens when two clients try to access the same file in the HDFS?

HDFS supports exclusive writes only.

When the first client contacts the “NameNode” to open the file for writing, the “NameNode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.

R4R Team
R4R provides Apache Hadoop Freshers questions and answers (Apache Hadoop Interview Questions and Answers) .The questions on R4R.in website is done by expert team! Mock Tests and Practice Papers for prepare yourself.. Mock Tests, Practice Papers,Hadoop interview question set 2,Apache Hadoop Freshers & Experienced Interview Questions and Answers,Apache Hadoop Objetive choice questions and answers,Apache Hadoop Multiple choice questions and answers,Apache Hadoop objective, Apache Hadoop questions , Apache Hadoop answers,Apache Hadoop MCQs questions and answers Java, C ,C++, ASP, ASP.net C# ,Struts ,Questions & Answer, Struts2, Ajax, Hibernate, Swing ,JSP , Servlet, J2EE ,Core Java ,Stping, VC++, HTML, DHTML, JAVASCRIPT, VB ,CSS, interview ,questions, and answers, for,experienced, and fresher R4r provides Python,General knowledge(GK),Computer,PHP,SQL,Java,JSP,Android,CSS,Hibernate,Servlets,Spring etc Interview tips for Freshers and Experienced for Apache Hadoop fresher interview questions ,Apache Hadoop Experienced interview questions,Apache Hadoop fresher interview questions and answers ,Apache Hadoop Experienced interview questions and answers,tricky Apache Hadoop queries for interview pdf,complex Apache Hadoop for practice with answers,Apache Hadoop for practice with answers You can search job and get offer latters by studing r4r.in .learn in easy ways .