ZooKeeper has a command line client support for interactive use. The command line interface of ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy of Znodes where each znode can contain data just similar to a file. Each znode can also have children just like directories in the UNIX file system.
Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by bye-passing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request.
Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.
Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.
The prime differences between the two are as follows:
1. Pig is a procedural query language, while SQL is declarative.
2. Pig follows a nested relational data model, while SQL follows a flat one.
3. Having schema in Pig is optional, but in SQL, it is mandatory.
4. Pig offers limited query optimization, but SQL offers significant optimization.
It is like an interface that allows interaction with the HDFS. It is also sometimes referred to as Pig interactive shell.
“Oozie†is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs such as “Java MapReduceâ€, “Streaming MapReduceâ€, “Pigâ€, “Hive†and “Sqoopâ€.
RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD are immutable and distributed, which is a key component of Apache Spark.
The components of a Region Server are:
1. WAL: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.
2. Block Cache: Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.
3. MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region.
4. HFile: HFile is stored in HDFS. It stores the actual cells on the disk.
“Derby database†is the default “Hive Metastoreâ€. Multiple users (processes) cannot access it at the same time. It is mainly used to perform unit tests.
If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file.
Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.
Atomic data types: Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[].
Complex Data Types: Complex data types are Tuple, Map and Bag.
“SequenceFileInputFormat†is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce†job to the input of some other “MapReduce†job.
Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.
A “Combiner†is a mini “reducer†that performs the local “reduce†task. It receives the input from the “mapper†on a particular “node†and sends the output to the “reducerâ€. “Combiners†help in enhancing the efficiency of “MapReduce†by reducing the quantum of data that is required to be sent to the “reducersâ€.
Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.
There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.
Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.
Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store-
1)CLOB ‘s – Character Large Objects
2)BLOB’s –Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile†i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object.
/usr/bin/Hadoop Sqoop
Yes, Sqoop supports two types of incremental imports-
1)Append
2)Last Modified
Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Then you can access the cache file as a local file in your Mapper or Reducer job.
The “InputSplit†defines a slice of work, but does not describe how to access it. The “RecordReader†class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper†task. The “RecordReader†instance is defined by the “Input Formatâ€.
The primary components of ZooKeeper architecture are:
Node: These are all the systems installed on the cluster.
ZNode: A different type of node that stores updates and data version information.
Client applications: These are the client-side applications useful in interacting with distributed applications.
Server applications: These are the server-side applications that provide an interface for the client applications to interact with the server.
Interceptors are useful in filtering out unwanted log files. We can use them to eliminate events between the source and channel or the channel and sink based on our requirements.
Sqoop metastore is a shared repository where multiple local and remorse users can execute the saved jobs. We can connect to the Sqoop metastore through sqoop-site.xml or with the help of the -meta-connect argument command.
We can use JDBC-based imports for BLOB and CLOB, as Sqoop does not support direct import function.
1. Heterogeneity: The design of applications should allow the users to access services and run applications over a heterogeneous collection of computers and networks taking into consideration Hardware devices, OS, networks, Programming languages.
2. Transparency: Distributed system Designers must hide the complexity of the system as much as they can. Some Terms of transparency are location, access, migration, Relocation, and so on.
3. Openness: It is a characteristic that determines whether the system can be extended and reimplemented in various ways.
4. Security: Distributed system Designers must take care of confidentiality, integrity, and availability.
5. Scalability: A system is said to be scalable if it can handle the addition of users and resources without suffering a noticeable loss of performance.
Apache Flume is a tool/service/data ingestion mechanism for assembling, aggregating, and carrying huge amounts of streaming data such as record files, events from various references to a centralized data store.
Flume is a very stable, distributed, and configurable tool. It is generally designed to copy streaming data (log data) from various web servers to HDFS.
This question can have two answers, we will discuss both the answers. We can restart NameNode by following methods:
1. You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode command.
2. To stop and start all the daemons, use. /sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.
If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative executionâ€.
Rack Awareness is the algorithm in which the “NameNode†decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes†within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rackâ€. This rule is known as the “Replica Placement Policyâ€.
The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.
HDFS is more suitable for large amounts of data sets in a single file as compared to small amount of data spread across multiple files. As you know, the NameNode stores the metadata information regarding the file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too many files will lead to the generation of too much metadata. And, storing these metadata in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.
The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience, you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.
When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.
The applications that are supported by Apache Hive are,
Java
PHP
Python
C++
Ruby
On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.
The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.
The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.
1. Family delete marker: It marks all the columns for deletion from a column family.
2. Version delete marker: It marks a single version of a column for deletion.
3. Column delete marker: It marks all versions of a column for deletion.
In brief, “Checkpointing†is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.
The data is stored in memory by Apache Spark for faster processing and development of machine learning models, which may need a lot of Machine Learning algorithms for multiple repetitions and various conceptual steps to create an optimized model. In the case of Graph algorithms, it moves within all the nodes and edges to make a graph. These low latency workloads, which need many iterations, enhance the performance.
The primary parameters of a mapper are text, LongWritable, text, and IntWritable. The initial two represent input parameters, and the other two signify intermediate output parameters.
Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when you want to share the files across all nodes in a Hadoop cluster. These files can be jar files or simple properties files.
Hadoop's MapReduce framework allows the facility to cache small to moderate read-only files such as text files, zip files, jar files, etc., and distribute them to all the Datanodes(worker-nodes) MapReduce jobs are running. All Datanode gets a copy of the file(local-copy), which is sent by Distributed Cache.
RDBMS is a schema based database whereas HBase is schema less data model.
RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
RDBMS stores normalized data whereas HBase stores de-normalized data.
Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.
Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.
The client application is used to submit the jobs to the Jobtracker.
The JobTracker associates with the NameNode to determine the data location.
With the help of available slots and the near the data, JobTracker locates TaskTracker nodes.
It submits the work on the selected TaskTracker Nodes.
When a task fails, JobTracker notifies and decides the further steps.
JobTracker monitors the TaskTracker nodes
NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.
A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.
The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.
HDFS supports exclusive writes only.
When the first client contacts the “NameNode†to open the file for writing, the “NameNode†grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode†will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.