-compress-codec parameter is generally used to get the output file of a sqoop import in formats other than .gz.
In general partitioning in Hive is a logical division of tables into related columns such as date, city, and department based on the values of partitioned columns. Then these partitions are subdivided into buckets so that they provide extra structure to the data that may be used for more efficient querying.
Now let’s experience data partitioning in Hive with an instance. Consider a table named Table1. The table contains client details like id, name, dept, and year of joining. Assume we need to retrieve the details of all the clients who joined in 2014.
Then, the query examines the whole table for the necessary data. But if we partition the client data by the year and save it in a different file, this will decrease the query processing time.
One of the most common question in any big data interview. The three modes are:
Standalone mode – This is Hadoop’s default mode that uses the local file system for both input and output operations. The main purpose of the standalone mode is debugging. It does not support HDFS and also lacks custom configuration required for mapred-site.xml, core-site.xml, and hdfs-site.xml files.
Pseudo-distributed mode – Also known as the single-node cluster, the pseudo-distributed mode includes both NameNode and DataNode within the same machine. In this mode, all the Hadoop daemons will run on a single node, and hence, the Master and Slave nodes are the same.
Fully distributed mode – This mode is known as the multi-node cluster wherein multiple nodes function simultaneously to execute Hadoop jobs. Here, all the Hadoop daemons run on different nodes. So, the Master and Slave nodes run separately.
The command used for copying data from the Local system to HDFS is:
hadoop fs –copyFromLocal [source][destination]
There are three core methods of a reducer-
setup() – It helps to configure parameters like heap size, distributed cache, and input data size.
reduce() – Also known as once per key with the concerned reduce task. It is the heart of the reducer.
cleanup() – It is a process to clean up all the temporary files at the end of a reducer task.
The core methods of a Reducer are:
setup(): setup is a method called just to configure different parameters for the reducer.
reduce(): reduce is the primary operation of the reducer. The specific function of this method includes defining the task that has to be worked on for a distinct set of values that share a key.
cleanup(): cleanup is used to clean or delete any temporary files or data after performing reduce() task.
MapReduce is a programming model created for distributed computation on big data sets in parallel. A MapReduce model has a map function that performs filtering and sorting and a reduced function, which serves as a summary operation.
MapReduce is an important part of the Apache Hadoop open-source ecosystem, and it’s extensively used for querying and selecting data in the Hadoop Distributed File System (HDFS). A variety of queries may be done depending on the broad spectrum of MapReduce algorithms possible for creating data selections. In addition, MapReduce is fit for iterative computation involving large quantities of data requiring parallel processing. This is because it represents a data flow rather than a procedure.
The more enhanced data we produce and accumulate, the higher the need to process all that data to make it usable. MapReduce’s iterative, parallel processing programming model is a good tool for creating a sense of big data.
Hadoop MapReduce is a software framework for processing enormous data sets. It is the main component for data processing in the Hadoop framework. It divides the input data into several parts and runs a program on every data component parallel. The word MapReduce refers to two separate and different tasks. The first is the map operation, which takes a set of data and transforms it into a diverse collection of data, where individual elements are divided into tuples. The reduce operation consolidates those data tuples based on the key and subsequently modifies the value of the key.
ou can deploy a Big Data solution in three steps:
<> Data Ingestion – This is the first step in the deployment of a Big Data solution. You begin by collecting data from multiple sources, be it social media platforms, log files, business documents, anything relevant to your business. Data can either be extracted through real-time streaming or in batch jobs.
<> Data Storage – Once the data is extracted, you must store the data in a database. It can be HDFS or HBase. While HDFS storage is perfect for sequential access, HBase is ideal for random read/write access.
<> Data Processing – The last step in the deployment of the solution is data processing. Usually, data processing is done via frameworks like Hadoop, Spark, MapReduce, Flink, and Pig, to name a few.
The following command will copy data from the local file system onto HDFS:
hadoop fs –copyFromLocal [source] [destination]
Example: hadoop fs –copyFromLocal /tmp/data.csv /user/test/data.csv
In the above syntax, the source is the local path and destination is the HDFS path. Copy from the local system using a -f option (flag option), which allows you to write the same file or a new file to HDFS.
Storing several small files on HDFS generates a lot of metadata files. To store these metadata in the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata. Thus, the cumulative size of all the metadata will be too large.
The main task of P-value is to determine the significance of results after a hypothesis test in statistics.
Readers can draw with conclusions with the help of P-value and it is always between 0 and 1.
P- Value > 0.05 denotes weak evidence against the null hypothesis, It means the null hypothesis cannot be rejected
P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected
P-value=0.05 is the marginal value indicating it is possible to go either way
The different Output formats in Hadoop are -
* Textoutputformat: TextOutputFormat is the default output format in Hadoop.
* Mapfileoutputformat: Mapfileoutputformat is used to write the output as map files in Hadoop.
* DBoutputformat: DBoutputformat is just used for writing output in relational databases and
* Sequencefileoutputformat: Sequencefileoutputformat is used for writing sequence files.
* SequencefileAsBinaryoutputformat: SequencefileAsBinaryoutputformat is used to write keys to
a sequence file in binary format.
One of the most common big data interview question. In the present scenario, Big Data is everything. If you have data, you have the most powerful tool at your disposal. Big Data Analytics helps businesses to transform raw data into meaningful and actionable insights that can shape their business strategies. The most important contribution of Big Data to business is data-driven business decisions. Big Data makes it possible for organizations to base their decisions on tangible information and insights.
This Big Data interview question dives into your knowledge of HBase and its working.
There are three main tombstone markers used for deletion in HBase. They are-
Family Delete Marker – For marking all the columns of a column family.
Version Delete Marker – For marking a single version of a single column.
Column Delete Marker – For marking all the versions of a single column.
There are three core methods of a reducer. They are-
setup() – This is used to configure different parameters like heap size, distributed cache and input data.
reduce() – A parameter that is called once per key with the concerned reduce task
cleanup() – Clears all temporary files and called only at the end of a reducer task.
This Big Data interview question aims to test your awareness regarding various tools and frameworks.
Oozie, Ambari, Pig and Flume are the most common data management tools that work with Edge Nodes in Hadoop.
NameNode – Port 50070
Task Tracker – Port 50060
Job Tracker – Port 50030
NFS (Network File system): A protocol that enables customers to access files over the network. NFS clients would allow files to be accessed as if the files live on the local device, even though they live on the disk of a networked device.
HDFS (Hadoop Distributed File System): A distributed file system is shared between multiple networked machines or nodes. HDFS is fault-tolerant because it saves various copies of files on the file system; the default replication level is 3.
The notable difference between the two is Replication/Fault Tolerance. HDFS was intended to withstand failures. NFS does not possess any fault tolerance built-in.
Benefits of HDFS over NFS:
Apart from fault tolerance, HDFS helps to create multiple replicas of files. This reduces the traditional bottleneck of many clients accessing a single file. In addition, since files have multiple images on various physical disks, reading performance scales better than NFS.
Listed in many Big Data Interview Questions and Answers, the best answer to this is –
Open-Source – Hadoop is an open-sourced platform. It allows the code to be rewritten or modified according to user and analytics requirements.
Scalability – Hadoop supports the addition of hardware resources to the new nodes.
Data Recovery – Hadoop follows replication which allows the recovery of data in the case of any failure.
Data Locality – This means that Hadoop moves the computation to the data and not the other way round. This way, the whole process speeds up.
This is one of the most important Big Data interview questions to help the interviewer gauge your knowledge of commands.
To start all the daemons:
To shut down all the daemons:
The JPS command is used for testing the working of all the Hadoop daemons. It specifically tests daemons like NameNode, DataNode, ResourceManager, NodeManager and more.
(In any Big Data interview, you’re likely to find one question on JPS and its importance.)
HDFS can replicate data on different DataNodes. So, should one node fail or crash, the data may still be accessed from one of the others.
To check the status of the blocks, use the command:
hdfs fsck <path> -files -blocks
To check the health status of FileSystem, use the command:
hdfs fsck / -files –blocks –locations > dfs-fsck.log
The following commands will help you restart NameNode and all the daemons:
You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.
You can stop all the daemons with ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.
K-mean is a partitioning technique in which objects are categorized into K groups. In this algorithm, the clusters are spherical with the data points aligned around that cluster, and the variance of the clusters is similar to one another.
By default, each block in HDFS is divided into 128 MB. The size of all the blocks, except the last block, will be 128 MB. For an input file of 350 MB, there are three input splits in total. The size of each split is 128 MB, 128MB, and 94 MB.
The two types of metadata that a NameNode server holds are:
1. Metadata in Disk - This contains the edit log and the FSImage
2. Metadata in RAM - This contains the information about DataNodes
A Combiner is a mini version of a reducer that is used to perform local reduction processes. The mapper sends the input to a specific node of the Combiner which later sends the respective output to the reducer. It also reduces the quantum of data that needs to be sent to the reducers, improving the efficiency of MapReduce.
Apache HBase is a distributed, open-source, scalable, and multidimensional database of NoSQL that is based on Java. It runs on HDFS and offers Google BigTable-like abilities and functionalities to Hadoop. Moreover, its fault-tolerant nature helps in storing large volumes of sparse data sets. It gets low latency and high throughput by offering faster access to large datasets for read/write functions.
The sources of Unstructured data are as follows:
* Textfiles and documents
* Server website and application log
* Sensor data
* Images, Videos and audio files
* Social media Data
Data cleansing it is also known as Data scrubbing, it is a process of removing data which incorrect, duplicated or corrupted. This process is used for enhancing the data quality by eliminating errors and irregularities.
Most important advantage of Big Data analysis is, it helps organizations harness their data and use it to identify new opportunities. With the help of this, companies lead to smarter business moves, more efficient operations, higher profits, and happier customers.
HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes, the data can still be retrieved from other DataNodes.
1. Regular FileSystem: In regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.
2. HDFS: Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.
The three modes in which Hadoop can run are :
* Standalone mode: This is the default mode. It uses the local FileSystem and a single Java process to run the Hadoop services.
* Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services.
* Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services.
The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera).
FSCK or File System Check is a command used by HDFS. It checks if any file is corrupt, has its replica, or if there are some missing blocks for a file. FSCK generates a summary report, which lists the overall health of the file system.
As the Big Data offers an extra competitive edge to a business over its competitors, a business can decide to tap the potential of Big Data as per its requirements and streamline the various business activities as per its objectives.
So the approaches to deal with Big Data are to be determined as per your business requirements and the available budgetary provisions.
First, you have to decide the kind of business concerns you are having right now. What kind of questions you want your data to answer. What are your business objectives and how do you want to achieve them.
As far as the approaches regarding Big Data processing are concerned, we can do it in two ways:
1. Batch processing
2. Stream processing
As per your business requirements, you can process the Big Data in batches daily or after a certain duration. If your business demands it, you can process it in streamline fashion after every hour or after every 15 seconds or so.
It all depends on your business objectives and the strategies you adopt.
It is one of the most commonly asked big data interview questions.
The important Big Data analytics tools are –
* Rattle GUI
Big data analysis has become very important for the businesses. It helps businesses to differentiate themselves from others and increase the revenue. Through predictive analytics, big data analytics provides businesses customized recommendations and suggestions. Also, big data analytics enables businesses to launch new products depending on customer needs and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies those are using big data analytics to increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.
Hadoop is an open-source framework in Java, and it processes even big volumes of data on a cluster of commodity hardware. It also allows running many exploratory data analysis tasks on full datasets, without sampling. Features that make Hadoop an essential requirement for Big Data are –
* Data collection
* Runs independently
Big Data is an asset, while Hadoop is an open-source software program, which accomplishes a set of goals and objectives to deal with that asset. Hadoop is used to process, store, and analyze complex unstructured data sets through specific proprietary algorithms and methods to derive actionable insights. So yes, they are related but are not alike.
There are three types of Big Data.
Structured Data – It suggests that the data can be processed, stored, and retrieved in a fixed format. It is highly organized information that can be easily assessed and stored, e.g. phone numbers, social security numbers, ZIP codes, employee information, and salaries, etc.
Unstructured Data – This refers to the data that has no specific structure or form. The most common types of unstructured data are formats like audio, video, social media posts, digital surveillance data, satellite data, etc.
Semi-structured Data – This refers to both structured and unstructured data formats and is unspecified yet important.
This is yet another Big Data interview question you’re most likely to come across in any interview you sit for.
Commodity Hardware refers to the minimal hardware resources needed to run the Apache Hadoop framework. Any hardware that supports Hadoop’s minimum requirements is known as ‘Commodity Hardware.’
Now that we’re in the zone of Hadoop, the next Big Data interview question you might face will revolve around the same.
The HDFS is Hadoop’s default storage unit and is responsible for storing different types of data in a distributed environment.
HDFS has the following two components:
NameNode – This is the master node that has the metadata information for all the data blocks in the HDFS.
DataNode – These are the nodes that act as slave nodes and are responsible for storing the data.
YARN, short for Yet Another Resource Negotiator, is responsible for managing resources and providing an execution environment for the said processes.
The two main components of YARN are –
ResourceManager – Responsible for allocating resources to respective NodeManagers based on the needs.
NodeManager – Executes tasks on every DataNode.
<> Volume: A considerable amount of data stored in data warehouses reflects the volume. The data may reach random heights; these large volumes of data need to be examined and processed. Which may exist up to or more than terabytes and petabytes.
<> Velocity: Velocity basically introduces the pace at which data is being produced in real-time. To give a simple example for recognition, imagine the rate at which Facebook, Instagram, or Twitter posts are generated per second, an hour or more.
<> Variety: Big Data comprises structured, unstructured, and semi-structured data collected from varied sources. This different variety of data requires very different and specific analyzing and processing techniques with unique and appropriate algorithms.
<> Veracity: Data veracity basically relates to how reliable the data is, or in a fundamental way, we can define it as the quality of the data analyzed.
<> Value: Raw data is of no use or meaning but once converted into something valuable. We can extract helpful information.
Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that’s why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. Big data also allows the companies to take better business decisions backed by data.
When we talk about Big Data, we talk about Hadoop. So, this is another Big Data interview question that you will definitely face in an interview.
Hadoop is an open-source framework for storing, processing, and analyzing complex unstructured data sets for deriving insights and intelligence.
Big Data can be defined as a collection of complex unstructured or semi-structured data sets which have the potential to deliver actionable insights.
The four Vs of Big Data are –
Volume – Talks about the amount of data
Variety – Talks about the various formats of data
Velocity – Talks about the ever increasing speed at which the data is growing
Veracity – Talks about the degree of accuracy of data available