Gradient augmentation is an ensemble method, similar to AdaBoost, which essentially iteratively builds and enhances previously constructed trees using gradients in the loss function. The final model predictions are the weighted sum of the predictions from all previous models.
A technique called simple random sampling can be used to randomly select a sample from a population of product users. Simple random sampling is an unbiased technique that randomly takes a subset of individuals, each with an equal probability of being chosen, from a larger data set. It is usually done without replacement.
In the case of using a library like pandas, you can use the .sample () to perform simple random sampling.
It is the bias that represents the precision of a model. A model with a high bias tends to be oversimplified and results in insufficient fit. The variance represents the sensitivity of the model to data and noise. A model with high variance results in overfitting.
Therefore, the trade-off between bias and variance is a property of machine learning models in which lower variance leads to higher bias and vice versa. In general, an optimal balance of the two can be found in which error is minimized.
A good data model –
It should be easily consumed
Large data changes should be scalable
Should offer predictable performances
Should adapt to changes in requirements
N-gram is a continuous sequence of n elements of a given voice or text. The N-gram is a type of probabilistic language model used in the prediction of the next item in the sequence in the form of (n-1).
K mean clustering is a method of vector quantization. With this method, objects are classified as belonging to one of the K groups, which are selected as a priori.
The hierarchical grouping algorithm is the one that combines and divides the groups that already exist, in this way they create a hierarchical structure that presents the order in which the groups are split or merged.
If a DataNode is executing any task slowly, the master node can redundantly execute another instance of the same task on another node. The task that finishes first will be accepted, and the other task would be killed. Therefore, speculative execution is useful if you are working in an intensive workload kind of environment.
The following image depicts the speculative execution:
From the above example, you can see that node A has a slower task. A scheduler maintains the resources available, and with speculative execution turned on, a copy of the slower task runs on node B. If node A task is slower, then the output is accepted from node B.
All the businesses are different and measured in different ways. Thus, you never have enough data and there will be no right answer. The amount of data required depends on the methods you use to have an excellent chance of obtaining vital results.
The structuring of unstructured data has been one of the essential reasons why Big Data revolutionized the data science domain. The unstructured data is transformed into structured data to ensure proper data analysis. In reply to such big data interview questions, you should first differentiate between these two types of data and then discuss the methods you use to transform one form to another. Emphasize the role of machine learning in data transformation while sharing your practical experience.
* User Interface: It requests for the execute interface for the driver and also builds a session for this query. Further, the query is sent to the compiler in order to create an execution plan for the same.
* Metastore: It stores the metadata and transfers it to the compile to execute a query.
* Compiler: It creates the execution plan. The compiler consists of a DAG of stages wherein each stage can either be a map, a metadata operation, or reduces an operation or a job on HDFS.
* Execution Engine: This engine bridges the gap between Hadoop and Hive and helps in processing the query. It communicates with the metastore bidirectionally in order to perform various tasks.
WAL is otherwise referred to as Write Ahead Log. This file is attached to each Region Server present inside the distributed environment. It stores the new data which is yet to be kept in permanent storage. WAL is often used to recover data sets in case of any failure.
DistCP is used for transferring data between clusters, while Sqoop is used for transferring data between Hadoop and RDBMS, only.
A checkpoint is a crucial element in maintaining filesystem metadata in HDFS. It creates checkpoints of file system metadata by joining fsimage with the edit log. The new version of fsimage is named Checkpoint.
Active NameNode runs and works in the cluster, whereas Passive NameNode has comparable data like active NameNode.
Three types of biases can happen through sampling, which are –
Under coverage bias
Yes, the following are ways to change the replication of files on HDFS:
We can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.
If you want to change the replication factor for a particular file or directory, use:
$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file
Example: $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/test.csv
The eval tool in Sqoop enables users to carry out user-defined queries on the corresponding database servers and check the outcome in the console.
When the file is stored in HDFS, all file system breaks down into a set of blocks, and HDFS unaware of what is stored in the file. The block size in Hadoop must be 128MB. This value can be tailored for individual files.
Collaborative filtering is a set of technologies that forecast which items a particular consumer would like depending on the preferences of the scores of individuals. It is nothing but the tech word for questioning individuals for suggestions.
Multiple users cannot write on the same HDFS file at a similar time. When the first user is accessing the file, inputs from the second user will be rejected because HDFS NameNode supports exclusive write.
This is due to the performance issue of NameNode. Usually, NameNode is allocated with huge space to store metadata for the large-scale file. The metadata is supposed to be a from a single file for optimum space utilization and cost benefit. In case of small size files, NameNode does not utilize the entire space which is a performance optimization issue.
CLASSPATH includes necessary directories that contain jar files to start or stop Hadoop daemons. Hence, setting CLASSPATH is essential to start or stop Hadoop daemons.
However, setting up CLASSPATH every time is not the standard that we follow. Usually CLASSPATH is written inside /etc/hadoop/hadoop-env.sh file. Hence, once we run Hadoop, it will load the CLASSPATH automatically.
A/B testing is a comparative study, where two or more variants of a page are presented before random users and their feedback is statistically analyzed to check which variation performs better.
Data science is a broad spectrum of activities involving analysis of Big Data, finding patterns, trends in data, interpreting statistical terms and predicting future trends. Big Data is just one part of Data Science. Though Data Science is a broad term and very important in the overall Business operations, it is nothing without Big Data.
All the activities we perform in Data Science are based on Big Data. Thus Big Data and Data Science are interrelated and can not be seen in isolation.
DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each DataNode sends a heartbeat message to notify that it is alive. If the NameNode does not receive a message from the DataNode for 10 minutes, the NameNode considers the DataNode to be dead or out of place and starts the replication of blocks that were hosted on that DataNode such that they are hosted on some other DataNode. A BlockReport contains a list of the all blocks on a DataNode. Now, the system starts to replicate what were stored in the dead DataNode.
The NameNode manages the replication of the data blocks from one DataNode to another. In this process, the replication data gets transferred directly between DataNodes such that the data never passes the NameNode.
A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it will contain some data in it or it won’t exist.
There are two methods to overwrite the replication factors in HDFS –
Method 1: On File Basis
In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The command used for this is:
$hadoop fs – setrep –w2/my/test_file
Here, test_file is the filename that’s replication factor will be set to 2.
Method 2: On Directory Basis
In this method, the replication factor is changed on directory basis i.e. the replication factor for all the files under a given directory is modified.
$hadoop fs –setrep –w5/my/test_dir
Here, test_dir is the name of the directory, the replication factor for the directory and all the files in it will be set to 5.
Commodity hardware is the basic hardware resource required to run the Apache Hadoop framework. It is a common term used for affordable devices, usually compatible with other such devices.
Here's your chance to demonstrate your enthusiasm and career ambitions. Of course, your answer will depend on your current level of academic qualifications/certifications, as well as your personal circumstances, which may include family responsibilities and financial considerations. Therefore, respond forthrightly and honestly. Bear in mind that numerous courses and learning modules are readily available online. Analytics vendors have also established training courses aimed at those seeking to upskill themselves in this domain. Also, you could inquire about the company's policy on mentoring and coaching.
The standard path for Hadoop Sqoop scripts is –
The jps command is used to check if the Hadoop daemons are running properly or not. This command shows all the daemons running on a machine i.e. Datanode, Namenode, NodeManager, ResourceManager etc.
Commodity hardware is a low-cost system identified by less-availability and low-quality. The commodity hardware comprises of RAM as it performs a number of services that require RAM for the execution. One doesn’t require high-end hardware configuration or supercomputers to run Hadoop, it can be run on any commodity hardware.
The different configuration files in Hadoop are –
core-site.xml – This configuration file contains Hadoop core configuration settings, for example, I/O settings, very common for MapReduce and HDFS. It uses hostname a port.
mapred-site.xml – This configuration file specifies a framework name for MapReduce by setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also specifies default block permission and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and NodeManager.
Distributed Cache is a feature of Hadoop MapReduce framework to cache files for applications. Hadoop framework makes cached files available for every map/reduce tasks running on the data nodes. Hence, the data files can access the cache file as a local file in the designated job.
File formats used with Hadoop, include –
* Sequence files
* Parquet file
Blocks are smallest continuous data storage in a hard drive. For HDFS, blocks are stored across Hadoop cluster.
* The default block size in Hadoop 1 is: 64 MB
* The default block size in Hadoop 2 is: 128 MB
In Hadoop, Kerberos – a network authentication protocol – is used to achieve security. Kerberos is designed to offer robust authentication for client/server applications via secret-key cryptography.
When you use Kerberos to access a service, you have to undergo three steps, each of which involves a message exchange with a server. The steps are as follows:
Authentication – This is the first step wherein the client is authenticated via the authentication server, after which a time-stamped TGT (Ticket Granting Ticket) is given to the client.
Authorization – In the second step, the client uses the TGT for requesting a service ticket from the TGS (Ticket Granting Server).
Service Request – In the final step, the client uses the service ticket to authenticate themselves to the server.
One of the important big data interview questions. In HDFS, datasets are stored as blocks in DataNodes in the Hadoop cluster. When a MapReduce job is executing, the individual Mapper processes the data blocks (Input Splits). If the data does is not present in the same node where the Mapper executes the job, the data must be copied from the DataNode where it resides over the network to the Mapper DataNode.
When a MapReduce job has over a hundred Mappers and each Mapper DataNode tries to copy the data from another DataNode in the cluster simultaneously, it will lead to network congestion, thereby having a negative impact on the system’s overall performance. This is where Data Locality enters the scenario. Instead of moving a large chunk of data to the computation, Data Locality moves the data computation close to where the actual data resides on the DataNode. This helps improve the overall performance of the system, without causing unnecessary delay.
Hadoop has three common input formats:
Text Input Format – This is the default input format in Hadoop.
Sequence File Input Format – This input format is used to read files in a sequence.
Key-Value Input Format – This input format is used for plain text files (files broken into lines).
Rack Awareness is one of the popular big data interview questions. Rach awareness is an algorithm that identifies and selects DataNodes closer to the NameNode based on their rack information. It is applied to the NameNode to determine how data blocks and their replicas will be placed. During the installation process, the default assumption is that all nodes belong to the same rack.
Rack awareness helps to:
* Improve data reliability and accessibility.
* Improve cluster performance.
* Improve network bandwidth.
* Keep the bulk flow in-rack as and when possible.
* Prevent data loss in case of a complete rack failure.
The main difference between data mining and data profiling is as follows:
* Data profiling: It targets the instant analysis of individual attributes like price vary, distinct price and their frequency, an incidence of null values, data type, length, etc.
* Data mining: It focuses on dependencies, sequence discovery, relation holding between several attributes, cluster analysis, detection of unusual records etc.
Machine learning is a category of an algorithm that helps software applications to become more accurate in predicting outcomes without being explicitly programmed. The basic concept of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.
It enables the computers or the machines to make data-driven decisions rather than being explicitly programmed for carrying out a certain task.
Again, one of the most important big data interview questions. Here are six outlier detection methods:
* Extreme Value Analysis – This method determines the statistical tails of the data distribution. Statistical methods like ‘z-scores’ on univariate data are a perfect example of extreme value analysis.
* Probabilistic and Statistical Models – This method determines the ‘unlikely instances’ from a ‘probabilistic model’ of data. A good example is the optimization of Gaussian mixture models using ‘expectation-maximization’.
* Linear Models – This method models the data into lower dimensions. Proximity-based Models – In this approach, the data instances that are isolated from the data group are determined by Cluster, Density, or by the Nearest Neighbor Analysis.
* Information-Theoretic Models – This approach seeks to detect outliers as the bad data instances that increase the complexity of the dataset.
* High-Dimensional Outlier Detection – This method identifies the subspaces for the outliers according to the distance measures in higher dimensions.
Data preparation is the method of cleansing and modifying raw data before processing and analyzing it. It is a crucial step before processing and usually requires reformatting data, making improvements to data, and consolidating data sets to enrich data.
Data preparation is an unending task for data specialists or business users. But, it is essential to convert data into context to get insights and then, can eliminate the biased results found due to poor data quality.
For instance, the data construction process typically includes standardizing data formats, enhancing source data, and/or eliminating outliers.
An open-ended question and there are many ways to achieve this.
Programming: Coding/ Programming is the most tried out method to transform unstructured data into a structured form. Programming is advantageous to accomplish because we get independence with it, which you can use to change the structure of the data in any form possible. Several programming languages, such as Python, Java, etc., can be used.
Data/Business Tools: Many BI (Business Intelligence) tools support the drag and drop functionality for converting unstructured data into structured data. One cautious thing before using BI tools is that most of these tools are paid, and we have to be financially capable to support these tools. For people who lack both experience and skills needed for option 1, this is the way to go.
* Persistent znodes: The default znode in ZooKeeper is the Persistent Znode. It permanently stays in the zookeeper server until any other clients leave it apart.
* Ephemeral znodes: These are the temporary znodes. It is smashed whenever the creator client logs out of the ZooKeeper server. For example, assume client1 created eznode1. Once client1 logs out of the ZooKeeper server, the eznode1 gets destroyed.
* Sequential znodes: Sequential znode is assigned a 10-digit number in numerical order at the end of its name. Assume client1 produced a sznode1. In the ZooKeeper server, the sznode1 will be named like this:
If client1 generates another sequential znode, it will bear the following number in a sequence. So the subsequent sequential znode is <znode name>0000000002.
Outliers are the data points that are very far from the group, which is not a part of any group or cluster. This may affect the behavior of the model, they may predict wrong results, or their accuracy will be very low. Therefore Outliers must be handled carefully as they may also contain some helpful information. The presence of these outliers may lead to misleading a Big Data model or a Machine Learning Model. The results of this may be,
* Poor Results
* Lower accuracy
* Longer Training Time
Hadoop can provide an option wherein a particular set of lousy input records could be skipped while processing map inputs. SkipBadRecords class in Hadoop offers an optional mode of execution in which the bad records can be detected and neglected in multiple attempts. This may happen due to the presence of some bugs in the map function. The user has to manually fix it, which may sometimes be possible because the bug may be in third-party libraries. With the help of this feature, only a small amount of data is lost, which may be acceptable because we are dealing with a large amount of data.
The chief configuration parameters that the user of the MapReduce framework needs to mention is:
<> Job’s input Location
<> Job’s Output Location
<> The Input format
<> The Output format
<> The Class including the Map function
<> The Class including the reduce function
<> JAR file, which includes the mapper, the Reducer, and the driver classes.