Spark Interview Questions for Freshers/Apache Spark Interview Questions and Answers for Freshers & Experienced

What file systems does Spark support?

The following three file systems are supported by Spark:

1. Hadoop Distributed File System (HDFS).
2. Local File system.
3. Amazon S3

How can Apache Spark be used alongside Hadoop?

The best part of Apache Spark is its compatibility with Hadoop. As a result, this makes for a very powerful combination of technologies. Here, we will be looking at how Spark can benefit from the best of Hadoop. Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN.

What is a Parquet file?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics formats so far.

Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows:

1. Columnar storage limits IO operations.
2. It can fetch specific columns that you need to access.
3. Columnar storage consumes less space.
4. It gives better-summarized data and follows type-specific encoding.

What do you understand by worker node?

Worker nodes are those nodes that run the Spark application in a cluster. The Spark driver program listens for the incoming connections and accepts them from the executors addresses them to the worker nodes for execution. A worker node is like a slave node where it gets the work from its master node and actually executes them. The worker nodes do data processing and report the resources used to the master. The master decides what amount of resources needs to be allocated and then based on their availability, the tasks are scheduled for the worker nodes by the master.

Is there a module to implement SQL in Spark? How does it work?

Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.

Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

The following are the four libraries of Spark SQL.

1. Data Source API
2. DataFrame API
3. Interpreter & Optimizer
4. SQL Service

How is machine learning implemented in Spark?

MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.

What is PageRank in GraphX?

PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly.

GraphX comes with static and dynamic implementations of PageRank as methods on the PageRank Object. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). GraphOps allows calling these algorithms directly as methods on Graph.

Is there an API for implementing graphs in Spark?

GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.

The property graph is a directed multi-graph which can have multiple edges in parallel. Every edge and vertex have user defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.

To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

How is Streaming implemented in Spark? Explain with examples.

Spark Streaming is used for processing real-time streaming data. Thus it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.

Define Spark DataFrames.

Spark Dataframes are the distributed collection of datasets organized into columns similar to SQL. It is equivalent to a table in the relational database and is mainly optimized for big data operations.
Dataframes can be created from an array of data from different data sources such as external databases, existing RDDs, Hive Tables, etc. Following are the features of Spark Dataframes:

* Spark Dataframes have the ability of processing data in sizes ranging from Kilobytes to Petabytes on a single node to large clusters.
* They support different data formats like CSV, Avro, elastic search, etc, and various storage systems like HDFS, Cassandra, MySQL, etc.
* By making use of SparkSQL catalyst optimizer, state of art optimization is achieved.
* It is possible to easily integrate Spark Dataframes with major Big Data tools using SparkCore.

Name the components of Spark Ecosystem.

1. Spark Core: Base engine for large-scale parallel and distributed data processing
2. Spark Streaming: Used for processing real-time streaming data
3. Spark SQL: Integrates relational processing with Spark’s functional programming API
4. GraphX: Graphs and graph-parallel computation
5. MLlib: Performs machine learning in Apache Spark

What do you understand by Transformations in Spark?

Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs. map() and filter() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements from current RDD that pass function argument.

val rawData=sc.textFile("path to/movies.txt")

val moviesData=rawData.map(x=>x.split(" "))
As we can see here, rawData RDD is transformed into moviesData RDD. Transformations are lazily evaluated.

Define Partitions in Apache Spark.

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks. Everything in Spark is a partitioned RDD.

What is Executor Memory in a Spark application?

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.

Under what scenarios do you use Client and Cluster modes for deployment?

<> In case the client machines are not close to the cluster, then the Cluster mode should be used for deployment. This is done to avoid the network latency caused while communication between the executors which would occur in the Client mode. Also, in Client mode, the entire process is lost if the machine goes offline.

<> If we have the client machine inside the cluster, then the Client mode can be used for deployment. Since the machine is inside the cluster, there won’t be issues of network latency and since the maintenance of the cluster is already handled, there is no cause of worry in cases of failure.

Explain the working of Spark with the help of its architecture.

Spark applications are run in the form of independent processes that are well coordinated by the Driver program by means of a SparkSession object. The cluster manager or the resource manager entity of Spark assigns the tasks of running the Spark jobs to the worker nodes as per one task per partition principle. There are various iterations algorithms that are repeatedly applied to the data to cache the datasets across various iterations. Every task applies its unit of operations to the dataset within its partition and results in the new partitioned dataset. These results are sent back to the main driver application for further processing or to store the data on the disk.

HOW MANY FORMS OF TRANSFORMATIONS ARE THERE?

There are two forms of transformation: narrow transformations and broad transformations.

EXPLAIN WHAT ACCUMULATORS ARE.

Accumulators are variables used to aggregate information across the executors.

EXPLAIN WHAT SCHEMARDD IS.

A SchemaRDD is similar to a table in a traditional relational database. A SchemaRDD can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

Is there any benefit of learning MapReduce if Spark is better than MapReduce?

Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.

What is a lazy evaluation in Spark?

When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.

Do you need to install Spark on all nodes of YARN cluster?

No, because Spark runs on top of YARN. Spark runs independently from its installation. Spark has some options to use YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some configurations to run YARN. They include master, deploy-mode, driver-memory, executor-memory, executor-cores, and queue.

What are the data formats supported by Spark?

Spark supports both the raw files and the structured file formats for efficient reading and processing. File formats like paraquet, JSON, XML, CSV, RC, Avro, TSV, etc are supported by Spark.

What is YARN?

Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. YARN is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN. Running Spark on YARN necessitates a binary distribution of Spark as built on YARN support.

What are the languages supported by Apache Spark and which is the most popular one?

Apache Spark supports the following four languages: Scala, Java, Python and R. Among these languages, Scala and Python have interactive shells for Spark. The Scala shell can be accessed through ./bin/spark-shell and the Python shell through ./bin/pyspark. Scala is the most used among them because Spark is written in Scala and it is the most popularly used for Spark.

Define the functions of Spark Core.

Serving as the base engine, Spark Core performs various important functions like memory management, monitoring jobs, providing fault-tolerance, job scheduling, and interaction with storage systems.

WHAT IS THE METHOD FOR CREATING A DATA FRAME?

A data frame can be generated using the Hive and Structured Data Tables.

EXPLAIN THE CONCEPT OF SPARSE VECTOR.

A vector is a one-dimensional array of elements. However, in many applications, the vector elements have mostly zero values that are said to be sparse.

What are the different cluster managers available in Apache Spark?

>> Standalone Mode: By default, applications submitted to the standalone mode cluster will run in FIFO order, and each application will try to use all available nodes. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
>> Apache Mesos: Apache Mesos is an open-source project to manage computer clusters, and can also run Hadoop applications. The advantages of deploying Spark with Mesos include dynamic partitioning between Spark and other frameworks as well as scalable partitioning between multiple instances of Spark.
>> Hadoop YARN: Apache YARN is the cluster resource manager of Hadoop 2. Spark can be run on YARN as well.
>> Kubernetes: Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.

What are receivers in Apache Spark Streaming?

Receivers are those entities that consume data from different data sources and then move them to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core. The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark:

<> Reliable receivers: Here, the receiver sends an acknowledegment to the data sources post successful reception of data and its replication on the Spark storage space.

<> Unreliable receiver: Here, there is no acknowledgement sent to the data sources.

Is it possible to run Apache Spark on Apache Mesos?

Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

What are the steps involved in structured API execution in Spark?

Steps involved:

1. Writing DataFrame/Dataset/SQLCode.
2. If valid code, Spark converts this to a Logical Plan.
3. Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way.
4. Spark then executes this Physical Plan (RDD manipulations) on the cluster.

Is it necessary to install spark on all the nodes of a YARN cluster when running Apache Spark on YARN ?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

What do you understand by lazy evaluation?

Lazy evaluation means that the Spark execution will not start until an action is triggered. In Spark, lazy evaluation comes into picture when Spark transformations occur.

Transformations are lazy in nature. When some operation in RDD is called, it does not execute immediately. Spark maintains the record of all operations being called through Directed Acyclic Graph. Spark RDD can be thought as the data, that we built up through transformation. Since transformations are lazy in nature, so we can execute operation any time by calling an action on data. Thus, in lazy evaluation data is not loaded until it is necessary.

What is DAG in Spark?

DAG refers to Directed Acyclic Graph. Here, the word “directed” means that iit is a finite directed graph with no directed cycles. There are finite numbers of vertices and edges, where each edge is directed from one vertex to another. It contains sequence of vertices such that every edge is directed from earlier to later in the sequence.

What is the role of a Spark Driver?

Once any action is performed on an RDD, Spark context gives the program to driver. The driver creates the directed acyclic graph or execution plan (job) for the program. Once the DAG is created, the driver divides it into a number of stages. These stages are then divided into number of smaller tasks and all these tasks are given to the executors for execution.

The Spark driver is accountable for converting a user program into units of physical execution called tasks. All Spark programs follow the same structure at a high level. They create RDDs from some input, derive new RDDs from those using transformations, and perform actions to collect or save data. A Spark program implicitly creates a logical directed acyclic graph of operations.When the driver runs, it converts this logical graph into a physical execution plan i.e tasks.

How many types of Deploy mode are there in Spark?

There are two types of “Deploy modes” in spark: Client Mode and Cluster Mode. If the driver component of spark job runs on the machine from which job is submitted, in that case deploy mode is “client mode” and if the “driver” component of spark job will not run on the local machine from which job is submitted then the deploy mode is basically “cluster mode”. Here spark job will launch “driver” component inside the cluster

What is Shuffling in Spark?

Shuffling is a process of redistributing data across partitions (aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire (between executors on separate machines). It is the process of data transfer between stages.


Name different types of data sources available in SparkSQL.

<> JSON Datasets: Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame.
<> Hive Tables: Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext.
<> Parquet Files: Parquet is a columnar format, supported by many data processing systems.

Can you use Spark to access and analyse data stored in Cassandra databases?

Yes, it is possible if you use Spark Cassandra Connector.

Explain about transformations and actions in the context of RDDs.

Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

What is a Sparse Vector?

A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.

List some use cases where Spark outperforms Hadoop in processing.

<> Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.

<> Spark is preferred over Hadoop for real time querying of data

<> Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.

What is Shark?

Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering compatibility with Hive metastore, queries and data.

Explain how Spark runs applications with the help of its architecture.

This is one of the most frequently asked spark interview questions, and the interviewer will expect you to give a thorough answer to it.

Spark applications run as independent processes that are coordinated by the SparkSession object in the driver program. The resource manager or cluster manager assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply operations repeatedly to the data so they can benefit from caching datasets across iterations. A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Finally, the results are sent back to the driver application or can be saved to the disk.

What are the important components of the Spark ecosystem?

>> Language support: Spark can integrate with different languages to applications and perform analytics. These languages are Java, Python, Scala, and R.

>> Core Components: Spark supports 5 main core components. There are Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and GraphX.

>> Cluster Management: Spark can be run in 3 environments. Those are the Standalone cluster, Apache Mesos, and YARN.

What is RDD?

RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and immutable. There are two types of datasets:

<> Parallelized collections: Meant for running parallelly.
<> Hadoop datasets: These perform operations on file record systems on HDFS or other storage systems.

What are the features of Apache Spark?

* High Processing Speed: Apache Spark helps in the achievement of a very high processing speed of data by reducing read-write operations to disk. The speed is almost 100x faster while performing in-memory computation and 10x faster while performing disk computation.

* Dynamic Nature: Spark provides 80 high-level operators which help in the easy development of parallel applications.
In-Memory Computation: The in-memory computation feature of Spark due to its DAG execution engine increases the speed of data processing. This also supports data caching and reduces the time required to fetch data from the disk.

* Reusability: Spark codes can be reused for batch-processing, data streaming, running ad-hoc queries, etc.

* Fault Tolerance: Spark supports fault tolerance using RDD. Spark RDDs are the abstractions designed to handle failures of worker nodes which ensures zero data loss.

* Stream Processing: Spark supports stream processing in real-time. The problem in the earlier MapReduce framework was that it could process only already existing data.

* Lazy Evaluation: Spark transformations done using Spark RDDs are lazy. Meaning, they do not generate results right away, but they create new RDDs from existing RDD. This lazy evaluation increases the system efficiency.

* Support Multiple Languages: Spark supports multiple languages like R, Scala, Python, Java which provides dynamicity and helps in overcoming the Hadoop limitation of application development only using Java.

* Hadoop Integration: Spark also supports the Hadoop YARN cluster manager thereby making it flexible.

Supports Spark GraphX for graph parallel execution, Spark SQL, libraries for Machine learning, etc.


* Cost Efficiency: Apache Spark is considered a better cost-efficient solution when compared to Hadoop as Hadoop required large storage and data centers while data processing and replication.

* Active Developer’s Community: Apache Spark has a large developers base involved in continuous development. It is considered to be the most important project undertaken by the Apache community.

Can you tell me what is Apache Spark about?

Apache Spark is an open-source framework engine that is known for its speed, easy-to-use nature in the field of big data processing and analysis. It also has built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation and cyclic data flow and it can run either on cluster mode or standalone mode and can access diverse data sources like HBase, HDFS, Cassandra, etc.

Search
R4R Team
R4R provides Apache Spark Freshers questions and answers (Apache Spark Interview Questions and Answers) .The questions on R4R.in website is done by expert team! Mock Tests and Practice Papers for prepare yourself.. Mock Tests, Practice Papers,Spark Interview Questions for Freshers,Apache Spark Freshers & Experienced Interview Questions and Answers,Apache Spark Objetive choice questions and answers,Apache Spark Multiple choice questions and answers,Apache Spark objective, Apache Spark questions , Apache Spark answers,Apache Spark MCQs questions and answers Java, C ,C++, ASP, ASP.net C# ,Struts ,Questions & Answer, Struts2, Ajax, Hibernate, Swing ,JSP , Servlet, J2EE ,Core Java ,Stping, VC++, HTML, DHTML, JAVASCRIPT, VB ,CSS, interview ,questions, and answers, for,experienced, and fresher R4r provides Python,General knowledge(GK),Computer,PHP,SQL,Java,JSP,Android,CSS,Hibernate,Servlets,Spring etc Interview tips for Freshers and Experienced for Apache Spark fresher interview questions ,Apache Spark Experienced interview questions,Apache Spark fresher interview questions and answers ,Apache Spark Experienced interview questions and answers,tricky Apache Spark queries for interview pdf,complex Apache Spark for practice with answers,Apache Spark for practice with answers You can search job and get offer latters by studing r4r.in .learn in easy ways .