Data Lakes Interview Questions for freshers/Data Lakes Interview Questions and Answers for Freshers & Experienced

What is a source qualifier?

A source qualifier represents the rows that the Server reads when it executes a session. Source qualifier transformation needs to be connected for the addition of a relational or a flat file source definition to a mapping.

What is the difference between data cleaning and data transformation?

Data cleaning is the process that removes data that doesn’t belong in your dataset. Data transformation is that the method by which data from one format or structure converts into another. Transformation processes also can be mentioned as data wrangling or data mugging, transforming, and mapping data from one “raw” data form into another format for warehousing and analyzing. This text focuses on the processes of cleaning that data.

What do you mean by the slice action and how many slice operated dimensions are used?

A slice operation is the filtration process in a data warehouse. It selects a specific dimension from a given cube and provides a new sub-cube in the slice operation. Only a single dimension is used, so basically, out of a multi-dimensional data warehouse, if it needs a very specific dimension that needs further analytics or processing, then it will use the slice operation in that data warehouse.

What is the difference between a data warehouse and a data mart?

A data warehouse is a set of data isolated from operational systems, so it is basically a way from the database itself; it is a view of the database. This helps an organization deal with its decision-making process. The Data Mart is a subset of a data warehouse that is geared to a particular business line. Data mods provide the stock of condensed data collected in the organization for analysis on a particular field or entity. So this is basically stating that the data warehouse contains a whole variety of information while a data mart is just a subset of that information based on a particular business line or model.

What are conformed dimensions?

Conform dimensions are the dimensions that can be used across multiple data marks in combination with multiple fact tables. A conform dimension is a dimension that has exactly the same meaning and contents; when being referred from different fact tables, it can refer to multiple tables in multiple data marts within the same organization itself.

How are the time dimensions loaded?

Time dimensions are usually loaded by a program that loops through all possible dates appearing within the data, its commonplace for 100 years to be represented during a time dimension with one row per day.

Why do we overwrite the execute method and struts so as parts of the start framework?

We can develop the action servlets and the action form servlets and other circuit classes in the action form class. We can develop a validated method that can return action errors object and in this method. We can write the validation code as well if this method returns null or action errors with the size of zero. The web container will call execute as part of the action class. If it returns a size greater than zero, it will call the execute method. It will rather execute the JSP servlet or the HTML file as the value for the input attribute is part of the attribute in the struts-config XML file.

Which one is faster: multidimensional OLAP or relational OLAP?

Multi-dimensional OLAP also known as MOLAP is faster than relational OLAP because of the following reasons in MOLAP.

The data is stored in a multi-dimensional queue; the storage is not in the relational database but in proprietary formats. MOLAP stores all the possible combinations of data in a multidimensional array.

What are the different types of SCDs used in data warehousing?

SCDs stands for slowly changing dimensions. It is basically a dimension where data changes do not happen frequently or on any regular basis. There are three types of SCDs the first is SCD1 it is a record that is used to replace the original record. Even when there is only one record existing within the database, the present data is going to be replaced, and therefore the new data will take its place.

SCD2 is the new record file that is added to the dimension table. The record exists in the database with the current data and the previous data that is stored in the audit or history.

SCD3 uses the original data that is modified to the new data. This consists of two records, one record that exists in the database and the other record which will replace the old database record with this new information.

What is a snapshot with reference to a data warehouse?

Snapshots are pretty common in software, especially in databases, so essentially, it is what the name suggests snapshot refers to the complete visualization of data at the time of extraction. It occupies less space and can be used to backup and restore data quickly, so essentially snapshot a data warehouse when anyone wants to create a backup of it. So using the data warehouse catalog, It’s creating a report, and the report will be generated as shown as soon as the session is disconnected from the data warehouse.

Explain the chameleon method utilized in data warehousing.

Here are a few fields or languages used by data engineer:

* Probability as well as linear algebra
* Machine learning
* Trend analysis and regression
* Hive QL and SQL databases

Explain the ETL cycles three-layer architecture.

ETL stands for extraction transformation and loading, so there are three phases involved in it – the primary is the staging layer, then the info integration layer, and the last layer is the access layer. So these are the three layers that are involved for the three specific phases within the ETL cycle, so within the staging layer, it’s used for the info extraction from various data structures of the source, within the data integration layer, data from the staging layer is transformed and transferred to the info base using the mixing layer the data is arranged in hierarchical groups often mentioned as dimensions facts or aggregates during a data warehousing system the mixture of facts and dimension tables is called a schema so basically within the data integration layer, once the info is loaded and data extracted and transformed within the staging layer and eventually the access layer where the info is accessed and may be loaded for further analytics.

What’s the biggest difference between Inmon and Kimball philosophies of knowledge warehousing?

These are two philosophies that we’ve in data warehousing. Within the Kimball philosophy, data warehousing is viewed as a constituency of knowledge mods, so data mods are focused on delivering business objectives for departments in a corporation, and therefore the data warehouse may be a confirmed dimension of the info mods hence a unified view of the enterprise are often obtained from the dimension modeling on a departmental area level, and within the Inmon philosophy we will create a knowledge warehouse on a topic by discipline basis hence the event of the info warehouse can start with the info from the web store other subject areas are often added to the info warehouse as their need arises point of sale or pos data are often added later if management decides that it’s required and if we check it out on a kind of algorithmic basis within the Kimball philosophy we first accompany data marts then we combine it and that we get our data warehouse while with Inmon philosophy we first create our data warehouse then we create our data marts.

What is the level of granularity of a fact table?

A fact table is usually designed at a low level of granularity. This means that we need to find the lowest amount of information stored in a fact table. For example, employee performance is a very high level of granularity while employee performance daily and employee performance weekly can be considered low levels of granularity because they are much more frequently recorded data. The granularity is the lowest level of information stored in the fact table; the depth of the data level is known as granularity in the date dimension. The level could be year month quarter period week and the day of granularity, so the day being the lowest level the year being the highest level the process consists of the following two steps determining the dimensions that are to be included and determining the location to find the hierarchy of each dimension of that information the above factors of determination will be resent as per the requirements.

What is the difference between agglomerative and divisive hierarchical clustering?

The agglomerative hierarchical constraining method allows clusters to be read from bottom to top so that the program always reads from the sub-component first and then moves to the parent in an upward direction. In contrast, divisive hierarchical clustering uses a top to bottom approach in which the parent is visited first and then the child. The agglomerative hierarchical method consists of objects in which each object creates its clusters. These clusters are grouped to form a larger cluster. It is also the process of continuous merging until all the single clusters are merged into a complete big cluster that will consist of the objects of the chart clusters; however, in divisive clustering, the parent cluster is divided into smaller clusters. It keeps on dividing until each cluster has a singular object to represent.

What is the purpose of cluster analysis and data warehousing?

One of the purposes of cluster analysis is to achieve scalability so regardless of the quantity of data system will able to analyze its ability to deal with different kinds of attributes so no matter the data type of the attributes present in the data set able to deal with its discovery of clusters with attribute shape high dimensionality which have multiple dimensions more than 2d to be precise ability to deal with noise, so any inconsistencies in the data to deal with that and interpretability.

What is a degenerate dimension?

In a data warehouse, a degenerate dimension is a dimension key in the fact table that does not have its own dimension table. Degenerate dimensions commonly occur when the fact table’s grain is a single transaction (or transaction line).

What is a Dimension Table?

A dimension table is a type of table that contains attributes of measurements stored in fact tables. It contains hierarchies, categories, and logic that can be used to traverse nodes.

What is the difference between E-R modelling and Dimensional modelling?

The basic difference is that E-R modeling has a logical and physical model while Dimensional modeling has only a physical model. E-R modeling is required for normalizing the OLTP database design, whereas dimensional modeling is required for de-normalizing the ROLAP/MOLAP design.

What are the types of Dimensional Modelling?

Types of Dimensional Modelling are listed below:

* Conceptual Modelling
* Logical Modelling
* Physical Modelling

What is dimensional data modelling?

Dimensional modeling is a set of guidelines to design database table structures for easier and faster data retrieval. It is a widely accepted technique. The benefits of using dimensional modeling are its simplicity and faster query performance. Dimension modeling elaborates logical and physical data models to further detail model data and data relationship requirements. Dimensional models map the aspects of every process within the business.

Dimensional Modelling is a core design concept used by many data warehouse designers to design data warehouses. During this design model, all the info is stored in two sorts of tables.

* Facts table
* Dimension table

The fact table contains the facts or measurements of the business, and the dimension table contains the context of measurements by which the facts are calculated. Dimension modeling is a method of designing a data warehouse.

Explain the main responsibilities of a data engineer

Data engineers have many responsibilities. They manage the source system of data. Data engineers simplify complex data structure and prevent the reduplication of data. Many times they also provide ELT and data transformation.

What is the core dimension?

The core dimension is a Dimension table, which is used, is dedicated for a single fact table or Data Mart.

What is a conformed fact?

A conformed fact is a type of table that will be used across multiple data marts and multiple fact tables.

Explain Hadoop distributed file system

Hadoop works with scalable distributed file systems like S3, HFTP FS, FS, and HDFS. Hadoop Distributed File System is made on the Google File System. This file system is designed in a way that it can easily run on a large cluster of the computer system.

What are non-additive facts?

Non-additive facts are not able to sum up for any of the dimensions available in the fact table. If there is any change in the dimension, then the same facts can be useful.

Explain Snowflake Schema

A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. It is so-called as snowflake because its diagram looks like a Snowflake. The dimension tables are normalized, that splits data into additional tables.

Explain FSCK

File System Check or FSCK is command used by HDFS. FSCK command is used to check inconsistencies and problem in file.

How to deploy a big data solution?

Follow the following steps in order to deploy a big data solution.

1) Integrate data using data sources like RDBMS, SAP, MySQL, Salesforce

2) Store data extracted data in either NoSQL database or HDFS.

3) Deploy big data solution using processing frameworks like Pig, Spark, and MapReduce.

Explain Star Schema

Star Schema or Star Join Schema is the simplest type of Data Warehouse schema. It is known as star schema because its structure is like a star. In the Star schema, the center of the star may have one fact table and multiple associated dimension table. This schema is used for querying large data sets.

What is the abbreviation of COSHH?

The abbreviation of COSHH is Classification and Optimization based Schedule for Heterogeneous Hadoop systems.

What is a conformed fact?

A conformed fact is a type of table that will be used across multiple data marts and multiple fact tables.

What are non-additive facts?

Non-additive facts are not able to sum up for any of the dimensions available in the fact table. If there is any change in the dimension, then the same facts can be useful.

Explain the main methods of Reducer

* setup (): It is used for configuring parameters like the size of input data and distributed cache.
* cleanup(): This method is used to clean temporary files.
* reduce(): It is a heart of the reducer which is called once per key with the associated reduced task

What is a fact table?

A fact table contains the measurements, metrics, or facts of a business process. It is located at the middle of a star schema or a snowflake schema, and it is surrounded by dimension tables.

What is a Factless fact table?

Factless fact is a fact table without any value. Such a table only contains keys from different dimension tables.

What are the different types of SCD?

There are six sorts of Slowly Changing Dimension that are commonly used. They are as follows:

Type 0 – Dimension never changes here, dimension is fixed, and no changes are permissible.

Type 1 – No History Update record directly. There’s no record of historical values, only the current state. A kind 1 SCD always reflects the newest values, and when changes in source data are detected, the dimension table is overwritten.

Type 2 – Row Versioning Track changes as version records which will be identified by current flag & active dates and other metadata. If the source system doesn’t store versions, then it’s usually the info warehouse load process that detects changes and appropriately manages the change during a dimension table.

Type 3 – Previous Value column Track change to a selected attribute, add a column to point out the previous value, which is updated as further changes occur.

Type 4 – History Table shows the current value in the dimension table. All changes are tracked and stored in a separate table.

Hybrid SCD – Hybrid SDC utilizes techniques from SCD Types 1, 2, and three to trace change.

Only types 0, 1, and a couple of are widely used, while the others are applied for very specific requirements.

What are four V’s of big data?

Four V’s of big data are:

* Velocity
* Variety
* Volume
* Veracity

What is a slowly changing dimension?

A slowly changing dimension (SCD) is one that appropriately manages changes of dimension members over time. It applies when business entity value changes over time and in an ad-hoc manner.

List out various XML configuration files in Hadoop?

There are five XML configuration files in Hadoop:

* Mapred-site
* Core-site
* HDFS-site
* Yarn-site

Name two messages that NameNode gets from DataNode?

There are two messages which NameNode gets from DataNode. They are 1) Block report and 2) Heartbeat.

What are the steps that occur when Block Scanner detects a corrupted data block?

Following are the steps that occur when Block Scanner find a corrupted data block:

1) First of all, when Block Scanner find a corrupted data block, DataNode report to NameNode

2) NameNode start the process of creating a new replica using a replica of the corrupted block.

3) Replication count of the correct replicas tries to match with the replication factor. If the match found corrupted data block will not be deleted.

Define Block and Block Scanner in HDFS

Blocks are the smallest unit of a data file. Hadoop automatically splits huge files into small pieces.

Block Scanner verifies the list of blocks that are presented on a DataNode.

Provide the couples of renowned used ETL tools used in the Industry.

Some of the major ETL tools are

* Informatica
* Talend
* Pentaho
* Abnitio
* Oracle Data Integrator
* Xplenty
* Skyvia
* Microsoft – SQL Server Integrated Services (SSIS)

Please provide a couple of data warehouse solutions which are widely used in the industry currently.

There are a couple of solutions are available in the market, some of the major solutions are

* Snowflakes
* Oracle Exadata
* Apache Hadoop
* SAP BW4HANA
* Microfocus Vertica
* Teradata
* AWS Redshift
* GCP Big Query]

What does data purging mean?

Data purging name is quite straightforward it is the process involving methods that can erase data permanently from the storage several techniques and strategies can be used for data purging the process of data forging often contrasts with data deletion, so they are not the same deleting data is more temporarily while data purging permanently removes the data this, in turn, frees up more storage and memory space which can be utilized for other purposes the purging process allows us to archive data even if it is permanently removed from the main source giving us an option to recover that data in case we purge it the deleting process also permanently removes the data but does not necessarily involve keeping a backup it generally involves insignificant amounts of data.

What is the difference between view and materialized view?

A view is to access the data from its table that does not occupy space, and changes get affected in the corresponding tables while in materialized view pre-calculated data persists it has physical data space occupation in the memory and changes will not get affected in the corresponding tables. Materialized view concept came from database links which earlier mainly used for making a copy of remote data sets. Nowadays, it’s widely used for performance tuning.

The view always holds the real-time data, whereas Materialized view contains a snapshot of data that may not be real-time. There are a couple of methods available to refresh the data in the Materialized view.

What is metadata and why is it used for?

The definition of Metadata is data about data. Metadata is the context that gives information a richer identity and forms the foundation for its relationship with other data. It can also be a helpful tool that saves time, keeps organized, and helps make the most of the files working with. Structural Metadata is information about how an object should be categorized to fit into a larger system with other objects. Structural Metadata establishes a relationship with other files to be organized and used in many ways.

Administrative Metadata is information about the history of an object, who used to own it, and what can be done with it. Things like rights, licenses, and permissions. This information is helpful for people managing and taking care of an object.

One point of data only gains its full meaning when it’s put in the right context. And the better-organized Metadata will reduce the searching time significantly.

What is ODS?

ODS stands for Operational Data Store and it stores the real-time operational data only and it does not store any long term trend data or historical data.

What are the differences between structured and unstructured data?

Structure data is neat, has a known schema, and could be fit in a fixed table. It uses the DBMS storage method. Scaling schemas is very difficult. Some of the following protocols are ODBS, SQL, and ADO.NET, etc.

Whereas, Unstructured data has no schema or structure. It is mostly unmanaged and very easy to scale in runtime, and can store any type of data. Some of the followed protocols are XML,CSV, SMSM, SMTP, JASON etc.

What is Data Modelling?

Data Modelling is a very simple step of simplifying an entity here in the concept of data engineering, and it will be simplifying a complex software by simply breaking it up into diagrams and further breaking into flow charts. Flowcharts are a simple representation of how a complex entity can be broken down into a simple diagram, so this will basically give a visual representation and easier understanding of the complex problem and even better readability to a person who might not be proficient in that particular software usage as well.

Data modeling is generally defined as a framework for data to be used within information systems by supporting specific definitions and formats. It is a process used to define and analyze data requirements needed to support the business processes within the boundary of respective information systems in organizations. Therefore, the creation of data modeling involves experienced data modelers working closely with business stakeholders, as well as potential users of the information system.

What is a data model?

A data model is simply a diagram that displays a set of tables and the relationship between them. This helps in understanding the purpose of the table as well as their dependency. A data model applies to any software development that involves the creation of database objects to store and manipulate data. This includes transactional systems as well as data warehouse systems. The data model is being designed through three main stages; they are – conceptual data model, logical and physical data model in this order.

A conceptual data model is just a set of square shapes connected by a line. The square shape represents an entity, and the line represents a relationship between the entities. This is very high level and highly abstract, and key attributes should be here.

The logical data model expands the conceptual data model by adding more detail to it and further identifies it as key attributes and non-key attributes. Hence, key attributes or attributes that define the uniqueness of that entity, such as in the time entity, it’s the date that’s a key attribute. It also considers the relationship type, whether it is one to one or one to many or many to many.

The physical data model looks a little similar to a logical data model; however, there are significant changes. Here entities will be replaced by tables, and attributes will be referred to as columns. So tables and columns are words specific to a database, whereas entities and attributes are specific to a logical data model design, so a physical data model always refers to these as tables and columns. It should be database technology compatible.

What is the difference between Database vs. Data lake vs. Warehouse vs Data Mart?

A database is typically structured with a defined schema so structured data can be fit in a database; items are organized as a set of tables with columns and rows, and columns indicate attributes, and rows indicate an object or entity. It has to be structured and filled in here within all these rows and columns. Columns represent attributes, and rows refer to an object or entity. The database is designed to be transactional and generally not designed to perform data analytics. Some examples are Oracle, MySQL, SQL Server, PostgreSQL, MS SQL Server, MongoDB, Cassandra, etc. It is generally used to store and perform business functional or transactional data.

Data warehouse exists on top of several databases, and it is used for business intelligence. Data warehouse gathers the data from all of these databases and creates a layer to optimize data to perform analytics. It mainly stores the processed, refined, highly modeled, highly standardized, and cleansed data.

A data lake is a centralized repository for structure and unstructured data storage. It can be used to store raw data as it is without any structure schema. There is no need to perform any ETL or transformation job on it. Any type of data can be stored here like images, text, files, videos, and even it can store machine learning model artifacts, real-time and analytics output, etc. Data retrieval processing can be done via export, so the schema is defined on reading. It mainly stores raw and unprocessed data. The main focus is to capture and store as much data as possible.

Data Mart lies between data warehouse and Data Lake. It’s basically a subset of filtered and structured essential data of a specific domain or area for a specific business need.

What are the key characteristics of a data warehouse?

Some of the major key characteristics of a data warehouse are listed below:

* The part of data can be denormalized so that it can be simplified and improve the performance of the same.
* A huge volume of historical data is stored and used whenever it is needed.
* A lot of queries are involved where a lot of data is additionally retrieved to support the queries.
* The data load is controlled
* Ad hoc queries and planned queries are quite common when it comes to data extraction.

Why do we need a Data Warehouse?

The primary reason for a data warehouse is for an organization to get advantage over its competitors. This also helps the organization to make smart decisions. Smarter decisions can be taken only if the executive responsibilities for taking such decisions have data at their disposal.

Why do we need a Data Warehouse?

The primary reason for a data warehouse is for an organization to get advantage over its competitors. This also helps the organization to make smart decisions. Smarter decisions can be taken only if the executive responsibilities for taking such decisions have data at their disposal.

What is data transformation?

Data transformation is the process or method of changing the format, structure, or values of data.

What is the difference between Data Warehousing and Data Mining?

A data warehouse is for storing data from different transactional databases through the process of extraction, transformation, and loading. Data is stored periodically. It stores a huge amount of data. A couple of use cases for data warehouses are product management and development, marketing, finance, banking, etc. It is used for improving operational efficiency and for MIS report generation and analysis purposes.

Whereas, Data Mining is a process of discovering patterns in large datasets by using machine learning methodology, statistics, and database systems. Data is analyzed regularly here. It analyses mostly on a sample of data. A couple of use cases are Market Analysis and management, identifying anomaly transactions, corporate analysis, risk management, etc. It is used for improving the business and making better decisions.

What is Data mining?

Data mining is a process of analyzing data from different perspectives, dimensions, patterns and summarizing them into meaningful content. Data is often retrieved or queried from the database in its own format. On the other hand, it can be defined as the method or process to turn raw data into useful information.

What is a Data warehouse?

A data warehouse is a central repository of all the data used by different parts of the organization. It is a repository of integrated information available for queries, analysis and can be accessed later. When the data has been moved, it needs to be cleaned, formatted, summarized, and supplemented with data from many other sources. And this resulting data warehouse becomes the most dependable source of data for report generation and analysis purposes.

Search
R4R Team
R4R provides Data Lakes Freshers questions and answers (Data Lakes Interview Questions and Answers) .The questions on R4R.in website is done by expert team! Mock Tests and Practice Papers for prepare yourself.. Mock Tests, Practice Papers,Data Lakes Interview Questions for freshers,Data Lakes Freshers & Experienced Interview Questions and Answers,Data Lakes Objetive choice questions and answers,Data Lakes Multiple choice questions and answers,Data Lakes objective, Data Lakes questions , Data Lakes answers,Data Lakes MCQs questions and answers Java, C ,C++, ASP, ASP.net C# ,Struts ,Questions & Answer, Struts2, Ajax, Hibernate, Swing ,JSP , Servlet, J2EE ,Core Java ,Stping, VC++, HTML, DHTML, JAVASCRIPT, VB ,CSS, interview ,questions, and answers, for,experienced, and fresher R4r provides Python,General knowledge(GK),Computer,PHP,SQL,Java,JSP,Android,CSS,Hibernate,Servlets,Spring etc Interview tips for Freshers and Experienced for Data Lakes fresher interview questions ,Data Lakes Experienced interview questions,Data Lakes fresher interview questions and answers ,Data Lakes Experienced interview questions and answers,tricky Data Lakes queries for interview pdf,complex Data Lakes for practice with answers,Data Lakes for practice with answers You can search job and get offer latters by studing r4r.in .learn in easy ways .