The hive.fetch.task.conversion property of Hive lowers the latency of MapReduce overhead, and in effect when executing queries such as SELECT, FILTER, LIMIT, etc. it skips the MapReduce function.
It controls ho wthe map output is reduced among the reducers. It is useful in case of streaming data
In a join query the smallest table to be taken in the first position and largest table should be taken in the last position.
No, we cannot use metastore in sharing mode. It is possible to use it in standalone â€œrealâ€ database. Such as MySQL or PostGresSQL.
We are using a precedence hierarchy for setting properties:
1. The SET command in Hive
2. The command-line â€“hiveconf option
Basically, it creates the local metastore, while we run the hive in embedded mode. Also, it looks whether metastore already exist or not before creating the metastore. Hence, in configuration file hive-site.xml. Property is â€œjavax.jdo.option.ConnectionURLâ€ with default value â€œjdbc:derby:;databaseName=metastore_db;create=trueâ€ this property is defined. Hence, to change the behavior change the location to the absolute path, thus metastore will be used from that location.
Basically, the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. But only if data is already present in HDFS. Hence, using the keyword external that creates the table definition in the hive metastore the user just has to define the table.
Create external table table_name (
Yes, by using the LOCATION keyword while creating the managed table, we can change the default location of Managed tables. But the one condition is, the user has to specify the storage path of the managed table as the value of the LOCATION keyword.
For adding a new partition in the above table partitioned_transaction, we will issue the command give below:
ALTER TABLE partitioned_transaction ADD PARTITION (month=â€™Decâ€™) LOCATION â€˜/partitioned_transactionâ€™;
Basically, hive-site.xml file has to be configured with the below property, to configure metastore in Hive â€“
thrift: //node1 (or IP Address):9083
IP address and port of the metastore host
Yes, one can run shell commands in Hive by adding a â€˜!â€™ before the command.
Yes, you can overwrite Hadoop MapReduce configuration in Hive.
Usually, while read/write the data, the user first communicate with inputformat. Then it connects with Record reader to read/write record. To serialize the data, the data goes to row. Here deserialized custom serde use object inspector to deserialize the data in fields.
Explain the three different ways (Thrift Client, JDBC Driver, and ODBC Driver) you can connect applications to the Hive Server. Youâ€™ll also want to explain the purpose for each option: for example, using JDBC will support the JDBC protocol.
No. The name of a view must be unique compared to all other tables and as views present in the same database.
To analyze the structure of individual columns and the internal structure of the row objects we use ObjectInspector. Basically, it provides access to complex objects which can be stored in multiple formats in Hive.
Partitioning provides granularity in a Hive table and therefore, reduces the query latency by scanning only relevant partitioned data instead of the whole data set.
For example, we can partition a transaction log of an e â€“ commerce website based on month like Jan, February, etc. So, any analytics regarding a particular month, say Jan, will have to scan the Jan partition (sub â€“ directory) only instead of the whole table data.
We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.
Hive index is a Hive query optimization techniques. Basically, we use it to speed up the access of a column or set of columns in a Hive database. Since, the database system does not need to read all rows in the table to find the data with the use of the index, especially that one has selected.
By using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Hive determines the bucket number for a row. Basically, hash_function depends on the column data type. Although, hash_function for integer data type will be:
hash_function (int_type_column)= value of int_type_column
Basically, for performing bucketing to a partition there are two main reasons:
* A map side join requires the data belonging to a unique join key to be present in the same partition.
* It allows us to decrease the query time. Also, makes the sampling process more efficient.
ObjectInspector functionality in Hive is used to analyze the internal structure of the columns, rows, and complex objects. It allows to access the internal fields inside the objects.
Hive variable is created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.
Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause â€“ LOCATION â€˜<hdfs_path>â€™.
Hive default read and write classes are
For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.
In local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.
In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.
Metastore in Hive stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which converts the object representation into relational schema and vice versa.
Dynamic partitioning values for partition columns are known in the runtime. In other words, it is known during loading of the data into a Hive table.
* While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it decreases the query latency.
* Also, while we do not know all the values of the partitions beforehand. Thus, finding these partition values manually from a huge dataset is a tedious task.
In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.
Map reduce mode is used when,
It will perform on large amount of data sets and query going to execute in a parallel way
Hadoop has multiple data nodes, and data is distributed across different node we use Hive in this mode
Processing large data sets with better performance needs to be achieved
Basically, for the purpose of grouping similar type of data together on the basis of column or partition key, Hive organizes tables into partitions.
Moreover, to identify a particular partition each table can have one or more partition keys. On defining Hive Partition, in other words, it is a sub-directory in the table directory.
Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer.
Hence, using ORDER BY will take a lot of time to execute a large number of inputs.
The metadata information along with the table data is deleted from the Hive warehouse directory if one drops a managed table.
Hive just deletes the metadata information regarding the table. Further, it leaves the table data present in HDFS untouched.
Local meta stores run on the same Java Virtual Machine (JVM) as the Hive service whereas remote meta stores run on a separate, distinct JVM.
Using REPLACE column option
ALTER TABLE table_name REPLACE COLUMNS â€¦â€¦
It offers an embedded Derby database instance backed by the local disk for the metastore, by default. It is what we call embedded metastore configuration.
It is the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.
In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.
Yes, the default managed table location can be changed in Hive by using the LOCATION â€˜<hdfs_path>â€™ clause.
Hive table data is stored in an HDFS directory by default â€“ user/hive/warehouse. This can be altered.
Using RDBMS instead of HDFS, Hive stores metadata information in the metastore. Basically, to achieve low latency we use RDBMS. Because HDFS read/write operations are time-consuming processes.
Alter Table table_name RENAME TO new_name
No Hive does not provide insert and update at row level. So it is not suitable for OLTP system.
There are two types. Managed table and external table. In managed table both the data an schema in under control of hive but in external table only the schema is under control of Hive.
Yes, you can change a table name in Hive. You can rename a table name by using: Alter Table table_name RENAME TO new_name.
Basically, to store the metadata information in the Hive we use Metastore. Though, it is possible by using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus. That converts the object representation into the relational schema and vice versa.
In an HDFS directory â€“ /user/hive/warehouse, the Hive table is stored, by default only. Moreover, by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml, one can change it.
All those client applications which are written in Java, PHP, Python, C++ or Ruby by exposing its thrift server, Hive supports them.
Basically, a tool which we call a data warehousing tool is Hive. However, Hive gives SQL queries to perform an analysis and also an abstraction. Although, Hive it is not a database it gives you logical abstraction over the databases and the tables.