The Solo Career of Hive Metastore

Understanding the standalone Hive Metastore service

Felipe Miquelim
QuintoAndar Tech Blog

--

Photo by Jonathan Farber on Unplash

Nowadays our metadata at QuintoAndar is scattered and accessed by different tools such as AWS Redshift and AWS Athena, although both of them make use of the AWS Glue Data Catalog (that uses a Hive implementation underneath).

A unified metastore is a feasible approach that would reduce the complexity of managing and using two or more metastores. We’ve chosen Hive as our go-to solution due to the right fit to our needs and due to its adherence with the rest of our stack, especially when the Hive Metastore is detached to run as a standalone solution.

We want to share what we’ve been learning by using the Hive Metastore as the central metastore of our data hub: managing our metadata for both data lake and data warehouse.

🐝 Hive

Hive is an open-source data warehouse that enables reading, writing, and managing large data set files, which can be stored in separate storage services such as Hadoop and HBase, or also cloud storage such as AWS S3.

Hive is also user-friendly since it allows file querying via its own language, Hive Query Language (a.k.a. HiveQL), which is very similar to the ANSI standard of SQL language, therefore very well-known amongst the programming universe.

Hive Architecture 🚧

Basically, Hive’s main components are a processing engine (Hive Server, specifically HiveServer2 in its current version), a metadata storing and management unit (Hive Metastore), and its inner communication framework (Thrift Communication Protocol). These main services work together in synchrony to provide the data warehousing functionality. Hive default architecture runs all these services, although in newer Hive releases it is possible to implement them partly.

Given that, for our data architecture we will focus on the internal components, essentially the Thrift Server, HiveServer2 and Hive Metastore.

Thrift 📡

The Thrift communication protocol is an important concept both in Hive default architecture and in our custom solution (spoiler alert 😜 — it does not contain the Hive Server). That being said we will explore how Thrift works in the Hive ecosystem.

Thrift is a software framework that enables efficient and scalable communication between services, across different programming languages, by defining data types and service interfaces for communication. The definition is given in a neutral-language parametrized file, which abstracts any peculiarity or specificity of a given programming language. The engine to auto-generate the translated set of codes to a target programming language is also an essential part of Thrift’s solution.

In other words, you can implement the Thrift communication in your service by defining the data types and interfaces expected by your software in a heuristic parametrized file which are then translated, by Thrift engine, to executable objects in different languages such as Python, C, Java, and so on, allowing another service to easily establish the expected communication.

Example of a neutral-programming-language Thrift mapping from Hive Metastore, demonstrating both a data type and declaration of methods:

Example of Hive Metastore thrift mapping datatype and methods

A Python code translated by Thrift’s engine can be generated based on the previous mapping. The generated Python data types would look a lot like the following sample:

Python generated data types based on previous Thrift mapping

Moreover, auto-generated Python codes of main and auxiliary methods would have a similar structure to the following sample:

Python generated methods based on previous Thrift mapping

Thrift in Hive 💬

When a HiveQL query needs to be executed in Hive, the flow of interactions would go as follows:

Default architecture by Hive using HiveServer2. Core HiveServer2 components such as Driver, Executor, and Compiler are abstracted in the HiveServer2 item since we do not want to explore deeply how it works.

Important to state that all interactions within Hive Services are based on the Thrift framework. We can see the clear HiveServer2 centralization in this architecture, by having to deal with basically all communication passing through the Hive environment.

Since Hive 3.0 the HiveServer2 became an optional service in Hive’s ecosystem. The latter offers its own Thrift mapping for data type and interactions. Therefore, we can connect via Thrift framework directly to the Hive Metastore:

Custom Hive Architecture after removing HiveServer2.

The green components are new features that must be implemented to obtain the full Hive metastore and querying functionalities. Hive-metastore-client is QuintoAndar’s custom library to communicate directly with the Hive Metastore, it will be introduced a few topics below. Also, there is a new communication between the Metastore and the File Storage to apply validations (when creating or updating metadata, for instance).

What one gains or loses by not using the HiveServer2? Amongst other features, the major downfall is not being able to query data using HiveQL. On the other hand, you can have one less set of Hive services to manage, thus saving effort and costs. One may choose a full Hive implementation or a standalone Hive Metastore depending on the scenario.

🏠 QuintoAndar’s in-house Hive approach

Hive Metastore checked a lot of boxes when assessing which solution could unify QuintoAndar’s metadata, currently stored in AWS Glue Data Catalog, which is highly attached to AWS’s environment.

An in-house Hive solution would be able to centralize both the metastore and the query engine and unify our data-access layer independently of which data one tries to access. We also wanted a solution that would allow us to have full control of its infrastructure specificities, to manage it as we seem fit (scale up and down, control costs, have custom integration with the rest of our stack, etc.).

Is Hive all we need? 💛

Would a default Hive implementation solve all the problems? — Sadly, no 😢 — Hive querying engine is very limited and does not have the processing power we seek for, also it does not integrate well with the rest of our data stack. As we saw above we can ‘split’ Hive solution into Metastore and HiveServer2 (querying engine) and opt for using the services standalone: Hive Metastore does supply all of our needs, we just needed to find a substitute for the querying engine…

Trino 🐰 is a recent solution in the open-source universe. It fits very well our needs for connecting with Hive Metastore and has a parallel and distributed query engine. If you want to know more about how Trino works in our data stack, in this article, we explained our implementation in detail.

Our approach is then based on the standalone Hive Metastore to store and manage metadata and Trino for querying and be the gateway for our data access. Although Trino can run DDL (Data Definition Language) operations in Hive Metastore (create schema, create table, alter column, add partitions, etc.), we seek to keep each service responsible for its own closed scope and decouple our services at most allowing us to manage and customize tightly every solution and also facilitating upcoming software shifting.

“ The lack of coupling means that the elements of our system are better isolated from each other and from change. This isolation makes it easier to understand each element of the system.” — Robert C. Martin (Uncle Bob) on Clean Code: A Handbook of Agile Software Craftsmanship

Thrift Client for Hive Metastore 💻

While working with a standalone Hive Metastore there must be a Thrift-based client to interact directly with it.

Unfortunately, Hive Metastore standalone is not widely used yet. It is way more common to find people using the default implementation, therefore there is not a native Python library or even a well-consolidated one in the community.

The fact that there is not a “go-to” Python library, and also that Thrift facilitates developers to generate language-oriented code given its mapping files, encouraged us to take up the challenge and create the hive-metastore-client, a Python-oriented library to interact directly with the standalone Hive Metastore.

We will be covering soon, in another article, all the implementation details of the library, its features, and especially, how it facilitates for a programmer to manage metadata tête-à-tête with Hive Metastore using Python. For now, you can check the project’s documentation and follow its internal guides and examples to get started! 🚀

Conclusion

The Hive project is a well-known solution in data stacks for some years now, however only in recent years, since Hive 3.0, it was raised the possibility to split its services into standalone solutions. It was a breaking point that enabled different approaches to handle data warehousing and metadata management, however it is quite hard to get the hang of it.

The standalone usage of Hive Metastore isn’t that popular yet, so it is quite complicated to understand what is the range of possibilities of using it. Down the same road, Thrift does not supply a lot of documentation and articles to help users get started on it.

Rick Astley Pixel GIF by GIPHY

Both complications make it a pain to start using standalone Hive Metastore from scratch 😢 . However if we did not give up on it, it will supply us with a very powerful and dynamic Metastore solution.

Furthermore, by understanding better how Hive, Hive Metastore, and Thrift work, we could instrument them in our scenario not only with a very nice adherence to the rest of our stack but, also to fit our data governance and management needs.

Thanks to Lucas Fonseca, Ribaldo, and Kenji for the arch discussions.

--

--

Felipe Miquelim
QuintoAndar Tech Blog

Data Engineer @ QuintoAndar. ❤ Data, Football, NBA, NFL & Gaming Enthusiast ❤