HakjDB, a custom in-memory key-value data store written in Go

Juuso Hakala | Last updated 11 Oct 2024

Introduction

This is the first technical blog post that I have ever written. I have written a lot of technical documentation before, but not blog posts. The project this post talks about has a lot of documentation, but I wanted to write a separate blog post that talks about how it works and how I built it. I hope the reader will enjoy this and find it interesting. Let’s get started!

I started learning the Go programming language in late 2023. At this point I had already been programming for a while. However, Go was a language that really intrigued me. I was interested in its ecosystem and the simple, but powerful nature of it. I wanted to work on a large project to learn the language really well and improve my software development skills at the same time. This is when I started building HakjDB.

What is HakjDB?

HakjDB logo

HakjDB is a simple in-memory key-value data store that was built as an educative hobby project. It is written in the Go programming language.

HakjDB allows you to store key-value pairs of different data types in namespaces called databases. The data is stored in the server’s memory.

HakjDB uses a simple client-server model, and has a well-defined and documented gRPC API. It can be used as a temporary database, session storage or cache. It may not be suitable for advanced needs, and does not offer data persistence on disk.

Data is stored at keys of different types. Each data type allows you to store different kind of data such as string values or objects.

Instances are easily configurable with command line flags, environment variables and a simple YAML file.

The project consists of the following 3 components:

  • hakjserver
  • hakjctl
  • hakjdb-gui

hakjserver is the HakjDB server process that listens for request from HakjDB clients. It is responsible for managing the server, databases and key-value pairs. hakjctl is a CLI tool to control HakjDB servers from the command line. hakjdb-gui is a cross-platform GUI desktop application to visually control HakjDB servers with a graphical user interface.

hakjserver and hakjctl are both written in Go. Their GitHub repository is here. hakjdb-gui is written in TypeScript and Rust using the Tauri framework. This is a separate project with its own GitHub repository here.

At the time of writing, the current HakjDB version is 1.2.0.

How does it work?

The following diagram shows the architecture and request flow

HakjDB architecture diagram

Client request starts by sending it to the server’s gRPC API using a gRPC client. gRPC uses RPCs (Remote Procedure Calls) in its communication. If TLS is enabled, TLS handshake is performed. If certificates are invalid, the connection is denied. If authentication is enabled, the server checks if the client sent a JWT access token in gRPC metadata and validates it. All requests are also logged depending on the configured log level. If log file is enabled, the logs are written to a log file as well. Logging and authorization are handled with middlewares by using gRPC interceptors. They allow to centralize these operations to a single place, without having to duplicate the logic to every RPC handler.

gRPC API does not handle any server side logic. It is a collection of API endpoint handlers that call the correct methods in the server’s internal API. This internal API handles logic such as creating new databases and retrieving key-values by calling the database library. Validations are performed for every operation that needs it. For example, database names are validated before creating new database, because they have a length limit. The internal API returns errors to the gRPC API, which in turn converts them to the correct gRPC errors with correct status codes.

Databases are namespaces for storing keys-values. A namespace contains unique keys. This means there cannot be two keys with the same name in the same database. The database library is embedded to the server process. It does write and read operations to memory. All databases and keys stored in them are stored in RAM. This allows fast read and write access to the data.

HakjDB server can be configured with command line flags, environment variables and a YAML file. Configurations are loaded to the server from these sources when the server starts. It is also possible to dynamically reload configurations at runtime. This is useful, if we want to change the log level for example from info to debug without having to restart the server. Without this we would have to restart the server to reload configurations, and all the stored data will be lost.

Startup process

The following diagram shows the server startup process

HakjDB startup process

A server instance can be started from the command line for example with

./hakjserver

First it creates the logger so it can output logs. After this it loads configurations from the sources. The configuration source load order is the following:

  1. Command line flags
  2. Environment variables
  3. YAML configuration file

This means that configurations set by command line flags override configurations set by environment variables and the config file. Command line flags are the fastest way to test out different configuration options, but in real usage may not be the best way. If the log level is set by command line flag when the server is started, it cannot be later reloaded at runtime with environment variable or by modifying the config file.

After loading configurations, the server is initialized and the configurations are processed. This step creates the server data structure and sets things like server password if authentication was configured to be used. After this the default database is created. Requests will use this database to store data if no other database is created or specified.

After server initialization, the gRPC server is configured and created. If TLS is enabled, it configures the gRPC server to perform TLS handshakes for new connections. A network socket listener is assigned to the server. It listens in all network interfaces on the host machine in the configured TCP/IP port. After this the gRPC server can start to listen for incoming gRPC requests. If the maximum active client connections (default 1000) is exceeded, it denies all incoming connections until new connections can be created again.

Database library

I programmed a custom key-value database library that the server can use to store data. There are many great key-value database libraries for Go, but I wanted to build my own. Why? Because the original idea of the project was to learn and better understand how key-value data stores work. By building my own library, I was able to learn more and get better.

The database library only stores data in memory, it does not support persisting data on disk. This is because the project was solely developed to be an in-memory database. Databases are Go structs that contain metadata such as name, description, and timestamps when the database has been created and updated. Keys are stored in this struct as a separate struct using Go maps as their data structure.

// DB is a database used as a namespace for storing key-value pairs.
type DB struct {
	// The name of the database.
	name string
	// The description of the database.
	description string
	// Timestamp describing when the database was created.
	createdAt time.Time
	// Timestamp describing when the database was updated.
	updatedAt time.Time
	// The data stored in this database.
	storedData dbStoredData

	cfg DBConfig
	mu  sync.RWMutex
}

The database struct has methods to do operations to the database. These operations can set, get and delete keys etc. Everything is guarded with mutexes in order to maintain data synchronization and to prevent race conditions if simultaneous goroutines try to access the data.

gRPC API

The API is implemented with gRPC. Protobuf is used as the IDL and data serialization format. The API is defined in .proto files in multiple protobuf packages and services. Packages are divided by features. A package contains all the messages and services related to the package’s feature. For example, database related messages and RPCs are defined in a database package. The API is versioned as a whole. This means that all packages and proto files are versioned together and not separately.

HakjDB clients can access the HakjDB server via the API. It provides RPCs such as CreateDatabase, SetString and GetString. These can be used to manage the server, databases and keys. Data related to the request is sent in the request message. Additional request data can be sent with gRPC metadata. The database that will be used to access keys can be specified in the gRPC request metadata. If it is not set, the server uses the default database.

I also wrote a shell script that generates Go source codes from the proto files. Running the script generates new source codes only if the proto files have changed. With this script, I made sure that the source codes are generated the same way every time.

protoc --go_out=. --go_opt=paths=source_relative \
  --go-grpc_out=. --go-grpc_opt=paths=source_relative \
  api/v1/**/*.proto

In the beginning of the project I was planning to build a REST API, but then I wanted to learn more about gRPC. gRPC has several benefits in this project and has better overall performance than REST. The API is not supposed to be called from browsers, so gRPC’s weak native browser support was not a problem. Using Protobuf allowed for a reliable and type safe API contract. gRPC in Go made the implementation of authorization and native TLS support straightforward.

CI/CD pipeline with GitHub Actions

I built a CI/CD pipeline for the project using GitHub Actions. It automates the process of building the binaries, running tests, and releasing new versions.

The CI (Continuous Integration) is responsible for building hakjserver and hakjctl, and to run tests against them. The project contains unit and integration tests verifying that the main features are working correctly. The CI workflow is triggered when a commit is pushed to the main branch.

The CD (Continuous Delivery) is responsible for building the release binaries for supported platforms, building and pushing Docker images to Docker Hub, and making a new GitHub release with the built binaries. This process uses a tool called GoReleaser to build the binaries for Windows, Linux and MacOS. The CD workflow triggers when a new Git tag is pushed to the remote repository. The tag can be v0.1.0 for example.

Example of releasing a new version:

# Create a new Git tag v0.1.0
git tag -a v0.1.0 -m "Release v0.1.0"

# Push it to the remote repository
git push origin v0.1.0

I released quite many versions of the project over the months so this CI/CD pipeline turned out to be pretty useful. Especially the automated process of releasing new versions as it saved me a lot of time. All I had to do was push a new tag and then watch the release process happening by itself. It took a while to write the workflow files and set up everything, but it was worth it and I learned a lot.

Docker

HakjDB has support for Docker containers. Images with multiple tags are available on Docker Hub here. The images are currently built only for amd64 architecture.

The image with the latest tag can be pulled with the following command:

docker pull hakj/hakjdb

There are two different Dockerfiles in the project. The first Dockerfile uses Alpine Linux as its base image and the other one Debian bookworm. The Alpine Linux version results in a much smaller image size and has less security vulnerabilities.

The Alpine Linux image can be pulled with:

docker pull hakj/hakjdb:alpine

It is also possible to pull a specific version:

docker pull hakj/hakjdb:v1.2.0-alpine

Development

I’ve been working on this project since late December 2023. That is about 9 months at the time of writing this. The project grew from a small prototype to a large system over the months. I added new features and released new version little by little. My main motivation was to better understand how popular key-value data stores like Redis and etcd work, and how they are designed and built. I kept refactoring the code over and over again after learning new Golang concepts and best practices. Documentation I wrote a lot and I tried to document every possible detail. The gRPC API also grew from a small one-package API to a well-defined, documented and versioned multi-package API.

I learned a ton while working on this and I am very happy about the current results. I don’t know if I will keep working on the project, because I have now implemented almost all the features that I originally wanted to. However, it would be interesting to develop a client library for this so it can be used in backend applications.

This can still be extended a lot, as it still lacks a lot of features. For example, keys cannot be retrieved using patterns. Let’s say there is 10,000 keys in a database and we want to retrieve all the keys that start with some pattern. Right now the only option is to fetch all the 10,000 keys to do so. Keys also do not have TTL (time to live), so they never expire. This is a useful feature in production ready key-value databases. Different database users, roles and permission for the roles would be interesting to implement so we can have RBAC (Role based access control). Permissions could be read and write to specific keys and databases.

Thank you for reading! You can check the project source code, documentation and details here.