The Anatomy of a Distributed Chat Application built with Elixir's Libcluster and Horde

A chat application built with Libcluster and Horde, which is highly resilient to node failures and ensures flawless user experience.

The Anatomy of a Distributed Chat Application built with Elixir's Libcluster and Horde

Introduction

ElixirChat is a Phoenix LiveView-based communication application that enables users to create rooms and allows chat with other users in those rooms. The app is deployed in a distributed environment that manages distributed state, but ensuring consistency throughout the active session of a chat can be challenging when nodes are transient. In this blog, we will discuss how we address this challenge by using LibclusterLibcluster, Horde SupervisorHorde Supervisor, and Horde Registry.

Clustering with Libcluster and Horde Registry

One way we address the challenge of maintaining a consistent state in a distributed environment is by using libcluster. Libcluster is a library that automatically creates clusters of Elixir replicas that are aware of each other. With libcluster's cluster strategies, BEAM nodes can communicate with each other, allowing Horde to manage persistent processes across all replicas. To facilitate clustering, a sample cluster formation using k8s is provided as part of the repo.

Cluster formation using k8s
def start(_type, _args) do
  topologies = [
    k8s_chat: [
      strategy: Elixir.Cluster.Strategy.Kubernetes.DNS,
      config: [
        service: "chat-svc-headless",
        application_name: "chat",
        polling_interval: 3_000          
      ]
    ]
  ]

ChatServer is where a chat room's message is saved/managed in memory. ChatServer is a process that gets created for each room and its reference is stored in the horde registry. These ChatServers (which are GenServers) can exist in any node i.e., a chat room can be present in any of the nodes in the libcluster cluster.

Dynamic Supervision with Horde Supervisor

To enable the supervisor to supervise long-running processes (i.e., chats) across the dynamic cluster and monitor them, we use a distributed dynamic supervisor called Horde Supervisor. It supervises child processes across multiple nodes in a cluster, and dynamically restructures the supervision tree based on node membership changes. When a node goes down, the signal is caught by the ChatServer’s Genserver and StateHandoff.handoff is invoked. This ensures that the application maintains its resilience even in the face of node failures.

code
defmodule Chat.StateHandoff do
  use GenServer
  require Logger

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end
  
  def child_spec(opts \\ []) do
    %{
      id: __MODULE__,
      start: {__MODULE__, :start_link, [opts]}
    }
  end
        
  # join this crdt with one on another node by adding it as a neighbour
  def join(other_node) do
    # the second element of the tuple, { __MODULE__ , node} is a syntax that 
    # identifies the process named __MODULE__ running on the other node other_node 
    Logger.warn("Joining StateHandoff at #{inspect other_node}") 
    GenServer.call(__MODULE__, { :add_neighbours, { __MODULE__, other_node } })
  end

  # store a room_id and messages in the handoff crdt
  def handoff (room_id, messages) do
    GenServer.call(__MODULE__,{ :handoff, room_id, messages })
  end

  # pickup the stored messages for a room
  def pickup (room_id) do
    GenServer.call(__MODULE__, { :pickup, room_id }) 
  end
  

Distributed Process Registry with Horde Registry and State Management using DeltaCRDT

To manage distributed state, the state of chat rooms (messages) is kept in memory, and a handoff occurs when a replica containing the state fails due to node shutdown. The normal shutdown message sent by the node is intercepted, and the DeltaCRDTDeltaCRDT is handed off with the current state before being put to sleep so it can synchronize with other nodes. We use the DeltaCRDT to maintain the state, which is kept synchronized throughout the cluster using a Conflict-free Replicated Data Type (CRDT). DeltaCRDT library helps in synchronizing the state. We can imagine it as a distributed map that is stored/synchronized across all nodes.

Flow Diagram 1
Flow Diagram 2

The above diagrams show that each node has its respective instance of Horde Registry, DeltaCRDT, and a supervisor. The ChatServers' GenServers are registered in the Horde Registry to keep track of them across all replicas/nodes. DeltaCRDT is used for synchronizing states across multiple nodes. Finally, the supervisors monitor the chat processes across all nodes and provide quick recovery in case of node failures.

Conclusion

Overall, the combination of Libcluster, Horde Supervisor, and Horde Registry allows the ElixirChat application to maintain a consistent state across a dynamic cluster of nodes, making it highly resilient to node failures and ensuring a seamless user experience. With this setup, we are confident that we can provide a reliable chat service to our users while also being able to scale horizontally as needed.


Elixir
Libcluster
Horde Supervisor
Horde Registry
DeltaCRDT
GenServers
Elixir Chat
Chat Application

By Suresh Thiruppathi
June 14, 2023

Talk to us for more insights

What more? Your business success story is right next here. We're just a ping away. Let's get connected.