(Re)setting up my home Raspberry PI cluster - part 1

Introduction

A while back I was working on setting up 3 RPIs based on this guide. Back then I was too excited to play around with new hardware and set up everything as soon as possible. I did not really account for the fact stuff might break and there is no real way to restore it. Of course, I could have followed the same guide from scratch, but how many times would I have to repeat the same process? Dreading this process, I have set this project on a shelf.

In my day job just like any good SRE engineer, I am guided by DevOps principles of having repeatable processes. That’s why the tools like Terraform, CloudCDK, Packer and Ansible are so popular and relevant right now in my opinion.

So I thought that this holiday season, I will slow down the process a bit and re-assemble my cluster and configure a proper process around it.

It is worth noting that this project is heavily influenced by the original guide I have used initially. Usage of Ansible for this project is inspired by www.ansiblefordevops.com and other Jeff Geerling projects.

What do we have?

  1. 3 Rapsberry Pis 4B model
  2. 3 PoE hats
  3. Uctronics Complete Enclosure for Raspberry Pi Clusters
  4. 2 8cm cooling fans
  5. Netgear GS305P unmanaged switch
  6. 3 Ethernet cables
  7. A few old USB 3.0 and C cables

What is the end goal?

The idea is to have one RPI(we will call it a head node until a proper name is given) acting as a router and hosting DHCP server. Other RPIs would get IPs in a dedicated subnet and communicate with the outside world through the head node. I still want to preserve the possibility of connecting to nodes using their home network IPs. This way I can preserve access if I break the head node.

ascii-rpi-topology.png

Fans set up

The last time I was setting up this cluster, I cut some corners because I did not have a soldering station, but this time I ready to set thing right. Two fans came with standard pin connector. It is not an option to use GPIO to power them up as PoE hat will take all the pins. I have used some of my old USB cables and got a pair of USB powered fans.

fans.jpeg

Switch with PoE

netgear-switch.jpeg This allowed for a “clean” looking setup without extra power cables and power sources.

RPIs setup

The setup is fairly straightforward. PoE hats were mounted on the top of GPIO pins.

rpi-with-poe.jpeg

For this project, I have decided to use Raspberry Pi OS Lite as I am planning to have a Kubernetes cluster here later on.

So it was very convenient to use Raspberry PI Imager to configure WLAN access and hostnames.

Node naming

Hostnames I have decided to use are Caspar, Balthasar and Melchior. These are also known as Magi from the Gospel of Matthew, but I have chosen these names after 3 computers (also collectively referred to as MAGI) in the anime series Evangelion.

magi-eva.webp

Booting up the first node

first-rpi-connected.jpeg As soon as a SD card was installed and Ethernet cable is connected, the head node - Caspar was booting up.

Now, to connect to Caspar, we need its IP first. As I have managed to forget the admin password from my router, I had to brush up on my ARP knowledge using the useful arp-scan utility:

$ sudo arp-scan --interface=en0 192.168.0.0/24
Password:
Interface: en0, type: EN10MB, MAC: <REDACTED>, IPv4: 192.168.0.249
Starting arp-scan 1.10.0 with 256 hosts (https://github.com/royhills/arp-scan)
192.168.0.1	50:91:e3:b9:80:8c	(Unknown)
192.168.0.90	12:b4:c6:3a:3d:44	(Unknown: locally administered)
192.168.0.121	8e:f6:38:a3:fa:53	(Unknown: locally administered)
192.168.0.177	26:41:24:e0:3b:b7	(Unknown: locally administered)
192.168.0.202	e4:5f:01:fc:a9:94	Raspberry Pi Trading Ltd

526 packets received by filter, 0 packets dropped by kernel
Ending arp-scan 1.10.0: 256 hosts scanned in 1.833 seconds (139.66 hosts/sec). 4 responded

Knowing Caspar’s Ip, we can test out some Ansible.

For this, I have created inventory.ini with the following content:

[magi]
192.168.0.202

Using this file, we can get free memory on the device using the following command:

$ ansible magi -a "free -h" -u pi

192.168.0.202 | CHANGED | rc=0 >>
               total        used        free      shared  buff/cache   available
Mem:           3.7Gi       174Mi       3.5Gi       1.2Mi       139Mi       3.5Gi
Swap:          511Mi          0B       511Mi

Okay, now that we know that Caspar is live and kicking, we start implementing its router capabilities.

DHCP setup

We will use a playbook to set up DHCP server which we will call dhcp-setup-playbook.yaml.

First, we want to run apt-get update and apt-get upgrade -y on our system to make sure that we stay up to date. For this will add the following content into this notebook:

- name: DHCP setup
  hosts: magi[0] #we want to run this playbook only on Caspar
  become: yes #at least most of the commands will require superuser access
  tasks:
    - name: Update the package list
      apt:
        update_cache: yes
    
    - name: Upgrade all packages
      apt:
        upgrade: dist

Currently, eth0 is disconnected, but it has a connection available. Its status can be checked through ifconfig or nmcli.

eth0: disconnected
        "eth0"
        1 connection available
        ethernet (bcmgenet), e4:5f:01:fc:a9:93, hw, mtu 1500

We need to add a few tasks to connect eth0:

- name: Check if eth0 is disconnected
  ansible.builtin.shell:
    cmd: "nmcli device status | grep -E '^eth0' | awk '{print $3}'"
  register: eth0_status
  failed_when: eth0_status.rc != 0 and 'eth0' not in eth0_status.stdout

- name: Debug the status of eth0
  debug:
    msg: "Interface eth0 is currently "

This would allow us to check the status of eth0. We don’t want to reconfigure it if eth0 is already connected. Then based on the guide let’s modify the connection and restart it only if it is in status disconnected or connecting(about this status I have learnt accidentally, when I have wiped Caspar clean to test out all my playbooks from scratch. It seems to appear when the system is loading and interfaces trying to figure out their status).

    - name: Configure Wired connection 1 with static IP
      when: eth0_status.stdout in ["disconnected", "connecting"]
      ansible.builtin.command:
        cmd: nmcli con mod "Wired connection 1" ipv4.addresses  ipv4.method manual

    - name: Bring Wired connection 1 down
      when: eth0_status.stdout in ["disconnected", "connecting"]
      ansible.builtin.command:
        cmd: nmcli con down "Wired connection 1"

    - name: Bring Wired connection 1 up
      when: eth0_status.stdout in ["disconnected", "connecting"]
      ansible.builtin.command:
        cmd: nmcli con up "Wired connection 1"

Now that eth0 will appear connected after a single run of the playbook(ansible-playbook dhcp-setup-playbook.yaml -u pi).

$ nmcli
...
eth0: connected to Wired connection 1
        "eth0"
        ethernet (bcmgenet), E4:5F:01:FC:A9:93, hw, mtu 1500
        inet4 10.3.50.1/32
        route4 10.3.50.0/24 metric 0
        inet6 fe80::67a:8941:9ec7:7ddd/64
        route6 fe80::/64 metric 1024
...

If we re-run this playbook, you will see that some steps are skipped because of when statements validating eth0_status.stdout in ["disconnected", "connecting"].

Also, we don’t want to hardcode too much information in a variable file, so let’s grap MAC address and save it for future use.

    - name: Extract the MAC address of eth0
      ansible.builtin.set_fact:
        eth0_mac: "{ { ansible_facts['eth0']['macaddress'] } }" #nevermind this awkward space between the brackets. otherwise it would not render correctly

Next, we need a DHCP server.

    - name: Install DHCP server
      apt:
        name: isc-dhcp-server

In the heart of this dhcp server lies a config file(/etc/dhcp/dhcpd.conf) that we need to create. We will template this file using the saved earlier MAC address of eth0 and some input variables that namely are:

---
main_node_static_ip: 10.3.50.1
cluster_subnet: 10.3.50.0
cluster_broadcast: 10.3.50.255
cluster_subnet_netmask: 255.255.255.0
cluster_subnet_starting_ip: 10.3.50.10
cluster_subnet_ending_ip: 10.3.50.20
non_cluster_subnet: 192.168.0.0
non_cluster_netmask: 255.255.255.0

These variables are stored locally in group_vars/all.yaml on my machine.

Below you can find a template for /etc/dhcp/dhcpd.conf stored as templates/dhcpd.conf on my machine.

#once again nevermind this awkward space between the brackets. otherwise it would not render correctly
ddns-update-style none;
authoritative;
log-facility local7;

# No service will be given on this subnet
subnet { { non_cluster_subnet } } netmask { { non_cluster_netmask } } {
}

# The internal cluster network
group {
   option broadcast-address { { cluster_broadcast } };
   option routers { { main_node_static_ip } };
   default-lease-time 600;
   max-lease-time 7200;
   option domain-name "cluster";
   option domain-name-servers 8.8.8.8, 8.8.4.4;
   subnet { { cluster_subnet } } netmask { { cluster_subnet_netmask } } {
      # this is important as DHCP server will issue IPs sequally. So these two values help to set a predictable range
      range { { cluster_subnet_starting_ip } } { { cluster_subnet_ending_ip } };

      # Head Node
      host cluster {
         hardware ethernet { { eth0_mac } };
         fixed-address { { main_node_static_ip } };
      }

   }
}

This template will be rendered by the following task:

    - name: Create /etc/dhcp/dhcpd.conf
      ansible.builtin.template:
        src: dhcpd.conf
        dest: /etc/dhcp/dhcpd.conf
        owner: root
        group: root
        mode: 0644

We also need to let DHCP server know about the config file and make some changes to /etc/hosts


    - name: Configure /etc/default/isc-dhcp-server
      ansible.builtin.copy:
        content: |
          DHCPDv4_CONF=/etc/dhcp/dhcpd.conf
          DHCPDv4_PID=/var/run/dhcpd.pid
          INTERFACESv4="eth0"
        dest: /etc/default/isc-dhcp-server
        owner: root
        group: root
        mode: '0644'

    - name: Configure /etc/hosts
      ansible.builtin.copy:
        content: |
          127.0.0.1	localhost
          ::1		localhost ip6-localhost ip6-loopback
          ff02::1		ip6-allnodes
          ff02::2		ip6-allrouters
          
          127.0.1.1	caspar
          127.0.1.1	cluster
          
          10.3.50.1	cluster
        dest: /etc/hosts
        owner: root
        group: root
        mode: '0644'

At this point, the guide points to node reboot, so all settings are set into place. I have done it manually, as was not sure if I want this part of my playbooks.

Once Caspar is up we can check a lease list using dhcp-lease-list command. As I have unmanaged switch(it does not get itself an IP), I got an empty list. So we need to wait for a second node - Balthasar to boot up.

Routing

Before booting Balthasar, we also need to make sure that the routing part works correctly as DHCP server only provides us with a subnet and IPs.

We do this by creating a separate playbook (mainly for simplicity of debugging) called nat-setup-playbook.yaml. First, we need to enable NAT capabilities by setting ip_forward option to 1.

- name: NAT setup
  hosts: magi[0]
  become: yes
  tasks:
    - name: Gather facts
      ansible.builtin.setup:

    - name: Configure IP forward
      ansible.builtin.lineinfile:
        path: /etc/sysctl.conf
        line: 'net.ipv4.ip_forward=1'
        state: present

Then we install iptables:

    - name: Install iptables
      apt:
        name: iptables

Once installed, we can configure iptables using the following tasks. It is purposely written in the way duplicate rules are not written (thanks JetBrains Assistant for that):

    - name: Ensure POSTROUTING MASQUERADE rule exists in iptables
      ansible.builtin.shell: |
        iptables -t nat -C POSTROUTING -o wlan0 -j MASQUERADE || iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE

    - name: Ensure FORWARD RELATED,ESTABLISHED rule exists in iptables
      ansible.builtin.shell: |
        iptables -C FORWARD -i wlan0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT || iptables -A FORWARD -i wlan0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT

    - name: Ensure FORWARD ACCEPT rule from eth0 to wlan0 exists in iptables
      ansible.builtin.shell: |
        iptables -C FORWARD -i eth0 -o wlan0 -j ACCEPT || iptables -A FORWARD -i eth0 -o wlan0 -j ACCEPT

iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE - Ensures outgoing packets from private networks are NAT-translated (masqueraded).

iptables -A FORWARD -i wlan0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT - Allows return traffic for already established or related connections.

iptables -A FORWARD -i eth0 -o wlan0 -j ACCEPT - Enables new outgoing traffic from the private network to the public network.

The tricky part about iptables is that they are not preserved automatically, so a few more tasks should be added:

    - name: Persist iptables rules
      ansible.builtin.shell:
        cmd: iptables-save > /etc/iptables.ipv4.nat

    - name: Write script to /etc/rc.local
      ansible.builtin.copy:
        dest: /etc/rc.local
        content: |
          #!/bin/bash
          _IP=$(hostname -I) || true
          if [ "$_IP" ]; then
            printf "My IP address is %s\n" "$_IP"
          fi

          iptables-restore < /etc/iptables.ipv4.nat
          
          exit 0
        owner: root
        group: root
        mode: '0755'

This ensures that iptables will be set up correctly after reboot.

Balthasar and Melchior

As we are done setting up DHCP and NAT on Caspar, it is time to connect the remaining two nodes and validate if everything is in order.

final-cluster.jpeg

Once Balthasar is up, let’s SSH into it and see if we can connect to the internet.

$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.3.50.10  netmask 255.255.255.0  broadcast 10.3.50.255
        inet6 fe80::88fe:b645:43e8:aa83  prefixlen 64  scopeid 0x20<link>
        ether e4:5f:01:fc:aa:29  txqueuelen 1000  (Ethernet)
        RX packets 273  bytes 39557 (38.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 280  bytes 35241 (34.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
...

It seems that we got an IP from Caspar for eth0. So the DHCP part works correctly.

Can we ping 8.8.8.8?

$ ping -c 4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
--- 8.8.8.8 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3060ms

Uh-oh, we cannot. Strange. We have followed the guide as thorough we could. Can we at least ping Caspar?

$ ping -c 4 10.3.50.1
PING 10.3.50.1 (10.3.50.1) 56(84) bytes of data.

--- 10.3.50.1 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3061ms

Hmmm, something is not right. I can see some network activity on switch when I try to perform ping command.

*double checks iptables*

*double checks that net.ipv4.ip_forward=1*

*googles*

*googles*

*googles*

While taking to AI assistant about possible causes, I have noticed that phrase ip route is mentioned a few times. So I have asked, Do I need ip route if I set iptables? and response was:

Even though your iptables rules allow forwarding packets between interfaces and perform NAT (masquerading) on outgoing traffic, the routing table must explicitly define the path for the packets to reach their destination. Here’s why you needed to add the ip route command.

Maybe the traffic originating from this network is not handled properly because the router does not understand it?

Let’s check ip routes:

$ ip routes
default via 192.168.0.1 dev wlan0 proto dhcp src 192.168.0.202 metric 600
192.168.0.0/24 dev wlan0 proto kernel scope link src 192.168.0.202 metric 600

Hmm, strange. What if we create a route specifically for the internal network originating from eht0 interface?

$  ip route add 10.3.50.0/24 dev eth0 proto static 

Now let’s try to a ping command again from Balthasar.

$ ping -c 4 10.3.50.1
PING 10.3.50.1 (10.3.50.1) 56(84) bytes of data.
64 bytes from 10.3.50.1: icmp_seq=1 ttl=64 time=0.361 ms
64 bytes from 10.3.50.1: icmp_seq=2 ttl=64 time=0.224 ms
64 bytes from 10.3.50.1: icmp_seq=3 ttl=64 time=0.293 ms
64 bytes from 10.3.50.1: icmp_seq=4 ttl=64 time=0.209 ms

--- 10.3.50.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3056ms
rtt min/avg/max/mdev = 0.209/0.271/0.361/0.060 ms

$ ping -c 4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=56 time=17.5 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=56 time=18.0 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=56 time=13.8 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=56 time=16.8 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 13.763/16.523/18.013/1.653 ms

It works!

Ip routes are also not preserved between restarts. Based on some internet suggestions, I have decided to set up a service to ensure this route is present on Caspar.


- name: Create the IP route check script
  ansible.builtin.copy:
    dest: /usr/local/bin/check_ip_route.sh
    content: |
      #!/bin/bash

      # Define the route to check
      TARGET_ROUTE="10.3.50.0/24 dev eth0 proto static"

      # Check if the route already exists
      if ! ip route show | grep -q "$TARGET_ROUTE"; then
          # Route does not exist, add it
          ip route add 10.3.50.0/24 dev eth0 proto static
          echo "Route added: $TARGET_ROUTE"
      else
          echo "Route already exists: $TARGET_ROUTE"
      fi
    mode: '0755'  # Set correct permissions for the script
    owner: root
    group: root

- name: Create the systemd service file
  ansible.builtin.copy:
    dest: /etc/systemd/system/check-ip-route.service
    content: |
      [Unit]
      Description=Check and Add IP Route
      After=network.target

      [Service]
      Type=oneshot
      ExecStart=/usr/local/bin/check_ip_route.sh
      RemainAfterExit=true
      User=root
      Group=root

      [Install]
      WantedBy=multi-user.target
    mode: '0644'
    owner: root
    group: root

- name: Reload systemd manager configuration
  become: true
  command: systemctl daemon-reload

- name: Enable the IP route service
  systemd:
    name: check-ip-route.service
    enabled: true
    state: started

End notes

Never be afraid to ask your AI assistant dumb questions :)

magi.jpeg