How To Create a Redundant Storage Pool Using GlusterFS on Ubuntu 20.04

An earlier version of this tutorial was written by Justin Ellingwood.

Introduction

When storing any critical data, having a single point of failure is very risky. While many databases and other software allow you to spread data out in the context of a single application, other systems can operate on the filesystem level to ensure that data is copied to another location whenever it’s written to disk.

GlusterFS is a network-attached storage filesystem that allows you to pool storage resources of multiple machines. In turn, this lets you treat multiple storage devices that are distributed among many computers as a single, more powerful unit. GlusterFS also gives you the freedom to create different kinds of storage configurations, many of which are functionally similar to RAID levels. For instance, you can stripe data across different nodes in the cluster, or you can implement redundancy for better data availability.

Goals

In this guide, you will create a redundant clustered storage array, also known as a distributed file system or, as it’s referred to in the GlusterFS documentation, a Trusted Storage Pool. This will provide functionality similar to a mirrored RAID configuration over the network: each independent server will contain its own copy of the data, allowing your applications to access either copy, thereby helping distribute your read load.

This redundant GlusterFS cluster will consist of two Ubuntu 20.04 servers. This will act similar to an NAS server with mirrored RAID. You’ll then access the cluster from a third Ubuntu 20.04 server configured to function as a GlusterFS client.

A Note About Running GlusterFS Securely

When you add data to a GlusterFS volume, that data gets synced to every machine in the storage pool where the volume is hosted. This traffic between nodes isn’t encrypted by default, meaning there’s a risk it could be intercepted by malicious actors.

For this reason, if you’re going to use GlusterFS in production, it’s recommended that you run it on an isolated network. For example, you could set it up to run in a Virtual Private Cloud (VPC) or with a VPN running between each of the nodes.

If you plan to deploy GlusterFS on DigitalOcean, you can set it up in an isolated network by adding your server infrastructure to a DigitalOcean Virtual Private Cloud. For details on how to set this up, see our VPC product documentation.

Prerequisites

To follow this tutorial, you will need three servers running Ubuntu 20.04. Each server should have a non-root user with administrative privileges, and a firewall configured with UFW. To set this up, follow our initial server setup guide for Ubuntu 20.04.

Note: As mentioned in the Goals section, this tutorial will walk you through configuring two of your Ubuntu servers to act as servers in your storage pool and the remaining one to act as a client which you’ll use to access these nodes.

For clarity, this tutorial will refer to these machines with the following hostnames:

Hostname Role in Storage Pool
gluster0 Server
gluster1 Server
gluster2 Client

Commands that should be run on either gluster0 or gluster1 will have blue and red backgrounds, respectively:

Commands that should only be run on the client (gluster2) will have a green background:

Commands that can or should be run on more than one machine will have a gray background:

Step 1 — Configuring DNS Resolution on Each Machine

Setting up some kind of hostname resolution between each computer can help with managing your Gluster storage pool. This way, whenever you have to reference one of your machines in a gluster command later in this tutorial, you can do so with an easy-to-remember domain name or even a nickname instead of their respective IP addresses.

If you do not have a spare domain name, or if you just want to set up something quickly, you can instead edit the /etc/hosts file on each computer. This is a special file on Linux machines where you can statically configure the system to resolve any hostnames contained in the file to static IP addresses.

Note: If you’d like to configure your servers to authenticate with a domain that you own, you’ll first need to obtain a domain name from a domain registrar — like Namecheap or Enom — and configure the appropriate DNS records.

Once you’ve configured an A record for each server, you can jump ahead to Step 2. As you follow this guide, make sure that you replace glusterN.example.com and glusterN with the domain name that resolves to the respective server referenced in the example command.

If you obtained your infrastructure from DigitalOcean, you could add your domain name to DigitalOcean then set up a unique A record for each of your servers.

Using your preferred text editor, open this file with root privileges on each of your machines. Here, we’ll use nano:

  • sudo nano /etc/hosts

By default, the file will look something like this with comments removed:

/etc/hosts

127.0.1.1 hostname hostname 127.0.0.1 localhost  ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts 

On one of your Ubuntu servers, add each server’s IP address followed by any names you wish to use to reference them in commands below the local host definition.

In the following example, each server is given a long hostname that aligns with glusterN.example.com and a short one that aligns with glusterN. You can change the glusterN.example.com and glusterN portions of each line to whatever name — or names separated by single spaces — you would like to use to access each server. Note, though, that this tutorial will use these examples throughout:

Note: If your servers are part of a Virtual Private Cloud infrastructure pool, you should use each server’s private IP address in the /etc/hosts file rather than their public IPs.

/etc/hosts

. . . 127.0.0.1       localhost first_ip_address gluster0.example.com gluster0 second_ip_address gluster1.example.com gluster1 third_ip_address gluster2.example.com gluster2  . . . 

When you are finished adding these new lines to the /etc/hosts file of one machine, copy and add them to the /etc/hosts files on your other machines. Each /etc/hosts file should contain the same lines, linking your servers’ IP addresses to the names you’ve selected.

Save and close each file when you are finished. If you used nano, do so by pressing CTRL + X, Y, and then ENTER.

Now that you’ve configured hostname resolution between each of your servers, it will be easier to run commands later on as you set up a storage pool and volume. Next, you’ll go through another step that must be completed on each of your servers. Namely, you’ll add the Gluster project’s official personal package archive (PPA) to each of your three Ubuntu servers to ensure that you can install the latest version of GlusterFS.

Step 2 — Setting Up Software Sources on Each Machine

Although the default Ubuntu 20.04 APT repositories do contain GlusterFS packages, at the time of this writing they are not the most recent versions. One way to install the latest stable version of GlusterFS (version 7.6 as of this writing) is to add the Gluster project’s official PPA to each of your three Ubuntu servers.

Add the PPA for the GlusterFS packages by running the following command on each server:

  • sudo add-apt-repository ppa:gluster/glusterfs-7

Press ENTER when prompted to confirm that you actually want to add the PPA.

After adding the PPA, refresh each server’s local package index. This will make each server aware of the new packages available:

  • sudo apt update

After adding the Gluster project’s official PPA to each server and updating the local package index, you’re ready to install the necessary GlusterFS packages. However, because two of your three machines will act as Gluster servers and the other will act as a client, there are two separate installation and configuration procedures. First, you’ll install and set up the server components.

Step 3 — Installing Server Components and Creating a Trusted Storage Pool

A storage pool is any amount of storage capacity aggregated from more than one storage resource. In this step, you will configure two of your servers — gluster0 and gluster1 — as the cluster components.

On both gluster0 and gluster1, install the GlusterFS server package by typing:

  • sudo apt install glusterfs-server

When prompted, press Y and then ENTER to confirm the installation.

The installation process automatically configures GlusterFS to run as a systemd service. However, it doesn’t automatically start the service or enable it to run at boot.

To start glusterd, the GlusterFS service, run the following systemctl start command on both gluster0 and gluster1:

  • sudo systemctl start glusterd.service

Then run the following command on both servers. This will enable the service to start whenever the server boots up:

  • sudo systemctl enable glusterd.service

Following that, you can check the service’s status on either or both servers:

  • sudo systemctl status glusterd.service

If the service is up and running, you’ll receive output like this:

Output● glusterd.service - GlusterFS, a clustered file-system server    Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)    Active: active (running) since Tue 2020-06-02 21:32:21 UTC; 32s ago      Docs: man:glusterd(8)  Main PID: 14742 (glusterd)     Tasks: 9 (limit: 2362)    CGroup: /system.slice/glusterd.service            └─14742 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO 

Assuming you followed the prerequisite initial server setup guide, you will have set up a firewall with UFW on each of your machines. Because of this, you’ll need to open up the firewall on each node before you can establish communications between them and create a storage pool.

The Gluster daemon uses port 24007, so you’ll need to allow each node access to that port through the firewall of each other node in your storage pool. To do so, run the following command on gluster0. Remember to change gluster1_ip_address to gluster1’s IP address:

  • sudo ufw allow from gluster1_ip_address to any port 24007

And run the following command on gluster1. Again, be sure to change gluster0_ip_address to gluster0’s IP address:

  • sudo ufw allow from gluster0_ip_address to any port 24007

You’ll also need to allow your client machine (gluster2) access to this port. Otherwise, you’ll run into issues later on when you try to mount the volume. Run the following command on both gluster0 and gluster1 to open up this port to your client machine:

  • sudo ufw allow from gluster2_ip_address to any port 24007

Then, to ensure that no other machines are able to access Gluster’s port on either server, add the following blanket deny rule to both gluster0 and gluster1:

  • sudo ufw deny 24007

You’re now ready to establish communication between gluster0 and gluster1. To do so, you’ll need to run the gluster peer probe command on one of your nodes. It doesn’t matter which node you use, but the following example shows the command being run on gluster0:

  • sudo gluster peer probe gluster1

Essentially, this command tells gluster0 to trust gluster1 and register it as part of its storage pool. If the probe is successful, it will return the following output:

Outputpeer probe: success 

You can check that the nodes are communicating at any time by running the gluster peer status command on either one. In this example, it’s run on gluster1:

  • sudo gluster peer status

If you run this command from gluster1, it will show output like this:

OutputNumber of Peers: 1  Hostname: gluster0.example.com Uuid: a3fae496-c4eb-4b20-9ed2-7840230407be State: Peer in Cluster (Connected) 

At this point, your two servers are communicating and ready to create storage volumes with each other.

Step 4 — Creating a Storage Volume

Recall that the primary goal of this tutorial is to create a redundant storage pool. To this end you’ll set up a volume with replica functionality, allowing you to keep multiple copies of your data and prevent your cluster from having a single point of failure.

To create a volume, you’ll use the gluster volume create command with this general syntax:

sudo gluster volume create volume_name replica number_of_servers domain1.com:/path/to/data/directory domain2.com:/path/to/data/directory force 

Here’s what this gluster volume create command’s arguments and options mean:

  • volume_name: This is the name you’ll use to refer to the volume after it’s created. The following example command creates a volume named volume1.
  • replica number_of_servers: Following the volume name, you can define what type of volume you want to create. Recall that the goal of this tutorial is to create a redundant storage pool, so we’ll use the replica volume type. This requires an argument indicating how many servers the volume’s data will be replicated to (2 in the case of this tutorial).
  • domain1.com:/… and domain2.com:/…: These define the machines and directory location of the bricks — GlusterFS’s term for its basic unit of storage, which includes any directories on any machines that serve as a part or a copy of a larger volume — that will make up volume1. The following example will create a directory named gluster-storage in the root directory of both servers.
  • force: This option will override any warnings or options that would otherwise come up and halt the volume’s creation.

Following the conventions established earlier in this tutorial, you can run this command to create a volume. Note that you can run it from either gluster0 or gluster1:

  • sudo gluster volume create volume1 replica 2 gluster0.example.com:/gluster-storage gluster1.example.com:/gluster-storage force

If the volume was created successfully, you’ll receive the following output:

Outputvolume create: volume1: success: please start the volume to access data 

At this point, your volume has been created, but it’s not yet active. You can start the volume and make it available for use by running the following command, again from either of your Gluster servers:

  • sudo gluster volume start volume1

You’ll receive this output if the volume was started correctly:

Outputvolume start: volume1: success 

Next, check that the volume is online. Run the following command from either one of your nodes:

  • sudo gluster volume status

This will return output similar to this:

OutputStatus of volume: volume1 Gluster process                             TCP Port  RDMA Port  Online  Pid ------------------------------------------------------------------------------ Brick gluster0.example.com:/gluster-storage 49152     0          Y       18801 Brick gluster1.example.com:/gluster-storage 49152     0          Y       19028 Self-heal Daemon on localhost               N/A       N/A        Y       19049 Self-heal Daemon on gluster0.example.com    N/A       N/A        Y       18822  Task Status of Volume volume1 ------------------------------------------------------------------------------ There are no active volume tasks 

Based on this output, the bricks on both servers are online.

As a final step to configuring your volume, you’ll need to open up the firewall on both servers so your client machine will be able to connect to and mount the volume. According to the previous command’s sample output, volume1 is running on port 49152 on both machines. This is GlusterFS’s default port for its initial volume, and the next volume you create will use port 49153, then 49154, and so on.

Run the following command on both gluster0 and gluster1 to allow gluster2 access to this port through each one’s respective firewall:

  • sudo ufw allow from gluster2_ip_address to any port 49152

Then, for an added layer of security, add another blanket deny rule for the volume’s port on both gluster0 and gluster1. This will ensure that no machines other than your client can access the volume on either server:

  • sudo ufw deny 49152

Now that your volume is up and running, you can set up your client machine and begin using it remotely.

Step 5 — Installing and Configuring Client Components

Your volume is now configured and available for use by your client machine. Before you begin though, you need to install the glusterfs-client package from the PPA you set up in Step 1 on your client machine. This package’s dependencies include some of GlusterFS’s common libraries and translator modules and the FUSE tools required for it to work.

Run the following command on gluster2:

  • sudo apt install glusterfs-client

You will mount your remote storage volume on your client computer shortly. Before you can do that, though, you need to create a mount point. Traditionally, this is in the /mnt directory, but anywhere convenient can be used.

For simplicity’s sake, create a directory named /storage-pool on your client machine to serve as the mount point. This directory name starts with a forward slash (/) which places it in the root directory, so you’ll need to create it with sudo privileges:

  • sudo mkdir /storage-pool

Now you can mount the remote volume. Before that, though, take a look at the syntax of the mount command you’ll use to do so:

sudo mount -t glusterfs domain1.com:volume_name /path/to/mount/point 

mount is a utility found in many Unix-like operating systems. It’s used to mount filesystems — anything from external storage devices, like SD cards or USB sticks, to network-attached storage as in the case of this tutorial — to directories on the machine’s existing filesystem. The mount command syntax you’ll use includes the -t option, which requires three arguments: the type of filesystem to be mounted, the device where the filesystem to mount can be found, and the directory on the client where you’ll mount the volume.

Notice that in this example syntax, the device argument points to a hostname followed by a colon and then the volume’s name. GlusterFS abstracts the actual storage directories on each host, meaning that this command doesn’t look to mount the /gluster-storage directory, but instead the volume1 volume.

Also notice that you only have to specify one member of the storage cluster. This can be either node, since the GlusterFS service treats them as one machine.

Run the following command on your client machine (gluster2) to mount the volume to the /storage-pool directory you created:

  • sudo mount -t glusterfs gluster0.example.com:/volume1 /storage-pool

Following that, run the df command. This will display the amount of available disk space for file systems to which the user invoking it has access:

  • df

This command will show that the GlusterFS volume is mounted at the correct location:

OutputFilesystem                    1K-blocks    Used Available Use% Mounted on . . . gluster0.example.com:/volume1  50633164 1938032  48695132   4% /storage-pool 

Now, you can move on to testing that any data you write to the volume on your client gets replicated to your server nodes as expected.

Step 6 — Testing Redundancy Features

Now that you’ve set up your client to use your storage pool and volume, you can test its functionality.

On your client machine (gluster2), navigate to the mount point that you defined in the previous step:

  • cd /storage-pool

Then create a few test files. The following command creates ten separate empty files in your storage pool:

  • sudo touch file_{0..9}.test

If you examine the storage directories you defined earlier on each storage host, you’ll discover that all of these files are present on each system.

On gluster0:

  • ls /gluster-storage
Outputfile_0.test  file_2.test  file_4.test  file_6.test  file_8.test file_1.test  file_3.test  file_5.test  file_7.test  file_9.test 

Likewise, on gluster1:

  • ls /gluster-storage
Outputfile_0.test  file_2.test  file_4.test  file_6.test  file_8.test file_1.test  file_3.test  file_5.test  file_7.test  file_9.test 

As these outputs show, the test files that you added to the client were also written to both of your nodes.

If there is ever a point when one of the nodes in your storage cluster is down, it could fall out of sync with the storage pool if any changes are made to the filesystem. Doing a read operation on the client mount point after the node comes back online will alert the node to get any missing files:

  • ls /storage-pool

Now that you’ve verified that your storage volume is mounted correctly and can replicate data to both machines in the cluster, you can lock down access to the storage pool.

Step 7 — Restricting Redundancy Features

At this point, any computer can connect to your storage volume without any restrictions. You can change this by setting the auth.allow option, which defines the IP addresses of whatever clients should have access to the volume.

If you’re using the /etc/hosts configuration, the names you’ve set for each server will not route correctly. You must use a static IP address instead. On the other hand, if you’re using DNS records, the domain name you’ve configured will work here.

On either one of your storage nodes (gluster0 or gluster1), run the following command:

  • sudo gluster volume set volume1 auth.allow gluster2_ip_address

If the command completes successfully, it will return this output:

Outputvolume set: success 

If you need to remove the restriction at any point, you can type:

  • sudo gluster volume set volume1 auth.allow *

This will allow connections from any machine again. This is insecure, but can be useful for debugging issues.

If you have multiple clients, you can specify their IP addresses or domain names at the same time (depending whether you are using /etc/hosts or DNS hostname resolution), separated by commas:

  • sudo gluster volume set volume1 auth.allow gluster_client1_ip,gluster_client2_ip

Your storage pool is now configured, secured, and ready for use. Next you’ll learn a few commands that will help you get information about the status of your storage pool.

Step 8 — Getting Information About your Storage Pool with GlusterFS Commands

When you begin changing some of the settings for your GlusterFS storage, you might get confused about what options you have available, which volumes are live, and which nodes are associated with each volume.

There are a number of different commands that are available on your nodes to retrieve this data and interact with your storage pool.

If you want information about each of your volumes, run the gluster volume info command:

  • sudo gluster volume info
OutputVolume Name: volume1 Type: Replicate Volume ID: a1e03075-a223-43ab-a0f6-612585940b0c Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gluster0.example.com:/gluster-storage Brick2: gluster1.example.com:/gluster-storage Options Reconfigured: auth.allow: gluster2_ip_address transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off 

Similarly, to get information about any peers that this node is connected to, you can type:

  • sudo gluster peer status
Number of Peers: 1  Hostname: gluster0.example.com Uuid: cb00a2fc-2384-41ac-b2a8-e7a1793bb5a9 State: Peer in Cluster (Connected) 

If you want detailed information about how each node is performing, you can profile a volume by typing:

  • sudo gluster volume profile volume_name start

When this command is complete, you can obtain the information that was gathered by typing:

  • sudo gluster volume profile volume_name info
OutputBrick: gluster0.example.com:/gluster-storage -------------------------------------------- Cumulative Stats:  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop  ---------   -----------   -----------   -----------   ------------        ----       0.00       0.00 us       0.00 us       0.00 us             30      FORGET       0.00       0.00 us       0.00 us       0.00 us             36     RELEASE       0.00       0.00 us       0.00 us       0.00 us             38  RELEASEDIR      Duration: 5445 seconds    Data Read: 0 bytes Data Written: 0 bytes  Interval 0 Stats:  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop  ---------   -----------   -----------   -----------   ------------        ----       0.00       0.00 us       0.00 us       0.00 us             30      FORGET       0.00       0.00 us       0.00 us       0.00 us             36     RELEASE       0.00       0.00 us       0.00 us       0.00 us             38  RELEASEDIR      Duration: 5445 seconds    Data Read: 0 bytes Data Written: 0 bytes . . . 

As shown previously, for a list of all of the GlusterFS associated components running on each of your nodes, run the gluster volume status command:

  • sudo gluster volume status
OutputStatus of volume: volume1 Gluster process                             TCP Port  RDMA Port  Online  Pid ------------------------------------------------------------------------------ Brick gluster0.example.com:/gluster-storage 49152     0          Y       19003 Brick gluster1.example.com:/gluster-storage 49152     0          Y       19040 Self-heal Daemon on localhost               N/A       N/A        Y       19061 Self-heal Daemon on gluster0.example.com    N/A       N/A        Y       19836  Task Status of Volume volume1 ------------------------------------------------------------------------------ There are no active volume tasks  

If you are going to be administering your GlusterFS storage volumes, it may be a good idea to drop into the GlusterFS console. This will allow you to interact with your GlusterFS environment without needing to type sudo gluster before everything:

  • sudo gluster

This will give you a prompt where you can type your commands. help is a good one to get yourself oriented:

  • help
Output peer help                - display help for peer commands  volume help              - display help for volume commands  volume bitrot help       - display help for volume bitrot commands  volume quota help        - display help for volume quota commands  snapshot help            - display help for snapshot commands  global help              - list global commands 

When you are finished, run exit to exit the Gluster console:

  • exit

With that, you’re ready to begin integrating GlusterFS with your next application.

Conclusion

By completing this tutorial, you have a redundant storage system that will allow you to write to two separate servers simultaneously. This can be useful for a number of applications and can ensure that your data is available even when one server goes down.