Podman, SELinux, and systemd

06 Jun, 2020

In my previous post about migrating this site to Podman, I laid out a rough outline of my plan to move forward with Podman. Step one was to move the database into a container.

I have a few updates on my progress, and some tips to share regarding selinux, and containers that have systemd running for service control.

I've basically been starting from scratch on this - I don't have any experience with other container platforms like Docker, I have had some limited exposer to lxd on Ubuntu systems, but I've always treated them as live systems -- more like a VM than a container.

Building the image

When thinking about how I wanted to build my container image I knew I had several options. There are a number of MariaDB container images available on Docker Hub, and I considered using one of those at first. In the end, I chose to use just the standard Fedora 32 image from the Fedora registry and install MariaDB into my own custom image based on Fedora. I did this for 2 reasons.

The first reason comes down to trust. I don't have anything against Docker or Docker Hub, but I honestly don't know much about them, and I don't know much about where the container images come from, or how often they are updated. Based on that I decided that I wanted to stick to the Fedora registry.

The second reason is that installing MariaDB on Fedora is trivial, and doesn't add any undue burden on me when creating the Containerfile. So why not just keep it simple - Fedora container - install MariaDB and push a simple service file.

Speaking of the Containerfile.

Podman Containerfile

Many of you probably already know this but you can build a container image using a simple file that describes the end state of your image. By convention this file is usually named "Dockerfile" or "Containerfile", but you can name it anything you want. If you do choose to call it something other than Dockerfile or Containerfile, just specify the name with podman build -t <image_name> -f <container_file_name>.

My Containerfile is pretty simple and is based on the build file I found on here:

    FROM registry.fedoraproject.org/fedora:32
    MAINTAINER luke at sudoedit.com
    RUN yum -y install mariadb-server mariadb
    COPY mariadb-service-limits.conf /etc/systemd/system/mariadb.service.d/limits.conf
    RUN systemctl enable mariadb
    RUN systemctl disable systemd-update-utmp.service
    ENTRYPOINT ["/sbin/init"]
    CMD ["/sbin/init"]

We'll see if I end up needing to change this at all, but basically I wanted to accomplish a few things.

I wanted to specify the Fedora version - I was a bit worried that if I built the container using just "fedora" that I would forget what was in here when the next major version is released and could end up with changes I'm not ready for. Probably unlikely to be an issue, but this just seemed safer.
Install MariaDB - and copy in a simple configuration file to allow more open files (file shown below)
Allow systemd to manage the MariaDB service, and have it enabled at start time.

Contents of mariadb-service-limits.conf

    [Service]
    LimitNOFILE=10000
    LimitMEMLOCK=infinity

If you don't include these parameters then MariaDB will only be allowed 1024 open files, and it will complain about it.

The first roadblock - SELinux booleans

After building the image the next most obvious course of action is to run it right? So, that's what I did, and I was greeted with this interesting message:

    [luke@Fedora ~]$ podman run -i -it localhost/test:testing

    systemd v245.4-1.fc32 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
    Detected virtualization container-other.
    Detected architecture x86-64.

    Welcome to Fedora 32 (Container Image)!

    Set hostname to <7f0982604176>.
    Initializing machine ID from random generator.
    Failed to create /init.scope control group: Permission denied
    Failed to allocate manager object: Permission denied
    [!!!!!!] Failed to allocate manager object.
    Exiting PID 1...

Reading this error message, I wasn't quite sure where I had gone wrong. The "permission denied" message didn't seem to make much sense to me because obviously Podman had started the container. Why is that obvious?

I see a "Welcome to Fedora 32 (Container Image)!" message - That means the container started.
If I didn't have permission to run Podman or if I didn't have permission to use the image I wouldn't have seen that message. Plus I did the image build as my own user account, so it would be incredibly weird if the image output from the Containerfile was owned by not me.

So, if the error is permission denied, and it's not me getting denied I decided it had to be something with the container - and in these cases I know that it makes sense to check for SELinux avc's ( Access Vector Cache) alerts in the audit log. Among my findings was the following:

    [luke@Fedora ~]$ sudo sealert -a /var/log/audit/audit.log

    ...

    SELinux is preventing systemd from write access on the directory libpod-53808768ab5caa62b545bc57c88001fae301b3111a93deb02386d3a81bcb84e1.scope.

    *****  Plugin catchall_boolean (89.3 confidence) suggests   ******************

    If you want to allow container to manage cgroup
    Then you must tell SELinux about this by enabling the 'container_manage_cgroup' boolean.

    Do
    setsebool -P container_manage_cgroup 1

    ...

I had definitely violated some SELinux policy.

I really like how the sealert tool tells you exactly how to solve the problem - setsebool -P container_manage_cgroup 1.

I wasn't entirely certain what this boolean controlled - I know it says manage cgroup, but what does that mean anyway? So I did a little bit of research Googling and stumbled upon this Red Hat blog: https://developers.redhat.com/blog/2019/04/24/how-to-run-systemd-in-a-container/ which has a great description of the issue I was facing:

On SELinux systems, systemd attempts to write to the cgroup file system. Containers writing to the cgroup file system are denied by default. The container_manage_cgroup boolean must be enabled for this to be allowed on an SELinux separated system.

That is where I was running into my permission denied error - or not me really but the container process trying to write to the cgroup filesystem. By default container processes cannot write to the cgroup file system but can be given permission to do so by flipping the container_manage_cgroup boolean to "true".

Turns out that works!

After running the following command:

sudo setsebool -P container_manage_cgroup 1

I tried to run my container image again - This time with much friendlier output:

    [luke@Fedora ~]$ podman run -i -it localhost/test:testing

    systemd v245.4-1.fc32 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
    Detected virtualization container-other.
    Detected architecture x86-64.

    Welcome to Fedora 32 (Container Image)!

    Set hostname to <3004166ec95b>.
    Initializing machine ID from random generator.
    [  OK  ] Started Dispatch Password Requests to Console Directory Watch.
    [  OK  ] Started Forward Password Requests to Wall Directory Watch.
    [  OK  ] Reached target Local File Systems.
    [  OK  ] Reached target Paths.
    [  OK  ] Reached target Remote File Systems.
    [  OK  ] Reached target Slices.
    [  OK  ] Reached target Swap.
    [  OK  ] Listening on Process Core Dump Socket.
    [  OK  ] Listening on initctl Compatibility Named Pipe.
    [  OK  ] Listening on Journal Socket (/dev/log).
    [  OK  ] Listening on Journal Socket.
    [  OK  ] Listening on User Database Manager Socket.
             Starting Rebuild Dynamic Linker Cache...
             Starting Journal Service...
             Starting Create System Users...
    [  OK  ] Finished Create System Users.
    [  OK  ] Finished Rebuild Dynamic Linker Cache.
    [  OK  ] Started Journal Service.
             Starting Flush Journal to Persistent Storage...
    [  OK  ] Finished Flush Journal to Persistent Storage.
             Starting Create Volatile Files and Directories...
    [  OK  ] Finished Create Volatile Files and Directories.
             Starting Rebuild Journal Catalog...
             Starting Update UTMP about System Boot/Shutdown...
    [  OK  ] Finished Update UTMP about System Boot/Shutdown.
    [  OK  ] Finished Rebuild Journal Catalog.
             Starting Update is Completed...
    [  OK  ] Finished Update is Completed.
    [  OK  ] Reached target System Initialization.
    [  OK  ] Started Daily Cleanup of Temporary Directories.
    [  OK  ] Reached target Timers.
    [  OK  ] Listening on D-Bus System Message Bus Socket.
    [  OK  ] Reached target Sockets.
    [  OK  ] Reached target Basic System.
             Starting MariaDB 10.4 database server...
             Starting Home Area Manager...
             Starting Permit User Sessions...
    [  OK  ] Finished Permit User Sessions.
             Starting D-Bus System Message Bus...
    [  OK  ] Started D-Bus System Message Bus.
    [  OK  ] Started Home Area Manager.
    [  OK  ] Started MariaDB 10.4 database server.
    [  OK  ] Reached target Multi-User System.
             Starting Update UTMP about System Runlevel Changes...
    [  OK  ] Finished Update UTMP about System Runlevel Changes.

Note: It's better to run this in detached mode - I just wanted to see the output as the container came up so that I would see if there were any other hangups.

Detached mode would look like this:

podman run -d --rm -v /srv/sudoedit.com/data/mysql/:/var/lib/mysql/:Z localhost/test:testing`

Checking the status I see:

    [luke@Fedora database]$ podman ps
    CONTAINER ID  IMAGE                    COMMAND     CREATED      STATUS          PORTS  NAMES
    bebb132a0bb2  localhost/test:testing  /sbin/init  2 hours ago  Up 2 hours ago         suspicious_beaver

Next steps

I'm fairly close to bundling this image up and putting it into production. I have a few minor details that I need to sort out and a few things to test.

Test restoring MariaDB from a mysqldump in the container on a writable mount.
I've done some preliminary testing and it's definitely possible - just need to break the steps down and script it. * How to manage updates to the container - Do I want to rebuild the image on a weekly/monthly basis and push it up to my server? Or should I just build the image directly on the host and restart/rebuild as necessary? I want to find out how others handle that sort of thing.
How to get the webserver talking to the database?
- Use port forwarding?
- Someway to use the local unix socket?
- Is one way better than other?

That's about it for the database. I feel like I can work out some of these final pieces as I go. I'd like to have a patching plan in place before jumping in, but I think the automated mysqldump restore is not a show-stopping problem. I'm thinking sometime in the next week I'll be able to have the first version of the MariaDB container up and running.