Kubernetes container security – Linux capabilities

Revision 2

Intro 2

Linux Capabilities in Kubernetes 3

Capabilities 3

AUDIT_CONTROL 3

AUDIT_READ 3

AUDIT_READ 4

BLOCK_SUSPEND 4

BPF 4

CHOWN 5

DAC_OVERRIDE 5

DAC_READ_SEARCH 5

FOWNER 6

FSETID 6

IPC_LOCK 6

IPC_LOCK 7

IPC_OWNER 7

KILL 7

LEASE 8

LINUX_IMMUTABLE 8

MAC_ADMIN 8

MAC_OVERRIDE 9

MKNOD 9

NET_ADMIN 9

NET_BIND_SERVICE 10

NET_BROADCAST 10

NET_BROADCAST 10

NET_RAW 11

PERFMON 11

SETGID 11

SETFCAP 12

SETPCAP 12

SETUID 12

SETUID 12

SYS_ADMIN 13

SYS_BOOT 13

SYS_CHROOT 13

SYS_MODULE 13

SYS_NICE 14

SYS_PACCT 14

SYS_PTRACE 14

SYS_RAWIO 15

SYS_RESOURCE 15

SYS_TIME 15

SYS_TTY_CONFIG 16

SYSLOG 16

WAKE_ALARM 16

 

Intro

The following Linux capabilities describe how unprivileged processes (including those running in containers with a UID and GID of non-zero) these capabilities are per-thread capabilities and the allow an unprivileged process perform certain actions, pending permissions.

This document describes the different capabilities – their meaning and things to lookout for when deploying them on a production cluster.

 

Linux Capabilities in Kubernetes

Linux Capabilities can be dropped (denied) added or updated to a Pod by using SCC or PSP (on newer versions).

The following is an example of an SCC or PSP yaml with capabilities dropped:

 

requiredDropCapabilities:

– KILL

– MKNOD

– SYS_CHROOT

 

 

Capabilities

AUDIT_CONTROL

This capability allows a process to change kernel auditing rules for  kernel audit system retrieve the kernel auditing status and rules.

 

Impact

this capability can be safely dropped.

 

Reasoning:

processes running isolated in a container should not be allowed to access the kernel in any way unless there is a very specific need in which case – a specific SCC or PSP should be created for such a container.

 

AUDIT_READ

This capability allows a process to read the audit log using a network multicast link.

 

Impact

This capability may be required by processes such as elasticsearch and similar; it is not recommended to remove it unless without investigating the impact on related processes.

 

Reasoning:

Disabling this can cripple several applications especially ones related with infrastructure.

AUDIT_READ

This capability allows a process to write to the audit log.

 

Impact

Writing information into the node’s kernel audit system.

 

Reasoning:

Processes running inside containers should avoid accessing the node’s kernel.

BLOCK_SUSPEND

This capability allows a process disable suspend mode in a linux system.

 

Impact

Although Kubernetes nodes should never go into suspend mode – this capability should be blocked

 

Reasoning:

Unnecessary access to Linux Kernel capabilities is strongly discouraged – this capability should be disabled.

BPF

Allows a process to perform BPF (Berkley packet filtering) operations.

 

Impact

Although Kubernetes nodes should never go into suspend mode – this capability should be disabled.

 

Reasoning:

This capability allows access to to BPF (if exists) and should be disabled.

CHOWN

Allows a process to change file and directory owner UID and GID.

 

Impact

Processes running inside containers generally should not be allowed to change file system ownership.

This should be disabled ONLY after verifying that all running Pods are not performing file system ownership changes either in a run script or during normal operation.

 

Reasoning:

Many developers migrating applications from legacy VMs onto Kubernetes use pre-existing scripts and methodologies, disabling this capability may cause some containers to fail continuously.

DAC_OVERRIDE

Allows a process to override discretionary access control.

 

Impact

Processes may have the ability to change linux ACLs, this is highly discouraged and should be avoided.

 

Reasoning:

Similarly to the chown, should developers migrate an application and it will try to change ACLs it can prevent the application from starting .

 

DAC_READ_SEARCH

Allows a process to override discretionary access control.

 

Impact

Processes may have the ability to change linux ACLs, this is highly discouraged and should be avoided.

 

Reasoning:

Similarly to the chown, should developers migrate an application and it will try to change ACLs it can prevent the application from starting .

 

FOWNER

Bypass UID based permissions (e.g. chmod, utime …).

 

Impact

Processes may have change files attribute bypassing permissions such as deleting directories with sticky bit set, set ACL on arbitrary files, specify O_NOATIME for files meaning the last file change will not be updated time wise.

 

Reasoning:

This option should be dropped (disabled) on containers as it may be used to perform changes to the container’s filesystem and used by stealth techniques to hide filesystem changes to containers.

 

FSETID

Allows a process to run as setuid or setgid elevated permissions.

 

Impact

This capability may be used to run a process as another UID. while common practice in Linux environment, it is strongly recommended to drop this capability to eliminate any change of a process from running using elevated privileges.

 

Reasoning:

This option should be dropped (disabled) – allowing it enables a process to perform tasks (run executables) with elevated permissions.

 

IPC_LOCK

Allows a process to lock the allotted virtual memory for that process (i.e. prevent virtual memory from being paged).

 

Impact

This capability may be used to run a process as another UID. while common practice in Linux environment, it is strongly recommended to drop this capability to eliminate any change of a process from running using elevated privileges.

 

Reasoning:

This option should be dropped (disabled) – allowing it enables a process to perform tasks (run executables) with elevated permissions.

 

IPC_LOCK

Allows a process to lock the allotted virtual memory for that process (i.e. prevent virtual memory from being paged).

 

Impact

This capability allows a process to lock it’s virtual memory space from paging, many applications require this capability for normal operation.

 

Reasoning:

This option should be allowed (not dropped) to prevent a per application SCC.

IPC_OWNER

Allows a process to bypass IPC permission check.

 

Impact

This allows a process to access System V IPC objects. I.e. message queues, semaphores sets, shared memory segments and IPC namespaces while bypassing permissions checks

 

Reasoning:

This capability should be dropped as it allows a process to access sensitive information and even intervene in other process’ System V objects

 

KILL

Allows a process to send signals to a process bypassing permission checks.

 

Impact

This capability allows a process to send signals to other processes 

 

Reasoning:

This capability should be dropped as it allows a process send signals to other processes and devices bypassing permission checks.

LEASE

Allow a process to establish lease on open file descriptors for example: file locks, flags etc.

 

Impact

This functionality is required by some process for normal operation, opening and manipulating files.

 

Reasoning:

This capability should not dropped to allow process to manipulate files.

 

LINUX_IMMUTABLE

Allow a process to change file flags (append and immutable).

 

Impact

This functionality is required by some process for normal operation, locking files for changes and allowing append only.

 

Reasoning:

This capability should not dropped to allow process to lock files.

 

MAC_ADMIN

Allow a process to change MAC (Mandatory Access control) of the Smack Linux Security Module (LSM).

 

Impact

Allow a process to change mandatory access control (part of SELinux).

 

Reasoning:

This capability should be dropped to deny a process from changing mandatory access control.

 

MAC_OVERRIDE

Allow a process to override MAC (Mandatory Access control) of the Smack Linux Security Module (LSM).

 

Impact

Allow a process to override mandatory access control (part of SELinux).

 

Reasoning:

This capability should be dropped to deny a process from changing mandatory access control.

 

MKNOD

Allow a process to create special files (devices, sockets …).

 

Impact

Allow a process to create special files.

 

Reasoning:

This capability should be dropped to deny a process creating device .

 

NET_ADMIN

Allow a process to change interface configuration, administration of IP firewall, masquerading, and accounting, modify routing tables, bind to any address for transparent proxying

set type-of-service (TOS), clear driver statistics, set promiscuous mode, enabling multicasting, set socket options

 

Impact

Allow a process to change network configuration.

 

Reasoning:

This capability should be dropped to deny a process from changing network configurations .

 

NET_BIND_SERVICE

Allow a process to bind to internet domain privileged ports.

Impact

Allow a process to bind to privileged ports (ports below 1024).

 

Reasoning:

This capability should not dropped as some processes bind to port 80 or 443 inside containers.

 

NET_BROADCAST

Allow a process to make network broadcasts and multicasts.

Impact

Allow a process to perform broadcasts and multicasts on a socket.

 

Reasoning:

This capability should not dropped as some processes require multicasts.

 

NET_BROADCAST

Allow a process to make network broadcasts and multicasts.

Impact

Allow a process to perform broadcasts and multicasts on a socket.

 

Reasoning:

This capability should not dropped as some processes require multicasts.

 

NET_RAW

Allow a process to craft ip packets and send them (including icmp).

Impact

This capability allows a process to craft packets and send them over thenetwork

 

Reasoning:

On development environments it is recommended to keep this capability for diagnostics purposes.

On production systems this capability should be dropped to prevent internal network attacks (between containers)

 

PERFMON

Employ performance monitoring mechanisms (BPF, kernel perf_event_open).

Impact

This capability allows a process setup performance monitoring.

 

Reasoning:

This capability should be disabled to prevent direct access to the kernel monitoring. Intense performance monitoring can slow down system operation and significantly increase context switches.

 

SETGID

Manipulate GID (in the container namespace).

Impact

This capability allows a process to add gids and make arbitrary manipulation to GIDs.

 

Reasoning:

This capability should be disabled to prevent processes from changing GIDs inside containers.

 

SETFCAP

Set arbitrary capabilities.

Impact

This capability allows a process to set arbitrary capabilities ona file.

 

Reasoning:

This capability should be disabled.

 

 

SETPCAP

add/drop capabilities to threads.

Impact

This capability adds or drops capabilities to threads.

 

Reasoning:

This capability should be disabled.

 

SETUID

Manipulate UID (in the container namespace).

Impact

This capability allows a process to add gids and make arbitrary manipulation to UIDs.

 

Reasoning:

This capability should be disabled to prevent processes from changing UIDs inside containers.

 

SETUID

Manipulate UID (in the container namespace).

Impact

This capability allows a process to add gids and make arbitrary manipulation to UIDs.

 

Reasoning:

This capability should be disabled to prevent processes from changing UIDs inside containers.

 

SYS_ADMIN

This capability allows various highly elevated capabilities to a process.

Impact

This capability allows many elevated privileges to a process.

 

Reasoning:

This capability should be disabled.

 

SYS_BOOT

This capability allows a process to use kexec_load or reboot .

Impact

Process can communicate with dangerous kernel functions.

 

Reasoning:

This capability should be disabled communicating to these functions may compromise the node.

 

SYS_CHROOT

This capability allows a process perform a chroot action .

Impact

A process can mount namespaces using chroot

 

Reasoning:

This capability should be disabled. This capability may allow a process to change between mount points on a node

 

SYS_MODULE

This capability allows a process load and unload kernel modules.

Impact

A process can load an unload kernel modules and potentially impact OS stability and or compromise OS security

 

Reasoning:

This capability should be disabled. This capability may allow a process to change between mount points on a node

 

 

SYS_NICE

Lower a process nice (priority), cpu affinity, real time policies , IO scheduling pocies .

Impact

A process with this capability may dangerously change the priority of process running (including itself) to the point of hogging a node by consuming too much resources being the highest priority process on the node (even higher than kernel processes).

 

Reasoning:

This capability should be disabled. To minimize the ability of a process running inside a container to impact node functionality and stability 

 

SYS_PACCT

Allows a process to use kernel process accounting .

Impact

Allowing a process to use kernel process accounting can expose dangerous information to an attacker (running UID, commands being run, CPU time etc.

 

Reasoning:

This capability should be disabled to prevent divulging information to potential attacker and/or allow an attacker to overload the system. 

 

SYS_PTRACE

Allows a process to trace other processes and get information about process actions, transfer data to a and from memory.

Impact

This capability is dangerous for live environments, it may allow an attacker to transfer information directly to a process memory and / or trace process actions.

 

Reasoning:

This capability should be disabled as it has the potential to load the system (if several processes are traced) and it may also expose dangerous information to a potential attacker about how the process running inside a container is operating.

 

SYS_RAWIO

Allows access to /proc/kcore, IO ports privilege level (IOPL) IO ports permission level , cpu registers (MSR), direct memory access, direct IO device access etc.

Impact

Allowing access to RAW IO devices is very dangerous on any system let alone on a containerized environment.

 

Reasoning:

This capability should be disabled, direct access to IO devices should not be allowed for processes running inside containers.

 

SYS_RESOURCE

Allows a process to override disk quota, increase resource limits, no. of process limit, override the number of consoles available, bypass the limit of inflight file descriptors when passing a file to another process, override the maximum pipe buffer, set oom_score_sdj etc.

Impact

A process can increase resource limits, OS limitation and basically render the OS unusable or kill specific processes by signalling to the OOMKiller that they are candidates for termination.

 

Reasoning:

This capability should be disabled as to prevent a process inside a container from affecting other containers/processes.

 

SYS_TIME

Allows a process to set system time and RTC time

Impact

This may be used by an attacker to disable communication to various time sensitive destinations (HTTPS certificates for example) and even interfere with proper system operation.

 

Reasoning:

This capability should be disabled. Any time conversion inside a container should be done by the application as to conform to the “deploy anytime anywhere” methodology in which a container may be deployed in different environments and is still required to function properly.

 

SYS_TTY_CONFIG

Allows a process to perform privileged operations on virtual terminals (using ioctl).

Impact

A tty inside a container is used to run the foreground application, a process inside a container cannot meddle in the terminal configuration or send a vhangup signal to it.

 

Reasoning:

This capability should be disabled. Access to the tty from inside a container is highly discouraged, it may also be used to exploit the running process(es)  inside that container it may be used to inject data into the terminal, set window size, redirecting console output etc.

 

SYSLOG

Perform syslog privileged operations, expose kernel data inside the container.

Impact

A process can change loglevel, close and open the log etc.

 

Reasoning:

This capability should be disabled. Changing syslog configuration should not be done from inside a container.

WAKE_ALARM

Set kernel timers to wake up the system.

Impact

Trigger something that will wake up the system

 

Reasoning:

This capability should be disabled. The kernel timers should be accessible from inside containers.