Rendered at 08:29:22 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
codedokode 20 hours ago [-]
I think it was a bad idea to put cryptographic APIs or VPN in the kernel. If userspace is too slow for this, you should either reduce context switch overhead, or create special kind of processes, which are isolated, but quick to switch into. They are repeating Windows mistakes.
Those Windows mistakes have been sorted out for a long time now.
cpach 17 hours ago [-]
Well at least if it’s crufty stuff like AF_ALG that barely no-one is using and is kind of a forgotten place of the kernel.
I don’t oppose reasonable crypto in the kernel, like WireGuard.
cluckindan 9 hours ago [-]
>barely no-one is using
Except, you know, many things
cpach 4 hours ago [-]
Many? No, I don’t agree.
nwallin 13 hours ago [-]
I like the idea of keeping stuff out of the kernel as much as possible, but in this case, there are good reasons why cryptography has to live in the kernel.
We need on disk encryption, and we need to be able boot from an encrypted disk. So we need encryption for that.
We need network filesystems, and we need the traffic over the network to be encrypted. So we need encryption.
IPsec, for better or for worse, is authenticated and partially encrypted at the transport layer, so if we want a linux machine to speak IPsec, we need encryption.
Fixing/changing this would require a huge restructuring of the kernel; it would basically require switching to a microkernel. Given the fact that nobody's ever written a microkernel that doesn't completely suck ass, I don't know that it would be worth the effort.
cpach 12 hours ago [-]
Sure. But it would probably still be a good thing if the kernel maintainers could tear out AF_ALG.
ranger_danger 13 hours ago [-]
What about having a way to run the same crypto code but in userspace? Or perhaps turn it into a library that can be used from userspace.
19 hours ago [-]
ohnei 18 hours ago [-]
I don't think it was a bad idea, doing any idea requires an investment and a better investment would have been kernel layer, just ask the history of export control law what the US feared breaking more. Having security in userland means attacks in kernel or in userland are worthwhile against it. In the kernel it could have been secured better than OpenSSL was with less resources and could have had keys unavailable from userland. Instead it got basically no uptake as everyone hobbled along on slightly more resources spread even thinner on OpenSSL clones.
amluto 1 days ago [-]
Sigh.
1. I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.
2. The write-to-RO-page-cache primitive STILL WORKED! It’s just that the particular exploit used had no meaningful effect in the already-root-in-a-container context. If you think you are safe, you’re probably wrong. All you need to make a new exploit is an fd representing something that you aren’t supposed to be able to write. This likely includes CoW things where you are supposed to be able to write after CoW but you aren’t supposed to be able to write to the source.
So:
- Are you using these containers with a common image or even a common layer in an image to isolate dangerous workloads from each other. Oops, they can modify the image layers and corrupt each other. There goes any sort of cross-tenant isolation.
- What if you get an fd backed by the zero page and write to it? This can’t result in anything that the administrator would approve of.
- What if you ro-bind-mount something in? It’s not ro any more.
jeroenhd 1 days ago [-]
> I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.
I see a lot of projects blocking those sockets in containers as a response to this exploit, but it seems rather strange to me. We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use. It's not like we're mass-disabling kernel modules everywhere every time someone discovers an EoP bug, do we? Did we blacklist OpenSSL's binaries after Heartbleed?
I suppose it makes sense as a default on vulnerable kernels (though people running vulnerable kernels should put effort into patching rather than workarounds in my opinion), but these defaults are going to be around ten years from now when copy.fail is a distant memory.
throw0101a 21 hours ago [-]
> We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use.
The need for this feature/functionality in the fist place is questioned by some:
> As someone who works on the Linux kernel's cryptography code, the regularly occurring AF_ALG exploits are really frustrating. AF_ALG, which was added to the kernel many years ago without sufficient review, should not exist. It's very complex, and it exposes a massive attack surface to unprivileged userspace programs. And it's almost completely unnecessary, as userspace already has its own cryptography code to use. The kernel's cryptography code is just for in-kernel users (for example, dm-crypt).
> The algorithm being used in this [specific] exploit, "authencesn", is even an IPsec implementation detail, which never should have been exposed to userspace as a general-purpose en/decryption API. […]
> Did we blacklist OpenSSL's binaries after Heartbleed?
No, but lots of companies have since migrated away. OpenSSL was harder to move away from because there weren't as obvious drop-in replacements. Blocking a syscall that you never actually used is simple and effective.
e12e 23 hours ago [-]
In fairness, after heartbleed - there was quite a push to move away from openSSL - like Google's boring ssl, openbsd libressl and Mozilla/nss or gnutls - but the alternative here would be moving to a different kernel, like freebsd or open Solaris/Illumos ...
PunchyHamster 23 hours ago [-]
that's just moving to kernel that had 1000x less eyes on it. Yeah sure it will have less exploits but purely because nobody bothers to look when there are much juicer targets on Linux.
But I am disappointed that we still don't have clear OpenSSL successor, there is nothing to be salvaged from this mess of a project
steve1977 15 hours ago [-]
Less eyes but also less problems like "it's been fixed in the kernel but not in distro XYZ"
DarkUranium 22 hours ago [-]
1000x less eyes is true, but also: Linux, even in the kernel, has a long history of "move fast and break things".
Yes, the syscall API is (famously) stable, but the drivers, for example, are such a mess that many non-Linux projects prefer to take BSD drivers for e.g. WiFi despite them supporting far fewer devices (even if the Linux ones would be license compatible).
wahern 16 hours ago [-]
If you're using a container as a sandbox, one should use a default deny policy and allow only the facilities required by the container. Though, in practice containers are used to package a huge collection of software, most of which the container creator has no familiarity with and no ability to determine what runtime dependencies, beyond other package names, are required. This one of the reasons why containers, generally speaking, don't offer reliable security. If you can't or won't carefully design your components to sandbox themselves (e.g. by using seccomp and landlock with policies tailored to the specific component), like Chrome or various OpenBSD daemons, then it's far better to use VMs for isolation; and if you do design your components that way, containers are superfluous from a security perspective.
nubinetwork 23 hours ago [-]
> We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time?
To my knowledge, not many things were using the in-kernel code anyways, the recommended way is to use userland tools...
It's optional for openssl, systemd apparently needs it, but deleting the module from one of my systems didn't cause any issues. /shrug
PunchyHamster 23 hours ago [-]
I haven't had it loaded on 100s of servers ranging kernel version from 5.10 to 6.14. The use is just that low
Retr0id 22 hours ago [-]
iiuc the AF_ALG interface only offers real performance wins if you have specialized hardware that the kernel can offload computations to. If you're not using that hardware, there's little reason not to do the crypto in userspace.
hlieberman 1 days ago [-]
In fact, the authors specifically say on the very first line of their website that the copy/fail primitive can be used as a container escape. The entire premise of this article is flawed and irresponsible.
eqvinox 20 hours ago [-]
AIUI they haven't shown a container escape and are just claiming it so far. Or did I miss something?
angry_octet 9 hours ago [-]
It's not hard to see ways to escape the container with a cache write primative. I suspect the copy.fail team have held back on releasing a POC because of the disruption it could cause.
mjmas 20 hours ago [-]
Having write access on anything you can read should be enough if libraries or binaries are shared (read-only) between the host and container.
eqvinox 18 hours ago [-]
> if libraries or binaries are shared (read-only) between the host and container.
Yeah, exactly - that's a pretty big "if", and not how a lot of container automation does things. In particular you'd need to hit the base system, it's no help at all if some application files that the host does nothing with can be hit.
1 days ago [-]
fguerraz 1 days ago [-]
I just contributed this [1] which does what you want for seccomp. Well, not by default, but profiling is now effective against this attack.
Blanket blocking socketcall() caused regressions for all 32-bit applications trying to make sockets. In theory, glibc disables socketcall when running on kernel version >= 4.3. In practice, Debian/Fedora/Ubuntu all set glibc's "expected kernel version" to 3.2, so socketcall() is still used on most 32-bit glibc binaries shipped.
That’s… great. But who runs containerised 32 bit applications?
dwroberts 1 days ago [-]
There is an addendum at the bottom where they admit the page corruption is still problematic even with rootless podman.
Although using this to justify their migration to micro-VMs is very strange to me. Sure for this CVE it would have been better, but surely for a future attack it could hit a component shared across VMs but not containers? Are people really choosing technology based on CVE-of-the-week?
anygivnthursday 1 days ago [-]
Containers were never a security boundary. VMs have better isolation, which is why people choose them for security. Containers are convenience and usually have better performance.
dwroberts 1 days ago [-]
I see the ‘not a security boundary’ thing repeated constantly, and while it makes sense (eg. they’re sharing the underlying kernel or at least some access to it) if you think about it a little more, VMs are not magically different: they are better isolated, but VMs on the same host still share the host in common. A CVE next week that allows corruption of host state that affects eg every VM under a particular hypervisor will be no less damaging than this CVE is to containers
throw0101c 20 hours ago [-]
> […] VMs are not magically different: they are better isolated, but VMs on the same host still share the host in common.
VMs are not different due to 'magic' but through hardware assist with things like Intel VT-x and AMD-V:
I disagree. VMs are better isolated to precisely the extent that (a) the attack surface is lower and (b) the implementation is simpler and thus less buggy.
Hardware virtualization has a strong effect on (b), but it’s not at all a foregone conclusion that it’s strictly in the direction of being more straightforward and thus more secure. And hardware features like fancy device passthrough encourages applications with a very, very large attack surface that has historically been full of holes.
necovek 1 days ago [-]
You are obviously right that these are similar in principle: VM isolation exploit would lead to the same exposure like container-related isolation exploits.
VMs are considered vastly better because the surface area where exploits can happen is smaller and/or better isolated within the kernel.
If you are arguing the latter is not true — and we are all collectively hand-waving away big chunk of the surface area so that may be the case — it would help to be explicit in why you believe an exploit in that area is similarly likely?
robertlagrant 22 hours ago [-]
I would say it's the fact that "not a security boundary" appears to be a pass/fail statement, whereas the reality is more like a security continuum, along which VMs are further than containers.
necovek 10 hours ago [-]
I believe that is tautologically true, and thus not a very useful framing.
Security is obviously a continuum (eg. you can even have a bug in your IPMI FW, and a network packet could break in without any interaction with the OS; or there could be a HW bug too), but there is a discrete "jump" between containers and VMs to the extent that it is useful to call one a security boundary and the other not. Just like a firewall is a security boundary even if it can have security bugs.
Whether this jump between exploitable surface area warrants this distinction is what the point is: many believe it does.
anygivnthursday 3 hours ago [-]
But you also cannot just handwave the difference by "it's a continuum". I did not use absolutes, but said "VMs are _better_ for security", which already implicit about a "continuum".
Containers are mostly used as a deployment/packaging model where typically VMs are used where stronger security is needed. This has been the established industry standard for a while. Look at major cloud providers for example.
AWS:
> Unless explicitly stated, AWS does not consider a container or primitives such as an ECS task or a Kubernetes pod to be a security boundary. A notable exception to this is ECS tasks running AWS Fargate, where the isolation boundary is a task. To account for this, we recommend that you use Fargate with ECS if your applications have strict isolation requirements.
> When you’re using the Fargate launch type, each Fargate task has its own isolation boundary and does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with another task.
They also further recommend that for even higher security requirements use different EC2 instances - which you can also run on dedicated hardware etc. But the fact that you can further increase isolation beyond VMs, does not make containers the same as VMs.
> There’s one myth worth clearing up: containers do not provide an impermeable security boundary, nor do they aim to. They provide some restrictions on access to shared resources on a host, but they don’t necessarily prevent a malicious attacker from circumventing these restrictions. Although both containers and VMs encapsulate an application, the container is a boundary for the application, but the VM is a boundary for the application and its resources, including resource allocation.
> If you're running an untrusted workload on Kubernetes Engine and need a strong security boundary, you should fall back on the isolation provided by the Google Cloud Platform project. For workloads sharing the same level of trust, you may get by with multi-tenancy, where a container is run on the same node as other containers or another node in the same cluster.
> Applications that run in traditional Linux containers access system resources in the same way that regular (non-containerized) applications do: by making system calls directly to the host kernel.
> One approach to improve container isolation is to run each container in its own virtual machine (VM). This gives each container its own "machine," including kernel and virtualized devices, completely separate from the host. Even if there is a vulnerability in the guest, the hypervisor still isolates the host, as well as other applications/containers running on the host.
> gVisor is more lightweight than a VM while maintaining a similar level of isolation. The core of gVisor is a kernel that runs as a normal, unprivileged process that supports most Linux system calls. This kernel is written in Go, which was chosen for its memory- and type-safety. Just like within a VM, an application running in a gVisor sandbox gets its own kernel and set of virtualized devices, distinct from the host and other sandboxes.
These guys are experts when it comes to securing workloads on shared infra and while there are different levels of isolation using various techniques, the current industry practice is to not consider regular Linux containers a security boundary.
staticassertion 20 hours ago [-]
Containers are a security boundary, yes.
> A CVE next week that allows corruption of host state that affects eg every VM under a particular hypervisor will be no less damaging than this CVE is to containers
Yeah this almost never happens though whereas Linux privesc is 10x a day.
graemep 22 hours ago [-]
They may not provide isolation as VMs but they clearly do limit some attacks. VMs do not provide the same isolation as using physically separate hardware either.
I would have thought they provide better isolation than using multiple users which is the traditional security boundary.
It might depends on what you mean by a container? Are sandboxes such as Bubblewrap and Firejail containers?
anygivnthursday 3 hours ago [-]
> It might depends on what you mean by a container?
The article was about Podman and Linux namespaces
ButlerianJihad 1 days ago [-]
Containers are a convenience boundary and they increase complexity of your risk assessments.
It is easy for security scanners to scan a Linux system, but will they inspect your containers, and snaps, and flatpaks, and VMs? It is easy for DevOps to ssh into your Linux server, but can they also get logged in to each container, and do useful things? Your patches and all dependencies are up-to-date on your server, but those containers are still dragging around legacy dependencies, by design. Is your backup system aware of containers and capable of creating backup images or files, that are suitable for restoring back to service?
necovek 1 days ago [-]
Security scanners already support most container and VM image formats in widespread use.
Does this increase complexity? Yes, it does. Is it worth the cost? Depends on each individual case IMO.
throw0101c 20 hours ago [-]
> Security scanners already support most container and VM image formats in widespread use.
E.g.,
> Container Security stores and scans container images as the images are built, before production. It provides vulnerability and malware detection, along with continuous monitoring of container images. By integrating with the continuous integration and continuous deployment (CI/CD) systems that build container images, Container Security ensures every container reaching production is secure and compliant with enterprise policy.
You need a tool like Anchore and PrismaCloud to scan the container images then monitor them in runtime with PrismaCloud. Trellix can “scan” however most people turn off or exclude container directories on the host because it can interfere with the running container.
staticassertion 20 hours ago [-]
These sorts of vulns are extremely common on Linux. This one is making the rounds for various reasons but it's a good justification for a migration away from containers if your threat model is concerned about it.
MicroVMs have much lower attack surface and you can even toss a container into one if you'd like.
Or use gvisor, which mitigates this vulnerability.
15 hours ago [-]
PunchyHamster 23 hours ago [-]
> I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.
there is no reason it would be default policy. Else might as well block every socket and just multiplex everything on stdin/out
cduzz 21 hours ago [-]
I'd have guessed that the default paranoia-first policy would be "drop everything; verify what you need" which would include AF_ALG.
share and enjoy!
tremon 18 hours ago [-]
How do you propose to implement that "drop everything except what you need" policy? Do your containers come with a detailed list of which OS services and syscalls are required? I think your idea has the same issue as what held back the adoption of selinux: many developers think that having to enumerate their application's behaviour like that is an undue burden.
A compounding issue is that using AF_ALG doesn't require a separate syscall: it's just using SYS_socket with the first argument set to 38. Your container behaviour specification needs to be specific enough to not only enumerate allowed syscalls, but the allowed values for each syscall parameter.
cduzz 18 hours ago [-]
There are those who are paranoid and those who are expedient. If you're truly paranoid, you spin up the thing you want to run, measure what it does, and open the holes to allow it to do what it needs to. It's tedious and sometimes error-prone, but in some environments it is necessary.
In the vast majority of the world, you set permissions to what's reasonable and trust that most of the time things will work out pretty well and have a plan for if you need to fix things on the fly.
I personally am not terribly paranoid, but I've worked places where we had to be pretty paranoid (shared hosting).
staticassertion 20 hours ago [-]
The reason is that it's very rarely used and has a history of issues.
SV_BubbleTime 21 hours ago [-]
>might as well block every socket and just multiplex everything on stdin/out
> [...] that root was just my unprivileged podman user on the host
Couldn't you then simply re-run the exploit again as unprivileged podman user and gain root on the host?
kelnos 21 hours ago [-]
No, because you're still in the container, and there's no route to the host's root from there.
If you can orchestrate a container escape from the container's "root", then you're on to something.
wang_li 17 hours ago [-]
This pollutes the page cache, which affects the entire host. Getting "root" in a rootless container may mean nothing. But if it attacked the ls, ps, cat, grep, etc. commands and any process outside the container invokes that command it runs the payload of the attacker. What if the payload of the attack is just the same attack to escalate to root? So now you have escaped the container and gained root.
tuananh 22 hours ago [-]
did anyone try it? it suppose to work right?
bawolff 17 hours ago [-]
It sounds like they are saying the exploit works but the proof-of-concept doesn't due to superficial reasons(?) That hardly seems like something to brag about.
raddan 16 hours ago [-]
It’s not exactly superficial. It’s defense in depth: make sure that root inside a container is not root outside a container. There is also some good discussion about how the elevated user has access to page caches which can be dangerous when containers share pages (which is common). An attack “not working” for some seemingly trivial structural reason is a common trait of defense in depth. We would all love it if attacks like this were impossible, but absent some evidence of impossibility, why not hedge a little?
bawolff 8 hours ago [-]
> make sure that root inside a container is not root outside a container.
And its a great idea in general, it just doesn't stop this exploit.
The proof of concept becomes root as a quick way to prove it has control of your computer. The system in the article isnt blocking the exploit its just blocking the mechanism to prove it worked. It still worked, just the test to verify is now giving a false negative.
Good defense in depth disables neccesary steps that by themselves arent sufficient but are a neccesary condition. In the context of this exploit (but not in general) this mitigation is more like renaming the su command to mysu and hoping nobody notices.
angry_octet 9 hours ago [-]
They seem to be in a weird state of denial? Why don't they make it clear that it's just this POC that is blocked? It's like they don't understand.
If the goal is just preventing full root privileges, a CapabilityBoundingSet in a systemd unit will do.
However copy fail can be used in many other ways not contained by containers or the above settings. For example it can modify the /etc/ssl/certs to prepare for MitM attacks. If you have multiple containers based on the same image then one compromised CA set affects another.
I could be wrong, but I’m not sure those settings are enough to mitigate Copy Fail.
If your distro offers a patched kernel, it’s best to upgrade to that one and reboot.
You can also disable the vulnerable module (how to do it depends on what distro you’re using). But if you stay on an old unpatched kernel you might be exposed to other vulnerabilites.
netheril96 16 hours ago [-]
You are misinterpreting my goal here. I have patched my kernel against copy fail but I am thinking of ways to harden my setup against future CVEs in the kernel.
So the question is, before I learned about copy fail, what could I have done that would have limited the possible damage this vulnerability could do to me? CapabilityBoundingSet is one answer and rootless podman as mentioned in this article is another. They don’t prevent all but at least `su` is useless.
cpach 15 hours ago [-]
If so, I would look into applying a decent seccomp profile.
Other hardening solutions could be to run the workloads inside of a VM such as Firecracker, or gVisor. But that might be more work to implement compared to seccomp.
2bitencryption 1 days ago [-]
tl;dr - within the container, the exploit works, and elevates to root (uid 0) within the container - BUT because that namespace actually maps to uid 1000 (the user) outside the container, the escalation does not flow up to the host.
But… does this escape the container? If not (the author seems to indicate it does not) then does it matter if you are in Docker or rootless Podman, right, since the end result is always: you have elevated to root within the container. If the rest of the container filesystem isolation does its job, the end result is the same? Though I guess another chained exploit to escape the container would be worse in Docker? Do I have that right?
firesteelrain 24 hours ago [-]
This is a problem and most people hadn’t considered it before because the caching is done to speed up build pipeline performance:
“ While rootless containers prevent the attacker from escalating to host root, the page cache is still shared across the host. Containers that re-use the same base image layers share the same cached pages for those layers — if a malicious CI job corrupts a binary in the page cache, other containers launched from that same image could end up executing the poisoned version.”
dwattttt 20 hours ago [-]
I'm no expert, but the kernel is shared between all containers and the host.
I don't believe the kernel maintains separate page caches for each container; a malicious CI job could corrupt a binary from any container, or the host.
firesteelrain 18 hours ago [-]
Only if there is a shared inode between host and container.
duped 14 hours ago [-]
Which is almost guaranteed if you're launching multiple containers with the same base image or shared layers.
kevincox 17 hours ago [-]
If any security relevant file from the host is mounted into the container this could be exploited quite easily. It is definitely a viable tool for escaping containers but it would require a bit of an attack chain and some containers may not be vulnerable.
grimblee 17 hours ago [-]
If I understand correctly, rootfull podman with --userns=auto would also prevent the privilege escalation ?
angry_octet 9 hours ago [-]
No it wouldn't. The exploit is not impacted by namespaces.
cpach 17 hours ago [-]
How?
grimblee 12 hours ago [-]
--userns=auto asign a different namespace for each container, so if you escape it you get a random uid far far away from root
it also protects other containers from the compromise since they each have their own namespace and uid/gid range, the drawback though is that you can't mount shared volume unless you use a pod, since you would see files from outside your uid/gid range as owned by nobody and inaccessible.
cpach 12 hours ago [-]
That might make Copy Fail harder to exploit, but I still wouldn’t bet money on CF being impossible to use in that scenario.
grimblee 11 hours ago [-]
Since in --userns=auto, root inside the container gets assigned to the first uid of the uid range assigned by podman, copyfail would succeed but you'd get uid 647831 and be able to do nothing with it
eqvinox 1 days ago [-]
Running sstrip on an ELF binary is called ELF "golfing"? TIL…
Retr0id 23 hours ago [-]
It is, although real ELF golfers consider that a little naive.
eqvinox 21 hours ago [-]
It does feel a little simplistic to get a special name. But lesser things have gotten fancier names...
angry_octet 9 hours ago [-]
I would only call it code golf if you actually reduced the amount of code.
repelsteeltje 22 hours ago [-]
Sorry for posting a n00b question, but could you share etymology on this term golfing?
mbreese 22 hours ago [-]
It’s manipulating the binary to make it as small as possible. In golf, the lowest score wins. So, in this context, the smallest binary that still works wins.
As befits a history of perl, it is full of random quotes and rambling discourses about history, but it has a lot of info in it.
Retr0id 22 hours ago [-]
In golf, lower scores are better.
walletdrainer 1 days ago [-]
This feels LLM generated, lots of emdashes and even more text around a completely false premise.
cpach 23 hours ago [-]
What is the false premise in the article?
Retr0id 23 hours ago [-]
That rootless containers mitigate kernel exploits.
averi 20 hours ago [-]
Nowhere in the article is mentioned user namespaces completely mitigate the vulnerability, page cache corruption still happens but not being able to obtain root in the target host increases the attack vector to more than just a one liner into having to figure out whether specific shared base image layers are in use and by whom and by what binaries (think of a shared CI platform like the one we run for GNOME).
Retr0id 20 hours ago [-]
The article does not prove that you can't get root on the host via page cache corruption, just that the specific exploit strategy they tried didn't work.
averi 19 hours ago [-]
There's a specific reason why the exploit targets a setuid binary, if you poison it in memory it will be executed with the permissions of the user owning it, in this case root, meaning a setuid(0) + spawning a new shell will effectively give you root access on the host system, this for systems where uid=0 is equivalent inside and outside the container itself. The vulnerability is still there and is deadly serious, with rootless containers the attack vector just increases, the attacker will have to identify other factors (what containers are using a shared base image, what binaries are being called, what binaries should be overridden in memory etc).
On top of this there's another thing worth mentioning, it's a common thing in Openshift (for non rootless podman) to allow CAP_SETGID/CAP_SETUID for being able to create a container within a container (this is called the allowPrivilegeEscalation in SCCs), that effectively grants you the ability to become uid=root in the container and in that scenario uid=0 matches the host uid=0. The important difference is that specific instance of the root user doesn't have CAP_SYS_ADMIN (or most of the other privileged kernel capabilities) meaning the actions the user can then perform are very limited.
Retr0id 19 hours ago [-]
I know how the reference exploit works, but that's not the only way to exploit the bug.
hackeman300 1 days ago [-]
It's a shame, this seems like an interesting topic but I can't get past the blatant AI-isms littered throughout.
>This is not raw shellcode — it is a fully formed ELF executable
washbasin 1 days ago [-]
Please post a tl;dr at the top or even in the subject. Many of us are scrambling to patch/reboot our **.
donaldjbiden 1 days ago [-]
This isn't a new CVE. It's just documenting what happened when this person ran the exploit inside a certain type of container.
atmosx 15 hours ago [-]
tl;dr: switch to podman :-) or (for docker, not mention in the post but...) just `allowPrivilegeEscalation=False` in the deployment's SCC and you'll be fine at the pod level. Most deployments don't need priv escalation anyway, the ones that do need to either limits perms through capabilities or make sure the node (meaning the kernel) is patched.
cpach 14 hours ago [-]
How does allowPrivilegeEscalation=False help?
atmosx 5 hours ago [-]
Have you tested running the PoC in a pod with and without proviEsc set?
that just prevents the faulty module from loading. So you have time to fix it properly (kernel upgrade)
Technically there should be zero impact (the very very few tools that use it will fall back to userspace), I haven't even found that module loaded in infrastructure
Then check if it is loaded, and if it is, unload/reboot
mjmas 20 hours ago [-]
Though this won't work for some kernels:
If algif_aead was a builtin module, it needs to be disabled by adding
initcall_blacklist=algif_aead_init
to the boot cmdline.
However initcall_blacklist requires the kernel to be built with CONFIG_KALLSYMS.
chrisss395 20 hours ago [-]
Dumb question: is preventing the module from loading safe to blindly run on, e.g., Unraid, Proxmox, WSL2? Is it possible to break anything?
cpach 17 hours ago [-]
I would say any sanely written application would fall back to doing the requested operations in userspace if it cannot use the AF_ALG socket.
It could fail though. But I have not yet heard of anyone noticing big problems due to disabling the problematic modules. And I have not noticed any such issues on our systems at ${DAYJOB}.
IMHO, since these parts of the Linux kernel are so crappy I personally would say disabling them is a good default choice. YMMV. But if you encounter problems, then you can always re-enable the modules. (Preferably after upgrading your kernel, obviously.)
isityettime 1 days ago [-]
It already has a table of contents. The heading titled "why rootless containers stopped the escalation" is your tl;dr.
nullsanity 1 days ago [-]
[dead]
HollowRidge427 10 hours ago [-]
[dead]
kator 20 hours ago [-]
[dead]
foreman_ 1 days ago [-]
[dead]
HollowRidge427 22 hours ago [-]
[dead]
CalmBirch127 22 hours ago [-]
[dead]
QuietLedge375 4 hours ago [-]
[dead]
QuietLedge375 1 days ago [-]
[dead]
BoldBrook418 22 hours ago [-]
[dead]
averi 1 days ago [-]
[flagged]
averi 1 days ago [-]
[flagged]
ezequiel-garzon 1 days ago [-]
Please reply instead of (or in addition to) tagging the user you're replying to.
pjmlp 1 days ago [-]
Tagging isn't a feature in HN.
ramon156 1 days ago [-]
Thanks for the bikeshedding, they meant mentioning.
pjmlp 1 days ago [-]
It is also not supported, beyond people by sheer luck see their nick.
zenoprax 1 days ago [-]
If I see my points shoot up a bit I check my comment history to see what caused it.
anygivnthursday 1 days ago [-]
Or running their Claw scraping HN comments periodically for their mentions.
hlieberman 1 days ago [-]
That's true... for the exploit demo that they released. The primitive that underlies the exploit, however -- a page cache write -- can easily bypass the container boundary. One only needs to hook an executable which is also present in the host.
averi 1 days ago [-]
[flagged]
M_bara 1 days ago [-]
> (like reading env vars and sending them to an external server) it'd not be able to send credentials or fetch a malware remotely at all due to the DNS queries being intercepted by eBPF and being sent to a CoreDNS proxy.
Wouldn’t the exploit then just use ip addresses directly?
averi 20 hours ago [-]
You can work with the idea of a DNS whitelist, as in you pass a list of allowed DNS entries via your .gitlab-ci.yml (or separate config) resolution happens and those entries (IPs) are stored in a list, any other IP not present in that list gets denied by eBPF (which can easily be used to rewrite the source and destination of a packet before the packet actually reaches the NIC for dispatch)
I don’t oppose reasonable crypto in the kernel, like WireGuard.
Except, you know, many things
We need on disk encryption, and we need to be able boot from an encrypted disk. So we need encryption for that.
We need network filesystems, and we need the traffic over the network to be encrypted. So we need encryption.
IPsec, for better or for worse, is authenticated and partially encrypted at the transport layer, so if we want a linux machine to speak IPsec, we need encryption.
Fixing/changing this would require a huge restructuring of the kernel; it would basically require switching to a microkernel. Given the fact that nobody's ever written a microkernel that doesn't completely suck ass, I don't know that it would be worth the effort.
1. I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.
2. The write-to-RO-page-cache primitive STILL WORKED! It’s just that the particular exploit used had no meaningful effect in the already-root-in-a-container context. If you think you are safe, you’re probably wrong. All you need to make a new exploit is an fd representing something that you aren’t supposed to be able to write. This likely includes CoW things where you are supposed to be able to write after CoW but you aren’t supposed to be able to write to the source.
So:
- Are you using these containers with a common image or even a common layer in an image to isolate dangerous workloads from each other. Oops, they can modify the image layers and corrupt each other. There goes any sort of cross-tenant isolation.
- What if you get an fd backed by the zero page and write to it? This can’t result in anything that the administrator would approve of.
- What if you ro-bind-mount something in? It’s not ro any more.
I see a lot of projects blocking those sockets in containers as a response to this exploit, but it seems rather strange to me. We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use. It's not like we're mass-disabling kernel modules everywhere every time someone discovers an EoP bug, do we? Did we blacklist OpenSSL's binaries after Heartbleed?
I suppose it makes sense as a default on vulnerable kernels (though people running vulnerable kernels should put effort into patching rather than workarounds in my opinion), but these defaults are going to be around ten years from now when copy.fail is a distant memory.
The need for this feature/functionality in the fist place is questioned by some:
> As someone who works on the Linux kernel's cryptography code, the regularly occurring AF_ALG exploits are really frustrating. AF_ALG, which was added to the kernel many years ago without sufficient review, should not exist. It's very complex, and it exposes a massive attack surface to unprivileged userspace programs. And it's almost completely unnecessary, as userspace already has its own cryptography code to use. The kernel's cryptography code is just for in-kernel users (for example, dm-crypt).
> The algorithm being used in this [specific] exploit, "authencesn", is even an IPsec implementation detail, which never should have been exposed to userspace as a general-purpose en/decryption API. […]
* https://news.ycombinator.com/item?id=47952181#unv_47956312
More than one time.
> a cryptographic performance enhancement feature
It's very rarely used.
> Did we blacklist OpenSSL's binaries after Heartbleed?
No, but lots of companies have since migrated away. OpenSSL was harder to move away from because there weren't as obvious drop-in replacements. Blocking a syscall that you never actually used is simple and effective.
But I am disappointed that we still don't have clear OpenSSL successor, there is nothing to be salvaged from this mess of a project
Yes, the syscall API is (famously) stable, but the drivers, for example, are such a mess that many non-Linux projects prefer to take BSD drivers for e.g. WiFi despite them supporting far fewer devices (even if the Linux ones would be license compatible).
To my knowledge, not many things were using the in-kernel code anyways, the recommended way is to use userland tools...
It's optional for openssl, systemd apparently needs it, but deleting the module from one of my systems didn't cause any issues. /shrug
Yeah, exactly - that's a pretty big "if", and not how a lot of container automation does things. In particular you'd need to hit the base system, it's no help at all if some application files that the host does nothing with can be hit.
Oh, an this [2] just happened
[1] https://github.com/containers/oci-seccomp-bpf-hook/pull/209 [2] https://github.com/moby/moby/pull/52501
https://salsa.debian.org/glibc-team/glibc/-/blob/sid/debian/...
https://src.fedoraproject.org/rpms/glibc/blob/rawhide/f/glib...
Although using this to justify their migration to micro-VMs is very strange to me. Sure for this CVE it would have been better, but surely for a future attack it could hit a component shared across VMs but not containers? Are people really choosing technology based on CVE-of-the-week?
VMs are not different due to 'magic' but through hardware assist with things like Intel VT-x and AMD-V:
* https://en.wikipedia.org/wiki/X86_virtualization#Hardware-as...
* https://blog.lyc8503.net/en/post/hypervisor-explore/
* https://binarydebt.wordpress.com/2018/10/14/intel-virtualisa...
Hardware virtualization has a strong effect on (b), but it’s not at all a foregone conclusion that it’s strictly in the direction of being more straightforward and thus more secure. And hardware features like fancy device passthrough encourages applications with a very, very large attack surface that has historically been full of holes.
VMs are considered vastly better because the surface area where exploits can happen is smaller and/or better isolated within the kernel.
If you are arguing the latter is not true — and we are all collectively hand-waving away big chunk of the surface area so that may be the case — it would help to be explicit in why you believe an exploit in that area is similarly likely?
Security is obviously a continuum (eg. you can even have a bug in your IPMI FW, and a network packet could break in without any interaction with the OS; or there could be a HW bug too), but there is a discrete "jump" between containers and VMs to the extent that it is useful to call one a security boundary and the other not. Just like a firewall is a security boundary even if it can have security bugs.
Whether this jump between exploitable surface area warrants this distinction is what the point is: many believe it does.
Containers are mostly used as a deployment/packaging model where typically VMs are used where stronger security is needed. This has been the established industry standard for a while. Look at major cloud providers for example.
AWS:
> Unless explicitly stated, AWS does not consider a container or primitives such as an ECS task or a Kubernetes pod to be a security boundary. A notable exception to this is ECS tasks running AWS Fargate, where the isolation boundary is a task. To account for this, we recommend that you use Fargate with ECS if your applications have strict isolation requirements.
> When you’re using the Fargate launch type, each Fargate task has its own isolation boundary and does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with another task.
They also further recommend that for even higher security requirements use different EC2 instances - which you can also run on dedicated hardware etc. But the fact that you can further increase isolation beyond VMs, does not make containers the same as VMs.
https://aws.amazon.com/blogs/security/security-consideration...
GCP:
> There’s one myth worth clearing up: containers do not provide an impermeable security boundary, nor do they aim to. They provide some restrictions on access to shared resources on a host, but they don’t necessarily prevent a malicious attacker from circumventing these restrictions. Although both containers and VMs encapsulate an application, the container is a boundary for the application, but the VM is a boundary for the application and its resources, including resource allocation.
> If you're running an untrusted workload on Kubernetes Engine and need a strong security boundary, you should fall back on the isolation provided by the Google Cloud Platform project. For workloads sharing the same level of trust, you may get by with multi-tenancy, where a container is run on the same node as other containers or another node in the same cluster.
https://cloud.google.com/blog/products/gcp/exploring-contain...
> Applications that run in traditional Linux containers access system resources in the same way that regular (non-containerized) applications do: by making system calls directly to the host kernel.
> One approach to improve container isolation is to run each container in its own virtual machine (VM). This gives each container its own "machine," including kernel and virtualized devices, completely separate from the host. Even if there is a vulnerability in the guest, the hypervisor still isolates the host, as well as other applications/containers running on the host.
> gVisor is more lightweight than a VM while maintaining a similar level of isolation. The core of gVisor is a kernel that runs as a normal, unprivileged process that supports most Linux system calls. This kernel is written in Go, which was chosen for its memory- and type-safety. Just like within a VM, an application running in a gVisor sandbox gets its own kernel and set of virtualized devices, distinct from the host and other sandboxes.
https://cloud.google.com/blog/products/identity-security/ope...
These guys are experts when it comes to securing workloads on shared infra and while there are different levels of isolation using various techniques, the current industry practice is to not consider regular Linux containers a security boundary.
> A CVE next week that allows corruption of host state that affects eg every VM under a particular hypervisor will be no less damaging than this CVE is to containers
Yeah this almost never happens though whereas Linux privesc is 10x a day.
I would have thought they provide better isolation than using multiple users which is the traditional security boundary.
It might depends on what you mean by a container? Are sandboxes such as Bubblewrap and Firejail containers?
The article was about Podman and Linux namespaces
It is easy for security scanners to scan a Linux system, but will they inspect your containers, and snaps, and flatpaks, and VMs? It is easy for DevOps to ssh into your Linux server, but can they also get logged in to each container, and do useful things? Your patches and all dependencies are up-to-date on your server, but those containers are still dragging around legacy dependencies, by design. Is your backup system aware of containers and capable of creating backup images or files, that are suitable for restoring back to service?
Does this increase complexity? Yes, it does. Is it worth the cost? Depends on each individual case IMO.
E.g.,
> Container Security stores and scans container images as the images are built, before production. It provides vulnerability and malware detection, along with continuous monitoring of container images. By integrating with the continuous integration and continuous deployment (CI/CD) systems that build container images, Container Security ensures every container reaching production is secure and compliant with enterprise policy.
* https://docs.tenable.com/enclave-security/container-security...
MicroVMs have much lower attack surface and you can even toss a container into one if you'd like.
Or use gvisor, which mitigates this vulnerability.
there is no reason it would be default policy. Else might as well block every socket and just multiplex everything on stdin/out
share and enjoy!
A compounding issue is that using AF_ALG doesn't require a separate syscall: it's just using SYS_socket with the first argument set to 38. Your container behaviour specification needs to be specific enough to not only enumerate allowed syscalls, but the allowed values for each syscall parameter.
In the vast majority of the world, you set permissions to what's reasonable and trust that most of the time things will work out pretty well and have a plan for if you need to fix things on the fly.
I personally am not terribly paranoid, but I've worked places where we had to be pretty paranoid (shared hosting).
You may be on to something…
Couldn't you then simply re-run the exploit again as unprivileged podman user and gain root on the host?
If you can orchestrate a container escape from the container's "root", then you're on to something.
And its a great idea in general, it just doesn't stop this exploit.
The proof of concept becomes root as a quick way to prove it has control of your computer. The system in the article isnt blocking the exploit its just blocking the mechanism to prove it worked. It still worked, just the test to verify is now giving a false negative.
Good defense in depth disables neccesary steps that by themselves arent sufficient but are a neccesary condition. In the context of this exploit (but not in general) this mitigation is more like renaming the su command to mysu and hoping nobody notices.
The dedicated website: https://copy.fail
However copy fail can be used in many other ways not contained by containers or the above settings. For example it can modify the /etc/ssl/certs to prepare for MitM attacks. If you have multiple containers based on the same image then one compromised CA set affects another.
I could be wrong, but I’m not sure those settings are enough to mitigate Copy Fail.
If your distro offers a patched kernel, it’s best to upgrade to that one and reboot.
You can also disable the vulnerable module (how to do it depends on what distro you’re using). But if you stay on an old unpatched kernel you might be exposed to other vulnerabilites.
So the question is, before I learned about copy fail, what could I have done that would have limited the possible damage this vulnerability could do to me? CapabilityBoundingSet is one answer and rootless podman as mentioned in this article is another. They don’t prevent all but at least `su` is useless.
Other hardening solutions could be to run the workloads inside of a VM such as Firecracker, or gVisor. But that might be more work to implement compared to seccomp.
But… does this escape the container? If not (the author seems to indicate it does not) then does it matter if you are in Docker or rootless Podman, right, since the end result is always: you have elevated to root within the container. If the rest of the container filesystem isolation does its job, the end result is the same? Though I guess another chained exploit to escape the container would be worse in Docker? Do I have that right?
“ While rootless containers prevent the attacker from escalating to host root, the page cache is still shared across the host. Containers that re-use the same base image layers share the same cached pages for those layers — if a malicious CI job corrupts a binary in the page cache, other containers launched from that same image could end up executing the poisoned version.”
I don't believe the kernel maintains separate page caches for each container; a malicious CI job could corrupt a binary from any container, or the host.
As befits a history of perl, it is full of random quotes and rambling discourses about history, but it has a lot of info in it.
On top of this there's another thing worth mentioning, it's a common thing in Openshift (for non rootless podman) to allow CAP_SETGID/CAP_SETUID for being able to create a container within a container (this is called the allowPrivilegeEscalation in SCCs), that effectively grants you the ability to become uid=root in the container and in that scenario uid=0 matches the host uid=0. The important difference is that specific instance of the root user doesn't have CAP_SYS_ADMIN (or most of the other privileged kernel capabilities) meaning the actions the user can then perform are very limited.
>This is not raw shellcode — it is a fully formed ELF executable
Technically there should be zero impact (the very very few tools that use it will fall back to userspace), I haven't even found that module loaded in infrastructure
Then check if it is loaded, and if it is, unload/reboot
If algif_aead was a builtin module, it needs to be disabled by adding initcall_blacklist=algif_aead_init to the boot cmdline.
However initcall_blacklist requires the kernel to be built with CONFIG_KALLSYMS.
It could fail though. But I have not yet heard of anyone noticing big problems due to disabling the problematic modules. And I have not noticed any such issues on our systems at ${DAYJOB}.
IMHO, since these parts of the Linux kernel are so crappy I personally would say disabling them is a good default choice. YMMV. But if you encounter problems, then you can always re-enable the modules. (Preferably after upgrading your kernel, obviously.)
Wouldn’t the exploit then just use ip addresses directly?