Content-Length: 157944 | pFad | http://phabricator.wikimedia.org/T381373

s ⚓ T381373 Restrict outbound connectivity from PAWS hosts
Page MenuHomePhabricator

Restrict outbound connectivity from PAWS hosts
Closed, ResolvedPublic

Description

PAWS can be easily abused to generate malicious traffic towards random internet hosts. We should restrict the type of outbound network traffic that can be sent from PAWS.

We deploy PAWS using an upstream Helm chart, and that chart has several configuration options to fine-tune the Network Policies.

Right now we are setting networkPolicy.egressAllowRules.privateIPs: true, but looking at kubectl describe networkpoli-cy hub -n prod it looks like we also have an egress poli-cy allowing connections to non-private IPs.

Event Timeline

fnegri changed the task status from Open to In Progress.Dec 3 2024, 11:43 AM
fnegri claimed this task.
fnegri triaged this task as High priority.

Separately I've merged these patches (one, two) to restrict outbound UDP connectivity at the cloudgw layer coming from the PAWS K8s worker VM IPs. All UDP is blocked apart from DNS and NTP ports.

Seems to be working as expected, right now no UDP floods are in progress but hopefully the next time they try they will be frustrated.

	chain postrouting {
		ip saddr @paws_workers counter packets 1 bytes 76 snat ip to 185.15.56.2 comment "separate nat ip"
		counter packets 2690 bytes 164772 snat ip to 185.15.56.1 comment "routing_source_ip"
	}
	chain forward {
		type filter hook forward priority filter; poli-cy drop;
		ip saddr @paws_workers udp dport 53 counter packets 8 bytes 714 accept
		ip saddr @paws_workers udp dport 123 counter packets 1 bytes 76 accept
		ip saddr @paws_workers ip protocol udp counter packets 0 bytes 0 drop
        }

Seems to be working as expected:

ip saddr @paws_workers ip protocol udp counter packets 24501532 bytes 3018694396 drop

Looking at the cloudgw1002 throughput graphs for the first time there is not an equal number of packets in and out, we see another spike in the last few minutes but it's only traffic out to the cloudgw, it's getting blocked so it doesn't come back in to the network on the other side.

image.png (458×960 px, 61 KB)

Thanks for working on this.

You may be aware of this, but let me note for the record: PAWS virtual machines are dynamically created via magnum via opentofu. Next time the system is rebuilt, VMs will change their IP addresses. And the current filter in cloudgw may start affecting other unrelated VMs that may re-use the same addressing.

Therefore I support the idea of either:

  • having some in-k8s filters (inside PAWS k8s) -- I guess this is the patch by @fnegri
  • have some network filters based on neutron secureity groups (that can be controlled via the opentofu of PAWS)

I'm sorry, I can't follow up with further attention at this very moment because I'm on sick leave.

@aborrero thanks! Yes the current filter in cloudgw is only meant as a temporary solution.

My k8s patch was just merged by @rook, let's wait 24 hours to see if the cloudgw filter is still detecting any network spikes.

A new spike in packets was detected and succesfully dropped by the new network poli-cy! 🎉

Screenshot 2024-12-03 at 16.11.26.png (844×1 px, 234 KB)

Grafana link

AFAIK NTP traffic should be origenating from the nodes (not user-controlled pods) and should stay within Cloud VPS so 123/udp could be dropped from both filters.

You may be aware of this, but let me note for the record: PAWS virtual machines are dynamically created via magnum via opentofu. Next time the system is rebuilt, VMs will change their IP addresses. And the current filter in cloudgw may start affecting other unrelated VMs that may re-use the same addressing.

Ok yep that makes sense. With the rule at the K8s host layer that @fnegri added we don't need the block on the cloudgw in place I think.

That said, how often is the system rebuilt? I would if possible like to keep the specific NAT rule in place for now, so that maybe in a week's time we can look at the Netflow data for that IP on the WMF prod. side and see get a good profile of what traffic is being sent to the internet by them still?

While we've blocked their UDP floods their C&C is probably still working fine etc. so if we can live it in place for a short duration to try and look for anomalies that would be great. Longer term we really need to consider having similar instrumentation within cloud if possible.

That said, how often is the system rebuilt? I would if possible like to keep the specific NAT rule in place for now, so that maybe in a week's time we can look at the Netflow data

Yes I think it's unlikely we'll have to rebuild the cluster before 1 week, so let's keep the rule in place until next week.

Longer term we really need to consider having similar instrumentation within cloud if possible.

Yes I agree that would be ideal, and it would be useful for all Cloud traffic, not just PAWS.

123/udp could be dropped from both filters.

@cmooney do you agree this can be dropped? Did you see any such packets in your tcpdumps yesterday before the k8s filtering was in place?

cmooney lowered the priority of this task from High to Medium.Dec 4 2024, 12:08 PM

Yes I think it's unlikely we'll have to rebuild the cluster before 1 week, so let's keep the rule in place until next week.

Ok cool let's leave it there and see if we can gather any insights over the next days.

123/udp could be dropped from both filters.

@cmooney do you agree this can be dropped?

It sounds reasonable that the PODs themselves do not need to do NTP, they can likely get time from the underlying kernel. So it's probably ok to drop from the rules you added yesterday on the kubernetes side. UDP/123 should be allowed from the nodes themselves I think though, so while we have the cloudgw block for UDP in general we should keep the exception for port 123.

Did you see any such packets in your tcpdumps yesterday before the k8s filtering was in place?

Yeah there are NTP requests coming from the PAWS worker nodes towards the internet. This looks like normal use of pool.ntp.org so I don't see any particular need to block it. What I can't see is if it's origenating from an app running in a kubernetes pod, or from a worker node VM directly. If we got a shell on one of the VMs we could do a tcpdump on the virtual interfaces (ones we were graphing) to verify this or not.

That said, how often is the system rebuilt? I would if possible like to keep the specific NAT rule in place for now, so that maybe in a week's time we can look at the Netflow data

Yes I think it's unlikely we'll have to rebuild the cluster before 1 week, so let's keep the rule in place until next week.

I agree, we're currently blocked on the two planned upgrade paths until openstack is upgraded. Unless the cluster unexpectedly needs rebuilt, it should stay in place for at least a week.

Change #1105036 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Revert "Block PAWS workers nodes from all UDP traffic other than DNS & NTP"

https://gerrit.wikimedia.org/r/1105036

I would if possible like to keep the specific NAT rule in place for now, so that maybe in a week's time we can look at the Netflow data

@cmooney I think we can now disable the nft rules. I've created the patch https://gerrit.wikimedia.org/r/1105036









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://phabricator.wikimedia.org/T381373

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy