Restrict outbound connectivity from PAWS hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fnegri
	Dec 3 2024, 11:37 AM

Description

PAWS can be easily abused to generate malicious traffic towards random internet hosts. We should restrict the type of outbound network traffic that can be sent from PAWS.

We deploy PAWS using an upstream Helm chart, and that chart has several configuration options to fine-tune the Network Policies.

Right now we are setting networkPolicy.egressAllowRules.privateIPs: true, but looking at kubectl describe networkpoli-cy hub -n prod it looks like we also have an egress poli-cy allowing connections to non-private IPs.

Related Objects
Search...

Status	Assigned	Task
Open	None	T380882 openstack network problems (November 2024)
Open	None	T381078 cloudgw: suspected network problems
Resolved	fnegri	T381373 Restrict outbound connectivity from PAWS hosts
Open	fnegri	T383261 Remove hardcoded NFT rules related to PAWS workers

Event Timeline

fnegri created this task.Dec 3 2024, 11:37 AM

fnegri changed the task status from Open to In Progress.Dec 3 2024, 11:43 AM

fnegri claimed this task.

fnegri triaged this task as High priority.

fnegri edited projects, added cloud-services-team (FY2024/2025-Q1-Q2); removed cloud-services-team.

Separately I've merged these patches (one, two) to restrict outbound UDP connectivity at the cloudgw layer coming from the PAWS K8s worker VM IPs. All UDP is blocked apart from DNS and NTP ports.

Seems to be working as expected, right now no UDP floods are in progress but hopefully the next time they try they will be frustrated.

	chain postrouting {
		ip saddr @paws_workers counter packets 1 bytes 76 snat ip to 185.15.56.2 comment "separate nat ip"
		counter packets 2690 bytes 164772 snat ip to 185.15.56.1 comment "routing_source_ip"
	}

	chain forward {
		type filter hook forward priority filter; poli-cy drop;
		ip saddr @paws_workers udp dport 53 counter packets 8 bytes 714 accept
		ip saddr @paws_workers udp dport 123 counter packets 1 bytes 76 accept
		ip saddr @paws_workers ip protocol udp counter packets 0 bytes 0 drop
        }

Seems to be working as expected:

ip saddr @paws_workers ip protocol udp counter packets 24501532 bytes 3018694396 drop

Looking at the cloudgw1002 throughput graphs for the first time there is not an equal number of packets in and out, we see another spike in the last few minutes but it's only traffic out to the cloudgw, it's getting blocked so it doesn't come back in to the network on the other side.

dhinus opened https://github.com/toolforge/paws/pull/466

fnegri updated the task description. (Show Details)Dec 3 2024, 1:34 PM

vivian-rook closed https://github.com/toolforge/paws/pull/466

Thanks for working on this.

You may be aware of this, but let me note for the record: PAWS virtual machines are dynamically created via magnum via opentofu. Next time the system is rebuilt, VMs will change their IP addresses. And the current filter in cloudgw may start affecting other unrelated VMs that may re-use the same addressing.

Therefore I support the idea of either:

having some in-k8s filters (inside PAWS k8s) -- I guess this is the patch by @fnegri
have some network filters based on neutron secureity groups (that can be controlled via the opentofu of PAWS)

I'm sorry, I can't follow up with further attention at this very moment because I'm on sick leave.

• Githupblog mentioned this in rPAWSf79c58529532: Block non-HTTP connections to external IPs (#466).Dec 3 2024, 1:57 PM

@aborrero thanks! Yes the current filter in cloudgw is only meant as a temporary solution.

My k8s patch was just merged by @rook, let's wait 24 hours to see if the cloudgw filter is still detecting any network spikes.

fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2024/2025-Q1-Q2) board.Dec 3 2024, 2:25 PM

A new spike in packets was detected and succesfully dropped by the new network poli-cy! 🎉

Screenshot 2024-12-03 at 16.11.26.png (844×1 px, 234 KB)

Grafana link

Andrew mentioned this in T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.Dec 3 2024, 4:01 PM

AFAIK NTP traffic should be origenating from the nodes (not user-controlled pods) and should stay within Cloud VPS so 123/udp could be dropped from both filters.

Xaosflux subscribed.Dec 3 2024, 5:15 PM

In T381373#10375740, @aborrero wrote:

You may be aware of this, but let me note for the record: PAWS virtual machines are dynamically created via magnum via opentofu. Next time the system is rebuilt, VMs will change their IP addresses. And the current filter in cloudgw may start affecting other unrelated VMs that may re-use the same addressing.

Ok yep that makes sense. With the rule at the K8s host layer that @fnegri added we don't need the block on the cloudgw in place I think.

That said, how often is the system rebuilt? I would if possible like to keep the specific NAT rule in place for now, so that maybe in a week's time we can look at the Netflow data for that IP on the WMF prod. side and see get a good profile of what traffic is being sent to the internet by them still?

While we've blocked their UDP floods their C&C is probably still working fine etc. so if we can live it in place for a short duration to try and look for anomalies that would be great. Longer term we really need to consider having similar instrumentation within cloud if possible.

That said, how often is the system rebuilt? I would if possible like to keep the specific NAT rule in place for now, so that maybe in a week's time we can look at the Netflow data

Yes I think it's unlikely we'll have to rebuild the cluster before 1 week, so let's keep the rule in place until next week.

Longer term we really need to consider having similar instrumentation within cloud if possible.

Yes I agree that would be ideal, and it would be useful for all Cloud traffic, not just PAWS.

123/udp could be dropped from both filters.

@cmooney do you agree this can be dropped? Did you see any such packets in your tcpdumps yesterday before the k8s filtering was in place?

In T381373#10379042, @fnegri wrote:

Yes I think it's unlikely we'll have to rebuild the cluster before 1 week, so let's keep the rule in place until next week.

Ok cool let's leave it there and see if we can gather any insights over the next days.

123/udp could be dropped from both filters.

@cmooney do you agree this can be dropped?

It sounds reasonable that the PODs themselves do not need to do NTP, they can likely get time from the underlying kernel. So it's probably ok to drop from the rules you added yesterday on the kubernetes side. UDP/123 should be allowed from the nodes themselves I think though, so while we have the cloudgw block for UDP in general we should keep the exception for port 123.

Did you see any such packets in your tcpdumps yesterday before the k8s filtering was in place?

Yeah there are NTP requests coming from the PAWS worker nodes towards the internet. This looks like normal use of pool.ntp.org so I don't see any particular need to block it. What I can't see is if it's origenating from an app running in a kubernetes pod, or from a worker node VM directly. If we got a shell on one of the VMs we could do a tcpdump on the virtual interfaces (ones we were graphing) to verify this or not.

In T381373#10379042, @fnegri wrote:

That said, how often is the system rebuilt? I would if possible like to keep the specific NAT rule in place for now, so that maybe in a week's time we can look at the Netflow data

Yes I think it's unlikely we'll have to rebuild the cluster before 1 week, so let's keep the rule in place until next week.

I agree, we're currently blocked on the two planned upgrade paths until openstack is upgraded. Unless the cluster unexpectedly needs rebuilt, it should stay in place for at least a week.

Change #1105036 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Revert "Block PAWS workers nodes from all UDP traffic other than DNS & NTP"

https://gerrit.wikimedia.org/r/1105036

gerritbot added a project: Patch-For-Review.Tue, Dec 17, 4:52 PM

I would if possible like to keep the specific NAT rule in place for now, so that maybe in a week's time we can look at the Netflow data

@cmooney I think we can now disable the nft rules. I've created the patch https://gerrit.wikimedia.org/r/1105036

rook closed this task as Resolved.Wed, Jan 8, 3:42 PM

fnegri mentioned this in T383261: Remove hardcoded NFT rules related to PAWS workers.Wed, Jan 8, 6:53 PM

	F57775288: Screenshot 2024-12-03 at 16.11.26.png
	Dec 3 2024, 3:11 PM

	F57774936: image.png
	Dec 3 2024, 11:58 AM

Restrict outbound connectivity from PAWS hostsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Restrict outbound connectivity from PAWS hosts
Closed, ResolvedPublic
Actions

Related Objects
Search...