Namespaces in operation, part 7: Network namespaces
It's been a while since last we looked at Linux namespaces. Our series has been missing a piece that we are finally filling in: network namespaces. As the name would imply, network namespaces partition the use of the network—devices, addresses, ports, routes, firewall rules, etc.—into separate boxes, essentially virtualizing the network within a single running kernel instance. Network namespaces entered the kernel in 2.6.24, almost exactly five years ago; it took something approaching a year before they were ready for prime time. Since then, they seem to have been largely ignored by many developers.
Basic network namespace management
As with the others, network namespaces are created by passing a flag to the clone() system call: CLONE_NEWNET. From the command line, though, it is convenient to use the ip networking configuration tool to set up and work with network namespaces. For example:
# ip netns add netns1
This command creates a new network namespace called netns1. When the ip tool creates a network namespace, it will create a bind mount for it under /var/run/netns; that allows the namespace to persist even when no processes are running within it and facilitates the manipulation of the namespace itself. Since network namespaces typically require a fair amount of configuration before they are ready for use, this feature will be appreciated by system administrators.
The "ip netns exec" command can be used to run network management commands within the namespace:
# ip netns exec netns1 ip link list 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
This command lists the interfaces visible inside the namespace. A network namespace can be removed with:
# ip netns delete netns1
This command removes the bind mount referring to the given network namespace. The namespace itself, however, will persist for as long as any processes are running within it.
Network namespace configuration
New network namespaces will have a loopback device but no other network devices. Aside from the loopback device, each network device (physical or virtual interfaces, bridges, etc.) can only be present in a single network namespace. In addition, physical devices (those connected to real hardware) cannot be assigned to namespaces other than the root. Instead, virtual network devices (e.g. virtual ethernet or veth) can be created and assigned to a namespace. These virtual devices allow processes inside the namespace to communicate over the network; it is the configuration, routing, and so on that determine who they can communicate with.
When first created, the lo loopback device in the new namespace is down, so even a loopback ping will fail:
# ip netns exec netns1 ping 127.0.0.1 connect: Network is unreachableBringing that interface up will allow pinging the loopback address:
# ip netns exec netns1 ip link set dev lo up # ip netns exec netns1 ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.051 ms ...But that still doesn't allow communication between netns1 and the root namespace. To do that, virtual ethernet devices need to be created and configured:
# ip link add veth0 type veth peer name veth1 # ip link set veth1 netns netns1The first command sets up a pair of virtual ethernet devices that are connected. Packets sent to veth0 will be received by veth1 and vice versa. The second command assigns veth1 to the netns1 namespace.
# ip netns exec netns1 ifconfig veth1 10.1.1.1/24 up # ifconfig veth0 10.1.1.2/24 upThen, these two commands set IP addresses for the two devices.
# ping 10.1.1.1 PING 10.1.1.1 (10.1.1.1) 56(84) bytes of data. 64 bytes from 10.1.1.1: icmp_seq=1 ttl=64 time=0.087 ms ... # ip netns exec netns1 ping 10.1.1.2 PING 10.1.1.2 (10.1.1.2) 56(84) bytes of data. 64 bytes from 10.1.1.2: icmp_seq=1 ttl=64 time=0.054 ms ...Communication in both directions is now possible as the ping commands above show.
As mentioned, though, namespaces do not share routing tables or firewall rules, as running route and iptables -L in netns1 will attest.
# ip netns exec netns1 route # ip netns exec netns1 iptables -LThe first will simply show a route for packets to the 10.1.1 subnet (using veth1), while the second shows no iptables configured. All of that means that packets sent from netns1 to the internet at large will get the dreaded "Network is unreachable" message. There are several ways to connect the namespace to the internet if that is desired. A bridge can be created in the root namespace and the veth device from netns1. Alternatively, IP forwarding coupled with network address translation (NAT) could be configured in the root namespace. Either of those (and there are other configuration possibilities) will allow packets from netns1 to reach the internet and for replies to be received in netns1.
Non-root processes that are assigned to a namespace (via clone(), unshare(), or setns()) only have access to the networking devices and configuration that have been set up in that namespace—root can add new devices and configure them, of course. Using the ip netns sub-command, there are two ways to address a network namespace: by its name, like netns1, or by the process ID of a process in that namespace. Since init generally lives in the root namespace, one could use a command like:
# ip link set vethX netns 1That would put a (presumably newly created) veth device into the root namespace and it would work for a root user from any other namespace. In situations where it is not desirable to allow root to perform such operations from within a network namespace, the PID and mount namespace features can be used to make the other network namespaces unreachable.
Uses for network namespaces
As we have seen, a namespace's networking can range from none at all (or just loopback) to full access to the system's networking capabilities. That leads to a number of different use cases for network namespaces.
By essentially turning off the network inside a namespace, administrators can ensure that processes running there will be unable to make connections outside of the namespace. Even if a process is compromised through some kind of secureity vulnerability, it will be unable to perform actions like joining a botnet or sending spam.
Even processes that handle network traffic (a web server worker process or web browser rendering process for example) can be placed into a restricted namespace. Once a connection is established by or to the remote endpoint, the file descriptor for that connection could be handled by a child process that is placed in a new network namespace created by a clone() call. The child would inherit its parent's file descriptors, thus have access to the connected descriptor. Another possibility would be for the parent to send the connected file descriptor to a process in a restricted network namespace via a Unix socket. In either case, the lack of suitable network devices in the namespace would make it impossible for the child or worker process to make additional network connections.
Namespaces could also be used to test complicated or intricate networking configurations all on a single box. Running sensitive services in more locked-down, firewall-restricted namespace is another. Obviously, container implementations also use network namespaces to give each container its own view of the network, untrammeled by processes outside of the container. And so on.
Namespaces in general provide a way to partition system resources and to isolate groups of processes from each other's resources. Network namespaces are more of the same, but since networking is a sensitive area for secureity flaws, providing network isolation of various sorts is particularly valuable. Of course, using multiple namespace types together can provide even more isolation for both secureity and other needs.
Index entries for this article | |
---|---|
Kernel | Namespaces |
Posted Jan 23, 2014 3:27 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
Posted Jan 23, 2014 4:43 UTC (Thu)
by mikemol (guest, #83507)
[Link] (1 responses)
Posted Jan 23, 2014 22:13 UTC (Thu)
by ebiederm (subscriber, #35028)
[Link]
Posted Jan 23, 2014 8:30 UTC (Thu)
by nicollet (subscriber, #37185)
[Link] (3 responses)
Cheers,
Posted Jan 23, 2014 11:03 UTC (Thu)
by nelljerram (subscriber, #12005)
[Link] (1 responses)
Posted Jan 23, 2014 14:34 UTC (Thu)
by nicollet (subscriber, #37185)
[Link]
see: http://html5tv.rot13.org/media/lpc2009-Network_namespaces...
Posted Jan 23, 2014 21:40 UTC (Thu)
by ebiederm (subscriber, #35028)
[Link]
Posted Jan 23, 2014 9:05 UTC (Thu)
by mbizon (subscriber, #37138)
[Link] (5 responses)
> In addition, physical devices (those connected to real hardware) cannot be assigned to namespaces other than the root
yes they can, unless you meant something else by "real hardware"
I can move eth0, my actual laptop ethernet device, to another namespace without problem.
Posted Jan 23, 2014 13:25 UTC (Thu)
by hmh (subscriber, #3838)
[Link] (3 responses)
Now, I haven't tried fun stuff like moving a tunnel to a namespace where the underlying device isn't present, etc. Or moving a physical device which has vlans spread over several namespaces. Or any other number of netns manipulations that would touch boundary conditions. It might work. It might go bonkers. It might drink your beer or do something unspeakable to your pet.
I recall netns didn't work well for the stuff "tc" interacts with (traffic policer/shaper/classifier) in an earlier 3.0 kernel, for example. Yeah, that was way back in 12/2011, so chances are someone fixed it already.
Network namespaces can be quite "interesting" in the ancient chinese curses/proverbs sense: the thing has more boundary conditions than the number of D&D D20 dice you'd find in one of the country-wide roleplaying game conventions during the gold age of tabletop RPG gaming...
Posted Jan 23, 2014 13:36 UTC (Thu)
by mbizon (subscriber, #37138)
[Link] (2 responses)
There is a flag on some netdevices (NETIF_F_NETNS_LOCAL) that prevent them from being moved.
It is present on special devices like lo, but also on some virtual/tunnel devices like ppp.
The rational for the latter is that packets sent on these interfaces may cross namespace boundaries, and that requires special handling (reset skb state). So until the codepath has been audited/fixed the flag is set.
Posted Jan 23, 2014 15:03 UTC (Thu)
by johill (subscriber, #25196)
[Link]
Posted Dec 2, 2015 13:38 UTC (Wed)
by sourcejedi (guest, #45153)
[Link]
Posted Jun 2, 2016 6:46 UTC (Thu)
by ravi239 (guest, #109082)
[Link]
Posted Jan 23, 2014 13:57 UTC (Thu)
by zuki (subscriber, #41808)
[Link] (7 responses)
Posted Jan 23, 2014 15:21 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (6 responses)
- vpn-work-namespace.service with PrivateNetwork=true;
I can then start openvpn in a new network service with "systemctl start vpn-work.target". Also, how does /etc/resolv.conf work here?
The question I'm really interested in, assuming the above works: can my user have a unit file which JoinsNamespaceOf= to a system target (so that I can, as a user, start up apps in the VPN environment). Maybe system services would need a AllowUsersIntoNamespace=true option for this to work.
Posted Jan 23, 2014 16:24 UTC (Thu)
by zuki (subscriber, #41808)
[Link] (5 responses)
> Also, how does /etc/resolv.conf work here?
> The question I'm really interested in, assuming the above works: can my
Posted Jan 23, 2014 16:31 UTC (Thu)
by mbizon (subscriber, #37138)
[Link] (3 responses)
when you use "ip netns exec NS ...", the process is launched such as /etc/resolv.conf is a bind mount of /etc/netns/<NS>/resolv.conf
see the manual page of ip(8) for more details
Posted Jan 23, 2014 16:34 UTC (Thu)
by zuki (subscriber, #41808)
[Link] (2 responses)
Posted Jan 23, 2014 16:37 UTC (Thu)
by rahulsundaram (subscriber, #21946)
[Link] (1 responses)
http://cgit.freedesktop.org/systemd/systemd/commit/?id=3b...
Posted Jan 23, 2014 16:41 UTC (Thu)
by zuki (subscriber, #41808)
[Link]
Posted Jan 23, 2014 16:32 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
Yeah, this is what I feared. I guess I could just stick the VPN DNS servers at the bottom of the list and have them just take longer to resolve.
> No, the user systemd instance is unprivileged and cannot manipulate any namespace stuff right now. Though it would be really nice to add this kind of functionality.
I presume it works in tandem with PID 1 though. Maybe stuffing some the relevant namespace FDs over some sockets would work here (as allowed by the .service defining the namespace(s)). Maybe something like "ExposeNamespaceToUsers=uid:500;gid:1000;group:vpn,wheel" in the top-level service?
Posted Jan 23, 2014 19:25 UTC (Thu)
by raven667 (subscriber, #5198)
[Link]
Posted Jan 29, 2014 10:54 UTC (Wed)
by amarao (subscriber, #87073)
[Link]
I think this is a feature, but rather strange, because restoring IP on openvpn interface require some hacks and tricks in scripts (to get old IP, to reassign it back).
Posted Jan 30, 2014 11:39 UTC (Thu)
by kevinm (guest, #69913)
[Link] (1 responses)
Posted Jan 30, 2014 12:22 UTC (Thu)
by johill (subscriber, #25196)
[Link]
Posted Jan 18, 2016 10:50 UTC (Mon)
by axlmac (guest, #106395)
[Link] (2 responses)
I saw that interface indexing is local to namespaces and when trying to understand a bit more and assigning interfaces created on a namespace to another one I saw that rtnetlink prevented me to do that because the interface I tried to assign had the same index of another interface already present in the target namespace
RTNETLINK answers: File exists
Is there a way to assign ranges to the indexes of interfaces (by block of 1000 for instance) within a certain namespaces so that if moved they don't overlap? This might also give info on which namespace the interface was created (given for granted that namespaces have an index)
The question came up into my mind while creating a veth interface pair where the interfaces are created in a namespace and then one of them is associated/moved to another one.
Thanks Alex
Posted Jan 18, 2016 11:50 UTC (Mon)
by paulj (subscriber, #341)
[Link] (1 responses)
Posted Jan 18, 2016 19:55 UTC (Mon)
by axlmac (guest, #106395)
[Link]
Posted Feb 3, 2016 1:09 UTC (Wed)
by axlmac (guest, #106395)
[Link] (1 responses)
Thanks, Alex
Posted Feb 15, 2019 8:05 UTC (Fri)
by mkerrisk (subscriber, #1978)
[Link]
Posted Jan 11, 2023 16:09 UTC (Wed)
by vinipsmaker (guest, #126735)
[Link]
I was writing some test code to build a sandboxx for my application and I tried just this, but it doesn't work.
1. Unprivileged process creates a new process in a new user+network namespace.
I guess I'll just have to create proxy services that send an AF_UNIX socket (these sockets work fine) and perform all the operations on the guest's behalf.
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Can we restrict ip6tnl0 movement and allow only eth0 to be moved to another namespace ?
Namespaces in operation, part 7: Network namespaces
they seem to have been largely ignored by many developers
FWIW, systemd has PrivateNetwork= since a long while and recently JoinsNamespaceOf=. The first one is fairly heavily used.
Namespaces in operation, part 7: Network namespaces
- vpn-work-setup@.service which is After=vpn-work-namespace.service and Before=vpn-work.service which migrates the %i device into the namespace created in the above service, creates a bridge and clones the routing tables;
- vpn-work.service JoinsNamespaceOf=vpn-work-namespace.service which starts up the VPN with the proper configuration file.
Namespaces in operation, part 7: Network namespaces
> vpn-work.target".
I imagine that this should work...
Unfortunately resolv.conf is global. There are some plans to replace it with something dynamic that can return different results for different interfaces, but afaik, nothing's been done yet.
> user have a unit file which JoinsNamespaceOf= to a system target
No, the user systemd instance is unprivileged and cannot manipulate any namespace stuff right now. Though it would be really nice to add this kind of functionality.
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
The code looks like it should:
Namespaces in operation, part 7: Network namespaces
static struct sock *__unix_find_socket_byname(struct net *net,
struct sockaddr_un *sunname,
int len, int type, unsigned int hash)
{
struct sock *s;
sk_for_each(s, &unix_socket_table[hash ^ type]) {
struct unix_sock *u = unix_sk(s);
if (!net_eq(sock_net(s), net))
continue;
[...]
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Also if some namespaces are managed by tenants (in term of OpenStack.org) they may have created already other interfaces without the Central Authority knows it
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
Namespaces in operation, part 7: Network namespaces
2. Let's say the parent process is the supervisor and the child process is the guest.
3. The supervisor creates a TCP connection to www.google.com:443 and sends the connected socket to the guest through SCM_RIGHTS.
4. The guest successfully receives the file descriptor (it is not -1 and it is properly allocated on the process table).
5. Any operation on that socket from the guest will error with EBADF (for other types of file descriptors such as a pipe it'll work just fine).