Linuxkernnet PDF
Linuxkernnet PDF
Linuxkernnet PDF
1/178
Please feel free send any feedback or questions or suggestions to
Rami Rosen by sending email to: ramirose@gmail.com
I will try hard to answer each and every question (though sometimes it
takes time)
Contents
• 1 Introduction
• 2 Hierarchy of networking layers
• 3 Networking Data Structures
• 3.1 SK_BUFF
• 3.2 net_device
• 4 NAPI
• 5 Routing Subsystem
• 5.1 Routing Tables
• 5.2 Routing Cache
• 5.2.1 Creating a Routing Cache Entry
• 5.3 Policy Routing (multiple tables)
• 5.3.1 Policy Routing: add/delete a rule example
• 5.4 Routing Table lookup algorithm
• 6 Receiving a packet
• 6.1 Forwarding
• 7 Sending a Packet
• 8 Multipath routing
• 9 Netfilter
• 9.1 Netfilter rule example
• 9.2 Connection Tracking
• 10 Traffic Control
2/178
• 11 ICMP redirect message
• 12 Neighboring Subsystem
• 13 Network Namespaces
• 14 Virtual Network Devices
• 15 IPv6
• 16 VLAN
• 17 Bonding Network device
• 18 Teaming Network Device
• 19 PPP
• 20 TUN/TAP
• 21 BLUETOOTH
• BlueZ
• RFCOMM
• L2CAP
• 22 VXLAN
• 23 TCP
• 24 IPSec
• 24.1 Example: Host to Host VPN (using openswan)
• 25 Wireless Subssytem
• 26 Links and more info
Introduction
● Understanding a packet walkthrough in the kernel is a key to
understanding kernel networking. Understanding it is a must if we
want to understand Netfilter or IPSec internals, and more.
● We will deal with this walkthrough in this document (design and
implementation details).
● The layers that we will deal with (based on the 7 layers model) are:
- Link Layer (L2) (ethernet)
- Network Layer (L3) (ip4, ipv6)
- Transport Layer (L4) (udp,tcp...)
3/178
Networking Data Structures
● The two most important structures of linux kernel network layer are:
– sk_buff struct (defined in include/linux/skbuff.h)
- net_device struct (defined in include/linux/netdevice.h)
It is better to know a bit about them before delving into the
walkthrough code.
SK_BUFF
4/178
reality, however, the sk_buff_head is included in the doubly linked list
of sk_buffs (so it actually forms a ring).
When a sk_buff is allocated, also its data space is allocated from
kernel memory. sk_buff allocation is done with
alloc_skb() or dev_alloc_skb(); drivers use dev_alloc_skb(); (free
by kfree_skb() and dev_kfree_skb()). However, sk_buff provides an
additional management layer. The data space is divided into a head
area and a data area. This allows kernel functions to reserve space for
the header, so that the data doesn't need to be copied around.
Typically, therefore, after allocating an sk_buff, header space is
reserved using skb_reserve(). skb_pull(int len) – removes data from
the start of a buffer (skipping over an existing header) by advancing
data to data+len and by decreasing len.
We also handle alignment when allocating sk_buff:
- when allocating an sk_buff, by netdev_alloc_skb(), we eventually
call __alloc_skb() and in fact, we have two allocations here:
- the sk_buff itself (struct sk_buff *skb)
this is done by
...
skb = kmem_cache_alloc_node(cache, gfp_mask &
~__GFP_DMA, node);
....
see __alloc_skb() in net/core/skbuff.c
5/178
The allocation of data above forces alignment.
The struct sk_buff objects themselves are private for every network
layer. When a packet is passed from one layer to another, the struct
sk_buff is cloned. However, the data itself is not copied in that case.
Note that struct sk_buff is quite large, but most of its members are
unused in most situations. The copy overhead when cloning is
therefore limited.
6/178
• net_enable_timestamp() must be called in order to get valid
timestamp values.
Helper method: static inline ktime_t skb_get_ktime(const struct
sk_buff *skb) : returns tstamp of the skb.
Many network modules define a private skb cb of their own, and use
the skb->cb
for their own needs. For example, in include/net/bluetooth/bluetooth.h,
we have:
#define bt_cb(skb) ((struct bt_skb_cb *)((skb)->cb))
• helper method:
• static inline struct dst_entry *skb_dst(const struct sk_buff *skb)
• struct dst_entry *dst – the route for this sk_buff; this route is
determined by the routing subsystem.
• It has 2 important function pointers:
• int (*input)(struct sk_buff*);
• int (*output)(struct sk_buff*);
• input() can be assigned to one of the following : ip_local_deliver,
ip_forward, ip_mr_input, ip_error or dst_discard_in.
• output() can be assigned to one of the following :ip_output,
ip_mc_output, ip_rt_bug, or dst_discard_out.
• We will deal more with dst when talking about routing.
• In the usual case, there is only one dst_entry for every skb.
7/178
• When using IPsec, there is a linked list of dst_entries and only the last
one is for routing; all other dst_entries are for IPSec transformers ;
these other dst_entries have the DST_NOHASH flag set. These
entries , which has this DST_NOHASH flag set are not kept in the
routing cache, but are kept instead on the flow cache.
8/178
skb->priority = sk->sk_priority;
And we have
int ip_forward(struct sk_buff *skb)
{
...
skb->priority = rt_tos2priority(iph->tos);
...
}
__u32 priority;
There are other cases when we set the priority of the skb.
For example, in vlan_do_receive() (net/8021q/vlan_core.c).
kmemcheck_bitfield_begin(flags1);
__u8 local_df:1,
cloned:1,
ip_summed:2,
nohdr:1,
nfctinfo:3;
__u8 pkt_type:3
• The packet type is determined in eth_type_trans() method.
9/178
• eth_type_trans() gets skb and net_device as parameters.
(see net/ethernet/eth.c).
• The packet type depends on the destination mac address in the
ethernet header.
• it is PACKET_BROADCAST for broadcast.
• it is PACKET_MULTICAST for multicast.
• it is PACKET_HOST if the destination mac address is mac address of
the device which was passed as a parameter.
• It is PACKET_OTHERHOST if these conditions are not met.
• (there is another type for outgoing packets,
PACKET_OUTGOING, dev_queue_xmit_nit())
• Notice that eth_type_trans() is unique to ethernet; for FDDI, for
example, we have fddi_type_trans() (see net/802/fddi.c).
fclone:2,
ipvs_property:1,
peeked:1,
nf_trace:1 - netfilter packet trace flag
kmemcheck_bitfield_end(flags1);
__be16 protocol;
• skb->protocol is set in ethernet network drivers by assigning it to
eth_type_trans() return value.
void (*destructor)(struct sk_buff *skb);
Helper method: static inline void skb_orphan(struct sk_buff *skb)
• If the skb has a destructor, call this destructor;
• set skb->sk and skb->destructor to null.
10/178
skb_iif to be the ifindex of the device on which we arrived,
skb->dev.
__u32 rxhash;
kmemcheck_bitfield_end(flags2);
dma_cookie_t dma_cookie;
11/178
__u32 secmark;
union {
__u32 mark;
__u32 dropcount;
__u32 avail_size;
};
sk_buff_data_t transport_header; - the transport layer (L4) header
(can be for example tcp header/udp header/icmp header, and more)
12/178
Receive Packet Steering (rps)
There is a global table called rps_sock_flow_table.
Each call to recvmsg or sendmsg updates the rps_sock_flow_table
by calling sock_rps_record_flow() which eventually calls
rps_record_sock_flow().
struct rps_sock_flow_table has an array called "ents".
- The index to this array is a hash (sk_rxhash) of the socket (sock)
from user space.
- The value of each element is the (desired) CPU.
Each call to send/receive from user space updates the CPU according
to the CPU
on which the call was done.
For example, in net/ipv4/af_inet.c:
int inet_recvmsg()
{
...
rps_record_sock_flow()
...
}
In net/ipv4/tcp.c:
ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
...
sock_rps_record_flow(sk);
...
}
13/178
It can be set via:
echo numEntries > /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
net_device
14/178
unsigned long base_addr; - device I/O address
unsigned int irq - device IRQ number (this is the irq number with which
we call request_irq()).
unsigned long state;
• A flag which can be one of these values:
__LINK_STATE_START
__LINK_STATE_PRESENT
__LINK_STATE_NOCARRIER
__LINK_STATE_LINKWATCH_PENDING
__LINK_STATE_DORMANT
15/178
NETIF_F_VLAN_CHALLENGED is also set when creating a bonding,
before enslaving first ethernet interface to it;
bond_setup()
{
....
bond_dev->features |= NETIF_F_VLAN_CHALLENGED;
...
in drivers/net/bonding/bond_main.c
This is done to avoid problems that occur when adding VLANs over
an empty bond. See also later in the bonding section.
ipip_tunnel_setup() {
...
dev->features |= NETIF_F_LLTX;
...
}
16/178
in veth: (drivers/net/veth.c)
static void veth_setup(struct net_device *dev)
{
...
dev->features |= NETIF_F_LLTX;
...
}
and
also in the IPv4 over IPSec tunneling driver, net/ipv4/ip_vti.c, we have:
static void vti_tunnel_setup(struct net_device *dev) {
...
dev->features |= NETIF_F_LLTX;
...
}
NETIF_F_LLTX is also used in a few drivers which has their own Tx lock,
like
drivers/net/ethernet/chelsio/cxgb:
in drivers/net/ethernet/chelsio/cxgb/cxgb2.c, we have:
static int __devinit init_one(struct pci_dev *pdev,
const struct pci_device_id *ent)
{
...
netdev->features |= NETIF_F_SG | NETIF_F_IP_CSUM |
NETIF_F_RXCSUM | NETIF_F_LLTX;
...
}
For the full list of net_device features, look in:
include/linux/netdev_features.h.
See more info in Documentation/networking/netdev-features.txt by
Michal Miroslaw.
17/178
netdev_features_t wanted_features - user-requested features
netdev_features_t vlan_features; - mask of features inheritable by
VLAN
devices.
int ifindex - Interface index. A unique device identifier.
• Helper method: static int dev_new_index(struct net *net)
• When creating a network device, ifindex is set.
• The ifindex is incremented by 1 each time we create a new network
device.
This is done by the dev_new_index() method. (Since ifindex is an int,
the method takes into account cyclic overflow of integer).
• The first network device we create, which is mostly always the
loopback device, has ifindex of 1.
• You can see the ifindex of the loopback device by:
• cat /sys/class/net/lo/ifindex
• You can see the ifindex of any other network device, which is named
netDeviceName, by:
• cat /sys/class/net/netDeviceName/ifindex
int iflink;
struct net_device_stats stats - device statistics, like number
of rx_packets, number of tx_packets, and more.
atomic_long_t rx_dropped - dropped packets by core network
should not be used this in drivers. There are some cases when the
stack increments the rx_dropped counter; for example, under certain
conditions in __netif_receive_skb()
const struct iw_handler_def * wireless_handlers;
struct iw_public_data * wireless_data;
const struct net_device_ops *netdev_ops;
net_device_ops includes pointers with several callback methods
which we want to define in case we want to override the default
behavior. net_device_ops object MUST be initialized (even to an empty
struct) prior to calling register_netdevice() ! The reason is that
in register_netdevice() we check if dev->netdev_ops->ndo_init exist
18/178
*without* verifying before that dev->netdev_ops is not null. In case we
won't initialize netdev_ops, we will have here a null pointer exception.
19/178
dev->flags = IFF_NOARP;
...
}
IFF_POINTOPOINT is set for ppp devices.
For example, in drivers/net/ppp/ppp_generic.c, we have:
static void ppp_setup(struct net_device *dev)
{
...
dev->flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST;
...
}
IFF_MASTER is set for master devices (whereas IFF_SLAVE is set for
slave devices).
For example, for bond devices, we have, in net/bonding/bond_main.c,
static void bond_setup(struct net_device *bond_dev)
{
bond_dev->flags |= IFF_MASTER|IFF_MULTICAST;
}
unsigned int priv_flags;
• These are flags you cannot see from user space with ifconfig or other
utils.
• Some examples of priv_flags:
• IFF_EBRIDGE for a bridge interface.
• This flag is set in br_dev_setup() in net/bridge/br_device.c
• IFF_BONDING
• This flag is set in bond_setup() method.
• This flag is set also in bond_enslave() method.
• both methods are in drivers/net/bonding/bond_main.c.
• IFF_802_1Q_VLAN
• This flag is set in vlan_setup() in net/8021q/vlan_dev.c
• IFF_TX_SKB_SHARING
• In ieee80211_if_setup() , net/mac80211/iface.c we have:
• dev->priv_flags &= ~IFF_TX_SKB_SHARING;
• IFF_TEAM_PORT
This flag is set in team_port_enter() method
in drivers/net/team/team.c
20/178
• IFF_UNICAST_FLT
• Specifies that the driver handles unicast address filtering.
• In mv643xx_eth_probe(), drivers/net/ethernet/marvell/mv643xx_eth.c,
• ...
• dev->priv_flags |= IFF_UNICAST_FLT;
• ...
• The patch which
added IFF_UNICAST_FLT: http://www.spinics.net/lists/netdev/msg17260
6.html
• IFF_LIVE_ADDR_CHANGE
• When this flag is set, we can change the mac address
with eth_mac_addr() when the flag is set. Many drivers
use eth_mac_addr() as the ndo_set_mac_address() callback of struct
net_device_ops.
• see eth_mac_addr() in net/ethernet/eth.c.
21/178
Maximum Transmission Unit: the maximum size of frame the device
can handle. RFC 791 sets 68 as a minimum for internet module MTU.
The eth_change_mtu() method above does not permit setting mtu
which are lower then 68. It should not be confused with path MTU,
which is 576 (also according to RFC 791).
22/178
• For ppp, the device type ARPHRD_PPP is assigned
in ppp_setup(). see drivers/net/ppp/ppp_generic.c.
• For IPv4 tunnels, the type is ARPHRD_TUNNEL .
For IPv6 tunnels, the type is ARPHRD_TUNNEL6 .
For example, for ip in ip tunnel in IPv4 (net/net/ipv4/ipip.c), we have:
static void ipip_tunnel_setup(struct net_device *dev) {
...
dev->type = ARPHRD_TUNNEL;
...
}
And for ip in ip tunnel in IPv6, we have:
static void ip6_tnl_dev_setup(struct net_device *dev)
{
...
dev->type = ARPHRD_TUNNEL6;
...
}
23/178
...
dev->hard_header_len = ETH_HLEN;
...
}
In case of tunnel devices, it is set to different values, according to the
tunnel specifics. So in case of vxlan, we have, in drivers/net/vxlan.c
static void vxlan_setup(struct net_device *dev)
{
...
dev->hard_header_len = ETH_HLEN + VXLAN_HEADROOM;
...
}
where VXLAN_HEADROOM is size of IP header (20) + sizeof UDP
header (20) + size of VXLAN header (8) + size of Ethernet header
(14); so VXLAN_HEADROOM is 50 bytes in total.
With ipip tunnel we have in ipip_tunnel_setup(), (net/ipv4/ipip.c)
static void ipip_tunnel_setup(struct net_device *dev)
{
...
dev->hard_header_len = LL_MAX_HEADER + sizeof(struct iphdr);
...
}
/* extra head- and tailroom the hardware may need, but not in all
cases
* can this be guaranteed, especially tailroom. Some cases also use
* LL_MAX_HEADER instead to allocate the skb.
*/
unsigned short needed_headroom;
unsigned short needed_tailroom;
/* Interface address info. */
unsigned char perm_addr[MAX_ADDR_LEN]; - permanent hw address
unsigned char addr_assign_type - hw address assignment type.
24/178
By default, the mac address is permanent (NET_ADDR_PERM). In case
the mac address was generated with a helper method
called eth_hw_addr_random(), the type of the mac address is
NET_ADD_RANDOM. There is also a type called NET_ADDR_STOLEN,
which is not used. The type of the mac address is stored in
addr_assign_type member of the net_device. Also when we change
the mac address of the device, with eth_mac_addr(), we reset
the addr_assign_type with ~NET_ADDR_RANDOM (in case it
was marked as NET_ADDR_RANDOM before).
25/178
a counter of the times a NIC is told to set to work in promiscuous
mode; used to enable more than one sniffing client; it is used also in
the bridging subsystem, when adding a bridge interface; see the call
to dev_set_promiscuity() in br_add_if(), net/bridge/br_if.c ). dev_set_p
romiscuity() sets the IFF_PROMISC flag of the netdevice. Since
promiscuity is an int, dev_set_promiscuity() takes into account
cyclic overflow of integer.
26/178
● This pointer is assigned to a pointer to struct in_device
in inetdev_init() (net/ipv4/devinet.c)
27/178
● unsigned int num_rx_queues -
number of RX queues allocated at register_netdev() time
ls /sys/class/net/p2p1.100/queues
rx-0 rx-1 rx-2 rx-3 rx-4 rx-5 rx-6
tx-0 tx-1 tx-2 tx-3 tx-4 tx-5
28/178
Helper method: netdev_rx_handler_register(struct net_device *dev,
rx_handler_func_t
*rx_handler,
void *rx_handler_data)
/*
* Cache lines mostly used on transmit path
*/
struct netdev_queue *_tx ____cache line_aligned_in_smp;
29/178
dev_init_scheduler() method initializes qdisc in register_netdevice().
helper methods:
static inline void dev_hold(struct net_device *dev)
increments reference count of the device.
static inline void dev_put(struct net_device *dev)
decrements reference count of the device.
int netdev_refcnt_read(const struct net_device *dev)
reads sum of all CPUs reference counts of this device.
/* delayed register/unregister */
struct list_head todo_list;
/* device index hash chain */
struct hlist_node index_hlist;
30/178
NETREG_REGISTERED, /* completed register_netdevice */
NETREG_UNREGISTERING, /* called unregister_netdevice */
NETREG_UNREGISTERED, /* completed unregister todo */
NETREG_RELEASED, /* called free_netdev */
NETREG_DUMMY, /* dummy device for NAPI poll */
} reg_state:8;
enum rtnl_link_state -
• rtnl_link_state can
be RTNL_LINK_INITIALIZING or RTNL_LINK_INITIALIZED.
• When creating a new link, in rtnl_newlink(), the rtnl_link_state is set to
be RTNL_LINK_INITIALIZING (this is done byrtnl_create_link(), which is
invoked from rtnl_newlink()); later on,
when calling rtnl_configure_link(), the rtnl_link_state is set to
be RTNL_LINK_INITIALIZED.
31/178
If the NETIF_F_NETNS_LOCAL flag of the net device is set, the
operation is not performed and an error is returned. Callers of this
method must hold the rtnl semaphore. This method returns 0 upon
success.
/* mid-layer private */
union {
void *ml_priv;
struct pcpu_lstats __percpu *lstats; /* loopback stats */
struct pcpu_tstats __percpu *tstats; /* tunnel stats */
struct pcpu_dstats __percpu *dstats; /* dummy stats */
};
struct garp_port __rcu *garp_port;
32/178
we send an RTM_NEWLINK message. This message is handled
by rtnl_newlink() callback in net/core/rtnetlink.c.
int group;
• The group the device belongs to.
● Helper method: void dev_set_group(struct net_device *dev, int
new_group): a helper method to set a new group.
struct pm_qos_request pm_qos_req; - for power management
requests.
}
33/178
● macros starting with IN_DEV like: IN_DEV_FORWARD() or
IN_DEV_RX_REDIRECTS() are related to
net_device. struct in_device has a member named conf (instance of
ipv4_devconf). Setting/proc/sys/net/ipv4/conf/all/forwarding eventually
sets the forwarding member of in_device to 1. The same is true
to accept_redirects and send_redirects; both are also members of cnf
(ipv4_devconf).
34/178
.name = "bridge",
};
br_dev_setup()
{
SET_NETDEV_DEVTYPE(dev, &br_type);
}
Calling thus SET_NETDEV_DEVTYPE() enables us to
see DEVTYPE=bridge when running udevadm command on the bridge
sysfs entry:
udevadm info -q all -p /sys/devices/virtual/net/mybr
P: /devices/virtual/net/mybr
E: DEVPATH=/devices/virtual/net/mybr
E: DEVTYPE=bridge
E: ID_MM_CANDIDATE=1
E: IFINDEX=7
E: INTERFACE=mybr
E: SUBSYSTEM=net
E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/mybr
E: TAGS=:systemd:
E: USEC_INITIALIZED=4288173427
The following patch from Doug Goldstein (which was applied) adds
sysfs type to vlan:
The patch itself:
http://www.spinics.net/lists/netdev/msg214013.html
The patch approval:
http://www.spinics.net/lists/netdev/msg216184.html
Another example of usage of SET_NETDEV_DEVTYPE() macro is in
cfg80211_netdev_notifier_call() in net/wireless/core.c
35/178
static struct device_type wiphy_type = {
.name = "wlan",
};
cfg80211_netdev_notifier_call() {
...
SET_NETDEV_DEVTYPE(dev, &wiphy_type);
...
36/178
We can view the number of packets rejected by Reverse Path Filter
by:
netstat -s | grep IPReversePathFilter
IPReversePathFilter: 12
This displays the LINUX_MIB_IPRPFILTER MIB counter, which is
incremented whenever
ip_rcv_finish() gets -EXDEV error from ip_route_input_noref().
We can also view it by cat /proc/net/netstat
IN_DEV_RPFILTER(idev) macro - used in fib_validate_source().
See myping.c as an example of spoofing in the following link:
http://www.tenouk.com/Module43a.html
Network interface drivers
● Most of the nics are PCI devices; There are cases, especially with
SoC (System On Chip) vendors, where the network interfaces are not
PCI devices. There are also some USB network devices.
● The drivers for network PCI devices use the generic PCI calls,
like pci_register_driver() and pci_enable_device().
● For more info on nic drives see the article “Writing Network Device
Driver for Linux” (link no. 9 in links) and chap17 in ldd3.
● There are two modes in which a NIC can receive a packet.
– The traditional way is interrupt driven
each received packet is an asynchronous event which causes an
interrupt.
NAPI
37/178
The initial change to napi_struct is explained in:
http://lwn.net/Articles/244640/
Routing Subsystem
● The routing table enable us to find the net device and the address of
the host to which a packet will be sent.
● Reading entries in the routing table is done by calling fib_lookup
• In IPv4: int fib_lookup(struct net *net, struct flowi4 *flp, struct
fib_result *res)
• In IPv6 :struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6
*fl6,
int flags, pol_lookup_t lookup)
● FIB is the “Forwarding Information Base”.
● There are two routing tables by default: (non Policy Routing case)
– local FIB table (ip_fib_local_table ; ID 255).
– main FIB table (ip_fib_main_table ; ID 254) – See :
include/net/ip_fib.h.
● Routes can be added into the main routing table in one of 3 ways:
– By sys admin command (route add/ip route).
– By routing daemons.
– As a result of ICMP (REDIRECT).
38/178
● A routing table is implemented by struct fib_table.
Routing Tables
Routing Cache
Note: In recent kernels, routing cache is removed.
39/178
● The dst_entry is the protocol independent part.
– Thus, for example, we have a first member called dst also in rt6_info
in IPv6; rt6_info is the parallel of rtable for IPv6 (include/net/ip6_fib.h).
● rtable is created in __mkroute_input() and
in __mkroute_output(). (net/ipv4/route.c)
●There is a member in rtable called rt_is_input, specifying whether it is
input route or output route.
● There are also two helper
methods, rt_is_input_route() and rt_is_output_route(), which return
whether the route is input route or output route.
● The key for a lookup operation in the routing cache is an IP address
(whereas in the routing table the key is a subnet).
● the lookup is done by fib_trie (net/ipv4/fib_trie.c)
● It is based on extending the lookup key.
● By Robert Olsson et al (see links).
– TRASH (trie + hash)
– Active Garbage Collection
● You can view fib tries stats by:
cat /proc/net/fib_triestat
● You can flush the routing cache by: ip route flush cache
caveat: it sometimes takes 2-3 seconds or more; it depends on your
machine.
● You can show the routing cache by:
ip route show cache
40/178
rth->u.dst.input and rth->u.dst.output
● Setting the flowi member of dst (rth->fl)
– Next time there is a lookup in the cache,for example ,
ip_route_input(), we will compare against rth->fl.
● A garbage collection call which delete eligible entries from the
routing cache.
● Which entries are not eligible ?
41/178
ip rule add
– The rule can be based on input interface, TOS, fwmark (from
netfilter).
● ip rule list – show all rules.
42/178
Routing Table lookup algorithm
Receiving a packet
43/178
● The handler for receiving an IPV6 packet
is ipv6_rcv() (net/ipv6/ip6_input.c)
● Handler for the protocols are registered at init phase.
– Likewise, arp_rcv() is the handler for ARP packets.
● First, ip_rcv() performs some sanity checks. For example: if (iph->ihl
< 5 || iph->version != 4) goto inhdr_error; – iph is the ip header ; iph-
>ihl is the ip header length (4 bits). – The ip header must be at least
20 bytes. – It can be up to 60 bytes (when we use ip options)
● Then it calls ip_rcv_finish(), by: NF_HOOK(PF_INET,
NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);
● This division of methods into two stages (where the second has the
same name with the suffix finish or slow, is typical for networking
kernel code.)
● In many cases the second method has a “slow” suffix instead of
“finish”; this usually happens when the first method looks in some
cache and the second method performs a lookup in a table, which is
slower.
● ip_rcv_finish() implementation: if (skb->dst == NULL) { int err =
ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,skb-
>dev); ... } ... return dst_input(skb);
● ip_route_input(): First performs a lookup in the routing cache to see
if there is a match. If there is no match (cache miss), calls
ip_route_input_slow() to perform a lookup in the routing table. (This
lookup is done by calling fib_lookup()).
● fib_lookup(const struct flowi *flp, struct fib_result *res) The results
are kept in fib_result.
● ip_route_input() returns 0 upon successful lookup. (also when there
is a cache miss but a successful lookup in the routing table.)
According to the results of fib_lookup(), we know if the frame is for
local delivery or for forwarding or to be dropped.
● If the frame is for local delivery , we will set the input() function
pointer of the route to ip_local_deliver(): rth->u.dst.input=
ip_local_deliver;
● If the frame is to be forwarded, we will set the input() function
pointer to ip_forward(): rth->u.dst.input = ip_forward; Local Delivery
44/178
● Prototype: ip_local_deliver(struct sk_buff *skb) (net/ipv4/ip_input.c).
calls NF_HOOK(PF_INET, NF_IP_LOCAL_IN, skb, skb-
>dev,NULL,ip_local_deliver_finish);
● Delivers the packet to the higher protocol layers according to
itstype.
Forwarding
45/178
● ip_output() will call ip_finish_output() – This is the
NF_IP_POST_ROUTING point.
● ip_finish_output() will eventually send the packet from a neighbor
by: – dst->neighbour->output(skb) – arp_bind_neighbour() sees to it
that the L2 address of the next hop will be known. (net/ipv4/arp.c)
● If the packet is for the local machine: – dst->output = ip_output –
dst->input = ip_local_deliver – ip_output() will send the packet on the
loopback device, – Then we will go into ip_rcv() and ip_rcv_finish(), but
this time dst is NOT null; so we will end in ip_local_deliver().
● See: net/ipv4/route.c
GRO:
GRO stands for Generic Receive Offload.
In order to work with GRO (Generic Receive Offload):
- you must set NETIF_F_GRO in device features.
- you should call napi_gro_receive() from the RX path of the driver.
GRO replaces LRO (Large Receive Offload), as LRO was only for TCP in
IPv4.
LRO was removed from the network stack.
GRO works in conjunction with GSO (Generic Segmentation Offload).
Multipath routing
● This feature enables the administrator to set multiple next hops for a
destination.
● To enable multipath routing, CONFIG_IP_ROUTE_MULTIPATH
should be set when building the kernel.
46/178
● There was also an option for multipath caching: (by setting
CONFIG_IP_ROUTE_MULTIPATH_CACHED).
● It was experimental and removed in 2.6.23 See links (6).
To add a multicast address at MAC level, you can use "ip maddr add".
Note that "ip maddr add" expects a MAC address, not an IP address!
So this is ok:
ip maddr add 01:00:5e:01:01:25 dev eth0
but this is wrong: (pay attention, you will not get any error message!)
ip maddr add 226.1.2.3
You can join a multicast group also by setsockopt
with IP_ADD_MEMBERSHIP; see for
example: https://github.com/troglobit/toolbox/blob/master/mcjoin.c
All Mulitcast addresses in mac presentations start with 01:00:5E
according to IANA requirements.
47/178
Multicast addresses are translated from IP notation to mac address
by a formula; see ip_eth_mc_map() in include/net/ip.h.
This is needed for example in arp
translation, arp_mc_map() in net/ipv4/arp.c.
The handler for multicast RX is ip_mr_input() in net/ipv4/ipmr.c.
48/178
the same network segment.
224.0.0.13 : All PIM Routers.
This is done in k_join() method of pimd-2.1.8/kern.c.
Two membership reports are sent as a result.
• These membership reports also has a TTL of 1.
see IPv4 Multicast Address Space Registry:
http://www.iana.org/assignments/multicast-addresses/multicast-
addresses.xml
pimd creates an IGMP socket.
pimd adds entries to the multicast cache (MFC). This is done by
setsockopt with MRT_ADD_MFC which
invokes ipmr_mfc_add() method in net/ipv4/ipmr.c
You can see entries and statistics of the multicast cache (MFC) by:
cat /proc/net/ip_mr_cache
This patch (4.12.12) from Nicolas Dichtel enables to advertise mfc
stats
via rtnetlink. This is done by adding a struct named rta_mfc_stats
in include/uapi/linux/rtnetlink.h.
see: ipmr/ip6mr: advertise mfc stats via rtnetlink:
http://permalink.gmane.org/gmane.linux.network/251481%20 %20 "
target="_blank">http://permalink.gmane.org/gmane.linux.network/25
1481
Secondary addresses:
An address is considered "secondary" if it is included in the subnet of
another address on the same interface.
Example:
ip address add 192.168.0.1/24 dev p2p1
ip address add 192.168.0.2/24 dev p2p1
ip addr list dev p2p1
3: p2p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast state UP qlen 1000
49/178
link/ether 00:a1:b0:69:74:00 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.1/24 scope global p2p1
inet 192.168.0.2/24 scope global secondary p2p1
inet6 fe80::2a1:b0ff:fe69:7400/64 scope link
valid_lft forever preferred_lft forever
IGMP snooping
IGMP snooping can be controlled through sysfs interface.
For brN, the settings can be found
under /sys/devices/virtual/net/brN/bridge.
For example,for:
brctl addbr br0
cat /sys/devices/virtual/net/br0/bridge/multicast_snooping
This multicast_disabled of net_bridge struct represents
multicast_snooping.
rtnl_register()
The rtnl_register() gets 3 callbacks as parameters:
doit, dumpit, and calcit callbacks.
We have two rtnl_register() with RTM_GETROUTE calls for the routing
subsystem;
rtnl_register(PF_INET, RTM_GETROUTE, inet_rtm_getroute, NULL,
NULL)
in net/ipv4/route.c
and
rtnl_register(PF_INET, RTM_GETROUTE, NULL, inet_dump_fib, NULL);
in net/ipv4/fib_frontend.c
They are called according to the type of userspace call:
ip route get 192.168.1.1 is implemented via inet_rtm_getroute()
ip route show is implemented via inet_dump_fib()
50/178
Note:
In rtnetlink_net_init(), which is called e have:
sk = netlink_kernel_create(net, NETLINK_ROUTE, &cfg);
rtnetlink_net_init() is called
from netlink_proto_init(), net/netlink/af_netlink.c.
We also have, in net/ipv4/fib_frontend.c:
netlink_kernel_create(net, NETLINK_FIB_LOOKUP, &cfg);
NETLINK_FIB_LOOKUP is not used by iproute2; it *is* used by libnl, in a
util called util named nl-fib-lookup, and also in other libnl code.
NETLINK_FIB_LOOKUP has one callback, named nl_fib_input(struct
sk_buff *skb), which in fact performs eventually a fib lookup; you
might wonder for what is NETLINK_FIB_LOOKUP socket needed if we
have "ip route get", which uses NETLINK_ROUTE socket
and RTM_GETROUTE message; the answer is
that NETLINK_FIB_LOOKUP was added when adding the trie code, and
it stayed probably as a legacy.
see:http://lists.openwall.net/netdev/2009/05/25/33
VRRP
VRRP stands for Virtual Router Redundancy Protocol
http://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Protocol
You can find a GPL licensed implementation of VRRP designed
for Linux operating systems here:
http://sourceforge.net/projects/vrrpd/
what is VRRPd daemon is is an implementation of VRRPv2 as specified
in rfc2338.
It runs in userspace on linux.
xorp project:
http://www.xorp.org/
• fea is the Forwarding Engine Abstraction
• mfea is the Multicast Forwarding Engine Abstraction.
XORP git tree https://github.com/greearb/xorp.ct.git
51/178
In case you download the xorp tar.gz and you had build problem, you
might consider git cloning the XORP git tree and building by scons &&
scons install.
Netfilter
● Netfilter is the kernel layer to support applying iptables rules.
– It enables:
● Filtering
● Changing packets (masquerading)
● Connection Tracking
● see: http://www.netfilter.org/
Xtables modules are prefixed with xt, for
example, net/netfilter/xt_REDIRECT.c.
Xtables matches are always lowercase.
Xtables targets are always uppercase (for example, xt_REDIRECT.c)
struct xt_target : defined in include/linux/netfilter/x_tables.h
Registering xt_target is done by xt_register_target().
Registering an array of xt_target is done
by xt_register_targets(). see net/netfilter/xt_TPROXY.c
● "Writing Netfilter modules" (67 pages pdf) by Jan Engelhardt,
Nicolas Bouliane:
http://jengelh.medozas.de/documents/Netfilter_Modules.pdf
Netfilter tables:
You register/unregister a netfilter table
by ipt_register_table()/ipt_unregister_table().
we have the following 5 netfilter tables in IPv4:
nat table - has 4 chains:
52/178
NF_INET_PRE_ROUTING
NF_INET_POST_ROUTING
NF_INET_LOCAL_OUT
NF_INET_LOCAL_IN
• see net/ipv4/netfilter/iptable_nat.c
• REDIRECT is a NAT table target; implemented
in net/netfilter/xt_REDIRECT.c
mangle table - has 5 chains:
NF_INET_PRE_ROUTING
NF_INET_LOCAL_IN
NF_INET_FORWARD
NF_INET_LOCAL_OUT
NF_INET_POST_ROUTING
see:net/ipv4/netfilter/iptable_mangle.c
TPROXY is a mangle table target; implemented
in net/netfilter/xt_TPROXY.c
53/178
• You can set the ICMP type with --reject-with type: it can be icmp-net-
unreachable, icmp-host-unreach-able, icmp-port-unreachable, icmp-
proto-unreachable, icmp-net-prohibited, icmp-host-prohibited or icmp-
admin-prohibited.
see net/ipv4/netfilter/iptable_filter.c
security table - has 3 chains:
NF_INET_LOCAL_IN
NF_INET_FORWARD
NF_INET_LOCAL_OUT
see: net/ipv4/netfilter/iptable_security.c
Connection Tracking
A connection entry is represented by struct nf_conn.
• see include/net/netfilter/nf_conntrack.h
Each connection tracking entry is kept until a certain timeout elapse.
This timeout period is different for TCP, UDP and ICMP.
You can see the connection tracking entries by:
cat /proc/net/nf_conntrack
SNAT and DNAT is implemented in net/netfilter/xt_nat.c
54/178
MASQUERADE is implemented
in net/ipv4/netfilter/ipt_MASQUERADE.c
Traffic Control
Tc utility (from iproute package) is used to configure Traffic Control in
the Linux kernel.
There are three areas which Traffic Control handles:
tc qdisc - Queuing discipline.
Implementation: in net/sched/sch_* files (for example,
net/sched/sch_fifo.c).
tc class
Implementation: Also in net/sched/sch_* files.
tc filter
Implementation: in net/sched/cls_* files.
important structures:
struct Qdisc : declared in include/net/sch_generic.h
• net_device has a Qdisc member (named qdisc).
struct Qdisc_ops : declared in include/net/sch_generic.h
The noqueue_qdisc is an example of Qdisc which is used in virtual
devices.
The noqueue_qdisc_ops is an example of Qdisc_ops (member
of noqueue_qdisc).
Both are defined in source/net/sched/sch_generic.
pfifo_fast is the default qdisc on all network interfaces.
• Enqueing/Dequeing is done
by pfifo_fast_enqueue() and pfifo_fast_dequeue().
pfifo_fast is a classless queueing discipline, as opposed, for example,
to CBQ or HTB, which are class-based queuing disciplines. We can
easily determine from looking at the qdisc declaration whether it is
classless or class based, by inspecting if there is a class_ops member:
• cbq_qdisc_ops has cbq_class_ops; see net/sched/sch_cbq.c; it is a
class-based qdisc
55/178
• htb_qdisc_ops has htb_class_ops; see net/sched/sch_htb.c ; it is a
class-based qdisc
• pfifo_fast_ops doesn't have class_ops; see net/sched/sch_generic.c; it
is a classles qdisc.
56/178
tc class add dev p2p1 parent 10:0 classid 10:10 htb rate 5Mbit
• This triggers invocation of tc_ctl_tclass() in net/sched/cls_api.c
(handler of RTM_NEWTFILTER message, sent from user space)
A class can be a parent class or a child class.
57/178
NETFILTER_XT_TARGET_TPROXY kernel config item should be set
for Transparent proxy (TPROXY) target support.
TPROXY target is somewhat similar to REDIRECT. It can only be used
in the mangle table and is useful to redirect traffic to a transparent
proxy.
As opposed to REDIRECT, it does not depend on Netfilter connection
tracking and NAT.
xt_TPROXY.c
Port 3128 is the default port of squid; in /etc/squid/squid.conf, you can
define a tproxy port; for example,
http_port 3128 tproxy
Adding tproxy will trigger calling setsockopt()
with IP_TRANSPARENT, when starting the squid daemon. This in turn
will set the transparent member of struct inet_sock.
An iptables rule to work with TPROXY can be for example:
iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY
--tproxy-mark 0x1/0x1 --on-port 3128
--tproxy-mark 0x1/0x1 is for setting skb->mark in the TPROXY
module.
Netfilter hooks
58/178
Netfilter rule example
● Short example:
● Applying the following iptables rule: – iptables A INPUT p udp dport
9999 j DROP
● This is NF_IP_LOCAL_IN rule;
● The packet will go to:
● ip_rcv()
● and then: ip_rcv_finish()
● And then ip_local_deliver()
● but it will NOT proceed to ip_local_deliver_finish() as in the usual
case, without this rule.
● As a result of applying this rule it reaches nf_hook_slow() with
verdict == NF_DROP (calls skb_free() to free the packet)
● See net/netfilter/core.c.
● iptables -t mangle A PREROUTING -p udp -dport 9999 -j MARK
-setmark 5
– Applying this rule will set skb->mark to 0x05 in ip_rcv_finish().
59/178
● To support sending ICMP redirects, the machine should be configured
to send redirect messages.
– /proc/sys/net/ipv4/conf/all/send_redirects should be 1.
● In order that the other side will receive redirects, we should set
/proc/sys/net/ipv4/conf/all/accept_redirects to 1.
● Example:
● Add a suboptimal route on 192.168.0.31:
● route add net 192.168.0.10 netmask 255.255.255.255 gw
192.168.0.121
● Running now “route” on 192.168.0.31 will show a new entry:
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.0.10 192.168.0.121 255.255.255.255 UGH 0 0 0 eth0
● Send packets from 192.168.0.31 to 192.168.0.10 :
● ping 192.168.0.10 (from 192.168.0.31)
● We will see (on 192.168.0.31): – From 192.168.0.121: icmp_seq=2
Redirect Host(New nexthop: 192.168.0.10)
● now, running on 192.168.0.121: – route Cn | grep .10 ● shows that
there is a new entry in the routing cache:
● 192.168.0.31 192.168.0.10 192.168.0.10 ri 0 0 34 eth0
● The “r” in the flags column means: RTCF_DOREDIRECT.
● The 192.168.0.121 machine had sent a redirect by calling
ip_rt_send_redirect() from ip_forward(). (net/ipv4/ip_forward.c)
● And on 192.168.0.31, running “route -c" | grep .10” shows now a
new entry in the routing cache: (in case accept_redirects=1)
● 192.168.0.31 192.168.0.10 192.168.0.10 0 0 1 eth0
● In case accept_redirects=0 (on 192.168.0.31), we will see:
● 192.168.0.31 192.168.0.10 192.168.0.121 0 0 0 eth0
● which means that the gw is still 192.168.0.121 (which is the route
that we added in the beginning).
● Adding an entry to the routing cache as a result of getting ICMP
REDIRECT is done in ip_rt_redirect(),net/ipv4/route.c.
● The entry in the routing table is not deleted.
60/178
Neighboring Subsystem
● Most known protocol: ARP (in IPV6: ND, neighbour discovery)
● ARP table.
● Ethernet header is 14 bytes long: – Source mac address (6 bytes). –
Destination mac address (6 bytes). – Type (2 bytes).
● 0x0800 is the type for IP packet (ETH_P_IP)
● 0x0806 is the type for ARP packet (ETH_P_ARP)
● 0x8100 is the type for VLAN packet (ETH_P_8021Q)
● see: include/linux/if_ether.h
● When there is no entry in the ARP cache for the destination IP
address of a packet, a broadcast is sent (ARP request,
ARPOP_REQUEST: who has IP address x.y.z...). This is done by a
method called arp_solicit(). (net/ipv4/arp.c)
● You can see the contents of the arp table by running: “cat
/proc/net/arp” or by running the “arp” from a command line .
● You can delete and add entries to the arp table; see man arp.
Bridging Subsystem
● Bridging implementation in Linux conforms to IEEE 802.1d standard
(which describes Bridging and Spanning tree).
See http://en.wikipedia.org/wiki/IEEE_802.1D
● You can define a bridge and add NICs to it (“enslaving ports”) using
brctl (from bridge-utils).
• bridge-utils is maintained by Stephen Hemminger.
• you can get the sources by:
• git clone
git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-
utils.git
• Building is simple: first run: autoconf (in order to create "configure"
file)
• Then run make.
61/178
There are two important structures in the bridging subsystem:
struct net_bridge represents a bridge.
struct net_bridge_port represents a bridge port.
(Both are defined in net/bridge/br_private.h)
net_bridge has a hash table inside called "hash".
It has 256 entries (BR_HASH_SIZE).
● You can have up to 1024 ports for every bridge device
(BR_MAX_PORTS) .
● Example:
● brctl addbr mybr (Create a bridge named "mybr")
● brctl addif mybr eth0 (add a port to a bridge).
● brctl show
● brctl delbr mybr (Delete the bridge named "mybr")
Note:
You can see the fdb by
62/178
brctl addif mybr wlan0
can't add wlan0 to bridge mybr: Operation not supported.
You cannot add a loopback device to a bridge:
brctl addbr mybr
brctl addif mybr lo
can't add lo to bridge mybr: Invalid argument
The reason:
In br_add_if(), we check the priv_flags of the device, and in
case IFF_DONT_BRIDGE
is set, we return -EOPNOTSUPP (Operation not supported).
In case of wireless device, cfg80211_netdev_notifier_call() method
sets the IFF_DONT_BRIDGE (see net/wireless/core.c)
TBD:
Under In which circumstances do we remove the IFF_DONT_BRIDGE
flag in cfg80211_change_iface() in net/wireless/util.c?
63/178
rx_handler_data member. You cannot call
twice netdev_rx_handler_register() on the same network device; this
will return an error ("Device or resource busy", EBUSY).
see drivers/net/macvlan.c and net/bonding/bond_main.c.
● In the past, when we received a frame, netif_receive_skb() called
handle_bridge().
Now we call br_handle_frame(), via
invoking rx_handler() (see __netif_receive_skb() in
net/core/dev.c)
64/178
The maintainer is Jesse Gross.
See also Documentation/networking/openvswitch.txt.
Network namespaces
A network namespace is logically another copy of the network stack,
with it's own routes, firewall rules, and network devices.
• A network device belongs to exactly one network namespace.
• A socket belongs to exactly one network namespace.
A network namespace provides an isolated view of the networking
stack
- network device interfaces
- IPv4 and IPv6 protocol stacks,
- IP routing tables
- firewall rules
- /proc/net and /sys/class/net directory trees
- sockets
- more
Network namespace is implemented by struct
net, include/net/net_namespace.h
By running:
ip netns add netns_one
we create a file under /var/run/netns/ called netns_one.
See: man ip netns
In order to show all of the named network namespaces, we run:
./ip/ip netns list
Next you run:
./ip link add name if_one type veth peer name if_one_peer
./ip link set dev if_one_peer netns netns_one
65/178
ip netns add myns1
ip netns add myns2
Now:
Running:
ip netns exec myns1 bash
will transfer me to myns1 network namespaces; so if I will run
there:
ifconfig -a
I will see p2p1;
66/178
the scenes, dev_change_net_namespace() checks
the NETIF_F_NETNS_LOCAL flag of the net device. If it is set, we will
not permit changing of network namespace, and we will
return EINVAL.
Under the hood, when calling ip netns exec , we have here invocation
of two system calls from user space:
setns system call with CLONE_NEWNET (kernel/nsproxy.c)
unshare system call with CLONE_NEWNS in (kernel/fork.c)
67/178
process is created in the same network namespace as the calling
process. This flag is intended for the implementation of containers.
Three lwn articles about namespaces:
"network namespaces"
http://lwn.net/Articles/219794/
"PID namespaces in the 2.6.24 kernel"
http://lwn.net/Articles/259217/
"Notes from a container"
http://lwn.net/Articles/256389/
A new approach to user namespaces: Jonathan Corbet, April 2012
http://lwn.net/Articles/491310/
Checkpoint/restore mostly in the userspace:
http://lwn.net/Articles/451916/
Checkpoint and Restore: are we there yet? lecture by Pavel Emelyanov
http://linux.conf.au/schedule/30116/view_talk?day=thursday
TCP
TCP: RFC 793: http://www.ietf.org/rfc/rfc793.txt
TCP - provides connected-orienetd service.
MSS = Maximum segment size
tcp_sendmsg() is the main handler in the TX path.
sk_state is the state of the TCP socket.
In case it is not in TCPF_ESTABLISHED or TCPF_CLOSE_WAIT we cannot
send data.
Allocation of a new segment is done via sk_stream_alloc_skb().
helper: tcp_current_mss(): compute the current effective MSS.
Important structures:
struct tcp_sock:
• u32 snd_cwnd - the congestion sending window size.
• u8 ecn_flags - ECN status bits.
68/178
• ECN stands for Explicit Congestion Notification.
• can be one of the following:
• TCP_ECN_OK
• TCP_ECN_QUEUE_CWR
• TCP_ECN_DEMAND_CWR
• TCP_ECN_SEEN
69/178
from include/uapi/linux/tcp.h:
struct tcphdr {
__be16 source;
__be16 dest;
__be32 seq;
__be32 ack_seq;
#if defined(__LITTLE_ENDIAN_BITFIELD)
__u16 res1:4,
doff:4,
fin:1,
syn:1,
rst:1,
psh:1,
ack:1,
urg:1,
ece:1,
cwr:1;
#elif defined(__BIG_ENDIAN_BITFIELD)
__u16 doff:4,
res1:4,
cwr:1,
ece:1,
urg:1,
ack:1,
psh:1,
rst:1,
syn:1,
fin:1;
#else
#error "Adjust your <asm/byteorder.h> defines"
#endif
__be16 window;
__sum16 check;
__be16 urg_ptr;
};
TCP packet loss can be detected by two events:
• a timeout
• receiving duplicate ACKs.
70/178
• When and why do we get "duplicate ACKs"?
• According to RFC 2581, "TCP Congestion Control"
• http://www.ietf.org/rfc/rfc2581.txt:
•
•
A TCP receiver SHOULD send an immediate duplicate ACK when
an out-
of-order segment arrives. The purpose of this ACK is to inform the
sender that a segment was received out-of-order and which sequence
number is expected.
see:
Congestion Avoidance and Control
Van Jacobson
Lawrence Berkeley Laboratory
Michael J. Karels
University of California at Berkeley
ee.lbl.gov/papers/congavoid.pdf
TCP timers:
Keep Alive timer - implemented in tcp_keepalive_timer() in
net/ipv4/tcp_timer
TCP retransmit timer - implemented in tcp_retransmit_timer() in
net/ipv4/tcp_timer
RTO - retransmission timeout.
RTT - round trip time.
IPSEC
● Works at network IP layer (L3)
● Used in many forms of secured networks like VPNs.
● Mandatory in IPv6. (not in IPv4)
● Implemented in many operating systems: Linux, Solaris, Windows,
and more.
● RFC2401
71/178
● In 2.6 kernel : implemented by Dave Miller and Alexey Kuznetsov.
● IPSec subsystem Maintainers:
Herbert Xu and David Miller.
Steffen Klassert was added as a maintainer in October 2012.
see:
http://marc.info/?t=135032283000003&r=1&w=2
IPSec git kernel repositories:
There are two git trees at kernel.org, an 'ipsec' tree that tracks the
net tree and an 'ipsec-next' tree that tracks the net-next tree.
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git
● Transformation bundles.
● Chain of dst entries; only the last one is for routing.
● User space tools: http://ipsectools.sf.net
72/178
● Openswan: http://www.openswan.org/ (Open Source project).
Also strongSwan: http://www.strongswan.org/
● There are also non IPSec solutions for VPN
– example: pptp
● struct xfrm_policy has the following member:
– struct dst_entry *bundles.
– __xfrm4_bundle_create() creates dst_entries (with the DST_NOHASH
flag) see: net/ipv4/xfrm4_policy.c
● Transport Mode and Tunnel Mode.
● Show the security policies:
– ip xfrm policy show
● Show xfrm states
- ip xfrm state show
● Create RSA keys:
– ipsec rsasigkey verbose 2048 > keys.txt
– ipsec showhostkey left > left.publickey – ipsec showhostkey right >
right.publickey
Some IPSec links:
USAGI IPv6 IPsec Development for Linux
http://hiroshi1.hongo.wide.ad.jp/hiroshi/papers/SAINT2004_kanda-
ipsec.pdf
73/178
Example: Host to Host VPN (using openswan)
in /etc/ipsec.conf:
conn linuxtolinux left=192.168.0.189 leftnexthop=%direct
leftrsasigkey=0sAQPPQ... right=192.168.0.45 rightnexthop=%direct
rightrsasigkey=0sAQNwb... type=tunnel auto=start
● service ipsec start (to start the service)
● ipsec verify – Check your system to see if IPsec got installed and
started correctly.
● ipsec auto –status – If you see “IPsec SA established” , this implies
success.
● Look for errors in /var/log/secure (fedora core) or in kernel syslog
Tips for hacking
● Documentation/networking/ipsysctl. txt: networking kernel tunabels
● Example of reading a hex address:
● iph->daddr == 0x0A00A8C0 or means checking if the address is
192.168.0.10 (C0=192,A8=168,00=0,0A=10).
74/178
● When you encounter: xfrm / CONFIG_XFRM this has to to do with
IPSEC. (transformers). New and future trends
● IO/AT.
● NetChannels (Van Jacobson and Evgeniy Polyakov).
● TCP Offloading.
● RDMA - Remote Direct Memory Access.
- iWARP - stands for: Internet Wide Area RDMA Protocol
- Currently there are only two drivers in the kernel tree for NICS with
RDMA support: (rnics) 1) drivers/infiniband/hw/amso1100
2) drivers/infiniband/hw/cxgb3. - driver for the Chelsio T3 1GbE and
10GbE adapters.
The kernel maintainer of the INFINIBAND SUBSYSTEM is Roland Dreier.
● Mulitqueus : some new nics, like e1000 and IPW2200, allow two or
more hardware Tx queues. Also with virtio, patches which support
multiqueue were recently sent.
In case you want to override the kernel selection of tx queue, you
should implement
ndo_select_queue() member of the net_device_ops struct in your
driver.
For example, this is done in ieee80211_dataif_ops struct
in net/mac80211/iface.c
...
ndo_select_queue = ieee80211_netdev_select_queue
...
see Documentation/networking/multiqueue.txt
and also
Documentation/networking/scaling.txt
75/178
vger.kernel.org/netconf2011_slides/bwh_netconf2011.pdf
76/178
lnstat tool
lnstat tool is a powerful tool, part of iproute 2 package
Examples of usage:
lnstat -f rt_cache -k entries
shows number of routing cache entries
Fragmentation:
Fragmentation of outgoing packets:
When the length of the skb is larger then the MTU of the device from
which
the packet is transmitted, we preform fragmentation; this is done
in ip_fragment() method
(net/ipv4/ip_output.c); in IPv6, it is done
in ip6_fragment() in net/ipv6/ip6_output.c
Fragmentation can be done in two ways:
- via a page array (called skb_shinfo(skb)->frags[]) (There can be up
to MAX_SKB_FRAGS; MAX_SKB_FRAGS is 16 when page size is 4K).
- via a list of SKBs (called skb_shinfo(skb)->frag_list)
- Then method skb_has_frag_list() tests the second (This method
was called skb_has_frags() in the past).
77/178
...
in on = IP_PMTUDISC_DO;
setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on));
...
In the kernel, ip_dont_fragment() checks the value of pmtudisc field of
the socket (struct inet_sock, which is embedded the sock structure).
In case pmtudisc equals IP_PMTUDISC_DO, we set the IP_DF (Don't
fragment) flag in the ip header by
iph->frag_off = htons(IP_DF). See for example,
ip_build_and_send_pkt() in net/ipv4/ip_output.c
So in order getting the offset and the flag from the ip header can
be done thus:
78/178
- see for example, ip_frag_queue() in net/ipv4/ip_fragment.c
Each fragment has the IP_MF flag ("More fragments") set, except for
the last fragment.
The id field of the ip header is the same for all fragments.
Neighboring Subsystem
● Why do we need the neighboring subsystem ?
● “The world is a jungle in general, and the networking game
contributes many animals.” (from RFC 826, ARP, 1982)
● Most known protocol: ARP (in IPv6: ND, neighbour discovery)
● Ethernet header is 14 bytes long:
● Source Mac address and destination Mac address are 6 bytes each.
– Type (2 bytes). For example, (include/linux/if_ether.h)
● 0x0800 is the type for IP packet (ETH_P_IP)
79/178
● 0x0806 is the type for ARP packet (ETH_P_ARP)
● 0X8035 is the type for RARP packet (ETH_P_RARP)
Neighboring Subsystem – struct neighbour
● neighbour (instance of struct neighbour) is embedded in dst, which
is in turn is embedded in sk_buff:
● Implementation: important data structures
● struct neighbour (include/net/neighbour.h)
– ha is the hardware address (MAC address when dealing with
Ethernet) of the neighbour. This field is filled when an ARP response
arrives.
• primary_key – The IP address (L3) of the neighbour.
● lookup in the arp table is done with the primary_key.
• nud_state represents the Network Unreachability Detection state of
the neighbor. (for example, NUD_REACHABLE).
● int (*output)(struct sk_buff *skb);
– output() can be assigned to different methods according to the state
of the neighbour. For example, neigh_resolve_output() and
neigh_connected_output().
Initially, it is neigh_blackhole().
– When a state changes, than also the output function may be
assigned to a different function.
● refcnt incremented by neigh_hold(); decremented
by neigh_release().
We don't free a neighbour when the refcnt is higher than 1;instead, we
set dead(a member of neighbour) to 1.
● timer (The callback method is neigh_timer_handler()).
● struct hh_cache *hh (defined in include/linux/netdevice.h)
● confirmed – confirmation timestamp.
– Confirmation can be also done from L4 (transport layer). – For
example, dst_confirm() calls neigh_confirm(). – dst_confirm() is called
from tcp_ack()
80/178
(net/ipv4/tcp_input.c) – and by udp_sendmsg() (net/ipv4/udp.c) and
more. –
neigh_confirm() does NOT change the state
– it is the job of neigh_timer_handler().
● dev (net_device)
● arp_queue – every neighbour has a small arp queue of itself. – There
can be only 3 elements by default in an arp_queue.
– This is configurable:/proc/sys/net/ipv4/neigh/default/unres_qlen
struct neigh_table
● struct neigh_table represents a neighboring table –
(/include/net/neighbour.h)
– The arp table (arp_tbl) is a neigh_table. (include/net/arp.h)
– In IPv6, nd_tbl (Neighbor Discovery table ) is a neigh_table also
(include/net/ndisc.h) – There is also dn_neigh_table (DEcnet )
(linux/net/decnet/dn_neigh.c) and clip_tbl (for ATM) (net/atm/clip.c) –
gc_timer: neigh_periodic_timer() is the callback for garbage collection.
– neigh_periodic_timer() deletes FAILED entries from the ARP table.
Neighboring Subsystem arp
● When there is no entry in the ARP cache for the destination IP
address of a packet, a broadcast is sent (ARP
request,ARPOP_REQUEST: who has IP address x.y.z...). This is done by
a method called arp_solicit().(net/ipv4/arp.c) – In IPv6, the parallel
mechanism is called ND (Neighbor discovery) and is implemented as
part of ICMPv6. – A multicast is sent in IPv6 (and not a broadcast).
● If there is no answer in time to this arp request, then we will end up
with sending back an ICMP error (Destination Host Unreachable).
● This is done by arp_error_report() , which indirectly
calls ipv4_link_failure() ; see net/ipv4/route.c.
● You can see the contents of the arp table by running: “cat
/proc/net/arp” or by running the “arp” from a command line.
• You can view statistics of arp cache (IPV4) by: cat
/proc/net/stat/arp_cache
• You can view statistics of ndisc cache (IPV6)
by: cat /proc/net/stat/ndisc_cache
81/178
● "ip neigh show" is the new method to show arp (from IPROUTE2)
• In IPv6 it is "ip -6 neigh show".
● You can delete and add entries to the arp table; see man arp/man ip.
● When using “ip neigh add” you can specify the state of the entry
which you are adding (like permanent, stale, reachable, etc).
● arp command does not show reachability states except the
incomplete state and permanent state: Permanent entries are marked
with M in Flags:
example : arp output
Address HWtype HWaddress Flags Mask Iface 10.0.0.2 (incomplete)
eth0 10.0.0.3 ether 00:01:02:03:04:05 CM eth0 10.0.0.138 ether
00:20:8F:0C:68:03 C eth0
Neighboring Subsystem – ip neigh show.
● We can see the current neighbour states:
● Example :
● ip neigh show
192.168.0.254 dev eth0 lladdr 00:03:27:f1:a1:31 REACHABLE
192.168.0.152 dev eth0 lladdr 00:00:00:cc:bb:aa STALE
192.168.0.121 dev eth0 lladdr 00:10:18:1b:1c:14 PERMANENT
192.168.0.54 dev eth0 lladdr aa:ab:ac:ad:ae:af STALE
● arp_process() handles both ARP requests and ARP responses.
– net/ipv4/arp.c
– If the target ip (tip) address in the arp header is the loopback
then arp_process() drops it since loopback does not need ARP
. ... if (LOOPBACK(tip) || MULTICAST(tip))
goto out;
out:
... kfree_skb(skb); return 0;
(see: #define LOOPBACK(x) (((x) & htonl(0xff000000)) ==
htonl(0x7f000000)) in linux/in.h
● If it is an ARP request (ARPOP_REQUEST) we call ip_route_input().
● Why ?
82/178
● In case it is for us, (RTN_LOCAL) we send and ARP reply. –
arp_send(ARPOP_REPLY,ETH_P_ARP,sip,dev,tip,sha ,dev>
dev_addr,sha); – We also update our arp table with the sender entry
(ip/mac).
● Special case: ARP proxy server.
● In case we receive an ARP reply – (ARPOP_REPLY) –
We perform a lookup in the arp table. (by calling __neigh_lookup()) – If
we find an entry, we update the arp table by neigh_update().
● If there is no entry and there is NO support for unsolicited ARP we
don't create an entry in the arp table. – Support for unsolicited ARP by
setting /proc/sys/net/ipv4/conf/all/arp_accept to 1. – The
corresponding macro is: IPV4_DEVCONF_ALL(ARP_ACCEPT)) – In older
kernels, support for unsolicited ARP was done by: –
CONFIG_IP_ACCEPT_UNSOLICITED_ARP Neighboring Subsystem –
lookup
● Lookup in the neighboring subsystem is done via: neigh_lookup()
parameters: – neigh_table (arp_tbl) – pkey (ip address, the
primary_key of neighbour struct) – dev (net_device) – There are 2
wrappers: – __neigh_lookup()
● just one more parameter: creat (a flag: to create a neighbor by
neigh_create() or not))
● and __neigh_lookup_errno()
Neighboring Subsystem – static entries
● Adding a static entry is done by:
arp -s ipAddress MacAddress
● Alternatively, this can be done by:
ip neigh add ipAddress dev eth0 lladdr MacAddress nud permanent
● The state (nud_state) of this entry will be NUD_PERMANENT
• ip neigh show will show it as PERMANENT.
● Why do we need PERMANENT entries ?
arp_bind_neighbour() method
● Suppose we are sending a packet to a host for the first time.
● a dst_entry is added to the routing cache by rt_intern_hash().
83/178
● We should know the L2 address of that host. – so rt_intern_hash()
calls arp_bind_neighbour().
● only for RTN_UNICAST (not for multicast/broadcast). –
arp_bind_neighbour(): net/ipv4/arp.c – dst-> neighbour=NULL, so it
calls__neigh_lookup_errno(). – There is no such entry in the arp table. –
So we will create a neighbour with neigh_create() and add it to the arp
table.
● neigh_create() creates a neighbour with NUD_NONE state
– setting nud_state to NUD_NONE is done in neigh_alloc()
The IFF_NOARP flag
● Disabling and enabling arp
● ifconfig eth1 -arp
– You will see the NOARP flag now in ifconfig a
● ifconfig eth1 arp (to enable arp of the device)
● In fact, this sets the IFF_NOARP flag of net_device.
● There are cases where the interface by default is with
the IFF_NOARP flag (for example, ppp interface, see
ppp_setup() (drivers/net/ppp_generic.c)
Changing IP address
● Suppose we try to set eth1 to an IP address of a different machine
on the LAN:
● First, we will set an ip for eth1 in (in Fedora Core 8,for example)
● /etc/sysconfig/networkscripts/ifcfg-eth1
● ... IPADDR=192.168.0.122 ...
and than run:
●ifup eth1
● we will get:
• Error, some other host already uses address 192.168.0.122.
● But:
● ifconfig eth0 192.168.0.122
● works ok !
84/178
● Why is it so ?
Duplicate Address Detection (DAD)
● Duplicate Address Detection mode (DAD)
● arping I eth0 D 192.168.0.10
– sends a broadcast packet whose source address is 0.0.0.0.
0.0.0.0 is not a valid IP address (for example, you cannot set an ip
address to 0.0.0.0 with ifconfig)
● The mac address of the sender is the real one.
● -D flag is for Duplicate Address Detection mode.
Code: (from arp_process() ; see /net/ipv4/arp.c) /* Special case: IPv4
duplicate address detection packet (RFC2131)*/ if (sip == 0) { if (arp>
ar_op == htons(ARPOP_REQUEST) &&
inet_addr_type(tip) == RTN_LOCAL && !arp_ignore(in_dev,dev,sip,tip))
arp_send(ARPOP_REPLY,ETH_P_ARP,tip,dev,tip,sha,dev-
>dev_addr,dev> dev_addr);
goto out;
}
Neighboring Subsystem – Garbage Collection
● Garbage Collection – neigh_periodic_timer() – neigh_timer_handler()
– neigh_periodic_timer() removes entries which are in NUD_FAILED
state. This is done by setting dead to 1, and calling neigh_release().
The refcnt must be 1 to ensure no one else uses this neighbour. Also
expired entries are removed.
● NUD_FAILED entries don't have MAC address ; see ip neigh show)
Neighboring Subsystem – Asynchronous Garbage Collection
● neigh_forced_gc() performs asynchronous Garbage Collection.
● It is called from neigh_alloc() when the number of the entries in the
arp table exceeds a (configurable) limit.
● This limit is configurable
(gc_thresh2,gc_thresh3) /proc/sys/net/ipv4/neigh/default/gc_thresh2
/proc/sys/net/ipv4/neigh/default/gc_thresh3
– The default for gc_thresh3 is 1024.
85/178
Candidates for cleanup: Entries which their reference count is 1, or
which their state is NOT permanent.
● Changing the neighbour state is done only in neigh_timer_handler().
LVS (Linux Virtual Sever)
● http://www.linuxvirtualserver.org/
● Integrated into the Linux kernel (in 2.4 kernel it was a patch).
● Located in: net/netfilter/ipvs in the kernel tree.
● LVS has eight scheduling algorithms.
● LVS/DR is LVS with direct routing (a load balancing solution).
● ipvsadm is the user space management tools (available in most
distros).
● Direct Routing is the packet forwarding method.
● -g, gatewaying => Use gatewaying (direct routing)
● see man ipvsadm.
LVS/DR
● Example: 3 Real Servers and the Director all have the same VirtualIP
(VIP).
● There is an ARP problem in this configuration.
● When you send an ARP broadcast, and the receiving machine has
two or more NICs, each of them responds to this ARP request.
Example: a machine with two NICs ;
● eth0 is 192.168.0.151 and eth1 is 192.168.0.152.
LVS and ARP
● Solutions
1) Set ARP_IGNORE to 1:
• echo “1” > /proc/sys/net/ipv4/conf/eth0/arp_ignore
• echo “1” > /proc/sys/net/ipv4/conf/eth1/arp_ignore
2) Use arptables. – There are 3 points in the arp walkthrough:
(include/linux/netfilter_arp.h) – NF_ARP_IN (in arp_rcv() ,
net/ipv4/arp.c). – NF_ARP_OUT (in arp_xmit()),net/ipv4/arp.c) –
NF_ARP_FORWARD ( in br_nf_forward_arp(), net/bridge/br_netfilter.c)
86/178
● http://ebtables.sourceforge.net/download.html
– Ebtables is in fact the parallel of netfilter but in L2.
LVS example (ipvsadm)
● An example for setting LVS/DR on TCP port 80 with three real
servers:
● ipvsadm C // clear the LVS table
● ipvsadm A t DirectorIPAddress:80
● ipvsadm -a t DirectorIPAddress:80 r RealServer1 g
● ipvsadm -a t DirectorIPAddress:80 r RealServer2 g
● ipvsadm -a t DirectorIPAddress:80 r RealServer3 g
● This example deals with tcp connections (for udp connection we
should use u instead of t in the last 3 lines).
LVS example:
● ipvsadm -Ln // list the LVS table
● /proc/sys/net/ipv4/ip_forward should be set to 1
● In this example, packets sent to VIP will be sent to the load balancer;
it will delegate them to the real server according to its scheduler. The
dest MAC address in L2 header will be the MAC address of the real
server to which the packet will be sent. The dest IP header will be VIP.
● This is done with NF_IP_LOCAL_IN.
ARPD – arp user space daemon
● ARPD is a user space daemon; it can be used if we want to remove
some work from the kernel.
● The user space daemon is part of iproute2 (/misc/arpd.c)
● ARPD has support for negative entries and for dead hosts.
– The kernel arp code does NOT support these type of entries!
● The kernel by default is not compiled with ARPD support; we should
set CONFIG_ARPD for using it:
● Networking Support-> Networking Options-> IP: ARP daemon
support.
● see: /usr/share/doc/iproute2.6.22/arpd.ps (Alexey Kuznetsov).
87/178
● We should also set app_probes to a value greater than 0 by setting –
/proc/sys/net/ipv4/neigh/eth0/app_solicit – This can be done also by
the a (active_probes) parameter. – The value of this parameter tells
how many ARP requests to send before that neighbour is considered
dead.
● The k parameter tells the kernel not to send ARP broadcast; in such
case, the arpd daemon is not only listening to ARP requests, but also
send ARP broadcasts.
● Activation:
● arpd a 1 k eth0 &
● On some distros, you will get the error db_open: No such file or
directory unless you simply run mkdir /var/lib/arpd/ before (for the
arpd.db file).
● Pay attention: you can start arpd daemon when there is no support
in the kernel(CONFIG_ARPD is not set).
● In this case you, arp packets are still caught by arpd daemon
get_arp_pkt()
(misc/arpd.c)
88/178
● Changing MAC address can be as a result of some security attack
(ARP cache poisoning).
● Arpwatch can help detect such an attack.
● Activation: arpwatch d i eth0 (output to stderr)
● Arpwatch keeps a table of ip/mac addresses and senses when there
is a change.
● d is for redirecting the log to stderr (no syslog, no mail).
● In case someone changed MAC address on the same network, you
will get a message like this: ARPwatch Example
From: root (Arpwatch) To: root Subject: changed ethernet address
(jupiter) hostname: jupiter ip address: 192.168.0.54 ethernet address:
aa:bb:cc:dd:ee:ff ethernet vendor: <unknown> old ethernet address:
0:20:18:61:e5:e0 old ethernet vendor: ...
Neighbour states
● neighbour states
neigh_alloc() Reachable Incomplete None Stale Delay Probe
Neighboring Subsystem
●– NUD_NONE
– NUD_REACHABLE
– NUD_STALE
– NUD_DELAY
– NUD_PROBE
– NUD_FAILED
– NUD_INCOMPLETE
● Special states:
● NUD_NOARP
● NUD_PERMANENT
● No state transitions are allowed from these states to another state.
Neighboring Subsystem – states
● NUD state combinations:
89/178
● NUD_IN_TIMER (NUD_INCOMPLETE|NUD_REACHABLE| NUD_DELAY|
NUD_PROBE)
● NUD_VALID (NUD_PERMANENT|NUD_NOARP| NUD_REACHABLE|
NUD_PROBE|NUD_STALE|NUD_DELAY)
● NUD_CONNECTED (NUD_PERMANENT|NUD_NOARP|
NUD_REACHABLE)
● When a neighbour is in a STALE state it will remain in this state until
one of the two will occur – a packet is sent to this neighbour. – Its state
changes to FAILED.
● neigh_resolve_output() and neigh_connected_output().
● net/core/neighbour.c
● A neighbour in INCOMPLETE state does not have MAC address set
yet (ha member of neighbour)
● So when neigh_resolve_output() is called, the neighbour state is
changed to INCOMPLETE.
● When neigh_connected_output() is called, the MAC address of the
neighbour is known; so we end up with calling dev_queue_xmit(),
which calls the ndo_start_xmit() callback method of the NIC device
driver.
● The ndo_start_xmit() method actually puts the frame on the wire.
Change of IP address/Mac address
● Change of IP address does not trigger notifying its neighbours.
● Change of MAC address , NETDEV_CHANGEADDR ,also does not
trigger notifying its neighbours.
● It does update the local arp table by neigh_changeaddr().
– Exception to this is irlan eth: irlan_eth_send_gratuitous_arp() –
(net/irda/irlan/irlan_eth.c) – Some nics don't permit changing of MAC
address – you get: SIOCSIFHWADDR: Device or resource busy.
Flushing the arp table
● Flushing the arp table:
● ip statistics neigh flush dev eth0
●
90/178
• Round 1, deleting 7 entries ***
●
• Flush is complete after 1 round ***
● Specifying twice statistics will also show which entries were deleted,
their mac addresses, etc...
● ip statistics statistics neigh flush dev eth0
● 192.168.0.254 lladdr 00:04:27:fd:ad:30 ref 17 used 0/0/0
REACHABLE ●
● *** Round 1, deleting 1 entries ***
● *** Flush is complete after 1 round ***
● calls neigh_delete() in net/core/neighbour.c
● Changes the state to NUD_FAILED
br_dev_setup() in net/bridge/br_device.c
...
dev->tx_queue_len = 0;
...
and
vlan_setup() in
net/8021q/vlan_dev.c
...
dev->tx_queue_len = 0;
...
and
bond_setup()
91/178
in drivers/net/bonding/bond_main.c:
...
bond_dev->tx_queue_len = 0;
...
and macvlan_setup()
in drivers/net/macvlan.c:
...
dev->tx_queue_len = 0;
...
and
vxlan_setup() in drivers/net/vxlan.c
...
dev->tx_queue_len = 0;
...
With pimreg (multicast) device, tx_queue_len is not initialiazed at all;
so when running ifconfig on pimreg device, you get:
txqueuelen 0 (UNSPEC)
Notice that for virtual devices, like loopback and vlan, the qdisc is
the noqueue qdisc.
So for example, when running "ip addr show" you will see for the
loopback device:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state
UNKNOWN
and for the a vlan device (eth0.6 in this example):
eth0.6@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
qdisc noqueue state
where as in other, non virtual devices, you will have pfifo_fast qdisc.
Some more implementation details about achieving it:
in attach_one_default_qdisc() (net/sched/sch_generic.c) we have this
code:
92/178
static void attach_one_default_qdisc(struct net_device *dev,
struct netdev_queue *dev_queue,
void *_unused)
{
struct Qdisc *qdisc = &noqueue_qdisc;
if (dev->tx_queue_len) {
qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops, TC_H_ROOT);
if (!qdisc) {
netdev_info(dev, "activation failed\n");
return;
}
}
dev_queue->qdisc_sleeping = qdisc;
}
So when dev->tx_queue_len is 0, as in the case with virtual devices,
we use the noqueue_qdisc and do not call qdisc_create_dflt().
Another feature of virtual devices is that they appear
under /sys/devices/virtual/net. So for example, after boot, we
have /sys/devices/virtual/net/lo/ entry for the loopback device. The
entries which are created under /sys/devices/virtual/net for virtual
network device are created not because tx_queue_len of virtual
devices is 0 and not because the noqueue_qdisc of virtual devices.
The reason they are created is because with virtual devices, we do not
call the SET_NETDEV_DEV() macro. In case you'll look at this simple
macro, which should always be called before register_netdev(), you'll
see that all it does is assign the parent member in net_device. How
does this has to do with the virtual entry under sysfs? The answer is
that devices which have no parent are considered "virtual" class-
devices. And if you will look for the implementation details, you see
that register_netdev() calls device_add() in netdev_register_kobject().
And device_add() (in drivers/base/core) creates an entry under
/sys/devices/virtual/net for a device whose parent is null
(see get_device_parent() method, which is invoked from device_add().
So, for example, in the case of creating a tun device (which is a
virtual device) by:
ip tuntap add tun0 mode tun
You will have an entry under:
93/178
/sys/devices/virtual/net/tun0/
And when creating a tap device (which is also a virtual device) by:
ip tuntap add tap0 mode tap
You will have an entry under:
/sys/devices/virtual/net/tap0/
Tunnels
What is the difference between ipip tunnel and gre tunnel?
gre tunnel supports multicasting whereas ipip tunnel does support
only unicast.
MTU
MTU stands for Maximum Transfer Unit (or sometimes also Maximum
Transfer Unit).
MTU is symmetrical and applies both to receive and transmit.
Layer 3 should not pass pass an skb which has payload bigger than an
MTU.
GSO and TSO are exceptions; in such cases, the device will separate
the packet into smaller packets, which are smaller than the MTU.
Multicasting
struct net_device holds two lists of addresses (instances of
struct netdev_hw_addr_list ):
• uc is the unicast mac addresses list
• mc is the multicast mac addresses list
You add multicast addresses to the multicast mac addresses list (mc)
both in IPv4 and IPv6 by:
dev_mc_add() (in net/core/dev_addr_lists.c).
94/178
In ipv4, a device adds the 224.0.0.1 multicast address
(IGMP_ALL_HOSTS , see include/linux/igmp.h),
in ip_mc_up() (see net/ipv4/igmp.c).
GSO
For implementing GSO, a method called gso_segment was added
to net_protocol
struct in ipv4 (see include/net/protocol.h)
For tcp, this method is tcp_tso_segment() (see tcp_protocol
in net/ipv4/af_inet.c).
There are drivers who implement TSO; for example, e1000e of Intel.
IP address
95/178
In IPv4, when you set and IP address, you in fact assign it to ifa-
>ifa_local.
(ifa is a pointer to struct in_ifaddr)
When running "ifconfig" or "ip addr show", you in fact issue
an SIOCGIFADDR ioctl,
for getting interface address, which is handled by
struct in_device from include/linux/inetdevice.h has a list : ifa_list,
which is the IP ifaddr chain.
ifa_local is a member of struct in_ifaddr which represents ipv4
address.
IPV6
In IPV6, the neighboring subsystem uses ICMPV6 for Neighboring
messages (instead of ARP messages in IPV4).
● There are 5 types of ICMP codes for neighbour discovery
messages:
NEIGHBOUR SOLICITATION (135) parallel to ARP request in IPV4
NEIGHBOUR ADVERTISEMENT (136) parallel to ARP reply in IPV4
ROUTER SOLICITATION (133)
ROUTER ADVERTISEMENT (134)
REDIRECT (137)
Special Addresses:
All nodes (or : All hosts) address: FF02::1
– ipv6_addr_all_nodes() sets address to FF02::1
– All Routers address: FF02::2
– ipv6_addr_all_routers() sets address to FF02::2
Both in include/net/addrconf.h
• In IPV6: All addresses starting with FF are multicast address.
● IPV4: Addresses in the range 224.0.0.0 – 239.255.255.255
are multicast addresses (class D).
Privacy Extensions
96/178
● Since the address is build using a prefix and MAC address,
the identity of the machine can be found.
● To avoid this, you can use Privacy Extensions.
– This adds randomness to the IPV6 address creation process.
(calling get_random_bytes() for example).
● RFC 3041 Privacy Extensions for Stateless Address
Autoconfiguration in IPv6.
● You need CONFIG_IPV6_PRIVACY to be set when building the kernel.
Hosts can disable receiving Router Advertisements by
setting Autoconfiguration
● When a host boots, (and its cable is connected) it first
creates a Link Local Address.
– A Link Local address starts with FE80.
– This address is tentative (only works with ND messages).
● The host sends a Neighbour Solicitation message.
– The target is its tentative address, the source is all zeros.
– This is DAD (Double Address Detection).
● If there is no answer in due time, the state is changed to
permanent. (IFA_F_PERMANENT)
● Then the host send Router Solicitation.
– The target address of the Router Solicitation
message is the All Routers multicast address
FF02::2
– All the routers reply with a Router Advertisement message.
– The host sets address/addresses according to the prefix/prefixes
received and starts the DAD process as before.
● At the end of the process, the host will have two (or more) IPv6
addresses:
– Link Local IPV6 address.
– The IPV6 address/addresses which was built using the prefix (in case
that there is one or more routers sending RAs).
● There are three trials by default for sending Router
Solicitation.
– It can be configured by:
● /proc/sys/net/ipv6/conf/eth0/router_solicitations
VLAN (802.1Q)
97/178
VLAN (Virtual LAN) enables us to partition a physical network. Thus,
different broadcast domains are created. This is achieved by inserting
VLAN tag into the packet.
The VLAN tag is 4 bytes: 2 bytes are Tag Protocol Identifier (TPID),
which has a value of 0x8100; 2 bytes are the Tag Control Identifier
(TCI). (In linux documentation, TCI is termed "tag control information",
see vlan_tci in sk_buff struct, include/linux/sk_buff)
The VLAN tag is inserted between the source mac address and
ethertype of the eth header. The vlan_insert_tag()method implements
this tag insertion (include/linux/if_vlan.h).
struct vlan_ethhdr represents vlan ethernet header (ethhdr +
vlan_hdr).
h_vlan_proto in this struct will get always 0x8100 value.
h_vlan_TCI in this struct is the TCI, composed from priority and VLAN
ID.
vlan_insert_tag() is invoked from the vlan rx
handler, vlan_do_receive().
(see include/linux/if_vlan.h).
VLAN support in linux is under net/8021q.
There is also the macvlan driver (drivers/net/macvlan.c).
The header file for vlan is include/linux/if_vlan.h
The header file for macvlan is include/linux/if_macvlan.h
The maintainer of vlan is Patrick McHardy.
VLAN supports almost everything a regular ethernet interface does,
including
firewalling, bridging, and of course IP traffic.
You will need the 'vconfig' tool from the VLAN project in order to
effectively use VLANs.
In fedora, there is a package ("rpm") called vconfig; you install it by
"yum install vconfig".
In Ubuntu, vconfig belongs to a package named "vlan"; you install it
by "apt-get install vlan"
You can also set vlan/macvlan with "vconfig" utility thus:
98/178
vconfig add p2p1 vlanID
Notice that you can add up to 4094 VLANs per ethernet interface.
In case you try to add more than 4094, you will get this error:
ERROR: trying to add VLAN #vlanID to IF -:p2p1:- error: Numerical
result out of range
According to http://en.wikipedia.org/wiki/IEEE_802.1Q:
"The hexadecimal values of 0x000 and 0xFFF are reserved.".
You can get some info about vlan devices in procfs under:
• /proc/net/vlan
• /proc/net/vlan/config (this includes info about vlan id).
See More info in VLAN web page:
http://www.candelatech.com/~greear/vlan.html
VLAN traffic has 0x8100 type (ETH_P_8021Q).
99/178
VLAN interface is a virtual device (you set the netdevice tx_queue_len
to be 0)
In case VLAN is compiled as a kernel module, its name is 8021q.ko.
Adding/Deleting vlans is done via ioctls which are sent from user
space;
for example, adding vlan is triggered by
receiving ADD_VLAN_CMD ioctl from user space. This triggers
the register_vlan_device() method. As said above, you cannot add
more than 4094 vlans to a single ethernet device. In the beginning
of register_vlan_device() we have:
if (vlan_id >= VLAN_VID_MASK)
return -ERANGE;
And VLAN_VID_MASK is 0x0fff (4095).
When returning -ERANGE, we get the error mentioned above:
error: Numerical result out of range
Deleting vlan is done by receiving DEL_VLAN_CMD ioctl from user
space. This triggers the unregister_vlan_dev() method.
These ioctls are defined in include/uapi/linux/if_vlan.h
(Once they were defined in include/linux/if_vlan.h).
The handler for this ioctls is vlan_ioctl_handler() in net/8021q/vlan.c
By default, ethernet header reorders are turned off. (The
VLAN_FLAG_REORDER_HDR
flag is not set). When ethernet header reorders are set, dumping the
device will appear as a common ethernet device without vlans.
VLAN private device data is represented by
struct vlan_dev_priv (net/8021q/vlan.h)
It has two arrays in it: egress_priority_map and ingress_priority_map.
We add entries to egress_priority_map array
by vlan_dev_set_egress_priority().
This is triggered by sending SET_VLAN_EGRESS_PRIORITY_CMD ioctl
from user space
(vconfig set_egress_map)
100/178
We add entries to ingress_priority_map array
by vlan_dev_set_ingress_priority().
This is triggered by sending SET_VLAN_INGRESS_PRIORITY_CMD ioctl
from user space
(vconfig set_ingress_map )
You can enable vlan reordering with vconfig thus:
vconfig set_flag eth0.100 1 1
Note that there are chances that the man page/help of some
distros is not accurate about this.
It says
set_flag [vlan-device] 0 | 1
And it should be:
set_flag [vlan-device] [flag-num] 0 | 1
101/178
There are some adapters which support VLAN hardware acceleration
offloading. You can get info about VLAN hardware acceleration
offloading with ethtool:
ethtool -k p2p1
...
rx-vlan-offload: on
tx-vlan-offload: on
...
102/178
#> vconfig add bond0 100
ERROR: trying to add VLAN #100 to IF -:bond0:- error: Operation not
supported.
How is this implemented ?
An empty bonding device has NETIF_F_VLAN_CHALLENGED set.
In vlan_check_real_dev(), which is invoked
from register_vlan_device() when
configuring VLAN over a device, we check the
NETIF_F_VLAN_CHALLENGED flag
of the device on which we are setting the VLAN.
If this flag is set, we return -EOPNOTSUPP:
int vlan_check_real_dev(struct net_device *real_dev, u16 vlan_id)
{
...
...
if (real_dev->features & NETIF_F_VLAN_CHALLENGED) {
pr_info("VLANs not supported on %s\n", name);
return -EOPNOTSUPP;
}
...
...
}
The Maintainers of the bonding driver are Jay Vosburgh and Andy
Gospodarek.
In the kernel, the bonding code is in drivers/net/bonding.
103/178
(these terms can be considered as synonyms).
Team has also a user-space util, libteam.
The team driver registers an RX handler
by netdev_rx_handler_register().
The handler is team_handle_frame().
This is common in a virtual driver; also the bonding driver registers an
RX handler
named bond_handle_frame() and also the bridge driver reigsters a
handler
named br_handle_frame(). These handlers are invoked
in __netif_receive_skb() (net/core/dev.c)
104/178
a device, in case we
did not define dellink in the rtnl_link_ops, then we assign
the generic unregister_netdevice_queue() method to the dellink
callback of rtnl_link_ops. And when running "ip link del team0", we
arrive at rtnl_dellink() , which
eventually calls unregister_netdevice_queue() and unregisters the
net_device.
see, in net/core/rtnetlink.c
int __rtnl_link_register(struct rtnl_link_ops *ops)
{
if (!ops->dellink)
ops->dellink = unregister_netdevice_queue;
...
return 0;
}
Adding p2p1:
ip link set p2p1 master team0
ip link set p2p1 master team0 triggers a call to team_port_add()
Removing p2p1:
ip link set p2p1 nomaster
Notice that p2p1 must be down for this operation to succeed; in case
it is up, you
will get "RTNETLINK answers: Device or resource busy" error.
Trying to add a loopback device to a team device will fail.
For example,
ip link set lo master team0
105/178
emits this error in the kernel log:
team0: Device lo is loopback device. Loopback devices can't be
added as a team port
There are four modules (or for "modes", which is the word the team
code uses)
in the team driver:
team_mode_broadcast.c
The broadcast mode is a basic mode in which all packets are
sent via all available ports.
team_mode_roundrobin.c
The roundrobin mode is a basic mode with very simple transmit
port-selecting algorithm based on looping around the port list.
This is the only mode able to run on its own without userspace
interactions.
team_mode_activebackup.c
The activebackup mode, in which only one port is active at a time and
able to perform transmit and receive of skb.
The rest of the ports are backup ports.
This Mode exposes activeport option through which
userspace application can specify the active port.
team_mode_loadbalance.c
The loadbalance mode is a more complex mode used for
example for LACP (Link Aggregation Control Protocol) and
userspace controlled transmit and
receive load balancing.
LACP protocol is part of the 802.3ad standard
and is very common for smart switches.
106/178
The teaming network driver uses the Generic Netlink API; it
calls genl_register_family()
and genl_register_mc_group() and other methods of the Generic
Netlink API.
In fedora 16/17 there is an rpm for the user-space util (libteam).
Team Infrastructure Specification:
https://fedorahosted.org/libteam/wiki/InfrastructureSpecification
see: https://fedorahosted.org/libteam/
https://github.com/jpirko/libteam
PPP
The most commonly used user space daemon for ppp is pppd.
It can be downlowded from here:
ftp://ftp.samba.org/pub/ppp/
pppd website is:
http://ppp.samba.org/
In case you need to use pppoe in conjunction with ppp, you should
install rp-pppoe:
http://www.roaringpenguin.com/products/pppoe
ppp setting are configurable via /etc/ppp.
The generic ppp layer is implemented
in ppp_generic.c (drivers/net/ppp/ppp_generic.c).
PPPoE and PPPL2TP uses the generic ppp layer.
You register a ppp generic channel by calling
the ppp_register_net_channel() method of ppp_generic.
This is done in pppoe_connect() (drivers/net/ppp/pppoe.c)
and in pppol2tp_connect() (net/l2tp/l2tp_ppp.c).
These two modules also call ppp_input() for handling
receiving of PPP packets over the ppp channel.
107/178
Unregistering is done by the ppp_unregister_channel() method of
ppp_generic.
pppox_unbind_sock() calls ppp_unregister_channel() drivers/net/ppp/p
ppox.c.
For pppoe, pppox_unbind_sock() is invoked when a PPPoE socket is
closed. (pppoe_release() in http://lxr.free-
electrons.com/source/drivers/net/ppp/pppoe.c).
For l2tp_ppp, pppox_unbind_sock() is invoked
by pppol2tp_session_close() and pppol2tp_release().
PPPoE
PPPoE stand for Point-to-Point Protocol over Ethernet.
defined in RFC 2516:
http://www.ietf.org/rfc/rfc2516.txt
PPPoE is implemented in pppoe.c. (drivers/net/ppp/pppoe.c)
For establishing PPPoE connection, there are two stages: the
Discovery stage and the Session stage.
The Discovery stage consists of four steps between the client
computer
and the PPPoE server (access concentrator) at the ISP.
1) PADI (Initiation)
2) PADO (Offer)
3) PADR (Request)
4) PADS (Session confirmation).
The Discovery stage is managed the pppd daemon.
You end a session by sending a PADT packet (termination packet).
The Discovery stage packets has an ehtertype of 0x8863
(ETH_P_PPP_DISC, defined in include/uapi/linux/if_ether.h).
The session stage packets has an ehtertype of 0x8864
(ETH_P_PPP_SES, also
defined in include/uapi/linux/if_ether.h).
SKB RECYCLE
108/178
skb_recycle was a Linux kernel network stack feature which was
removed.
When we don't need anymore an skb, we free its memory by calling
(for example)
__kfree_skb(). The skb_recycle patch is based mainly on adding code
in __kfree_skb(),
so that this skb will not be freed. Instead we will initialized members
of skb so the result will be as of a new skb which was just created.
TUN/TAP
TUN/TAP provides packet reception for transmission for user space
programs.
It can be seen as a simple Point-to-Point or Ethernet device, which,
instead
of receiving packets from physical media, receives them from user
space
program and instead of sending packets via physical media writes
them
to the user space program.
109/178
TUN/TAP is a driver which enables us to receive packets from user
space
and send packets to user space. TUN/TAP is different from other virtual
devices in that it does not relay on real devices for its work; it is a
purely sw driver which work with user space sockets.
The implementation is in drivers/net/tun.c.
The tun driver has two net_device_ops instances:
• tap_netdev_ops for tap devices.
• tun_netdev_ops for tun devices .
The tun device is /dev/net/tun; it is a character device, created
with misc_register().
To insert tuntap module you should run: modprobe tun.
With recent iproute2, you can create tun/tap devices with ip tuntap
command.
see: ip tuntap help
For example:
ip tuntap add tap0 mode tap
or
ip tuntap add tun0 mode tun
Notice that if you try to delete a nonexistent tun or tap device, you
will not get an error mesage or any warning.
110/178
tun devices do not have mac addresses, but tap devices have an hw
address which was created by calling eth_hw_addr_random()
Trying to set a mac address on a tun2 device will give an error; for
example,
ifconfig tun2 hw ether 00:01:02:03:04:05
SIOCSIFHWADDR: Operation not supported
On a tap devices, changing the mac address in this way is possible.
With tun device, the tun_net_open() and tun_net_close() methods are
called when you run "ifconfig tun0 up" and "ifconfig tun0 down",
respectively.
The same is true also with tap device; tun_net_open() is invoked when
calling "ifconfig tap0 up" and tun_net_close() is invoked when
calling "ifconfig tap0 down"
calling fd = open("/dev/net/tun")
triggers tun_chr_open()
calling close(fd) triggers tun_chr_close()
Following is a simple user space app which create a tun device.
Notice that calling TUNSETPERSIST is mandatory in this program. In
case we will not call this method, then when exiting the program the
fd (of "/dev/net/tun") will be closed and tun_chr_close() will be
invoked, as described above. In case TUNSETPERSIST is not
set, unregister_netdevice(dev) will be called (by __tun_detach()).
In case we set the TUN_NO_PI flag (note set in the example below) this
means that packet information (PI) will not be provided. Packet
Information is 4 bytes which are added
when the flag is not set.
These 4 bytes are 2 bytes of flags, and 2 bytes of protocol.
Wireshark sniffer does not show these 4 bytes.
see: include/uapi/linux/if_tun.h:
struct tun_pi {
__u16 flags;
__be16 proto;
};
// tun.c
111/178
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <linux/if_tun.h>
#include <linux/socket.h>
#include <stdio.h>
int main()
{
int fd,err;
struct ifreq ifr;
fd = open("/dev/net/tun",O_RDWR);
if (fd < 0) {
printf("fd < 0 in open\n");
return -1;
}
memset(&ifr, 0, sizeof(ifr));
ifr.ifr_flags = IFF_TUN;
strncpy(ifr.ifr_name, "tun1", IFNAMSIZ);
if (err < 0) {
printf("err<0 after TUNSETIFF, ioctl\n");
close(fd);
return -1;
}
112/178
The only method that the tun driver export is tun_get_socket(), and it
is
used in the vhost driver (drivers/vhost/net.c).
tunctl is an older tool for creating tun/tap devices
http://tunctl.sourceforge.net/
You can also use a util from openvpn to create a tun/tap device:
openvpn --mktun --dev tun2
openvpn --rmtun --dev tun2
By default, when the device name starts with "tun", "openvpn
--mktun" creates a TUN device. When the device name starts with
"tap", "openvpn --mktun" creates a TAP device. However, if you need
for some reason to create a tap device which its name starts with tun,
you still can do it thus:
openvpn --mktun --dev tun11 --dev-type tap
Notice that also here when you try to delete a nonexisting tun/tap
device, you don't get any warning.
TUN/TAP devices are widely used, in virtualization and in other fields.
For example, with virt-manager, libvirt and KVM, when we start a
guest, a TAP
device named vnet0 is created on the host. It is added to a bridge
interface
on the host, called virbr0, with 192.168.122.1 ip address. In the
guest, you can add the host bridge interface as a default gateway in
order to be connected to the outside WAN.
For implementation details of creating the tap device in libvirt, look in
virNetDevTapCreate() method in src/util/virnetdevtap.c of the libvirt
package.
113/178
The maintainer is Maxim Krasnyansky.
web site: http://vtun.sourceforge.net/tun.
In interesting patch series, adding multiqueue support for tuntap,
was sent by Jason Wang in October 2012:
http://www.spinics.net/lists/netdev/msg214869.html
Also an ioctl called TUNSETQUEUE was added ; this ioctl,
this IFF_ATTACH_QUEUE/IFF_DETACH_QUEUE flags, enables
attaching/detaching a queue from user space.
see:
http://www.spinics.net/lists/kernel/msg1429560.html
Following is an example of using tun multiqueues. Please notice that
we set IFF_MULTI_QUEUE when calling TUNSETIFF; later on, we
call TUNSETPERSIST on the same fd , and then open a new fd and
call TUNSETQUEUE with IFF_ATTACH_QUEUE flag set, and a third fd, on
which we again call TUNSETQUEUE with IFF_ATTACH_QUEUE flag set.
The reason for the pause() at the end is that without it all the file
descriptors will be closed. Closing the fd invokes tun_chr_close(),
which subsequently call tun_detach(), removes the sys queue entries
and unregisters the device.
calling twice TUNSETQUEUE as in this example will result with having
3 queues in the end ; we can view these queues also under sys queue
entry:
ls /sys/class/net/tun1/queues/
rx-0 rx-1 rx-2 tx-0 tx-1 tx-2
// tuntap/tunMultiQueue.c
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <linux/if_tun.h>
#include <linux/socket.h>
#include <stdio.h>
114/178
int main()
{
int fd, fd1, fd2, err;
struct ifreq ifr;
fd = open("/dev/net/tun",O_RDWR);
if (fd < 0) {
printf("fd < 0 in open\n");
return -1;
}
memset(&ifr, 0, sizeof(ifr));
ifr.ifr_flags = IFF_TUN | IFF_MULTI_QUEUE;
strncpy(ifr.ifr_name, "tun1", IFNAMSIZ);
if (err < 0) {
printf("err<0 after TUNSETIFF, ioctl\n");
close(fd);
return -1;
}
memset(&ifr, 0, sizeof(ifr));
ifr.ifr_flags = IFF_TUN | IFF_ATTACH_QUEUE;
strncpy(ifr.ifr_name, "tun1", IFNAMSIZ);
err = ioctl(fd1, TUNSETQUEUE, (void*)&ifr);
115/178
if (err < 0) {
perror("TUNSETQUEUE (second)\n");
close(fd1);
return -1;
}
memset(&ifr, 0, sizeof(ifr));
ifr.ifr_flags = IFF_TUN | IFF_ATTACH_QUEUE;
strncpy(ifr.ifr_name, "tun1", IFNAMSIZ);
err = ioctl(fd2, TUNSETQUEUE, (void*)&ifr);
if (err < 0) {
perror("TUNSETQUEUE (second)\n");
close(fd2);
return -1;
} else
printf("Third call to TUNSETIFF on fd1 is OK\n");
pause();
}
116/178
In case you create a tun or tap device, and
you want later to know the type of the device,
you can do it by eththool -i unknownTypeDeviceName| grep bus-info
virtio
virtio was developed by Rusty Russell for his lguest project.
virtio has a common API for different types of devices (block devices,
net devices,
pci devices, and more).
The virtio network driver is implemented in drivers/net/virtio_net.c.
BLUETOOTH
Bluetooth is a wireless technology standard for exchanging data over
short distances.
Bluetooth implementation in the Linux kenel is found in two locations:
Bluetooth core:
net/bluetooth
Bluetooth drivers:
drivers/bluetooth/
Bluetooth kernel diagram:
(Note: cmtp is a module for ISDN, not commonly used)
117/178
Note that there are very few drivers for bluetooth, as many devices
use the generic drivers. We have, for example, the Generic Bluetooth
USB driver (drivers/bluetooth/btusb.c) which is used for many USB BT
devices.
For example, the ASUS USB BT21 dongle, which has a Broadcom
chip, uses this driver.
BlueZ
BlueZ is the user space package for bluetooth.
• From version 4.0 of BlueZ, the main daemon is
called bluetoothd (instead of hcid in earlier versions).
• The main configuration file is /etc/bluetooth/main.conf.
• This daemon also creates an sdp server, calling start_sdp_server().
• start_sdp_server() is implemented in bluez-4.99/src/sdpd-server.c.
• This sdp server opens two sockets:
• UNIX local domain socket, for getting requests sent from the local
machine, such as adding a service (sdptool add). The socket is opened
on /var/run/sdp.
• L2CAP socket for getting requests from outside (for example, when a
remote machine runs "sdptool browse" with the machine address.
There is also a Bluetooth virtual HCI
driver, drivers/bluetooth/hci_vhci.c; it works with a misc character
device, /dev/vhci. The hciemu, HCI emulator, from bluez package,
uses this driver.
Example:
modprobe hci_vhci
then:
./hciemu -n 10
This will create a virtual BT hci device. hciconfig will show it with "Bus:
VIRTUAL." By default, it will be BREDR device (Type: BR/EDR). In case
we need AMP device, we should first run "modprobe hci_vhci amp=1".
Then hciconfig will show AMP device (Type: AMP).
118/178
mknod /dev/vhci c 10 250
119/178
hciconfig hci0 up
These two commands send HCIDEVDOWN/HCIDEVUP ioctls from user
space.
These ioctls are handled by hci_dev_open() and hci_dev_close(),
respectively
• see net/bluetooth/hci_sock.c
Resetting a bluetooth device can be done by:
hcionfig hci0 reset
Scanning for bluetooth devices is done by:
hcitool scan
hcitool scan triggers a call to hci_inquiry() in user space (bluez-
4.99/lib/hci.c).This method creates a BTPROTO_HCI socket and
send HCIINQUIRY to the kernel. This IOCTL is handled
int hci_inquiry() in the kernel (net/bluetooth/hci_core.c).
hcitool con
show active connections.
Bluetooth sniffing can be done by:
hcidump
(you can add flags, like hcidump -Xt).
hcidump in Fedora is part of the bluez-hcidump package.
You should use the -t flag for temporary change or permanent. The -r
120/178
flag is for soft reset. Without it, you should perform hard reset by
removing and replugging the device.
Using Bluetooth input devices, like a mouse/keyboard:
Bluetooth Input devices are handled by the hidp kernel module
(net/bluetooth/hidp/hidp.ko).
121/178
and by opening a Raw HCI socket, with HCI protocol.
socket(AF_BLUETOOTH, SOCK_RAW, BTPROTO_HCI);
l2ping - L2CAP ping util.
sdptool browse XX:XX:XX:XX:XX:XX
• shows opened services on the specified device.
• sdptool browse btAddr does the following:
• creates a L2CAP socket (by l2cap_sock_create(), in
net/bluetooth/l2cap_sock.c.
• connect to this socket (by l2cap_sock_connect() in
net/bluetooth/l2cap_sock.c.
• calls hci_connect() with ACL_LINK, which eventually
calls hci_connect_acl(), in net/bluetooth/hci_conn.c.
•
sdptool browse local
• shows opened services on the local device.
122/178
"Network Access Point" (0x1116)
Protocol Descriptor List:
"L2CAP" (0x0100)
PSM: 15
"BNEP" (0x000f)
Version: 0x0100
SEQ16: 800 806
Language Base Attr List:
code_ISO639: 0x656e
encoding: 0x6a
base_offset: 0x100
Profile Descriptor List:
"Network Access Point" (0x1116)
Version: 0x0100
• And when we run:
• pand --listen --role GP
• Then "sdptool browse" on that device will show, among other SDP
services, the "PAN Group Network" service, which might be something
like this:
• Service Name: Group Network Service
Service Description: BlueZ PAN Service
Service Provider: BlueZ PAN
Service RecHandle: 0x10007
Service Class ID List:
"PAN Group Network" (0x1117)
Protocol Descriptor List:
"L2CAP" (0x0100)
PSM: 15
"BNEP" (0x000f)
Version: 0x0100
SEQ16: 800 806
Language Base Attr List:
code_ISO639: 0x656e
encoding: 0x6a
base_offset: 0x100
Profile Descriptor List:
"PAN Group Network" (0x1117)
Version: 0x0100
123/178
sdptool search --bdaddr XX:XX:XX:XX:XX:XX FTP
• shows opened OBEX FTP service on the specified device and the
respective channel.
• openobex site: http://dev.zuckschwerdt.org/openobex/
RFCOMM
• Acronym for: Radio Frequency Communications protocol.
Following is a practical example of establishing PC to PC connection
with RFCOMM over serial:
Run set a Serial Port service (SP) on both sides:
sdptool add --channel=1 SP
(you can choose a different channel than 1, but it should be the same
on the
client and server)
Now, run on the listener side the following:
rfcomm listen rfcomm0 1
This command triggers creating an BTPROTO_RFCOMM kernel socket
and calling
rfcomm_sock_listen() method and afterwards rfcomm_sock_accept().
Only after the socket is created, a device named /dev/rfcomm0 is
created
by sending an ioctl (RFCOMMCREATEDEV) to this socket.
struct rfcomm_dev represents the rfcomm device. A sysfs entry is
generated for this device, /sys/class/tty/rfcomm0. This is done
by device_create_file(). This folder contains values such as the
address and the channel of this device. The address is the dst
member of struct rfcomm_dev and the channel is the channel member
of struct rfcomm_dev.
On the sender side, run
rfcomm connect rfcomm0 00:11:22:33:44:55 1
This command triggers creating an BTPROTO_RFCOMM kernel socket
and then calling rfcomm_sock_bind() and rfcomm_sock_connect() and
creating a device named /dev/rfcomm0 by sending an ioctl
(RFCOMMCREATEDEV) to
this socket.
124/178
You should get on the sender this message:
Connected /dev/rfcomm0 to 00:11:22:33:44:55 on channel 15
Press CTRL-C for hangup
Now you can send text from the sender to the listener thus:
first, run on the listener, on a different console:
cat /dev/rfcomm0
then, on the sender, run:
echo "foo" >> /dev/rfcomm0
You should see "foo" on the listener terminal.
The RFCOMM tty module (net/bluetooth/rfcomm/tty.c) implements
serial emulation of
Bluetooth using tty driver API,
calling tty_register_driver() and tty_port_register_device().
We can establish TCP/IP connection over Bluetooth devices in this
way,
for example:
on the server side:
pand --listen --role=NAP
And on the client-side
pand --connect btAddressOfServer
An interface called bnep0 will be created on both sides.
We can assign IP addresses on these two interfaces and have TCP/IP
traffic.
In case you encounter problems, like "Connect to btAddr failed. Invalid
exchange(52)" , or "connection refused", used hcidump to try to
debug the problem. Make sure the the ISCAN and PSCAN flags are set
on both sides.
pand --connect btAddressOfServer triggers the following sequence:
First, create a L2CAP socket and connect to it, by invoking socket()
system call
with BTPROTO_L2CAP protocol and then calling connect(), from user
space.
This is handled by l2cap_sock_connect() in the kernel
(net/bluetooth/l2cap_sock.c)
125/178
l2cap_sock_connect() also creates a new connection. In this
process,an entry is added under sysfs. This is done
by hci_conn_init_sysfs() and hci_conn_add_sysfs() in net/bluetooth/hci_
sysfs.c.
When a new connection is removed, this entry is removed from
sysfs, with hci_conn_del_sysfs().
This entry has only 3 attributes (besides the generic device
attributes):
type, address and features.
L2CAP
L2CAP header is 4 bytes:
126/178
- 2 bytes for length of the entire L2CAP PDU in bytes(without the
header).
• The maximum length can be 65529 or 65531 bytes (according to 3.3.1
in the spec).
- 2 bytes for cid (Channel Identifier)
127/178
Info about establishing PAN can be found here:
http://bluez.sourceforge.net/contrib/HOWTO-PAN
Notice that some of the info about the pand daemon is not updated to
recent pand releases.
For example, running the following command, which is mentioned in
this howto:
pand --listen --role NAP --sdp
will give the following error with pand of bluez-compat-4.99-
2.fc17.x86_64:
pand: unrecognized option '--sdp'
(The --nodsp option does exist)
BNEP layer is for the transmission of IP packets in the Personal Area
Networking Profile and is implemented innet/bluetooth/bnep.
XX:XX:XX:XX:XX:XX
Lower Address Part (LAP): 24bits
Upper Address Part (UAP): 8 bits
Nonsignificant Address Part (NAP): 16 bits
Helper methods:
static inline int bacmp(bdaddr_t *ba1, bdaddr_t *ba2)
compares two bt addresses; return 0 if equal.
128/178
static inline void bacpy(bdaddr_t *dst, bdaddr_t *src)
copy src address to dst address.
int ba2str(const bdaddr_t *ba, char *str)
converts from bdaddr_t to a zero-terminated string.
int str2ba(const char *str, bdaddr_t *ba)
converts from zero-terminated string to bdaddr_t.
Read
more:http://wiki.answers.com/Q/Whats_a_bd_address_as_it_asks_for_it
_on_my_phone_for_bluetooth#ixzz28Jwz51ca
129/178
ACL: The Asynchronous Connection-oriented Logical transport
protocol.
SCO: Synchronous Connection-Oriented logical transport.
SSP: Secure Simple Pairing (SSP).
• The headline feature of Bluetooth 2.1
• hciconfig hci0 sspmode 0 - this command disable sspmode.
• hciconfig hci0 sspmode 1 - this command enables sspmode.
• hciconfig hci0 sspmode - this command shows sspmode.
BlueDroid
With Android 4.2 release, BlueZ-based Bluetooth stack was replaced
with a new stack , named "Bluedroid", which is a collaboration
between Google and Broadcom.
See:
https://developer.android.com/about/versions/jelly-bean.html
http://lwn.net/Articles/525816/
http://lwn.net/Articles/525636/
Links:
Bluetooth git tree for developers (for submitting patches):
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-
next.git
http://www.bluez.org/release-of-bluez-5-0/
The 5.0 BlueZ, By Nathan Willis, January 3, 2013:
http://lwn.net/Articles/531133/
BlueZ 5 API introduction and porting guide:
http://www.bluez.org/bluez-5-api-introduction-and-porting-guide/
Using Bluetooth article by Ben DuPont on DrDobbs (January 31, 2012)
http://www.drdobbs.com/mobile/using-bluetooth/232500828
OLS: Audio Streaming over Bluetooth - article by Ian Ward
http://lwn.net/Articles/293692/
Bluetooth profiles book:
http://www.amazon.com/Bluetooth-Profiles-Dean-Anthony-
Gratton/dp/0130092215/ref=sr_1_1?
130/178
s=books&ie=UTF8&qid=1355583216&sr=1-
1&keywords=bluetooth+profiles
NETILTER
Linux 3.7 kernel includes support for IPv6 NAT
See:
http://lwn.net/Articles/514087/
Most patches are from Patrick McHardy.
VXLAN
131/178
http://www.spinics.net/lists/netdev/msg212202.html
is for adding support for managing vxlan tunnels in iproute2.
The basic way to add vxlan virtual interface is by:
ip link add myvxlan type vxlan id 1
This sets the vni member of vxlan_dev struct to 1
(via vxlan_newlink() method).
vni is the virtual network id; the vni can be in the range 0-16777215
(whereas in vlans the id is restricted to 0-4094).
You can add vxlan with group address and ttl thus:
ip link add myvxlan type vxlan id 1 group 239.0.0.42 ttl 10
This sets also the ttl and the gaddr (multicast group address) of
the vxlan device (vxlan_dev)
Removing vxlan virtual interface is done thus:
ip link del myvxlan
(This triggers the vxlan_dellink() method)
132/178
See: http://www.spinics.net/lists/netdev/msg211564.html
Setting up VXLAN:
http://vincent.bernat.im/en/blog/2012-multicast-vxlan.html#setting-
up-vxlan
http://blogs.cisco.com/datacenter/digging-deeper-into-vxlan/
Stephan hemminger blog about vxlan:
http://linux-network-plumber.blogspot.co.il/2012/09/just-published-
linux-kernel.html
A First Look At VXLAN over Infiniband Network On Linux 3.7-rc7: by
Naoto MATSUMOTO on Nov 29, 2012
vxlan draft:
http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
This draft does not support IPv6, but probably IPv6 will be supported
in the future.
See also Documentation/networking/vxlan.txt
vxlan tools/userspace
Two userspace apps ,"vxland" and "vxlanctl".
vxland, is a vxlan daemon, forwards packet to VXLAN Overlay
Network.
vxlanctl is command for controlling vxlan.
You can create/destroy vxlan tunnel interface using vxlanctl.
git clone git://github.com/upa/vxlan.git
requires uthash package late 1.9 (for the hash table usage).
you can fetch uthash from http://uthash.sourceforge.net/. Then put
the header file, uthash.h, under /usr/include, and run "Make" for the
vxlan project from github.
133/178
VXLAN includes support for Distributed Overlay Virtual Ethernet
(DOVE) networks by David L Stevens from IBM.
NFC
NFC stands for: Near Field Communication.
AF_NFC sockets are implemented under net/nfc.
neard, The Near Field Communication manager, is available in:
http://git.kernel.org/?p=network/nfc/neard.git;a=summary
134/178
http://www.linuxvirtualserver.org/
Implemented in net/netfilter/ipvs
IP Virtual Server lets you build a high-performance
virtual server based on cluster of two or more real servers.
Sockets:
There are two types of socket in the kernel; most of them are sockets
created from user space. There are also kernel sockets; they are
created by sock_create_kern().
For example, in bluetooth kernel stack (net/bluetooth/rfcomm/core.c):
135/178
pcap library is in use by sniffers such as tcpdump or wireshark.
Also hostapd uses PF_PACKET sockets (hostapd is a wireless access
point management project).
From hostapd source code:
...
drv->monitor_sock = socket(PF_PACKET,
SOCK_RAW, htons(ETH_P_ALL));
...
Type:
– SOCK_STREAM and SOCK_DGRAM are the mostly used types.
136/178
The sk_protocol member of struct sock equals to the third
parameter (protocol) of the socket() system call.
struct sock has three queues:
• sk_receive_queue for rx
• sk_write_queue for tx
• sk_error_queue for errors.
• skb_queue_tail() : Adding to the queue.
• skb_dequeue() : removing from the queue.
• For the error queue: sock_queue_err_skb() adds to its tail
(include/net/sock.h). Eventually, it also calls skb_queue_tail().
• Errors can be ICMP errors or EMSGSIZE errors.
UDP and TCP sockets:
• No explicit connection setup is done with UDP.
– In TCP there is a preliminary connection setup.
Packets can be lost in UDP (there is no retransmission mechanism in
the kernel). TCP
on the other hand is reliable (there is a retransmission mechanism).
Most of the Internet traffic is TCP (like http, ssh).
– UDP is for audio/video (RTP)/streaming.
● Note: streaming with VLC is by UDP (RTP).
● Streaming via YouTube is tcp (http)
137/178
__sum16 check;
};
UDP packet = UDP header + payload.
From user space, you can receive udp traffic by three system calls:
– recv() (when the socket is connected)
– recvfrom()
– recvmsg()
All three are handled by udp_recvmsg() in the kernel.
Note that fourth parameter of these 3 methods is flags;
however, this parameter is NOT changed upon return. If you are
interested in returned flags , you must use only recvmsg(), and
to retrieve the msg.msg_flags member.
For example, suppose you have a client-server udp applications, and
the sender sends a packets which is longer then what the client had
allocated for input buffer. The kernel
than truncates the packet, and send MSG_TRUNC flag. In order to
retrieve it, you should use something like:
138/178
● In the same way we have :
– raw_rcv() as a handler for raw packets.
– tcp_v4_rcv() as a handler for TCP packets.
– icmp_rcv() as a handler for ICMP packets.
139/178
If there is a sock listening on the destination port,
call udp_queue_rcv_skb().
– Eventually calls sock_queue_rcv_skb().
● Which adds the packet to the sk_receive_queue by skb_queue_tail().
udp_recvmsg():
Calls __skb_recv_datagram() , for receiving
one sk_buff.
– The __skb_recv_datagram() may block.
– Eventually, what __skb_recv_datagram() does is
read one sk_buff from the sk_receive_queue
queue
memcpy_toiovec() performs the actual copy to user space by invoking
copy_to_user().
● One of the parameters of udp_recvmsg() is a pointer to struct
msghdr. Let's take a look:
From include/linux/socket.h:
struct msghdr {
void *msg_name; /* Socket name */
int msg_namelen; /* Length of name */
struct iovec *msg_iov; /* Data blocks */
__kernel_size_t msg_iovlen; /* Number of blocks */
void *msg_control;
__kernel_size_t msg_controllen; /* Length of cmsg list */
unsigned msg_flags;
};
Control messages (ancillary
messages)
● The msg_control member of msgdhr represent a control message.
– Sometimes you need to perform some special things. For example,
getting to know what was the destination address of a received
packet.
● Sometimes there is more than one address on a machine (and also
you can have multiple addresses on the same nic).
– How can we know the destination address of the ip header in the
application?
– struct cmsghdr (/usr/include/bits/socket.h) represents a control
message.
140/178
cmsghdr members can mean different things based on the type of
socket.
● There is a set of macros for handling cmsghdr like
CMSG_FIRSTHDR(), CMSG_NXTHDR(), CMSG_DATA(), CMSG_LEN() and
more.
● There are no control messages for TCP sockets.
Socket options:
In order to tell the socket to get the information about the packet
destination, we should call setsockopt().
● setsockopt() and getsockopt() - set and get options on a socket.
– Both methods return 0 on success and -1 on error.
● Prototype: int setsockopt(int sockfd, int level, int optname,...
There are two levels of socket options:
To manipulate options at the sockets API level: SOL_SOCKET
To manipulate options at a protocol level, that protocol number should
be used;
– for example, for UDP it is IPPROTO_UDP or SOL_UDP
(both are equal 17) ; see include/linux/in.h and include/linux/socket.h
● SOL_IP is 0.
There are currently 19 Linux socket options and one another on option
for BSD compatibility.
● There is an option of SO_BINDTODEVICE (assigning socket to a
specified device).
• This patch added also an option to
get SO_BINDTODEVICEvia getsockopt: http://www.spinics.net/lists/netd
ev/msg214004.html
141/178
int ipi_ifindex; /* Interface index */
struct in_addr ipi_spec_dst; /* Routing destination address */
struct in_addr ipi_addr; /* Header destination address */
};
const int on = 1;
sockfd = socket(AF_INET, SOCK_DGRAM,0);
if (setsockopt(sockfd, SOL_IP, IP_PKTINFO, &on, sizeof(on))<0)
perror("setsockopt");
...
...
...
When calling recvmsg(), we will parse the msghr like this:
for (cmptr=CMSG_FIRSTHDR(&msg); cmptr!=NULL;
cmptr=CMSG_NXTHDR(&msg,cmptr))
{
if (cmptr->cmsg_level == SOL_IP && cmptr->cmsg_type
== IP_PKTINFO)
{
pktinfo = (struct in_pktinfo*)CMSG_DATA(cmptr);
printf("destination=%s\n", inet_ntop(AF_INET, &pktinfo-
>ipi_addr, str, sizeof(str)));
}
}
In the kernel, this calls ip_cmsg_recv() in net/ipv4/ip_sockglue.c.
(which eventually calls
ip_cmsg_recv_pktinfo()).
● You can in this way retrieve other fields of the ip header:
– For getting the TTL:
● setsockopt(sockfd, SOL_IP, IP_RECVTTL, &on, sizeof(on))<0).
● But: cmsg_type == IP_TTL.
– For getting ip_options:
● setsockopt() with IP_OPTIONS.
Note: you cannot get/set ip_options in Java
Sending packets in UDP
142/178
From user space, you can send udp traffic with three system calls:
– send() (when the socket is connected).
– sendto()
– sendmsg()
● All three are handled by udp_sendmsg() in the kernel.
● udp_sendmsg() is much simpler than the tcp
parallel method , tcp_sendmsg().
● udp_sendpage() is called when user space calls sendfile() (to copy a
file into a udp socket).
– sendfile() can be used also to copy data between one file descriptor
and another.
udp_sendpage() invokes udp_sendmsg().
● udp_sendpage() will work only if the nic supports Scatter/Gather
(NETIF_F_SG feature is supported).
Bind:
You cannot bind to privileged ports (ports lower than 1024) when you
are not root !
– Trying to do this will give:
– “Permission denied” (EPERM).
– You can enable non root binding on privileged port
by running as root: (You will need at least a 2.6.24 kernel)
– setcap 'cap_net_bind_service=+ep' udpclient
– This sets the CAP_NET_BIND_SERVICE
capability.
You cannot bind on a port which is already bound.
– Trying to do this will give:
– “Address already in use” (EADDRINUSE)
● You cannot bind twice or more with the same UDP socket (even if
you change the port).
– You will get “bind: Invalid argument” error in such case (EINVAL)
If you try connect() on an unbound UDP socket and then bind() you
will also get the EINVAL
error. The reason is that connecting to an unbound socket will call
inet_autobind() to
automatically bind an unbound socket (on a random port). So after
connect(), the socket is
bounded. And the calling bind() again will fail with EINVAL (since the
143/178
socket is already
bonded).
Binding in the kernel for UDP is implemented in inet_bind() and
inet_autobind()
– (in IPV6: inet6_bind() )
Non local bind
What happens if we try to bind on a non local address ? (a non
local address can be for example, an address of interface which
is temporarily down)
– We get EADDRNOTAVAIL error:
– “bind: Cannot assign requested address.”
– However, if we set
/proc/sys/net/ipv4/ip_nonlocal_bind to 1, by
– echo "1" > /proc/sys/net/ipv4/ip_nonlocal_bind
– Or adding in /etc/sysctl.conf:
net.ipv4.ip_nonlocal_bind=1
– The bind() will succeed, but it may sometimes break
applications.
What will happen if in the above udp client example, we will try
setting a broadcast address as the destination (instead of
192.168.0.121), thus:
inet_aton("255.255.255.255",&target.sin_addr);
● We will get EACCESS error (“Permission denied”) for sendto().
In order that UDP broadcast will work, we have to add:
int flag = 1;
if (setsockopt (s, SOL_SOCKET, SO_BROADCAST,&flag,
sizeof(flag)) < 0)
perror("setsockopt");
UDP socket options
●
For IPPROTO_UDP/SOL_UDP level, we have
two socket options:
● UDP_CORK socket option.
– Added in Linux kernel 2.5.44.
int state=1;
setsockopt(s, IPPROTO_UDP, UDP_CORK, &state, sizeof(state));
144/178
for (j=1;j<1000;j++)
sendto(s,buf1,...)
state=0;
setsockopt(s, IPPROTO_UDP, UDP_CORK, &state,sizeof(state));
● The above code fragment will call udp_sendmsg() 1000 times
without actually
sending anything on the wire (in the usual case, when without
setsockopt() with UDP_CORK,
1000 packets will be send).
● Only after the second setsockopt() is called, with UDP_CORK and
state=0, one packet is
sent on the wire.
● Kernel implementation: when using UDP_CORK, udp_sendmsg()
passes
MSG_MORE to ip_append_data().
– Implementation detail: UDP_CORK is not in glibc-header
(/usr/include/netinet/udp.h); you need to add in your
program:
– #define UDP_CORK 1
● UDP_ENCAP socket option.
– For usage with IPSEC.
● Used, for example, in ipsec-tools.
● Note: UDP_ENCAP does not appear yet in the man page
of udp (UDP_CORK does appear).
● Note that there are other socket options at the
SOL_SOCKET level which you can get/set on
UDP sockets: for example, SO_NO_CHECK (to
disable checksum on UDP receive).
● SO_DONTROUTE (equivalent to MSG_DONTROUTE in send().
● The SO_DONTROUTE option tells “don't send via a gateway,
only send to directly connected hosts.”
● Adding:
– setsockopt(s, SOL_SOCKET, SO_DONTROUTE, val,
sizeof(one)) < 0)
– And sending the packet to a host on a different network will
cause “Network is unreachable” error to be received.
(ENETUNREACH)
– The same will happen when MSG_DONTROUTE flag is set
145/178
in sendto().
● SO_SNDBUF.
● getsockopt(s, SOL_SOCKET, SO_SNDBUF, (void *) &sndbuf).
Suppose we want to receive ICMP errors with the UDP client
example (like ICMP destination unreachable/port unreachable).
● How can we achieve this ?
● First, we should set this socket option:
– int val=1;
– setsockopt(s, SOL_IP, IP_RECVERR,(char*)&val, sizeof(val));
udp_sendmsg()
udp_sendmsg(struct kiocb *iocb, struct sock
*sk, struct msghdr *msg, size_t len)
● Sanity checks in udp_sendmsg():
The destination UDP port must not be 0.
● If we try destination port of 0 we get EINVAL error as a return value
of udp_sendmsg()
– The destination UDP is embedded inside the msghdr parameter (In
fact, msg->msg_name
represents a sockaddr_in; sin_port is sockaddr_in is the destination
port number).
● MSG_OOB is the only illegal flag for UDP.
Returns EOPNOTSUPP error if such a flag is
passed. (only permitted to SOCK_STREAM)
● MSG_OOB is also illegal in AF_UNIX.
OOB stands for “Out Of Band data”.
● The MSG_OOB flag is permitted in TCP.
– It enables sending one byte of data in urgent mode.
– (telnet , “ctrl/c” for example).
● The destination must be either:
– specified in the msghdr (the name field in msghdr).
– Or the socket is connected.
● sk->sk_state == TCP_ESTABLISHED
– Notice that though this is UDP, we use TCP semantics here.
In case the socket is not connected, we should find a route to it; this is
done by calling
ip_route_output_flow().
● In case it is connected, we use the route from the sock
146/178
(sk_dst_cache member of sk, which is an instance of dst_entry).
– When the connect() system call was
invoked, ip4_datagram_connect() finds the route by
ip_route_connect() and set sk->sk_dst_cache in sk_dst_set()
● Moving the packet to Layer 3 (IP layer) is done by ip_append_data().
In TCP, moving the packet to Layer 3 is done with ip_queue_xmit().
– What's the difference ?
● UDP does not handle fragmentation;
ip_append_data() does handle fragmentation.
– TCP handles fragmentation in layer 4. So no
need for ip_append_data().
ip_queue_xmit() is (naturally) a simpler method.
● Basically what the udp_sendmsg() method
does is:
● Finds the route for the packet by
ip_route_output_flow()
● Sends the packet with
ip_local_out(skb)
Asynchronous I/O
● There is support for Asynchronous I/O in UDP sockets.
This means that instead of polling to know if
there is data (by select(), for example), the
kernel sends a SIGIO signal in such a case
Using Asynchronous I/O UDP in a user space
application is done in three stages:
– 1) Adding a SIGIO signal handler by calling
sigaction() system call
– 2) Calling fcntl() with F_SETOWN and the pid of our
process to tell the process that it is the owner of the
socket (so that SIGIO signals will be delivered to it).
Several processes can access a socket. If we will not call
fcntl() with F_SETOWN, there can be ambiguity as to which
process will get the SIGIO signal. For example, if we call
fork() the owner of the SIGIO is the parent; but we can call,
in the son, fcntl(s,F_SETOWN, getpid()).
– 3) Setting flags: calling fcntl() with F_SETFL and
O_NONBLOCK | FASYNC.
147/178
In the SIGIO handler, we call recvfrom().
● Example:
struct sockaddr_in source;
struct sigaction handler;
source.sin_family = AF_INET;
source.sin_port = htons(888);
source.sin_addr.s_addr = htonl(INADDR_ANY);
servSocket = socket(AF_INET, SOCK_DGRAM, 0);
bind(servSocket,(struct sockaddr*)&source,sizeof(struct
sockaddr_in));
handler.sa_handler = SIGIOHandler;
sigfillset(&handler.sa_mask);
handler.sa_flags = 0;
sigaction(SIGIO, &handler, 0);
fcntl(servSocket,F_SETOWN, getpid());
fcntl(servSocket,F_SETFL, O_NONBLOCK | FASYNC);
The fcntl() which sets the O_NONBLOCK | FASYNC flags
invokes sock_fasync() in net/socket.c to add the socket.
– The SIGIOHandler() method will be called when there is
data (since a SIGIO signal was generated) ; it should call
recvmsg().
app.
RDMA (Infiniband)
See this sites by Dotan Barak:
http://www.rdmamojo.com/
http://www.rdmamojo.com/links/
148/178
Linux Wireless Subsystem (802.11).
Each MAC frame consists of a MAC header, a frame body of variable
length and an
FCS (Frame Check Sequence) of 32 bit CRC. Next figure shows the
802.11 header.
Protocol version:
The version of the MAC 802.11 we use. Currently there is only
one version of MAC, so this field is always 0.
Type:
149/178
There are three types of packets in 802.11:management, control and
data.
● Management packets (IEEE80211_FTYPE_MGMT) are for
management
actions like association, authentication, scanning and more. We will
deal more
with management packets in the following sections.
● Control packets (IEEE80211_FTYPE_CTL) usually have some
relevance to data
packets; for example, a PS Poll packet is for retrieving packets from an
Access
Point buffer. Another example: a station that wants to transmit first
sends a control
packet called RTS (request to send); if the medium is free, the
destination station
will send a control packet called CTS (clear to send).
● Data packets (IEEE80211_FTYPE_DATA) are the raw data packets.
Null
packets are a special case of raw packets.
150/178
of space for management packets subtypes, action management
frames are used also
in 802.11n management packets. A value of 1011 for the subtype field
in a
control packet denotes that this is a request to send
(IEEE80211_STYPE_RTS)
control packet. A value of 0100 for the subtype field of a data packet
denotes that
this a a null data (IEEE80211_STYPE_NULLFUNC) packet, which is used
for power management control. A value of 1000
(IEEE80211_STYPE_QOS_DATA)
for a subtype of a data packet means that this is a QoS data packet;
this subtype was
added by the IEEE802.11e amendment, which dealt with QoS
enhancements.
ToDS:
When this bit is set, this means that the packet is for the distribution
system.
FromDS:
When this bit is set, this means that the packet is from the distribution
system.
More Frag:
When we use fragmentation, this bit is set to 1.
Retry:
When a packet is retransmitted, this packet is set to 1. A common
case of
retransmission is when a packet that was sent did not receive an
acknowledgment in
time. The acknowledgements are sent by the firmware of the wireless
driver.
Pwr Mgmt:
When the power management bit is set, this means that the station
will enter
power save mode.
More Data:
When an Access Points sends packets that it buffered for a sleeping
151/178
station, it
sets the more data bit to 1 when the buffer is not empty. Thus the
station knows
that there are more packets it should retrieve. When the buffer has
been
emptied, this bit is set to 0.
Protected Frame:
This bit is set to 1 when the frame body in encrypted; only data
frames and
authentication frames can be encrypted.
Order:
There is a MAC service called “strict ordering”. With this service, the
order of
frames is important. When this service is in use, the order bit is set to
1. It is
rarely used.
Duration/ID:
The duration holds values for the Network Allocation Vector (NAV) in
microseconds, and it consists of 15 bits of the duration field. The
sixteenth field is
0. When working in power save mode it is the AID (Association ID) of a
station.
The Network Allocation Vector (NAV) is a virtual carrier sensing
mechanism.
Sequence control:
This is a 2 byte field specifying the sequence control. In 802.11, it is
possible that a
packet will be received more than once. The most common cause for
such a case is
when an acknowledgement is not received for some reason. The
sequence control
field consists of a fragment number (4 bits) and a sequence number
(12 bits). The
sequence number is generated by the transmitting station, in
ieee80211_tx_h_sequence(). In case of a duplicate frame in a
retransmission, it is
152/178
dropped, and a counter of the dropped duplicate frames
(dot11FrameDuplicateCount)
is incremented by 1; this is done in ieee80211_rx_h_check(). Sequence
Control field
is not present in control packets.
Address Fields:
There are four addresses, but we don't always use all of them. Address
1 is the Receive
Address (RA), and is used in all packets. Address 2 is the Transmit
Address (TA), and it
exists in all packets except ACK and CTS packets. Address 3 is used
only for
management and data packets. Address 4 is used when ToDS and
FromDS bits of the
frame control are set; this happens when operating in a Wireless
Distribution System
(WDS).
OoS Control:
The QoS Control field was added by 802.11e amendment and it is only
present in QoS
data packets. Since it is not part of the original 802.11 spec, it is not
part of the original
mac80211 implementation, so it is not a member of the
ieee80211_hdr struct. In fact, it
was added at the end of 802.11 header and it can be accessed by
ieee80211_get_qos_ctl() method. The QoS Control field includes the
tid (Traffic
Identification), the Ack Policy, and a field called A-MSDU present,
which tells whether
an A-MSDU is present.
HT Control Field:
HT Control Field was added by 802.11n amendment. HT stands for
High Throughput.
One of the most important features of 802.11n amendment is
increasing the rate to up
to 600 Mbps.
153/178
• All stations must authenticate and associate and with the Access Point
prior to
communicating.
Stations usually perform scanning prior to authentication and
association in order to get details about the Access Point (like mac
address, essid, and more).
Scanning is done thus:
ifconfig wlan0 up
iwlist wlan0 scan
154/178
● An Access Point will not receive any data frames from a station
before it it is associated with the AP.
● An Access Point which receive an association request will check
whether the mobile station parameters match the Access point
parameters.
– These parameters are SSID, Supported Rates and capability
information.
Hostapd
hostapd is a user space daemon implementing access point
functionality (and authentication servers). It supports Linux and
FreeBSD.
● http://hostap.epitest.fi/hostapd/
● Developed by Jouni Malinen
● hostapd.conf is the configuration file.
● Certain devices, which support Master Mode,
can be operated as Access Points by running
the hostapd daemon.
● Hostapd implements part of the MLME AP code
which is not in the kernel
● and probably will not be in the near future.
● For example: handling association requests which are
received from wireless clients.
155/178
Hostapd manages:
● Association/Disassociation requests.
● Authentication/deauthentication requests.
wpa_supplicant is part of hostapd project
You can clone hostap by:
git clone git://w1.fi/srv/git/hostap.git
156/178
And then start wireshark and select the wlan0 interface.
You can know the channel number while sniffing by
looking at the radiotap header in the sniffer output;
channel frequency translates to a channel number
(1 to 1 correspondence.) Moreover, the channel number appears in
square
brackets. Like:
– channel frequency 2437 [BG 6]
The radiotap header is added in certain cases under monitor mode.
It precedes the 802.11 header.
It is done in ieee80211_add_rx_radiotap_header() in
net/mac80211/rx.c
ieee80211_add_rx_radiotap_header() is invoked from:
• ieee80211_rx_monitor().
• ieee80211_rx_cooked_monitor().
You can know the mac address of your wireless nic by:
cat /sys/class/ieee80211/phy*/macaddress
● A station send a null packet by calling ieee80211_send_nullfunc()
(net/mac80211/mlme.c)
The PM bit in the frame control of this packet is set.
(IEEE80211_FCTL_PM bit)
● Each access point has an array of skbs for buffering unicast packets
from the stations which enter power save mode.
● It is called ps_tx_buf (in struct sta_info; see
net/mac80211/sta_info.h)
An access point also has a ps_bc_buf queue for for multicast and
broadcast packets.
ps_tx_buf can buffer up to 64 skbs. (STA_MAX_TX_BUFFER=64, in
net/mac80211/sta_info.h)
157/178
In case the buffer is filled, old skbs will be dropped.
● When a station enters PS mode it turns off its RF. From time to time
it turns the RF on, but only for receiving beacons.
● An Access Point sends beacon frames periodically (usually about 10
beacons per second).
● Each beacon has a TIM (Traffic Indication Map) field.
ieee80211_rx_mgmt_beacon() handles receiving
beacons. (net/mac80211/mlme.c).
A beacon is a management, represented by struct beacon, which is
one
of the members in a union (named "u") in ieee80211_mgmt struct.
the "variable" member of struct beacon represents the "Information
Elements"
this beacon can contain; "Information Elements" can be SSID,
Supported rates, FH Params, DS Params, CF Params, IBSS Params, TIM,
and more.
158/178
struct ieee802_11_elems represent "Information Element". It conatins
a structure
called tim (ieee80211_tim_ie) , representing the tim (Traffic Indication
Map).
Note that in the following diagram, we do not show the ACK packets.
159/178
AD Hoc
Implementation of 802.11 AD Hoc is mainly in: net/mac80211/ibss.c
ieee80211_if_ibss structure represents an AD hoc station.
80211.n
80211.n started with the High Throughput Study Group in about 2002.
In 802.11, each packet should be acknowledged. In 802.11nm we
grouping packets in a block and acknowledging this block instead
acknowledging each packet separately. This improves performance.
Grouping packets in a block in this way is called "packet aggregation"
in 802.11n terminology.
160/178
There are two forms of aggregation:
● AMPDU (The more common form)
AMPDU aggregation requires the use of block
acknowledgement or BlockAck, which was introduced in 802.11e and
has been optimized in 802.11n.
802.11e is the quality-of-service extensions amendment.
The 802.11e amendment deals with QoS; it introduced four queues for
different types of traffic: voice traffic, video traffic, best-effort traffic
and background traffic. The Linux implementation of 802.11e uses
multiqueues. Traffic in higher priority queue is transmitted before
traffic in a lower priority queue.
MPDU stand for: MAC protocol data units
● AMSDU
With AMSDU, you make one big packet out of some packets.
This big packet should be acked.
Disadvantage: more risk of corruption of the big packet.
Less in usage, fading out.
MSDU stands for: MAC service data units.
Packet aggregation
● There are two sides to a block ack session: originator and recipient.
Each block session has a different TID (traffic identifier).
● The originator starts the block acknowledge session by
calling ieee80211_start_tx_ba_session() (net/mac80211/agg-tx.c)
ieee80211_tx_ba_session_handle_start() is a callback
of ieee80211_start_tx_ba_session(). In this callback we send an
ADDBA (add Block Acknowledgment) request packet, by invoking
ieee80211_send_addba_request() method (Also in net/mac80211/agg-
tx.c)
ieee80211_send_addba_request() method builds a management
action packet
(The sub type is action, IEEE80211_STYPE_ACTION).
161/178
The response to the ADDBA request should be received within 1 HZ,
which is one millisecond in x86_64 machines (ADDBA_RESP_INTERVAL,
defined in net-next/net/mac80211/sta_info.h)
In case we do not get a response in time,
the sta_addba_resp_timer_expired() will stop the BA session by
calling ieee80211_stop_tx_ba_session().
When the other side (the recipient) receives the ADDBA request, it
first sends an ACK. Then it processes the ADDBA request by
calling ieee80211_process_addba_request(); (net/mac80211/agg-rx.c)
if everything is ok, it sets the aggregation state of this machine to
operational
(HT_AGG_STATE_OPERATIONAL), and sends an ADDBA Response by
calling
ieee80211_send_addba_resp().
After a session was started, a data block, containing multiple MPDU
packets is sent. Consequently, the originator sends a Block Ack
Request (BAR) packet by
calling ieee80211_send_bar(). (net/mac80211/agg-tx.c)
The BAR is a control packet with Block Ack Request subtype
(IEEE80211_STYPE_BACK_REQ).
The bar packet includes the SSN (start sequence number), which is
the sequence number of the oldest MSDU in the block which should be
acknowledged.
The BAR (HT Block Ack Request) is defined in
include/linux/ieee80211.h.
Its start_seq_num member is initialized to the proper SSN.
There are two types of Block Ack: Immediate Block Ack and Delayed
Block Ack.
Mac80211 debugfs support:
In order to have mac80211 debugfs support, kernel should be built
with CONFIG_MAC80211_DEBUGFS (and CONFIG_DEBUG_FS)
Then after:
mount -t debugfs debugfs /sys/kernel/debug
You can see debugfs entries under:
162/178
/sys/kernel/debug/ieee80211/phy*
Open Firmware
The Atheros 802.11n USB chipset (AR9170) has open firmware;
see http://www.linuxwireless.org/en/users/Drivers/ar9170.fw
802.11 AC
The next generation of 802.11 is AC.
Support for 802.11AC was added in mac80211 stack. For example,
ieee80211_ie_build_vht_cap() in net/mac80211/util.c.
struct ieee80211_vht_capabilities and struct ieee80211_vht_operation
in include/linux/ieee80211.h.
VHT stands for: Very High Throughput
Development:
Sending patches should be done against the wireless-testing tree
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-testing.git
The maintainer of compat wireless is Luis R. Rodriguez.
163/178
• The second mesh topology is partial mesh. With partial mesh, nodes
are connected to only some of the other nodes, not all. This topology
is much more common in wireless mesh networks.
In 2.6.26, the network stack added support for the draft of wireless
mesh networking (802.11s), thanks to the open80211s project. The
open80211s project goal was to create the first open implementation
of 802.11s. The project got some sponsorship from the OLPC project.
Luis Carlos Cobo and Javier Cardona and other developers from
Cozybit
developed the Linux mac80211 mesh code. This code was merged
into the Linux
Kernel from 2.6.26 release (July 2008). There are some drivers in the
linux kernel with
support to mesh networking (ath5k, b43, libertas_tf, p54, zd1211rw).
HWMP protocol.
802.11s defines a default routing protocol called HWMP (Hybrid
Wireless Mesh
Protocol). The HWMP protocol works with layer 2 (Mac addresses) as
opposed to IPV4
routing protocol, for example, which works with layer 3 (IP addresses).
HWMP routing
is based on two types of routing (hence it is called hybrid). The first
is on demand
routing and the second is proactive, dynamic routing. Currently only
on demand routing
is implemented in the Linux Kernel. We have three types of messages
with on demand
routing. The first is PREQ (Path Request). This type of messages is sent
as a broadcast
when we look for some destination, which we still do not have a route
to. This PREQ
message is propagated in the mesh until it gets to its destination. On
each station until
the final destination is reached, a lookup is performed
(by mesh_path_lookup(), net/mac80211/mesh_pathtbl.c).
In case the lookup fails, the PREQ is forwarded (as a broadcast).
The PREQ message is sent in a management packet; its subtype is a
164/178
Then a PREP (Path Reply) unicast packet is sent. This packet is sent in
the reverse path.
The PREP message is also sent in a management packet; its subtype is
also action.
(IEEE80211_STYPE_ACTION). It is handled by
hwmp_prep_frame_process(). Both
PREQ and PREP are sent in the mesh_path_sel_frame_tx() function. If
there is some
failure on the way, a PERR is sent.(Path Error). A PERR message is
handled by
mesh_path_error_tx().
The route take into consideration a radio-aware metric (airtime
metric). The airtime
metric is calculated in the airtime_link_metric_get() method ,
net/mac80211/mesh_hwmp.c(based on rate and other
hardware parameters). Mesh Points continuously monitor their links
and update metric
values with neighbours.
The station which sent the PREQ may try to send packets to the final
destination while
still not knowing the route to that destination; these packets are kept
in a buffer called
frame_queue, which is a member of mesh_path struct;
net/mac80211/mesh.h) in such a case, when a PREP finally arrives,
the pending packets of this buffer are sent to the final destination
(by calling mesh_path_tx_pending()). The maximum number of frames
buffered per destination for unresolved destinations is 10
(MESH_FRAME_QUEUE_LEN, defined in net/mac80211/mesh.h).
The disadvantages:
165/178
● Many broadcasts limit network performance
● Not all wireless drivers support mesh mode at the moment.
-----------------------------------------------------------
The WRTG54L LinkSys wireless router comes out of factory with
Linux.
In case you want to hack mac80211 with OpenWrt, you can do it
with backfire or
with kamikaze, which are versions of OpenWrt. In case of
kamikaze, you will soon
find out that with recent kamikaze releases (8.09.1 and 8.09.2),
the wireless driver does not exist (kmod-b43).
For this reason "opkg install kmod-b43" fails on kamikaze 8.09.1 and
kamikaze 8.09.2.
You can use also kamikaze 9.0.2 and build the broadcom wireless
driver
as a kernel module.
A simple way of achieving this is thus:
"make kernel_menuconfig"
Then:
select driver/network/wireless/B43 by
166/178
build_dir/linux-brcm47xx/compat-wireless-2011-12-
01/drivers/net/wireless/b43
Tip:
When working with b43 kernel module (b43.ko) it is enough to run
make target/linux/compile
in order to create b43.ko (under build_dir/linux-brcm47xx/linux-
2.6.32.27/drivers/net/wireless/b43/) and copy it.
When trying "ifconfig wlan0 up", in case you get an error about
firmware,
like this error message about missing firmware file,
"b43-phy0 ERROR: Firmware file "b43/ucode5.fw" not found or load
failed."
do as described in:
http://linuxwireless.org/en/users/Drivers/b43#devicefirmware
167/178
wlan0 Interface doesn't support scanning : Operation not supported
(so when booting with kamikaze 8.09 you do see wireless interface
when
running iwconfig).
- In case there are any problems with burning an image and you
cannot access
the WRT54GL linksys device, you can burn an image via tftp,
in this way:
tftp 192.168.1.1
bin
trace
timeout 60
rexmt 1
168/178
put nameOfFirmareFile
- When using this way, you should download the firmware from linksys
site:
http://homesupport.cisco.com/en-us/support/routers/WRT54GL
- In case you will try to burn an openwrt image, most likely you will get
errors;
like:
....
...
tftp> put openwrt-brcm47xx-squashfs.trx
received ACK <block=0>
sent DATA <block=1, 512 bytes>
received ACK <block=0>
received ERROR <code=4, msg=code pattern incorrect>
Error code 4: code pattern incorrect
...
...
OpenFWWF website:
http://www.ing.unibs.it/~openfwwf/
- also for wrt54GL.
building a firmware for b43 is simple:
you download b43-tools and b43 firmware.
From b43-tools/assembler you run "make && make install".
(you only need assembler for building the b43 firmware)
169/178
b43-tools/assembler># make
CC b43-asm.bin
/usr/bin/ld: cannot find -lfl
make sure that flex-static and flex are installed. (yum install flex-static
flex)
Then simply go to the folder where you extracted the firmware, and
run "make".
A file name "ucode5.fw" will be generated.
With b43 on the WRT54GL, we use SSB_BUSTYPE_SSB
This means that
in b43_wireless_core_start() (drivers/net/wireless/b43/main.c),
dev->dev->bus->bustype is SSB_BUSTYPE_SSB and we call
request_threaded_irq() and not b43_sdio_request_irq().
(The other possibilities are SSB_BUSTYPE_PCI, SSB_BUSTYPE_PCMCIA
or
SSB_BUSTYPE_SDIO).
170/178
TBD:
The following downloads 8.09.2 and not 8.09; how you get 8.09 and
not 8.09.2?
You can download kamikaze 8.09 by:
svn co svn://svn.openwrt.org/openwrt/branches/8.09
OpenWrt repositories are in the following link:
https://dev.openwrt.org/wiki/GetSource
RFKILL
rfkill is a simple tool for accessing the Linux rfkill device interface,
which is used to enable and disable wireless networking devices,
typically
WLAN, Bluetooth and mobile broadband.
rfkill list will list the status of rfkill.
rfkill block to set a soft lock
rfkill unblock to clear a soft lock
see:
http://www.linuxwireless.org/en/users/Documentation/rfkill
WiMAX
LTE will undoubtedly be the 4G technology. There is WimMAX solution
in
Linux kernel though.
WiMAX (Worldwide interoperability for Microwave access) is based on
IEEE802.16
standard. It is a wireless solution for broadband WAN (Wide Area
Network).
about 200 WiMAX projects around the world. WiMAX products can
accommodate fixed and mobile usage models.
There is a WiMAX Linux git tree, maintained by Inaky Perez-Gonzalez
from Intel.
In the past, Inaky was involved in developing the Linux USB stack and
the Linux UWB
(Ultra Wideband) stack. The WiMAX stack and driver have been
accepted in mainline
171/178
for 2.6.29 in January 2009. The WiMAX support in Linux consists of a
Kernel module
(net/wimax/wimax.ko), device-specific drivers under it, and a user
space management
stack, WiMAX Network Service. There was in the past an initiative from
Nokia for a
WiMAX stack for Linux, but it is not integrated currently. Also
work was done on D-Bus interface to the WiMAX stack, which will help
user space tools manage the WiMAX stack. There is currently one
WiMAX driver in the Linux tree, the Intel
WiMAX Connection 2400 over USB driver (which supports any of the
Intel Wireless
WiMAX/WiFi Link 5x50 series). The WiMAX stack uses generic netlink
protocol
mechanism to send and receive netlink messages to and from
userspace. Free form
messages can be sent back and forth between driver/device and user
space
batman-adv
"B.A.T.M.A.N. Advanced Meshing Protocol is
a routing protocol for multi-hop ad-hoc mesh networks. The networks
may be wired or wireless.
Implementation is in net/batman-adv
See http://www.open-mesh.org/
172/178
IEEE 802.15.4
IEEE standard 802.15.4 is for wireless personal area network (WPAN).
Implementation in the Linux kernel tree: net/ieee802154/
The maintainers of IEEE 802.15.4 SUBSYSTEM are Alexander Smirnov
and Dmitry Eremin-Solenikov.
compat-wireless
compat-wireless is a backport of the wireless stack from newer
kernels to older ones.
Wi-Fi Direct, previously known as Wi-Fi P2P, is a standard that allows
Wi-Fi devices to connect to each other without the need for an Access
Point.
Links:
"Linux wireless networking", article from 2004
http://www.ibm.com/developerworks/library/wi-enable/index.html
173/178
Books:
TBD:
ACS (Automatic Channel Selection)
Useful tips:
Printing IP address:
__be32 ipAddr;
printk("ipAddr = %pI4\n", &ipAddr);
when
u32 ipAddr;
TBD!
wireshark tip:
Sometimes you see in wireshark sniffer,
that the amount of "Bytes on wire" is larger then the MTU
of the network card.
This is probably due to using Jumbo packets or offloading.
174/178
Links and more info
175/178
http://vger.kernel.org/netconf2011.html
11) http://www.policyrouting.org/PolicyRoutingBook/
12) THRASH A dynamic LCtrie and hash data structure:
Robert Olsson Stefan Nilsson, August 2006
http://www.csc.kth.se/~snilsson/public/papers/trash/trash.pdf
13) IPSec howto:
http://www.ipsechowto.org/t1.html
176/178
(trafgen; uses PF_PACKET RAW sockets and sendto() sys call)
22) splice tools: http://brick.kernel.dk/snaps/splice-git-latest.tar.gz
network splice receive:
http://lwn.net/Articles/236918/
23) Network namespaces - by Jonathan Corbet:
http://lwn.net/Articles/219794/
24) The initial change to napi_struct is explained in
ttp://lwn.net/Articles/244640/
177/178
http://www.kloth.net/services/iplocate.php
kernel networking repositories:
To clone the stable tree you should run:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
178/178