Apply for an account using the signup link. On the “Person Information” panel, click “Join Existing Project” and fill in our project name “MIT6828”. Your application will be approved by us shortly.
We are going to use m510 machines equipped with Eight-core Intel Xeon D-1548 2.0 GHz CPU, 64GB ECC Memory, 256 GB NVMe flash storage, and Dual-port Mellanox ConnectX-3 10 Gbps NIC. You can find the detailed hardware description and how the machines are interconnected here and current availability here. Based on our experience, m510 machines have ample availability most of the time but start the assignment early to avoid missing the deadline due to the availability issue.
Mellanox ConnectX-3 requires MLX4 poll mode drive library (librte_pmd_mlx4) to poll the packets directly from the NIC with DPDK. See the detailed document here. To enable the mlx4 driver, you first need to install the Mellanox OFED (OpenFabrics Enterprise Distribution) on the machines.
$ sudo apt-get update $ sudo apt-get install meson python3-pyelftools
$ wget https://content.mellanox.com/ofed/MLNX_OFED-4.9-5.1.0.0/MLNX_OFED_LINUX-4.9-5.1.0.0-ubuntu20.04-x86_64.tgz
$ tar -xvzf MLNX_OFED_LINUX-4.9-5.1.0.0-ubuntu20.04-x86_64.tgz $ cd MLNX_OFED_LINUX-4.9-5.1.0.0-ubuntu20.04-x86_64 $ sudo ./mlnxofedinstall --upstream-libs --dpdk
$ sudo /etc/init.d/openibd restart
$ ibv_devinfo
If it is installed successfully, you can see that two ports are available on the machine with ibv_devinfo
. Note that the port number in ibv_devinfo
starts from 1 but in DPDK it starts from 0. In the m510 machine, the first port (port 0 in DPDK) is used for public (inter-cluster) connection, and the second port (port 1 in DPDK) is used for private (intra-cluster) connection. To measure the latency of the two machines, we are going to use the second port (port 1 in DPDK) of the NIC.
Now that you have OFED installed, you are ready to build the DPDK library.
$ git clone https://github.com/DPDK/dpdk $ cd dpdk $ git checkout releases
$ meson build -Dexamples=all $ cd build $ meson configure $ ninja
$ echo 1024 | sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
Congratulations! Now your DPDK library is ready to use. Before going on, you are encouraged to play with DPDK’s sample applications on examples
folder of DPDK. Their documentation is here. Their compiled binaries are at build/examples
.
Warning: The machine will be completely erased when the experiment ends (~16 hours by default). To keep your progress, consider storing your code on Github (using a private repo) or on your personal machine. CloudLab also supports creating a disk image of your machine for snapshotting. Note that your home directory will not be saved in disk images, but you can place data to be included in the image in /opt
.
In this assignment, you will build a server that can respond to ICMP echoes using DPDK. Since DPDK directly works with Ethernet frames, you will have to manually parse and modify IP and ICMP headers from echo requests. DPDK uses struct rte_mbuf
data structure to store packet buffers. The programming guide explains this data structure here.
If you feel it is hard to start, use DPDK's simple L2 forwarding example (examples/skeleton
) as a starting point. You will need to change the port initialization code since you will only use port1 in this experiment (remember port0 is needed for Linux and ssh). You have to modify lcore_main()
to incorporate your packet parsing and crafting logic into its run-to-completion loop. You might find the following macros and functions useful for your purpose. You can find them in DPDK’s API documentation.
rte_pktmbuf_pool_create
rte_pktmbuf_free
rte_pktmbuf_mtod
rte_pktmbuf_pkt_len
rte_pktmbuf_mtod_offset
rte_eth_rx_burst
rte_eth_tx_burst
rte_is_same_ether_addr
rte_ether_addr_copy
rte_ipv4_cksum
rte_raw_cksum
RTE_IP_ICMP_ECHO_REQUEST
RTE_IP_ICMP_ECHO_REPLY
Hint: It may be easier to modify each received packet buffer in place before sending it, rather than creating a new packet buffer.
Hint: IP and ICMP checksums must be updated if you modify the packet. DPDK provides several functions to help calculate checksums.
Hint: See RFC 792, ICMP echo, for more details about how ping works.
Hint: Make sure your Ethernet header contains the right MAC addresses.
Now let us use Linux's ping tool as the client to test the server you just built. You can generate ping requests on the client machine as follows:
$ sudo ifconfig eno1d1 192.168.1.2 netmask 255.255.255.0
$ sudo arp -s 192.168.1.3 [your server's eno1d1 MAC address]
$ sudo ping -f 192.168.1.3 -c 500000
The last command takes around 10 seconds to complete. If successful, you will see ping statistics with zero packet loss.
Hint: If your code does not work correctly, tcpdump and dpdk-dumpcap are useful tools for debugging. You can use tcpdump
for the client-side diagnosing and/or dpdk-dumpcap
for the server-side diagnosing.
Now that you have a working DPDK-based ping server, the next step is to replace the Linux's ping client with a DPDK-based ping client. The client mainly does two things: 1) sending the ICMP echo request packet to the server, and 2) receiving the ICMP echo reply packet back from the server. For simplicity, you can hardcode the request packet in the client.
Hint: Run your ping server with the Linux client and print out the received echo request packet.
Now that you have a working client and server to perform ICMP echoes, the final step is to measure the time your software and DPDK use to perform a round trip of echo. You have to figure out the correct code position to add timing API calls.
Hint: You might have to instrument part of the DPDK mlx4 driver.
For an accurate time measurement, you can read the CPU's time stamp counter (TSC) for the elapsed cycles and calculate the elapsed time as cycles/freq. Below shows a sample code for accomplishing that in DPDK.
uint64_t hz = rte_get_timer_hz(); uint64_t begin = rte_rdtsc_precise(); // Do something uint64_t elapsed_cycles = rte_rdtsc_precise() - begin; uint64_t microseconds= elapsed_cycles * 1000000 / hz;
If you want to learn more, this Intel white paper is a good reference. You’re free to use other methods to measure or infer your system’s performance, but make sure to describe your approach.