6.5810: Lab 2

Setting up the experiment environment in Cloudlab

We are going to use c6525-100g machines equipped with a 24-core AMD 7402P 2.80GHz CPU, 128GB ECC Memory, two 1.6 TB NVMe SSDs, and two Mellanox ConnectX-5 NICs (25GbE and 100 GbE). You can find the detailed hardware description and how the machines are interconnected here and current availability here. Due to the low availability, for this lab, we have reserved enough machines for the class from Sep 21 8:00 AM to Sep 30 1:00 AM. Please make sure to complete the assignment in time to avoid the out-of-machine issue.

  1. We start by creating a cloudlab profile with two c6525-100g machines connecting with two links. After that, we install the dependencies and Mellanox OFED to set up the NIC. Refer to lab1’s instructions for details.

Now that we have OFED installed, we are ready to build the DPDK and the SPDK library.

  1. Clone and build DPDK.
    $ git clone https://github.com/DPDK/dpdk
    $ cd dpdk
    $ git checkout releases
    $ meson build
    $ meson configure -Dprefix=$PWD/build build
    $ ninja -C build
    $ ninja -C build install  
    
  1. Clone and build SPDK. Set $DPDK_ROOT to the path of your DPDK folder.
    $ git clone https://github.com/spdk/spdk.git
    $ cd spdk
    $ git checkout v22.05.x
    $ sudo scripts/pkgdep.sh
    $ ./configure --with-dpdk=$(DPDK_ROOT)/build/
    $ make -j`nproc`
    

Congratulations! Now your SPDK and DPDK libraries are ready to use.

Understanding the “Hello World” example

In this lab, we will use SPDK’s NVMe driver to access the NVMe SSD of our machine. At a high level, SPDK exposes a queue pair abstraction (similar to DPDK) for the userspace program to directly access the storage device. To help you grasp the basics of SPDK, we provide a “Hello World” example. Carefully read the code of functions main_loop(), read_complete(), and write_complete()and skim through the remaining code. The example above sets up the SPDK queue pair, writes the “Hello World!” string into the first sector of the device, and reads the string back. Now let us try to run the example.

Hint: You can find SPDK’s API documentation here

  1. Before running any SPDK application, we have to allocate hugepages and unbind the NVMe device from the native kernel driver. SPDK provides a handy script for this purpose.
    $ sudo PCI_BLOCKED=0000:c5:00.0 ./scripts/setup.sh 

Warning: remember to exclude the OS root drive from unbinding using the PCI_BLOCKED variable above. Otherwise, the root filesystem can be corrupted.

  1. Build and run the compiled example. You are expected to see the “Hello World!” string in the output.
    $ make build
    $ make run 

Building a storage server with SPDK

After grasping the basics of SPDK, now it is your turn to build a storage server. At a high level, it accepts storage requests---either reading from a sector or writing into a sector---from the client using DPDK, and operates the NVMe storage device using SPDK.

We will divide the task into two steps. First, we will build a storage server using SPDK without networking support. After making sure it is working, we then add networking support with DPDK into it. We provide a skeleton code for you as a starting point. Read through the code and comments for more detailed instructions. You are also welcome to start from scratch with your code. 

Hint: After implementing the storage logic, you can test it by mocking recv_req_from_client() and send_resp_to_client(). Make sure your code is working before moving to the next step.

Hint: When the queue pair is full, the call to spdk_nvme_ns_cmd_{read/write}()will return -ENOMEM.

Hint: You must periodically invoke spdk_nvme_qpair_process_completions() to drain the completion queue and trigger the callbacks in spdk_nvme_ns_cmd_{read/write}().

Adding Networking Support with DPDK

Now that you have a solid storage logic implementation, your next step is to add networking support. More specifically, implement recv_req_from_client() and send_resp_to_client() using DPDK. You can pretty much reuse your ping server code from lab1 but replace the ping request with the storage request. For simplicity, you can encapsulate the storage request directly in an ethernet frame instead of in an IP packet. For this assignment, use port 2 in DPDK for client/server communication.

Hint: The sector size of our NVMe device (512 B) is much smaller than the standard MTU size (1,500 B). Therefore, you do not have to do any fragmentation. 

Writing a client to benchmark the storage server

Now that you have a fully functional storage server, your final task is to write a client to benchmark its performance. Similarly, you can pretty much reuse your ping client code from lab1.

We will first benchmark the unloaded latency of the storage server. You can achieve this by sending just one storage request at a time to the server and measuring the elapsed time for receiving the response.

Then we will benchmark the throughput of the storage server. You can achieve this by having multiple inflight storage requests to stress the server. 

Hint: For the throughput measurement, you will see packet drops and performance collapse if the client keeps sending requests at a rate above the storage server’s capacity.

Hand-in Instructions

  1. Write a short paragraph summarizing your design of the server and the client.
  2. For the latency, measure the time spent in your code and DPDK/SPDK stack versus the time spent in the networking and storage hardware.
  3. Measure the end-to-end latency multiple times to get enough data points. Draw a CDF graph of the end-to-end latency. What do you observe? Can you explain why?
  4. Attach the throughput number you get.
  5. Do you think your implementation is optimal in terms of latency and throughput? If not, how would you further improve it?
  6. Submit your source code (excluding SPDK and DPDK libraries) and answers (of Q1-Q5) to 6828seminar-staff@lists.csail.mit.edu. You can submit multiple times before the deadline; the latest one will be used for grading.

Optional Challenge 1: If your current implementation is single-threaded, try to extend it to be multi-threaded to use more CPU cores. What is the highest throughput you get? And where is the bottleneck in this case?

Optional Challenge 2: Implement in-memory caching to accelerate your storage server. Generate requests with a skewed distribution to benchmark it.