Improvements in the LHCb DAQ

The LHCb Data Acquisition system consists of about 300 FPGA-powered data-sources connected to a large farm of about 1500 ×86-servers. The connection is made by a Ethernet Local Area Network with more than 3000 ports. The very simple, connection-less, push-protocol for event-building employed by LHCb relies critically on extremely low loss-rates in the network. Since the last presentation of this system at the 2010 RealTime conference, the redundancy of the system has been significantly improved and it has also grown in size. The redundancy has increased the complexity of the network, but we managed to hide this from the event-builder “applications” on the FPGA and the individual CPU nodes. This setup and challenges with it will be described in this paper. Ageing network hardware cannot always be replaced identically, because maintenance of old network devices becomes very expensive. We have begun a campaign to identify replacement devices and will describe our procedure and measurement results. One specificity of the LHCb data acquisition system, which distinguishes it from other LHC DAQs is the use of the Timing and Fast Control (TFC) system, which is LHCb's variant of the LHC-wide Timing and Trigger Control (TTC), is used for event-management. The TFC is a hard realtime system, which needs to collaborate with a variable latency network for the purpose of event management. Together with our unreliable event-building protocol this makes the overall system sensitive to latency distributions on the network, which can lead to occasional problems in the network. To better understand these effects we have done measurements of the timing structure on the network in real time using independent FPGA-based network probes. These rather challenging measurements on a live 500 Gbit/s network will be used to improve the system for the next LHC run. Another change is the addition of disks to the event-receiving nodes, which used to do purely transient processing. This has increased the “elasticity” of the system, at the expense of increased operational complexity. We will discuss performance and reliability issues.