ECMWF's next-generation high-performance computing facility is being installed in its new Bologna data centre, and other production services will be migrated there shortly after. This presents new challenges as during the transition period the forecast data will have to be copied between the Reading (UK) and Bologna (Italy) data centres, using 100 Gbit connections to the public Internet. It was therefore imperative to better understand the issues related to long-distance data transfers so that they can be mitigated.
The importance of testing
Data transfers over long distances and Wide Area Networks (WANs) are inherently less predictable than data transfers over Local Area Networks (LANs). This is primarily because the data must pass through many devices that are administered by multiple third parties, and issues cannot be easily diagnosed and fixed without their involvement. Physics also plays a part as longer transfer distances introduce a substantial latency - for example, there is around 40 ms latency between Reading and Bologna compared with less than 1 ms on a local network. Finally, Internet protocol limitations and envisaged transfer methods also need to be considered.
Therefore, testing is a vital stage in the solution delivery process, as it provides an empirical way to determine that the solution meets its feature and performance targets. Early testing is particularly beneficial in projects which are complex, or involve factors that are outside of organisational control, in this instance use of the Internet as a medium for long-distance data transfers. All-in-all, ensuring that long-distance data transfers perform well is a science that combines many elements of physics, system performance tuning and application design. For this reason, simulation and testing are essential for understanding what performance can be expected and finding the best parameters for system and application tuning.
Testing long-distance data transfers
A multi-stage approach to testing was applied, with the initial tests being based on 'synthetic' benchmarks rather than real-life performance. From February 2020 to March 2021, tests were carried out using two 100-Gbit-capable test servers.
In order to progress at the right pace and increase the knowledge on these matters, ECMWF enlisted the help of a long-distance bandwidth testing expert from the Gigabit European Academic Network Technology (GÉANT), the pan-European data and communication network for Europe's education and research community. Dr. Richard Hughes-Jones advised on how to best configure the test servers and carried out the bulk of the tests along with ECMWF's Networks and Security and Data Handling System teams.
Test results
The initial tests were carried out from February to October 2020 with two servers connected directly to each other. This established a baseline against which the performance of real WAN connections could be measured. Some tests were also made against GÉANT test servers, first in London and then in Paris. These were useful as they made it possible to estimate how increasing latency (ca. 9 ms London, ca. 16 ms Paris) affects performance.
All these tests enabled the best tuning parameters to be used, enabling a quick and efficient running of the final tests between Reading and Bologna once the Bologna Internet circuits went live in November 2020.
As a single high-bandwidth data flow is more susceptible to protocol limitations/performance issues from packet loss than multiple low- bandwidth data flows, it is common practice to initiate several data flows simultaneously over a significant period (600 seconds or more) to ascertain the overall performance of a link. The number of flows is not relevant as the end goal is to measure the highest speed that could be obtained, although it is provided for information in the graphs.
The graphs summarise the data transfer speeds obtained during the tests. The speed variation shown for the real-life test is due to: (a) the uncertain state of the Internet according to the time of day at which the flows take place; (b) outages affecting the networks at any given time; and (c) the level of usage of the networks of the many Internet Service Providers (ISPs) involved in the end-to-end transfers.
Conclusion
As shown in this article, a well-thought-out testing strategy plays a crucial part in successful project delivery.
The test results gathered suggest that 25 Gbps for a single data flow and 90 Gbps for multiple flows should be achievable over Reading-Bologna site-to-site links. Despite problems associated with using the public Internet, this compares very well to the baseline results of 30 Gbps and 100 Gbps respectively that we obtained when test systems were directly connected.
This article has presented the results of network-level tests. The next stage will be to see how real-world applications behave when real transfers start to happen over the coming months, and what effect the ever-changing nature of the Internet has on the speed and reliability of these transfers over medium to long time frames.