The Pandora Papers have rocked the world. Since news organisations began publishing their explosive contents on October 3, the giant leak has dominated headlines and posed questions of some of the world’s most powerful people and their financial propriety.
Everyone from former UK prime minister Tony Blair to the K ing of Jordan have been dragged into a murky world of offshore finance, with stunning allegations being uncovered daily. And not for the first time, calls have been made to crack down on offshore financial products and institutions, and to instigate a fairer tax regime.
The Pandora paper revelations came from an unfathomably big tranche of documents: 2.94 terabytes of data in all, 11.9 million records and documents dating back to the 1970s. But how do you handle a massive leak of such size securely, when documents come in all sizes and formats, some dating back five decades?
The organisation behind the Pandora Papers leak, the International Consortium of Investigative Journalists (ICIJ), has spent the best part of a year coordinating simultaneous reporting from 150 different media outlets in 117 countries. And it involves a lot of technical infrastructure to bring the stories of financial issues to light. “We had data from 14 different offshore providers,” says Delphine Reuter, a Belgian data journalist and researcher at the ICIJ. Work began on analysing the data in November 2020.
“The first challenge for us was to get the data,” explains Pierre Romera, chief technology officer at the ICIJ. “We exchanged for weeks and months with the sources, and at a point we had to find a way to get the data.” Initially, the ICIJ brokered a deal with its sources that would allow them to send the data remotely without needing to travel, but as the size of the document dump grew, so did the challenges in ensuring it all could be sent to a secure server. Some members of the ICIJ team met directly with sources and collected huge hard drives containing the documents.
But the sheer size of the leak was still tricky to navigate. “They’re massive,” Romera says. Analysing such a volume of data isn’t a job for Excel or existing database management programs. “You can’t just go at it with classic tools. There’s nothing in the market for journalists that can ingest so much data.” Worse, four million of the files were PDFs – notoriously bad to interrogate. “PDFs are horrible to extract information from,” says Reuter. And they weren’t ordinary PDFs either: seemingly unrelated documents were scanned together into single PDF files without rhyme or reason. “You might have copies or emails or registers of directors within the information we were interested in,” she adds.
However, the ICIJ has had practice in parsing huge troves of information. The Panama Papers, which in 2016 uncovered the rogue offshore finance industry over 11.5 million leaked documents across 2.6 terabytes of data, gave the coalition of investigative journalists a set of best practices on how to handle all that data. “We created our own tools and technology to extract the text and make it searchable,” says Romera. That task fell to a team including Bruno Thomas, senior developer at the ICIJ, to prepare the data to be accessible for scores of reporters worldwide.
The ICIJ used two self-developed technologies in combination to comb through the documents. One, Extract, is able to share the computational load of extracting information between multiple servers. “When you have millions of documents, Extract is able to tell a server to look at one document and another server to look at another,” Romera says. Extract is part of a larger ICIJ project, called Datashare, which is a data structuring tool. “Everyone has to use Datashare to explore the documents,” says Reuter. “They can download documents to their own machine, but they have to use Datashare to search the documents because it’s not doable to go through 11.9 million documents without the system.”