Yes, You Can Examine 650,000 Emails in Eight Days

darpa_big_dataOn the eve of the 2016 Presidential Election, the controversy that FBI Directory James Comey sparked when he revealed that a new batch of emails related to Hillary Clinton’s use of a private email server had been found on computers belonging to disgraced New York congressmen Anthony Weiner took a sudden turn. Comey reported that the new emails were mostly duplicates of old emails; that there were few, if any classified documents contained in this batch; and that nothing in the recently discovered material could cause him to change his decision not to prosecute in July.

Predictably, cries of a “rigged election” have resurfaced again in the Trump camp. Trump and his supporters claim that you can’t examine 650,000 emails in 8 days. One tweet put it this way: “There R 691,200 seconds in 8 days. DIR Comey has thoroughly reviewed 650,000 emails in 8 days? An email / second? IMPOSSIBLE.”

Unfortunately, comments like these suggest that some people have never heard of computers. Properly configured and programmed, a computer system can examine 650,000 emails reasonably well in ten minutes.

I am not familiar with the procedure the FBI followed to review the emails. But, if I were to design the process, I would start by identifying unique senders, recipients, and subject lines, ones that had not been identified in the previous email analyses, and set those special emails off to the side for more in-depth examination. Computers are really good at comparing strings of characters like subject lines and email addresses, so it would take very little time to identify emails that deviate from the previous batches.

Computers also excel at keyword searches. Just ask Google, which performs 3.5 billion keyword searches per day (40,000 per second). I assume the FBI compiled a file with hundreds or thousands of keyword searches related to classified information. By flagging those documents turned up the keyword search and cross-listing them with the distinctive emails found in the previous step, the FBI could identify the emails most likely to merit further examination quickly and effectively.

All this could be done in about ten minutes, and faster if they devoted a farm of computers to the problem. In parallel, they could perform the more time-consuming task of directly comparing the new emails with the previously examined ones. Those that are duplicates would be cast to the side, because they had already been examined. Although comparing 650,000 emails with 50 million previously discovered ones sounds arduous, it becomes very efficient if you compute the cryptographic hash of all the new emails and the old emails and you compare each against each other. When you hash a message, you compute a number that represents its contents. Given the size of the number, this representation is practically unique for each message. Anyone who has used Shazam to identify songs on the radio has seen (or heard) hashing in action, because Shazam hashes what your phone hears and compares that with its database of hashes of millions of song parts to find matches. Shazam works very quickly, doesn’t it? In the same way, through this more exhaustive search involving hash comparisons, any new email that duplicated an old email could be identified and removed from consideration.

This is the craziest and most depressing election I have ever witnessed, mostly due to the misinformation that has been spread and how willingly people believe it without asking questions. But this task of examining 650,000 emails for actionable information was an easy one. Computer science has rigged it that way.

About Ray Klump

Associate Dean, College of Aviation, Science, and Technology at Lewis University Director, Master of Science in Information Security Lewis University,, You can find him on Google+.

Leave a Reply

Your email address will not be published. Required fields are marked *