AI could match 'fingerprints' of texts to their authors, under new intelligence program
Nextgov was pre-briefed by the Intelligence Advanced Research Projects Activity on a soon-to-be-released, broad agency announcement that might one day help combat disinformation campaigns and human trafficking.
The intelligence community’s research arm is preparing to develop new artificial intelligence systems that can identify who, or what, authored any specific text—and on the flip side, advanced systems targeting features to protect authors’ privacy.
“This effort, we think, is potentially game-changing for tracking disinformation campaigns, and things like combating human trafficking and other malicious activities that go on in online text forums, and elsewhere using text,” Dr. Timothy McKinnon told Nextgov in a recent interview.
McKinnon is the Intelligence Advanced Research Projects Activity program manager leading this work, which is deemed the HIATUS—or, human interpretable attribution of text using underlying structure—program.
IARPA will likely release a broad agency announcement next week to solicit research proposals for HIATUS. McKinnon provided an early look at the about-to-unfold project, which marks the IC’s latest research and development effort in human language technology.
“We anticipate that the program will last 42 months after it kicks off,” he confirmed.
The challenges IARPA aims to confront through HIATUS are incredibly complex.
“For a little bit of context, like just think about if you had 100 different people, and you ask them to describe some simple thing—like how to open a door—in two sentences or one sentence, you’d probably get about 100 different answers, right?” McKinnon said. “And, you know, each person sort of has their own idiosyncrasies as an author that are potentially used by authorship attribution systems.”
Heaps of multilingual raw text are produced by anonymous authors—both human and machine—every day. As the program manager noted, such materials generally contain linguistic components that can be used to pinpoint precisely who crafted the information, or to safeguard authors’ identities if attribution could put them in some sort of danger.
“With attribution, what we're doing is we're identifying stylistic features. So, these are things like word placement and syntax that can identify who wrote a given text. Think about it as like your written fingerprint, right? What characteristics make your writing unique? So the technology would be able to identify that fingerprint compared against a corpus of other documents, and match them up if they are from the same author,” he explained. “On the privacy side, what the technology would do is it would figure out ways that text could be modified so that it no longer looks like a person's writing.”
The program is structured in a way that puts these two elements in competition with one another to really drive development on both sides. Through HIATUS, officials are essentially embracing authorship attribution and privacy as an adversarial machine learning problem since development and evaluation involve competition between those two components.
Up to this point, there are really only three categories of approaches to the issues that IARPA’s team intends to address.
Through traditional manual approaches, human experts can analyze text and search for overlap or qualities of a specific author. Another category involves machine learning and algorithmic techniques like logistic regression or Bayesian models, but McKinnon said those don’t scale well across different text genres.
The third set of techniques is “very new,” he noted. It incorporates neural language models, which are sophisticated systems that represent human language.
“The problem with those models is that even though they're very, very fast—and they perform very well—we don't understand, really, what's going on inside of them. They're very complex,” McKinnon said. “And so what HIATUS is seeking to do, among other things, is to unearth some of the rationale underpinning those models’ behavior, so that you can actually, you know, when we perform attribution or we perform authorship privacy, we're able to really understand why the system is behaving the way it is, and be able to verify that it's not picking up on spurious stuff and that it's doing the right thing.”
When the BAA is launched, proposers will have the opportunity to highlight their own research and development in this realm and suggest how IARPA should move forward to meet its overall objectives.
“We're looking to develop systems that can be robustly performant across diverse domains and genres of text—and also, there's going to be foreign languages involved in the program as it progresses as well,” McKinnon said.
As the IC’s key research and development hub, IARPA conducts exploratory ventures and doesn't have much of a role in fully deploying or operationalizing the technology it creates. Once it’s done, the tools are shipped out to agencies to implement based on their own specific needs.
Roughly 70% of its completed research efforts transition to other government partners.
For that reason and others, though, officials don’t speculate in great detail about use cases that could blossom down the line, from what they’ve produced.
Still, McKinnon noted that this work could have major impacts associated with combatting human trafficking, or understanding and stopping the increasingly sophisticated malicious influence campaigns on the internet.
“Let’s take a disinformation campaign as an example of what the technology could do. Imagine you had machine-generated text that was being created online to conduct a disinformation campaign,” McKinnon said. “What the technology will be able to do is it will be able to identify, potentially, the fact that a machine generated the text, and also help you understand which groups are engaged in those activities.”