{shortcode-58aaaa67a87330c2cb4682e1d454e367f13a939e}
Since Harvard’s Institutional Data Initiative launched last December, the team has formed partnerships with open-source artificial intelligence developers like OpenAI and Microsoft to train large language models on archival documents in institutional collections.
The initiative runs through Harvard Law School’s Library Innovation Lab and aims to expand the resources for AI training by using data from documents in the public domain. The IDI works with institutional partners — such as newspapers or public and university libraries — to convert their archival materials into accessible data sets, which they then provide to AI researchers training chatbots.
Greg Leppert, the executive director at the IDI and chief technologist at the HLS Berkman Klein Center for Internet and Society, described the project as a “collaborative endeavor.” As of now, the IDI has worked with OpenAI, Microsoft, Google Books, and the Boston Public Library.
“We do try and get, you know, everybody around the table,” Leppert said. “Whether that's a bunch of libraries or model makers or open source AI creators to work on data.”
In June, the initiative shared nearly a million books from a Harvard Library collection with AI researchers, spanning more than 254 languages and dating as far back as the 1400s. Currently, the initiative is tackling newspaper collections and government documents from the Boston Public Library’s collection.
Leppert noted that the project focuses on resources from library collections that are “under-resourced.” This approach also avoids copyright protections, which have posed an obstacle for companies training AI on preexisting material.
“The collections that they’re stewarding have been undervalued,” Leppert said of lesser known library and institutional archives.
The initiative’s work with the Boston Public Library relies on “trained custom segmentation models” that break down and transform articles into “highly searchable” copies that are easier to access for model training.
Leppert said the project’s goal is to use information as a way of training AI in “the positive direction you want it to go.”
“That’s something we think a lot about,” he said. “How can libraries be a part of that dialogue and discourse, and how can knowledge institutions generally play a part in that?”
Leppert said he hopes the IDI will publish results of their work with Boston Public Library documents by the start of 2026. According to Leppert, the IDI’s long-term goals are to expand their reach — to more newspapers and institutional libraries — and to build tools that libraries can utilize to publish their own data.
“We’re trying to build connections across the globe to institutions of any size,” Leppert said. “It doesn’t have to be a million books that can be impactful.”
Leppert added that the IDI would “love to collaborate” with Harvard Library in the future due to their “unique collections.”
“I think that that would be phenomenal, and in fact, could make these collections have entirely new life,” he said.
—Staff writer Danielle J. Im can be reached at danielle.im@thecrimson.com.
— Staff writer Neeraja S. Kumar can be reached at neeraja.kumar@thecrimson.com. Follow her on X @neerajasrikumar.
Read more in News
School Committee Hopefuls Talk Algebra 1 at Candidate Forum