News

‘He Should Resign’: Harvard Undergrads Take Hard Line Against Summers Over Epstein Scandal

News

Harvard To Launch New Investigation Into Epstein’s Ties to Summers, Other University Affiliates

News

Harvard Students To Vote on Divestment From Israel in Inaugural HUA Election Survey

News

800 Affiliates Petition Harvard To Aid Venezuelan Staff After TPS Expiration

News

Summers To Step Back from Public Commitments Amid Epstein Scandal

News

As Summers Sought Clandestine Relationship With Woman He Called a Mentee, Epstein Was His ‘Wing Man’

News

Harvard Faculty Disturbed by Revelations of Summers’ ‘Cozy Friendship’ With Epstein

How a Harvard Initiative is Translating Archives for AI Models

By Danielle J. Im and Neeraja S. Kumar, Crimson Staff Writers September 17, 2025

{shortcode-58aaaa67a87330c2cb4682e1d454e367f13a939e}

Since Harvard’s Institutional Data Initiative launched last December, the team has formed partnerships with open-source artificial intelligence developers like OpenAI and Microsoft to train large language models on archival documents in institutional collections.

The initiative was incubated by Harvard Law School’s Library Innovation Lab. Now run directly under the HLS library, it aims to expand the resources for AI training by using data from documents in the public domain. The IDI works with institutional partners — such as newspapers or public and university libraries — to convert their archival materials into accessible data sets, which they then provide to AI researchers training chatbots.

Greg Leppert, the executive director at the IDI and chief technologist at the HLS Berkman Klein Center for Internet and Society, described the project as a “collaborative endeavor.” As of now, the IDI has worked with OpenAI, Microsoft, Google Books, and the Boston Public Library.

“We do try and get, you know, everybody around the table,” Leppert said. “Whether that's a bunch of libraries or model makers or open source AI creators to work on data.”

In June, the initiative shared nearly a million books from a Harvard Library collection with AI researchers, spanning more than 254 languages and dating as far back as the 1400s. Currently, the initiative is tackling newspaper collections and government documents from the Boston Public Library’s collection.

Leppert noted that the project focuses on resources from library collections that are “under-resourced.” This approach also avoids copyright protections, which have posed an obstacle for companies training AI on preexisting material.

“The collections that they’re stewarding have been undervalued,” Leppert said of lesser known library and institutional archives.

The initiative’s work with the Boston Public Library relies on “trained custom segmentation models” that break down and transform articles into “highly searchable” copies that are easier to access for model training.

Leppert said the project’s goal is to use information as a way of training AI in “the positive direction you want it to go.”

“That’s something we think a lot about,” he said. “How can libraries be a part of that dialogue and discourse, and how can knowledge institutions generally play a part in that?”

Leppert said he hopes the IDI will publish results of their work with Boston Public Library documents by the start of 2026. According to Leppert, the IDI’s long-term goals are to expand their reach — to more newspapers and institutional libraries — and to build tools that libraries can utilize to publish their own data.

“We’re trying to build connections across the globe to institutions of any size,” Leppert said. “It doesn’t have to be a million books that can be impactful.”

Leppert added that the IDI would “love to collaborate” with Harvard Library in the future due to their “unique collections.”

“I think that that would be phenomenal, and in fact, could make these collections have entirely new life,” he said.

Correction: September 17, 2025

A previous version of this article incorrectly stated that the Institutional Data Initiative runs through Harvard Law School’s Library Innovation Lab. In fact, it was incubated by the Library Innovation Lab but is now run directly under the HLS library.

—Staff writer Danielle J. Im can be reached at danielle.im@thecrimson.com.

— Staff writer Neeraja S. Kumar can be reached at neeraja.kumar@thecrimson.com. Follow her on X @neerajasrikumar.

The Harvard Crimson

The Harvard Crimson

How a Harvard Initiative is Translating Archives for AI Models

Tags

From Our Advertisers

The Harvard Crimson

How a Harvard Initiative is Translating Archives for AI Models

Tags

MOST READ

From Our Advertisers