ILS Colloquium: Exploring the Future of Data Curation with AI – Luddy Department of Information & Library Science Blog

By Ruby Mae Miller and Quentin Kitchell

On October 3rd, the ILS Department welcomed Dr. Libby Hemphill (she/her/hers) from the University of Michigan as part of this fall’s ILS Colloquium Series. Students and faculty filled Dorsey Auditorium to hear her talk, “Augmenting Data Curation: Using Generative AI to Improve Data Discovery and Reduce Disclosure Risk.” The lecture was followed by an interactive workshop that dove deeper into how AI can assist, not replace, researchers and data curators.

About the Speaker

Dr. Hemphill is an Associate Professor of Information at the University of Michigan’s School of Information. She also holds appointments as a Research Associate Professor, a Professor in the Digital Studies Institute, and a member of the Institute for Social Research.
Her research focuses on artificial intelligence, archives, digital curation, and data stewardship alongside broader studies of social media, content moderation, and extremism.

To explore more of her work, check out:

Rethinking Metadata with AI

Dr. Hemphill shared her research goals: using generative AI to improve data discovery and reduce disclosure risks. Her initial research found that while generative AI was not yet successful in constructing useful metadata overall, it could effectively create tags for specific aspects of a scientific document, such as observations or variables. These concepts are not usually indexed in most metadata profiles, and when they are, the system often utilizes alphanumeric strings that are tedious and unintuitive. For example, the ICPSR Social Science Variables Database codes the variable “cat as pet” as “VAR0243.” By using generative AI to create more natural, descriptive options for this tag, data curators can enrich users’ search capabilities and improve accessibility.

Dr. Hemphill also proposed the use of generative AI to produce synthetic datasets, assisting researchers with preliminary data analysis while mitigating privacy risks. When researchers collect census or other sensitive data, it must legally remain secure and private. Because of this, the approval process for other researchers to access and use such data can be time-consuming and costly. Dr. Hemphill experimented with using generative AI to scan the original dataset in a secure environment and reproduce a conceptually similar synthetic dataset that can be processed without violating participant privacy. Outside researchers can then use this synthetic dataset to determine its relevance before undergoing the process of requesting access to the original data.

Dr. Hemphill left attendees with a clear message in both her talk and workshop: generative AI should empower data curators and researchers, not replace them.

Watch the full lecture recording here.