Research Guides: The AI Syllabus: Data + Archives and Archival Violence

Data + Archives and Archival Violence

Critical AI Literacies:

Understand how classification functions as an act of power and recognize that all data classification systems are inherently biased.
Evaluate how archives preserve history, culture, memory and can also enforce power and erase experience.

On this page, you will find multimedia sources that engage with AI in relationship to data, datasets, archives, and archival violence including how the Internet serves as a dataset for training generative AI, how classification of data is inherently biased, and how digital scholars are exploring new media including AI as a form of resistance to archival harm and erasure in the digital memory. Additionally, this page includes key concepts, activities, and assignments to build understanding of and critically engage with AI, data, and archives.

Sources

Data + Archives and Archival Violence

Sources:

Chiang, T. (2023, February 9). ChatGPT is a blurry jpeg of the web. The New Yorker.

In this article, Sci-fi author Chiang, uses the analogy of digital images to explain how large language models (LLMs) including [Chat]GPT function and why they hallucinate or otherwise approximate seemingly correct but compromised information. It explores how the internet as an archive of information behavior serves as a dataset for training generative AI. #Practical #Philosophical

Colavizza, G., Blanke, T., Jeurgens, C., and Noordegraaf, J. (2021, December 14). Archives and AI: An overview of current debates and future perspectives. Journal on Computing and Cultural Heritage, 15(1), 1-15.

This academic journal article serves as a literature review for scholarship considering the role of AI in archival processes including how the digitization of archival materials transforms the content into big data, which archivist tasks could be automated and how that will shift the work of archivists, and the implications of both these shifts on how information is organized and knowledge is rendered. The authors acknowledge their survey doesn't include any substantive consideration of ethics around AI use in this capacity. Dense reading - useful exploration, written for subject experts. #Philosophical

Crawford, K. and Paglen, T. (2019, September 19). Excavating AI: The politics of training sets for machine learning. Excavating AI.

In this position paper originally published by New York University’s AI Now Institute to accompany the authors’ art exhibit, Training Humans, Crawford and Paglen elucidate the biases in training data - the processes and the data itself- beginning with the problematic nature of taxonomy as an act of power to specifically critique ImageNet and its impact. It’s important to note the paper and art exhibit have been criticized for a double standard in featuring images of subjects without their explicit consent which is part of the authors’ original critique of the image archive. #Philosophical

freemyn, k.b. (2022). Expanding the Black archival imagination: Digital content creators and the movement to liberate Black narratives from institutional violence. In Burns-Simpson, S., Hayes, N.M., Ndumu, A., & Walker, S. (Eds.). The Black librarian in America: Reflections, resistance, and reawakening.

In this chapter, freemyn discusses how independent Black digital memory workers use Instagram to do archival work outside institutional archives and the violence on marginalized people they perpetuate. She talks about how the conventions of the platform e.g. grid interface, hashtags serve to construct independent archives of the work of Black women and how institutional archivists can support this work through outreach, collaboration, and job expertise. This chapter explores how technology and new media can be used to confront and circumvent institutionalized culture work that harms marginalized people. #Philosophical #Practical

Hempel, J. (2018, November 13). Fei-Fei Li’s quest to make AI better for humanity. Wired.

This article is a profile of computer scientist and AI pioneer, Fei-Fei Li, who founded the image archive, ImageNet, and ran the AI Lab at Stanford for several years. It includes her motivations and processes for starting ImageNet, her controversial time working for Google, and her feelings about the need for increased diversity in the field of AI. #Practical #Philosophical

Internet Archive. (2023, May 2). Generative AI meets open culture [Video]. Archive.org

A 60 minute video of a recorded panel discussion with expert representatives from Internet Archive, Wikimedia Foundation (Wikipedia), and Creative Commons. Topics covered include how Internet Archive (IA) is using AI to explore and improve their records, how generative AI can be fun and joyful, how Wikipedia editors are testing ChatGPT, how creative participation is changing, the limits of copyright, the tension between the intended public good of the Internet and the corporate motivations of most AI companies, etc. [Ed note: The closed captions aren’t the most accurate.] #Practical #Philosophical

Jules, B. (2016, Nov. 11). Confronting our failure of care around the legacies of marginalized people in the archives. [Keynote presentation]. Medium

Dr. Jules is an archivist and historian and co-founder of the Archiving the Black Web project who gave the keynote presentation at the National Digital Stewardship Alliance annual meeting in which he discussed how an emphasis on neutrality and standards in the field over an ethics of care leads to symbolic annihilation of marginalized peoples in information preservation and knowledge construction through archives, libraries, and museums. Though his keynote doesn't mention AI, it's instructive for recognizing how a similar insistence that technology like AI is neutral is both incorrect and harmful in its impact. #Philosophical #Practical

Moore, K. (2024, March 27). Review: Kate Crawford’s ‘Atlas of AI’, chapter 4: Classification. Hastac Commons.

This peer-reviewed blog post is written by a doctoral graduate student and provides a synthesis and analysis of the chapter on classification in Crawford’s book, Atlas of AI, including how classification acts as a system of power that impacts how AI is trained and used drawing on a historical tradition of biases and racism in archival processes. #Practical #Philosophical

PBS NewsHour. (2017, January 2). Internet history is fragile. This archive is making sure it doesn’t disappear [Video]. Youtube.

A 9 minute news report that serves as an explainer for the Wayback Machine, the visual search function of the Internet Archive, that allows users to access screen captures of pages from defunct or updated websites. #Practical

Rebeiz, L.D. (2023, November 1). AI tools paint a blurry picture of our current reality - so what do these biases mean for our future? It’s Nice That.

In this article, artist Rebeiz discusses their experience training a Generative Adversarial Network (GAN) to generate output paintings that look like her own work by creating a database of their own paintings as training data and contrasts that with Midjourney and DALL-E which are trained on web data and generate biased and limited outputs because of it. Rebeiz explores the complicated issue of the necessity of archive as knowledge preservation even as the preservation processes and the knowledge are problematic. They view engagement with AI as a “necessary form of resistance” to erasure from the digital memory of the world. #Philosophical

Wong, J.C. (2019, September 18). The viral selfie app ImageNet Roulette seemed fun - until it called me a racist slur. The Guardian.

In this article, a technology journalist discusses her experience using the app created by Crawford and Paglen as part of their ImageNet critique which allows users to upload an image of themselves and the app will categorize it according to ImageNet’s taxonomy which includes racist categories. This piece is useful to explore ethical considerations around critique as a concept and its direct discriminatory impact on people engaging with the technology. #Practical #Philosophical

Building Critical AI Literacies

KNOWING:

Archives preserve culture and memory but also reinforce power structures and erase marginalized experiences. As archives are digitized, they become vast datasets that AI uses for training, further embedding historical biases. Because AI relies on classification—a process shaped by human values—these biases persist, influencing how knowledge is structured and reproduced.

To critically engage with AI in relationship to data and archives, it's important to:

Explore how archives shape collective memory influencing which voices and histories are preserved and which are erased.
Investigate how AI learns from archival data identifying the ways historical biases are reinforced in machine learning.
Analyze how classification systems structure information to determine their impact on AI training and knowledge construction.

DOING:

Hidden Histories, Open Source, and AI. This project asks students to trace a hidden history (Bay Area or California) that should have more representation online. Find a local hidden history. Perform a quick search to see what exists online. Perform an AI review to see how AI represents and frames the historical person, event, or topic. Contribute to the online representation by adding online content to wikipedia. Return to the AI tool used and see what changes occur. Document any observations.
If you could use AI to bring an archival collection or historical collection “to life” (i.e. users can interact or engage with the archival data in some way through AI) - what collection would you choose? How would you have users engage with the data through AI? What outcomes would you hope to achieve by engaging with history in this way?
How are museums and libraries using AI to have users engage with history? Highlight specific projects. As a class, map broad themes. Use AI to write a proposal to your local historical museum or historical society (or local library/historical archive) about how AI could help engage community members with their collections. See, “AI at the LC” (Library of Congress) for more ideas!
Time to remix the archive. Use the Citizen DJ at Library of Congress with Brian Foo…to create your own sound remix of an archival sound collection.