Week 06: Data + People

September 29, 2025

Texts to have read / watched

D’Ignazio, Catherine, and Lauren F. Klein. Data Feminism, MIT Press, 2020. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/pitt-ebooks/detail.action?docID=6120950.
- “Introduction: Why Data Science Needs Feminism.” pp. 1–19.
- "4. 'What Gets Counted Counts.' " pp. 97-124.
Onuoha, Mimi. On Missing Data Sets. 2016. 16 July 2024. GitHub, https://github.com/MimiOnuoha/missing-datasets. See also the related [art installation](https://mimionuoha.com/the-library-of-missing-datasets) and its sequels [v2](https://mimionuoha.com/the-library-of-missing-datasets-v-20) and [v3](https://mimionuoha.com/the-library-of-missing-datasets-v3)
Schöch, Christof. “Big? Smart? Clean? Messy? Data in the Humanities.” Journal of Digital Humanities, Nov. 2013, https://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/.
Cairo, Alberto. “5: Basic Principles of Visualization.” The Truthful Art: Data, Charts, and Maps for Communication, New Riders, 2016. learning.oreilly.com, https://learning.oreilly.com/library/view/the-truthful-art/9780133440492/ch05.html.
Brown, AmyJo. “Building Your Own Data Set: A Journalist’s Approach.” What Are Digital Humanities?, 11 Nov. 2022, https://cmu-lib.github.io/dhlg/project-videos/brown/.

Writing to turn in

Two peer reviews, as assigned, for the project presentations (iteration 1), posted to the discussion forum

data: is or are? depends on the underlying construct. — "A Grammatical Conundrum," comic 1816 from Piled Higher and Deeper (PhD Comics) by Jorge Cham.

Plan for the day:

First half: Let’s discuss!
- Presentation debrief: think, pair, share
- Reading responses & follow-ups
- Writing to remember (inkshedding)
Second half: Let’s practice!
- Reflective writing: getting meta with categories
- Project studio (and mini-conferences)
Homework for next time:
- Materiality + Modeling

First half: Discussion

Presentation Debrief

Think: Take notes on your own.
- What helped you as a member of the audience or at-home reader?
- What made things hard?
- Was there ever too much of a good thing?
Pair: Working with the person next to you, share your second lists (what made things hard). What strategies can you brainstorm together to help with those challenges? Post notes on the shared google doc.
- EXT: Share your other lists. What helpful things do you want to hold onto?
Share: Read everyone’s notes and discuss together as a class.

Reading responses & follow-ups

Some possible moments to revisit:

The process of capturing data

In her video about building the Public Ledger, starting at around 4:38, AmyJo Brown has some advice about capturing data from handwritten material:

After you are familiar with the data that you are using, and what you might be able to learn from it, start to write down the questions that you think it can answer for you. This will help focus your data entry effort.

It can be awfully tempting to capture every piece of data on the form. But that will significantly increase the cost of the project in time and resources. (This is a hard-learned lesson.)

At the same time, you don't want to get halfway through your data entry work and then realize there was a piece of data that you really wish you had to work with. So it helps to work backwards from the questions that you actually want to answer.

Why do you think questions come "after you are familiar with the data that you are using" (my emphasis)? If you had to draw a flow chart of this process, what would it look like?

What is added by framing something as "data"?

Christof Schöch writes:

Data in the humanities could be considered a digital, selectively constructed, machine-actionable abstraction representing some aspects of a given object of humanistic inquiry. Whether we are historians using texts or other cultural artifacts as windows into another time or another culture, or whether we are literary scholars using knowledge of other times and cultures in order to construct the meaning of texts, digital data add another layer of mediation into the equation. Data (as well as the tools with which we manipulate them) add complexity to the relation between researchers and their objects of study.

What does it mean to say that data is a "layer of mediation"? How might thinking about your objects of study through the lens of data change your approach to it? Is there a benefit in "add[ing] complexity"?

"the paradox of exposure:"

This is a term that Catherine D'Ignazio and Lauren Klein define as "the double bind that places those who stand to significantly gain from being counted in the most danger from that same counting (or classifying) act" (Data Feminism 105).

Or again:

[B]eing represented also means being made visible, and being made visible to the matrix of domination–which continuously develops laws, practices, and cultural norms to police the gender binary–poses significant risks to the health and safety of minoritized groups. (110)

Or, not to make too much of it but to notice the emphasis signaled by their repetition:

Acts of counting and classification, especially as they relate to minoritized groups, must always balance harms and benefits. When data are collected about real people and their lives, risks ranging from exposure to violence are always present. But when deliberately considered, and when consent is obtained, counting can contribute to efforts to increase valuable and desired visibility. (118-119)

I don't have a question here, per se, but I did want to notice this and bring it into the room. There will be a number of damned-if-you-do, damned-if-you-don't binds in your research. Thinking about consent and conversation as one way forward in the midst of these binds, and balance as another, will be helpful, I hope. Any other thoughts you want to add?

Let's talk about the (non-hidden) figures

Alberto Cairo writes that "there are no graphic forms that are intrinsically good or bad but graphic forms that are more or less effective." Think about the example visualizations in the chapter I gave you from Data Feminism, a few of which are reproduced below.

Which of the affordances that Cairo discusses are these data visualizations making use of? For example, what visual features are data variables being mapped onto?
Do you agree that the visualizations are effective? Toward what ends, and why?

Pockets

CEOS

Congress

Genetic Sex

(Apologies for the blurry screenshots! I encourage you to find the originals in the ebook for better detail.

Let’s discuss! And maybe we can take some collaborative notes at bit.ly/dsam2025fall-notes?

If we get to inkshedding by around 10:15, that should leave us with enough time to share and break and still beat the 11:00 rush.

Writing to remember

Spend some time putting marks on a page to help you think through, and consolidate for yourself, what we discussed today. What do you want to remember? What are you left wondering?

After a few minutes, I’ll ask everyone to share one thing, to which we’ll all say, simply, “thank you.”

Break (10 minutes)

Assuming we left off at 10:35, let’s aim to start up again at 10:45 or so. That should beat most of the rush for 11am classes.

Second half: Let’s practice!

Reflective writing

In one of several examples of inescapable paradoxes discussed by Catherine D’Ignazio and Lauren Klein in Data Feminism, they discuss the problem of imposing hierarchical relations, to the point of reinscribing racism or sexism, by attempting to categorize and track some phenomenon of interest. What to do?

A simple solution might be to say, “Fine, then. Let’s just not classify anything or anyone!” But the flaw in that plan is that data must be classified in some way to be put to use. In fact, by the time that information becomes data, it’s already been classified in some way. Data, after all, is information made tractable, to borrow a term from computer science. “What distinguishes data from other forms of information is that it can be processed by a computer, or by computer-like operations,” as Lauren has written in an essay coauthored with information studies scholar Miriam Posner. And to enable those operations, which range from counting to sorting and from modeling to visualizing, the data must be placed into some kind of category—if not always into a conceptual category like gender, then at the least into a computational category like Boolean (a type of data with only two values, like true or false), integer (a type of number with no decimal points, like 237 or –1) or string (a sequence of letters or words, like “this”).

[…So i]t’s not that we should reject these classification systems out of hand, or even that we could if we wanted to. […] It’s just that once a system is in place, it becomes naturalized as “the way things are.” This means we don’t question how our classification systems are constructed, what values or judgments might be encoded into them, or why they were thought up in the first place. (103-104)

Think about the source material you've gathered so far for your project, and do a little writing to gather ideas. I'll write, too!

In what ways have you grouped or categorized this material? How are you labeling or dividing it so far?
How else might it have been grouped?
- If you can’t think of any, raise your hand and we can get into groups of two or three to brainstorm together.
What assumptions are baked into the categories you’ve so far taken as natural?
Who are the people implicated in the way you’ve been thinking of the data so far? Who made these things? Who might be affected or exposed by your gathering them or processing them?
Now that you’ve noticed the categories, do you want to affirm them (which might be just what you need to get traction)? Change them? Do you see a way to combine multiple vantage points (e.g. across several passes or data visualizations)?

A brief debrief

I want to save some time to work in class on your projects, but if we’re careful about timing we should still be able to talk about the process of reflection that just happened. Did anyone have any new insights to celebrate or questions to puzzle through together?

Project studio (and mini-conferences)

From about five minutes from now until the last five minutes of class, you have general studio time to work on your own project – whatever step you’re up to – in a dedicated co-working environment. And if I didn’t get to talk to you last week about your project, I’m hoping to have a few minutes now!

Set an intention

Those first five minutes, though? Please set some goals:

In our shared google doc, write up to 2 sentences declaring your intentions: Given what you hope to achieve by the end of the term, what small piece of your project do you hope to tackle in the next 30-40 minutes?

Don’t forget to document how you use the time in your Mindful Practice Journal, and take care as well to document any decisions you make about how to classify or modify something you’ll use as data – including any new definitions for your data dictionary / code book or items in a list of possible categories.

Get to it

Captain Jean-Luc Picard of the Star Trek Enterprise says to engage — from ST:TNG and Imgflip Meme Generator

Exit note

When the end of the class is approaching, please head back into the google doc and respond to your own note from earlier:

How far did you get?
What are your new priorities for the coming week?

Homework for next time:

As always, please continue working on your project, and remember to keep track of your time in your Mindful Practice Journal.

After reading the texts below, please head to the discussion forum; the group that presented last week (John, Rose, Tunga, Scylla, Yuqing) is due to post a passage and a question that will kick off our in-class conversation when we return. If you want to claim first-rights for one of the texts below, you have the option to do so on the notes doc.

To prepare for week 07, on Materiality + Modeling, please watch / read:

Jannidis, Fotis, and Julia Flanders. “2 A Gentle Introduction to Data Modeling.” The Shape of Data in Digital Humanities: Modeling Texts and Text-Based Resources, by Julia Flanders and Fotis Jannidis, Taylor & Francis Group, 2018, pp. 55–65. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/pitt-ebooks/detail.action?docID=5582790.
- Section 1: What Is Data Modeling?
- Section 2: Some Basic Concepts
Cairo, Alberto. “3: The Truth Continuum.” The Truthful Art: Data, Charts, and Maps for Communication, New Riders, 2016, https://learning.oreilly.com/library/view/the-truthful-art/9780133440492/ch03.html.
- NB: to view the content, click “SIGN IN” at the top of the page, and begin logging in with your Pitt email address; you should then get the option to “Sign in with SSO” (single sign-on), which will take you to the Pitt Passport screen.
Ensmenger, Nathan L. “The Cloud Is a Factory.” Your Computer Is On Fire, edited by Thomas S. Mullaney, Benjamin Peters, Mar Hicks, and Kavita Philip, MIT Press, 2021, pp. 37–60. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/pitt-ebooks/detail.action?docID=6479710.
Neely-Cohen, Maxwell. “Century-Scale Storage.” https://lil.law.harvard.edu/century-scale-storage. Accessed 29 July 2025.
Ford, Paul. What Is Code? If You Don’t Know, You Need to Read This, Bloomberg.com, http://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/.
- Section 2: “Let’s Begin.” (if you haven’t yet)
EXT for eager readers:
- Jannidis, Fotis, and Julia Flanders. The remainder of the chapter above.
- Crawford, Kate, and Vladan Joler. “Anatomy of an AI System: The Amazon Echo As An Anatomical Map of Human Labor, Data and Planetary Resources.” AI Now Institute and Share Lab, 7 Sept. 2018, https://www.anatomyof.ai.
- Crump, Jon. “Generating an Ordered Data Set from an OCR Text File.” Programming Historian, Nov. 2014. programminghistorian.org, https://programminghistorian.org/en/lessons/generating-an-ordered-data-set-from-an-OCR-text-file.

Back to the calendar