Reverse detail from Kakelbont MS 1, a fifteenth-century French Psalter. This image is in the public domain. Daniel Paul O'Donnell

Forward to Navigation

The bird in hand: Humanities research in the age of open data (Digital Science Report)

Posted: Oct 24, 2016 13:10;
Last Modified: Oct 24, 2016 13:10


Originally published as Daniel Paul O’Donnell. 2016. “The Bird in Hand: Humanities Research in the Age of Open Data.” In The State of Open Data: A Selection of Analyses and Articles about Open Data, Edited by Figshare, 34–35. Digital Science Report. London: Digital Science.

Traditionally, humanities scholars have resisted describing their raw material as
“data” 10.

Instead, they speak of “sources” and “readings.” “Primary sources” are the
texts, objects, and artifacts they study; “secondary sources” are the works
of other commentators used in their analyses; “readings” can be either the
arguments that represent the end product of their research or the extracts
and quotations they use for support.

These definitions are contextual. The primary source for one argument can be
the secondary source for another or, as in the case of a “critical edition” of a
historical text, simultaneously primary and secondary. Almost any document,
artifact or record of human activity can be a topic of study. Arguments proposing
previously unrecognized sources (“high school yearbooks, cookbooks, or wear
patterns in the floors of public places”) are valued acts of scholarship. 1

This resistance to “data” is a recognition of real differences in the way humanists
collect and use such material. In other domains, data are generated through
experiment, observation, and measurement. Darwin goes to the Galapagos
Islands, observes the finches, and fills notebooks with what he sees. His notes
(i.e. his “data”) “represent information in a formalized manner suitable for
communication, interpretation, or processing” 2 . They are “the facts, numbers,
letters, and symbols that describe an object, idea, condition, situation, or other
factors” 3. Given the extent to which they are generated, it has been argued that
they might be described better as capta, “taken,” than data, “given”. 4

The material of humanities research traditionally is much more datum than
captum, finch than note. Since the humanities involve the study of the meaning
of human thought, culture, and history, such material typically involves other
people’s work. It is often unique and its interpretation is usually provisional,
depending on broader understandings of purpose, context and form that are
themselves open to analysis, argument and modification. In the humanities, we
more often end up debating why we think something is a finch than what we
can conclude from observing it.

Perhaps most telling is the fact that humanities sources, unlike scientific
data, are usually practically as well as theoretically non-rivalrous 5. Humanities
researchers rarely have an incentive (or capability) to prevent others from
accessing their raw material and entire research domains (e.g. Jane Austen
studies) can work for centuries from the same few primary sources. Priority
disputes that occur regularly in the sciences 6 are almost non-existent within
the humanities. 1

The digital age is changing one aspect of this traditional disciplinary difference.
Mass digitalization and new tools make it possible to extract material
algorithmically from large numbers of cultural artifacts. Where researchers
used to be limited to sources in archives and libraries to which they had
physical access, digital archives and metadata now make it easier to work
across complete historical or geographic corpora: all surviving periodicals from
19th century England, for example, or every known pamphlet from the Civil
War. In the digital age, humanities resources can be capta as well as data.

Such changes allow for new types of research and improve the efficacy of some
traditional approaches. But they also raise existential questions about long-
standing practices. Traditionally, humanities researchers have tended to work
with details from a limited corpus to make larger arguments: “close readings” of
selected passages in a given text to produce larger interpretations of the work
as a whole; or of passages from a few selected works to support arguments
about larger events, movements or schools. In one famous but far from atypical
example, author Ian Watt uses readings from five novels and three authors as the
main primary sources in his discussion of the Rise of the Novel. 7

In the age of open data, it is tempting to see this as being, in essence, a small-
sample analysis lacking in statistical power. 8 But such data-centric criticism of
traditional humanities arguments can be a form of category error. Humanities
research is as a rule more about interpretation than solution. It is about why
you understand something the way you do rather than why something is
the way it is. It treats its sources as examples to support an argument rather
phenomena to be observed in the service of a solution. While Watt’s title,
“The Rise of the Novel,” can be understood as implying a historical scope
that his sample cannot support, his subtitle, “Studies in Defoe, Richardson,
and Fielding,” shows that he actually was making an argument about the
interpretation of three canonical authors based on his understanding of
the novel’s early history – an understanding that by definition always will be
provisional and open to amendment.

The real challenge for the humanities in the age of digital open data is
recognizing the value of both types of sources: the material we can now
generate algorithmically at previously unimaginable scales and the continuing
value of the exemplary source or passage. As the raw material of humanities
research begins to acquire formal qualities associated with data in other fields,
the danger is going to be that we forget that our research requires us to be
sensitive to both object and observation, datum and captum, finch and note. In
asking ourselves what we can do with a million books 9, we need to remember
that we remain interested in the meaning of individual titles and passages.

Works cited

1 Borgman, Christine L. 2007. Scholarship in the Digital Age: Information, Infrastructure, and the Internet. Cambridge, Mass: MIT Press.

2 Consultative Committee for Space Data Systems. 2012. “Reference Model for an Open Archival Information System (OAIS).” CCSDS 650.0-M-2.

3 National Research Council. 1999. Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases. Washington:
National Academies Press.

4 Jensen, H. E. 1950. “Editorial Note.” In Through Values to Social Interpretation: Essays on Social Contexts, Actions, Types, and Prospects, vii – xi.
Sociological Series. Duke University Press.

5 Kitchin, Rob. 2014. The Data Revolution. Thousand Oaks, CA: SAGE Publications Ltd.

6 Casadevall, Arturo, and Ferric C. Fang. 2012. “Winner Takes All.” Scientific American 307 (2): 13. doi:10.1038/scientificamerican0812-13.

7 Watt, Ian P. (1957) 1987. The Rise of the Novel: Studies in Defoe, Richardson, and Fielding. London: Hogarth.

8 Jockers, Matthew L. 2013. Macroanalysis : Digital Methods and Literary History. Urbana, IL: University of Illinois Press.

9 Crane, Gregory. 2006. “What Do You Do with a Million Books?” D-Lib Magazine 12 (3). doi:10.1045/march2006-crane.

10 Marche, Stephen. 2012. “Literature Is Not Data: Against Digital Humanities.” Los Angeles Review of Books, October. article/literature-is-not-data-against-digital-humanities/


More on Aauthors and Aalphabetical placement

Posted: Jul 26, 2014 16:07;
Last Modified: Jul 26, 2014 16:07


In an earlier post today, I discussed some of the economic implications of having a last name beginning early in the alphabet in disciplines that traditionally order the authors on multi-author papers alphabetically.

I’ve since looked up the original paper (Einav, Liran, and Leeat Yariv. 2006. “What’s in a Surname? The Effects of Surname Initials on Academic Success.” The Journal of Economic Perspectives 20 (1): 175–88). This is more startling than I thought.

First of all, from the authors’ own description:

In this paper, we focus on the effects of surname initials on professional outcomes in the academic labor market for economists.

We begin our analysis with data on faculty in all top 35 U.S. economics departments. Faculty with earlier surname initials are significantly more likely to receive tenure at top ten economics departments, are significantly more likely to become fellows of the Econometric Society, and, to a lesser extent, are more likely to receive the Clark Medal and the Nobel Prize. These statistically significant differences remain the same even after we control for country of origin, ethnicity, religion or departmental fixed effects. All these effects gradually fade as we increase the sample to include our entire set of top 35 departments.

We suspect the “alphabetical discrimination” reported in this paper is linked to the norm in the economics profession prescribing alphabetical ordering of credits on coauthored publications. As a test, we replicate our analysis for faculty in the top 35 U.S. psychology departments, for which coauthorships are not normatively ordered alphabetically. We find no relationship between alphabetical placement and tenure status in psychology.

We then discuss the extent to which the effects of alphabetical placement are internalized by potential authors in their choices of the number of coauthors as well as in their willingness to follow the alphabetical ordering norm. We find that the distribution of authors’ surnames in single-authored, double-authored and triple-authored papers does not differ significantly. Nonetheless, authors with surname initials that are placed later in the alphabet are significantly less likely to participate in four- and five-author projects. Furthermore, such authors are also more likely to deviate from the accepted norm, and to write papers in which credits do not follow the alphabetical ordering.

Here are the core figures from the paper, comparing top Economics departments (alphabetical ordering) and top Psychology departments (non-alphabetical ordering):

Figure showing distribution of last initials in Top Economic departments.

Figure showing distribution of last initials in Top Psychology departments.

As Figure 1, shows, Economic departments have a definite tendency towards having tenured members with early last names: about 50% of the tenured faculty in top 5 departments have names beginning with the letters from A-G, and by the time you get to O you’ve accounted for about 70% of the Faculty (Tip for economists planning on having affairs at economics conferences: use “John/Jane Doe” rather than “John/Jane Smith”—I’m guessing you’re likely to pull partners from better schools).

In Psychology, on the other hand, you’re probably at “K,” before you get to 50% of the tenured faculty. (The 50% mark in both sets of departments for untenured faculty comes in at about L, suggesting there isn’t such a bias in the case of pre-tenure hires—perhaps because they publish less before they are hired?)

So all in all, I guess this just means that Paul S. Krugman’s career is even more impressive than it looks: a job at Princeton and a Nobel prize and a last name beginning with “K”? What are the odds. Makes you wonder what’s wrong with Ben S. Bernanke: a last name beginning with “B” and he only makes chairman of the Fed?

Maybe somebody needs to do a study of the influence on academic careers of S. as a middle initial.


The credit line

Posted: Jul 13, 2014 13:07;
Last Modified: Aug 16, 2014 13:08


I think it is time to get rid of authorship altogether, at least in research communication.


What is an author

Outside of academia, the definition of authorship is quite striaghtforward. As the OED puts it, an author is “the writer of a book or other work.” Things get a little complicated with ghost-writers (is a ghost-written book “by” the person who commissioned it or the person who actually composed it?). But on the whole, there isn’t much room for ambiguity. Authors are people who write.

Within academia, however, things are more complicated. There you can have ““authors” who don’t write anything”: and writers who aren’t authors.

This is because in academia, the writing is only part of a larger research process: articles and books report on research projects that take place in the laboratory or library but they are not the research project themselves. A single research project will often lead to a number of articles and (occasionally) books and the people who end up writing an individual article or book can represent only a small sub-set of the entire team that was responsible for working on the project as a whole. An individual article can involve the essential intellectual contributions of a far larger number of people than those actually responsible for drafting its text.

Why “authorship” matters

Despite this ambiguity, however, academics devote a lot of attention to determining authorship, and, especially, distinguishing between authorship and other forms of “contribution.”

This is because, being “an author” (as opposed to “a contributor”) carries with it real rewards. “Authorship” is the primary basis at universities for determining promotion, bonuses, and relative status. A researcher who receives a lot of authorship credits is going to do better (in terms of pay, rank, and position) than a researcher who is frequently acknowledged as a collaborator, even if that collaboration is as or more essential to the overall success of the project as a whole.

Authors who don’t write and writers who don’t author

Over the years, this disparity in reward has lead to some authorship scandals, paticularly in the medical sciences: people buying and selling “authorship” credit, authors who subsequently deny any responsibility for papers that are shown to be incorrect, researchers who do work being denied authorship credit.

It has also led to increasingly rigorous definitions of what authorship means in a research communication. The International Committee of Medical Journal Editors, for example, has come up with a fairly clear set of criteria for authorship credit:

The ICMJE recommends that authorship be based on the following 4 criteria:

* Substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work; AND

* Drafting the work or revising it critically for important intellectual content; AND

* Final approval of the version to be published; AND

* Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

In addition to being accountable for the parts of the work he or she has done, an author should be able to identify which co-authors are responsible for specific other parts of the work. In addition, authors should have confidence in the integrity of the contributions of their co-authors.

All those designated as authors should meet all four criteria for authorship, and all who meet the four criteria should be identified as authors. Those who do not meet all four criteria should be acknowledged—see Section II.A.3 below. These authorship criteria are intended to reserve the status of authorship for those who deserve credit and can take responsibility for the work.

Disambiguating “credit” and “responsibility”

The problem with this definition, and indeed with the author/collaborator distinction as a whole, lies in that last sentence: “These authorship criteria are intended to reserve the status of authorship for those who deserve credit and can take responsibility for the work.” That is to say, this definition, like the author/collaborator distinction in the first place, is attempting to do too much: provide credit and identify responsibility. “Credit” on a modern research project in particular needs to be distributed far more widely than simply to those who can take responsibilty for its reports. From the people who secure the funding to the people who code the tools, modern research projects often make use of essential contributions from a large number of people—far more than ever end up drawing conclusions or reporting on results in writing. As long as the primary method of crediting research activity remains the by-line of the research article, it is essential that these people as well receive credit for their work similar to that received by the “authors” of the reports themselves.

This is particularly true for the growing class of adjunct and para-researchers in contemporary universities: these researchers, who are usually as qualified and well-trained as the principal investigators, can easily end up in a kind of researchers no man’s land: working on many research projects but too early in the pipeline and not high enough up in the hierarchy to receive authorship credit.

Indeed, in one sense, “authorship” is itself, just a contribution: projects need somebody to analyse and write up their results as surely as they need somebody to code their instruments. While authorship is the last step in the research cycle, it is not necessarily more important than all of the preceding steps, and, indeed, could not have taken place unless at least some of those preceding steps had occured: somebody needed to frame the project in such a way that it could be funded and ensure that the lab receives the resources it needs; somebody had to implement the protocols or design the software and routines; somebody had to acquire and process the data, and so on.

Challenging outdated assumptions

In fact, I would argue that our struggles about the definition of “authorship” in a research context are in fact evidence that the concept itself is outmoded. In the days when most projects were concevied of and carried out by a single person who then wrote up the reports by himself (pronoun being used advisedly), the idea of assigning credit for research to an “author” (in the traditional sense of “the writer of a book or other work”) made sense—though there are enough examples of lab assistants (usually women) not receiving credit for essential work even from those days to make one wonder.

Now that few projects are conceived of an executed in this way, however, the entire privileging of “authorship” (however defined) makes much less sense. It remains important to identify the people who bear intellectual responsibility for the argument and conclusions of a given research communication; but it does not seem more important to me to identify the people who bear intellectual responsibility for the particular set of conclusions as being more important than all those others without whose contributions the “authors” would have nothing to conclude about. In the modern research world, “authorship” (like “writing,” but also like “getting funding” or “developing the algorithm” or “doing the coding”) is really just one kind of contribution credit.

Changing the byline to the credit line

In my view, the way to address this problem is to get rid of the “author”/“contributor” distinction altogether. While it is important to continue to identify the people who have intellectual responsibility for the presentation and conclusions of a given piece of research communication, especially in cases of research fraud, it does not seem at all appropriate to me to maintain the vast distinction in credit that comes with the “author”/“contributor” distinction. If authors are really just a particular kind of contributor to the project as a whole, then it seems to me that we can acknowledge their contributions (and responsibility) just as easily in the contributor list as on the byline. Or, perhaps better said, that we could just as easily eliminate the distinction altogether and credit all contributors on the byline.

In fact, I think we should change our understanding of the “byline” altogether. If modern definitions of research authorship run into trouble because they attempt to use the byline for two things (assigning credit and identifying a particular kind of responsibility), an approach that saw the byline as simply the “credit line” and saw the attribution of specific credit as something that could be handled by a note would disambiguate these two functions and provide a far fairer (and more reliable) attribution of responsibility than the current system.

How would this work?

This is really two questions:

  1. how would projects assign credit and responsbility on papers (i.e. order of names, definitions of responsibilities)
  2. how could a “creditline” system replace a “byline” system in actual publications

For the first of these, I suspect the answer would, at least initially, depend on the researchers and disciplines. There are currently few standards and different disciplinary customs regarding the attribution and place of authors on the byline (e.g. alphabetical, from greatest to least responsibility, and so on). Likewise, there are no agreed upon terms for describing individual types of responsibility. Journals, such as Science, that include breakdowns of responsibility tend to do this in a free-form narrative.

I don’t see this changing under a credit-line system. Research teams would still be faced with the problem of ordering credit and it seems unlikely to me that the current places of highest prestige would change very much. Instead of first and last “author” we would now speak of “first and last contributor” as being the places of highest prestige, but the problem of who goes where would remain very much something individual research teams would still need to solve.

For the second question, how such a system could be implemented, I suspect not much is required. The difference between the current system and this proposal is not actually that great: we need definitions of authorship in academia because “authorship” is no longer about who writes things. All that is needed is for journals to begin inviting research projects to adopt a system in which the identities of all essential contributors are credited in the “creditline” and their contributions defined in a “responsibilities” note.


Back to content

Search my site


Current teaching

Recent changes to this site


anglo-saxon studies, caedmon, citation, citation practice, citations, composition, computers, digital humanities, digital pedagogy, exercises, grammar, history, moodle, old english, pedagogy, research, student employees, students, study tips, teaching, tips, tutorials, unessay, universities, university of lethbridge

See all...

Follow me on Twitter

At the dpod blog