Doing Digital History: A beginner’s guide to working with text as data

By Jonathan Blaney

A friend of mine is the most analogue person I know. We keep in touch by letter and arranging to meet requires a flurry of postcards confirming time and place. If one of us were held up en route the other would presumably just hang around and wait.
So I was taken aback the last time we met. He was enthusing about the beautiful end papers for a book he had coming out and casually said, “hang on, I’ll show you a picture”, fished a smartphone out of his pocket, and started swiping through his photos. We really are all digital now.

Humanities researchers nowadays are pretty comfortable with lots of digital tools, whether photographing archives, searching online databases or collaborating via Google Docs. However, I’ve learned from training postgraduates that they they tend to lack the skills needed to work with files or texts en masse. They often assume that you need to be a programmer to extract information from thousands of files. This is a shame because, as we show in our new book, you can do an enormous amount without programming.

Between us, my co-authors on Doing Digital History and I have decades of experience of the back end of digital work: getting from physical books to digital text, dealing with many files at once, managing the life-cycle of a complex project and producing outputs from all that data and all that work. We wanted to condense as much of our knowledge and experience as possible into a short book, and this is the result.

After putting digital history into its own historical context, alongside Digital Humanities more broadly, we show how the reader can use the current digital landscape – what is available and how; its limitations and its strengths – in developing their own digital research project. Then we describe the nuts and bolts of a digital history project: how, for example, can you go about working with texts which only currently exist in print form?

The heart of the book is working with text, which we divide into structured and unstructured text: these have common tools but also require different approaches. There are a couple of key skills that we spend a lot of time on because, in our experience of training humanities researchers in digital skills, they are scarcely known: the command line and regular expressions.

The command line is a more direct and powerful way of working with a computer than using graphical software. Because we’re so used to the latter, the command line can seem a bit intimidating, but we introduce it gradually while quickly showing its power and flexibility. For example, if you want to work with all of the files in a folder at once, the command line is an ideal tool.

The command line on a Mac

Regular expressions, or regex, are a way of working with text by matching patterns of characters (not literal sequences). This means we can use them to extract information without knowing exactly what it will look like. In the book we use regex to extract all of the professions from a section of the Victorian Post Office Directory for London. Then we use the command line to sort the list of professions and count them by frequency.

You don’t need a special mindset to do this work, we argue, just practice and a bit of confidence.

Following on from our chapter on structured text (where we focus on XML) we discuss how to manage all of the data that this work produces: how to keep it under version control, how to share, and how to recover if you mess things up. Here we focus a lot on Git, a free program which has saved us from lost work and screw-ups with data so many times that we can get quite emotional about it.

Another reason for learning a little bit of Git as a humanities researcher is that nowadays Git repositories (mostly to be found on GitHub) are a goldmine for historical and other humanities data. In fact the data we used in the book, such as the Post Office directory information, is on an open repository on GitHub, predictably, so that readers can work through the exercises found throughout the book.

How about visualising the data we have been working with, either for research purposes (to see patterns more clearly) or for presentation and publication? That deserves a book in itself, but we spend a chapter looking at maps and charts and give some general advice on things like colour. For one street in the Post Office directory we go step-by-step through how we visualised its occupants:

Our visualisation of Beaufort Street

Historians won’t necessarily need or want to know everything in Doing Digital History but we hope there is something for all interested readers. We have written the book assuming no knowledge of digital techniques at all.

As we say throughout the book, digital tools add to the historian’s toolkit; they don’t replace other skills or tools. Some things, we say, remain best in analogue. I still don’t have my friend’s phone number. He probably thinks we’re communicating just fine by Royal Mail.

Jonathan Blaney

Doing digital history: A beginner’s guide to working with text as data by Jonathan Blaney, Jane Winters, Sarah Milligan and Martin Steer is available to buy now.

Jonathan Blaney was Head of Digital Projects at the Institute of Historical Research, University of London until 2021

Sarah Milligan is an independent scholar based in Victoria, Canada

Marty Steer is Technical Lead, Digital Humanities at the School of Advanced Study, University of London

Jane Winters is Professor of Digital Humanities at the School of Advanced Study, University of London

By Jonathan Blaney

Cookies

Sign up for our newsletter