Cambridge Digital Humanities - Machine Reading the Archive

The Library as Data

Mon 15 Oct 2018 13:30 Finished

Discover the rich digital collections of Cambridge University Library and explore the methods and tools that researchers are using to analyse and visualise data.

Creating Databases from Historical Sources (Workshop) Mon 25 Feb 2019 11:00 Finished

This workshop will examine strategies for transforming a variety of sources into structured digital data, ranging from crumbling manuscripts to printed documents and books.

Optical Character Recognition (OCR): An Introduction (Workshop)

Mon 11 Feb 2019 11:00 Finished

Optical Character Recognition is a term used to describe techniques for converting images containing printed or handwritten text into a format that can be searched and analysed computationally. This workshop will introduce several such tools along with some practical techniques for using them, and will also highlight OCR and related services offered by the Digital Content Unit at the Cambridge University Library.

Introduction to Text-Mining with Python 2

Tue 7 May 2019 11:00 Finished

This session will introduce topic modelling. Topic modelling is looking for clusters of words that summarise the meaning of documents. We will talk about how to choose what sort of text mining you might want for your research. Some knowledge of Python is required, as gained from 'Introduction to Text-Mining with Python 1', or equivalent. No installations will be needed; we will use web services available in your browser to follow along with the examples.

Introduction to Text-Mining with Python 1

Tue 30 Apr 2019 11:00 Finished

This session will introduce basic methods for reading and processing text files in Python. We will walk through an example that reads in a large text corpus, splits it into tokens (words) and sentences, removes unwanted words (stopwords), counts the words (frequency analysis), and visualises results. We will talk about the 5 steps of text mining and what resources to use when learning text mining for your research in your own time. No prior knowledge of Python is required, and no installations will be needed. We will use web services available in your browser to follow along.

Sources to Data (Workshop) Wed 5 Jun 2019 11:00 Finished

This workshop will examine database creation from historical documents. Extracting data from these can be hard work and involves quite unusual skill combinations. You may need to digitise and transcribe from primary sources, and then design and build a database from scratch with the information. Other sources you use could already be digitised but may be arranged or filed in an unsuitable way for your project and therefore need conversion. We will look at techniques used when employing crumbling manuscripts, printed documents, books, or text searchable images, to harvest historical data. Techniques include manual data-entry, scanning and OCR, and handwritten text recognition systems.

Digital Mapping for Historians

Wed 26 Jun 2019 09:30 Finished

This intensive workshop will provide an overview of a range of applications of digital mapping in historical research projects and introduce GIS tools and software.

The Letters Connection: Social Network Analysis in the Scientific Correspondence Collection

Wed 18 Sep 2019 11:30 Finished

Letters have been for centuries the main form of communication between scientists. Correspondence collections are a unique window into the social networks of prominent historical figures. What can digital social sciences and humanities reveal about the correspondence networks of 19th century scientists? This two-session intensive workshop will give participants the opportunity to explore possible answers to this question.

With the digitisation and encoding of personal letters, researchers have at their disposal a wealth of relational data, which we propose to study through social network analysis (SNA). The workshop will be divided in two sessions during which participants will “learn by doing” how to apply SNA to personal correspondence datasets. Following a guided project framework, participants will work on the correspondence collections of John Herschel and Charles Darwin. After a contextual introduction to the datasets, the sessions will focus on the basic concepts of SNA, data transformation and preparation, data visualisation and data analysis, with particular emphasis on “ego network” measures.

The two demonstration datasets used during the workshop will be provided by the Epsilon project, a research consortium between Cambridge Digital Library, The Royal Institution and The Royal Society of London aimed at building a collaborative digital framework for 19th century letters of science. The first dataset, the “Calendar of the Correspondence of Sir John Hershel Database at the Adler Planetarium”, is a collection of the personal correspondence of John Frederick William Herschel (1792-1871), a polymath celebrated for his contributions to the field of astronomy. Its curation process started in the 50s at the Royal Society and currently comprises 14.815 digitised letters encoded in extensible markup language (.xml) format. The second dataset, the “Darwin Correspondence Project” has been locating, researching, editing and publishing Charles Darwin’s letters since 1974. In addition to a 30-volume print edition, the project has also made letters available in .xml format.

The workshop will provide a step-by-step guide to analysing correspondence networks from these collections, which will cover:

- Explanation of the encoding procedures and rationale following the Text Encoding Initiative guidelines; - Preparation and transformation of .xml files for analysis with an open source data wrangler; - Rendering of network visualisations using an open source SNA tool; - Analysis of the Ego Networks of John Herschel and Charles Darwin (requires UCINET)

About the speakers and course facilitators:

Anne Alexander is Director of Learning at Cambridge Digital Humanities

Hugo Leal is Methods Fellow at Cambridge Digital Humanities and Co-ordinator of the Cambridge Data School

Louisiane Ferlier is Digital Resources Manager at the Centre for the History of Science at the Royal Society. In her current role she facilitates research collaborations with the Royal Society collections, curates digital and physical exhibitions, as well as augmenting its portfolio of digital assets. A historian of ideas by training, her research investigates the material and intellectual circulation of ideas in the 17th and 18th centuries.

Elizabeth Smith is the Associate Editor for Digital Development at the Darwin Correspondence Project, where she contributed to the conversion of the Project’s work into TEI several years ago, and has since been collaborating with the technical director in enhancing the Darwin Project’s data. She is one of the co-ordinators of Epsilon, a TEI-based portal for nineteenth-century science letters.

No knowledge of prior knowledge of programming is required, instructions on software to install will be sent out before the workshop. Some exercises and preparation for the second session will be set during the first and participants should allow 2-3 hours for this. Please note, priority will be given to staff and students at the University of Cambridge for booking onto this workshop.

CDH Learning gratefully acknowledges the support of the Isaac Newton Trust and the Faculty of History for this workshop.

The Library as Data: Digital Text Markup and TEI

Wed 23 Oct 2019 11:00 Finished

Text encoding, or the addition of semantic meaning to text, is a core activity in digital humanities, covering everything from linguistic analysis of novels to quantitative research on manuscript collections. In this session we will take a look at the fundamentals of text encoding – why we might want to do it, and why we need to think carefully about our approaches. We will also introduce the TEI (Text Encoding Initiative), the most commonly used standard for markup in the digital humanities, and look at some common research applications through examples.

The Library as Data: Social Network Analysis in the Correspondence Collection Archive

Wed 30 Oct 2019 11:00 Finished

Correspondence collections are a unique window into the social networks of prominent historical figures. With the digitisation and encoding of personal letters, researchers have at their disposal a wealth of relational data, which can be studied using social network analysis.

This session will introduce and demonstrate foundational concepts, methods and tools in social network analysis using datasets prepared from the Darwin Correspondence collection. Topics covered will include

Explanation of the encoding procedures and rationale following the Text Encoding Initiative guidelines
Preparation and transformation of .xml files for analysis with an open source data wrangler
Rendering of network visualisations using an open source SNA tool

No knowledge of prior knowledge of programming is required, instructions on software to install will be sent out before the session

The Library as Data: An overview

Wed 16 Oct 2019 11:00 Finished

Is the "digital library" more than a virtual rendering of the bookshelf or filing cabinet? Does the transformation of books into bytes and manuscripts into pixels change the way we create and share knowledge? This session introduces a conceptual toolkit for understanding the library collection in the digital age, and provides a guide to key methods for accessing, transforming and analysing the contents as data. Using the rich collections of Cambridge University Library as a starting point, we will explore:

Relations between digital and material texts and artefacts
Definitions of data and metadata
Methods for accessing data in bulk from digital collections
Understanding file formats and standards

The session will also provide an overview of the content in the rest of the term’s Library as Data programme, and introduce our annual call for applications to the Machine Reading the Archive Projects mentoring scheme.

Introduction to Archival Photography workshop [cancelled re Covid-19]

Wed 10 Jun 2020 11:00 CANCELLED

We are currently reformatting our Learning programme for remote teaching; this will require some rescheduling so bookings will reopen and new sessions will be created for online courses as soon as possible. In the interim we would encourage you to register your interest so as to be notified of the new schedule. Please be aware that we hope to run many of our courses online, but that this is dependent on staff availability and resources so please be aware we may have to postpone or cancel some sessions

This session focusses on providing photography skills for those undertaking archival research. Dr Oliver Dunn has experience spanning more than 10 years digitising written and printed historical sources for major university research projects in the humanities and social sciences. The focus is very much on low-tech approaches and small budgets. We’ll consider best uses of smartphones, digital cameras and tripods.

The Library as Data: Exploring Digital Collections through Machine Learning

Wed 13 Nov 2019 11:00 Finished

Recent advances in machine learning are allowing computer vision and humanities researchers to develop new tools and methods for exploring digital image collections. Neural network models are now able to match, differentiate and classify images at scale in ways which would have been impossible a few years ago. This session introduces the IIIF image data framework, which has been developed by a consortium of the world’s leading research libraries and image repositories, and demonstrates a range of different machine learning- based methods for exploring digital image collections. We will also discuss some of the ethical challenges of applying computer vision algorithms to cultural and historical image collections. Topics covered will include:

Unlocking image collections with the IIIF image data framework
Machine Learning: a very short introduction
Working with images at scale: ethical and methodological challenges
Applying computer vision methods to digital collections

Network Analysis for Humanities Scholars

Mon 27 Jan 2020 12:30 Finished

This workshop is a very basic introduction to network analysis for humanities scholars. It will introduce the concepts of networks, nodes, edges, directed and weighted networks, bi- and multi-partite networks. It will give an overview of the kinds of things that can be thought about through a network framework, as well as some things that can’t. And it will introduce key theories, including weak ties, and small worlds. There will be an activity where participants will build their own test data set that they can then visualise. In the second half of the workshop we will cover some networks metrics including various centrality measures, clustering coefficient, community detection algorithms. It will include an activity introducing one basic web-based tool that allows you to run some of these algorithms and will provide suggestions for routes forward with other tools and coding libraries that allows quantitative analysis.

Attendees should bring their own laptops.

Ruth Ahnert is Professor of Literary History & Digital Humanities at Queen Mary University of London, and is currently leading two large AHRC-funded projects: Living with Machines, and Networking Archives. She is author of The Rise of Prison Literature in the Sixteenth Century (2013), and co-author of Tudor Networks of Power, and The Network Turn (both forthcoming).

Mapping the Past [remote delivery]

Fri 22 May 2020 11:00 Finished

This intensive workshop is split into two online chats and two 1-hour sessions. Participants will first learn to collect and process geospatial data from historical sources and process it using geographical information systems from Google Earth to QGIS.

The first online session introduces research techniques for collecting, arranging and mapping geospatial data from historical sources, and is taught by Dr Oliver Dunn. His session is split into two parts: Part A will introduce both online sessions by showing some of our own research that makes use of Google Earth, 3D Maps in Excel, and historical GIS. In Part B you will be asked to locate a set of Scotland’s historical lighthouses on historical maps online and map their location and other attributes in Google earth and 3D Maps.

The second online session introduces students to mapping humanities data using Q-GIS which is a free GIS (Geographical Information System) software platform. Course participants will need to download and install QGIS on their laptops before 5th of June. On the 1st of June there will be further details concerning downloading QGIS, a chat forum where we can discuss why you might wish to use GIS, and whether GIS is the right choice for you, and a release of course teaching materials. On 5 June you will be taken through the map creation process step-by-step. This session will be taught by Max Satchell.

Bug Hunt 2020 [cancelled - Covid 19]

Tue 21 Apr 2020 13:00 CANCELLED

This programme is an opportunity to learn, through practical experience and shared investigation, how to apply digital methods for exploring and analysing a body of archival texts. The core of the programme will be 5 x 2 hour classroom based sessions supplemented by group and individual work on tasks related to the project design, delivery and documentation in between sessions. In addition to attending all five face-to-face sessions, participants should set aside an additional 8-10 hours over the duration of the course for work on project-related tasks.

During the programme we’ll work together on a particular topic: how insects were represented in books created for children in the 19th century. This question will help us to think about how children’s encounters with the natural world might have been framed and shaped by their reading. We’ll work on digital collections of 19th century children’s books exploring how such collections are built and how they can be used for machine reading. We’ll develop specific research questions and you’ll learn how to explore them using different tools for textual stylistic analysis. At the end, we’ll present findings and consider the implications of what we’ve discovered.

Topics covered include;

• The development of methods for machine reading the archive – ideas, motivations and ethics • Children’s books of the long 19th century – a beginner’s guide • Designing a small-scale investigation • Building a collection of digital texts • Transforming texts into searchable data • Analysing stylistic patterns in the data

Introduction to Text-mining with Python [remote delivery]

Thu 30 Apr 2020 11:00 Finished

This online session will introduce basic methods for reading and processing text files in Python with Jupyter Notebooks. We'll discuss why you might wish to do text-mining, and whether coding with Python is the right choice for you. We'll run through the 5 steps of text-mining, and start to walk through an example that reads in a text corpus, splits it into words and sentences (tokens), removes unwanted words (stopwords), counts the tokens (frequency analysis), and visualises results.

This initial session is one hour long and will be delivered remotely by video conferencing. During the session we will cover the essentials of working with the Jupyter Notebooks provided so that you can carry on working through the materials in your own time. The first session will be followed by a second, optional Q&A session for troubleshooting issues and recapping essentials.

Required preparation: A short internet-based exercise in working with variables and text in Python will be sent out one week prior to the session. You will also get instructions on how to find the materials we will be using and how to log onto the video conferencing platform. Please make sure you have some time to prepare properly so that we can concentrate on teaching during the remote session.

Sources to Data

Wed 3 Jun 2020 11:00 CANCELLED

We are currently reformatting our Learning programme for remote teaching; this will require some rescheduling so bookings will reopen and new sessions will be created for online courses as soon as possible. In the interim we would encourage you to register your interest so as to be notified of the new schedule. Please be aware that we hope to run many of our courses online, but that this is dependent on staff availability and resources so please be aware we may have to postpone or cancel some sessions

Archives typically hold records containing enormous quantities of data presented in a variety of scribal and print formats. Extracting this information has traditionally involved long hours of expensive manual data-entry work. Nowadays this work can be automated to a large degree and could soon open archives and allow for unprecedentedly large structured data sets for curators, researchers, and the public alike. This workshop will examine new methods for collecting historical data from manuscript and printed documents. We will look at archival photography, OCR, page structure recognition, and new handwritten text recognition systems. Cutting-edge Cambridge research in this field will be demonstrated.

Evolve your Python Code into a Workflow for Text-based Research [cancelled - Covid-19]

Wed 20 May 2020 13:00 CANCELLED

We are currently reformatting our Learning programme for remote teaching; this will require some rescheduling so bookings will reopen and new sessions will be created for online courses as soon as possible. In the interim we would encourage you to register your interest so as to be notified of the new schedule. Please be aware that we hope to run many of our courses online, but that this is dependent on staff availability and resources so please be aware we may have to postpone or cancel some sessions

This workshop will develop your coding practice from testing ideas to creating an efficient workflow for your code, data and analysis. If you are using Jupyter Notebooks (but even if you’re not) this workshop will demonstrate how to better manage your code using good programming practices, and package your code into a program that is easier and quicker to run for lots of data and more reliable.

Required preparation (instructions provided): Python 3 installed on laptop; a text editor or IDE installed on laptop; git installed on laptop and signed up for GitHub; a short internet-based exercise in working with the command line.

Machine Reading the Archive 2020 - end of programme workshop [see eventbrite]

Mon 15 Jun 2020 11:30 Finished

We are currently reformatting our Learning programme for remote teaching; this will require some rescheduling so bookings will reopen and new sessions will be created for online courses as soon as possible. In the interim we would encourage you to register your interest so as to be notified of the new schedule. Please be aware that we hope to run many of our courses online, but that this is dependent on staff availability and resources so please be aware we may have to postpone or cancel some sessions

This public workshop will mark the end of the 2020 programme of Machine Reading the Archive, a digital methods development programme organised by Cambridge Digital Humanities with the support of the Researcher Development Fund.

It will showcase the digital archive projects created by our cohort of project participants as well as invited contributions from leading experts in the field.

(Critical) Machine Vision for the Humanities [remote delivery]

Tue 9 Jun 2020 15:00 Finished

Leonardo Impett, Cambridge Digital Humanities

Application forms should be returned to CDH Learning (learning@cdh.cam.ac.uk) by Friday 22 May 2020. Successful applicants will be notified by 26 May 2020.

This course will introduce graduate students, early-career researchers, and professionals in the humanities to the technologies of image recognition and machine vision, including recent developments in machine vision research in the past half-decade. The course will seek to combine a technical understanding of how machine vision systems work, with a detailed understanding of the possibilities they open to research and study in the humanities, and with a critical exploration of the social, political and ideological dimensions of machine vision.

Learning outcomes

By the end of the course, students should be able to:

Understand the basic tasks of machine vision, such as Image Classification, Object Detection, Image-to-Image Translation, Image Captioning, Image Segmentation etc.
Understand the fundamental technical operations of image processing and machine vision: the pixel encoding of images, Gaussian and convolutional filters,
Explore critical aspects of machine vision in a technically-informed way: e.g. the problems in algorithmic bias brought about by featureless convolutional networks
Develop and run their own simple machine vision and image processing pipelines, in a visual programming language compiling to Python
Understand the potential synergies and limitations of machine vision applications in humanities research and cultural heritage institutions

The Transkribus Guided Project

Wed 29 Jul 2020 16:00 Finished

We introduce the Transkribus software system that can be taught to read handwriting from images of documents and rapidly convert it into useful digital formats. This guided course provides basic training by practical immersion in this software, which requires only basic IT skills. Transkribus was developed by READ under the Horizon 2020 funding framework and is now a co-operative. It had 20,000+ users in 2019, and is becoming a standard research tool for mass transcription of archival sources. Participants will transcribe anonymised data from pre-loaded scans of forms filled out for the French national census of 1999 in Transkribus's downloadable software interface. These manual transcriptions will help train a handwritten text recognition (HTR) model to automatically transcribe many more of these forms later. In fact, the model will eventually allow the creation of one of the largest data sets ever attempted from manuscript sources. This course is a collaboration with Transkribus and Cambridge Digital Humanities. It is funded by a Cambridge Humanities Research Grant.

Delving into Massive Digital Archives - finding lost, forgotten and neglected texts (Guided Project)

Mon 12 Oct 2020 11:00 Finished

Application forms https://www.cdh.cam.ac.uk/file/cdhdelvingintomassivedaapplicationdocx should be returned to CDH Learning (learning@cdh.cam.ac.uk) by Tuesday 6 October 2020. Successful applicants will be notified by Thursday 8 October 2020.

Massive digital archives such as the Internet Archive offer researchers tantalising possibilities for the recovery of lost, forgotten and neglected literary texts. Yet the reality can be very frustrating due to limitations in the design of the archives and the tools available for exploring them. This programme supports researchers in understanding the issues they are likely to encounter and developing practical methods for delving into massive digital archives.

Theme: Machine Reading the Archive

Contact training provider

Privacy policy
Cookie policy

Study at Cambridge

About the University

Research at Cambridge