skip to navigation skip to content
Thu 11 Mar, Thu 18 Mar 2021
11:00 - 12:00

Venue: Cambridge Digital Humanities Online

Provided by: Cambridge Digital Humanities


Bookings cannot be made on this event (Event is not taking bookings).

Other dates:

No more events

Booking / availability

Methods Workshop: Introduction to Text-mining with Python

Thu 11 Mar, Thu 18 Mar 2021


Text-mining is extracting information from unstructured text, such as books, newspapers, and manuscript transcriptions. This foundational course is aimed at students and staff who are new to text-mining, and presents a basic introduction to text-mining principles and methods, with coding examples and exercises in Python. To discuss the process, we will walk through a simple example of collecting, cleaning and analysing a text.

If you are interested in attending this course, please fill in, and return, the application form by Monday, 22 February 2021. Places will be prioritised for students and staff in the schools of Arts & Humanities, Humanities & Social Sciences, libraries and museums. If you study or work in a STEM department and use humanities or social sciences approaches you are also welcome to apply.


We expect you to have some basic knowledge of Python, or coding in another language. At a minimum, we recommend that you have attended the CDH Basics session “First steps in coding and Jupyter Notebooks” and subsequently done some follow-on independent learning in basic Python. Alternatively, you may have equivalent basic coding experience in Python or a different language from another course of study.

If you are unsure whether your coding experience is sufficient, please apply anyway and we can talk about it together.


Number of sessions: 2

# Date Time Venue Trainer
1 Thu 11 Mar   11:00 - 12:00 11:00 - 12:00 Cambridge Digital Humanities Online Mary Chester-Kadwell
2 Thu 18 Mar   11:00 - 12:00 11:00 - 12:00 Cambridge Digital Humanities Online Mary Chester-Kadwell

We will cover:

  • What text-mining is for and what text-mining methods are available (including topic modelling, sentiment analysis, named entity recognition).
  • The text-mining pipeline and 5 steps of text-mining: choosing and collecting text, cleaning and preparing, exploring, analysing and presenting results.
  • Revision of basic Python:
    • Working with text using strings and manipulating lists of strings;
    • Importing code and calling functions;
    • Using Jupyter notebooks.
  • Methods for:
    • Harvesting text from the web;
    • Reading from and saving text to files;
    • Working with TEI-XML;
    • Cleaning up text (normalising);
    • Splitting strings into words and sentences (tokens);
    • Removing unwanted words (stopwords);
    • Counting tokens (frequency analysis);
    • Visualising results.
  • Next steps: resources and directions.

By the end of this course you should be able to:

  • Understand the broad overview of different text-mining methods and their uses.
  • Plan a basic text-mining pipeline for your work.
  • Expand your skills in using Python and Jupyter Notebooks into text-mining.

This course takes a ‘flipped classroom’ approach whereby much of the learning takes place self-paced in your own time. Preparatory material is released in the week before the course takes place. The course starts with a 1-hour remote video session to introduce the topics and materials, and ends with another 1-hour remote video session to discuss progress and next steps. Self-paced materials are provided to work through in between the sessions. A chat forum will be used on Moodle for asking/answering questions during the week.

Please make sure you can plan time in your schedule to complete the preparatory and self-paced materials in order to get the most out of the course. Time estimates for working through these materials are as follows:

  • Preparatory materials (total: 15 minutes-3 hours):
    • Introductory video: 15 minutes
    • Optional: Installing Python: 1 hour
    • Optional: Revision of basic Python: 1-2 hours
  • Self-paced Jupyter Notebooks (total: 2-4 hours)

The amount of time you may wish to spend on the self-paced materials depends on your pre-existing experience and own personal goals.

System requirements

You will need a laptop/desktop to join the sessions and follow the self-paced materials. Installation of Python 3 and Jupyter is needed, but full instructions will be provided in the preparatory materials if you don’t already have these installed.

CDH Methods Workshop

Booking / availability