Advanced Topics in Data Preparation Using Python New
As a science researcher, you will need to deal with quite heterogeneous and dirty data. The data may have been collected through different approaches: observation, surveys, interviews, experiments, published printed or online sources, etc. Moreover, the data may have been encoded by different software and persons, and comes to you in different formats (e.g., txt, csv, xlsx, json, etc). Therefore, the data typically needs to be preprocessed before you can make sense of it through statistics and graphical representations. For example, you may need to re-encode the information in a way that is more meaningful to your analysis goals. Also, you may need to re-arrange the data and clean it, removing duplicates and incomplete information. Finally, you may need to apply all these transformations to other similarly structured data, over and over again. Doing this “by hand” is an arduous, time-consuming and error-prone task; so, automatizing these routines is the smart way to go!
In this course, I will teach you how to read, transform and prepare different kinds of data using Python and its popular libraries NumPy and Pandas. We are going to solve several problems (“Missions”) together, of increasing levels of difficulty. Each Mission will introduce you to new data structures (e.g., dictionaries, series, dataframes), methods and attributes, extending your previous knowledge. The content of the course is designed to be to-the-point and focused on practicality. By the end of the course, you should be able to program preprocessing routines that you can apply to your own data. Moreover, you will have an advanced data-handling blueprint to which you can easily add new information and skills over time.
- Postgraduate students and staff
- Further details regarding eligibility criteria are available here
- Some experience coding and running Python scripts and a familiarity with the syntax.
- Knowledge ofthe most common data types, data structures (lists, tuples, sets), operators, and flow statements.
If you have not used Python before, it is recommended that you first take the ‘Introduction to Python’ module.
You are expected to bring your own laptop, with up and running Python environments (Visual Studio Code, JupyterLab), and already installed NumPy, Pandas, io libraries. You should also bring a copy of the materials provided for the module, so you can access code, datasets and supporting information.
Number of sessions: 4
# | Date | Time | Venue | Trainer | |
---|---|---|---|---|---|
1 | Tue 5 Nov 16:00 - 18:00 | 16:00 - 18:00 | Titan Teaching Room 1, New Museums Site | map | Maité Crespo García |
2 | Tue 12 Nov 16:00 - 18:00 | 16:00 - 18:00 | Titan Teaching Room 1, New Museums Site | map | Maité Crespo García |
3 | Tue 19 Nov 16:00 - 18:00 | 16:00 - 18:00 | Titan Teaching Room 1, New Museums Site | map | Maité Crespo García |
4 | Tue 26 Nov 16:00 - 18:00 | 16:00 - 18:00 | Titan Teaching Room 1, New Museums Site | map | Maité Crespo García |
Outline of the sessions
Session 1: Introduction to Pandas and DataFrames. Introduction to the course contents and organization, basic guide on how to install necessary programs and how to work with the provided course material, discussion about the importance of preparing the data before analyses. Revision of prior knowledge on most common data types and structures used in Python. Introduction to other data structures such as dictionaries, series and dataframes. Beginning to work on Mission 1. Reading csv files with Pandas and learning how to inspect their content within the Python environment.
Session 2: Selecting, sorting and re-encoding your data. Continuation of Mission 1. Learning how to re-encode the content of dataframes, how to select columns and rows, how to remove unwanted content, how to filter and sort the information. Usage of help() and dir() in-built functions to learn about different Python components and as a support when coding. Beginning to work on Mission 2. Reading Excel files.
Session 3: Cleaning and combining your data. Continuation of Mission 2. Revising prior knowledge on how to write functions in Python. Writing useful functions to process data upon loading. Learning how to detect and remove missing data and duplicates. Combining dataframes and datasets. Saving clean datasets. Beginning of Mission 3. Reading json files.
Session 4: Automating data transformation with loops and batches. Continuation of Mission 3. Extending knowledge on control flow statements. Processing several datasets in a batch. Mission 4. Reading text files with io library. Working on data handling problems provided by the students.
Aims: Gain advanced knowledge on how to read and preprocess different kinds of raw scientific data in Python, so that it can be ready for analyses and graphical representation.
Specific objectives:
- Read data encoded in different formats and originating from different sources.
- Visualize and obtain basic information of the data.
- Reformat and re-encode the data into more meaningful and handy formats.
- Select and filter specific information within the data structures.
- Deal with duplicates, missing data and outliers.
- Combine data structures and datasets and save new and clean datasets.
- Process datasets in batches using loops and other strategies to handle multiple datasets or repetitive tasks with different parameters.
- Solve other problems that you may face when handling your own data.
The module consists of four in-person sessions. Each session lasts 2 hours and combines presentations by the instructor with several short hands-on intervals. During these intervals, the students would run code snippets provided as part of the data Missions. Datasets will be provided by the instructor as part of the course materials. However, the students will be encouraged to bring their own data to work with during the fourth session of the module.
Click the "Booking" button panel on the left-hand sidebar (on a phone, this will be via a link called Booking/Availability near the top of the page).
Moodle is the 'Virtual Learning Environment' (VLE) that CaRM uses to deliver online courses.
CaRM instructors use Moodle to make teaching resources available before, during, and/or after classes, and to make announcements and answer questions.
For this reason, it is vital that all students enrol onto and explore their course Moodle pages once booking their CaRM modules via the UTBS, and that they do so before their module begins. Moodle pages for modules should go live around a week before the module commences, but some may be made visible to students, earlier.
For more information, and links to specific Moodle module pages, please visit our website
Booking / availability