Avoiding data disasters

Please note that these materials are no longer being developed. For a more up-to-date version see here

Outline

It has been said that 80% of data analysis is spent on the process of cleaning and preparing the data. Not only does this represent a significant time investment for the data analyst, but is often a hurdle for the non-specialist trying to get to grips with analysing their own data after attending an R or Python course. Despite the best intentions, a spreadsheet that is intuitive and easily-understandable by human eyes can lead to disaster when trying to process computationally.

This workshop will go through the basic principles that we can all adopt in order to work with data more effectively and "think like a computer". Moreover, we will discuss the best practices for data management and organisation so that our research is auditable and reproducible by ourselves, and others, in the future.

Timetable for 5th December 2016

12:30 - 13:00 Mark - Course Introduction and general Principles
13:00 - 14:00 Valeria - Data formatting issues Files for practical:- Example 1 Example 2
14:00 - 15:00 Mark - Open Refine demo Example file here
15:00 - 15:30 Sergio - File management
15:30 - 16:00 Peter - Strategies for backup
16:00 - 16:30 Rosie - Data Sharing at the University of Cambridge

References

The course was inspired by....

Data Carpentry workshops

The Data Organisation tutorial by Karl Broman

The Quartz guide to bad data

Three common bad practices in sharing tables and spreadsheets and how to avoid them.

Open Data

Cambridge researchers are advised to visit the Open Data page at Cambridge University for advice on data-sharing and complying with funding agency requirements for data management