Careful thought about files at the beginning of a research project can save a lot of time, money and heartache later in a project.
File formats govern the ability to use and reuse data in the future, with the ongoing accessibility of data an important consideration.
Formats more likely to be accessible in the future are non-proprietary, open, documented standard commonly usage by research community, standard representation (ASCII, Unicode), unencrypted and uncompressed.
- UK Data Archive's Data Formats table lists optimal data formats that are used for long-term preservation of data.
- ANDS Guide on File Formats covers institutional planning implications, covers obsolescence, file migration, open/proprietary formats, lossy/lossless formats, compression, standards and more.
File naming conventions
A File Naming Convention is a framework for naming your files in a way that describes what they contain and how they relate to other files. It is essential to establish a FNC before you begin to collect data to prevent against a backlog of unorganized files that could lead to misplaced or lost data.
Naming records consistently, logically and in a predictable way will distinguish similar records from one another at a glance, and by doing so will facilitate the storage and retrieval of records, which will enable users to browse file names more effectively and efficiently. Naming records according to agreed conventions should also make file naming easier for colleagues because they will not have to 're-think' the process each time.
The University of Edinburgh has a comprehensive yet easy to follow list (with examples and explanations) of 13 Rules for file naming conventions.
Having logical and known naming conventions in place can also help you with version control.
Because digital research data can so easily be changed, copied, or over-written, researchers need to be able to protect its authenticity. Working with outdated versions of files wastes research time and valuable data can be put at risk.
Version control can prevent this. Version control is the means by which different versions and drafts of a document (or file or record or dataset) are managed. This is particularly important if data is being used by multiple members of a research team, or if research files are shared across different locations.
Version control involves a process of naming and distinguishing between a series of draft documents (or file or record or dataset) which lead to a final (or approved) version, which in turn may be subject to further amendments. It also provides an audit trail for the revision and update of draft and final versions.
University of Leicester has some excellent resources on the subject.
In the case of research data, a new version of a dataset may be created when an existing dataset is reprocessed, corrected or appended with additional data. Data versioning is one means by which to track changes associated with ‘dynamic’ data that is not static over time.