D-Lib MagazineJanuary/February 2015 Data Citation Practices in the CRAWDAD Wireless Network Data Archive
Tristan Henderson AbstractCRAWDAD (Community Resource for Archiving Wireless Data At Dartmouth) is a popular research data archive for wireless network data, archiving over 100 datasets used by over 6,500 users. In this paper we examine citation behaviour amongst 1,281 papers that use CRAWDAD datasets. We find that (in general) paper authors cite datasets in a manner that is sufficient for providing credit to dataset authors and also provides access to the datasets that were used. Only 11.5% of papers did not do so; common problems included (1) citing the canonical papers rather than the dataset, (2) describing the dataset using unclear identifiers, and (3) not providing URLs or pointers to datasets. 1 IntroductionThe archiving, sharing and reuse of research data is a fundamental part of the scientific process, and the benefits of doing so have been increasingly recognised [4, 13]. Indeed such data sharing is now being encouraged or mandated by research funders [3, 10]. Since 2005 we have run the CRAWDAD network-data archive as a resource for wireless-network researchers to deposit and share their data, and for other researchers to download and use data in their research. Wireless network data is crucial for conducting research into future wireless networks; much research is based on analytical or simulation models that might not reflect the real world. We have therefore encouraged researchers to share their data, such as mobility measurements of wireless network users, or radio measurements of networks. By many measures CRAWDAD has been a success, with over six thousand users using datasets in over a thousand papers, and in teaching and standards development. CRAWDAD datasets are not used solely by wireless-network researchers; we have observed researchers from other fields such as geography, epidemiology, and sociology using our datasets. In this paper we investigate how these CRAWDAD users cite our datasets when they use them. Understanding how people cite data, and the problems that they have in citing data, is important if we are to maximise the usefulness of shared research data. If data are cited in a way that makes it difficult to find or interpret the data, then this might limit future research. These issues have recently been recognised by the Force 11 Data Citation Principles [5]. 2 The CRAWDAD data archiveCRAWDAD (Community Resource for Archiving Wireless Data at Dartmouth) was founded in 2005 [7], and initially funded for three years through a Community Resource award from the US National Science Foundation (NSF). The original NSF proposal summarised the archive as follows: The investigators propose to develop an archive of wireless network data and associated tools for collecting and processing the data, as a community resource for those involved in wireless network research and education. Today, this community is seriously starved for real data about real users on real networks. Most current research is based on analytical or simulation models; due to the complexity of radio propagation in the real world and a lack of understanding about the behavior of real wireless applications and users, these models are severely limited. On the other hand, the difficult logistical challenges involved in collecting detailed traces of wireless network activity preclude most people from working with experimental data. Starting with a single wireless network dataset collected by the investigators, CRAWDAD has grown to become what we believe is the largest data archive of its type. (This is perhaps not as impressive as it might sound, as in general, there are few network data archives and even fewer dedicated to wireless networks.) As of October 2014 we host 116 datasets and tools. We require users to register and agree to a license before they can download datasets (viewing of metadata is free) and according to our registration records we now have over 6,200 users from 101 countries. To initially bootstrap and publicise the archive, we held a series of workshops at the largest wireless network research conferences [16, 14, 15]. We also approached premier publication venues in our research community (in computer science these are typically conferences rather than journals) with a view to encouraging or even mandating data sharing. These approaches were not particularly successful, with the exception of the Internet Measurement Conference, which now requires data sharing for those papers that wish to be considered for a best paper award. Other mechanisms for encouraging researchers to contribute to the data archive include the all-important crawdad toys (Figure 1) and stickers that are sent out to contributors! We also attempt to make it as easy as possible for researchers to contribute their data, by helping them to create metadata for their datasets. Because wireless data might well contain sensitive data (e.g., locations, or application-usage information), we help researchers with sanitising and removing sensitive data. As there do not exist standard formats for much of the data collected in our field, however, we generally point to existing tools and algorithms and help data contributors, but do not carry out the sanitisation ourselves. |