D-Lib MagazineJanuary/February 2015 A-posteriori Provenance-enabled Linking of Publications and Datasets via Crowdsourcing
Laura Drăgan, Markus Luczak-Rösch, Elena Simperl, Heather Packer and Luc Moreau AbstractIn this paper we present opportunities to leverage crowdsourcing for a-posteriori capturing dataset citation graphs. We describe a user study we carried out, which applied a possible crowdsourcing technique to collect this information from domain experts. We propose to publish the results as Linked Data, using the W3C PROV standard, and we demonstrate how to do this with the Web-based application we built for the study. Based on the results and feedback from this first study, we introduce a two-layered approach that combines information extraction technology and crowdsourcing in order to achieve both scalability (through the use of automatic tools) and accuracy (via human intelligence). In addition, non-experts can become involved in the process. 1 IntroductionThe need to treat research datasets as "first-class citizens" in the scientific publishing process is recognised in many disciplines. Many popular citation guidelines have been enriched with templates for data publication and citation1. This enables a more informed review and reuse of scientific work, as readers of scholarly publications can now easily consult the relevant datasets and assess their quality. References to datasets could also become an integral part of bibliographic algorithms in order to add data-specific statistics to traditional citation graphs. Going a step further, datasets could have their own form of citation: a dataset could be composite of, derived from, a subset of, the aggregate of, or a new version of other datasets. The combination of metadata about scientific publications and the related data, citation links between these artefacts, and versioning information could be the source for rich analytics, which would offer a more complete picture of the scientific publishing process and would drastically improve reproducibility of research results. However promising and conceptually simple such an idea might sound, exploring this integrated information space is still a thing of the future. In this paper we propose different opportunities to leverage crowdsourcing for a-posteriori creating dataset citation graphs. By a-posteriori we mean that the information is captured "after publication", as opposed to "at the time of writing or submission". This is motivated by the large number of existing publications and datasets that are already published, but not interlinked. We describe a practical approach, which exploits a specific crowdsourcing technique to elicit these graphs from domain experts. The results cover both types of information mentioned earlier: the relationship between publications and data sources, as well as between different dataset versions or derivatives. For the representation of these augmented citation graphs we apply provenance modelling as recommended by the W3C provenance working group, as well as the Linked Data principles (Berners-Lee, 2006) to facilitate online access and data integration. In the following we will refer to two examples to illustrate the main idea of our approach: the DBpedia Linked Open Data dataset, and the USEWOD log file datasets. We report on a small user study which was run during the USEWOD2014 workshop with a group of experts as participants in the crowdsourcing process. Following up the findings of this study, we redesigned the approach to balance accuracy and scalability; we combined information extraction technology (automatic, hence fast) with crowd intelligence (manual, hence accurate). This hybrid workflow opened up the possibility to use multiple forms of crowdsourcing for different tasks, most importantly enabling us to involve non-experts (hence, a larger crowd than the research community) in the information collection and analysis process. We define two types of relationships between publications and the datasets they refer to: a generic, high-level relationship which merely captures the fact that a dataset (possible with some versioning information) is "used" in a paper; and a more specialized set of relationships which provide details about the role of the data artefact in that line of scientific inquiry. This distinction makes it easier to collect information; some contributors to our crowdsourcing endeavour might not have the time or knowledge to identify very specific data citation information from a publication. In those cases in which this information can be elicited, we offer the conceptual gestalt to represent it, hence enabling more advanced analytics and giving a more complete picture of the scientific process. In Section 2 we present some of the current developments around data citation, and briefly introduce the fundamentals of crowdsourcing. In Section 3 we describe our two use cases, while in Section 4 we discuss a particular instantiation of the framework and its outcomes. In Section 5 we present the enhanced design of our system and conclude with an outlook on future work. 2 Background and related work2.1 Capturing data citationsMany organisations have identified the need for data citation and have developed principles and rules to support it. The Force11 community (Bourne, et al., 2012) has created a list of eight data citation principles, which cover purpose, function, and attributes of citations. The principles describe data citation importance, credit and attribution, evidence, unique identification, access, persistence, specificity and verifiability, and their interoperability and flexibility. "These principles recognise the dual necessity of creating citation practices that are both human understandable and machine-actionable." (Force11, 2014) Michigan State University provide guidelines for citing data using established bibliography styles such as APA, MLA or Chicago and Harvard (Michigan State University Libraries, 2014). Attributes recommended to describe a dataset include author or creator, date of publication, title, publisher, and additional information such as edition or version, date accessed online, and a format description. Some e-Science initiatives such as OpenML (a repository of machine-learning experiments, in which datasets play a pivotal role) supply exportable metadata records for datasets; they also measure citation counts and usage in experiments and support links to dataset description publications, so that scientists who publish a dataset can indicate which paper they want other people to cite when reusing the data. This citation of a publication rather than of the data artefact itself is customary in some disciplines such as Computer Science. It ensures that credit is given to the creators of the dataset. However, as we show in the following sections, it creates a disconnect between the data and the results, which can be detrimental to the scientific workflow. We therefore propose to go beyond citing publications only. The data citation approach recommended by most publishers is to use the well-established Digital Object Identifier (DOI) system. A DOI is a persistent identifier which is dereferenceable and provides metadata describing the object, in our case the dataset. DataCite and Cite My Data create DOIs for the datasets published through their platform. The use of standardised vocabularies to describe data citation supports many of the guidelines and principles mentioned above. Some vocabularies are specifically designed for this very type of references, while others could be re-purposed to cover data citation as well. These include:
Research into data citation includes several domain-specific projects, including the Advanced Climate Research Infrastructure for Data (ACRIF) project, which developed a Linked Data approach to citation and publication of climate research data along with full provenance information, including the workflows and software that was used (Ball & Duke, 2012). 2.2 CrowdsourcingCrowdsourcing was defined by Howe as: "the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call. This can take the form of peer-production (when the job is performed collaboratively), but is also often undertaken by sole individuals. The crucial prerequisite is the use of the open call format and the large network of potential." (Howe, 2006) There are various ways in which data citation links could be created through crowdsourcing. Putting aside the various forms to deploy the original notion defined by Howe, contributions could be sought from multiple audiences, or crowds, from the researchers authoring or reviewing a publication to publishers, dataset owners and users, and the general public. Going a step further, we could imagine various types of contributions and crowdsourcing workflows, ranging from the identification of links between papers and datasets to validating existing citations or eliciting further information about the dataset itself, including versioning. Automatic techniques could be exploited to identify potential dataset references, or to discover potential inconsistencies in the responses submitted by the crowd. All this could happen either at publication time (e.g., when the camera-ready version of an article is prepared for submission) or independently of the publication life-cycle. In this section, we focus on settings where such information has not been collected at the time of publication and data citation tasks are outsourced to an open crowd of contributors using one or a combination of crowdsourcing mechanisms. When embarking on a crowdsourcing enterprise, the requester (that is, the party which resorts to the wisdom of the crowds to solve a given problem) has a variety of options to choose from in terms of specific contributions, their use as part of the solution to the problem, and the ways in which participants will be incentivised. Each of these is a dimension of the crowdsourcing design space. Case studies and experience reports in the field provide theoretical and empirical evidence for the extent to which certain regions in that space are likely to be more successful than others. In the remainder of this section we introduce these dimensions and discuss the implications of choosing one alternative over the other. A first dimension of crowdsourcing refers to the task that is assigned. The literature distinguishes between two types of tasks: macro- and micro-tasks (Dawson & Bynghall, 2012). Macro-tasks are outsourced via an open call without specifying how they are to be completed. This is the case, most importantly, when the task is of a creative nature, and as such difficult to define one structured workflow that will achieve the goal (e.g., scientific challenges à la InnoCentive, or eParticipation approaches to policy making (Aitamurto, 2012) ). A second category of crowdsourcing deals with micro-tasks: these are much more constrained and at a level of granularity that allows the contributors to solve them rapidly and without much effort. A typical project contains a number of such micro-tasks, which are outsourced to different contributors that approach them in parallel and independently of each other. This makes the overall project very efficient, though it adds overhead in consolidating the individual inputs into the final result. Given the nature of the data citation problem, we expect a micro-task approach to be beneficial. For any collection of papers and datasets, one can easily define micro-tasks referring to pairs of papers and datasets, or one specific paper and all datasets that are relevant to it. No matter what the actual micro-task looks like, the requester has to carefully craft the description of the task and the instructions given to the contributors. Assuming the task asks for links between papers and a pre-defined list of datasets, one needs to think about the different ways in which both the paper and the dataset will be presented to the crowd. Alternatives include:
Each of these alternatives has advantages and disadvantages, and the choice also depends on the affordances given by the crowdsourcing platform used. A second dimension of crowdsourcing describes the targeted crowd of participants. The preconditions of the task may determine who can be in the targeted group. This group may be a restricted group of experts who qualify by fulfilling a given condition (e.g., having a skill, being of a certain height, or living in a given area). Alternatively, an open call for participation can be made, where anyone can take part. However, even when no explicit prerequisites apply, it is still worthwhile to consider what would drive contributors to engage with the task at all and to design incentives that would encourage them to do so (Simperl, Cuel & Stein, 2013). For data citation tasks, different types of participants can provide different types of useful information:
Targeting a specific crowd can have advantages and drawbacks. For example, experts are few and usually constrained by time, whereas non-experts may provide less precise information that must be validated. A third dimension of crowdsourcing is how results are managed. Depending on the task design and chosen crowd, the results may need to be validated, aggregated, integrated, etc. The solution space of some problems can be open, meaning that the number of possible correct solutions is unlimited. This is mostly the case with macro-tasks and open calls which require creative work and innovation. In the case of crowdsourcing data citation links however, the solution space is rather constrained, since there is a finite number of correct links that can be created between given publications and datasets. The solutions submitted by participants can be assessed automatically by comparing results against one another, using simple or weighted averages, or majority voting (Kittur, Chi & Suh, 2008). It is also possible to crowdsource the evaluation of results by assigning participants evaluation tasks, in which they verify the accuracy of previously submitted answers. This option is more costly, as it requires additional participants, but also likely to produce more reliable results, especially in cases in which answers are not straightforward to obtain or require very specific insight. A hybrid approach can employ both methods, by either randomly selecting results to be evaluated from the total set of submitted results, or by only evaluating the ones where agreement between participants is low. The performance of a crowd member can be evaluated by assessing their contributions' divergence from the "ground truth" approximated from aggregating over all contributions. When the ground truth is known partially (e.g. for a limited number of cases out of the total, for which we have expert-made annotations) we can test the participants' contributions by randomly or selectively assigning them tasks to which we know the true solution. Social mechanisms such as participant reputation, and rating and voting (Packer, Drăgan & Moreau, 2014), can be also used to infer the quality of work. We now turn to an analysis of these considerations. 3 Use cases: two datasets, two types of links, two crowdsIn this section, we present two datasets that initially motivated our research, USEWOD and DBpedia. Our resulting approach, however, can be applied to any dataset and any domain. Based on these datasets and their characteristics we describe the relationships that exist between datasets, or different versions of a dataset, and between datasets and the publications using them. We then show how these relations can be crowdsourced to obtain data citation graphs, and how the process varies with the level of expertise of the participants. DBpedia is the most prominent Linked Open Data source containing structured data that is automatically extracted from Wikipedia. Hence, DBpedia is a cross-domain open dataset, and a fantastic example of the problem we are attempting to solve. It has well-established creation and publication processes, which generate dataset versions with complex relationships between them. The DBpedia project wiki is well maintained, and it provides a comprehensive version history and detailed information. This makes it easy to set up mirrors of any particular DBpedia version and granularity (e.g. only a specific language or excluding particular link sets). In a change log the DBpedia team documents changes on the DBpedia ontology as well as changes of the extraction and interlinkage framework. But neither DBpedia in general nor any of its versions is archived in a research-data repository, which would allow for referring to a persistent identifier such as a DOI. The dataset, its versions, and the protocol for their generation evolve dynamically, based on community input and collaboration. It is however provider-dependent2, and neither sustainable availability nor reliable long term archiving can be assured. Additionally, every DBpedia version originates from a particular Wikipedia dump. When a DBpedia dataset is used in research, there exists a transitive dependency which makes the respective Wikipedia dump that has been processed by the DBpedia extraction algorithms the actual source of the data used in the research, influenced certainly by the scripts used to extract it. The different Wikipedia dumps contain data created and altered by millions of people, and thus the relationship between DBpedia versions inherits the complexity of this provenance chain. Such complex relationships between datasets and versions are important in tracing the lineage of the data used in research publications, and the complexity is not specific to DBpedia. DBpedia is also associated with a large number of research publications that claim to use it in some way at the time of writing, approximately 10,000 articles found in Google Scholar with the "dbpedia" keyword3. However, the majority of the publications do not explicitly reference the particular DBpedia version they use, and those that do reference it, do not do so in a consistent way. As described in Section 2, the key papers of the DBpedia publishers are cited rather than the actual DBpedia dataset version that was used in a particular study. This limits others' ability to reproduce or evaluate the published results, and it makes it difficult to validate the research and draw useful conclusions from validation efforts. The USEWOD dataset is a collection of server access logs from various well-known Linked Data datasets, most prominently DBpedia, LinkedGeoData, and BioPortal amongst many others. As part of a data analysis challenge, the chairs of the annual USEWOD workshop released four dataset versions, one before each instance of the workshop since 2011. The four individual USEWOD dataset versions are available upon request from http://usewod.org, and a description of the contents is included in the respective compressed archive file. It is noteworthy that the 2012 and 2013 versions of the dataset each contained the entire content of the preceding year plus additional data. This practice was changed in 2014 to release additional data only. As a lightweight citation policy, the workshop chairs asked users of the USEWOD dataset to cite one of the initial papers describing the workshop and the research dataset (Berendt, et al., 2011). The provision of the USEWOD dataset is representative of research datasets that are hosted by an academic unit or an individual researcher, such as the UCI Machine Learning Repository or the Stanford SNAP dataset collection. They all employ a non-standard way of hosting and maintaining research data without any guarantee of long-term availability of the service, and they do not provide their users with an option to refer to a persistent identifier controlled by an official entity managing research data. We detailed above through the DBpedia example how the relationships between various versions of the same dataset are relevant for the traceability of research results using one version or another. With the USEWOD dataset, which contains information related to other datasets, it becomes clear that the relationships between datasets are just as important: inclusion, dependence, transformation, aggregation, projection, etc. The links between datasets are not always expressed in a standardised, machine-readable way, but rather captured in textual documentation by the creators of each dataset or version thereof. As such, the capture of these relations can be done in two ways: by extracting the information automatically, where possible, from the documentation, or by asking the creators (the experts in this case) to manually specify them. Moving on to the relations between publications and datasets, we find that for the majority of instances we can simply restrict the vocabulary to say that a publication, uses a dataset (or more than one). Our first user study, described in the next section, suggests that this simple metadata is enough to gather a clear data citation graph. This general way of establishing the link between a publication and the precise dataset and version used for the research has the added advantage of being easy to elicit from non-experts, as no further information is required. It can also be automatically extracted in a large number of cases using text analysis and restrictions on the possible date ranges, as in the examples shown below. The uses relationship however does not cover all cases. Some publications do more than just use a dataset, they describe how a new one was generated, or they analyse, compare, and evaluate existing datasets. This more detailed metadata provides richer information on how the data is used in research, but it is more difficult to extract automatically with high accuracy, and also more difficult to elicit from the crowd, as it requires expert users. We investigate how to utilize the power of the two types of crowds, that of experts and that or non-experts, in the way most suitable to each type. For example we target the authors of publications and other domain experts for the crowdsourcing of detailed usage metadata. For general usage metadata we propose to engage a wider set of participants in the crowdsourcing tasks, possibly including Amazon's Mechanical Turk or other paid micro-task platforms. We use simple information extraction tools to detect whether any dataset reference can be found automatically. This is straightforward when the paper contains the version of the dataset in plain text, as for example (Morsey, Lehmann, Auer & Ngonga Ngomo, 2011) which contains "DBpedia (version 3.6)" in its introduction. Additionally, if available, we can use some of the metadata about datasets and publications to restrict the set of possible datasets linked to a paper based on the intersection of the temporal range of the creation dates. For example, the DBpedia Spotlight paper (Mendes, Jakob, Garcia-Silva & Bizer, 2011) uses the DBpedia dataset for evaluation of a tool, but does not specify in the text which version was used. The paper was published in September 2011, which means it could only have used DBpedia datasets up to version 3.7 (released in August 2011, according to the changelog). Taking into account the fact that the submission deadline for the conference was in April 2011 (according to the call for papers), we can restrict the range by one more version, to DBpedia v. 3.6 (released in January 2011). The automatically extracted and inferred information can be used in two ways, to validate the crowdsourced information, or to be validated by the crowd. We plan to explore both options in the future. 4 Crowdsourcing dataset references with experts: the USEWOD2014 studyDuring the USEWOD2014 edition of the workshop we ran a small user study in which we asked workshop participants to annotate papers and datasets with the relations between them. The participants were experts in the field of the papers they were asked to annotate, and some of them were authors of the papers in question. As experts, they were asked not only to capture the simple usage relation, but also the more detailed descriptions of how a dataset is used by a given publication. Figure 1 shows a screenshot of the tool developed for the study. It is available online at http://prov.usewod.org/. The functionality of the tool, the data modelling, and the vocabularies used are described in detail in (Drăgan, Luczak-Rösch, Simperl, Berendt & Moreau, 2014). The vocabulary provided five possible relationships between publications and datasets: "mentions", "describes", "evaluates", "analyses", "compares"; and five possible relationships between datasets: "extends", "includes", "overlaps", is transformation of", "is generalisation of". The simplest relation between a publication and a dataset was "mentions", which was intended to be used when none of the others were suitable, but a mention of the given dataset appears in the given paper. |