TY - JOUR AU - Anderson, Ben AB - Abstract One key feature of e-science is to encourage archiving and release of data so that they are available in digitally-processable forms for re-use almost from the point of collection. This assumes particular processes of translation by which data can be made visible in transportable and intelligible forms. It also requires mechanisms by which data quality and provenance can be trusted once “disconnected” from their producers. By analyzing the “life stages” of data in four academic projects, we show that these requirements create difficulties for disciplines where tacit knowledge and craft-like methods are deeply embedded in researchers, as well as for disciplines producing non-digital heterogeneous data or data derived from people rather than from material phenomena. While craft practices and tacit knowledges are a feature of most scientific endeavors, some disciplines currently appear more inclined to attempt to formalize or at least record these knowledges. We discuss the implications this has for the e-science objective of widespread data re-use. Introduction Partly as a result of financial inducement (research funds) but also for sound methodological and substantive reasons, social scientists in the United Kingdom and elsewhere are beginning to engage with the wider program of “e-science.” Investments in e-science technologies are motivated by a multiplicity of factors. First is the urgency to manage the increasingly large quantities of complex data produced by digital technologies and digitally enabled science. “Deluge,” “waves,” and “knowledge overload” are some of the terms used to describe the situation (Hey & Trefethen, 2003). Another related factor is the concern of funding bodies to “repurpose” their investments in data to avoid what is, in turn, termed “data tombs in mono-disciplinary silos” and to see a maximum return on their investments. In addition to greater computing capacity and larger sets of data to compare, service and data grids also raise the prospect that new scientific questions can be asked, questions that can only be addressed through massive analysis or the federation of disparate shared datasets. However, the act of sharing data implies the communication of something to a set of potentially unknown others. Moreover, the controversies surrounding the ethics of sharing data (Thompson, 2003) and the methodological reasons for (not) doing so in the social sciences (Fielding, 2004) as well as the natural sciences (Borgman, 2005; Campbell, Clarridge, Gokhale, Birenbaum, Hilgartner, Holtzman, et al., 2002; Hagstrom, 1974) indicate that how and what to communicate and to whom is more problematic than naïve accounts of scientific collaboration presume. Key recent studies in the field of e-science (Jirotka, Proctor, Hartswood, Slack, Simpson, Coopmans, et al., 2005; Purdam, Elliot, Smith, & Pickles, 2005) have underlined how most of the obstacles to such data provision are less technological than social, ethical, legal, and institutional, as researchers in fields such as Computer Supported Cooperative Work (CSCW) have known for some time (see Bimholtz & Bietz, 2003; Bowers, 1994). While there is ample evidence of why people say they do not wish to share and/or collaborate,1 this article uses the ethnography of actual data sharing practices in order to shed new light on the multiple social contexts generating this discourse. However, this is emphatically not an article intending to map disciplines in terms of those that are “better” at sharing and re-use than others, and as we will re-emphasize, simple assumptions and binary divides between, for example, qualitative and quantitative methods do not hold true in practice. Corti and Wright (2003), for example, found among medical researchers both an empiricist commitment to the objectivity of data and equally weighty if “softer” concerns about the reinterpretation of data, while Latour (1993) finds hybrids of “soft” sociality and “hard” materiality within laboratory science. However, the materiality of social science data production appears much less explored and rarely contrasted with that of other sciences (although see Whitely, 2000). If this article inscribes itself in a field of studies assessing the potential of disciplines for the uptake of e-science (Fry, 2006; Kling & McKim, 2000), it mainly follows up on Coopmans’ study of a digital mammography project (Coopmans, 2006) as it pays attention to the sometimes apparently trivial material transformations and translations implied by new modes of knowledge distribution. In this article, we show how different groups in both the natural and social sciences build socio-technical hybrids through the collection, processing, annotation, release, and re-use of data. We do this through four case studies which focus on several key “life stages” of data, in order to reflect on the technical and social contexts of “e-nabling” data for re-use. Field Sites and Methods The four projects selected varied considerably and intentionally in their approaches to data collection, disciplinary backgrounds, and use of technologies. They were: SkyProject– a £10M project initiated in 2001 by a consortium of 11 departments. Its distributed team is composed of scientists, software developers, and managers aiming at building the infrastructure for a data grid for U.K. astronomy, which will form the U.K. contribution to a global virtual observatory. It works closely with similar projects worldwide through an international alliance. The infrastructure developed enables the first beta-users to perform queries across distributed datasets through the SkyProject portal in order to access sequences of observations from a range of telescopes. Demonstrator tools accessed via a PC-based “workbench” have included the self-assembly of sky-movies and automated filtering and processing of observation data held on a range of distributed databases. SurveyProject– a resource center producing a large-scale complex survey dataset released every year through the government funded U.K. Data Archive (UKDA). The survey is conducted by a single academic institution, although the fieldwork itself is carried out via a subcontract to an agency. Every year some 10,000 U.K. households are contacted and interviewed, with the resulting data being fed back to the institute for cleaning, processing, checking, and packaging. The data are then deposited with the U.K. Data Archive, which is responsible for managing re-use by third parties. The academic institute also has a remit to support re-use through the provision of documentation, derived variables, and training, as well as the maintenance of a “user” group. Mature computing technologies are used all along the chain through which data are collected, processed, released, and analyzed. CurationProject– an activity which for 20 years has been digitizing records of a collection of more than 750,000 artifacts and 100,000 field photographs collected since 1884, and making them available for study through an online database. With images now being added to the documentation, the project is conceived as an additional “collection.” A curator has also recently initiated two new projects by which the museum’s database will be open to a community of researchers, artists, and community representatives from around the world, so that their alternative expertise, taxonomies, and meanings can be recorded at the core of the museum. AnthroProject– an anthropologist and her students who have undertaken to digitize and distribute all materials so far collected in a range of countries during her academic career in order to preserve in a digital medium anthropological materials including fieldwork notes, images, maps, texts, and videos that were quickly degenerating in their current forms. In addition, it was the intention to make these materials available for re-use by other researchers and also available to their sources’ descendants through both an online database and DVDs. Some of the professor’s students opened cultural area based sub-sites under the umbrella of the main project. They have developed a wiki, a forum, a proprietary probabilistic search engine designed in partnership with a consultancy, and they also participate in the Dspace worldwide digital archive, among other activities. The case studies were specifically selected to provide a reasonable spread of disciplines across the typology introduced by Becher (1987) and developed by Fry (2006). Thus SkyProject represents Becher’s “hard-pure” group, SurveyProject represents “soft-applied,” while CurationProject and AnthroProject represent “soft-pure.” Not represented here is the “hard-applied” group, although several e-science activities from this area have already been studied by others such as Hine (2005) and Fry (2006). In addition, the case studies enabled the comparison of leading edge e-science projects such as SkyProject with smaller scale innovative projects such as AnthroProject, and with more “traditional” users of computing technology such as SurveyProject and CurationProject. Finally, the case studies were also distributed across a dimension of maturity. As the least mature, SkyProject constituted a significant innovation. SurveyProject, on the other hand, has been in existence for almost 15 years and is providing a mature and critical service to U.K. and international social scientists. As such, it could not risk any innovation in the use of computing technologies or other processes that might jeopardize the survey itself or the timely processing, documentation, and archiving of the data. All change had to be incremental and carefully considered. CurationProject and AnthroProject, on the other hand, had an even longer history of the collection of heterogeneous materials, even if they had come only more recently to the use of digital technologies for archiving. In all cases it was believed that scientific practice could potentially be co-evolving with computing technology, but that this was more likely to be visible in SkyProject, which was more consciously attempting to reinvent both its scientific practices and its scientific tools and had perhaps a greater freedom to do so. One of the case studies (SkyProject) was nationally distributed, while the others were geographically concentrated in one or two co-located institutions. For each, the fieldwork included interviewing key informants initially recruited through personal contacts who served as sources of further interviewees and acted as a sounding board for early observations. In all, 16 people were interviewed, most on several occasions. The secondary respondents were selected in part by the key informant’s recommendation but also by the individual paths that the “data” took through each project. This required meeting people involved in data collection, processing, analysis and reuse, some of whom were external to the specific case study itself. We similarly followed up and interviewed contacts recommended by the four projects at various institutions, such as the U.K. Data Archive. In addition, we analyzed internal and external project documents produced, including project websites, and observed conference and other face-to-face meetings including a number of U.K. e-social science and e-science events such as two of the e-science All Hands Meetings/conferences and the first International Conference on e-Social Science. Finally, in the case of SkyProject, we traced email activity and other forms of electronic communication such as jabber, a wiki, and Skype (Internet telephony), all of which were extensively used. Each participant provisionally agreed to serve as a case study, and in line with the Association of Social Anthropologists guidelines on informed consent, it was considered crucial to establish through ongoing dialogue with participants the scope of information to which we were granted access and the extent to which it could be reported. Each group was made aware that due to the necessity of being quite specific about the kinds of data shared, it was likely to be impossible to guarantee that the identities of the institutions and individuals could be kept confidential. Participants were invited to review drafts of the reports and to comment and provide feedback through personal communication. The data were transcribed and coded using ATLAS.ti2 as a resource for rapid reanalysis and data inquiry. The codes were based on themes that emerged as relevant after several re-readings and re-codings of the materials; it was essential that categories emerged from the materials rather than be imposed on them from the outset (Glaser, 1998; Glaser & Strauss, 1967). As befits a series of case studies, the positions and claims reported here are representative of each field site insofar as its members generally held similar views. They cannot, of course, be taken as representative of all members of the respondents’ professions or disciplines, and it is not the place of case study based research to make such claims for generality. Only those themes that were richly illustrated have been generally reported in the article, and because it focused on specific academic groups, this article can only claim to be indicative, as compared to representative, of disciplinary states and stances. Results and Analysis As one might expect, the results of this project were as diverse as the case studies observed, with the various forms of data collection and the grounded (data driven and emergent) analysis producing a wealth of potentially significant themes. However, within the constraints of this article, we concentrate here on aspects of the “life stages” of data. Data Collection: Born Digital and Heterogeneous Legacy Data In comparing the potential of different disciplines for increasing data re-use, it was immediately clear that the basic requirement that data be “digital” constitutes a main challenge in certain research contexts. If it is not an obstacle for those disciplines, illustrated by the SkyProject and SurveyProject, that now produce “born digital” data, it remains a major issue for those observing and producing heterogeneous materials. In the field, members of AnthroProject might, for instance, observe tales, songs, dances, and countless other daily practices, from which similarly heterogeneous records will be produced: audiotapes, letters, diaries, notes, books, photographs, films, and objects. Data re-use thus has several implications for disciplines generating non-digital heterogeneous materials. For qualitative social sciences, an initial process of translation of data into digital form is therefore needed, and the data digitized first are often those which do not present this problematic heterogeneity. Within qualitative social research, text is most frequently retained, and it is not clear, for example, that the U.K. Data Archive is equipped to act as curator (Corti, 2000; Corti & Thompson, 2004). Indeed, during an interview, one UKDA respondent expressed her surprise at the homogeneity in the data produced by sociologists: “For them it’s mainly in-depth interviews, a few personal letters, a few photos. I was surprised that they didn’t make more use of video or things like that.” It is primarily verbal accounts, and especially those recorded through controlled and coherent formats, that make their way to digital archives. A member of SurveyProject mentioned in this respect: “It seems that survey data tend to make it as large datasets. But what about other things like observational stuff, such as coding newspapers?” The underlying worry that surfaced during interviews with social scientists was that, ultimately, the demand for data archiving and re-use might force qualitative materials into quantitative forms and logics (or favor the quantitative side of qualitative studies) and thereby jeopardize the specificity of qualitative approaches. More concretely, the task of digitizing data derived from other media proves very costly and sometimes requires decades in itself. It took AnthroProject 30 years to digitize their research materials progressively and 20 years for CurationProject to document its entire collection of over 850,000 objects in its database. One respondent told us that “at the present date only 30% of the collections in the U.K. are digitized.” As an observer of the similar Systematics project (classifying organisms) noted, the labor of digitization is often ignored in grand visions, even though it preoccupies people on the ground (Hine, 2005). As a member of CurationProject remarked, “The digital image needs to be named, then to be stored, then to be attached to the in-house database. That means opening the database and sitting for hours.” Another added: “The data entry is done in a cold room with no light. It’s very boring—you need to check the dates and plug numbers such as 1996, 1997, 1998 and so on. Cleaning the database is time-consuming—check consistency in formats for names; spelling mistakes, etc.” Indeed, from the online diary of AnthroProject, it was clear that the team spends most if its time transforming the format of existing data and maintaining them rather than creating/collecting new data. According to the leader of AnthroProject, “the likelihood of colleagues [anthropologists] doing it is very meager. … They either write from their notes or organize their notes. They can’t do both.” Another issue is that the digital form of data often will not substitute for the original but might only be a different and complementary kind of data. For example, one of the CurationProject curators remarked: “For people, the digital images [of objects in the collection] stand for the objects themselves, but in fact these images are objects in themselves. They are a new kind of objects. The artifacts of the collections almost never go out. What go onto the web are just performances of these artifacts.” Although this curator promotes the database of images as being a new collection in itself, a specific aura remains with the material artifacts. In the same way, the database does not fully replace former types of records (in the present case, a catalog of cards), which remain “the” authority. “The old catalogue cards have a tangible nature. It is also the only true one. [The head curator] likes the manual catalogue cards and also asks people to look at it rather than the e-one. The e-one is good for identifying, but serious research requires to look at the catalogue cards for hand-writing etc. and people generally prefer the feel of it” (respondent, CurationProject). Data Formatting: Codified Abstract Forms and Tacit Knowledge For collected materials to become data that can be used and mainly reused, they need to be rendered disseminative, that is to be rendered at the same time transportable in concise abstract forms and intelligible. As Strathern (2005a, n.p.) points out, communication entails a necessary reductionism “based perhaps on the idea that what has to be shown is how something has ‘traveled’ between disciplines. So the chances are that it will be a ‘thing’ (a piece of data) shorn of its relational co-ordinates.” She similarly argues elsewhere: “Not everything of course can—or should—be conveyed, and some kind of reduction and thus loss of complexity is seemingly inevitable in the making of new relations, as is involved in relations with other bodies of knowledge” (Strathern, 2005b, n.p.). This truism plays out in different ways in the four projects studied. In SurveyProject, for example, variables are renamed and recoded, and coding frames for verbatim responses have to be created. From the point of collection, answers formulated by respondents in words are visualized and circulated in the form of numbers. The 2000 flat files collected in a survey database mirroring the structure of questionnaires were transferred in batches and collapsed into a much smaller number of variables. In addition to the passage from CAPI software to SPSS to an in-house software solution, the renaming of the files was part of this first major conversion. Files were loaded into a database where the cleaning of data was done. Flat files were then re-expanded again when loaded into a third users’ database: “Having all the data, we transform them again because people, most analysts, want a much flatter structure,” commented the computer manager, and she added: “derived variables, weighting, imputations are put on.” What was first in CAPI and passed into SPSS and SIR was made then accessible again through SPSS, SAS, and STATA software. Variable names were kept consistent across years, as were the terms used for indexing. It was this process of converting words into numbers; of successive flattening and restructuring; of renaming and renumbering that transformed materials collected into visible, manageable, communicable, and intelligible data for its community of users. If particularly bulky (with the largest databases growing at several terabytes per year), the materials collected in SkyProject are nonetheless similarly conveyed in highly visual and stable forms: images, catalogs of numerical values, and models. As the object of study (the sky) is entirely numerically mapped (a star is defined by its coordinates, for instance), an image can be translated into numbers, which can be translated into diagrams. To this visualization “tool-box,” one can add images created out of theory. On the issue of visualization, we were told by a SkyProject member: Our raw data are tables, not images. You can do things with numbers and you’ll still have to have numbers out of an image. Visualization is very important. One thing is to have a concept. The other is to make it accessible … to be able to create a figure, to create and analogy is important. Concepts on the edge are difficult to grab. Diagrams are important to evaluate the quality of data: many dimensions. Quantitative disciplines thus appear particularly well-equipped with tools for the visualization and simulation of the objects and processes they study, and this is a strong asset in the drive towards data re-use, which itself requires that data be circulated in transportable and intelligible forms. Funding bodies promoting not only data re-use but also interdisciplinarity increasingly pressure anthropology and related disciplines to formalize materials that exist between private field data and that which is the basis for, or is captured in, monographs and articles (Brenneis & Marcus, 2005). This requirement represents a challenge, for different reasons. The interviewed anthropologists who went alone into the field rather than in a team did not need to externalize their data until it took the final form of published monographs or articles, as earlier quotes have illustrated. Apart from some types of diagrams, such as the mapping of descent lineages inherited from the 19th century when anthropology aimed to be a quantitative science, anthropologists make very little use diagrams to model the social processes they study. For those interviewed, the adoption of constraining models was seen as pushing them towards generalization, while they saw their primary goal as describing the specificities of particular contexts and drawing distinctions where others seek simplification. Most of them did not visualize their data in condensed and abstract forms that would easily lend themselves to circulation, and thus anthropology seems to belong to those disciplines for which data re-use would imply a radical epistemological change. Data Release: Ownership, Consent, and Moral Rights Beyond practical challenges, the caution with which some social scientists approach the prospect of making their data public on a new scale through e-science appears to follow from their data being derived from people and not objects. In contrast to objects, people have claims, and there was a concern that by enabling re-use at an early stage, Researchers might disclose data to the wrong people or simply disclose erroneous data that had not gone through appropriate quality checks. In this respect, the four case studies illustrate a schematic continuum from SkyProject, where people are not represented in the source data at all, through SurveyProject, where they are represented in highly symbolic forms, to Curation and AnthroProjects. In the latter, the very point of the data is their subject-specificity, which makes any kind of anonymization largely impossible to achieve. Although SurveyProject derives its data from people, the issue is less acute as people are anonymized in the process, making data somewhat “claim-proof.” “The way we deal with the law that information on individuals cannot be kept is that we separate the database of respondents from the one on the results. We anticipated the change in law,” a member of SurveyProject commented. This often proves impossible when the identity of individuals or groups is inscribed in the data, as it is often inscribed in Curation and AnthroProject’s artifacts. Corti and Thompson (2004) have pointed to the same problem regarding the archiving of video or even audio materials, which are almost impossible to anonymize. This problem is best illustrated through the dilemma faced by CurationProject because of the multiplicity of people with which its primary data, artifacts, are entangled. One of the constraints on the implementation of digital technologies at CurationProject is that any people from one of the many contexts an object belongs to could potentially have a claim on the use made of it. For a given photograph, for instance, claimants may include those depicted, the photographer, the collector, or the museum. CurationProject deals with complex data, because these data convey the practice and knowledge of others who might consider it sensitive knowledge and whose conceptions of ownership might differ from those of the museum’s curators. We were for instance told, “[A researcher] has put a picture on the cover of a publication. He could be fined for that [by the community it originated from], because the artifact shows a ritual/secret process. This is despite the fact that it was a three-dimensional artifact sold to him.” Mentioning a Ph.D. student, the same curator remarked: “during her fieldwork in Malaysia, there was a photo collection (of a former local museum) that they wanted to sell to us. There were photos by tourists, army officers, etc. They think that they own every photo, but in our sense the photographer owns it, and we can therefore not show it.” As one respondent at CurationProject put it (see Herle, 1994): The problem is that you might not know that it’s sensitive knowledge until it gets public …. For instance, we have a photo showing relationships among people that these same people deny having had. They see in these pictures the political rather than the daily Tibetan practices …. Also, the museum is thought to be a neutral place for many communities because it’s so far from them. They sometimes prefer their objects to be there rather than on the island of their neighbors. People think that access is to take everything you have and put it online but it would be the best way to ensure not working with them anymore. No conflict ever arose from CurationProject having publicized specific objects, but CurationProject’s ethical obligation to respect creators and communities of origin nevertheless constitutes a barrier to the circulation of content over e-media, as a number of their grant proposals make clear. Both Anthro and CurationProject thus acknowledge being stretched between two trends: one driven by new technologies encouraging the widest distribution, the other advocating a case-by-case approach based on informed consent and the respect of moral ownership and claims. Data Re-Use: Trust, Provenance, and “Cookery” Data re-use not only involves the disconnection of data from the people they represent but also from the researchers who collected them. This opens up the central question as to how data collected or constructed by one researcher can be trusted or even understood by another. Across all case studies it was clear that this disconnection required not only visualizing data in intelligible forms, but more importantly, making explicit their context of production and setting up appropriate systems of quality checks and assessment. This issue proved to be central across disciplinary contexts, irrespective of the quantitative or qualitative character of a discipline’s approach or the material or social nature of its object. What interviews and observations showed was that researcher practices around data are always highly specific and qualitative, even within quantitative disciplines, and that the data are always “cooked,” to employ a term recurrently used by some informants. By way of illustration, one can list the parameters a SkyScientist would want to be made explicit before re-using data he has not himself collected. A SkyProject scientist explained: The most basic level is how good the atmospheric conditions were (clear night, good site) …. The second criteria of quality is the instrument. Unless you’re talking of a pristine photographic plate, with electronic data, no electronic detector is perfect. One needs to fix imperfection and calibrate and one gets to know a detector and how to calibrate it. He went on to note “When there is a new detector, nothing is known about this detector. It takes a lot of time learning about the detector [and] during that time there is no standard.” A SkyProject developer similarly remarked, “At the user workshop, people wanted to know what algorithm had been used. It was built into the program but they don’t trust other people’s calibration if they don’t have the algorithm.” Since a color image in SkyProject is itself “constructed” out of several images created from the use of several filters, an expert would want to know which filters have been used, especially because filters have developed differently over the years. S/he would want to know how much an image had been “cooked,” because s/he might have preferred a previous stage depending on the research questions under investigation. The repositories of data accessible through portals such as SkyProject therefore need to make explicit all these processes and parameters, and indeed many e-science projects are facing up to this challenge (Chen, 2005; Groth, 2005), although the extent to which they actually catalog all data transformations is rather unclear. However, it is important to remember that incomplete, inconsistent, or erroneous datasets can still be useful to those who are aware of these problems, even if they are misleading and frustrating for those who are not (Missier, Embury, Greenwood, Preece, & Binling, 2005). Beyond singular images or tables, SkyProject aims to address the issue of trust in data by recording the history of changes to each data item (its provenance) to allow scientists to verify both their own experiments and those of their peers. As a SkyProject scientist said: In SkyScience we are often concerned with authenticating the data as much as the user—providing information on data provenance, pipeline history, calibration status, reliability, etc., including the issue of how random observatories and scientists can insert their data into the datagrid system, and whether other users can then trust these data. The case studies suggested that disciplines’ histories as well as the configuration of their research communities are factors that can impact their capacity to contextualize and document their data and processes appropriately. In the case of SurveyProject, trust and reliability of data also involve making explicit decisions concerning the processes underpinning their production. Different factors, such as the long tradition of depositing data in quantitative sociology and the obligation to provide data for third party analysis, explain the success of SurveyProject in this respect. In addition, the fact that those designing the survey, those carrying it out, those processing it, and those analyzing it are in different teams forces each of the team to to t make explicit the ways he proceeded. It provides an incentive to create documentation to accompany the data and describe precisely the conditions in which they were collected and what was done to them between collection and archiving. This clarification is achieved through a resident expert help-desk (an individual); through events (user group meetings and workshops on the use and usefulness of the data); and through documents such as the online documentation, questionnaires, and subject thesauri. Clearly both Sky and SurveyProject benefit from using numbers as a means of encoding and visualization, but also from the use of simplified symbolic languages (statistics) as a means of communication among re-users of the same datasets. These methods and languages are formally taught and therefore generally commonly interpretable, even if often contested, across communities of researchers. SurveyProject seminars are dedicated to testing the reliability of results by testing methods against others who use the same data set. As a senior research officer at SurveyProject’s host institution explained: “You can lie with statistics, but not to another statistician, the truth comes from the statistical methods, not the numbers. You will get found out because the numbers and the methods of interpretation are there side by side.” Although SurveyProject’s host institution gathers researchers with different backgrounds (economics, sociology, political sciences) who use the same statistical methods differently, statistics nevertheless provide the common language which allows them to expose their methods to each other and which therefore fosters re-use. As one SurveyProject respondent commented: “I know that when I talk to sociologists I have to use one set of terminology, with economists another.” Another suggested that: Statistics, too, is a language, but a language with dialects. I can’t understand how economists work their statistics. Reading this stuff it’s hard to judge it. I have a hard time because the way sociologists talk about stats and the ways statisticians talk about stats are two different things. Finally, for those in charge of SurveyProject, trust in data was intended to strengthen as good practices and standards are established. As the survey manager remarked: It’s part of trying to establish standards. So that people don’t go, “oh, I don’t know why it’s like that; the people who made the decision aren’t there anymore….” There are a set of generally accepted practices in terms of ways of designing questions, ways of dealing with data that you would expect any reputable survey to follow. And a lot of that has been backed up by tons of experimental research of various kinds …. You can see the effect of doing things in different ways, but one big problem with quality is that there’s no real way to measure validity. In the case of AnthroProject, which is generally weakly disposed to encoding and visualizing its data in concise, stable, abstract forms, systematic exposition of its methods is rare. In general, contrary to the case of SurveyProject, the one who collected the data and the one who interpreted them were the same person, and this had implications for the potential to meet data re-use requirements, because many assumptions, procedures, processes, and decisions often remained undocumented tacit knowledge. On the other hand, for most respondents in AnthroProject, it was not crucial that method be formalized, as they saw anthropology as more about explanation and interpretation than about replicable analysis. As an AnthroProject informant said, “there are no right or wrong or clear yes or no.” Another suggested that there is therefore a tremendous amount of trust in anthropology; no one wants to “check” data. One respondent from AnthroProject’s department similarly commented that one could not impose fixed procedures in anthropology: “All you can rely on is the person to have ethics. It is that person who makes decisions and selections every day. The anthropologist is a walking filter.” The anthropologist was thus assumed to have developed ways to “auto-audit.” As a result, the most difficult issue facing AnthroProject in the pursuit of data re-use remained meaningfully capturing the context needed to make qualitative data “reusable.” In the case of the U.K. Data Archive’s Qualidata service, for example, the guidelines for the preparation of data for submission have defined context in terms of 10 criteria, which generate 10 mandatory fields to be completed. The conception of context “audit-trails” of half a page that could be completed on a daily basis when in the field or when archiving data was met with amusement by the leader of AnthroProject, who had published at least one lengthy monograph simply on the context of just one of her field sites. For others in the same department, “context” and “reflexivity” were understood to form the boundary between quantitative and qualitative data, and it was the perceived impossibility of archiving content and reflexivity that for them rendered qualitative data problematic to reuse.3 Conclusion This article has used a set of four apparently different and contrasting case studies to examine the nature of data and some key aspects of their life-cycle in the context of data re-use in e-science. Our overriding conclusion is that two key assumptions that appear to underpin a number of discourses on e-science are not supported in practice: That knowledge can easily and straightforwardly be disembedded from its producers and original contexts to become explicit data for temporally and geographically distributed re-users. That there is a binary divide between the “quantitative” and “qualitative” sciences in their approach to, and ability to benefit from, e-science tools and practices, especially in terms of data re-use. One problem common to all disciplines but most obvious here in CurationProject was the issue of data that were not born digital. The article has discussed the issue with respect to CurationProject, but it should also be said that efforts to digitize historical quantitative materials, even of survey and census data, are extremely expensive, and as one respondent noted, the same would be the case for historical SkyScience data. One major factor for certain qualitative disciplines may be that they have not benefited from the funds that would allow them to dedicate the required significant resources to the task over long periods of time. At the same time, because they also often deal with sensitive materials and are less convinced of the viability and value of secondary data analysis, they have not necessarily wanted to rush into digitization. While the demand for qualitative data re-use remains low for the variety of reasons discussed, there may be little incentive for funding sources to make such investments. It would perhaps be better (for the discipline of anthropology) if funds were channeled into supporting significantly larger numbers of Ph.D. students and research fellows. In all case studies, decisions were being made about what data should be captured and archived for re-use and what could be filtered out and left behind. While this was common practice in Survey and SkyProjects, it caused unease in AnthroProject, where there were oft-stated concerns that certain kinds of data would become privileged not because of their substantive value but simply because they were easier to digitize. All case studies saw the need to address issues of documentation, context, and provenance as key requirements for the future re-use of data. As much in the case of SkyProject as in SurveyProject and AnthroProject, data were not self-contained units that could easily be circulated, but always needed complementary external information to be understood or trusted. Numbers and observed raw data were never self-explanatory or self-legitimizing, and the degree to which and how they were constructed needs to be stated. The data were always “cooked,” and the extent to which capturing context and cookery was deemed possible and thus plausible in the projects varied. It appeared to depend largely on what was considered to be “enough context” to fulfill the needs of the data re-user. For AnthroProject, steeped in a history of specificity, there perhaps never could be “enough.” Thus as one of the AnthroProject respondents pointed out, contextualizing data fully in anthropology would essentially come to reproduce the world, and for some the risks of potentially misrepresenting other people’s data and informants precludes any attempt at secondary use (van den Berg, 2005). However, one could also conclude that to specify context fully in SkyProject would be to reproduce the sky, but the nature of SkyScience is such that scientific practice deals easily with abstractions and “reasonable filtering” early in the data food chain, provided that the cookery that has taken place is “properly” documented. Of course, “reasonable” and “properly” are slippery and contested concepts. The case studies also drew attention to the limits on re-use that stem from the nature of the data themselves, what they represent, and in what way. Although data from the SurveyProject represent people, and as a result, are open to the issue of claims, they are relatively straightforward to anonymize and are encoded and recoded to produce abstracted representations. Curation and AnthroProjects, on the other hand, deal with data that are highly subject specific, open to moral rights (Stokes, 2003), almost impossible to anonymize, and generally not encoded into abstract representations. This makes re-use a difficult case-by-case issue, because people and their claims are inevitably entangled in the data themselves. However, it would be a mistake to assume that this is a qualitative/quantitative divide, since the quantitative analysis of video data of naturalistic individual behavior would raise precisely the same issue (Fraser, Shaukat, Timakov, Hindmarsh, Tutt, Heath, et al., 2006). Similarly, the negotiation in CurationProject regarding the balancing act between “publication” for re-use—bearing in mind that the artifacts themselves are never published but rather merely representations of them—and non-disclosure to prevent the alienation of the indigenous sources, illustrates not that all anthropological materials are impossible to re-use, but rather that different materials will be differently re-usable and by different people, as has been demonstrated in the case of medical records and images (Coopmans, 2006). While this is easy enough to conclude, the curator of CurationProject noted that sufficiently flexible and context dependent authorization systems may be highly problematic to implement in current computational infrastructures. We began this article with the notion that sharing data for re-use in all disciplines implies the communication of something to a set of potentially unknown and unknowable others. Through the four case studies, we illustrated what might need to be communicated across a range of disciplines and scientific practices as well as highlighted issues that may demand rather significant changes to the “way of doing things.” Whether this will prove revolutionary or evolutionary remains to be seen, but the precipitation of change is of course exactly what e-science is supposed to provoke. Acknowledgment We are indebted to Dawn Nafus for her early work leading this project and to the two anonymous reviewers for their constructive suggestions for improving this article. This research was supported by the ESRC e-Social Science Programme Small Grant Entangled Data: Knowledge and Community Making in E (Social) Science, RES-149-25-1002 (http://www.essex.ac.uk/chimera/projects/edkm/). Notes 1 See http://www.esds.ac.uk/qualidata/support/faq.asp. 2 See http://www.atlasti.com/. 3 For an extensive discussion, see the special issue of Forum: Qualitative Social Research on Secondary Analysis of Qualitative Data, 6 (1), edited by Louise Corti, Andreas Witzel, and Libby Bishop (January, 2005). References Becher , T . ( 1987 ). The disciplinary shaping of the profession . In B. Clark (Ed.), The Academic Profession (pp. 271 – 301 ). Berkeley : University of California Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Birnholtz , J. , & Bietz , M. ( 2003, November ). Data at work: Supporting sharing in science and engineering . Paper presented at the 2003 International ACM SIGGROUP Conference on Supporting Group Work, Sanibel Island, Florida . Borgman , C . ( 2005, July ). Disciplinary differences in e-research: An information perspective . Paper presented at the First International Conference on e-Social Science , Manchester, UK. Bowers , J . ( 1994, October ). The work to make a network work: Studying CSCW in action . Paper presented at the 1994 ACM Conference on Computer Supported Cooperative Work, Chapel Hill, North Carolina . Brenneis , D. , & Marcus , G. E. ( 2005 ). In between, and on the margins of, the shining centres on the hill . Anthropology News , 46 ( 6 ), 8 – 12 . Google Scholar Crossref Search ADS WorldCat Campbell E. G. , Clarridge B. R., Gokhale M., Birenbaum L., Hilgartner S., Holtzman , N. A., et al. 2002 ). Data withholding in academic genetics: Evidence from a national survey . Journal of the American Medical Association , 287 ( 4 ), 473 – 480 . Google Scholar Crossref Search ADS PubMed WorldCat Chen , L . ( 2005, September ). A proof of concept: Provenance in a service oriented architecture . Paper presented at the proceedings of the UK e-Science All Hands Meeting, Nottingham, UK . Coopmans , C . ( 2006 ). Making mammograms mobile. Suggestions for a sociology of data mobility . Information, Communication & Society , 9 ( 1 ), 1 – 19 . Google Scholar Crossref Search ADS WorldCat Corti , L . ( 2000 ). Progress and problems of preserving and providing access to qualitative data for social research—The international picture of an emerging culture . Forum Qualitative Sozialforschung / Forum: Qualitative Social Research , 1 ( 3 ). Retrieved November 11, 2006 from http://www.qualitative-research.net/fqs-texte/3-00/3-00corti-e.htm Corti , L. , & Thompson , P. ( 2004 ). Secondary analysis of archived data . In C. Seale (Ed.), Qualitative Research Practice (pp. 327 – 343 ). London : Sage Publications . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Corti , L. , & Wright , M. ( 2003 ). MRC Population Data Archiving and Access Project Consultants’ Report . Colchester: UK Data Archive, University of Essex . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Fielding , N . ( 2004 ). Qualitative Research and E-Social Science: Appraising the Potential . University of Surrey. Retrieved November 11, 2006 from http://www.ncess.ac.uk/docs/qualitative_research_and_e_soc_sci.pdf Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Fraser , M. , Shaukat , M., Timakov , S., Hindmarsh , J., Tutt , D., Heath , C, et al. ( 2006, July ). Using real-time annotations as qualitative e-research metadata . Paper presented at the Second International Conference on e-Social Science , Manchester, UK. Fry , J . ( 2006 ). Coordination and control across scientific fields: Implications for a differentiated e-science . In C. Hine (Ed.), New Infrastructures for Knowledge Production: Understanding E-Science ( 167 – 188 ). Hershey, PA : IDEA Group . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Glaser , B . ( 1998 ). Doing Grounded Theory: Issues & Discussion . Mill Valley, CA : Sociology Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Glaser , B. , & Strauss , A. ( 1967 ). The Discovery of Grounded Theory: Strategies for Qualitative Research . Chicago : Aldine . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Groth , P . ( 2005, September ). PReServ: Provenance recording for services . Paper presented at the UK e-Science All Hands Meeting , Nottingham, UK. Hagstrom , W . ( 1974 ). Competition in science . American Sociological Review , 39 ( 1 ), 1 – 18 . Google Scholar Crossref Search ADS WorldCat Herle , A . ( 1994 ). Museums and shamans: A cross-cultural collaboration . Anthropology Today , 10 ( 1 ), 2 – 5 . Google Scholar Crossref Search ADS WorldCat Hey , A. , & Trefethen , A. ( 2003 ). The data deluge: An e-science perspective . In F. Berman, G. C. Fox, & A. Hey (Eds.), Grid Computing: Making the Global Infrastructure a Reality (pp. 809 – 824 ). Chichester, UK : John Wiley . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Hine , C . ( 2005, July ). Material culture and the shaping of e-science . Paper presented at the First International Conference on e-Social Science , Manchester, UK. Jirotka , M. , Proctor , R., Hartswood , M., Slack , R., Simpson , A., Coopmans , C., et al. 2005 ). Collaboration and trust in healthcare innovation: The eDiaMoND case study . Computer Supported Cooperative Work , 14 ( 4 ), 368 – 398 . Google Scholar Crossref Search ADS WorldCat Kling , R. , & McKim , G. ( 2000 ). Not just a matter of time: Field differences in the shaping of electronic media in supporting scientific communication . Journal of the American Society for Information Science , 51 ( 14 ), 1306 – 1320 . Latour , B . ( 1993 ). We Have Never Been Modern . London : Harvester Wheatsheaf . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Missier , P. , Embury , S., Greenwood , M., Preece , A., & Binling , J. ( 2005, September ). An ontology-based approach to handling information quality in e-science . Paper presented at the UK e-Science All Hands Meeting , Nottingham, UK. Purdam , K. , Elliot , M., Smith , D., & Pickles , S. ( 2005, September ). Confidential data access, disclosure risk and Grid computing . Paper presented at the UK e-Science All Hands Meeting , Nottingham, UK. Stokes , S . ( 2003 ). Art & Copyright . Oxford : Hart Publishing . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Strathern , M . ( 2005a, February ). Currencies of collaboration . Paper presented at the PLACEB-O Workshop , Girton College , Cambridge, UK. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Strathern , M . ( 2005b, March ). Useful knowledge . Lecture presented at The Isaiah Berlin Lecture , Manchester, UK. Google Scholar OpenURL Placeholder Text WorldCat Thompson , P . ( 2003 ). Towards ethical practice in the use of archived transcripted interviews: A response . International Journal of Social Research Methodology , 6 ( 4 ), 357 – 360 . Google Scholar Crossref Search ADS WorldCat Van Den Berg , H . ( 2005 ). Reanalyzing qualitative interviews from different angles: The risk of decontextualization and other problems of sharing qualitative data . Forum Qualitative Sozialforschung / Forum: Qualitative Social Research , 6 ( 1 ). Retrieved November 11, 2006 from http://www.qualitative-research.net/fqs-texte/3-00/3-00corti-e.htm Google Scholar OpenURL Placeholder Text WorldCat Whitely , R . ( 2000 ). The Intellectual and Social Organization of the Sciences . Oxford : Oxford University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC About the Authors Samuelle Carlson is a consultant for Arts Council England and was formerly a researcher at Chimera, Essex’s Institute for Social and Technical Research. She has a background in History of Art and a Doctorate in Social Anthropology from Cambridge University. She has interests in theories of historical change, the field of Design, and the technologies and materiality of interdisciplinary collaboration. Address: Chimera, University of Essex, PP1 Ross Building, Adastral Park, Ipswich, Suffolk, IP5 3RE, UK Ben Anderson is Deputy Director of Chimera, the Institute for Social and Technical Research at the University of Essex. He has a BSc in Biology and Computer Science and a Ph.D. in Computer Studies. For the last nine years, he has led a program of strategic social science research analyzing the co-evolution of people and digital technologies. Address: Chimera, University of Essex, PP1 Ross Building, Adastral Park, Ipswich, Suffolk, IP5 3RE, UK © 2007 International Communication Association TI - What Are Data? The Many Kinds of Data and Their Implications for Data Re-Use JF - Journal of Computer-Mediated Communication DO - 10.1111/j.1083-6101.2007.00342.x DA - 2007-01-01 UR - https://www.deepdyve.com/lp/oxford-university-press/what-are-data-the-many-kinds-of-data-and-their-implications-for-data-DYKxzxVBT8 SP - 635 EP - 651 VL - 12 IS - 2 DP - DeepDyve ER -