Data-processing, Data-transfer and Search: Further Technical Challenges for Open Access / By Wolfram Horstmann, Bielefeld University Library 数据处理、数据转换及搜索:开放获取的技术挑战 / 沃尔夫勒姆·霍斯特曼(比勒费尔德大学图书馆) Open Access is a child of the Internet. Theoretically, the World Wide Web (WWW) has made it possible for everyone to have immediate access to news, media and communication everywhere. Scientists created the Web almost 20 years ago in order to exchange academic information more efficiently, and thus made direct access to information possible. Since then, for many people, the Web has developed into the ultimate global information platform. 开放获取衍生自因特网。理论上,万维网使得任何人、在任何地方,都能立即获取新闻、媒体和交流。为了更有效地交换学术信息,20年前科学家创建了万维网,得以直接获取信息。自那时以来,对于很多人来说,万维网已发展成为基本的全球信息平台。 Today scientists and scholars are once again aiming at the goal of Open Access to academic information on the Internet. This is not because access to academic information on the Internet has meanwhile been closed off. Rather, information in the form we are concerned with today did not exist in the infancy of the WWW. In those days, academic publications existed largely in printed form. It is only in the past decade that they have become available electronically on a large scale. In addition, today we are not merely concerned with publications: many other data can be found in academic offices and laboratories on computers, storage media or servers that are not compatible with the standards of the WWW. For example, we are talking about the digitalisation of cultural heritage, experimental measurement data, computer programs for evaluation, modelling or simulation, and learning materials. 今天,科学家和学者再次把目标对准因特网上的开放获取学术信息,并不是因为从因特网获取学术信息的管道曾经被关闭,而是因为我们今天所关心的信息形式,在万维网发展初期并不存在。以往学术出版物多半以印本形式出现,只不过十来年的光景,大量学术出版物已经以电子方式呈现。此外,我们今天不仅关心出版物,在研究室及实验室的计算机、存储媒介或服务器上,还有许多与万维网标准不兼容的其他数据,例如我们谈论的文化遗产数字化,实验测定数据,供评价、建模或仿真用的计算机程序,以及学习素材。 Manual processing of all these data is impossible, which is why the machine-readability of data plays an important role. Firstly, machine-readability means that data must be recognisable from external servers or digital services. This recognition mostly takes place via metadata, a kind of digital label for data, which contains information about the form and content of the underlying object. In addition, machine-readability requires a transfer protocol which allows the data to be transferred from one place to another. In the traditional WWW this is primarily ‘http’ (hyper-text transfer protocol). However, for the multitude of data types and uses to be found in science and scholarship today, this is not adequate since far more multifarious information on the type and purpose of the data has to be exchanged before any transfer can take place. 以手工方式处理这些数据是不可能的,所以机读数据就扮演着重要的角色。首先,机读意味着数据必须能够被外部服务器或数字服务辨识。辨识主要通过元数据,即数据的一种数字标签,包含处理对象的形式及内容的信息。此外,机读需要传输协议,让数据从一处传输到另一处。在传统的万维网里,主要用的是“http”(超文本传输协议),然而在今天的科学和学术领域里,由于数据类型及用途的多样性,这个传输协议是不够的,因为在传输之前,必须交换关于数据类型和目的方面的多样信息。 For academic data, but also for other forms of data, labelling with the ‘Simple Dublin Core Metadata Element Set’ (http://dublincore.org) has become standard practice. As a transfer protocol for open, machine-readable data stores, the ‘Open Archives Initiative Protocol for Metadata Harvesting’ (http://www.openarchives.org) is often used. The combination of the two allows a new form of technical networking based on the principles of Open Access: digital knowledge stores, known as repositories, are coming into existence worldwide. Alongside the repositories that are created directly for academic disciplines, many academic libraries function as systematic digital age providers of information by operating repositories. These repositories expose their data without restricting access for digital ‘harvesters’, which collect metadata and structure them in intermediate storage facilities for systematic access. After that, search engines enable researchers, teachers and learners to access information, which is distributed worldwide in an unrestricted and targeted fashion. 将学术及其它形式的数据,以“简单都柏林核心元数据元素集”(http://dublincore.org)标示,已成为标准做法。作为一个开放的机读数据存储的传输协议,也常采用“开放档案计划元数据收割协议”(http://www.openarchives.org)。这两者的结合,形成基于开放获取原则的新型技术网络,在世界各地出现了数字知识库,即典藏库。除了直接为学科创建的典藏库外,很多学术图书馆也运作典藏库,扮演系统数字时代信息提供者的角色。这些典藏库将数据释放出来,不限制数字“收割器”的获取,由“收割器”收集元数据,并在中介储存设备中组织它们,供系统获取。然后,搜索引擎就能使研究人员、教师和学者获取信息,进而以无限制且针对性的方式,传播于全世界。 But even if the data are present in repositories, labelled with metadata and accessible from other servers and services, there is still no guarantee that the results are actually usable by academics. Due to major Internet protagonists like Google, scientists and scholars are accustomed to relatively comprehensive and rapid access to the results. Google and others invest a great deal in the registration and computer-based structuring of data, which relate not just to metadata but to every conceivable form of information which subsists in the digital object itself. The approach of structuring academic information exclusively via metadata is conceptually superior. However, in practice this approach still needs to be turned from an individual testing application into a comprehensive everyday tool. A ‘future-proof’ solution could lie in collaboration between libraries, which guarantee the quality of the metadata and data presentation, and experts in information sciences, media studies and informatics. 然而,即使数据已储存在典藏库,用元数据标识,并可经由其它服务器和服务获取,但仍不能保证学者确实可使用到成果。应归功于因特网主角如谷歌,科学家和学者已经习惯于全面和快速地获取结果。谷歌等大量投入于将数据登录及结构化,不只是元数据,还包括存在于数字对象本身的、所有可能想象的信息形式。在概念上,优先采用通过元数据专门对学术信息结构化的方法,但在实践上,这一方法仍需要由个别试验应用转向广泛的日常工具。通过图书馆间的合作,以及信息科学、媒体研究和信息学领域专家的参与,会有一个“前瞻性的”解决方案,保证元数据及数据呈现的质量。 Especially for the young generation of researchers and students, the WWW has developed into a highly interactive environment. For many, the browser is a central switchboard in their professional and social lives, in which communication, the exchange of data and the structuring and configuration of their daily routines take place. The academic world also works on an increasingly interactive basis. This means that not only access, but also the manipulation of data, collective editing à la Wikipedia (http://www.wikipedia.org) or sharing à la Del-icio-us(http://del.icio.us) are expected. 尤其对年轻一代的研究人员和学生来说,万维网已经发展为一个高度互动的环境。对于许多人来说,浏览器是其专业和社会生活的中央交换器,在上面交流、交换数据,并结构化和配置其日常生活。学术世界的基础,也越来越依赖互动,这意味着,不仅是获取,还期望在网络上运作数据、共同编辑维基百科((http://www.wikipedia.org)或共享美味书签(http: //del.icio.us)。 The reconfiguration of the WWW into an interactive environment suitable for science and scholarship represents a challenge for service providers even with respect to publications with a relatively simple structure. Increasingly, however, we also have to deal with the other materials mentioned above, such as multifarious digital items, computer programs, and learning materials. These days, many academic results are obtained with the help of precisely these new media; traditional publication with text and graphics forms only a fraction of this academic work. Tracing back, let alone the verifying of scientific results, is becoming increasingly difficult on the basis of publications alone. 重新配置万维网,使之适合科学及学术互动的环境,即使对出版这种结构相对简单的服务提供者,也是一个挑战。然而,还需面对前述的其它资料,诸如多样化的数字对象、计算机程序及学习素材。近来,许多学术成果借助这些新媒体呈现,以文字和图形形式的传统出版物只是这些学术作品的一个部分,更不用说再向前追溯到科学成果的验证,更难单独以出版物为基础。 At the outset, there seem to be no limits to the possibilities offered by a new, virtualised academic world in the context of such forms of electronic publishing. In such a comprehensive scenario, however, it must not be forgotten that vast quantities of data are generated that are totally inconceivable in the analogue, nonelectronic world. Also, much of this information is not intended to be used by the public or even by scientists or scholars in related disciplines. And not every piece of academic information generated in such a scenario can or need be preserved and placed at the permanent disposal of posterity. Science and scholarship have become more fugacious. 从一开始,在新的虚拟学术世界里,对电子出版似乎没有限制的可能。然而在整个场景中,不能忽略生成的大量数据,这在模拟的、非电子的世界里是很难想象的。并且,这些信息中的大部分并不打算被公众或相关学科的科学家或学者使用,在这种情况下产生的学术信息,也不能或不需要巨细糜遗的永久保存。科学和学术已变得更加无常。 In addition, the atomisation of science and scholarship into more and more sub-disciplines has made it more and more difficult to provide interdisciplinary services for the academic community in the way that university libraries have traditionally done. Today, only academics themselves know what information and services they need for their work in their respective areas of research. The challenge for information service providers will consist in offering to structure, process, and make accessible academics' specialist knowledge with functional, generally valid information tools, be they search engines, or tools for the administration of information, documentation, editing or communication. 此外,科学与学术越分越细,形成越来越多的亚学科,使大学图书馆很难按传统方式为学界提供跨学科的服务。今天,只有学者才知道在各自研究领域的工作里,需要什么样的信息和服务。信息服务提供者面临的挑战,在于以广泛有效的功能性信息工具,为学者提供结构化、处理过且可以获取的专业知识,成为他们的搜索引擎,或者管理信息、文档、编辑或交流的工具。 p. 66-68 Open Access: Opportunities and challenges. A handbook [开放获取 : 机会及挑战] / European Commission/German Commission for UNESCO). -- Luxembourg: Office for Official Publications of the European Communities, 2008. -- 144 pp., 14.8 x 21.0 cm. -- ISBN 978-92-79-06665-8. -- EUR 23459, http://tinyurl.com/3q8wo5