2.3.6 Harvesting
From Geostandards
- 2.1.1 What is metadata?
- 2.1.2 Why metadata?
- 2.1.3 What can you do with it?
- 2.1.4 Where can you find metadata?
- 2.1.5 Applications
- 2.1.6 Discovery discovery
- 2.2.1 What is a standard for?
- 2.2.2 Standards for geo-information
- 2.2.3 Metadata standards
- 2.2.4 INSPIRE
- 2.2.5 Discovery standards
2.3 Metadata in the working process
- 2.3.1 How do you make metadata?
- 2.3.2 How do you make metadata from services?
- 2.3.3 Where in the working process?
- 2.3.4 Tips for gathering metadata
- 2.3.5 Publishing metadata
- 2.3.6 Harvesting
- 2.3.7 Validating
- 2.3.8 Discovery working process
- 2.4.0 Overview of metadata elements
- 2.4.1 Title of the resource
- 2.4.2 Summary
- 2.4.3 Status
- 2.4.4 Level of hierarchy
- 2.4.5 URL
- 2.4.6 Protocol
- 2.4.7 Name
- 2.4.8 Unique Identifier of the resource
- 2.4.9 Language of the resource
- 2.4.10 Subject
- 2.4.11 Keyword
- 2.4.12 Thesaurus
- 2.4.13 Thesaurus date
- 2.4.14 Thesaurus date type
- 2.4.15 Minimum x-coordinate
- 2.4.16 Maximum x-coordinate
- 2.4.17 Minimum y-coordinate
- 2.4.18 Maximum y-coordinate
- 2.4.19 Temporal cover
- 2.4.20 Date of the resource
- 2.4.21 Date type of the resource
- 2.4.22 Grade of the description of quality
- 2.4.23 General description of origin
- 2.4.24 Scale of application
- 2.4.25 Resolution
- 2.4.26 Code Reference system
- 2.4.27 Responsible organisation for namespace reference system
- 2.4.28 Conformity indication with the specification
- 2.4.29 Clarification
- 2.4.30 Specification
- 2.4.31 Specification date
- 2.4.32 Specification date type
- 2.4.33 Legal restrictions to accessibility
- 2.4.34 Other constraints
- 2.4.35 Security restrictions
- 2.4.36 User constraints
- 2.4.37 Responsible organisation resource
- 2.4.38 Responsible organisation resource: email
- 2.4.39 Responsible organisation resource: role
- 2.4.40 Metadata unique identifier
- 2.4.41 Parent unique identifier
- 2.4.42 Responsible organisation metadata
- 2.4.43 Responsible organisation metadata: role
- 2.4.44 Responsible organisation metadata: email
- 2.4.45 Metadata date
- 2.4.46 Language of the metadata
- 2.4.47 Metadata standard name
- 2.4.48 Metadata Standard version
- 2.4.49 Discovery metadata for data
2.5 Metadata elements for services
- 2.5.0 Metadata elements for services overview
- 2.5.1 Resource Title
- 2.5.2 Resource abstract
- 2.5.3 Resource type
- 2.5.4 Resource locator
- 2.5.5 Connect Point Linkage
- 2.5.6 Coupled resource
- 2.5.7 Scoped Name
- 2.5.8 Coupling Type
- 2.5.9 Spatial data service type
- 2.5.10 Service Type Version
- 2.5.11 Operation Name
- 2.5.12 DCP
- 2.5.13 Keyword value
- 2.5.14 Originating controlled vocabulary
- 2.5.15 Geographic location
- 2.5.16 Temporal Reference
- 2.5.17 Spatial resolution
- 2.5.18 Degree
- 2.5.19 Specification
- 2.5.20 Constraints
- 2.5.21 Conditions applying to access and use
- 2.5.22 Responsible party
- 2.5.23 Responsible party role
- 2.5.24 Metadata point of contact
- 2.5.25 Metadata language
- 2.5.26 Metadata date
- 2.5.27 The link to the metadata of the dataset and dataset series from the service
- 2.5.28 Discovery metadata for services
Contents |
Harvesting methods
Harvesting is the mechanism that ‘drags’ (copies) metadata to the catalogue. This function ensures that the metadata referred to in the catalogue, is included and updated. It is the task of the catalogues service to collect the metadata at the location and to enter it in the catalogues.
There are three different methods of harvesting;
1. Harvest existing data as an XML. 2. Harvest existing metadata from a catalogue. 3. Harvest the capabilities.
Harvesting is a process that can be done on a regular basis, for example, every day, or once a week. The data is synchronised during the harvesting process. A catalogue is able to recognize any metadata that is added, deleted or updated at the source location and can modify the catalogue database to accommodate this.
During harvesting it is possible to apply a filter, so not all metadata is copied, but only a limited set from the remote catalogue. For example, a filter can be applied to: free texts, subjects, titles and summaries.
The harvesting mechanism is based on the concept of a universally unique identifier (uuid) and on the date of the modification. The uuids enable to harvesting to take place from various resources. Even though some metadata exists in more than one resource, thanks to the uuid it is included only once and thanks to the date of modification, only the most up-to-date version is listed in the register.
Metadata XML harvesting
Implementing and managing a catalogue is not very profitable for organisations which only handle a small amount of metadata. As placing XML files in a web accessible folder is easy to do, making more metadata documents that can be harvested available to others is a simple solution. The web DAV (Distributed Authoring and Versioning) is used to harvest metadata from a DAV server for this protocol. WebDAV defines the so-called collections of files on a web server; these can then be used to harvest more than one set of metadata documents at the same time. As WebDAV is a protocol, it can be configured by system managers on a default web server and it should be possible to access the web folder without having to use any authorisation. During the configuration a URL will be defined for the catalogue to harvest from, for example:
http://www.RIVM.nl/webdav Web accessible folder complying with WebDAV (IETF, RFS 2518)
In this way, it becomes possible to define and harvest Web accessible folders as a resource.
Exporting metadata as an XML file is a normal function of metadata tools. A lot of metadata documents are available this way in (government) organisations. Other parties want to use this information too.
One can enter metadata in various folders for different target groups. This way, all the data can be harvested for a wide range of applications from just one folder.
Harvesting Metadata from a catalogue
If an organisation has its own catalogue, information can be taken from it. Metadata records are being copied to, for example, the National Geo Register. The harvesting operation of the catalogue service focuses on creating and updating the records in the national geo register. The CSW standard is used for this. CSW means Catalogue Services for the Web and is a searching interface for catalogues that are developed by the Open Geospatial Consortium. NGR supports version 2.0.2 ISO AP of this standard.
When processing a harvest request by the CSW the following steps are taken: 1. The CSW goes to the URI where the metadata resource is recorded. 2. Parses the resource. 3. Creates or modifies metadata records in the catalogue to register the resource.
http://www.opengis.net/cat/csw/2.0.2 can be included as the resource type, indicating which type resource will be harvested.
This operation is executed once or periodically (every night) depending on the settings that are determined by the manager of the catalogue.
In the CSW 2.0.2 publication scheme, the harvesting operation is defined as follows:
Harvesting Capabilities
The catalogue’s function is to be able to harvest capabilities. Most of the catalogues can save the URL, for example, and then harvest the metadata periodically. The metadata from the capabilities elements are thus transformed into a CSW2 AP ISO document.
XSLT transformation of WMS capabilities document into CSW2 AP ISO document
In the CSW 2.0.2 publishing scheme, the harvesting operation for this type of resource is defined as follows:
This searches the catalogue service to harvest the resource http://www.myhost.com?Service=WMS&Request=GetCapabilities” of the type “http://www.opengis.net/wms” periodically every month P1M). The mime type of the resource is “application/xml”.
Distributed search pattern
Even though distributed searching is not part of harvesting in the same way as a catalogue service is, we have decided to describe it in this paragraph. The reason for this is that this function is often described as catalogue-to-catalogue harvesting. However it cannot be compared with the harvesting operations described in the previous paragraphs.
Distributed searching: the formulated question is sent to the local catalogue and to every other catalogue which is known to have a specific query depth of searching (related to a specific network topology). The results are integrated and shown to the client. The metadata records of other catalogues are not copied to one’s own catalogue. Catalogue records remain with the resource.
The distributed question is, on interface level, part of the GetRecords operation of the catalog service. This operation is part of the Discovery class (and not of the manager class like the harvesting operation is).
In the CSW 2.0.2 publication scheme, the distributed search operation is defined as follows:
The element “DistributedSearchType” asks the catalogue service to pass the “GetRecords” request on to all catalog services.
This function can be increased with a catalog implementation in order to ‘cache’ metadata from a remote catalogue. This is comparable with catalogue-to-catalogue harvesting and can only be executed at the implementation level; this is not a function which is defined at the specification level and therefore is not a standard approach.
| ← previous | 2 Metadata | next → |






