INFORMATION EXCHANGE IN IABIN:

Technical Issues to be Considered in the Implementation

of the Inter-American Biodiversity Information Network
 
 

April 13, 1999
 
 
 

Prepared for the

Technical Meeting for the Establishment of IABIN

Brasilia, Brazil

April 15-18, 1999

 






 


TABLE OF CONTENTS

 



EXECUTIVE SUMMARY

I. BACKGROUND

II. APPROACH

III. GOALS

IV. STRATEGY

V. FIRST STEPS

VI. PILOT PROJECT LEVELS OF PARTICIPATION

Country Representatives
Country Nodes
IABIN Partners
Funding Considerations
VII. DATA STRUCTURES

VIII. VOCABULARIES

IX. MAPPING SPECIES OCCURRENCES: A DISTRIBUTED DATABASE APPLICATION

X. Z39.50 PROFILES

XI. METADATA AND ELECTRONIC PUBLISHING

XII. TECHNICAL CAPABILITIES NEEDED FOR IABIN SITES

Web Site Technology
Connectivity
Security
Search and Query Technology
XIII. CONCLUSIONS

APPENDIX: List of Acronyms
 


EXECUTIVE SUMMARY

The Inter-American Biodiversity Information Network (IABIN) was mandated in the action plan arising from the Santa Cruz (Bolivia) Summit of the Americas on Sustainable Development. Central to the implementation of is an understanding of the technical challenges which will need to be addressed by the participants. This study, in conjunction with the development of plans for a pilot project on invasive species in the Americas, undertook to discover and define those challenges and offer recommendations to IABIN participants concerning information exchange in light of such challenges.

Any effective framework for networking biodiversity information in the Americas must address at least the following considerations:

It is recommended that IABIN support participants at a number of levels of commitment and complexity, including core sites (serving all of the Americas), national sites, project nodes, partner nodes (knowledge contributors connected to the system), and users.

The technical approaches to the implementation of IABIN should build on, and reflect, the approaches taken by other international biodiversity networking initiatives. The first step is to inventory available information, technology, and information needs. This should be done in a formal and organized fashion, developing consistent catalogs of experts, partner organizations, species and resources of concern to those experts and organizations, existing data sets, major existing scientific and management projects, and various kinds of resulting synthetic data, educational materials, and capacity-building opportunities. Because resource cataloging is almost insurmountable task if attempted for all important biodiversity resources in the hemisphere, catalogs should first be developed for a small number of well-defined environmental issues of particularly high visibility and importance. These are the IABIN pilot projects.

Individual projects must select standards and protocols to be used, including metadata strategies, data structures, controlled vocabularies, electronic publishing (including peer review) processes, etc. Web site technology, connectivity, security, and search and query technology are other technical issues which must be considered. If carefully done, the information systems developed for these pilot projects should be widely applicable to other biodiversity issues considered by IABIN.
 

Summary of Specific Recommendations
Recommendation 1: IABIN should adopt the list of databases as a minimum set for information sharing in the Americas on invasive vascular plants, invasive fish, specialist pollinators, and amphibians, insofar as funds and personnel are available.

Recommendation 2: Each country participating in each pilot should designate a primary country representative. Ideally, each country representative should both participate in or collaborate with the country's official policy toward the Biodiversity Convention and IABIN, and be affiliated with an organization with the interest, facilities, and computer expertise to host a moderately sophisticated database center and Web site.

Recommendation 3: IABIN should explore adapting the North American Biodiversity Information Network's (NABIN) "Species Analyst" approach and software to develop a distributed species mapping capability for species groups (invasive plant, invasive fish, declining amphibians, selected declining pollinators, hanta virus outbreaks), as funds allow, as a prototype for a more ambitious effort to link biodiversity databases using public-domain Z39.50 server technology.

Acknowledgments
This report was prepared by Dr. James F. Quinn, Department of Environmental Science and Policy, University of California at Davis. Funding for this study was provided by the United States Agency for International Development, Project #598-0780, "Environmental Support Project," under an Interagency Agreement with the U.S. Department of the Interior. Project management was provided by the International Biological Informatics Program of U.S. Geological Survey.


I. BACKGROUND

In December 1996, leaders of the governments of the Americas met at the Santa Cruz (Bolivia) Summit on Sustainable Development. Government leaders recognized the importance of reliable and accurate information on biodiversity in decision-making and the need for cooperation among the countries of the Western Hemisphere to link information sources together. Summit leaders agreed to:

Seek to establish an Inter-American Biodiversity Information Network, primarily through the Internet, that will promote compatible means of collection, communication and exchange of information relevant to decision-making and education on biodiversity conservation, and that builds upon such initiatives such as the Clearing-House Mechanism provided for in the United Nations Convention on Biological Diversity, the Man and the Biosphere Network (MABNet), and the Biodiversity Conservation Information System (BCIS), an initiative of nine IUCN programs and partners.
This declaration, Initiative 31, prompted a series of informal meetings among interested parties, which were followed by two Experts' Meetings, sponsored by the Organization of American States, regarding the establishment of IABIN. At the Experts' Meeting in January 1998, the United States Geological Survey announced an inter-agency agreement with the U.S. Agency for International Development to support planning of the IABIN concept by identifying informational needs for sharing biodiversity information among IABIN partners, and by initiating several pilot projects. The purpose of the pilots is in part to demonstrate and test communications strategies for a wider range of biodiversity issues important to IABIN partners, and in part to begin the actual process of building a better infrastructure for exchanging scientific and management information on biodiversity in the hemisphere.

U.S. participants in IABIN met in October, 1998, in Alexandria, Virginia, to review available information and develop recommendations on goals, institutional frameworks, legal constraints, and information sharing for IABIN, and to develop priorities for pilot projects. As in earlier meetings, invasive species were identified as a priority for international information networking. Invasive species have huge ecological and economic impacts, have generated a substantial body of scientific knowledge which could be assembled into a useful information framework, and are of immediate concern to a number of present and potential IABIN cooperators. To keep an initial pilot project on invasive species manageable, the U.S. experts decided to concentrate on non-indigenous vascular plants and freshwater fish. Following similar logic, breakout groups in Alexandria also recommended pilot projects concerning amphibians and specialist pollinators, both of which have experienced marked declines on a vast geographic scale. Other expert groups have suggested other taxa or ecological guilds (e.g., corals vis-a-vis bleaching) as additional candidates for IABIN pilots.

Shortly after the Alexandria meeting, the U.S. Geological Survey, The Nature Conservancy, and the University of California sponsored a workshop at the National Center for Ecological Assessment and Synthesis at the University of California, Santa Barbara, to recommend strategies for both an invasive species pilot and an overview of information strategies to support that pilot and other related initiatives of IABIN. Participants and subsequent reviewers represent 10 countries and a range of governmental, non-profit, and academic organizations. The resulting recommendations for invasives are found in the report entitled, "Inter-American Biodiversity Information Network (IABIN): Invasive Species in the Americas Pilot Projects."

This report represents a parallel effort to review some approaches to information strategies to support the proposed pilot projects. The particular framework described was specifically designed around the information needed to assess and manage effects of invasive species. However, the recommendations should be broadly applicable to pollinators, amphibians, corals, or other species-occurrence-based initiatives arising within IABIN.
 

II. APPROACH

All of the IABIN meetings to date have concluded that a systematic approach to discovering information sources and facilitating exchange is fundamental to establishing an inter-American information network on biodiversity. Within a number of countries, efforts have already resulted in the establishment of national information networks; various sub-regional network efforts are also proceeding, for example in Central America. In the Andean nations, a network is focusing on exchanging social and economic data. Extending national and sub-regional biodiversity efforts into a hemisphere-wide information network is a next step and reflects the approach -- to build on existing efforts -- directed in Initiative 31 of the Santa Cruz summit action plan.

Setting up an information network involves a wide range of challenges involving information technology, informatics, and information infrastructure. At the international level, an information network takes additional complexities: distances, telephone connections, connectivity, among many others. Even in an ideal technical configuration, information challenges include issues such as how information is stored, discovered, filtered, and exchanged.

Many organizations in the Americas have already confronted some of the technical and information challenges specific to their own sites, and the IABIN network development efforts should build on these lessons learned. In particular, it is important that any international efforts arising out of IABIN build upon, and communicate with, the initiatives of the Clearing-House Mechanism (CHM) of the Convention on Biological Diversity (http://www.biodiv.org/chm/), the MABNet Americas program (http://www.mabnetamericas.org/mabnet/home.html), and the extensive national efforts to build biodiversity information systems in partner countries. Particularly active examples include the biodiversity database efforts of the Base de Dados Tropical (http://www.bdt.org.br/bdt/) in Brazil, INBio (http://www.inbio.ac.cr/) in Costa Rica, CONABIO (http://www.conabio.gob.mx/) in Mexico, and the National Biological Information Infrastructure (NBII) in the United States (http://www.nbii.gov), but there are important national resources for biodiversity information in every participating country.

There are also a number of complementary international and global efforts to coordinate biodiversity information, especially on protected lands. Notable examples include the Biodiversity Conservation Information System (http://www.biodiversity.org/), The Nature Conservancy's network of Conservation Data Centers (http://www.consci.tnc.org/src/bcdover.html), and the Man and the Biosphere program's Biosphere Reserve Integrated Monitoring initiative (http://www.usmab.org/brim/home.html). While these programs share many goals, they are not currently particularly interoperable in their data structures, use of names and terminology, or software. A goal of IABIN should certainly be to build upon the knowledge represented in these efforts, and to enhance the conservation efforts they represent. However seamless integration of the information among major biodiversity databases remains a very long term objective.
 

III. GOALS

Any effective framework for networking biodiversity information in the Western Hemisphere must address at least the following considerations:


IV. STRATEGY

An information network serves a number of distinct functions that can be organized into a number of tiers of complexity:


The previous levels are intended to permit a user to identify a useful body of information (typically a database), and to understand its source, structure and applicability for particular tasks well enough to make use of it for purposes not foreseen by the data developer. The ultimate goal, of course, is to have some kinds of comparative data available from multiple sources covering large geographic areas. Sharing data has traditionally been done by having one site collect data from multiple contributors and compile a single summary database. While there are numerous worthy biodiversity compendia of this kind in the Americas, all suffer from infrequent updates, a lack of the local knowledge needed to interpret particular records, and limited scope. The principal goal of IABIN is to provide access to the most current records from a wider range of experts - in other words, a fully distributed database.

In practice, neither software, nor network capacity, nor consistency in methods now permit a fully distributed biodiversity database (live links to searchable digital collection labels for all American fish collections, for example). This kind true interoperability is a long term goal, and requires both improved technology and substantial funding to standardize and digitize data. In the meantime, it is reasonable to collect abstracted data from multiple sources. For example, IABIN participants might want to share the simplest core elements of species occurrence records (date, location, species, collector, repository), without communicating important ancillary information (morphology, genetics, diet, chemicals in tissues, etc.) which may be special to an individual study. This would permit the construction of range maps and rates of range expansion, even if the collected data were inadequate to analyze the effects of diet or pollution a particular species.

As a result, the final two tiers of data are


V. FIRST STEPS

Given the current explosion of information on the topics of the proposed pilots, it is important that the strategy be both incremental and designed to produce useful results early in the project.

A majority of the proposed pilot projects (invasive plants, invasive fish, pollinators, amphibians, bleaching corals, hanta-virus outbreaks) share common features that justify a coordinated approach to their data systems. The October, 1998, Santa Barbara workgroup of experts on invasive species suggested eight data themes for better tracking effects of invasive species in the Americas. All are intended to be on-line and searchable, and most are designed to use the G8 Global Information Locator (GILS) format. The themes are:

Good models exist for each (discussed in the proposal document for the invasive pilot), but none has been specifically compiled for invasive species on a hemispheric basis. Similarly, U.S. experts consulted on amphibians, pollinators, and Hanna-virus outbreaks report similar information needs for those taxa, suggesting that this list is a reasonable starting point for other pilots in the form of taxon-based assessments.

Recommendation 1: IABIN should adopt the list of databases as a minimum set for information sharing in the Americas on invasive vascular plants, invasive fish, specialist pollinators, and amphibians, insofar as funds and personnel are available.

Some of the databases (experts, species of concern) can be compiled in a straightforward fashion through questionnaires, digital templates (e.g., stand alone Access data entry forms), or Web-based data entry. Most of the others require coordination at the level of a country or major institution, The invasives working group recommended that one or more country representatives be established for each country participating in each pilot. The representative(s) would be responsible for designating, and establishing if necessary, a main country Internet node for each pilot. Suggested institutional and technical capabilities and budgetary requirements for each node are discussed later, but in most countries the country representative and node should probably be associated with the government agency responsible for IABIN. Numerous secondary nodes -- for example, at universities, museums, and non-governmental organization (NGO) offices -- can be established. The working group encouraged country representative to take advantage of existing digital data on biodiversity held by researchers, museums, Conservation Data Centers, parks and refuges, Biosphere Reserves, and other conservation-oriented organizations.

Recommendation 2: Each country participating in each pilot should designate a primary country representative. Ideally, each country representative should both participate in or collaborate with the country's official policy toward the Biodiversity Convention and IABIN, and be affiliated with an organization with the interest, facilities, and computer expertise to host a moderately sophisticated database center and Web site.
 

VI. PILOT PROJECT LEVELS OF PARTICIPATION

As a first step toward the development of a hemisphere-wide information system on invasive plants and fishes, participants in the Santa Barbara working group recommended a pilot project with several levels of participation in each country:

Country Representatives
The IABIN Country Representative for each pilot should be chosen by the country representatives to IABIN on the basis of demonstrated understanding and interest in the focus of the project (e.g. invasive species), and professional connections to both the IABIN governance structure and the IABIN country node support structure

Country Nodes
A country node which meets minimum Information Technology standards should be designated or established. These minimum IT standards include:

Many country nodes are likely to be national centers for biodiversity information and to have related responsibilities under the CHM, and for related initiatives, such as MABNet Americas and BCIS, as well as for other IABIN projects.

IABIN Partners
A number of organizations or sites within a country might be designated as IABIN Partners. Minimum requirements for participation in the pilot should probably include:

Partners could be nominated by country representatives, represent established programs already active in IABIN and related initiatives (such as many museums and universities), or represent a NGO with substantial technical capability that is active throughout the region (such as Conservation International, World Wildlife Fund, or The Nature Conservancy).

Because the purpose of the proposed pilots is to demonstrate exchange of digital information on focused biodiversity problems, partners would be expected to have substantial information available to contribute in digital form. However, it is clear that many potentially valuable partners (from a scientific perspective) still have limited computer or telecommunications capabilities. Consequently, partners would be expected to have a data entry computer and standard office software, and to have e-mail (reliable enough that a one day maximum turnaround could be expected of the system administrator). However, communications of large data sets might be transmitted on disk by mail rather than over the Internet. (This situation will change; with funding, satellite links to even the most remote areas may be expected in the near future.)

Funding Considerations
Funding needs depend upon the identity of the participant countries and organizations in the pilots. At a bare minimum, the national nodes probably need both a system administrator and at least one full time data expert to work on particular pilots (doing data management, programming, web site construction, etc.). Salary and expenses for the national representative are needed unless covered from another source. Hardware, software, and maintenance can be expected to cost some $5000 per year for each computer in use, with as much more for communications costs. Substantial training would probably be needed in the first year, suggesting that the full costs of a national node could hardly be less than $50,000 per year, and could easily be several times that amount. Partners' minimum communications and staff costs can be much lower. However, at many partner sites, creating or converting to digital data is a limiting factor to information exchange. Taxonomic specimen records can typically be digitized at rates of only a few specimens per hour, and ongoing field collecting is always underfunded. Consequently, information systems and data management may be a small part of the cost of linking to data from IABIN partners. However, without specific participants, costs are difficult to estimate. [Without some needed information, the Santa Barbara workshop estimated a minimum of about $500,000 for 2 years to construct a 6-country demonstration for the 8 databases recommended for freshwater fish and vascular plants.]
 

VII. DATA STRUCTURES

There is a general consensus that IABIN should follow the lead of the CHM, BCIS, the U.S. National Biological Information Infrastructure (NBII), and many other biodiversity data centers, and begin by systematically cataloging available data on the pilot themes. More specifically, the experts meetings to date have recommend an incremental approach, beginning with simple registries of experts, species, data sets, and management or restoration projects, all built on the common framework of the Global Information Locator Service (GILS) specification (or equivalent catalog standards).

GILS provides the most basic elements needed to catalog a database or other "data object" consisting of environmental information. Its purpose is not to document the data fully, but rather to describe it in a way that helps users discover it and determine whether it is appropriate for their use. The description contains information on how to acquire the data or contact the data holder.

GILS is covered by agreements among the G8 nations and executive orders in the United States, and has become fairly standard within the environmental community (see http://www.gils.org and http://ceres.ca.gov/catalog for examples.) GILS is also a profile of Z39.50, a specification for data description that also encompasses MARC bibliographic database specification (used by the U.S. Library of Congress and many other libraries) and some detailed metadata structures, including the widely used Federal Geographic Data Committee (FGDC) metadata standard, which is required for geospatial data (maps and images) produced by any U.S. government program. As a result, server software exists to permit straightforward distributed access to GILS and other Z39.50 datasets, maintained on multiple servers, and therefore simplifies data discovery in a spatially fragmented organization.

GILS may be thought of as a series of structured tags labeling the information (or a pointer to where the information resides) covered by the catalog. Tag types include title, lead organization, contact person, geographical application, subjects of the information, time periods covered, and access instructions. Each of these elements may be used to specify particular attributes important to the project. For example, in all IABIN projects, one required subject designation might be the taxa covered (or the habitat types, or land use types). For the pilot project database types recommended in Section IV,

1) A registry species under management or of priority concern to partners,

2) Experts,

3) Alerts (new locations or outbreaks),

4) A GILS metadata registry of pilot project datasets,

5) A searchable GILS-like registry of funded management projects,

6) A distributed on-line mapping system,

7) A catalog of needs and opportunities for capacity building, and

8) A web-based compendium of available educational materials related to the pilot,

most can be thought of as GILS-like. The first two (species and people) contain a subset of GILS. The distributed mapping system is more ambitious, though a prototype using Z39.50 has been successfully developed by NABIN and the University of Kansas. Information catalogs on invasive species that are specifically GILS-compatible have been deployed for invasive species alerts (Item 3 -- http://www.nfrcg.gov), and invasive plant management projects (Item 5 -- http://endeavor.des.ucdavis.edu/weeds/), as well as numerous databases and educational offerings (items 7 and 8 -- see http://www.gils.org for examples).
 

VIII. VOCABULARIES

GILS records or any other data integration strategy can only integrate data across multiple sources if entries denoting the same kind of object (e.g., the same species or habitat type) use the same name or code. Permitting data collectors to enter data without restrictions on their language leads to confusion. Consequently, it is important that the fields used to organize and search the data (species name, country, organization, etc.) only allow a limited list of terms selected from a standardized list (a "pick list," or "controlled vocabulary," or "thesaurus").

However, past experience has shown that imposing standardized names even for fairly standardized categories (such as species) is difficult and can essentially be impossible for more subjective classifications (such as habitats). In practice, vocabulary choices (or "thesauri") can only be successfully standardized among a particular professional community that chooses to adopt voluntarily a particular usage. However, it must be recognized that a particular use adopted by one community is likely to be unacceptable or of no use to professionals in another discipline. Consequently, it is important to designate not only the data entry (for example, species) but also the list from which it was taken (i.e., the taxonomic authority or reference).

GILS is very flexible about the thesauri used. It has specific places for geographical, methodological, and subject keywords, but the specific choices are not specified. For an IABIN pilot, the challenge is to agree on some keyword sets (locations, species, habitats, condition, etc.) that will adequately characterize the shared data for the professional community. For a given pilot, such as invasive fish, a standardized choice of species list (e.g., Eschmeyer, 1998) is achievable, but standardized choices for river names, collecting methods, study purposes, etc. would still have to be decided. It is unlikely that the choices (at least initially) could be effectively shared with other pilots (such as bleaching corals). Consequently, each pilot will need a working group or existing authority to choose thesauri to be used, and to provide updates and interpretations as needed. The thesauri used should always be recorded, named, and made available on-line. This task is further complicated, of course, in a multilingual environment, where links between terms in different languages must be maintained (presumably by reference to a common code).

A additional complication is that a user may wish to recover a record using a different thesaurus than a scientist originally used to develop the record. For example, a fish biologist might classify a location as being in a particular river, but a ministry employee might want to classify location by country or province. While such "cross-walks" can be done, it is important to retain the information about the original classification (rivers) as opposed to the necessarily less exact inferences of the appropriate entry from another thesaurus (provinces) chosen for other purposes.

IABIN probably should not try impose particular thesauri or vocabularies on all users. For example, it would probably be counterproductive to insist that all field botanists throughout the Americas accept the FGDC vegetation classification recently proposed as a standard for the U.S. However, IABIN as a whole or IABIN working groups can catalog and document vocabularies in wide use, support maintenance functions, and help develop cross-walks. Over time, it is likely that data contributors will choose to specify the most widely searched thesaurus:keyword treatments in order to raise the visibility of their published data, whether or not they also included other treatments they might prefer, or which might be common in their own user community. Permitting living thesauri, rather than static database fields, to do the "heavy lifting" of defining categorical descriptions of data sets (and other data objects) permits greater simplicity in the standard, flexibility to incorporate unanticipated data types, and perhaps easier adaptation to upcoming approaches (such as XML tools) to data warehousing and metasearching.

Once the vocabularies of "keywords" are established for particular pilots, populating most of the minimum database set will be straightforward. A number of candidate thesauri for environmental information exist, and these should be explored by individual projects before developing project-specific vocabularies. Examples of subject thesauri include the IUCN draft environmental thesaurus (T. Moritz, 1997 -- see http://www.biodiversity.org), the Global Change Master Directory (http://gcmd.gsfc.nasa.gov/), and the CERES Thesaurus project (http://ceres.ca.gov/thesaurus/), which derives environmental thesauri from Library of Congress keywords. Other kinds of (somewhat) standardized keywords include species names and chemical names; it would be desirable for IABIN projects to agree on particular master authorities for these types of names. For example, all fish names cold be cross-referenced to Eschmeyer's Fish of the World, or the Integrated Taxonomic Information System adaptation of that work, and all chemicals could use the Chemical Abstracts System code. Geographic keywords may be standardized on a coarse scale (countries, provinces, major rivers), but many geolocators (stream or village names) will probably have to be country-specific. Latitude-longitude designations are, of course, universal.

In summary, most of the databases recommended as core information for the biological pilot projects represent GILS or GILS-like databases. These are only easily searchable if they use specified vocabularies or thesauri. The best choices for thesauri are not established (beyond, perhaps, for species names), so they must be decided as an early activity of the country representatives and other IABIN partners for each pilot. A working group or organization charged with interpreting and updating the vocabularies is also needed (and should presumably deliberate on-line rather than in person). An initial workshop to make vocabulary choices and establish the maintenance mechanism is probably needed for each pilot.
 

IX. MAPPING SPECIES OCCURRENCES: A DISTRIBUTED DATABASE APPLICATION

As just discussed, the first step in the incremental development of a distributed information system for the biodiversity information developed by IABIN pilot projects is metadata catalogs, following the GILS guidelines. This is exactly the same strategy being pursued by BCIS (http://www.biodiversity.org) for a broader overview of biodiversity information. As noted by IUCN (http://biodiversity.org/metadatabase.html), distributed queries from multiple sources require coordinated vocabularies -- one potential product of the cataloging efforts. However, where widespread vocabularies are already used, more ambitious mechanisms for data sharing are possible. In the pilot projects, point locations of species records provide such an opportunity, as Latin binomials and latitude-longitude locations can be reasonably interpreted from any source.

A system for interactive mapping of species locations and species ranges has been prototyped for museum records of North American birds by NABIN using the "Species Analyst" application developed by the University of Kansas with funding provided by the Commission for Environmental Cooperation (CEC, part of the NAFTA process). This pilot study provides an integrated viewer for the locations of bird specimens in the collections of several dozen museums in Mexico, the U.S. and Canada. This is done using a public-domain Z39.50 server (YAZ) and an application that, in essence, helps the curator of the digital collection database construct a table relating the format of the species and lat-long (plus collector and date) to the public (Z39.50) format posted by each server. This approach is independent of both the hardware running the YAZ server and the database software and structure being queried (as long as a Latin binomial, a lat-long, a date, and a collector are found in each record). Free client software on any machine connected to the Internet can then display bird records for many sites, filtered for species, date, collection, etc. The IABIN project has also successfully used these data reports and species distribution models, originally developed in Australia (the GARP and BIOCLIM projects), to predict species ranges in areas where biological surveys are missing or incomplete.

Recommendation 3: IABIN should explore adapting the IABIN "Species Analyst" approach and software to develop a distributed species mapping capability for species groups (invasive plant, invasive fish, declining amphibians, selected declining pollinators, Hanna virus outbreaks), as funds allow, as a prototype for a more ambitious effort to link biodiversity databases using public-domain Z39.50 server technology.
 

X. Z39.50 PROFILES

It should be noted that GILS, the IABIN species-occurrence databases, and the full FGDC Spatial Metadata Profile (the U.S. standard for full documentation of maps and images, soon to be considered by the International Standards Organization as a defined standard) are all Z39.50 profiles, as is the bibliographic database standard (MARC) used by the U.S. Library of Congress and many other traditional and digital libraries. This family relationship among the important data types for networking biodiversity information suggest that the classes of databases recommended in Section IV could profitably be approached as a nested set of Z39.50 profiles, ranging from the most abstracted (a GILS catalog of databases) through highly detailed (full FGDC or IUCN metadata) or distributed (IABIN "Species Analyst"). As with the individual catalogs, full interoperability requires agreement on vocabularies (thesauri) as well as formats (database schema) and server/data transfer standards (Z39.50).

A review by the American Institute for Biological Sciences of biological metadata strategies for the NBII suggested a similarly tiered Z39.50 strategy for cataloguing and documenting biological information developed by NBII participants and partners.

As GILS for locator records and MARC for bibliographic records are already well established as both international standards and Z39.50 profiles, it would be efficient for the IABIN pilot databases and networks to follow these standards. The FGDC formulation for full geospatial metadata (detailed documentation for remote-sensing images and maps) is likely to become an ISO standard, and a biological variant may be proposed soon. Both are also Z39.50 profiles, and could also profitably be adopted by IABIN (though full metadata should be developed by the creator of the data, not the network apparatus).

All of these standards are flexible and evolving, so detailed compliance is certainly less important than collecting and disseminating information with Z39.50 content standards in mind. Similarly, while Z39.50 public domain (YAZ) or commercial (Blue Angel) server software provides powerful technologies for distributing environmental data over the Internet, the particular server technologies will undoubtedly be supplanted by more powerful and flexible systems, such as the XML-based data warehousing systems now under development by most major software vendors (IBM, Oracle, Microsoft, etc.). Nevertheless, upward compatibility will probably be straightforward if current GILS and Z39.50-related standards and good programming practices are followed today.
 

XI. METADATA AND ELECTRONIC PUBLISHING

IABIN's role in developing metadata for the pilot projects is an opportunity not only to set and implement metadata standards for IABIN and partner databases, but also to serve as a forum to promoting quality electronic publishing. Evaluating (in many cases, peer reviewing) public-interest data and metadata, then cataloging and publishing the data (or pointing to other data publication sites) provides a public service akin to journal publication of scientific studies. Identifying a study or data set as valuable to or part of IABIN provides evidence of professional achievement to the contributors, and might be an important incentive for independent scientists or land managers to participate in an organized metadata system. Metadata of increasing complexity might be subjected to increasing levels of review and quality control, just as journal entries range from unreviewed letters to the editor through heavily edited feature articles. Metadata guidelines in this kind of strategy then might be profitably viewed as "instructions to contributors" rather than mandates.

Electronic publishing of data from IABIN participants and partners has many advantages. Many journals are unable or unwilling to publish extensive data because of the cost of printing copies on paper. Consequently, many of the large, synthetic studies most important to IABIN's mission are under-represented in the conventional scientific literature. Providing a prestigious outlet for biodiversity assessments will undoubtedly improve communication. Electronic publishing is also inexpensive, permitting the development of a healthy scientific press at national nodes in countries where publishing resources are limited.

An exact formulation of the editorial function is a charge to the organizers of the pilot projects. However, the pilots probably ought to include both unreviewed web sites, permitting remote users to voluntarily catalog information resources of potential interest to IABIN partners (akin to letters to the editor), and peer-reviewed sites where metadata descriptions and full data access (as appropriate) to IABIN-sponsored studies are published. Peer reviewed entries should have or point to full descriptions of methods, data structure, and models used, and would normally be expected to make the underlying data public. Unreviewed records might simply list title, author, organization, subject, location, and access instructions, and could cover incomplete or proprietary work as well as major scientific or management initiatives.
 

XII. TECHNICAL CAPABILITIES NEEDED FOR IABIN SITES

The general consensus of past experts' meetings has been that IABIN needs to support participants at a number of levels of commitment and complexity.

The need to adopt the most current technology will probably become more acute as XML becomes established as a standard for representing and indexing Web documents. When this happens, only XML compatible access software (including, reportedly, the next releases of Netscape and Internet Explorer) will be able to parse (view) the growing number of environmental XML websites (as well as to display conventional HTML.)

Obviously, some compromises will be necessary, and many users without new computers or dial-in capability may have to be supported by e-mail, FTP sites, or packaged CD-ROMs.

Web Site Technology
Increasingly, the function of Web sites is defined by open standards for communications, database content, and data transfer standards, but is largely independent of the underlying platform. Existing sites of IABIN partners that operate in ways consistent with the strategy described in this document employ Windows NT, Linux, UNIX (multiple variants), and even occasional Macintosh platforms, using a wide variety of website software. Most modern relational database systems have the capability to serve the kinds of databases described above over the Web, and at least Access (with additional tools), MS SQL Server, Oracle, and Informix are widely used for this purpose. Most on-line GIS sites either use ESRI products (e.g., MapObjects and Internet Map Server, often coupled with relational databases such as Oracle) or custom-designed graphic-display programs. Several public-domain Z39.50 server packages are widely used, but there are also better-supported commercial choices.

In short, IABIN sites can probably meet the capabilities described in this document in many ways, and should choose on the basis of the expertise of their staff and their other computing needs.

Connectivity
In most locations, connection to the Internet is still provided through the local or national telecom, though deregulation and privatization is rapidly increasing options in many urban areas. The best choices vary with country. In most urban settings, T1 or partial T1 (several hundred KB to a couple of MB per second) lines are available, but may be prohibitively expensive. (Prices in the U.S. are typically one to several thousand dollars per month). In many places (but not the U.S.) ISDN (56KB/sec) is a reasonable slower but lower cost alternative. However a number of new higher bandwidth (> 1 MB/sec) technologies, including Digital Subscriber Lines and cable-modem connections are becoming available, and should provide adequate speed for all but the largest of sites at a much diminished cost. Within a year, wireless high bandwidth connections at costs of under $100/month, often using satellites, should be technically feasible throughout the hemisphere, though regulatory approval may slow their availability in many places.

In short, cost and regulations, rather than availability, are likely to be the main limits to connectivity in the near future.

Security
Since IABIN proposes to enable voluntary publication of biodiversity data, theft of data is not likely to be a major problem. However, some data, such as locations of endangered species, may be sensitive, and IABIN sites should certainly draw on the extensive experience of other sensitive-species sites (WCMC, CITES, most national biodiversity sites) in deciding which data to blur spatially or simply not post.

Protection against accidental or deliberate damage to the sites is more difficult. Existing biodiversity sites seem especially vulnerable to deliberate virus and "spam" attacks, and it is likely that any large site should use one of the many excellent commercial security suites (firewall, anti-virus, etc.) to help protect the site. Major vendors include Network Associates, Checkpoint Software, and Symantec, but many smaller companies offer effective products. Limiting public access to pilot data, at least at the initial stage, by password access is straightforward, but may nor may not be desirable if improved communication is a principal objective. One approach that has been effective in many countries is to establish "research web sites" at a university or NGO during the development phase, then to open the official and accountable government version after the research site is fully operational and de-bugged.

Operators of public databases are always well advised to keep the official, archival copy of any data on a machine not accessible to the public, then to periodically post copies on the public site. Similarly, IABIN may choose to create "mirror" sites in multiple countries to provide more reliable access. Without automated mechanisms to update "mirror" sites, this is more expensive than it sounds, since "version-control" problems inevitably arise from uncoordinated updates at multiple sites. The San Diego Supercomputer Center is currently researching methods for automated mirroring of large environmental datasets, and may provide appropriate (and free) technology for doing so during the duration of the pilot studies.

Search and Query Technology
A number of engines for querying complex environmental data sources and documents are currently in use in the established biodiversity database centers. These include both general purpose search engines (Harvest, Alta-Vista, etc.) and a variety of project-specific query tools. In practice, general keyword searches can always be done, but the principal limitations on structured cross-site or cross-database queries are less technological than terminological - complex queries only work effectively when vocabularies have been standardized.

Extended Markup Language (XML) or related "document-centric" technologies may provide a means for more generalized access to environmental information. XML may be looked at as a superset of HTML, with more structured "metatags" categorizing each information "object" (page, document, map, herbarium sheet, etc.). If vocabularies are standardized in the (GILS-style) metadatabases discussed above, those metadatabases can be used to automatically generate easily searchable "metatags" for XML. XML, in turn, can be viewed ("parsed") with standard Web browser technology, or may be used to generate conventional HTML directly. One school of thought is that complex databases will continue to be developed in relatively conventional relational (or relational-object) database formats (Oracle, Informix, etc.), which will in turn generate XML "middleware." The XML can then be browsed or queried more easily by non-experts than the full-blown (SQL or equivalent) database. In any case, XML-like technology is being rapidly developed by most of the world's largest software companies and is central to their strategies for "data warehousing." Consequently, the capabilities of future XML are unpredictable, but it is sure to be more powerful and flexible than it appears at present. At this point, IABIN pilot projects should at least consider representing their data in an XML format, but it is probably premature to start designing specific search or query strategies using XML technology.
 

XIII. CONCLUSIONS

The technical approaches to IABIN pilot projects should closely reflect those taken by IABIN as a whole, as well as other partners in the Clearinghouse Mechanism, BCIS, MABNet, and other international biodiversity networking initiatives. As in any technical undertaking, the first step is to inventory available information, technology, and information needs. It is highly desirable to do so in a formal and organized fashion, developing consistent catalogs of experts, partner organizations, species and resources of concern to those experts and organizations, existing data sets, major existing scientific and management projects, and various kinds of resulting synthetic data, educational materials, and capacity-building opportunities. This would an almost insurmountable task if attempted for all important biodiversity resources in the hemisphere. Consequently, these catalogs should first be developed for a small number of well-defined environmental issues of particularly high visibility and importance. These are the IABIN pilot projects. If carefully done, the information systems developed for those projects should be widely applicable to other biodiversity issues considered by IABIN.

As a practical matter, a generally agreed format for digital catalogs (or "metadatabases") in the environmental area has been adopted by many of IABIN's partner organizations. This is the G8-standard GILS specification. GILS fits data types defined in a number of pilot proposals (e.g., invasive species) well.

The power of metadata system, however, is derived less from the format of the catalogs than it is from consistent use of language in describing the information available. "Controlled vocabularies" are recognized as a major research issue in metadata development by many IABIN participants and partners, and a number of draft proposals for shared environmental "thesauri" should be examined as part of each pilot project. However, as a practical matter, particular thesauri have been adopted as useful only by fairly well defined professional "communities", and it is probable that IABIN will have to track different kinds of data through different vocabularies useful to different sets of participants. It is the social process of developing a consensus on language, rather than the technical representation of that language, that is the limiting process on the coordination of environmental data across multiple countries, organizations, and professions. Presumably, the pilot projects will provide methods and insight on how to harmonize language on fairly well-defined topics.

Once the vocabulary strategy is under control, a hierarchical set of network nodes, ranging from hemispheric sites, through national representatives, specialist partners, and remote users, can be constructed using public domain information standards and off-the-shelf hardware and software. Again, the technical challenges are less imposing than the human challenges of how to provide incentives for busy professionals to participate, how to provide the training and funding they need to develop and manage nodes, and how to coordinate their activities. However the nature of maturing World-Wide Web technology and structured document (XML-like) expression of information are well suited to the complexity and informal nature of environmental networking, and it is likely that IABIN can easily model the technical solutions on the basis of the human innovations it is fostering.


APPENDIX

List of Acronyms
 

BCIS Biodiversity Conservation Information System

BDT Base de Dados Tropical

CEC Commission for Environmental Cooperation

CHM Clearing-House Mechanism, of the Convention on Biological Diversity

FGDC Federal Geographic Data Committee

GILS Global Information Locator Service

IABIN Inter-American Biodiversity Information Network

MABNet Man and the Biosphere Network

MARC Machine-Readable Cataloging

NABIN North American Biodiversity Information Network

NBII National Biological Information Infrastructure

NGO Non-Governmental Organization

XML Extended Mark-up Language